No title - PDF Free Download

Editorial policy: The Journal of Econometrics is designed to serve as an outlet for important new research in both theor...

6 downloads 617 Views 8MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Editorial policy: The Journal of Econometrics is designed to serve as an outlet for important new research in both theoretical and applied econometrics. Papers dealing with estimation and other methodological aspects of the application of statistical inference to economic data as well as papers dealing with the application of econometric techniques to substantive areas of economics fall within the scope of the Journal. Econometric research in the traditional divisions of the discipline or in the newly developing areas of social experimentation are decidedly within the range of the Journal’s interests. The Annals of Econometrics form an integral part of the Journal of Econometrics. Each issue of the Annals includes a collection of refereed papers on an important topic in econometrics. Editors: T. AMEMIYA, Department of Economics, Encina Hall, Stanford University, Stanford, CA 94035-6072, USA. A.R. GALLANT, Duke University, Fuqua School of Business, Durham, NC 27708-0120, USA. J.F. GEWEKE, Department of Economics, University of Iowa, Iowa City, IA 52240-1000, USA. C. HSIAO, Department of Economics, University of Southern California, Los Angeles, CA 90089, USA. P. ROBINSON, Department of Economics, London School of Economics, London WC2 2AE, UK. A. ZELLNER, Graduate School of Business, University of Chicago, Chicago, IL 60637, USA. Executive Council: D.J. AIGNER, Paul Merage School of Business, University of California, Irvine CA 92697; T. AMEMIYA, Stanford University; R. BLUNDELL, University College, London; P. DHRYMES, Columbia University; D. JORGENSON, Harvard University; A. ZELLNER, University of Chicago. Associate Editors: Y. AÏT-SAHALIA, Princeton University, Princeton, USA; B.H. BALTAGI, Syracuse University, Syracuse, USA; R. BANSAL, Duke University, Durham, NC, USA; M.J. CHAMBERS, University of Essex, Colchester, UK; SONGNIAN CHEN, Hong Kong University of Science and Technology, Kowloon, Hong Kong; XIAOHONG CHEN, Department of Economics, Yale University, 30 Hillhouse Avenue, P.O. Box 208281, New Haven, CT 06520-8281, USA; MIKHAIL CHERNOV (LSE), London Business School, Sussex Place, Regents Park, London, NW1 4SA, UK; V. CHERNOZHUKOV, MIT, Massachusetts, USA; M. DEISTLER, Technical University of Vienna, Vienna, Austria; M.A. DELGADO, Universidad Carlos III de Madrid, Madrid, Spain; YANQIN FAN, Department of Economics, Vanderbilt University, VU Station B #351819, 2301 Vanderbilt Place, Nashville, TN 37235-1819, USA; S. FRUHWIRTH-SCHNATTER, Johannes Kepler University, Liuz, Austria; E. GHYSELS, University of North Carolina at Chapel Hill, NC, USA; J.C. HAM, University of Southern California, Los Angeles, CA, USA; J. HIDALGO, London School of Economics, London, UK; H. HONG, Stanford University, Stanford, USA; MICHAEL KEANE, University of Technology Sydney, P.O. Box 123 Broadway, NSW 2007, Australia; Y. KITAMURA, Yale Univeristy, New Haven, USA; G.M. KOOP, University of Strathclyde, Glasgow, UK; N. KUNITOMO, University of Tokyo, Tokyo, Japan; K. LAHIRI, State University of New York, Albany, NY, USA; Q. LI, Texas A&M University, College Station, USA; T. LI, Vanderbilt University, Nashville, TN, USA; R.L. MATZKIN, Northwestern University, Evanston, IL, USA; FRANCESCA MOLINARI (CORNELL), Department of Economics, 492 Uris Hall, Ithaca, New York 14853-7601, USA; F.C. PALM, Rijksuniversiteit Limburg, Maastricht, The Netherlands; D.J. POIRIER, University of California, Irvine, USA; B.M. PÖTSCHER, University of Vienna, Vienna, Austria; I. PRUCHA, University of Maryland, College Park, USA; E. RENAULT, University of North Carolina, Chapel Hill, NC; R. SICKLES, Rice University, Houston, USA; F. SOWELL, Carnegie Mellon University, Pittsburgh, PA, USA; MARK STEEL (WARWICK), Department of Statistics, University of Warwick, Coventry CV4 7AL, UK; DAG BJARNE TJOESTHEIM, Department of Mathematics, University of Bergen, Bergen, Norway; HERMAN VAN DIJK, Erasmus University, Rotterdam, The Netherlands; Q.H. VUONG, Pennsylvania State University, University Park, PA, USA; E. VYTLACIL, Columbia University, New York, USA; T. WANSBEEK, Rijksuniversiteit Groningen, Groningen, Netherlands; T. ZHA, Federal Reserve Bank of Atlanta, Atlanta, USA and Emory University, Atlanta, USA. Submission fee: Unsolicited manuscripts must be accompanied by a submission fee of US$50 for authors who currently do not subscribe to the Journal of Econometrics; subscribers are exempt. Personal cheques or money orders accompanying the manuscripts should be made payable to the Journal of Econometrics. Publication information: Journal of Econometrics (ISSN 0304-4076). For 2011, Volumes 160–165 (12 issues) are scheduled for publication. Subscription prices are available upon request from the Publisher, from the Elsevier Customer Service Department nearest you, or from this journal’s website (http://www.elsevier.com/locate/jeconom). Further information is available on this journal and other Elsevier products through Elsevier’s website (http://www.elsevier.com). Subscriptions are accepted on a prepaid basis only and are entered on a calendar year basis. Issues are sent by standard mail (surface within Europe, air delivery outside Europe). Priority rates are available upon request. Claims for missing issues should be made within six months of the date of dispatch. USA mailing notice: Journal of Econometrics (ISSN 0304-4076) is published monthly by Elsevier B.V. (Radarweg 29, 1043 NX Amsterdam, The Netherlands). Periodicals postage paid at Rahway, NJ 07065-9998, USA, and at additional mailing offices. USA POSTMASTER: Send change of address to Journal of Econometrics, Elsevier Customer Service Department, 3251 Riverport Lane, Maryland Heights, MO 63043, USA. AIRFREIGHT AND MAILING in the USA by Mercury International Limited, 365 Blair Road, Avenel, NJ 07001-2231, USA. Orders, claims, and journal inquiries: Please contact the Elsevier Customer Service Department nearest you. St. Louis: Elsevier Customer Service Department, 3251 Riverport Lane, Maryland Heights, MO 63043, USA; phone: (877) 8397126 [toll free within the USA]; (+1) (314) 4478878 [outside the USA]; fax: (+1) (314) 4478077; e-mail: [email protected]. Oxford: Elsevier Customer Service Department, The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, UK; phone: (+44) (1865) 843434; fax: (+44) (1865) 843970; e-mail: [email protected]. Tokyo: Elsevier Customer Service Department, 4F Higashi-Azabu, 1-Chome Bldg., 1-9-15 Higashi-Azabu, Minato-ku, Tokyo 106-0044, Japan; phone: (+81) (3) 5561 5037; fax: (+81) (3) 5561 5047; e-mail: [email protected]. Singapore: Elsevier Customer Service Department, 3 Killiney Road, #08-01 Winsland House I, Singapore 239519; phone: (+65) 63490222; fax: (+65) 67331510; e-mail: [email protected]. Printed by Henry Ling Ltd., Dorchester, United Kingdom The paper used in this publication meets the requirements of ANSI/NISO Z39.48-1992 (Permanence of Paper)

Journal of Econometrics 162 (2011) 149–169

Contents lists available at ScienceDirect

Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom

Multivariate realised kernels: Consistent positive semi-definite estimators of the covariation of equity prices with noise and non-synchronous trading✩ Ole E. Barndorff-Nielsen a,b , Peter Reinhard Hansen c , Asger Lunde d,b,∗ , Neil Shephard e,f a

The T.N. Thiele Centre for Mathematics in Natural Science, Department of Mathematical Sciences, University of Aarhus, Ny Munkegade, DK-8000 Aarhus C, Denmark

b

CREATES, University of Aarhus, Denmark

c

Department of Economics, Stanford University, Landau Economics Building, 579 Serra Mall, Stanford, CA 94305-6072, USA

d

School of Economics and Management, Aarhus University, Bartholins Allé 10, DK-8000 Aarhus C, Denmark

e

Oxford-Man Institute, University of Oxford, Eagle House, Walton Well Road, Oxford OX2 6ED, UK

f

Department of Economics, University of Oxford, UK

article

info

Article history: Received 14 July 2010 Received in revised form 14 July 2010 Accepted 28 July 2010 Available online 27 January 2011 Keywords: HAC estimator Long run variance estimator Market frictions Quadratic variation Realised variance

abstract We propose a multivariate realised kernel to estimate the ex-post covariation of log-prices. We show this new consistent estimator is guaranteed to be positive semi-definite and is robust to measurement error of certain types and can also handle non-synchronous trading. It is the first estimator which has these three properties which are all essential for empirical work in this area. We derive the large sample asymptotics of this estimator and assess its accuracy using a Monte Carlo study. We implement the estimator on some US equity data, comparing our results to previous work which has used returns measured over 5 or 10 min intervals. We show that the new estimator is substantially more precise. © 2011 Elsevier B.V. All rights reserved.

1. Introduction The last seven years has seen dramatic improvements in the way econometricians think about time-varying financial volatility, first brought about by harnessing high frequency data and then by mitigating the influence of market microstructure effects. Extending this work to the multivariate case is challenging as this needs to additionally remove the effects of non-synchronous trading while simultaneously requiring that the covariance matrix estimator be positive semi-definite. In this paper we provide the first estimator which achieves all these objectives. This will be called the multivariate realised kernel, which we will define in Eq. (1). We study a d-dimensional log price process X = (X (1) , X (2) , . . . , (d) ′ X ) . These prices are observed irregularly and non-synchronous over the interval [0, T ]. For simplicity of exposition we take T = 1 ✩ The second and fourth author are also affiliated with CREATES, a research center funded by the Danish National Research Foundation. The Ox language of Doornik (2006) was used to perform the calculations reported here. We thank Torben Andersen, Tim Bollerslev, Ron Gallant, Xin Huang, Oliver Linton and Kevin Sheppard for comments on a previous version of this paper. ∗ Corresponding author at: School of Economics and Management, Aarhus University, Bartholins Allé 10, DK-8000 Aarhus C, Denmark. E-mail addresses: [email protected] (O.E. Barndorff-Nielsen), [email protected] (P.R. Hansen), [email protected] (A. Lunde), [email protected] (N. Shephard).

0304-4076/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2010.07.009

throughout the paper. These observations could be trades or quote updates. The observation times for the i-th asset will be written as (i) (i) (i) t1 , t2 , . . .. This means the available database of prices is X (i) (tj ),

for j = 1, 2, . . . , N (i) (1), and i = 1, 2, . . . , d. Here N (i) (t ) counts the number of distinct data points available for the i-th asset up to time t. X is assumed to be driven by Y , the efficient price, abstracting from market microstructure effects. The efficient price is modelled as a Brownian semimartingale (Y ∈ BSM ) defined on some filtered probability space (Ω , F , (Ft ), P ), Y (t ) =

t

∫

a(u)du +

t

∫

0

σ (u)dW (u), 0

where a is a vector of elements which are predictable locally bounded drifts, σ is a càdlàg volatility matrix process and W is a vector of independent Brownian motions. For reviews of the econometrics of this type of process see, for example, Ghysels et al. (1996). If Y ∈ BSM then its ex-post covariation, which we will focus on for reasons explained in a moment, is

[Y ](1) =

1

∫

Σ (u)du,

where Σ = σ σ ′ ,

0

where

[Y ](1) = plim n→∞

n − {Y (tj ) − Y (tj−1 )}{Y (tj ) − Y (tj−1 )}′ , j =1

150

O.E. Barndorff-Nielsen et al. / Journal of Econometrics 162 (2011) 149–169

(e.g. Protter (2004, p. 66–77) and Jacod and Shiryaev (2003, p. 51)) for any sequence of deterministic synchronized partitions 0 = t0 < t1 < · · · < tn = 1 with supj {tj+1 − tj } → 0 for n → ∞. This is the quadratic variation of Y . The contribution of this paper is to construct a consistent, positive semi-definite (psd) estimator of [Y ](1) from our database of asset prices. The challenges of doing this are three-fold: (i) there are market microstructure effects U = X − Y , (ii) the data is irregularly spaced and non-synchronous, (iii) the market microstructure effects are not statistically independent of the Y process. Quadratic variation is crucial to the economics of financial risk. This is reviewed by, for example, Andersen et al. (2010) and Barndorff-Nielsen and Shephard (2007), who provide very extensive references. The economic importance of this line of research has recently been reinforced by the insight of Bollerslev et al. (2009) who have showed that expected stock returns seem well explained by the variance risk premium (the difference between the implied and realised variance) and this risk premium is only detectable using the power of high frequency data. See also the papers by Drechsler and Yaron (2011), Fleming et al. (2003) and de Pooter et al. (2008). Our analysis builds upon earlier work on the effect of noise on univariate estimators of [Y ](1) by, amongst others, Zhou (1996), Andersen et al. (2000), Bandi and Russell (2008), Zhang et al. (2005), Hansen and Lunde (2006), Hansen et al. (2008), Kalnina and Linton (2008), Zhang (2006), Barndorff-Nielsen et al. (2008), Renault and Werker (2011), Hansen and Horel (2009), Jacod et al. (2009) and Andersen et al. (2011). The case of no noise is dealt with in the same spirit as the papers by Andersen et al. (2001), Barndorff-Nielsen and Shephard (2002, 2004), Mykland and Zhang (2006, 2009) Goncalves and Meddahi (2009) and Jacod and Protter (1998). A distinctive feature of multivariate financial data is the phenomenon of non-synchronous trading or nontrading. These two terms are distinct. The first refers to the fact that any two assets rarely trade at the same instant. The latter to situations where one assets is trading frequently over a period while some other assets do not trade. The treatment of non-synchronous trading effects dates back to Fisher (1966). For several years researchers focused mainly on the effects that stale quotes have on daily closing prices. Campbell et al. (1997, Chapter 3) provides a survey of this literature. When increasing the sampling frequency beyond the inter-hour level several authors have demonstrated a severe bias towards zero in covariation statistics. This phenomenon is often referred to as the Epps effect. Epps (1979) found this bias for stock returns, and it has also been demonstrated to hold for foreign exchange returns, see Guillaume et al. (1997). This is confirmed in our empirical work where realised covariances computed using high frequency data, over specified fixed time periods such as 15 s, dramatically underestimate the degree of dependence between assets. Some recent econometric work on this topic includes Malliavin and Mancino (2002), Reno (2003), Martens (unpublished paper), Hayashi and Yoshida (2005), Bandi and Russell (unpublished paper), Voev and Lunde (2007), Griffin and Oomen (2011) and Large (unpublished paper). We will draw ideas from this work. Our estimator, the multivariate realised kernel, differs from the univariate realised kernel estimator by Barndorff-Nielsen et al. (2008) in important ways. The latter converges a rate n1/4 but critically relies on the assumption that the noise is a white noise process, and Barndorff-Nielsen et al. (2008) stress that their estimator cannot be applied to tick-by-tick data. In order not to be in obvious violation of the iid assumption, Barndorff-Nielsen et al. (2008) apply their estimator to prices that are (on average) sampled every minute or so. Here, in the present paper, we allow

for a general form of noise that is consistent with the empirical features of tick-by-tick data. For this reason we adopt a larger bandwidth that has the implication that our multivariate realised kernel estimator converges at rate n1/5 . Although this rate is slower than n1/4 it is, from a practical viewpoint, important to acknowledge that there are only 390 one-minute returns in a typical trading day, while many shares trade several thousand times, and 3901/4 < 20001/5 . So the rates of convergence will not (alone) tell us which estimators will be most accurate in practice — even for the univariate estimation problem. In addition to being robust to noise with a general form of dependence, the n1/5 convergence rate enables us to construct an estimator that is guaranteed to psd, which is not the case for the estimator by Barndorff-Nielsen et al. (2008). Moreover, our analysis of irregularly spaced and non-synchronous observations causes the asymptotic distribution of our estimator to be quite different from that in Barndorff-Nielsen et al. (2008). We discuss the differences between these estimators in greater details in Section 6.1. The structure of the paper is as follows. In Section 2 we synchronise the timing of the multivariate data using what we call Refresh Time. This allows us to refine high frequency returns and in turn the multivariate realised kernel. Further we make precise the assumptions we make use of in our theorems to study the behaviour of our statistics. In Section 3 we give a detailed discussion of the asymptotic distribution of realised kernels in the univariate case. The analysis is then extended to the multivariate case. Section 4 contains a summary of a simulation experiment designed to investigate the finite sample properties of our estimator. Section 5 contains some results from implementing our estimators on some US stock price data taken from the TAQ database. We analyse up to 30 dimensional covariance matrices, and demonstrate efficiency gains that are around 20-fold compared to using daily data. This is followed by a section on extensions and further remarks, while the main part of the paper is finished by a conclusion. This is followed by an Appendix which contains the proofs of various theorems given in the paper, and an Appendix with results related to Refresh Time sampling. More details of our empirical results and simulation experiments are given in a web appendix which can be found at http://mit.econ.au.dk/vip_htm/alunde/BNHLS/BNHLS.htm. 2. Defining the multivariate realised kernel 2.1. Synchronising data: refresh time Non-synchronous trading delivers fresh (trade or quote) prices at irregularly spaced times which differ across stocks. Dealing with non-synchronous trading has been an active area of research in financial econometrics in recent years, e.g. Hayashi and Yoshida (2005), Voev and Lunde (2007) and Large (unpublished paper). Stale prices are a key feature of estimating covariances in financial econometrics as recognised at least since Epps (1979), for they induce cross-autocorrelation amongst asset price returns. Write the number of observations in the i-th asset made up to time t as the counting process N (i) (t ), and the times at which trades (i) (i) are made as t1 , t2 , . . .. We now define refresh time which will be key to the construction of multivariate realised kernels. This time scale was used in a cointegration study of price discovery by Harris et al. (1995), and Martens (unpublished paper) has used the same idea in the context of realised covariances. Definition 1. Refresh Time for t ∈ [0, 1]. We define the first re(1) (d) fresh time as τ1 = max(t1 , . . . , t1 ), and then subsequent refresh times as

τj+1 = max(t (1()1)

Nτ +1 j

, . . . , t (d()d)

N τ +1 j

).

O.E. Barndorff-Nielsen et al. / Journal of Econometrics 162 (2011) 149–169

151

kernels (RK). It takes on the following form K (X ) =

  n − h k

h=−n

where Γh =

H

n −

Γh , (1)

xj x′j−h ,

for h ≥ 0,

j=h+1

Fig. 1. This figure illustrates Refresh Time in a situation with three assets. The dots (i) represent the times {tj }. The vertical lines represent the sampling times generated from the three assets with refresh time sampling. Note, in this example, N = 7, n(1) = 8, n(2) = 9 and n(3) = 10.

The resulting Refresh Time sample size is N, while we write n(i) = N (i) (1). The τ1 is the time it has taken for all the assets to trade, i.e. all their posted price have been updated. τ2 is the first time when all the prices are again refreshed. This process is displayed in Fig. 1 for d = 3. Our analysis will now be based on this time clock {τj }. Our approach will be to:

• Assume the entire vector of up to date prices are seen at these refreshed times X (τj ), which is not correct — for we only see a single new price and d − 1 stale prices.1 • Show these stale pricing errors have no impact on the asymptotic distribution of the realised kernels. This approach to dealing with non-synchronous data converts the problem into one where the Refreshed Times’ sample size N is determined by the degree of non-synchronicity and n(1) , n(2) , . . . , n(d) . The degree to which we keep data is measured by the size of the retained data over the original size of the ∑d database. For Refresh Time this is p = dN / i=1 n(i) . For the data in Fig. 1, p = 21/27 ≃ 0.78. 2.2. Jittering end conditions It turns out that our asymptotic theory dictates that we need to average m prices at the very beginning and end of the day to obtain a consistent estimator.2 The theory behind this will be explained in Section 6.4, where experimentation suggests the best choice for m is around two for the kind of data we see in this paper. Now we define what we mean by jittering. Let n, m ∈ N, with n − 1 + 2m = N, then set the vector observations X0 , X1 , . . . , Xn as Xj = X (τN ,j+m ), j = 1, 2, . . . , n − 1, and X0 =

m 1 −

m j =1

X (τN ,j ) and Xn =

m 1 −

m j =1

X (τN ,N −m+j ).

So X0 and Xn are constructed by jittering initial and final time points. By allowing m to be moderately large but very small in comparison with n, it means these observations record the efficient price without much error, as the error is averaged away. These prices allow us to define the high frequency vector returns: xj = Xj − Xj−1 , j = 1, 2, . . . , n, that the realised kernels are built out of. 2.3. Realised kernel Having synchronised the high frequency vector returns {xj } we can define our class of positive semi-definite multivariate realised 1 Their degree of staleness will be limited by their Refresh Time construction to a single lag in Refresh Time. The extension to a finite number of lags is given in Section 6.5. 2 This kind of averaging appears in, for example, Jacod et al. (2009).

and Γh = Γ−′ h for h < 0. Here Γh is the h-th realised autocovariance and k : R y R is a non-stochastic weight function. We focus on the class of functions, K , that is characterised by: Assumption K. (i) k(0) = 1, k′ (0) = 0; (ii) k is twice  ∞ differentiable with continuous derivatives; (iii) define k0•,0 = 0 k(x)2 dx, k1•,1 =

∞ 0

k′ (x)2 dx, and k2•,2 =

∞

∞ 0

k′′ (x)2 dx then k0•,0 , k1•,1 , k2•,2 < ∞;

(iv) −∞ k(x) exp(ixλ)dx ≥ 0 for all λ ∈ R. The assumption k(0) = 1 means Γ0 gets unit weight, while k′ (0) = 0 means the kernel gives close to unit weight to Γh for small values of |h|. Condition (iv) guarantees K (X ) to be positive semi-definite, (e.g. Bochner’s theorem and Andrews (1991)). The multivariate realised kernel has the same form as a standard heteroskedasticity and autocorrelated (HAC) covariance matrix estimator familiar in econometrics (e.g. Gallant (1987), Newey and West (1987) and Andrews (1991)). But there are a number of important differences. For example, the sums that define the realised autocovariances are not divided by the sample size and k′ (0) = 0 is critical in our framework. Unlike the situation in the standard HAC literature, an estimator based on the Bartlett kernel will not be consistent for the ex-post variation of prices, measured by quadratic variation, in the present setting. Later we will recommend using the Parzen kernel (its form is given in Table 1) instead. In some of our results we use the following additional assumption on the Brownian semimartingale. Assumption SH. Assume µ and σ are bounded and we will write σ+ = supt ∈[0,1] |σ (t )|. This can be relaxed to locally bounded if (µ, σ ) is an Ito process — e.g. Mykland and Zhang (forthcoming). 2.4. Some assumptions about refresh time and noise Having defined the positive semi-definite realised kernel, we will now write out our assumptions about the refresh times {τN ,j } and the market microstructure effects U that govern the properties of the vector returns {xj } and so K (X ). 2.4.1. Assumptions about the refresh time We use the subscript-N to make the dependence on N explicit. Note that N is random and we write the durations as τN ,i −τN ,i−1 =

1N ,i =

DN , i N

for all i. We make the following assumptions about the durations between observation times. p

Assumption D. (i) That E(DrN ,⌊tN ⌋ |FτN ,⌊tN ⌋−1 ) → ~r (t ), 0 < r ≤ 2, as N → ∞. Here we assume ~r (t ) are strictly positive càdlàg processes adapted to {Ft }; (ii) maxi∈{j+1,...,j+R} DN ,i = op (R1/2 ) for any j; (iii) τN ,0 ≤ 0 and τN ,N +1 ≥ 1. Remark 1. If we have Poisson sampling the durations are exponential and the maxi∈{1,...,N } 1N ,i = Op (log(N )/N ), so maxi∈{1,...,N } DN ,i = Op (log(N )). Note both Barndorff-Nielsen et al. (2008) and Mykland and Zhang (2006) assume that maxi∈{1,...,N } DN ,i = Op (1). Phillips and Yu (unpublished paper) provided a novel analysis of realised volatility under random times of trades. We use their

152

O.E. Barndorff-Nielsen et al. / Journal of Econometrics 162 (2011) 149–169

Assumption D here, applied to the realised kernel. Deriving results for realised volatility under random times of trades is an active research area.3 Example 1 (Refresh Time). If each individual series has trade times which arrive as independent Poisson processes with the same i.i.d.

intensity λN, then their scaled durations are D(j) ∼ exp(λ), j = L

1, 2, . . . , d, so the refresh time durations are DN ,i = max{D(1) , . . . , D(d) }, and so (e.g. Embrechts et al. (1997, p. 189)) the refresh times have the form of a renewal process τN ,i − τN ,i−1 = N1 DN ,i , L

D N ,i =

λ −2 {

d 1 (j) −1 −1 , 2 t j=1 j D . In particular 1 t j =1 j d d −2 −1 2 j j . Of interest is how these terms change j =1 j=1

~ () = λ

∑d

∑

∑

~ () =

) }

∑ +(

It is convenient to define the average long run variance of U by

Ω=

~ (t )

d → ∞, so limD→∞ ~2 (t ) → 1. For d = 1, ~1 (t ) = λ−1 and 1 ~2 (t ) = 2λ−2 , so ~2 (t )/~1 (t ) = 2λ−1 .

ΣU (u)du, 0

which is a d × d matrix. When d = 1 we write ω2 in place of Ω , σU2 (t ) in place of ΣU (t ), etc. ω2 appears frequently later. It reflects the variance of the average noise a frequent trader would be exposed to. 3. Asymptotic results 3.1. Consistency

as d increases. The former is the harmonic series and divergent at 2 the slow rate log(d). The conditional variance converges to 6πλ2 as

1

∫

We note that the multivariate realised kernel can be written as K (X ) = K (Y ) + K (Y , U ) + K (U , Y ) + K (U ), where K (Y , U ) =

n −1 −

k

h=−n+1

 − h H

yj u′j−h ,

j

2.4.2. Assumptions about the noise The assumptions about the noise are stated in observations time — that is we only model the noise at exactly the times where there are trades or quote updates. This follows, for example, Zhou (1998), Bandi and Russell (unpublished paper), Zhang et al. (2005), Barndorff-Nielsen et al. (2008) and Hansen and Lunde (2006). We define the noise associated with X (τN ,j ) at the observation time τN ,j as UN ,j = X (τN ,j ) − Y (τN ,j ).

with yj and uj defined analogous to the definition of xj . This implies immediately that

Assumption U. Assume the component model

K (X ) →[Y ]. Hansen and Lunde (2006) have shown that endogenous noise is empirically important, particularly for mid-quote data. The above theorem means endogeneity does not matter for consistency. What matters is that the realised kernel applied to the noise process vanishes in probability. Because realised kernels are built out of these n high frequency returns, it is natural to state asymptotic results in terms of n (rather than N).

∞ −

Theorem 1 is a very powerful result for dealing with endogenous noise. Note whatever the relationship between Y and U, if p

p

K (U ) → 0 then K (X ) − K (Y ) = op (1), so if also K (Y ) →[Y ] then p

U N ,i = v N ,i + ζ N ,i , where vN ,i =

Theorem 1. Let K hold and suppose that K (Y ) = Op (1). Then √ K (X ) − K (Y ) = K (U ) + Op ( K (U )).

ψh (τN ,i−1−h )ϵN ,i−h ,

h=0

−1/2

˜ (τN ,i ) − W ˜ (τN ,i−1 )]. with ϵN ,i = 1N ,i [W ˜ is a standard Brownian motion and {ζN ,i } is a sequence Here W of independent random variables, with E(ζN ,i |FτN ,i−1 ) = 0 and var(ζN ,i |FτN ,i−1 ) = Σζ (τN ,i−1 ). Further, ϵN ,i and ζN ,i are assumed to be independent, while (ψh , Σζ ) are bounded and adapted to {Ft }, ∑∞ with j=0 |ψj (t )| < ∞ a.s. uniformly in t. We also assume that Dyζ . ˜ facilitates a general Remark 2. The auxiliary Brownian motion W ˜ and the form of endogenous noise through correlation between W Brownian motion, W , that drives the underlying process, Y . In fact, ˜ = W is permitted under our assumptions. the case W

Lemma 1. Let K, SH, D, and U hold. Then as H , n, m → ∞ with m/n → 0, H = c0 nη , c0 > 0, and η ∈ (0, 1) H2 n

p

K (X ) → |k′′ (0)|Ω ,

K (X ) =

if η < 1/2,

(2)

1

∫

Σ (u)du + c0−2 |k′′ (0)|Ω + Op (1),

if η = 1/2,

(3)

0 p

K (X ) →

1

∫

Σ (u)du,

if η > 1/2.

(4)

0

Remark 3. The standard assumption in this literature is that ψh (t ) is zero for all t and h, but this assumption is known to be shallow empirically. A ψ0 (t ) type term appears in Hansen and Lunde (2006, Example 1) and Kalnina and Linton (2008) in their analysis of endogeneity and a two scale estimator. The ‘‘local’’ long run variance of v is given by Σν (t ) = h=−∞ ∑∞ ′ γh (t ), where γh (t ) = j=1 ψj+h (t )ψj (t ) for h ≥ 0 and γh (t ) = γ−h (t )′ for h < 0, so that the local long run variance of U is given by

∑∞

ΣU (t ) = Σν (t ) + Σζ (t ). 3 The earliest research on this that we know of is Jacod (1994), Mykland and Zhang (2006) and Barndorff-Nielsen et al. (2008) provided an analysis based on the assumption that DN ,i = Op (1), which is perhaps too strong an assumption here (see Remark 1). Barndorff-Nielsen and Shephard (2005) allowed very general spacing, but assumed times and prices were independent. More recent important contributions include Hayashi et al. (unpublished paper), Li et al. (2009), Mykland and Zhang (forthcoming) and Jacod (unpublished paper) provide insightful analysis.

This shows the crucial role of H. H needs to increase with n quite quickly to remove the influence on the estimator of the noise. For very slow rates of increase in H the realised kernel will actually estimate a scaled version of the integrated long-run noise. To estimate the integrated variance our preferred approach is to set H = c0 n3/5 , because n3/5 is the optimal rate in the trade-off between (squared) bias and variance. In the next section we give a rule for selecting the best possible c0 . This approach delivers a positive semi-definite consistent estimator which is robust to endogeneity and semi-staleness of prices. The results for H ∝ n1/2 achieve the best available rate of convergence (see below), but is inconsistent. The Op (1) remainder in (3) can, in some cases, be shown to be op (1) in which case the bias, c0−2 |k′′ (0)|Ω , can be removed by using (2) with η = 1/5, for example. However, the resulting estimator is not necessarily positive semi-definite and we do not recommend this in applications. To study these results in more detail we will develop a central limit theory for the realised kernels, which allows us to select a

O.E. Barndorff-Nielsen et al. / Journal of Econometrics 162 (2011) 149–169

sensible H. Before introducing the multivariate results, it is helpful to consider the univariate case. 3.2. Univariate asymptotic distribution



Theorem √ 2. Let K, SH, D, and U hold. If n → ∞, H = c0 n3/5 and m−1 = o( H /n) = o(n−1/5 ) we have that n1/5



K (X ) −

1

∫

σ 2 (u)du



0



Ls

→ MN c0 |k′′ (0)|ω2 , 4c0 k0•,0 −2

1

∫

σ 4 (u) 0

 ~ 2 ( u) du . ~ 1 ( u)

Ls

The notation → MN means stable convergence to a mixed Gaussian distribution. This notion is important for the construction of confidence and the use of the delta method. The reason  1 intervals ~ (u) is that 0 σ 4 (u) ~2 (u) du is random, and stable convergence guar1 antees joint convergence that is needed here. Stable convergence is discussed, for example, in Mykland and Zhang (2006), who also provide extensive references. The minimum mean square error of the H = c0 n3/5 estimator is achieved by setting c0 = c ∗ ξ 4/5 so H = c ∗ ξ 4/5 n3/5 where c∗ =



k′′ (0)2

1/5

0,0

k• 1

∫

σ 4 ( u)

IQ = 0

ω2 ξ2 = √ ,

,

IQ

~2 (u) du. ~1 (u)

Notice that the serial dependence in the noise will impact the choice of c0 with ceteris paribus increasing dependence leading to larger values of H. Then

|k′′ (0)|

c0 k0•,0 IQ = κ 2 ,

c02

n



K (X ) −



Ls

σ (u)du → MN(κ, 4κ 2 ). 2



0 Ls



→ MN 0, 4c0 k0•,0 IQ +

8 c0

k1•,1 ω2

1

∫

σ 2 (u)du + 0



4

k2,2 ω4 , 3 •

c0

1/2

when H = c0 n , under the (far more restrictive) assumption that U is white noise. Hence, the implication is that the kernel estimators proposed in this paper will be (asymptotically) inferior to K F (X ) in the special case where U is white noise. The advantage of our estimator, which has H = c0 n3/5 , is that it is based on far more realistic assumptions about the noise. This has the practical implication that K (X ) can be applied to prices that are sampled at the highest possible frequency. This point is forcefully illustrated in Section 6.1.2 where we compare the two estimators, K (X ) and K F (X ), and show the importance of being robust to endogeneity and serial dependence. A simulation design shows that K (X ) is far more accurate than K F (X ) when the noise is serially dependent. Moreover, as an extra benefit of constructing our estimator from K is that it ensures positive semi-definiteness. Naturally, one can always truncate an estimator to be psd, for instance by replacing negative eigenvalues with zeros. Still, we find it convenient that the estimator is guaranteed to be psd, because it makes a check for positive definiteness and correction for lack thereof, entirely redundant. Having an asymptotic bias term in the asymptotic distribution is familiar from kernel density estimation with the optimal bandwidth. √ The bias is modest so long as H increases at a faster rate than n. If k′′ (0) = 0 we could take H ∝ n1/2 which would result in a faster rate of convergence. However, no weight function with k′′ (0) = 0 can guarantee a positive semi-definite estimate, see Andrews (1991, p. 832, comment 5). The following theorem rules out an important class of estimators which seems to be attractive to empirical researchers.

2|k′ (0)|

κ0 = (|k′′ (0)|(k0•,0 )2 )1/5 . 1

∫

σ 2 (u)du

wise satisfies K. Then, as n, H , m → ∞ we have that

1 0

H K n

p

(U ) →

{Σζ (u) + γ0 (u)}du.

Remark 4. If k′ (0) ̸= 0 then there does not exist a consistent K (X ). This rules out, for example, the well-known Bartlett type estimator in this context.

Then 1/5

1

∫

Lemma 2. Given U and a kernel function with k′ (0) ̸= 0 but other-

ω2 = κ,

where

κ = κ0 {IQω}2/5 ,

The result looks weak compared to the corresponding result for the flat-top kernel K F (X ) introduced by Barndorff-Nielsen et al. (2008) with k′ (0) = 0. They had the nicer result that4 n1/4 K F (X ) −

3.2.1. Core points The univariate version of the main results in our paper is the following.

153

(5)

σ 4 (u)~1 (u)du. So the asymptotic variance

3.2.3. Choosing the bandwidth H and weight function The relative efficiency of different realised kernels in this class are determined solely by the constant |k′′ (0)(k0•,0 )2 |1/5 and so can be universally determined for all Brownian semimartingales and noise processes. This constant is computed for a variety of kernel weight functions in Table 1. This shows that the Quadratic Spectral (QS), Parzen and Fejér weight functions are attractive in this context. The optimal weight function minimises, |k′′ (0)(k0•,0 )2 |1/5 , which is also the situation for HAC estimators, see Andrews (1991). Thus, using Andrews’ analysis of HAC estimators, it follows from our results that the QS kernel is the optimal weight function within the class of weight functions that are guaranteed to produce a nonnegative realised kernel estimate. A drawback of the QS and Fejér weight functions is that they, in principle, require n (all) realised autocovariances to be computed, whereas the number of realised autocovariances needed for the Parzen kernel is only H — hence we advocate the use of Parzen weight functions. We will discuss estimating ξ 2 in Section 3.4.

above is higher than a process with time-varying but nonstochastic durations. The random nature of the durations inflates the asymptotic variance.

4 See also Zhang (2006) who independently obtained a n1/4 consistent estimator using a multiscale approach.

0

This shows both the bias and variance of the realised kernel will increase with the value of the long-run variance of the noise. Interestingly time-variation in the noise does not, in itself, change the precision of the realised kernel — all that matters is the average level of the long-run variance of the noise. For the Parzen kernel we have κ0 = 0.97. 3.2.2. Some additional comments The conditions on m is caused by end effects, as these induce a bias in K (U ) that is of the order 2m−1 ω2 . Empirically ω2 is tiny so 2m−1 ω2 will be small even with m = 1, but theoretically this is an important observation. Assumption D(i) implies p

Var(Dn,⌊tn⌋ |Fτ⌊tn⌋−1 ) → ~2 (t ) − ~12 (t ), which is non-negative. Thus we have the inequality, ~2 (t )/~1 (t ) ≥ ~1 (t ), which means that

1 0

σ 4 (u) ~~21 ((uu)) du ≥

1 0

154

O.E. Barndorff-Nielsen et al. / Journal of Econometrics 162 (2011) 149–169

Table 1 Properties of some realised kernels. Kernel function, k(x) 0≤x≤

 1 − 6x2 + 6x3 k(x) = 2(1 − x)3 

Parzen

1

0

Quadratic spectral Fejér Tukey-Hanning∞

k(x) = x32 sinx x − cos x k(x) = ( sinx x )2   k(x) = sin2 π2 e−x

BNHLS (2008)

k(x) = (1 + x)e−x

0,0 2 1/5

|k (0)(k• ) | ′′





12

0.269

3.51

0.97

0.93 0.94 1.06 1.09

π 2 /2

3π/5 π/3 0.52

0.46 0.84 2.16

x≥0

1

5/4

0.96

measures the relative asymptotic efficiency of K ∈ K .

1

{Σ (u) ⊗ Σ (u)} 0

~2 (u) du, ~1 (u)

Theorem 3. Suppose H = c0 n3/5 , m−1 = o(n−1/5 ), K, SH, D, and U then



n1/5 K (X ) −

1

∫



This is the multivariate extension of Theorem 2, yielding a limit theorem for the consistent multivariate estimator in the presence of noise. The bias is determined by the long-run variance Ω , whereas the variance depends solely on the integrated quarticity. Corollary 1. An implication of Theorem 3 is that for a, b ∈ Rd we have



n1/5 a′ K (X ) −

1

∫

Σ (u)du b

→ MN{c0 |k (0)|a Ω b, ′′

′

v

v },

where vab = vec( ). For two different elements, a K (X )b and ′ c ′ K (X )d say, their asymptotic covariance is given by 4c0 k0•,0 vab IQvcd . ′

So once a consistent estimator for IQ is obtained, Corollary 1 makes it straightforward to compute a confidence interval for any element of the integrated variance matrix. Example 2. In the bivariate case we can write the results as (i) K (X ) −

1

∫

(6)

0



and

∫ 0

1



K (X (j) )

 Ls − β (i,j) → MN(A, B),

where

and B =  

2c0 k0•,0 1 0

 2 1,

−β (i,j)

Σjj du [∫ 1  Σii Σjj + Σij2 × 2Σjj Σij 0



2Σii Σji 2Σjj2

~2 du ~1



]

1

−β (i,j)



.

2Σii2

 • •

2Σii Σij

Σii Σjj + Σij2 •

A main feature of multivariate kernels is that there is a single bandwidth parameter H which controls the number of leads and lags used for all the series. It must grow with n at rate n3/5 , the key question here is how to estimate a good constant of proportionality — which controls the efficiency of the procedure. If we applied the univariate optimal mean square error bandwidth selection to each asset price individually we would get 4/5 d bandwidths H (i) = c ∗ ξi n3/5 , where c ∗ = {k′′ (0)2 /k0•,0 }1/5 and √ 2 ξi = Ωii / IQii , where Σii (u) is the spot variance for the i-th asset.

ξi2 = Ωii /

where

Ωii A = c0−2 |k′′ (0)| Ωij Ωjj

K (X (i) , X (j) )

In practice we usually approximate



Σii du  0   ∫ 1   Ls 1/5  ( i) ( j) n K (X , X ) − Σij du  → MN(A, B),   0 ∫ 1   (j) K (X ) − Σjj du





3.4. Practical issues: choice of H

′ 4c0 k0•,0 ab IQ ab

ab′ +ba′ 2



n1/5



0

−2

which has features in common with the noiseless case discussed in Barndorff-Nielsen and Shephard (2004, Eq. 18). By the delta method we can deduce the asymptotic distribution of the kernel based regression and correlation (extending the work of, for example, Andersen et al. (2003), Barndorff-Nielsen and Shephard (2004) and Dovonon et al. (forthcoming)). For example, with 1 1 β (i,j) = 0 Σij du/ 0 Σjj du,

c −2 |k′′ (0)| A = 0 1 (Ωij − Ωjj βij ), Σjj du 0

Ls

Σ (u)du → MN{c0−2 |k′′ (0)|Ω , 4c0 k0•,0 IQ}.

0

B = 2c0 k•0,0

|k′′ (0)(k0•,0 )2 |1/5

1/5 2/3

which is a d2 × d2 random matrix.

Ls

c∗

≤x≤1 2 x>1 x≥0 x≥0 x≥0

To start we extend the definition of the integrated quarticity to the multivariate context

∫

k0•,0

1 2

3.3. Multivariate asymptotic distribution

IQ =

|k′′ (0)|

2Σij2



~2

2Σii Σji  du, ~1 2Σjj2

1

√

IQii by

1 0

Σii (u)du and use

Σii (u)du, which can be estimated relatively easily 0 1 by using a low frequency estimate of 0 Σii (u)du and one of many sensible estimators of Ωii which use high frequency data. Then we could construct some ad hoc rules for choosing the global H, such as Hmin = min(H (1) , . . . , H (d) ), Hmax = max(H (1) , . . . , H (d) ), or ∑ ¯ = d−1 di=1 H (i) , or many others. In our empirical work we have H ¯ while our web Appendix provides an analysis of the impact used H, of this choice. An interesting alternative is to optimise the problem for a portfolio, e.g. letting ι be a d-dimensional vector of ones then d−2 ι′ K (X )ι = K (d−1 ι′ X ), which is like a ‘‘market portfolio’’ if X contains many assets. This is easy to carry out, for having converted everything into Refresh Time one computes the market (ι′ X /ι′ ι) return and then carry out a univariate analysis on it, choosing an optimal H for the market. This single H is then applied to the multivariate problem.

O.E. Barndorff-Nielsen et al. / Journal of Econometrics 162 (2011) 149–169

From the results in Example 2 it is straightforward to derive the optimal choice for H, when the objective is to estimate a covariance, a correlation, the inverse covariance matrix (which is important for portfolio choice) or β (i,j) . For β (1,2) the trade-off is between c0−4 |k′′ (0)|2 (Ω12 − Ω22 β (1,2) )2 , and 2c0 k0•,0

1

∫

2 (Σ11 Σ22 + Σ12 − 4β (1,2) Σ11 Σ22 + 2β (1,2)2 Σ22 )

0

~2 du. ~1

4. Simulation study So far the analysis has been asymptotic as n → ∞. Here we carry out a simulation analysis to assess the accuracy of the asymptotic predictions in finite samples. We simulate over the interval t ∈ [0, 1]. The following multivariate factor stochastic volatility model is used dY (i) = µ(i) dt + dV (i) + dF (i) , dF (i) =



dV (i) = ρ (i) σ (i) dB(i) ,

1 − ρ (i) σ (i) dW



2

where the elements of B are independent standard Brownian (i) motions and W yB. Here F is the common factor, whose strength is determined by 1 − (ρ (i) )2 . This model means that each Y (i) is a diffusive SV model with constant drift µ(i) and random spot volatility σ (i) . In turn the (i) spot volatility obeys the independent processes σ (i) = exp(β0

+ β1(i) ϱ(i) ) with dϱ(i) = α (i) ϱ(i) dt + dB(i) . Thus there is perfect

statistical leverage (correlation between their innovations) bei) tween V (i) and σ (i) , while the leverage between Y ( and ϱ(i) is (i) (1) (2) ρ . The correlation between Y (t ) and Y (t ) is 1 − (ρ (1) )2 1 − (ρ (2) )2 . The price process is simulated via an Euler scheme,5 and the fact that the OU-process have an exact discretisation (e.g. Glasserman (2004, pp. 110)). Our simulations are based on the following con(i) (i) figuration (µ(i) , β0 , β1 , α (i) , ρ (i) ) = (0.03, −5/16, 1/8, −1/40,



−0.3), so that β0(i) = (β1(i) )2 /(2α (i) ). Throughout we have im1 posed that E( 0 σ (i)2 (u)du) = 1. The stationary distribution of ϱ(i) is utilised in our simulations to restart the process each day at ϱ(i) (0) ∼ N (0, (−2α (i) )−1 ). For our design we have that the vari(i) ance of σ 2 is exp(−2(β1 )2 /α (i) ) − 1 ≃ 2.5. This is comparable to the empirical results found in e.g. Hansen and Lunde (2005) which motivates our choice for α (i) . We add noise simulated as

  N −  i.i.d. (i) Uj |σ , Y ∼ N (0, ω2 ) with ω2 = ξ 2 N −1 σ (i)4 (j/N ), j =1

where the noise-to-signal ratio, ξ takes the values 0, 0.001 and 0.01. This means that the variance of the noise increases with the volatility of the efficient price (e.g. Bandi and Russell (2006)). To model the non-synchronously spaced data we use two independent Poisson process sampling schemes to generate the (i) times of the actual observations {tj } to which we apply our realised kernel. We control the two Poisson processes by λ = (λ1 , λ2 ), such that for example λ = (5, 10) means that on average X (1) and X (2) is observed every 5 and 10 s, respectively. This means that the simulated number of observations will differ between repetitions, but on average the processes will have 23,400/λ1 and 23,400/λ2 observations, respectively. 2

5 We normalise one second to be 1/23,400, so that the interval [0, 1] contains 6.5 h. In generating the observed price, we discretise [0, 1] into a number N = 23,400 of intervals.

155

We vary λ through the following configurations (3, 6), (5, 10), (10, 20), (15, 30), (30, 60), (60, 120) motivated by the kind of data we see in databases of equity prices. In order to calculate K (X ) we need to select H. To do this we evaluate ω ˆ δ(i)2 = [Xδ(i) ](1)/(2n) and [X1(/i)900 ](1), the realised variance estimator based on 15 min returns. These give us the ˆ i = cn3/5 (ωˆ δ(i)2 /[X1(/i)900 ](1))2/5 . The following feasible values H results for Hmean are presented in Table 2. Panel A of the table reports the univariate results of estimating integrated variance. We give the bias and root mean square error (MSE) for the realised kernel and compare it to the standard realised variance. In the no noise case of ξ 2 = 0 the RV statistic is quite a bit more precise, especially when n is large. The positive bias of the realised kernel can be seen when ξ 2 is quite large, but it is small compared to the estimator’s variance. In that situation the realised kernel is far more precise than the realised variance. None of these results is surprising or novel. In Panel B we break new ground as it focuses on estimating the integrated covariance. We compare the realised kernel estimator with a realised covariance. The high frequency realised covariance is a very precise estimator of the wrong quantity as its bias is very close to its very large mean square error. In this case its bias does not really change very much as n increases. The realised kernel delivers a very precise estimator of the integrated covariance. It is downward biased due to the nonsynchronous data, but the bias is very modest when n is large and its sampling variance dominates the root MSE. Taken together this implies the realised kernel estimators of the correlation and regression (beta) are strongly negatively biased — which is due to it being a non-linear function of the noisy estimates of the integrated variance. The bias is the dominant component of the root MSE in the correlation case. 5. Empirical illustration We analyse high-frequency assets prices for thirty assets.6 In the analysis the main focus will be on the empirical properties of 30 × 30 realised kernel estimates. To conserve space we will only present detailed results for a 10 × 10 submatrix of the full 30 × 30 matrix. The ten assets we will focus on are Alcoa Inc. (AA), American International Group Inc. (AIG), American Express Co. (AXP), Boeing Co. (BA), Bank of America Corp. (BAC), Citygroup Inc. (C), Caterpillar Inc. (CAT), Chevron Corp. (CVX), El DuPont de Nemours & Co. (DD), and Standard & Poor’s Depository Receipt (SPY). The SPY is an exchange-traded fund that holds all of the S&P 500 Index stocks and has enormous liquidity. The sample period runs from January 3, 2002 to July 31, 2008, delivering 1503 distinct days. The data is the collection of trades and quotes recorded on the NYSE, taken from the TAQ database through the Wharton Research Data Services (WRDS) system. We present empirical results for both transaction and mid-quote prices. Throughout our analysis we will estimate quantities each day, in the tradition of the realised volatility literature following Andersen et al. (2001) and Barndorff-Nielsen and Shephard (2002). This means the target becomes functions of [Y ]s = [Y ](s) − [Y ](s − 1), s ∈ N. The functions we will deal with are covariances, correlations and betas. 5.1. Procedure for cleaning the high-frequency data Careful data cleaning is one of the most important aspects of volatility estimation from high-frequency data. Numerous 6 The ticker symbols of these assets are AA, AIG, AXP, BA, BAC, C, CAT, CVX, DD, DIS, GE, GM, HD, IBM, INTC, JNJ, JPM, KO, MCD, MMM, MRK, MSFT, PFE, PG, SPY, T, UTX, VZ, WMT, and XOM.

156

O.E. Barndorff-Nielsen et al. / Journal of Econometrics 162 (2011) 149–169

Table 2 Simulation results. Panel A: Integrated variance Series A

ξ = 0.0 λ = (3, 6) λ = (10, 20) λ = (60, 120) ξ 2 = 0.001 λ = (3, 6) λ = (10, 20) λ = (60, 120) ξ 2 = 0.01 λ = (3, 6) λ = (10, 20) λ = (60, 120) 2

Series B K (X )

RV1m

RV15m

Bias 0.006 0.011 0.003

R.mse 0.147 0.262 0.557

R.mse 0.122 0.114 0.227

R.mse 0.436 0.450 0.517

Bias 0.003 0.011 0.001

R.mse 0.134 0.224 0.490

0.654 0.660 0.559

0.040 0.041 0.014

0.253 0.359 0.557

1.417 1.318 0.636

0.488 0.492 0.554

0.033 0.035 0.013

0.215 0.295 0.551

1.531 1.452 1.222

0.096 0.106 0.077

0.410 0.568 0.611

13.67 13.15 5.386

1.168 1.305 1.322

0.084 0.081 0.080

0.351 0.424 0.776

RV1m

RV15m

R.mse 0.113 0.111 0.229

R.mse 0.505 0.547 0.504

1.509 1.432 1.013 14.39 14.01 8.893

Panel B: Integrated covariance/correlation Cov1m

ξ = 0.0 λ = (3, 6) λ = (5, 10) λ = (10, 20) λ = (30, 60) λ = (60, 120) ξ 2 = 0.001 λ = (3, 6) λ = (5, 10) λ = (10, 20) λ = (30, 60) λ = (60, 120) ξ 2 = 0.01 λ = (3, 6) λ = (5, 10) λ = (10, 20) λ = (30, 60) λ = (60, 120) 2

K (X ) Covar

Cov15m Bias

K (X )

K (X ) Corr Bias

K (X ) beta

#rets 3121 1921 982 332 166

Bias −0.051 −0.085 −0.160 −0.342 −0.445

R.mse 0.076 0.108 0.186 0.395 0.510

−0.004 −0.006 −0.011 −0.038 −0.071

R.mse 0.183 0.183 0.186 0.188 0.203

Bias −0.007 −0.009 −0.009 −0.021 −0.034

R.mse 0.062 0.076 0.097 0.142 0.189

−0.012 −0.015 −0.018 −0.028 −0.036

R.mse 0.016 0.020 0.026 0.042 0.054

Bias −0.016 −0.019 −0.023 −0.035 −0.035

R.mse 0.061 0.064 0.084 0.125 0.178

3121 1921 982 332 166

−0.046 −0.082 −0.156 −0.344 −0.445

0.091 0.123 0.189 0.400 0.513

−0.005 −0.006 −0.010 −0.039 −0.074

0.191 0.186 0.195 0.187 0.206

−0.000 −0.002 −0.004 −0.019 −0.034

0.090 0.099 0.118 0.150 0.195

−0.027 −0.029 −0.032 −0.039 −0.044

0.032 0.036 0.040 0.052 0.060

−0.034 −0.033 −0.042 −0.049 −0.049

0.085 0.083 0.111 0.153 0.204

3121 1921 982 332 166

−0.027 −0.073 −0.139 −0.354 −0.451

0.398 0.431 0.407 0.486 0.561

−0.009 −0.005 −0.001 −0.044 −0.083

0.263 0.257 0.263 0.236 0.265

0.000

−0.002 −0.005 −0.017 −0.032

0.123 0.133 0.153 0.180 0.222

−0.063 −0.067 −0.074 −0.089 −0.092

0.071 0.076 0.084 0.104 0.111

−0.072 −0.082 −0.099 −0.119 −0.120

0.132 0.149 0.198 0.242 0.310

Simulation results for the realised kernel using a factor SV model with non-synchronous observations and measurement noise. Panel A looks at estimating integrated variance using realised variance and the Parzen type realised kernel K (X ). Panel B looks at estimating integrated covariance and correlation using realised covariance and realised kernel. Bias and root mean square error are reported. The results are based on 1000 repetitions.

problems and solutions are discussed in Falkenberry (2001), Hansen and Lunde (2006), Brownless and Gallo (2006) and Barndorff-Nielsen et al. (2009). In this paper we follow the stepby-step cleaning procedure used in Barndorff-Nielsen et al. (2009) who discuss in detail the various choices available and their impact on univariate realised kernels. For convenience we briefly review these steps. All data. (P1) Delete entries with a timestamp outside the 9:30 a.m.–4 p.m. window when the exchange is open. (P2) Delete entries with a bid, ask or transaction price equal to zero. (P3) Retain entries originating from a single exchange (NYSE, except INTC and MFST from NASDAQ and for SPY for which all retained observations are from Pacific). Delete other entries. Quote data only. (Q1) When multiple quotes have the same timestamp, we replace all these with a single entry with the median bid and median ask price. (Q2) Delete rows for which the spread is negative. (Q3) Delete rows for which the spread is more that 10 times the median spread on that day. (Q4) Delete rows for which the mid-quote deviated by more than 10 mean absolute deviations from a centred median (excluding the observation under consideration) of 50 observations. Trade data only. (T1) Delete entries with corrected trades. (Trades with a Correction Indicator, CORR ̸= 0.) (T2) Delete entries with abnormal Sale Condition. (Trades where COND has a letter code, except for ‘‘E’’ and ‘‘F’’.) (T3) If multiple transactions have the same time stamp: use the median price. (T4) Delete entries with prices that are above the ask plus the bid-ask spread. Similar for entries with prices below the bid minus the bid-ask spread. We note steps

P2, T1, T2, T4, Q2, Q3 and Q4 collectively reduce the sample size by less than 1%. 5.2. Sampling schemes We applied three different sampling schemes depending on the particular estimator. The simplest one is the estimator by Hayashi and Yoshida (2005) that uses all the available observations for a particular asset combination. Following Andersen et al. (2003) the realised covariation estimator is based on calender time sampling. Specifically, we consider 15 s, 5 min, and 30 min intraday returns, aligned using the previous tick approach. This results in 1560, 78 and 13 daily observations, respectively. For the realised kernel the Refresh Time sampling scheme discussed in Section 2.1 is used. In our analysis we present estimates for the upper left 10 × 10 block of the full 30 × 30 integrated covariance matrix. The estimates are constructed using three different sampling schemes. (a) Refresh Time sampling applied to full set of DJ stocks, (b) Refresh Time sampling applied to only the 10 stocks that we focus on and (c) Refresh Time sampling applied to each unique pair of assets. So in our analysis we will present three sets of realised kernel estimates of the elements of the integrated covariance matrix. One set that comes from a 30 × 1 vector of returns, the same set estimated using only the required 10 × 1 vector of returns, and finally a set constructed from the 45 distinct 2 × 2 covariance matrix estimates. Note that the two first estimators are positive semidefinite by construction, while the latter is not guaranteed to be so. We compute these covariance matrix estimates for each day in our sample.

O.E. Barndorff-Nielsen et al. / Journal of Econometrics 162 (2011) 149–169

157

Table 3 Summary statistics for the refresh sampling scheme, 2 × 2 case. 2 × 2 case AA AA AIG AXP BA BAC C CAT CVX DD SPY

0.635 0.639 0.634 0.636 0.635 0.630 0.641 0.639 0.609

AIG

AXP

BA

BAC

C

CAT

CVX

DD

SPY

0.589

0.599 0.607

0.595 0.596 0.613

0.586 0.616 0.617 0.596

0.565 0.608 0.595 0.577 0.631

0.593 0.584 0.596 0.595 0.579 0.554

0.582 0.605 0.594 0.591 0.612 0.610 0.575

0.602 0.598 0.619 0.609 0.597 0.575 0.597 0.588

0.574 0.601 0.584 0.582 0.613 0.619 0.569 0.618 0.575

0.651 0.642 0.656 0.657 0.631 0.657 0.646 0.642

0.652 0.662 0.660 0.641 0.659 0.656 0.630

0.649 0.647 0.636 0.651 0.649 0.622

0.680 0.633 0.671 0.653 0.667

0.630 0.675 0.651 0.685

0.634 0.639 0.602

0.652 0.670

0.625

3330 8026

4,845 10,254

3307 8521

Average over daily number of high frequency observations available before the refresh time transformation Trades Quotes

3442 8460

4228 9270

3461 8626

3529 8553

4,544 10,091

5,480 10,809

5,412 15,973

Summary statistics for the refresh sampling scheme. In the upper panel we present averages over the daily data of the data maintained by the refresh sampling scheme, ∑d measured by p = dN / i=1 n(i) . The upper panel display this in the 2 × 2 case. The upper diagonal is based on transaction prices, whereas the lower diagonal is based on mid-quotes. In the lower panel we average over the daily number of high frequency observations.

The fraction of the data we retained by constructing Refresh Time is recorded in Table 3 for each of the 45 distinct 2 × 2 matrices. It records the average of the daily p statistics defined in Section 2.1 for each pair. It emerges that we never lose more that half the observations for most frequently traded assets. For the least active assets we typically lose between 30% and 40% of the observations. For the 10 × 1 case the data loss is more pronounced. Still, on average more that 25% of the observations remain in the sample. For transaction data the average number of Refresh Time observations is 1470, whereas the corresponding number is 4491 for the quote data. So in most cases we have an observation on average more often than every 5 s for quote data and 15 s for trade data. We observed that the data loss levels off as the dimension increases. For the 30 × 1 case we have on average more that 17% of the observations remaining in the sample. For transaction data the average number of Refresh Time observations is 966 and 2978 for the quote data. This gives an observation on average more often than every 8 s for quote data and 24 s for trade data.

leading diagonal are results from trade data, the numbers below are from mid-quotes. It is interesting to note that the CovHY s estimates are typically much lower than the corresponding CovOtoC s estimate. Numbers in bold font indicate estimates that are significantly different from CovOtoC at the one percent level. This s assessment is carried out in the following way.

OtoC 5.3. Analysis of the covariance estimators: CovKs , CovHY and s , Covs 1m Covs

We now move on to more successful estimators. The upper panel of Table 5 presents the time series average estimates for

Throughout this subsection the target which we wish to estimate is [Y (i) , Y (j) ]s , i, j = 1, 2, . . . , d, s ∈ N. In what follows the pair i, j will only be referred to implicitly. All kernels are computed with Parzen weights. We compute the realised kernel for the full 30-dimensional vector, the 10-dimensional sub-vector and (all possible) pairs of the ten assets. The resulting estimates of [Y (i) , Y (j) ]s are denoted K Covs 30×30 ,

K Covs 10×10

K Covs 2×2 ,

by and respectively. These estimators differ in a number of ways, such as the bandwidth selection and the sampling times (due to the construction of Refresh Time). To provide useful benchmarks for these estimators we also compute: CovHY s , the Hayashi and Yoshida (2005) covariance estimator. Cov1 s , the realised covariance based on intraday returns that span a interval of length 1, e.g. 5 or 30 min (the previoustick method is used). CovOtoC , the outer products of the open to s close returns, which when averaged over many days provide an estimator of the average covariance between asset returns. The empirical analysis of our estimators of the covariance is started by recalling the main statistical impact of market microstructure and the Epps effect. Table 4 contains the time series average covariance computed using the Hayashi and Yoshida (2005) estimator CovHY and the open to close estimator s CovOtoC . Quite a few of these types of tables will be presented s and they all have the same structure. The numbers above the

K

For a given estimator, e.g. Covs 2×2 , we consider the difference K

es = Covs 2×2 − CovOtoC , and compute the sample bias as e¯ and s ∑q robust (HAC) variance as s2η = γ0 + 2 h=1 (1 − q+h 1 )γh , where

ηs ηs−h . Here ηs = es − e¯ and q = int{4(T /100)2/9 }. √ The number is boldfaced if | T e¯ /sη | > 2.326. The results in Table 4 γh =

1 T −h

∑n−h s=1

indicate that CovHY is severely downward bias. Every covariance s estimator for every pair of assets for both trades and quotes is statistically significantly biased. K

K

K

5.4. Results for Covs 30×30 , Covs 10×10 and Covs 2×2

K

K

Covs 30×30 , the middle panel for Covs 10×10 , and the lower panel give K Covs 2×2 .

results for The diagonal elements are the estimates based on transactions. Off-diagonal numbers are boldfaced if they are significantly biased (compared to CovOtoC ) at the 1% level. These s results are quite encouraging for all three estimators. The average levels of the three estimators are roughly the same. A much tougher comparison is to replace the noisy es = K ′

K

′

CovKs − CovOtoC with es = Covs d×d − Covs d ×d , where the two s estimates come from applying the realised kernel to price vectors of dimension d and d′ . Our tests will then ask if there is a significant difference in the average. The results reported in our web Appendix suggest very little difference in the level of the three realised kernel estimators. When we compute the same test based on es = K

Covs d×d − Cov5m s we find that the realised kernels and the realised covariances based on 5 min returns are also quite similar. The result in that analysis is reinforced by the information in the summary Table 6, which shows results averaged over all asset pairs for both trades and quotes. The results are not very different for most estimators as we move from trades to quotes, the counter example is CovHY s which is sensitive to this. K

K

K

The table shows Covs 30×30 , Covs 10×10 and Covs 2×2 have roughly K

the same average value, which is slightly below CovOtoC . Covs 2×2 s OtoC has a seven times smaller variance than Covs , which shows it is a lot more precise. Of course integrated variance is its self random

158

O.E. Barndorff-Nielsen et al. / Journal of Econometrics 162 (2011) 149–169

Table 4 Average high frequency realised covariance and open to close covariance. AA

AIG

AXP

BA

BAC

C

CAT

CVX

DD

SPY

Average of Hayashi–Yoshida covariances (all times) AA AIG AXP BA BAC C CAT CVX DD SPY

4.331 0.225 0.251 0.197 0.198 0.237 0.227 0.194 0.240 0.149

0.415 2.904 0.315 0.206 0.275 0.305 0.224 0.196 0.225 0.152

0.484 0.612 3.383 0.230 0.299 0.319 0.254 0.219 0.251 0.179

0.383 0.387 0.447 3.274 0.177 0.199 0.210 0.176 0.206 0.143

0.372 0.520 0.580 0.344 2.466 0.303 0.197 0.180 0.196 0.142

0.446 0.597 0.661 0.400 0.596 4.014 0.220 0.190 0.224 0.147

0.424 0.415 0.476 0.393 0.374 0.431 2.274 0.187 0.222 0.151

0.360 0.332 0.382 0.319 0.315 0.354 0.332 2.192 0.186 0.141

0.465 0.415 0.477 0.389 0.375 0.430 0.413 0.340 2.643 0.142

0.349 0.361 0.430 0.336 0.345 0.382 0.341 0.305 0.342 1.068

1.100

1.228 1.776

1.091 0.978 1.045

1.019 1.808 1.661 0.882

1.267 2.076 2.055 1.071 2.036

1.377 1.062 1.191 0.990 0.967 1.230

1.027 0.675 0.853 0.630 0.619 0.819 0.768

1.219 1.076 1.101 0.850 0.927 1.134 1.005 0.658

1.036 1.120 1.180 0.820 0.988 1.261 0.933 0.705 0.829

Open-to-close covariance AA AIG AXP BA BAC C CAT CVX DD SPY

1.092 1.220 1.075 1.015 1.265 1.365 1.026 1.211 1.029

1.763 0.976 1.804 2.088 1.045 0.671 1.070 1.120

1.049 1.653 2.053 1.178 0.862 1.092 1.177

0.887 1.077 0.981 0.627 0.850 0.816

2.041 0.965 0.624 0.923 0.985

1.233 0.824 1.134 1.268

0.771 0.996 0.927

0.656 0.708

0.829

OtoC The upper panel presents average estimates for CovHY . In both panels the upper diagonal is based on transaction prices, whereas s and the lower panel displays these for Covs the lower diagonal is based on mid-quotes. the diagonal elements are computed with transaction prices. In the upper panel numbers outside the diagonal are boldfaced if the bias is significant at the 1% level.

Table 5 Average for alternative integrated covariance estimators. AA

AIG

AXP

BA

BAC

C

CAT

CVX

DD

SPY

1.033 1.300 2.580 0.773 1.305 1.627 0.939 0.570 0.890 0.918

0.808 0.722 0.780 2.194 0.658 0.840 0.760 0.500 0.706 0.678

0.865 1.254 1.326 0.654 2.057 1.649 0.792 0.489 0.759 0.808

1.118 1.556 1.656 0.831 1.681 2.942 1.001 0.640 0.984 1.058

1.055 0.856 0.974 0.765 0.805 1.011 2.225 0.607 0.874 0.775

0.818 0.533 0.578 0.488 0.485 0.638 0.603 1.655 0.589 0.602

1.047 0.820 0.900 0.702 0.768 0.986 0.882 0.576 1.810 0.744

0.862 0.878 0.930 0.670 0.811 1.057 0.776 0.589 0.736 0.745

1.024 1.298 2.593 0.783 1.290 1.616 0.931 0.583 0.892 0.917

0.808 0.734 0.778 2.222 0.660 0.854 0.767 0.512 0.719 0.688

0.863 1.248 1.315 0.656 2.080 1.625 0.785 0.501 0.755 0.806

1.121 1.553 1.642 0.845 1.669 2.985 1.003 0.656 0.991 1.061

1.040 0.860 0.957 0.772 0.799 1.013 2.234 0.619 0.876 0.779

0.825 0.544 0.581 0.501 0.493 0.653 0.613 1.682 0.603 0.616

1.046 0.829 0.900 0.712 0.765 0.992 0.879 0.592 1.850 0.752

0.868 0.885 0.926 0.679 0.811 1.062 0.779 0.603 0.745 0.755

0.986 1.273 2.707 0.825 1.237 1.568 0.914 0.642 0.904 0.910

0.812 0.775 0.832 2.371 0.683 0.893 0.794 0.577 0.759 0.726

0.837 1.180 1.268 0.679 2.096 1.550 0.775 0.553 0.760 0.795

1.098 1.490 1.599 0.892 1.565 3.108 1.005 0.723 0.999 1.054

1.004 0.863 0.945 0.800 0.786 1.011 2.299 0.648 0.873 0.791

0.824 0.583 0.633 0.560 0.541 0.719 0.642 1.738 0.644 0.664

1.034 0.847 0.919 0.753 0.763 0.995 0.883 0.637 1.961 0.768

0.867 0.891 0.941 0.731 0.811 1.066 0.798 0.662 0.774 0.783

Average of Parzen covariances (30 × 30) AA AIG AXP BA BAC C CAT CVX DD SPY

3.338 0.933 1.003 0.792 0.850 1.102 1.028 0.822 1.049 0.862

0.945 2.725 1.273 0.730 1.226 1.528 0.841 0.534 0.813 0.876

Average of Parzen covariances (10 × 10) AA AIG AXP BA BAC C CAT CVX DD SPY

3.400 0.931 0.995 0.798 0.844 1.101 1.019 0.824 1.045 0.865

0.953 2.766 1.271 0.739 1.204 1.518 0.843 0.544 0.819 0.876

Average of Parzen covariances (2 × 2) AA AIG AXP BA BAC C CAT CVX DD SPY

3.531 0.912 0.956 0.793 0.820 1.078 0.975 0.810 1.018 0.853

0.931 2.803 1.238 0.773 1.156 1.469 0.851 0.595 0.842 0.875

K

K

K

The upper panel presents average estimates for Covs 30×30 , the middle panel for Covs 10×10 , and the lower panel gives results for Covs 2×2 . In both panels the upper diagonal is based on transaction prices, whereas the lower diagonal is based on mid-quotes. The diagonal element are computed with transaction prices. Outside the diagonals numbers are boldfaced if the bias is significant at the 1% level.

O.E. Barndorff-Nielsen et al. / Journal of Econometrics 162 (2011) 149–169

159

Table 6 Summary statistics across all asset pairs. Transaction prices Estimator Average HAC Summary stats for covariances

Stdev

Bias

cor(., K )

acf1

acf2

acf3

acf4

acf5

acf10

CovK30×30 CovK10×10 CovK2×2 CovHY Cov1/4m Cov5m Cov30m Cov3h CovOtoC

1.607 1.596 1.518 0.362 0.805 1.511 1.838 2.659 4.255

−0.229 −0.227 −0.223 −0.902 −0.660 −0.262 −0.208 −0.156

1.000 0.992 0.960 0.767 0.660 0.942 0.866 0.640 0.508

0.67 0.69 0.75 0.80 0.84 0.71 0.49 0.22 0.12

0.58 0.61 0.66 0.73 0.74 0.62 0.46 0.25 0.15

0.52 0.54 0.59 0.68 0.68 0.54 0.37 0.20 0.17

0.45 0.47 0.53 0.64 0.64 0.48 0.34 0.17 0.14

0.44 0.45 0.51 0.62 0.61 0.46 0.35 0.16 0.08

0.35 0.36 0.42 0.53 0.52 0.37 0.25 0.15 0.15

1.000 0.975 0.824 0.277 0.756 0.684 0.347

0.28 0.32 0.44 0.78 0.34 0.14 0.03

0.26 0.29 0.40 0.75 0.31 0.13 0.03

0.23 0.26 0.37 0.72 0.28 0.11 0.03

0.23 0.26 0.35 0.71 0.26 0.12 0.04

0.22 0.25 0.34 0.70 0.24 0.11 0.02

0.19 0.22 0.30 0.66 0.22 0.10 0.02

0.8844 [0.089] 0.8862 [0.089] 0.8900 [0.088] 0.2113 [0.022] 0.4534 [0.050] 0.8505 [0.085] 0.9049 [0.091] 0.9566 [0.105] 1.1116 [0.150]

Summary stats for correlations CorrK30×30 0.3862 [0.008] 0.203 CorrK10×10 0.3825 [0.008] 0.188 0.3698 [0.007] 0.155 CorrK2×2 Corr1/4m 0.1836 [0.007] 0.113 Corr5m 0.3619 [0.007] 0.169 Corr30m 0.4030 [0.010] 0.298 Corr3h 0.3869 [0.016] 0.594 Average unconditional open-to-close correlation = 0.5185 Mid-quotes Estimator Average HAC Summary stats for covariances

Stdev

Bias

cor(., K )

acf1

acf2

acf3

acf4

acf5

acf10

CovK30×30 CovK10×10 CovK2×2 CovHY Cov1/4m Cov5m Cov30m Cov3h CovOtoC

1.656 1.636 1.546 0.627 0.776 1.481 1.833 2.661 4.234

−0.221 −0.219 −0.213 −0.699 −0.666 −0.260 −0.207 −0.156

1.000 0.992 0.941 0.788 0.669 0.922 0.897 0.672 0.534

0.62 0.66 0.74 0.82 0.83 0.72 0.50 0.22 0.12

0.55 0.58 0.66 0.74 0.74 0.61 0.46 0.25 0.15

0.48 0.51 0.59 0.69 0.68 0.55 0.37 0.21 0.18

0.43 0.45 0.53 0.65 0.64 0.50 0.34 0.17 0.14

0.42 0.44 0.51 0.62 0.61 0.47 0.35 0.16 0.08

0.32 0.34 0.41 0.53 0.52 0.39 0.25 0.16 0.15

1.000 0.968 0.818 0.282 0.724 0.734 0.382

0.25 0.30 0.41 0.75 0.35 0.14 0.03

0.23 0.27 0.37 0.71 0.30 0.13 0.04

0.21 0.24 0.34 0.69 0.28 0.11 0.03

0.20 0.24 0.32 0.67 0.26 0.12 0.04

0.20 0.23 0.31 0.66 0.25 0.11 0.02

0.18 0.20 0.27 0.62 0.22 0.10 0.02

0.8917 [0.089] 0.8940 [0.090] 0.9000 [0.089] 0.4144 [0.038] 0.4470 [0.048] 0.8530 [0.084] 0.9056 [0.091] 0.9574 [0.105] 1.1143 [0.150]

Summary stats for correlations CorrK30×30 0.3904 [0.009] 0.221 CorrK10×10 0.3870 [0.008] 0.200 CorrK2×2 0.3763 [0.008] 0.165 Corr1/4m 0.1815 [0.006] 0.103 5m Corr 0.3650 [0.007] 0.168 Corr30m 0.4027 [0.010] 0.299 Corr3h 0.3873 [0.016] 0.593 Average unconditional open-to-close correlation = 0.5169

Summary statistics across all asset pairs. The first column identify the estimator, and the second gives the average estimate across all asset combinations, followed by the average Newey–West type standard error. The fourth gives the average standard deviation of the estimator. The fifth is the average bias. Next is average sample correlation with our realised kernel. The remaining columns give average autocorrelations. The upper panel is based on transaction prices, whereas the lower panel is based on midquotes. K

so seven underestimates the efficiency gain of using Covs 2×2 . If volatility is close to being persistent then 4.2552 1.612 (1−acf 1 )

K Covs 30×30

is at least

≃ 20 times more informative than the cross product

of daily returns. The same observation holds for mid-quotes. Cov15s and CovHY are very precise estimates of the wrong s s K

K

K

quantity. Cov5m is quite close to Covs 30×30 , Covs 10×10 and Covs 2×2 , s K

with Cov5m and Covs 30×30 having a correlation of 0.942. We note s that realised kernel results seem to show some bias compared to CovOtoC , the difference is however statistically insignificantly s different than zero, as CovOtoC turns out to be very noisy. s The corresponding results for correlations are interesting. Naturally, the computation of the correlation involves a nonlinear transformation of roughly unbiased and noisy estimates. We should therefore (by a Jensen inequality argument) expect all these estimates to be biased. The most persistent estimator is Corrs1/4m , but the high autocorrelation merely reflects the large distortion that noise has on this estimator, as is also evident from the sample

average of this correlation estimator. The largest autocorrelation K

amongst the more reliable estimators is that of Corrs 2×2 , which suggest that this is most effective estimate of the correlation. In our web appendix we give time series plots and autocorrelogram for the various estimates of realised covariance for the K

AA-SPY assets combination using trade data. They show Covs 2×2 performing much better than the 30 min realised covariance but there not being a great deal of difference between the statistics when the realised covariance is based on 5 min returns. The web appendix also presents scatter plots of estimates based on transaction prices (vertical axis) against the same estimate based on mid-quotes (horizontal axis) for the same days. These show a reK

markable agreement between estimates based on Covs 2×2 , Cov5m s K

2×2 and Cov30m , while once again CovHY and s s struggles. Overall Covs

K

Cov5m behave in a similar manner, with Covs 2×2 slightly stronger. s K Covs 10×10

K

estimates roughly the same level as Covs 2×2 but is discernibly noisier.

160

O.E. Barndorff-Nielsen et al. / Journal of Econometrics 162 (2011) 149–169

Fig. 2. Scatter plots for daily realised kernel betas for the AA and SPY asset combination.

5.5. Analysis of the correlation estimates

We also calculate the encompassing regressions. The estimates for the realised kernel betas are (i,j)

In this = [Y (i) ,  subsection we will focus on estimating ρs (i,j)K ( i ) ( j ) Y ]s / [Y ]s [Y ]s by the realised kernel correlation ρˆ s = (j)

(i,j)

Ks

/ Ks(i,i) Ks(j,j) and the corresponding realised correlation ρˆ sXm .

A table in our web appendix reports the average estimates for

ρˆ

K2×2 , s

ρˆ

K10×10 s

K2×2 5m is s . It shows the expected result that s K10×10 . Both have average values which are s

and ρˆ

ρˆ

more precise than ρˆ quite a bit below the unconditional correlation of the daily opento-close returns. This is not surprising. All the three ingredients of (i,j)K

K

the ρˆ s 2×2 are measured with noise and so when we form ρˆ s will be downward biased.

it

βsK = 0.084 + 0.858 βsK−1 + 0.074 βs5−min 1 + us − 0.726 us−1 , (0.031)

(0.053)

(0.043)

(0.044)

adj−R2 = 0.215, with the corresponding 5 min based realised betas K βs5 min = 0.056 + 0.879 βs5−min 1 + 0.069 βs−1 + us − 0.822 us−1 ,

(0.026)

(0.047)

(0.035)

(0.040)

adj−R = 0.150. 2

This shows that either estimator dominates the other in terms of encompassing, although the realised kernel has a slightly stronger t-statistic. 5.7. A scalar BEKK

5.6. Analysis of the beta estimates

The two estimators are βs 2×2 to β5m s . The results are not very different in these two cases. Fig. 3 compares the fitted values from ARMA models for the kernel and 5 min estimates of realised betas for the AA-SPY assets combination. These are based on the model estimates for the daily kernel based realised betas

An important use of realised quantities is to forecast future volatilities and correlations of daily returns. The use of reduced form has been pioneered by Andersen et al. (2001, 2003). One useful way of thinking about the forecasting problem is to fit a GARCH type problem with lagged realised quantities as explanatory variables, e.g. Engle and Gallo (2006). Here we follow this route, fitting multivariate GARCH models with E(rs |Fs−1 ) = 0, Cov(rs |Fs−1 ) = Hs , where rs is the d × 1 vector of daily close to close returns, Fs−1 is the information available at time s − 1 to predict rs . ∑T ′ −1 A standard Gaussian quasi-likelihood − 12 s=1 (log |Hs | + rs Hs rs ) is used to make inference. The model we fit is a variant on the scalar BEKK (e.g. Engle and Kroner (1995))

βsK = 1.20 + 0.923 βsK−1 + us − 0.726 us−1 ,

Hs = C ′ C + β Hs−1 + α rs−1 rs′−1 + γ Ks−1 ,

(i,j)

Here we will focus on estimating βs = [Y (i) , Y (j) ]s /[Y (j) ]s , by (i,j)K the realised kernel beta βs = Ks(i,j) /Ks(j,j) . Fig. 2 presents scatter plots of beta estimates based on transaction prices (vertical axis) against the same estimate based on mid-quotes (horizontal axis). K

(0.06)

(0.027)

(0.048)

adj−R2 = 0.213,

and for 5 min based realised betas

βs5 min = 1.16 + 0.950 βs5−min 1 + us − 0.821 us−1 , (0.06)

(0.024)

(0.039)

adj−R = 0.145. 2

Both models have a significant memory, with autoregressive roots well above 0.9 and with large moving average roots. The fit of the realised kernel beta is a little bit better than that for the realised beta.

α, β, γ ≥ 0.

Here we follow the literature and use Hs to denote the conditional variance matrix (not to be confused with our bandwidth parameters). Instead of estimating the d(d + 1)/2 unique elements of C we use a variant of variance targeting as suggested in Engle and Mezrich (1996). The general idea is to estimate the intercept matrix by an auxiliary estimator that is given by Cˆ ′ Cˆ = S¯ ⊙ (1 − α − β − γ κ),

S¯ =

T 1−

T s =1

rs rs′ ,

(7)

O.E. Barndorff-Nielsen et al. / Journal of Econometrics 162 (2011) 149–169

161

Fig. 3. ARMA(1, 1) model for transaction based realised kernel betas for the AA and SPY combination. Table 7 Scalar BEKK models for close-to-close returns. Panel A: 30 × 30 case

Panel B: 10 × 10 case

Hs−1 0.943

rs−1 (rs−1 )′ 0.005

Kt − 1 0.040

RV5m s− 1 –

log L −27,029.9

Hs−1 0.768

rs−1 (rs−1 )′ 0.015

Kt −1 0.151

0.742

0.013

–

0.115

−27,077.7

0.687

0.022

–

0.160

−7935.9

0.984

0.008

–

–

−28,477.5

0.965

0.023

–

–

−8307.5

0.777

–

0.076

0.061

−26,948.3

0.705

–

0.126

0.067

−7923.3

0.784

0.009

0.067

0.059

−26 904.3

0.716

0.017

0.106

0.065

−7903.0

log L

BA-SPY Hs−1

rs−1 (rs−1 )′

Kt −1

RV5m s−1

log L

0.844

0.031

0.094

–

−1516.9

(0.006)

(0.000)

(0.014)

(0.001)

(0.001)

(0.000)

(0.012)

(0.013)

(0.004)

(0.006)

(0.004)

(0.001)

(0.004)

(0.005)

(0.005)

(0.017)

(0.025)

(0.003)

(0.023)

(0.023)

(0.003)

(0.003)

(0.001)

(0.003)

(0.011)

(0.014)

(0.014)

RV5m s−1 – (0.013)

(0.014)

(0.013)

log L

−7920.7

Panel C : 2 × 2 cases AIG-CAT Hs−1

rs−1 (rs−1 )′

Kt − 1

RV5m s− 1

0.837

0.038

0.126

–

(0.028)

(0.008)

(0.030)

−2584.9

(0.043)

(0.008)

(0.032)

0.863

0.044

–

0.098

−2591.2

0.843

0.032

–

0.091

−1517.8

0.951

0.045

–

–

−2629.7

0.958

0.036

–

–

−1544.4

0.764

–

0.236

0

−2592.4

0.717

–

0.125

0.083

−1521.1

0.837

0.038

0.126

0

−2584.9

0.837

0.031

0.068

0.031

−1516.6

(0.023)

(0.007)

(0.006)

(0.005)

(0.036)

(0.028)

(0.025)

(0.063)

(0.008)

(0.050)

–

–

(0.041)

(0.006)

(0.074)

(0.045)

(0.008)

(0.005)

(0.009)

(0.071)

(0.047)

(0.030)

(0.065)

(0.045)

Estimation results for scalar BEKK models for close-to-close d = 30, 10, 2 dimensional return vectors.

where ⊙ denotes the Hadamard product. There is a slight deviation from the situation considered by Engle and Mezrich (1996) because K

Ks−2×12 is only estimated for the part of the day where the NYSE is open. To accommodate this we follow Shephard and Sheppard (2010) that introduce the scaling matrix, κ , in (7) which we estimate by

κˆ ij =



µ ˆ RK µ ˆ



, ij

µ ˆ = T −1

T − s=1

rs rs′

and µ ˆ RK = T −1

T −

Ks .

s=1

Having S¯ and κˆ at hand the remaining parameters are simply estimated by maximising the concentrated quasi-log-likehood, with Hs = S¯ ⊙ (1 − α − β − γ κ) ˆ + β Hs−1 + α rs−1 rs′−1 + γ Ks−1 ,

α, β, γ ≥ 0. An interesting question is whether γ is statistically different from zero, because this means that high frequency data enhances the forecast of future covariation. In our analysis we will also augment the model with RV5m s−1 .

We estimate scalar BEKK models for the 30 × 30, 10 × 10, and the 45 2 × 2 cases. In Table 7 we present estimates for the two larger dimensions and three selected 2 × 2 cases. The results in Table 7 suggest that lagged daily returns are no longer significant for this multivariate GARCH model once we have the realised kernel covariance. This is even though the realised kernel covariance misses out the overnight effect — the information in the close-to-open returns. An interesting feature of the series is that in most cases including Ks−1 reduces the size of the estimated Hs−1 term. It is also interesting to note that including Ks−1 in general gives a higher log-likelihood than including RV5m s−1 . This holds for both the 30-dimensional and the 10-dimensional cases, and for 40 of the 45 2-dimensional cases. In our web appendix we report summary statistics of two likelihood ratio tests applied to all the 45 2-dimensional cases. The average LR statistic for removing RV5m s−1 from our most general specification is 0.66, where as the corresponding average for removing Ks−1 is 11.9. These tests can be interpreted as encompassing tests, and provide an overwhelming evidence that the information in RV5m s−1 is contained in Ks−1 .

162

O.E. Barndorff-Nielsen et al. / Journal of Econometrics 162 (2011) 149–169

6. Additional remarks 6.1. Relating K (X ) to the flat-top realised kernel K F (X ) h In the univariate case the realised kernel K (X ) = h=−n k( H ) ∑n Γh , with Γh = j=|h|+1 xj xj−|h| , is at first sight very similar to the unbiased flat-top realised kernel of Barndorff-Nielsen et al. (2008)

∑n

K F (X ) = Γ0 +

ΓhF =

n −

  n − h−1 k (ΓhF + Γ−F h ), H + 1 h =1

x j x j −h .

j=1

Here the Γh and ΓhF are not divided by the sample size. This means that the end conditions, the observations at the start and end of the sample, can have influential effects. Jittering eliminates the end effects in K (X ), whereas the presence of x−1 , x−2 , . . . and xn+1 , xn+2 , . . . in the the definition of ΓhF removes the end effects from K F (X ). However, an implication of this is that the resulting estimator is not guaranteed to be positive semi-definite whatever the choice of the weight function. The alternative K F (X ) has the advantage that it (under the restrictive independent noise assumption) converges at a n1/4 rate and is close to the parametric efficiency bound. It has the disadvantage that it can go negative, while we see in the next subsection that it is sensitive to deviations from independent noise, such as serial dependence in the noise and endogenous noise, which K (X ) is robust to. The requirement that K (X ) be positive results in the bias-variance trade-off and reduces the best rate of convergence from n1/4 to n1/5 . This resembles the effects seen in the literature on density estimation with kernel functions. The  property, u2 k(u)du = 0, reduces the  order of the asymptotic bias, but kernel functions that satisfy u2 k(u)du = 0 can result in negative density estimates, see Silverman (1986, Sections 3.3 and 3.6). 6.1.1. Positivity There are three reasons that K F (X ) can go negative.7 The most obvious is the use of a kernel function that does not satisfy, ∞ −∞ k(x) exp(ixλ)dx ≥ 0 for all λ ∈ R, such as the Tukey–Hanning

kernel or the cubic kernel, k(x) = 1 − 3x2 + 2x3 . The flat-top kernels give unit weight to γ1 and γ−1 , which can mean K F (X ) may be negative. This can be verified by rewriting the estimator as a quadratic form estimator, x′ Mx, where M is a symmetric band matrix M = band(1, 1, k( H1 ), k( H2 ), . . . , ). The determinant of the upper-left matrix is given by −{k( H1 ) − 1}2 , so that k( H1 ) = 1 is needed to avoid negative eigenvalues. Repeating this argument leads to k( Hh ) = 1 for all h, which violates the condition that

k( Hh ) → 0, as h → ∞. Finally, the third reason that the flattop kernel could produce a negative estimate ∑ was due to the n construction of realised autocovariances, γh = j=1 xj xj−h . This requires the use of ‘‘out-of-period’’ intraday returns, such as x1−H . This formulation was chosen because it makes E{K (U )} = 0 when U is white noise. However, since x−H only appears once in this estimator, with the term x1 x1−H , it is evident that a sufficiently large value of x1−H (positive or negative, depending on the sign of x1 ) will cause the estimator to be negative. We have overcome the last obstacle by jittering the end-points, which makes the use of ‘‘out-of-period’’ redundant. They can be dropped at the expense of a O(m−1 ) bias. 7 The flat-top kernel is only rarely negative with modern data. However, if [Y ] is very small and the ω2 very large, which we saw on slow days on the NYSE when the tick size was $1/8, then it can happen quite often when the flat-top realised kernel is used. We are grateful to Kevin Sheppard for pointing out these negative days.

6.1.2. Efficiency An important question is how inefficient is K (X ) in practice compared to the flat-top realised kernel, K F (X )? The answer is quite a bit when U is white noise. Table 8 gives E[n1/4 {K (X ) − [Y ]}]2 /ω and E[n1/4 {K F (X ) − [Y ]}]2 /ω, the mean square normalised by the rate of convergence of KPF (X ) (which is the flat-top realised kernel using the Parzen weight function. An implication is that the scaled MSE for the K (X ) and KBF will increase without bound as n → ∞ because these estimators converge at a rate that is slower than n1/4 ). The results are given in the case of Brownian motion observed with different types of noise. Results for two flat-tops are given, the Bartlett (KBF (X )) and Parzen (KPF (X )) weight functions. Similar types of results hold for other weight functions. Consider first the case with Gaussian U white noise with variance of ω2 . The results show that the variance of K (X ) is much bigger than its squared bias. For small n there is not much difference between the three estimators, but by the time n = 4096 (which is realistic for our applications) the flat-top K F (X ) has roughly half the MSE of K (X ) in the univariate case. Hence in ideal (but unrealistic) circumstances K F (X ) has advantages over K (X ), but we are attracted to the positivity and robustness of K (X ). The robustness advantage of K (X ) can be seen using four simulation designs where Uj is modelled as a dependent process. We consider the moving average specification, Uj = ϵj − θ ϵj−1 , with θ = ±0.5 and the autoregressive specification, Uj = ϕ Uj−1 + ϵj , with ϕ = ±0.5, where ϵj is Gaussian white noise. The bandwidth for all estimators were to be ‘‘optimal’’ under U being white noise, which is the default in the literature, so HBF = 2.28ω4/3 n2/3 , ∑∞ HPF = 4.77ωn1/2 , and HP = 3.51ω4/5 n3/5 where ω2 = h=−∞ cov(Uj , Uj−h ). The results show the robustness of K (X ) and the strong asymptotic bias of KPF and KBF under the non-white noise assumption. The specifications, θ = 0.5 and ϕ = −0.5 induce a negative first-order autocorrelation while θ = −0.5 and ϕ = 0.5 induce positive autocorrelation. Negative first-order autocorrelation can be the product of bid-ask bounce effects, this is particularly the case if sampling only occurs when the price changes. Positive first-order autocorrelation would, for example, be relevant for the noise in bid prices because variation in the bid-ask spread would induce such dependence. 6.2. Preaveraging without bias correction 6.2.1. Estimating multivariate QV In independent and concurrent work Vetter (2008, p. 29 and Section 3.2.4) has studied a univariate suboptimal preaveraging estimator of [Y ] whose bias is sufficiently small that the estimator does not need to be explicitly bias corrected to be consistent (the bias corrected version can be negative). Its rate of convergence does not achieve the optimal n−1/4 rate. Hence his suboptimal preaveraging estimator has some similarities to our non-negative realised kernel. Implicit in his work is that his non-corrected preaveraging estimator is non-negative. However, this is not remarked upon explicitly nor developed into the multivariate case where non-synchronously spaced data is crucial. Here we outline what a simple multivariate uncorrected preaveraging estimator based on refresh time would look like. We ∑n−H ′ ∑H h −1/2 define it as Vˆ = j=1 xj xj , where xj = (ψ2 H ) h=1 g ( H )xj+h ,

ψ2 =

1

g 2 (u)du. Here g (u), u ∈ [0, 1] is a non-negative, con0 tinuously differentiable weight function, with the properties that g (0) = g (1) = 0 and ψ2 > 0. Now if we set H = θ n3/5 , then the univariate result in Vetter (2008) would suggest that Vˆ converges at rate n−1/5 , like the univariate version of our multivariate realised kernel. There is no simple guidance, even in the univariate case, as to how to choose θ .

O.E. Barndorff-Nielsen et al. / Journal of Econometrics 162 (2011) 149–169

163

Table 8 Relative efficiency of the realised kernel K (X ).

ω2 = 0.001 Normalised bias2

n

Normalised variance

KBF (X ) U ∈ WN 250 1000 4000 16,000

KPF (X )

Normalised mse

K (X )

KBF (X )

KPF (X )

K (X )

KBF (X )

KPF (X )

K (X )

0.0 0.0 0.0 0.0

0.0 0.0 0.0 0.0

0.8 2.5 3.1 4.6

16.2 11.7 10.4 10.5

16.3 12.1 10.4 9.5

18.0 16.9 19.0 20.8

16.2 11.7 10.4 10.5

16.3 12.1 10.4 9.5

18.8 19.4 22.1 25.4

1.5 22.1 175.7 898.5

1.2 7.3 18.5 41.0

0.6 2.2 3.2 4.4

15.3 11.0 9.3 9.0

15.7 11.9 10.2 9.4

17.6 16.9 19.0 20.9

16.9 33.0 185.0 907.6

16.9 19.2 28.8 50.4

18.2 19.1 22.2 25.4

Uj = ϵj − 0.5ϵj−1 250 122.7 1000 1,769.1 4000 14,195.1 16,000 72,797.6

96.9 588.0 1490.4 3326.8

3.9 6.1 5.0 5.5

27.5 44.8 73.1 88.6

24.2 20.4 13.9 10.9

18.3 16.9 19.3 20.8

150.2 1,813.9 14,268.2 72,886.2

121.1 608.3 1504.4 3337.7

22.2 23.0 24.3 26.3

Uj = −0.5Uj−1 + ϵj 250 39.1 1000 1,261.0 4000 7,751.7 16,000 40,973.1

30.9 74.9 141.1 253.8

1.3 3.3 3.5 4.8

18.9 35.9 40.8 52.0

18.1 13.2 10.8 9.7

17.9 16.8 18.8 20.9

58.0 1,296.9 7,792.5 41,025.2

49.0 88.1 151.9 263.5

19.2 20.0 22.4 25.7

Uj = 0.5Uj−1 + ϵj 250 1000 4000 16,000

0.4 6.3 39.6 141.5

0.3 1.5 2.7 4.2

14.8 9.8 8.5 8.5

15.3 10.8 9.7 9.2

17.7 16.6 19.1 21.1

15.3 19.4 104.4 514.3

15.7 17.1 49.2 150.7

18.0 18.2 21.8 25.3

Uj = ϵj + 0.5ϵj−1 250 1000 4000 16,000

0.5 9.6 96.0 505.8

Estimation results for scalar BEKK models for close-to-close Relative efficiency of the realised kernel K (X ) and the flat-top realised kernel, K F (X ). Results for five different types of noise are presented. In the MA(1) and AR(1) designs, the variance of was scaled such that Var (U ) = ω2 . The squared bias, variance, and MSE have been scaled by n1/2 /ω. In the special case with Gaussian white noise the asymptotic lower bound for the normalised MSE is 8.00 (the normalised MSE for KPF (X ) converges to 8.54 as n → ∞ in this special case). The results are based on 50,000 repetitions.

In the univariate bias corrected form, Jacod et al. (2009) show that Vˆ is asymptotically equivalent to using a K F (X ) with k(x) = 1 ψ2−1 x g (u)g (u − x)du and H ∝ n1/2 . It is clear the same result will

hold for the relationship between Vˆ and K (X ) in the multivariate case when H = θ n3/5 . A natural choice of g is g (x) = (1 − x) ∧ x, 1 which delivers 0 g 2 (u)du = 1/12 and a k function which is the Parzen weight function. Hence one might investigate using θ = c0 as in our paper, to drive the choice of H for Vˆ when applied to refresh time based high frequency returns. Following the initial draft of this paper Christensen et al. (2009) have defined a bias corrected preaveraging estimator of the multivariate [Y ] with H = θ n1/2 , for which they derive limit theory. Their estimator has the disadvantage that it is not guaranteed to be positive semi-definite. 6.2.2. Estimating integrated quarticity In order to construct feasible confidence intervals for our realised quantities (see Barndorff-Nielsen and Shephard (2002)) we have to estimate the stochastic d2 × d2 matrix, IQ. Our approach is based on the no-noise Barndorff-Nielsen and Shephard (2004) bipower type estimator applied to suboptimal preaveraged data taking H = θ n3/5 . This is not an optimal estimator, it will converge at rate n1/5 , but it will be positive semidefinite. The proposed ∑n−H −1 (positive semi-definite) estimator of vec(IQ) is Qˆ = n j=1

{cj cj′ − 12 (cj cj′+H + cj+H cj )}, where cj = vec(¯xj x¯ ′j ). That the elements of Qˆ are consistent using this choice of bandwidth is implicit in the thesis of Vetter (2008, p. 29 and Section 3.2.4). 6.3. Finite sample improvements The realised kernel is non-negative so we can use log-transform n

1/5



log(K (X )) − log

1

∫

σ (u)du 2

0



Ls

→ MN

  

κ 1 0

σ 2 (u)du

 ,4 1 0

κ σ 2 (u)du

2   

to improve its finite sample performance. When the data is regularly spaced and the volatility is constant then κσ −2 = (ω/σ )2/5 |k′′ (0)|1/5 (k0•,0 )2/5 , which depends less on σ 2 than the non-transformed version. 6.4. Subtlety of end effects We have introduced jittering to eliminate end-effects. The larger is m the smaller is the end-effects, however increasing m has the drawback that is reduces the sample size, n, that can be used to compute the realised autocovariances. Given N observations, the sample size available after jittering is n = N − 2(m − 1), so extensive jittering will increase the variance of the estimator. In this subsection we study this trade-off. We focus on the univariate case where U is white noise. The mean square error caused by end-effects is simply the squared bias plus the variance of U0 U0′ + Un Un′ , which is given by 4m−2 ω4 + 4m−2 ω4 = 8ω4 m−2 , see the proof of the Proposition A.2. The asymptotic variance (abstracting from end-effects) is 5κ 2 n−2/5 = 5|k′′ (0)ω2 |2/5 {k0•,0 IQ}4/5 n−2/5 . So the trade-off between contributions from end-effects and asymptotic variance is given by gN ,ω2 ,IQ (m) = m−2 8ω4 + 5|k′′ (0)ω2 |2/5 {k0•,0 IQ}4/5 (N − m)−2/5 . This function is plotted in Fig. 4 for the case where N = 1000 and IQ = 1 and ω2 = 0.0025 and 0.001. The optimal value of m ranges from 1 to 2. The effect of increasing n on optimal m can be seen from Fig. 4, where the optimal value of m has increased a little from Fig. 4 as n has increased to 5000. However, the optimal amount of jittering is still rather modest.

164

O.E. Barndorff-Nielsen et al. / Journal of Econometrics 162 (2011) 149–169

Fig. 4. Sensitivity to the the choice of m. The figure shows the RMSE as a function of m for the sample sizes N = 1000 and N = 5000, and ω2 = 0.001 and ω2 = 0.0025.

6.5. Finite lag refresh time In this paper we roughly synchronise our return data using the concept of Refresh Time. Refresh Time guarantees that our returns are not stale by more than one lag in Refresh Time. Our proofs need a somewhat less tight condition, that returns are not stale by more than a finite number of lags. This suggests it may be possible to find a different way of synchronising data which throws information away less readily than Refresh Time. We leave this problem to further research. 6.6. Jumps In this paper we have assumed that Y is a pure BSM . The analysis could be extended to the situation where Y is a pure BSM plus a finite activity jump process. The analysis in BarndorffNielsen et al. (2008, Section 5.6) suggests that the realised kernel is consistent for the quadratic variation, [Y ], at the same rate of convergence as before, but with a different asymptotic distribution. 7. Conclusions In this paper we have proposed the multivariate realised kernel, which is a non-normalised HAC type estimator applied to high frequency financial returns, as an estimator of the ex-post variation of asset prices in the presence of noise and non-synchronous trading. The choice of kernel weight function is important here — for example the Bartlett weight function yields an inconsistent estimator in this context. Our analysis is based on three innovations: (i) we used a weight function which delivers biased kernels, allowing us to use positive semi-definite estimators, (ii) we coordinate the collection of data through the idea of refresh time, (iii) we show the estimator is robust to the remaining staleness in the data. We are able to show consistency and asymptotic mixed Gaussianity of our estimator. Our simulation study indicates our estimator is close to being unbiased for covariances under realistic situations. Not surprisingly the estimators of correlations are downward biased

due to the sampling variance of our estimators of variance. The empirical results based on our new estimator are striking, providing much sharper estimates of dependence amongst assets than has previously been available. We have analysed problems of up to 30 dimensions and have found that efficiency gains of using the high frequency data are around 20 fold. Multivariate realised kernels have potentially many areas of application, improving our ability to estimate covariances. In particular, this allows us to utilise high frequency data to significantly improve our predictive models as well as providing a better understand of asset pricing and management of risk in financial markets. In the appendices that follow, we give some proofs and the errors induced by stale prices. Under the assumptions given in this paper, our line of argument will be as follows.

• Show the realised kernel is consistent and work out its limit theory for synchronised data. This is shown in Appendix A, where Propositions A.1–A.5, Theorems 3 and A.4 are used to establish the multivariate result in Theorem 3 and the univariate result in Theorem 2 then follows as a corollary to Theorem 3. • Show the staleness left by the definition of refresh time has no impact on the asymptotic distribution of the equally spaced realised kernel. This is shown in Appendix B. Appendix A. Proofs for synchronised data Proof of Theorem 1. We note that for all i, j,

 K

Y (i) U (j)



K (Y (i) ) K (Y (i) , U (j) )

 =

K (Y (i) , U (j) ) , K (U (j) )



is positive semi-definite. This means that by taking the determinant of this matrix and rearranging we see that K (Y (i) , U (j) )2 ≤ K (Y (i) )K (U (j) ), so that K (X ) = K (Y ) + O







maxi K (Y (i) ) maxj K (U (j) )

and the result follows.

+ K (U ),

O.E. Barndorff-Nielsen et al. / Journal of Econometrics 162 (2011) 149–169

Next collect limit results about K (Y ) and K (U ). Due to Theorem 1 we can safely ignore the cross terms K (U , Y ) as long as K (U ) vanishes at the appropriate rate. A.1. Results concerning K (U ) The aim of this subsection to is prove the following proposition.

Proof of Proposition A.2. The first result follows by the definition ∑m−1 of Vh and U. Next, since U0 = m−1 j=0 U (tj ) it follows that Zh is stochastic for any m < ∞, and

Theorem A.4. Under K and U then H2 n

p

K (U ) → −k (0)Ω , ′′

165

1  Note that 0 Σζ (u) + γ0 (u) du is the average local variance  1 of U as oppose to the average long-run variance Ω = 0 Σζ (u) +  ∑∞ h=−∞ γh (u) du.

m−1 m−1

as n, H , m → ∞ with H /(mn) → 0. 2

mU0 U0′ = m−1

−−

p

U (tj )U (ti )′ → ΣU (0),

j =0 i =0 p

Before we prove Theorem A.4, we establish some intermediate results. The following definitions lead to a useful representation of K (U ). For h = 0, 1, . . . , we define Vh =

n −1 −

and similar mUn Un′ → ΣU (1). So the result for h = 0 follows from Z0 = 2(U0 U0′ + Un Un′ ). Next, for h > 0, m−1

−

mU0 Uh′ =

Uj Uj′−h + Uj−h Uj′ ,

p

U (tj )U (tm−1+h )′ →

∞ −

j =0

j =h +1

∞ −

=

and Zh = (U0 Uh′ + Uh U0′ ) + (Un Un′ −h + Un−h Un′ ).

γj+h (0)′ ,

j =0 p

Proposition A.1. The realised autocovariances of U can be written as

Γ0 (U ) = V0 − V1 +

1 2

Z0 − Z1

(A.1)

Γh (U ) + Γh (U )′ = −Vh−1 + 2Vh − Vh+1 + Zh − Zh+1 ,

(A.2)

so with kh = k( ) we have h H

K (U ) = (k0 − k1 )V0 − 1 2

n −1 −

(kh+1 − 2kh + kh−1 )Vh

(kh − kh−1 )Zh .

(A.3)

ah

−

 2 H

 sup √ 

|h|≤ H

j

−

Uj Uj′−h .

j

|h|> H

n

 

ah n + k′′ (0) → 0,

1 n ′ ′′ √ so that the first term Hn2 |h|≤ H ah n j Uj Uj−h = −k (0) H 2 Ω + n o( H 2 ). The second term vanishes because

∑

∑

(Uj Uj′−1 + Uj−1 Uj′ )

     1− ′   a U U h j j −h   2 H  √ n j  |h|> H n  −

j =2

− (Un Un′ −1 + Un−1 Un′ + U0 U1′ + U1 U0′ ),

We note that end-effects can only have an impact on K (U ) through Zh , h = 0, 1, . . . , because U0 and Un do not appear in the expressions for Vh , h = 0, 1, . . . .

for h = 0,

H2

− √ |h|> H

  1 −   ′  |H ah | · sup Uj Uj−h  ,  √ n  |h|> H j

and sup|h|>√H | 1n

2

∑

j

Uj Uj′−h | = op (1).

n −1 n −1 − 1 1 − ′ (kh − kh−1 )Zh = {k (h/H ) + o(1)}mZh

for h > 0,

h=1

0

) for all h = 0, 1, . . . , and as m → ∞,

 {ΣU (0) + ΣU (1)}  2∞ p − mZh → ′ ′   {γj+h (0) + γj+h (0) + γj+h (1) + γj+h (1) }

≤

n

For the Z -terms we have by Proposition A.2 that Z0 = Op (m−1 ), and

Proposition A.2. Given U. Then

 ∫ 1   {Σζ (u) + γ0 (u)}du 2 1 p Vh → ∫ 10  n   {γh (u) + γh (u)′ }du

√

ah

as H , n → ∞ with H /n = o(1),

n −1 − (Uj Uj′ + Uj Uj′ )

and (A.2) is proven similarly.

−

Uj Uj′−h + H −2

By the continuity of k (x) it follows that

j=1

j=0

−

Uj Uj′−h

j

′′

j =1

and Zh = Op (m

−

|h|≤ H

− (Uj − Uj−1 )(Uj − Uj−1 )′

−1

ah

√

h=1

n −1 −

′′

n −1 − (kh+1 − 2kh + kh−1 )Vh

n −1 −

= H −2

n

−

γj+h (1).

Proof of Theorem A.4. Since k (0) = 0 and k (x) is continuous we have k0 − k1 = −H −2 k′′ (ϵ)/2, for some 0 ≤ ϵ ≤ H −1 . Define a0 = −k′′ (ϵ) and ah = H 2 (−k|h|+1 + 2k|h| − k|h|−1 ), and write V -terms of K (U ), see (A.3), as

h=−n+1

= U0 U0′ + Un Un′ +

j =0

′

= H −2

Proof. The first expression, (A.1), follows from

Γ0 (U ) =

∑m−1

h=1

h =1

+ Z0 −

and similarly we find mUn Un′ −h →

(k0 − k1 )V0 −

n−1 −

γ−j−h (0)

j =0

for h = 0, for h > 0.

m H h=1

= Op (m−1 ). Proof of Lemma 2. When k′ (0) ̸= 0 we see that the first term of p

1

(A.3) is such that Hn (k0 − k1 )V0 → −k′ (0)2 0 {Σζ (u) + γ0 (u)}du. From the proof of Theorem A.4 it follows that the other terms in (A.3) are of lower order.

166

O.E. Barndorff-Nielsen et al. / Journal of Econometrics 162 (2011) 149–169 m −

A.2. Results concerning K (Y ) The aim of this subsection to is prove the following theorem that concern K (Y ) in the univariate case. Then we extend the result to the multivariate case in the next subsection. Theorem A.5. Suppose K, SH, D,√and U hold then as n, m, H → ∞ with H /n = o(1) and m−1 = o( H /n), we have



n



H

K (Y ) −

1

∫

σ 2 (u)du

Ls

0,0

→ MN 0, 4k•

1

∫

(A.4)

where yN ,i = Y (τN ,i ) − Y (τN ,i−1 ) and yˆ N ,i = σ (τN ,i−1 )(WτN ,i − WτN ,i−1 ) and

ηˆ N(1,)i = yˆ 2N ,i ,

= op

m3/2

ηˆ N(2,)i = 2yˆ N ,i

,

since maxi=1,...,m DN ,i = op (m1/2 ), σ 2 (t ) is bounded, and

= Op (m). So we need



N

n m3/2



H

ηN(2,)i = 2yN ,i

N −1 −

τN ,i

∑N

√ ηN(2,)i − ηˆ N(2,)i = op ( H /N ). First note that   N − −

i =1

N 1 − (2) η − ηˆ N(2,)i = yN ,i 2 i=1 N ,i i=1

−

kh yN ,i−h

h>0



N −

=

N −

τN ,i−1

yˆ N ,i

= yˆ N ,i {1 + op (N

τN ,i τN ,i−1

σ (u)dW (u)

)},



1/2

= λτN ,i−1 =

∫

τN ,i τN ,i−1

λ2τN ,i−1 D2N ,i 2

N2

(u − τN ,i−1 )du{1 + op (1)}

{1 + op (1)}.

Proposition A.3. Suppose K, SH, and D hold. Then as n → ∞ with √ 3 H = o(n) and m = O( Hn), then



n H

{K (Y ) − K˜ (Y )} = op (1).

Proof. The difference between K (Y ) and K˜ (Y ) is tied to the m first and m last observations. So the difference vanishes if m does not grow at too fast a rate. We have

 −

(yN ,i − yˆ N ,i )

kh yN ,i−h

h>0



N −

yˆ N ,i

 −

kh (yN ,i−h − yˆ N ,i−h ) .

(A.6)

h>0

i=1

The first term of (A.6) is a sum of martingale difference sequences. Its conditional variance is N λ2 2 − τN ,i−1 DN ,i

i=1,...,N

−

N i=1

 

N

N

H

where we have used that term is yˆ N ,i

−

1 N

2

 −

kh yN ,i−h

h>0

= op (H /N ), √

∑

h>1

kh yN ,i−h = Op ( H /N ). The second

kh (yN ,i−h − yˆ N ,i−h )

h>0

i=1

=



2

1

= op (N )Op (1) Op

N −

kh yN ,i−h

h>0

2 N 1 − λτN ,i−1

 D2N ,i

≤ max

2



N2

2

i =1

(A.5)

εN ,i ∼ iid N (0, 1) and note that yˆ N ,i = N −1/2 σ (τN ,i−1 )DN ,i εN ,i . We use yˆ N ,i as our estimate of yN ,i throughout, later showing it makes no impact on the result. τ Note that yN ,i − yˆ N ,i = τ N ,i−j {σ (u) − σ (τN ,i−j−1 )}dW (u), so N ,i−j−1 with d[σ ]t = λt dt we find ∫ τN ,i {σ (u) − σ (τN ,i−1 )}2 dW (u) 2

+

V =

Jacod (unpublished paper, (6.25))√and Phillips and Yu (unpublished paper, Eq. 66)). Let εN ,i = 1N ,i (WτN ,i − WτN ,i−1 ) so that

τN ,i−1

kh yˆ N ,i−h

h>0

i=1

∫

−1/2

 −

i=1

kh yˆ N ,i−h .

µ(u)du +

εN2 ,i

Proof. From, for example, Phillips and Yu (unpublished paper) it is ∑N (1) (1) known that i=1 ηN ,i − ηˆ N ,i = op (N −1/2 ). The only thing left to do

kh yi−h ,

K˜ is similar to K , except that it is not subjected to the jittering, and Kˆ is similar to K˜ , but is computed with auxiliary intraday returns. Note that we have (uniformly over i) the strong approximation (under SH)

∫

i=1

= O(1) which is implied by m3 /

N

h =1

yN ,i =

∑m

{K˜ (Y ) − Kˆ (Y )} = op (1).

h=1 N −1 −



N

is to prove that

i =1

ηN(1,)i = y2N ,i ,



H

N N − − (ηN(1,)i + ηN(2,)i ) and Kˆ (Y ) = (ηˆ N(1,)i + ηˆ N(2,)i ) i =1

i =1

σ 2 (τN ,i−1 )εN2 ,i {1 + op (N −1/2 )}2

Proposition A.4. Suppose SH and D hold then, so long as H = o(N ),

 ~2 (u) σ (u) du . ~1 (u)

Before we prove this theorem for K (Y ) we introduce and analyse two related quantities, K˜ (Y ) =

i=1

N

(HN ) ≤ m /(Hn) = O(1).

4

0

m − DN ,i

3



0



y2N ,i =

N −

yˆ N ,i

 −

∫ kh

h >0

i =1

τN ,i−h τN ,i−h−1

 {σ (u) − σ (τN ,i−h−1 )}dW (u) .

It has a zero conditional means and its conditional variance is N 1 −

σ 2 (τN ,i−1 )DN ,i

N i=1

=

−

k2h

h>0

N 1 −

N i =1

σ 2 (τN ,i−1 )DN ,i

τN ,i−h

∫

τN ,i−h−1

− h >0

{σ (u) − σ (τN ,i−h−1 )}2 du

k2h λ2τN ,i−h−1 N −2 D2N ,i−h /2{1 + op (1)},

where d[σ ]t = λt dt

≤

N 1 −

N 3 i=1

1

σ 2 (τN ,i−1 ) max DN ,i max λN ,i i=1,...,N

2

i

O.E. Barndorff-Nielsen et al. / Journal of Econometrics 162 (2011) 149–169

×

⌊− N /q⌋

q(j+1) i=qj+1,...,qj+q

j =0 N

1 −

≤

max

−

D2N ,i

as this is a temporal average of a martingale difference sequence. This means that

k2h {1 + op (1)}

h>qj



σ 2 (τN ,i−1 )op (N 1/2 )Op (1)op (q)Ho log

N 3 i=1   H = op , N

take q = N

1/2

N

N N −



q

= k0•,0

/ log(N ).

Proposition A.5. Suppose SH and D hold and H = o(N ), then as N →∞



H

Kˆ (Y ) −

1

∫

σ (u)du 2

1

∫

→ MN 0, 4k0•,0

0

∑N

i=1

σ

2

0



σ (u)du) =

N H

∑ ( Ni=1 σ 2 (τN ,i−1 )1N ,i −

). This means that    − ∫ 1 N N N 2 Kˆ (Y ) − σ (u)du = (ηˆ N(1,)i + ηˆ N(2,)i ) + op (1), 2

H

(

op √1 H

H i=1

0

(1) which is the sum of the martingale differences: {ηˆ N ,i

+ ηˆ N(2,)i , FτN ,i }.

So we just need to compute its contributions to the conditional variance. (1) The first term, ηˆ N ,i , is the sampling error of the well-known realized variance. In the present context, it was studied in Phillips 

and Yu (unpublished paper), and it follows that

N H

∑N

i =1

ηˆ N(1,)i =

Op (H −1 ). This means that unless H = O(1) this term will be asymptotically irrelevant for the realised kernel. Next N H

=

where (H

(τN ,i )ε N N −

H i=1

=

2 N ,i

4N y2N ,i

ˆ

∑ −1

h>1

H

−1

−

kh yˆ N ,i−h

p

and (3) follows since K (Y ) →[Y ] and K (U ) → c0 n1/2 .

−k′′ (0) c02

Ω when H =

Proof of Theorem 3. We analyse the joint characteristic function of the realised kernel matrix

 E exp [itr{AK (X )}] = E exp i

d −

 λj tr{K (X )aj aj } . ′

j =1

′ 8 where A = j=1 λj aj aj is symmetric matrix of constants. Hence it ′ is sufficient for us to study the joint law of aj K (X )aj , for any fixed aj , j = 1, . . . , d. This is a convenient form as a′j K (X )aj = K (a′j X ), the univariate kernel applied to the process a′j X . This is very convenient as a′j X is simple a univariate process in our class. The univariate results imply that the only thing left to study is the joint distribution of K (a′j Y ). Now under the conditions of the theorem with n, H → ∞ and H ∝ nη for η ∈ (0, 1) and m/n → 0, we will establish that

∑d



n

K (Y ) −

H

t

∫

Σ (u)du



0



Ls

0,0

→ MN 0, 4k•

2 p

kh yˆ N ,i−h ) → k0•,0 σ 2 (τN ,i ). Since N yˆ 2N ,i = DN ,i σ 2

1

∫ 0

~ 2 ( u) Ψ ( u) du ~ 1 ( u)

 (A.7)

where Ψ (u) = Σ (u) ⊗ Σ (u). This will then complete the theorem. The univariate proof implies we that can replace K (a′j Y ) by Kˆ (a′j Y )

(2)

∑n ∑ n −1 ∑n 2 ′ = i = 1 x n ,j ,i + 2 i=1 xn,j,i h=1 kh xn,j,i−h , where xn,j,i = aj σ 1/2 (τn,i−1 )1n,i εn,i and εn,i = Wτn,i − Wτn,i−1 . But this raises no new

σ 4 (τN ,i−1 )E(D2N ,i |FτN ,i−1 ) + op (1).

principles and so we can see that using the same method as before

i=1



Now we follow Phillips and Yu (unpublished paper) and write E(D2N ,i |FτN ,i−1 ) = DN ,i

~ 2 ( u) du + op (1), ~1 (u)

Proof of Theorem A.5. Follows by combining the results of Propositions A.3–A.5.

,

Var(ηˆ N ,i |FτN ,i−1 )

N

0

+ op (1),

by Riemann integration. The results then follows by the martingale array CLT.

h >1

we have

N k•0,0 −

σu4

= k•



2



(ηˆ N(2,)i )2

1

∫

E(DN ,i |FτN ,i−1 )

p

(τN ,i−1 )1N ,i + ηˆ N(1,)i + ηˆ N(2,)i . Phillips 

and Yu (unpublished paper) imply that

1

0,0

E(D2N ,i |FτN ,i−1 )

Proof of Lemma 1. The results (2) and (4) follow by combining Theorem 1 with Proposition A.4 and Theorem A.5. From the √ proof of Theorem 1 we have K (X ) = K (Y ) + K (U ) + Op ( K (U )),

 ~ 2 ( u) 4 σ ( u) du . ~ 1 ( u)

Proof. We have Kˆ (Y ) =

σ 4 (τN ,i−1 )1N ,i

A.3. Multivariate results



0



Ls

N − i=1

∑qj+q h 2 ∑ 1 −1 ), since H1 Nh=1 k( Hh )2 h>qj k( H ) is at most of order o(j H ∑⌊N /q⌋ 1 ∑qj ∑m 1 h 2 ≃ j =0 j =1 j = h>q(j−1) k( H ) is convergent, and that H O(log m).

N

(2)

Var(ηˆ N ,i |FτN ,i−1 )

H i =1

Here we have used that maxh∈{c +1,...,c +m} DN ,i = op (m1/2 ), that



167

E(D2N ,i |FτN ,i−1 )

n H

K (a′j Y )

[

K (a′k Y )

 Ls

0,0

→ MN 0, 4k•

E(DN ,i |FτN ,i−1 )

− {DN ,i − E(DN ,i |FτN ,i−1 )}

E(

D2N ,i

|FτN ,i−1 )

E(DN ,i |FτN ,i−1 )

 ′  ∫

 −

a′k 1

∫

Σ (u)du





aj

ak



]

0



a′j Σ (u)aj

2 2

a′j Σ (u)ak



0

.

t

aj

  ′ 2  ~ 2 ( u) aj Σ (u)ak  ′ 2 du . ~ 1 ( u) aj Σ (u)aj

Unwrapping the results delivers (A.7) as required.

Proof of Theorem 2. Follows as a corollary to Theorem 3.

Now N k0•,0 −

N

σ 4 (τN ,i−1 ){DN ,i − E(DN ,i |FτN ,i−1 )}

i=1

= op (1),

E(D2N ,i |FτN ,i−1 ) E(DN ,i |FτN ,i−1 )

8 It is well known that the distribution is characteristed by the matrix characteristic function E exp[itr{A′ K (X )}]. Without loss of ∑ generality we can ∑assume A is symmetric as K (X ) is symmetric and tr (A′ K (X )) = i,j aij K (X )ji = i aii K (X )ii +

∑

i<j

(aij + aji )K (X )ij =

∑

i ,j

(aij + aji )/2K (X )ij = 12 tr {(A + A′ )′ K (X )}.

168

O.E. Barndorff-Nielsen et al. / Journal of Econometrics 162 (2011) 149–169

A.4. Optimal choice of bandwidth The problem is simply to minimise the squared bias plus the contribution from the asymptotic variance with respect to c0 . Set 1 IQ = 0 σ 4 (u)du. The first order conditions of minc0 {−c0−4 k′′ (0)2 ω4 + c0 4k•0,0 IQ} yield the optimal value for c0



∗

c0 =

k′′ (0)2 ω4

1/5

0,0

k• IQ

= c ∗ ξ 4/5 ,

with c ∗ = {k′′ (0)2 /k0•,0 }1/5 .

With H ∗ = c ∗ ξ 4/5 n3/5 the asymptotic bias is given by

 −

k′′ (0)2 ω4

−2/5

0,0

k• IQ

k′′ (0)ω2 n−1/5

= |k′′ (0)ω2 |1/5 {k•0,0 IQ}2/5 n−1/5 , and the asymptotic variance is



k′′ (0)2 ω4

1/5

0,0

k• IQ

4k0•,0 IQn−2/5 = 4|k′′ (0)ω2 |2/5 {k0•,0 IQ}4/5 n−2/5 .

Appendix B. Errors induced by stale prices The stale prices induce a particular form of noise with an endogenous component. The price indexed by time τj is, in fact, (i)

≤ τj , for i = 1, . . . , d. With Refresh > τj−1 so that

the price recorded at time tj (i)

Time we have τj ≥ tj (i)

(i)

X (i) (τj ) = Y (i) (tj ) + U˜ (i) (tj )

= Y (i) (τj ) + U˜ (i) (tj(i) ) − {Y (i) (τj ) − Y (tj(i) )} .    U (i) (τj )

The endogenous component that is induced by refresh time is (i) Y (i) (τj ) − Y (tj ). But this is exactly the sort of dependence that

˜ Assumption U can accommodate through correlation between W and W , and the (random) coefficients ψh (t ), h = 0, 1, . . . . References Andersen, T.G., Bollerslev, T., Diebold, F.X., 2010. Parametric and nonparametric measurement of volatility. In: Aït-Sahalia, Y., Hansen, L.P. (Eds.), Handbook of Financial Econometrics. North Holland, Amsterdam, pp. 67–138. Andersen, T.G., Bollerslev, T., Diebold, F.X., Labys, P., 2000. Great realizations. Risk 13, 105–108. Andersen, T.G., Bollerslev, T., Diebold, F.X., Labys, P., 2001. The distribution of exchange rate volatility. Journal of the American Statistical Association 96, 42–55. Correction published in 2003 volume 98, page 501. Andersen, T.G., Bollerslev, T., Diebold, F.X., Labys, P., 2003. Modeling and forecasting realized volatility. Econometrica 71, 579–625. Andersen, T.G., Bollerslev, T., Meddahi, N., 2011. Market microstructure noise and realized volatility forecasting. Journal of Econometrics 160, 220–234. Andrews, D.W.K., 1991. Heteroskedasticity and autocorrelation consistent covariance matrix estimation. Econometrica 59, 817–858. Bandi, F.M., Russell, J.R., 2008. Microstructure noise, realized variance, and optimal sampling. Review of Economic Studies 75, 339–369. Bandi, F.M., Russell, J.R., 2005. Realized covariation, realized beta and microstructure noise. Graduate School of Business, University of Chicago (unpublished paper). Bandi, F.M., Russell, J.R., 2006. Seperating microstructure noise from volatility. Journal of Financial Economics 79, 655–692. Barndorff-Nielsen, O.E., Hansen, P.R., Lunde, A., Shephard, N., 2008. Designing realised kernels to measure the ex-post variation of equity prices in the presence of noise. Econometrica 76, 1481–1536. Barndorff-Nielsen, O.E., Hansen, P.R., Lunde, A., Shephard, N., 2009. Realised kernels in practice: trades and quotes. Econometrics Journal 12, 1–33. Barndorff-Nielsen, O.E., Shephard, N., 2007. Variation, jumps and high frequency data in financial econometrics. In: Blundell, R., Persson, T., Newey, W.K. (Eds.), Advances in Economics and Econometrics. Theory and Applications, Ninth World Congress. In: Econometric Society Monographs, Cambridge University Press, pp. 328–372. Barndorff-Nielsen, O.E., Shephard, N., 2002. Econometric analysis of realised volatility and its use in estimating stochastic volatility models. Journal of the Royal Statistical Society, Series B 64, 253–280.

Barndorff-Nielsen, O.E., Shephard, N., 2004. Econometric analysis of realised covariation: high frequency covariance, regression and correlation in financial economics. Econometrica 72, 885–925. Barndorff-Nielsen, O.E., Shephard, N., 2005. Power variation and time change. Theory of Probability and its Applications 50, 1–15. Bollerslev, T., Tauchen, G., Zhou, H., 2009. Expected stock returns and variance risk premia. Review of Financial Studies 22, 4463–4492. Brownless, C.T., Gallo, G.M., 2006. Financial econometric analysis at ultra-high frequency: data handling concerns. Computational Statistics & Data Analysis 51, 2232–2245. Campbell, J.Y., Lo, A.W., MacKinlay, A.C., 1997. The Econometrics of Financial Markets. Princeton University Press. Christensen, K., Kinnebrock, S., Podolskij, M., 2009. Pre-averaging estimators of the ex-post covariance matrix. Research paper 2009-45, CREATES, Aarhus University. de Pooter, M., Martens, M., van Dijk, D., 2008. Predicting the daily covariance matrix for s&p 100 stocks using intraday data — but which frequency to use? Econometric Reviews 27, 199–229. Doornik, J.A., 2006. Ox: An Object-Orientated Matrix Programming Langguage, fifth ed. Timberlake Consultants Ltd., London. Dovonon, P., Goncalves, S., Meddahi, N., 2011. Bootstrapping realized multivariate volatility measures. Journal of Econometrics (forthcoming). Drechsler, I., Yaron, A., 2011. What’s vol got to do with it. Journal of Econometrics 24, 1–45. Embrechts, P., Klüppelberg, C., Mikosch, T., 1997. Modelling Extremal Events for Insurance and Finance. Springer, Berlin. Engle, R.F., Gallo, J.P., 2006. A multiple indicator model for volatility using intra daily data. Journal of Econometrics 131, 3–27. Engle, R.F., Kroner, K.F., 1995. Multivariate simultaneous generalized ARCH. Econometric Theory 11, 122–150. Engle, R.F., Mezrich, J., 1996. GARCH for groups. Risk 36–40. Epps, T.W., 1979. Comovements in stock prices in the very short run. Journal of the American Statistical Association 74, 291–296. Falkenberry, T.N., 2001. High frequency data filtering, Technical Report, Tick Data. Fisher, L., 1966. Some new stock-market indexes. Journal of Business 39, 191–225. Fleming, J., Kirby, C., Ostdiek, B., 2003. The economic values of volatility timing using ‘realized’ volatility. Journal of Financial Economics 67, 473–509. Gallant, A.R., 1987. Nonlinear Statistical Models. John Wiley, New York. Ghysels, E., Harvey, A.C., Renault, E., 1996. Stochastic volatility. In: Rao, C.R., Maddala, G.S. (Eds.), Statistical Methods in Finance. North-Holland, Amsterdam, pp. 119–191. Glasserman, P., 2004. Monte Carlo Methods in Financial Engineering. SpringerVerlag, New York, Inc. Goncalves, S., Meddahi, N., 2009. Bootstrapping realized volatility. Econometrica 77, 283–306. Griffin, J.E., Oomen, R.C.A., 2011. Covariance measurement in the presence of nonsynchronous trading and market microstructure noise. Journal of Econometrics 160, 58–68. Guillaume, D.M., Dacorogna, M.M., Dave, R.R., Müller, U.A., Olsen, R.B., Pictet, O.V., 1997. From the bird’s eye view to the microscope: a survey of new stylized facts of the intra-daily foreign exchange markets. Finance and Stochastics 2, 95–130. Hansen, P.R., Horel, G., 2009. Quadratic Variation by Markov Chains, working paper. http://www.stanford.edu/people/peter.hansen. Hansen, P.R., Large, J., Lunde, A., 2008. Moving average-based estimators of integrated variance. Econometric Reviews 27, 79–111. Hansen, P.R., Lunde, A., 2005. A realized variance for the whole day based on intermittent high-frequency data. Journal of Financial Econometrics 3, 525–554. Hansen, P.R., Lunde, A., 2006. Realized variance and market microstructure noise (with discussion). Journal of Business and Economic Statistics 24, 127–218. Harris, F., McInish, T., Shoesmith, G., Wood, R., 1995. Cointegration, error correction and price discovery on informationally-linked security markets. Journal of Financial and Quantitative Analysis 30, 563–581. Hayashi, T., Jacod, J., Yoshida, N., 2008. Irregular Sampling and Central Limit Theorems for Power Variations: The Continuous Case. Keio University (unpublished paper). Hayashi, T., Yoshida, N., 2005. On covariance estimation of non-synchronously observed diffusion processes. Bernoulli 11, 359–379. Jacod, J., 1994. Limit of random measures associated with the increments of a Brownian semimartingale. Preprint number 120, Laboratoire de Probabilitiés, Université Pierre et Marie Curie, Paris. Jacod, J., 2008. Statistics and high frequency data (unpublished paper). Jacod, J., Li, Y., Mykland, P.A., Podolskij, M., Vetter, M., 2009. Microstructure noise in the continuous case: the pre-averaging approach. Stochastic Processes and Their Applications 119, 2249–2276. Jacod, J., Protter, P., 1998. Asymptotic error distributions for the Euler method for stochastic differential equations. Annals of Probability 26, 267–307. Jacod, J., Shiryaev, A.N., 2003. Limit Theorems for Stochastic Processes, second ed. Springer, Berlin. Kalnina, I., Linton, O., 2008. Estimating quadratic variation consistently in the presence of correlated measurement error. Journal of Econometrics 147, 47–59. Large, J., 2007. Accounting for the Epps Effect: Realized Covariation, Cointegration and Common Factors. Oxford-Man Institute, University of Oxford (unpublished paper). Li, Y., Mykland, P., Renault, E., Zhang, L., Zheng, X., 2009. Realized volatility when endogeniety of time matters. Department of Statistics, University of Chicago. Working Paper.

O.E. Barndorff-Nielsen et al. / Journal of Econometrics 162 (2011) 149–169 Malliavin, P., Mancino, M.E., 2002. Fourier series method for measurement of multivariate volatilities. Finance and Stochastics 6, 49–61. Martens, M., 2003. Estimating unbiased and precise realized covariances. Department of Finance, Erasmus School of Economics, Rotterdam (unpublished paper). Mykland, P.A., Zhang, L., 2006. ANOVA for diffusions and Ito processes. Annals of Statistics 34, 1931–1963. Mykland, P.A., Zhang, L., 2009. Inference for continuous semimartingales observed at high frequency: a general approach. Econometrica 77, 1403–1455. Mykland, P.A., Zhang, L., 2009. The econometrics of high frequency data. In: A.L.M. Kessler and M. Sørensen, (Eds), ‘Statistical Methods for Stochastic Differential Equations’, Chapman & Hall/CRC Press (forthcoming). Newey, W.K., West, K.D., 1987. A simple positive semi-definite, heteroskedasticity and autocorrelation consistent covariance matrix. Econometrica 55, 703–708. Phillips, P.C.B., Yu, J., 2008. Information loss in volatility measurement with flat price trading. Cowles Foundation for Research in Economics, Yale University (unpublished paper). Protter, P., 2004. Stochastic Integration and Differential Equations. Springer-Verlag, New York. Renault, E., Werker, B., 2011. Causality effects in return volatility measures with random times. Journal of Econometrics 160, 272–279.

169

Reno, R., 2003. A closer look at the Epps effect. International Journal of Theoretical and Applied Finance 6, 87–102. Shephard, N., Sheppard, K.K., 2010. Realising the future: forecasting with high frequency based volatility (HEAVY) models. Journal of Applied Econometrics 25, 197–231. Silverman, B.W., 1986. Density Estimation for Statistical and Data Analysis. Chapman & Hall, London. Vetter, M., 2008. Estimation methods in noisy diffusion models. Ph.D. thesis, Institute of Mathematics, Ruhr University Bochum. Voev, V., Lunde, A., 2007. Integrated covariance estimation using high-frequency data in the presence of noise. Journal of Financial Econometrics 5, 68–104. Zhang, L., 2006. Efficient estimation of stochastic volatility using noisy observations: a multi-scale approach. Bernoulli 12, 1019–1043. Zhang, L., Mykland, P.A., Aït-Sahalia, Y., 2005. A tale of two time scales: determining integrated volatility with noisy high-frequency data. Journal of the American Statistical Association 100, 1394–1411. Zhou, B., 1996. High-frequency data and volatility in foreign-exchange rates. Journal of Business and Economic Statistics 14, 45–52. Zhou, B., 1998. Parametric and nonparametric volatility measurement. In: Dunis, C.L., Zhou, B. (Eds.), Nonlinear Modelling of High Frequency Financial Time Series. John Wiley Sons Ltd., New York, pp. 109–123 (Chapter 6).

Journal of Econometrics 162 (2011) 170–188

Contents lists available at ScienceDirect

Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom

Estimating features of a distribution from binomial data✩ Arthur Lewbel a,∗ , Daniel McFadden b , Oliver Linton c a

Department of Economics, Boston College, 140 Commonwealth Avenue, Chestnut Hill, MA 02467, USA

b

Department of Economics, University of California, Berkeley, CA 94720-3880, USA

c

Department of Economics, London School of Economics, Houghton Street, London WC2A 2AE, United Kingdom

article

info

Article history: Received 1 July 2008 Received in revised form 22 August 2010 Accepted 8 November 2010 Available online 27 November 2010 JEL classification: C14 C25 C42 H41

abstract We propose estimators of features of the distribution of an unobserved random variable W . What is observed is a sample of Y , V , X where a binary Y equals one when W exceeds a threshold V determined by experimental design, and X are covariates. Potential applications include bioassay and destructive duration analysis. Our empirical application is referendum contingent valuation in resource economics, where one is interested in features of the distribution of values W (willingness to pay) placed by consumers on a public good such as endangered species. Sample consumers with characteristics X are asked whether they favor (with Y = 1 if yes and zero otherwise) a referendum that would provide the good at a cost V specified by experimental design. This paper provides estimators for quantiles and conditional on X moments of W under both nonparametric and semiparametric specifications. © 2010 Elsevier B.V. All rights reserved.

Keywords: Willingness to pay Contingent valuation Discrete choice Binomial response Bioassay Destructive duration testing Semiparametric Nonparametric Latent variable models

1. Introduction Consider an experiment where an individual is asked if he would be willing to pay more than V dollars for some product. Let unobserved W be the most the individual would be willing to pay, and let the individual’s response be Y = 1(W > V ), so Y = 1 if his latent willingness to pay (WTP) is greater than the proposed bid price V , and zero otherwise. In a typical experiment like this V is a random draw from some distribution determined by the researcher. For example, in our empirical application V is chosen by the researcher to be one of fourteen different possible dollar values ranging from $25 to $375, and the question is whether the

✩ This research was supported in part by the National Science Foundation through grants SES-9905010 and SBR-9730282, by the E. Morris Cox Endowment, and by the ESRC. The authors would like to thank anonymous referees and the co-editor for many helpful suggestions. ∗ Corresponding author. Tel.: +1 617 552 3678. E-mail addresses: [email protected] (A. Lewbel), [email protected] (D. McFadden), [email protected] (O. Linton).

0304-4076/$ – see front matter © 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2010.11.006

individual would be willing to pay more than V dollars to protect wetlands in California. Given data on Y and V , the goal of the analysis is estimation of features of the distribution of W across individuals, such as moments and quantiles. We also observe covariates X in addition to Y and V , and we consider estimation of µr (x) = E (r (W , X ) | X = x) for some function r specified by the researcher. For example, our data includes education level and gender, so we could let r (W , X ) = W to estimate the mean WTP E (W | X = x) among college educated women x. Letting t denote a parameter, other examples are r (W , X ) = etW , leading to the moment generating function; r (W , X ) = 1(W ≤ t ), leading to the probability of the event W ≤ t; and r (W , X ) = 1(W ≤ t )W , leading to a trimmed mean, all conditioned on, say, likely voters x. Our analysis permits the experimental design to depend on x, e.g., wealthier individuals could be assigned relatively high bids.1 Our estimators are readily

1 Other experimental designs include follow-up queries to gain more information about WTP, and open ended questions, where subjects are simply asked to state

A. Lewbel et al. / Journal of Econometrics 162 (2011) 170–188

extended to interval censored data with multiple (adaptive) test levels and multinomial status. We have given the example of a contingent valuation study using a referendum format elicitation, but the same structure arises in many other contexts. The problem can be generally described as uncovering features of conditional survival curves from unbounded interval censored data. Let W denote a random failure time, and let G(w | x) = Pr(W > w | X = x) denote the survival curve conditioned on time invariant covariates X . For example, in a bioassay, V is the time an animal exposed to an environmental hazard is sacrificed for testing, G is the distribution of survival times until the onset of abnormality, and Y is an indicator for an abnormality found by the test, termed a current status observation. Duration may also occur in dimensions other than time. In dose-response studies, V is the administered dose of a toxin, W is the lethal dose, and G is the dose-response curve, and Y is a mortality indicator. For materials testing, Y indicates that the material meets some requirement at treatment level V , e.g., G could be the distribution of speeds W at which a car safety device fails, with Y = 1 indicating failure at test speed V . A common procedure is to completely parameterize W , e.g., to assume W equals X ⊤ θ0 − ε with ε ∼ N (α0 , σ 2 ). The model then takes the form of a standard probit Y = I [X ⊤ θ0 − V > ε] and can be estimated using maximum likelihood. However, estimation of the features of the distribution of W differs from ordinary binomial response model estimation when the model is not fully parameterized, because the goal is estimation of moments or quantiles of W , rather than response or choice probabilities of Y . So, for example, in the above parameterized model E (W | X = x) = X ⊤ θ0 − α0 , and therefore any binomial response model estimator that fails to estimate the location term α0 , such as the semiparametrically efficient estimator of Klein and Spady (1993), is inadequate for estimation of moments of W . Another important difference is the role of the support of V . By construction G(v | x) = E (Y | V = v, X = x), so G can be estimated using ordinary parametric, semiparametric, or nonparametric conditional mean estimation. But nonparametric estimation of moments of W then requires identification of G(v | x) everywhere on the support of W , so nonparametric identification requires that the support of V contains the support of W . However, virtually all experiments only consider a small number of values for v . While the literature contains many estimators of moments of W ,2 virtually all of them are parametric or semiparametric, using functional form assumptions to obtain identification, without recognizing or acknowledging the resulting failure of nonparametric identification.3

their WTP. Open ended questions often suffer from high rates of nonresponse (with possible selection bias), while referendum format follow-up responses can be biased due to the framing effect of the first bid. This shadowing effect is common in unfolding bracket survey questions. See McFadden (1994) for references and experimental evidence regarding response biases. Other issues regarding the framing of questions also impact survey responses, particularly anchoring to test values, including the initial test value; see Green et al. (1998). The data generation process may then be a convolution of the target distribution and a distribution of psychometric errors. This paper will ignore these issues and treat the data generation process as if it were the target distribution. However, we do empirically apply our estimators separately to first round and follow-up bids, and find differences in the results, which provides evidence that such biases are present. The difficult general problem of deconvoluting a target distribution in the presence of psychometric errors is left for future research. 2 See, e.g., Kanninen (1993) and Crooker and Herriges (2004) for comparisons of various, mostly parametric, WTP estimators. Estimators that are not fully parameterized include Chen and Randall (1997), Creel and Loomis (1997) and An (2000) for WTP and Ramgopal et al. (1993) and Ho and Sen (2000) for bioassay. 3 In supplementary materials to this paper, we show that, given a fixed discrete

design for V , even assuming that W = m(X ) − ε with X and ε independent is still not sufficient for identification, though identification does become possible in this case if m(X ) is finitely parameterized. See also Matzkin (1992).

171

Our nonparametric estimators obtain identification by assuming either that bids v are draws from a continuously distributed random variable V , or that the experimental design varies with the sample size n, so for any fixed n there may be a finite number of values bids can take on, but this number of possible bid values becomes dense in the support of W as n goes to infinity.4 We also show how this dependence of survey design on sample size affects the resulting limiting distributions, and we provide an alternative identifying assumption based on a semiparametric specification of W described below.5 With an estimate of the density g (w | x) = −∂ G(w | x)/ ∂w and sufficient identifying assumptions, features of the distribution of W such as moments and percentiles can be readily recovered. In particular, moments µr (x) = E (r (W , X ) | X = x) = r (w, x)g (w | x)dw can be estimated in two steps, using secondstep numerical integration after plugging in a first-step estimate of the conditional density. We provide alternative semiparametric and nonparametric estimators of µr (x) that do not require estimation of g (w | x). These estimators utilize the feature that v is determined by experimental design, use integration by parts to obtain an expression for µr (x) that depends on G(w | x) but not its derivative, and use numerical integration methods that work with first-step undersmoothed estimators of G(w | x) at a limited number of convenient evaluation points. When the experimental design for v is known, we provide estimators for smooth moments of W that use only the indicator Y and do not require a first-step estimator of G(w | x). We consider estimation for two different information conditions on the conditional distribution of W given X . In the most general case, this distribution is completely unrestricted apart from smoothness, and is estimated nonparametrically. We may write this case as W = m(X , ε) with m unknown and ε an unobserved disturbance that is independent of X . This includes as a more restrictive case the location model W = m(X ) − ε . We also include here the special cases where we are interested in unconditional moments of W , or in conditional moments when X has finite support. The second case we analyze is the semiparametric model W = Λ[m(X , θ0 ) − ε] for known functions m and Λ, an unknown finite parameter vector θ0 , and a distribution for the disturbance ε that is known only to be independent of X . This model includes as special cases the probit model discussed earlier, similar logit models, and the Weibull proportional hazards model in which Λ is exponential and ε is the extreme value. In this semiparametric model, identification requires that the support of m(X , θ0 ) − Λ−1 (V ) become dense in the support of ε ; if X includes a continuously distributed component, this can be achieved even if the support of V is fixed and finite. We also consider estimation for two information conditions on the asymptotic distribution of the bid values V , the case where this is known to the researcher, and the case where it is unknown. We provide estimators, and associated limiting normal distributions, for these primary information conditions, a Monte Carlo analysis of the estimators, and an empirical application estimating conditional mean WTP to protect wetland habitats in California’s San Joaquin Valley. The two-step estimator that uses a first-step estimator of g (w | x) is denoted  µ0r (x). Table 1 gives the notation for the other estimators we offer for the various information conditions.

4 Virtually all existing contingent valuation data sets draw bids from discrete distributions. However, large surveys typically have bid distributions with more mass points than small surveys, consistent with our assumption of an increasing number of bid values as sample size grows. See, e.g., Crooker and Herriges (2004) for a study of WTP bid designs, with explicit consideration of varying numbers of mass points. 5 An approach that we do not pursue in this paper is to sacrifice point identification and instead estimate bounds on features of G, as in McFadden (1998). See also Manski and Tamer (2002).

172

A. Lewbel et al. / Journal of Econometrics 162 (2011) 170–188

Table 1 Information used to construct different estimators. Information conditions

Nonparametric G

Semiparametric G

Density of v known Density of v unknown

 µ1r (x)  µ2r (x)

 µ3r (x)  µ4r (x)

2. Estimators 2.1. The data generation process and estimands Let G(w | x) = Pr(W > w | X = x), so G is the unknown complementary cumulative distribution function of a latent, continuously distributed unobserved random scalar W , conditioned on a vector of observed covariates X . Let g (w | x) denote the conditional probability density function of W , so g = −dG/dw . A test value v (a realization of V ) is set by an experimental design or natural experiment. Define Y = 1(W > V ) where 1(·) is the indicator function. The observed data consist of a sample of realizations of covariates X , test values V , and outcomes Y . The framework is similar to random censored regressions (with censoring point V ), except that for random censoring we would observe W for observations having W > V , whereas in the present context we only observe Y = I (W > V ). Given a function r (w, x), the goal is estimation of the conditional moment µr (x) = E [r (W , X ) | X = x] for any chosen x in the support of X . Let r ′ (w, x) denote ∂ r (w, x)/∂w wherever it exists, and let G−1 (· | x) denote the inverse of the function G(w | x) with respect to its first argument. We assume the conditional distribution of W given X = x is not finitely parameterized, since otherwise ordinary maximum likelihood estimation would suffice. Assumption A.1. The covariate vector X is composed of a possibly empty discrete subvector Q that ranges over a finite number of configurations, and a possibly empty continuous subvector Z that ranges over a compact rectangle in Rd , Q has a positive density p1 (q), and Z has a positive Lipschitz-continuous density p2 (z | q). The latent scalar W has an unknown conditional CDF 1 − G(w | x) for x = (q, z ) with compact support [ρ0 (x); ρ1 (x)], and G(w | x) is continuously differentiable with Lipschitz-continuous derivatives and a positive density function g (w | x). The variables W and V are conditionally independent, given X , and Y = I (W > V ). Assumption A.2. The function r (w, x), chosen by the researcher, is continuous in (w, x) and is continuously differentiable in w , with a uniformly Lipschitz derivative, for each x. We term a function r (w, x) satisfying Assumption A.2 regular. From Assumption A.1, and in particular the conditional independence of W and V , G(v | x) = E (Y | V = v, X = x) = Pr(Y = 1|V = v, X = x).

(1)

For a regular function r (w, x), integration by parts yields

µr (x) =

∫

ρ1 (x) ρ0 (x)

r (w, x)g (w | x)dw

= r (ρ0 (x), x) +

∫

ρ1 (x)

ρ0 (x)

r ′ (v, x)G(v | x)dv.

(2)

The estimators we consider below are obtained by substituting first-step estimates or empirical analogs of g (v | x) or G(v | x) into (2) or (3). The parameter κ(x) does not affect the estimand µr (x), but it can affect some estimators in finite samples even though it drops out asymptotically. For simplicity we later demean v and take κ(x) to be zero, or otherwise choose some central value for κ(x). It is possible to extend the regular class to include functions with a finite number of breaks, with a corresponding extension of the integration by parts formula (2); this can be used to obtain analogs of the estimators in this paper for conditional percentiles and trimmed moments.6 If G(w | x) is not at least partly parameterized, then Eq. (1) implies that for identification of the distribution of W , the support of V should contain the support of W . As noted in the introduction, and by the identification analysis in the supplemental Appendix to this paper, the distribution of W is in general not identified when the asymptotic support of V has a finite number of elements. To identify features of the distribution of W with minimal restrictions on G, our nonparametric estimators assume an experimental design in which the test values of V become dense in the support of W as the sample size grows to infinity. Let Hn (v, x | n) denote the empirical distribution function for observations of (V , X ) for a sample of size n. Realizations could be random draws from a CDF H (v, x | n), but the data, particularly bids, could also be derived from some purposive sampling protocol. The requirement we place on the data generating process to assure nonparametric identification is the following: Assumption A.3. There exists a CDF H (v, x) with the property that the corresponding conditional distribution of test values V given X = x, denoted H (v | x), has a strictly positive continuous density h(v | x) with a compact support [δ0 (x), δ1 (x)] that contains the support of W . The empirical distribution function satisfies supn |Hn (v, x | n) − H (v, x)| → 0 almost surely, and nτ [Hn (v, x | n) − H (v, x)] converges weakly to a Gaussian process for some τ with τ = 1/2 for root-n asymptotics. Two examples illustrate this data generating process assumption: 1. Suppose for each sample observation i = 1, . . . , n, Xi , Vi is drawn randomly from the CDF H (v, x). Then the required sup norm convergence follows by the Glivenko–Cantelli theorem, and the convergence to a Gaussian process with τ = 1/2 can be shown by, e.g., the Shorack and Wellner (1986, p. 108ff) treatment of triangular arrays of empirical processes. 2. For each sample size n, suppose xi is drawn at random from a distribution, and that vi is drawn with random or quota sampling from a distribution H (v | xi , n) that has a finite support containing Jn points, and let ρ0 (x) = v0n (x) < · · · vJn +1,n (x) = ρ1 (x) denote these points plus the end points. Suppose that Jn ≤ n, n−1/2−γ Jn → ∞ for some γ ∈ (0, 1/2), the maximum spacing Sn between the points vjn (x) satisfies plimn→∞ n1/2 Sn = 0, and H (· | xi , n) converges to a distribution with a positive density h(v | x). Let M be a bound on r ′ (v, x), r ′′ (v, x) and g (v | x). Then, starting from Eq. (3),

The regular class includes smooth functions such as r (w, x) = w t and r (w, x) = et w for a parameter t that corresponds to moments and to the moment generating function. For any κ(x) having ρ0 (x) < κ(x) < ρ1 (x), we can rewrite Eq. (2) as

∫  ρ1 (x)  n1/2  r ′ (v, x)[G(v | x) − 1(v < κ(x))]dv  ρ0 (x)

µr (x) = r (κ(x), x) ∫ ρ1 (x) + r ′ (v, x) [G(v | x) − 1(v < κ(x))] dv.

6 If r (w, x) has possible break points at ρ (x) = w (x) < · · · < w (x) = 0 1 K ρ1 (x), integrating by parts between the breakpoints gives µr (x) = r (ρ0 (x)+ , x) +  ∑K −1 ρ ( x ) 1 + − ′ k=2 [r (wk (x) , x)− r (wk (x) , x)]G(wk (x) | x)dv + ρ0 (x) r (v, x)G(v | x)dv when

ρ0 (x)

(3)

the one-sided limits exist.

A. Lewbel et al. / Journal of Econometrics 162 (2011) 170–188

 Jn  −  − r ′ (vjn , x)[G(vjn | x) − 1(vjn < κ(x))]  j =1 ≤ M (1 + K + M + M 2 )n1/2 Sn →p 0 and the numerical integration error associated with this design process for test values is root-n asymptotically negligible. If the design points are drawn randomly from a density h(v | x) that satisfies minv h(v | x) ≥ m > 0, then from David and Nagaraja (2003, p. 327), and the constraints nγ ≤ n−1/2 Jn ≤ n1/2 for n large, limn Pr(n1/2 Sn < c ) ≥ limn exp(− exp(−n−1/2 Jn mc + ln Jn )) = 1, and the condition plimn→∞ n1/2 Sn = 0 holds. Then, the design process in this example with the constraints on Jn are sufficient to make its deviation from the previous random sampling example root-n asymptotically negligible. This second example covers all current contingent valuation studies of WTP provided they are embedded in design processes with test values satisfying the limit conditions on Jn and on the distributions H (v | x, n). Of course, the statement that these designs can be embedded in processes that lead to consistent, normal asymptotics does not guarantee that these asymptotics provide a good approximation to finite-sample behavior. In our simulation studies, we will examine the size of finite sample bias that results when our estimators are applied with both discrete and continuous designs for the test values V . 2.2. Nonparametric moments For estimation we suppose that a sample (Qi , Zi , Vi , Yi ) with Xi = (Qi , Zi ) generated in accordance with Assumption A.3 for i = 1, . . . , n. First consider G(v | x) and g (v | x). Let En denote the sample empirical expectation over functions of the random variables (Q , Z , V , Y ). For concreteness and ease of exposition, in this section we will just consider Nadaraya–Watson kernel estimators to show consistency and asymptotic normal convergence rates. Later we provide limiting distribution theory, including explicit variance formulas, for a more general class of estimators, including local polynomials, that may be numerically preferable in applications. Let K1 and K2 be kernel functions that are symmetric continuously differentiable densities with compact support on R and Rd respectively, and let λ denote a bandwidth parameter. Our estimators for µr (x) make use of the following set of first-stage estimators: Standard arguments for kernel estimation (Silverman, Section 4.3) show that under Assumption A.3 and the bandwidth restrictions given in Table 2, these estimators converge in probability to the given limits, and with bandwidths that shrink to zero at the optimal λ rate or faster, the deviations of the estimators from their limits, normalized by (nλd )1/2 , converge to Gaussian processes, with associated MLE converging at (nλd )−1 rates. For example, the estimator  A with an optimal rate λ ∝ n−1/(d+5) has a MSE converging to zero at the rate n−4/(d+5) . The estimators  G and  g are useful when the function h is unknown, while  G can be used when h is known from the experimental design. The estimators  G and  G are not guaranteed to be monotone non-increasing. They can be modified to satisfy this condition using either a ‘‘pool adjacent violator’s’’ algorithm (e.g., Dinse and Lagakos, 1982) or with probability weights on the terms in  A chosen to minimize distance from uniform weights, subject to the monotonicity constraint (Hall and Huang, 2001). These modifications will not alter the asymptotic behavior of  G and  G, or necessarily improve finite-sample properties of functionals of this estimator, but they do simplify computation of statistics such as conditional quantiles. Another approach, the generalization of the Kaplan–Meier product limit estimator to conditional distributions,

173

achieves monotonicity, but has more complex asymptotic behavior and achieves no better rate than the kernel based estimators  G or  G. Now consider estimation of µr (x) for a piecewise regular function r (w, x). Plugging  g into the definition of this moment in Eq. (2) yields the second-step estimator

 µ0r (x) =

ρ1 (x)

∫

ρ0 (x)

r (v, x) g (v | x)dv

(4)

with evaluation requiring numerical integration that contributes an additional error that can be made asymptotically negligible. This estimator will inherit the optimal MSE rate n−4/(d+6) of  g. Relative to  µ0r , we now define the estimators, suitable for varying information sets, listed in Table 1. The estimator  µ2r is obtained by plugging  G into Eq. (3) for some researcher chosen continuous function κ(x) having ρ0 (x) < κ(x) < ρ1 (x). This gives

 µ2r (x) = r (κ(x), x) + ×

∫

ρ1 (x)

ρ0 (x)

r ′ (v, x)

 G(v | x) − 1(v < κ(x)) F (dv | x, n) f (v | x)

(5)

where Eq. (5), compared to Eq. (3), makes the required numerical integration explicit by introducing a researcher-chosen positive density f (v | x) and associated CDF F (v | x) with a support that contains [ρ0 (x), ρ1 (x)] and chosen CDF F (v | x, n) with finite support for each n such that supv n1/2 |F (v | x, n) − F (v | x)| is stochastically bounded. The functions f (v | x) and κ(x) can be chosen for computational convenience or to limit variation in the integrand. The estimator  µ2r is superior to the base case estimator  µ0r in the sense that  µ2r inherits the optimal rate n−4/(d+5) of  G, which is better than the optimal rate of  µ0r . When the limiting design density h(v | x) is known, plugging  G into Eq. (3), reversing the order of integration and empirical expectation, and substituting asymptotic limits yields the estimator

 µ1r (x) = r (κ(x), x) + ×

1

 C ( x)

En r ′ ( V , X )

Y − 1 (V < κ(X )) −d λ K2 h(V | X )



Z −z

λ



1(Q = q)

(6)

which requires no numerical integration and converges with an optimal MSE rate of n−4/(d+4) . This is a specific case of the general principle that when a kernel estimator is plugged into a smooth functional, a better asymptotic rate of convergence can be achieved by undersmoothing; see Goldstein and Messer (1992). In addition, if d = 0, corresponding to moments that are unconditional or conditioned only on one of the finite configurations of Q , then Eq. (6) reduces to

 µ1r (q) = r (κ(x), x) + ×

1 En 1(Q = q)

Y − 1(V < κ(X )) h( V | X )

En r ′ ( V , X )

1(Q = q)

(7)

which requires no kernel smoothing, and is root-n consistent and asymptotically normal. The estimation problem in this case is related to that of estimating unconditional survival curves from current status data; see Jewell and van der Laan (2002). The properties of the estimators introduced above are summarized in Theorem 1. The derivation of explicit variance formulas is deferred to later, when we generalize these results to allow for a larger class of nonparametric smoothers.

174

A. Lewbel et al. / Journal of Econometrics 162 (2011) 170–188

Table 2 First stage estimators. Estimator

Formula

 D(v, x)

Restrictions

En λd1+1 K1

 C (x)

En λ1d K2

 V −v  λ

 Z −z  λ

K2

 Z −z  λ

1(Q = q)

1(Q = q)

 A(v, x)

En λdY+1 K1

 V −v  λ

K2

 Z −z 

 B(v, x)

En λd1+2 K1′

 V −v 

K2

 Z −z 

 G(v|x)

 A(v, x)/ D(v, x)

 G(v|x)

 A(v, x)/(h(v|x) C (x))

 g (v|x)

 B(v, x)/ D(v, x)

λ

λ λ

1(Q = q) 1(Q = q)

λ→0 nλd+1 → ∞ λ→0 nλd → ∞ λ→0 nλd+1 → ∞ λ→0 nλd+2 → ∞

Theorem 1. Suppose Assumptions A.1–A.3 hold. Suppose K1 and K2 are kernel functions on R and Rd respectively that are each symmetric, compactly supported continuously differentiable densities. Then, the second-step estimator  µ0r (x) in Eq. (4), using the first-step estimator  g (v | x) with a bandwidth λ proportional to n−1/(d+6) is consistent and n2/(d+6) [ µ0r (x) − µr (x)] is asymptotically normal. The secondstep estimator  µ2r (x) in Eq. (5) using the first-step estimator  G(v | x) with a bandwidth λ proportional to n−1/(d+5) is consistent and n2/(d+5) [ µ2r (x) − µr (x)] is asymptotically normal. When h(v | x) is known, the second step estimator  µ1r (x) in Eq. (6) using the first-step estimator  C with a bandwidth λ proportional to n−1/(d+4) is consistent and n2/(d+4) [ µ1r (x)−µr (x)] is asymptotically normal. When h(v | x) is known and the moment µr (x) is unconditional or is conditioned only on the discrete configuration Q , then the estimator  µ1r (q) in Eq. (7) is consistent and root-n asymptotically normal. Proof of Theorem 1. We first note that Assumptions A.1 and A.2 suffice to make Eqs. (2) and (3) hold. Also, verification of the asymptotic properties of the kernel estimators in Table 2 given our assumptions is standard. Rewrite  µ2r (x) in Eq. (5) as

 µ2r (x) = r (κ(x), x) +  ×

ρ1 (x)

∫

ρ0 (x)

F (dv | x, n) f (v | x)

∫

ρ1 (x)

+ ρ0 (x)

r (v, x)( G(v | x) − 1(v < κ(x)))

Limit

n−1/(d+5)

h(v|x)p2 (z |q)p1 (q)

n−1/(d+4)

p2 (z |q)p1 (q)

n−1/(d+5)

G(v|x)h(v|x)p2 (z |q)p1 (q)

n−1/(d+6)

g (v|x)h(v|x)p2 (z |q)p1 (q)

−1/(d+5)

n

G(v|x)

n−1/(d+5)

G(v|x)

−1/(d+6)

g (v|x)

n

array of independent bounded, mean zero random variables, and the variance of the numerator converges to a positive constant. Then a central limit theorem for triangular arrays (e.g., Pollard, 1984, p. 170–174) establishes that the numerator converges to a normal variate. The numerator of the second term with the transformation Z = z + λt, becomes EX (ψ(X ) − ψ(x))ω(X , x, λ)

∫



Z −z



(ψ(Z , q) − ψ(z , q))λ K2 p2 (z | q)p1 (q)dZ λ ∫Z = (ψ(z + λt , q) − ψ(z , q))K2 (t )p2 (z + λt | q)p1 (q)dt . −d

=

t

Assumptions A.1 and A.2 imply that ψ(z , q) and p2 (z | q) are Lipschitz-continuous in z, so that the last expression scaled by

 1/2  2 (nλd )1/2 converges to nλd+4 c t t K2 (t )dt for a constant c. At the optimal rate λ ∝ n−1/(d+4) this numerator then converges to a constant. This establishes that the estimator  µ1r (x) in Eq. (6) converges with an optimal MSE rate of n−4/(d+4) . When d = 0, this gives Eq. (7) with a conventional MSE rate of n−1 .

′

− dv



r ′ (v, x)( G(v | x) − G(v | x))dv.

Consider the terms on the right-hand side of this expression, normalized by (nλd+1 )1/2 . The first integral converges to zero since r ′ (v, x) is uniformly bounded, and by construction 1/f (v | x) is uniformly bounded and supv n1/2 |F (v | x, n) − F (v | x)| is stochastically bounded. The final integral converges to a normal variate as it is a smooth bounded linear functional of a Gaussian process plus an asymptotically negligible process. A similar decomposition establishes that  µ0r (x) in Eq. (4), normalized by (nλd+2 )1/2 and adapted computationally with a numerical integration procedure with the same properties as that employed in (11) also converges to a normal variate. Next, define S (Y , V , X ) = r ′ (V , X )[Y − 1(V < κ(X ))]/h(V | X ) and let

ψ(X ) = EY ,V |X S (Y , V , X ) ∫ ρ1 ( X ) = r ′ (V , X )(G(V | X ) − 1(V < κ(X )))dV . ρ0 (X )

z Define ω(X , x, λ) = λ−d K2 Z − 1(Q = q). Then  µ1r (x) in Eq. (6) λ can be rewritten as the equation is given in Box I. The denominator  C (x) converges in probability to p2 (z | q)p1 (q) > 0. The numerator of the first term, normalized by (nλd )1/2 , is a sum of a triangular



Optimal λ rate



In the special case of the nonparametric location model W = Λ[m(X ) − ε] with ε ⊥ X , and Λ known and invertible, these µr (x) estimators can be used to estimate an unknown m(x), since m(x) = µr (x) − E (ε) with r (w, x) = Λ−1 (w). Chen and Randall (1997) and Crooker and Herriges (2004) consider this case. An (2000) considers the model where Λ is unknown but m and the distribution of ε are known; this also is a special case of our nonparametric model. 2.3. Semiparametric moments Corollary 1 will be used in place of Theorem 1 to obtain faster convergence rates using a semiparametric model for W . To simplify the analysis we take κ = 0 (or equivalently, we absorb κ into the definition of Λ) so when applying these results one could first recenter (e.g., demean) V , and adjust the definition of Λ accordingly. Assumption A.4. The latent W satisfies W = Λ[m(X , θ0 ) − ε], where m and Λ are known functions, Λ is invertible and differentiable with derivative denoted Λ′ , θ0 ∈ Θ is a vector of parameters, and ε is a disturbance that is distributed independently of V , X , with unknown, twice continuously differentiable CDF Fε (ε) and compact support [a0 , a1 ] that contains zero. Define U = m(X , θ0 )− Λ−1 (V ). Let Ψn (U | n) denote the empirical CDF of U at sample size n. supU |Ψn (U | n) − Ψ (U )| → 0 a.s., where Ψ (U ) is a CDF that has an associated PDF ψ(U ) that is continuous and strictly positive on the interval [a0 , a1 ].

A. Lewbel et al. / Journal of Econometrics 162 (2011) 170–188

 µ1r (x) − µr (x) =

175

En S (Y , V , X )ω(X , x, λ) − EX ψ(X )ω(X , x, λ) − ψ(x)(ω(X , x, λ) − EX ω(X , x, λ))

 C (x)

+

EX (ψ(X ) − ψ(x))ω(X , x, λ)

 C (x)

Box I.



Define s∗r (x, u, y) and tr∗ (x, u) by

+E

s∗r (x, u, y) = r [Λ(m(x, θ0 )), x] r ′ [Λ(m(x, θ0 ) − u), x]Λ′ (m(x, θ0 ) − u)[y − 1(u > 0)]

+

ψ(u) ′ ′ r [ Λ ( m ( x , θ ) − u ), x ] Λ ( m ( x , θ 0 0 ) − u)[Fε (u) − 1(u > 0)] . tr∗ (x, u) = ψ(u) If Λ is the identity function, then W equals a parameterized function of x plus an additive independent error. If Λ is the exponential function, then it is ln(W ) that is modeled with an additive error. Other examples include: the Box–Cox, Λ−1 (W ) = (W λ − 1)/λ, the Zellner–Revankar Λ−1 (W ) = ln W + λW , and the arcsinh Λ−1 (W ) = sinh−1 (λW )/λ, where in each case λ is a free parameter. Corollary 1. Let Assumptions A.1, A.2 and A.4 hold. Then E (Y | U = u) = Fε (u)

µr (x) = r [Λ(m(x, θ0 )), x] +

a1

∫

tr∗ (x, u)Ψ (du),

a0

µr (x) = E [sr (x, U , Y ) | n] + ∗

a1

∫

tr∗ (x, u)[Ψ (du) − Ψ (du | n)]

a0

= lim E [s∗r (x, U , Y ) | n] n→∞

and, if Assumption A.3 also holds:

Ψn (u | n) = E (1 − H [Λ(m(X , θ0 ) − u) | X , n]) ψn (u) = E [h[Λ(m(X , θ0 ) − u) | X , n]Λ′ (m(X , θ0 ) − u) | n] → ψ(u), where expectation is with respect to H (v | x, n). Proof of Corollary 1. Recall that Y = I (W > V ) = I (ε < U ), so E (Y | U = u) = Fε (u). Starting from the definition of µr (x),

µr ( x ) =

a1

∫

0

r [Λ(m(x, θ0 ) − u), x]

= a0

a1

∫ +

dFε (u) du

r [Λ(m(x, θ0 ) − u), x]

0

du

d[Fε (u) − 1] du

du

and applying integration by parts to each of the above integrals yields

µr (x) = r [Λ(m(x, θ0 )), x] +

∫

a1

r ′ [Λ(m(x, θ0 ) − u), x]

a0

× Λ′ (m(x, θ0 ) − u)[Fε (u) − I (u > 0)]du ∫ a1 = r [Λ(m(x, θ0 )), x] + tr∗ (x, u)Ψ (du) = r [Λ(m(x, θ0 )), x] +

a0 a1

∫

tr∗ (x, u)Ψn (du | n)

a0

∫

a1

+

tr∗ (x, u)[Ψ (du) − Ψn (du | n)].

a0

Next, apply the law of iterated expectations to obtain E [s∗r (x, U , Y )] = r [Λ(m(x, θ0 )), x]



ψ(u)

= r [Λ(m(x, θ0 )), x] +

∫

a1

tr∗ (x, u)Ψn (du | n),

a0

a

which gives the expressions for µr (x), and a 1 tr∗ (x, u)[Ψ (du) − 0 Ψn (du | n)] →p 0 by the uniform convergence of Ψn . Note that Ψn (u | n) is the empirical probability that U ≤ u, which is the same event as V ≥ Λ(m(X , θ0 ) − u). Conditioning on X = x this probability would be 1 − Hn [Λ(m(x, θ0 ) − u) | x, n], and averaging over X gives Ψn (u | n) = E (1 − Hn [Λ(m(X , θ0 ) − u) | X , n]). This implies Ψ (u) = limn→∞ E (1 − Hn [Λ(m(X , θ0 ) − u) | X , n]), where the only role of the limit is to evaluate the expectation at the limiting distribution of X . Taking the derivative with respect to u gives ψ(u) = limn→∞ E (h[Λ(m(X , θ0 ) − u | X )Λ′ (m(X , θ0 ) − u)]). Consistency of ψn (u) then follows from the uniform convergence of the distribution of X to its limiting distribution in Assumption A.3. Now consider rate root n estimation of arbitrary conditional moments based on Corollary 1. It will be convenient to first consider the case where θ0 is known, implying that the conditional mean of W is known up to an arbitrary location (since ε is not required to have mean zero). A special case of known θ0 is when x is empty, i.e., estimation of unconditional moments of W , since in that case we can without loss of generality take m to equal zero. 2.3.1. Estimation with known θ Suppose that θ0 is known. Considering first the case where the limiting design density h(v|x) is also known, for a given u define  u) by the sample average ψ( n −  u) = 1 ψ( h[Λ(m(Xi , θ0 ) − u) | Xi ]Λ′ (m(Xi , θ0 ) − u).

n i =1

Then, based on Corollary 1, we have consistency of the estimator

 µ∗3r (x) = r [Λ(m(x, θ0 )), x] n 1 − r ′ [Λ(m(x, θ0 ) − Ui ), x]Λ′ (m(x, θ0 ) − Ui )[Yi − 1(Ui > 0)] + .  Ui ) n i=1 ψ(

r [Λ(m(x, θ0 ) − ε), x]Fε (dε)

a0

∫

r ′ [Λ(m(x, θ0 ) − U ), x]Λ′ (m(x, θ0 ) − U )[Fε (u) − 1(U > 0)]

This estimator is computationally extremely simple, since it entails only sample averages. Special cases of this estimator were proposed by McFadden (1994) and by Lewbel (1997).  u) be an estimator of ψ(u) that does not depend on Let ψ(  u) could be a (one dimensional) knowledge of h. For example ψ( kernel density estimator of the density of U, based on the data  Ui = m(Xi , θ0 ) − Λ−1 (Vi ) and evaluated at u. We then have the estimator

 µ∗4r (x) = r [Λ(m(x, θ0 )), x] n 1 − r ′ [Λ(m(x, θ0 ) − Ui ), x]Λ′ (m(x, θ0 ) − Ui )[Yi − 1(Ui > 0)] + ,  Ui ) n i=1 ψ( which may be used when h is unknown. 2.3.2. Estimation of θ First, consider estimation of θ . By Assumption A.4, E [Λ−1 (W ) | X = x] = α0 + m(x, θ0 ) for some arbitrary location constant α0 . This constant is unknown since no location constraint is imposed upon ε . Let sΛ−1 (X , V , Y )

176

A. Lewbel et al. / Journal of Econometrics 162 (2011) 170–188

denote sr (X , V , Y ) with r (w, x) = Λ−1 (w). It then follows from Theorem 1 that lim E [sΛ−1 (X , V , Y ) | X = x] = lim E (Λ−1 (W ) | X = x).

n→∞

n→∞

Note that the limit as n → ∞ means that the expectations are taken at the limiting distributions of the data. In other words the asymptotic conditional expectation of the known or estimable quantity sΛ−1 is equal to α0 + m(x, θ0 ). Under some identification conditions this can be used for estimation of (α0 , θ0 ). Specifically, we could estimate θ0 by minimizing the least squares criterion 1 ( θ,  α ) = arg min θ,α

n − [sΛ−1 (Xi , Vi , Yi ) − α − m(Xi , θ )]2 .

n i=1

(8)

If m is linear in parameters, then a closed form expression results for both parameter estimates. If h is not known, one could replace h(V | X ) in the expression of sΛ−1 (X , V , Y ) with an estimate  h(V | X ). The resulting estimator would then take the form of a two step estimator with a nonparametric first step (the estimation of h). This estimator of θ and α is equivalent to the estimator for general binary choice models proposed by Lewbel (2000), though Lewbel provides other extensions, such as to estimation with endogenous regressors. With Assumption A.4, the latent error ε is independent of X , and therefore the binary choice estimator of Klein and Spady (1993) may provide a semiparametrically efficient estimator of θ .7 2.3.3. Estimation with unknown θ Let  θ denote a root n consistent, asymptotically normal estimator for θ0 . Replacing θ0 with any θ ∈ Θ we may rewrite the estimators of the previous section as  µ∗λr (x; θ) for λ = 3 or 4. In doing so, note that θ appears both directly in the equations for  µ∗λr , and also in the definition of Ui = m(Xi , θ) − Λ−1 (Vi ). We later derive the root n consistent, asymptotically normal limiting distribution for each estimator  µλr (x) =  µ∗λr (x;  θ), where we  suppress the dependence on θ for simplicity. The estimators are not differentiable in Ui , which complicates the derivation of their limiting distribution, e.g., even with a fixed design, Theorem 6.1 of Newey and McFadden (1994) is not directly applicable due to this nondifferentiability. 3. Estimation details and distribution theory In this section we provide more detail about the computation of the estimators  µ0r (x), . . . ,  µ4r (x) and their distribution theory. Earlier we focused on ordinary kernel regression based estimators for ease of exposition, but in this section we allow for more general classes of estimators. 3.1. Nonparametric estimators There are many different nonparametric methods for estimating regression functions. For purely continuous variables with density bounded away from zero throughout their support the local linear kernel method is attractive. This method has been extensively analyzed and has some positive properties like being design adaptive, and best linear minimax under standard conditions; see Fan and Gijbels (1996) for further discussion. One issue we are

7 The Klein and Spady estimator does not identify a location constant α , but that is not required for this step, since no location constraint is imposed upon ε . Also, for the present application, the limiting distribution theory for Klein and Spady would need to be extended to allow for data generating processes that vary with the sample size.

particularly concerned about is how to handle discrete variables. Specifically, some elements of X could be discrete, either ordered discrete or unordered discrete, while V can be ordered discrete. When there is a single discrete variable that takes only a small number of values, the pure frequency estimator is the natural and indeed optimal estimator to take in the absence of additional structure. In fact, one obtains parametric rates of convergence in the pure discrete case [and in the mixed discrete/continuous case the rate of consistency is unaffected by how many such discrete covariates there are], see Delgado and Mora (1995) for discussion. When there are many discrete covariates, it may be desirable to use some ‘discrete smoothing’, as discussed in Li and Racine (2004), see also Wang and Van Ryzin (1981). Coppejans (2003) considers a case most similar to our own—he allows the distribution of the discrete data to change with sample size. One major difference is that his data have arrived from a very specific grouping scheme that introduces an extra bias problem. We shall not outline all the possibilities for estimation here with regard to the covariates X , rather we assume that X is continuously distributed with density bounded away from zero. However, the estimators we define can be applied in all of the above situations [although they may not be optimal], and the estimators are still asymptotically normal with the rate determined by the number of continuous variables. We will pay more attention to the potential discreteness in V , since this is key to our estimation problem. For clarity we will avoid excessive subscripts/superscripts. We suppose that V is asymptotically continuous in the sense that for each n, Vi is drawn from a distribution H (v|Xi , n) that has finite support, increasing with n. The case where Vi is drawn from a continuous distribution H (v|Xi ) for all n is really a special case of our set-up. Under our conditions there is a bias in the estimates of µr (x) of order J −1 in this discrete case. Therefore, for this term not to matter in the limiting distribution we require that δn J −1 → 0, where δn is the rate of convergence of the estimator in question [δn = √ √ n in the parametric case but δn = nbd for some bandwidth b in the nonparametric cases]. In the nonparametric case, the spacing of the discrete covariates is closer than the bandwidth of a standard kernel estimator, that is, we know that b2 J → ∞ so that J −1 is much smaller than the smoothing window of a kernel estimator. Therefore, the pure frequency estimator is dominated by a smoothing estimator, and we shall just construct smoothingbased estimators. The estimator  µ1r (x) involves smoothing the data sr (Zi ) = r [κ(Xi ), Xi ] +

r ′ (Vi , Xi )[Yi − 1(Vi < κ(Xi ))] h(Vi | Xi )

against Xi , where Zi = (Vi , Xi , Yi ). Define the p − 1-th order local polynomial regression of sr (Zi ) on Xi by minimizing Qps−1,n

(ϑ) =

n 1−

n i=1

2

 Kb (Xi − x) sr (Zi ) −

−

ϑj (Xi − x)

j

(9)

0≤|j|≤p−1

with respect to the vector ϑ containing all the ϑj , where Kb (t ) = j=1 kb (tj ) with kb (u) = k(u/b)/b, where k is a univariate kernel function and b = b(n) is a bandwidth. Here, we are using the multidimensional index notation, for vectors j = (j1 , . . . , jd )⊤ and a = ∑ j j (a1 , .∑ . . , ad )⊤ : j! = j1 !×· · ·×jd !, |j| = dk=1 jk , aj = a11 ×· · ·×add , and denotes the sum over all j with 0 ≤ | j | ≤ p − 1. 0≤|j|≤p−1

∏d

Let  ϑ0 denote the first element of the vector  ϑ that minimizes (9). Then let

 µ1r (x) =  ϑ0 .

(10)

This estimator is linear in the dependent variable and has an explicit form.

A. Lewbel et al. / Journal of Econometrics 162 (2011) 170–188

In computing the estimator  µ2r (x) we require an estimator of G(v | x), which is given by the smoothness of Yi on Xi , Vi . Let  Xi = (Vi , Xi⊤ )⊤ and  x = (v, x)⊤ and define the p − 1-th order local polynomial regression of Yi on  Xi by minimizing QpY−1,n

(ϑ) =

n 1−

n i=1

−

 Kb ( Xi − x) Yi −

 j ϑj  Xi − x

, (11)

0≤|j|≤p−1

where  Kb ( Xi −  x) = kb (Vi − v)Kb (Xi − x). Let  ϑ0 denote the first element of the vector  ϑ that minimizes (11), and let  G(v | x) =  ϑ0 . Then define

 µ2r (x) = r (κ(x), x) +

a1

∫

r ′ (v, x)[ G(v | x) − 1(v < κ(x))]dv,

a

(12) where the univariate integral is interpreted in the Lebesgue Stieltjes sense (actually under our conditions  G(v | x) is a continuous function and 1(v < κ(x)) is a simple step function). Finally, to compute  µ0r (x) we use one higher order of polynomial, i.e., minimize QpY,n (ϑ) with respect to ϑ , and let ∂ G(v | x)/∂v =  ϑv , where  ϑv is the second element of the vector  ϑ . Then define

 µ0r (x) = −

∫

a1 a0

r (v, x)

∂ G(v | x) dv. ∂v

π = (π1 , . . . , πs ) such that j=1 πj = p) as a large column vector f (p;s) (t ) of dimensions (s + p − 1)!/s!(p − 1)!. Let ad;p (k), ad+1;p (k) and a∗d+1;p+1 (k) be conformable vectors of constants depending only on the kernel k, and let cd;p (k), c d+1;p (k) and cd∗+1;p+1 (k) also be constants only depending on the kernels. Define

2



177

∑s

(13)

The estimator (12) is in the class of marginal integration/partial mean estimators sometimes used for estimating additive nonparametric regression models, see Linton and Nielsen (1995), except that the integrand is not just a regression function and the integrating measure λ, where (asymptotically) dλ(v) = r ′ (v, x)1(ρ0 (x) ≤ v ≤ ρ1 (x))dv , is not necessarily a probability measure, i.e., it may not be positive or integrate to one. The distribution theory for the class of marginal integration estimators has already been worked out for a number of specific smoothing methods when the covariate distribution is absolutely continuous, see the above references. We make the following assumptions.

β0 (x) = a∗d+1;p+1 (k)⊤

∫

ρ1 (x) ρ0 (x)

r (v, x)G(p+1;d+1) (v|x)dv

β1 (x) = ad;p (k)⊤ µ(rp;d) (x); ∫ ρ1 (x) r ′ (v, x)G(p;d+1) (v|x)dv, β2 (x) = ad+1;p (k)⊤ ρ0 (x) ∫ ρ1 (x)

σ 2 (v, x) ω0 (x) = cd∗+1;p+1 (k) ρ0 (x)  ′ 2 r (v, x)h(v, x) − r (v, x)h′ (v, x) × h(v, x)dv h2 (v, x) var[sr (Z ) | X = x] ω1 (x) = cd;p (k) ; h( x ) 2  ′ ∫ ρ1 (x) r (v, x) h(v, x)dv. ω2 (x) = c d+1;p (k) σ 2 (v, x) h(v, x) ρ0 (x) Theorem 2. Suppose that Assumptions A.1–A.3, B.1 and B.2 hold and that the bandwidth sequence b = b(n) satisfies b → 0, nbd+2 / log n → ∞, and Jb2 → ∞. Then, for j = 1, 2,

√

nbd  µjr (x) − µr (x) − bp βj (x) H⇒ N (0, ωj (x)).





If G is p + 1-times continuously differentiable, then

√

nbd [ µ0r (x) − µr (x) − bp β0 (x)] H⇒ N (0, ω0 (x)).

Remarks. 1. In the local linear case (p = 2) the kernel constants in ω1 (x) and ω2 (x) are identical, and equal to ‖K ‖22 = K (u)2 du. A simple argument then shows that ω1 (x) ≥ ω2 (x). By the law of iterated expectation

Assumption B.1. k is a symmetric probability density with bounded support, and is continuously differentiable on its support.

var[sr (Z ) | X = x] = E [var[sr (Z ) | V , X ] | X = x]

Assumption B.2. The random variables (V , X ) are asymptotically continuously distributed, i.e., for some finite constant ch

Furthermore,

sup

v∈[ρ0 (x),ρ1 (x)]

|H (v|x, n) − H (v|x)| ≤

ch J

,

The condition (14) is satisfied provided the associated frequency function h(v | x, n) satisfies minv∈Jn h(v | x, n) ≥ v/Jn and maxv∈Jn h(v | x, n) ≤ v/Jn for some bounds v > 0 and v < ∞, and provided the support Jn becomes dense in [ρ0 (x), ρ1 (x)]. The other conditions are standard regularity conditions for nonparametric estimation. For a function f : ∑ Rs → R, arrange the elements of s j=1

πj

π

E [var[sr (Z ) | V , X ] | X = x]

(14)

where H (v, x) possesses a Lebesgue density h(v, x) along with conditionals h(v|x) and marginal h(x). Furthermore, infρ0 (x)≤v≤ρ1 (x) h(v, x) > 0. The conditional variance of the limiting continuous distribution is equal to the limiting conditional variance σ 2 (v, x) = limn→∞ var(Y | V = v, X = x) = G(v | x)[1 − G(v | x)]. Furthermore, G(v | x) and h(v, x) are p-times continuously differentiable for all v with ρ0 (x) ≤ v ≤ ρ1 (x), letting g (v | x) = ∂ G(v|x)/∂v denote the conditional density of W |X . The set [ρ0 (x), ρ1 (x)] × {x} is strictly contained in the support of (V , X ) for large enough n.

its partial derivatives ∂

+ var[E [sr (Z ) | V , X ] | X = x].

f (t )/∂ t1 1 · · · ∂ tsπs (for all vectors

∫

ρ1 (x)

h(v | x)

ρ0 (x)

= h(x)

r ′ (v, x)



=

ρ1 (x)

∫



2

σ 2 (v, x)h(v | x)dv

r ′ (v, x) h(v, x)

ρ0 (x)

2

σ 2 (v, x)h(v, x)dv.

It follows that ω1 (x) ≥ ω2 (x). In the special case that h′ (v, x) = 0, which would be true if V were uniformly distributed, ω0 (x) is the same as ω2 (x) apart from the kernel constants. 2. Regarding the biases, in the special case of local linear estimation and supposing that g (v|x) has two continuous derivatives and the support of V |X does not depend on X , we have:

β1 (x) ∝

d ∫ − j =1

ρ1 ρ0

∂ 2 {r (v, x)g (v | x)} dv ∂ x2j

  d ρ1 − ∂ 2 G(v | x) ∂ 2 G(v | x) ′ β2 (x) ∝ + r (v, x)dv. ∂v 2 ∂ x2j ρ0 j =1 ∫

Applying integration by parts shows that these two biases are sometimes the same, depending upon boundary conditions.

178

A. Lewbel et al. / Journal of Econometrics 162 (2011) 170–188

3. If r (v, x) is a vector of functions, then the results are as above with the square operation replaced by outer product of corresponding vectors. Suppose one wants to estimate var(W |X = x) = µW 2 (x) − µ2W (x), a nonlinear function of the vector (E (W 2 |X = x), E (W |X = x)). In this case, one obtains the asymptotic distribution by the delta method applied to the joint limiting behaviour of the estimators of µW 2 (x), µW (x). 4. Standard errors can be constructed by plugging in estimators of unknown quantities in the asymptotic distributions. For the estimator  µ1 (x), standard formulae can be applied as given in Härdle and Linton (1994) and more recently reviewed in Linton and Park (Unpublished manuscript). For the integration based estimators  µ0 (x) and  µ2 (x) one can apply the methods described in Sperlich et al. (1999). For example, ω2 (x) just requires consistent estimation of σ 2 (v, x) and h(v, x), a regression function and density function, and this follows from our proofs below. Note that estimation of the bias term in all cases is hard. If desired one could as usual remove the asymptotic bias by using an undersmoothed bandwidth. 5. We have shown that generally  µ2r (x) has a smaller mean squared error than  µ1r (x). However, there are other comparisons between the estimators that are also relevant. For example, the estimator  µ1r (x) requires prior knowledge of h(v | x). On the other hand  µ1r (x) also uses a lower dimensional smoothing operation than  µ2r (x), which may be important in small samples. An advantage of the estimator  µ1r (x) is that it takes the form of a standard nonparametric regression estimator, so known regression bandwidth selection methods can be automatically applied. Sperlich et al. (1999) present a comprehensive study of the finite sample performance of marginal integration estimators and discuss bandwidth selection. 6. The term κ(x) drops out of these limiting distributions, showing that choice of κ(x) is asymptotically irrelevant. For simplicity, κ(x) can be taken to be some convenient value in the range of the v data such as its mean. 3.2. Semiparametric estimators In this section we assume the conditions of A4 prevail. In this case, discreteness of Vi is less of an issue — even if Vi is discrete, if there are continuous variables in Xi , then Ui = m(Xi , θ0 )− Λ−1 (Vi ) can be continuously distributed. For simplicity we therefore assume a fixed design for our limiting distribution calculations. Similar asymptotics will result when the assumption that Vi is continuously distributed is replaced by an assumption like Eq. (14). Let  θ be some consistent estimator of θ0 . Define:  µ3r (x) = r [Λ(m(x,  θ )), x] n 1 − r ′ [Λ(m(x,  θ) −  Ui ), x]Λ′ (m(x,  θ) −  Ui )[Yi − 1( Ui > 0)] +   n i=1 ψ(Ui )

 µ4r (x) = r [Λ(m(x,  θ )), x] n 1 − r ′ [Λ(m(x,  θ) −  Ui ), x]Λ′ (m(x,  θ) −  Ui )[Yi − 1( Ui > 0)] + ,   Ui ) n ψ( i=1

where  Ui = m(Xi ,  θ ) − Λ−1 (Vi ) and n 1−  ψ( Ui ) = h[Λ(m(Xj ,  θ) −  Ui )|Xj ]Λ′ (m(Xj ,  θ) −  Ui );

n j =1

 ψ( Ui ) =

n 1 −

nb j=1

 k

 Ui −  Uj b

 .

Define also the estimators  µ∗3r (x) and  µ∗4r (x) as the special cases of  µ3r (x) and  µ4r (x) in which θ is known, in which case  Ui is replaced by Ui .

We next state the asymptotic properties of the conditional moment estimators based on Corollary 1. We need some conditions on the estimator and on the regression functions and densities. Assumption C.1. Suppose that

√

n 1 − ς (Zi , θ0 ) + op (1) n( θ − θ0 ) = √ n i =1

for some function ς such that E [ς (Zi , θ0 )] = 0 and Ω = E [ς (Zi , θ0 )ς (Zi , θ0 )⊤ ] < ∞. Suppose also that θ0 is an interior point of the parameter space. Assumption C.2. The function m is twice continuously differentiable in θ and for any δn → 0,

   ∂m   sup  (x, θ )  ≤ d1 (x); ∂θ ‖θ −θ0 ‖≤δn   2   ∂ m  sup  (x, θ )  ≤ d2 (x) ⊤ ‖θ −θ0 ‖≤δn ∂θ ∂θ with Edr1 (Xi ) < ∞ and Edr2 (Xi ) < ∞ for some r > 2. Assumption C.3. The density function h is continuous and is strictly positive on its compact support and is twice continuously differentiable. The transformation Λ is three times continuously differentiable. Assumption C.4. The kernel k is twice continuously differentiable on its support, and therefore supt |k′′ (t )| < ∞. The bandwidth b satisfies b → 0 and nb6 → ∞. The regularity conditions are quite standard. Assumption C.4 is used for  µ4r (x), which is based on a one-dimensional kernel density estimator. For each θ ∈ Θ and x ∈ X, define the stochastic processes: f0 (Zi , θ )

=

r ′ [Λ(m(x, θ ) − Ui (θ )), x]Λ′ (m(x, θ ) − Ui (θ ))[Yi − 1(Ui (θ ) > 0)]

ψ(Ui )

f1 (Zi , θ ) = r [Λ(m(x, θ )), x]

+

r ′ [Λ(m(x, θ ) − Ui (θ )), x]Λ′ (m(x, θ ) − Ui (θ ))[Yi − 1(Ui (θ ) > 0)]

ψ(Ui )

where Ui (θ ) = m(Xi , θ ) − Λ−1 (Vi ). Then

  ∂ E [f1 (Zi , θ )]   ∂θ θ =θ0 [ ] [ ] ψ ′ (Ui ) f0 (Zi , θ0 )  ΨF = −E f0 (Zi , θ0 )  γi + E ζij γj ψ(Ui ) ψ(Ui ) [ ] ∂m ∂m  γi = ⊤ (Xi , θ0 ) − E (Xi , θ0 ) ∂θ ∂θ ⊤   and ζij = h′ (Λ|Xj )(Λ′ )2 + h(Λ|Xj )Λ′′ (m(Xj , θ0 ) − Ui ), where  ζij = ζij − Ei ζij . ΓF =



The above quantities may depend on x but we have suppressed this notationally. Note also that Ef1 (Zi , θ0 ) = µr (x). Theorem 3. Suppose that Assumptions A.1–A.4 and C.1–C.3 hold. Then, as n → ∞,

√

n[ µ3r (x) − µr (x)] H⇒ N (0, ση2 (x)),

(15)

A. Lewbel et al. / Journal of Econometrics 162 (2011) 170–188

where 0 < ση2 (x) = var(ηj ) < ∞ with ηj = η1j + η2j + η3j , where:

η1j = f0 (Zj , θ0 ) − Ef0 (Zj , θ0 ) η2j = (ΓF − ΨF )ς (Zj ; θ0 ) 

We report the results of a small simulation experiment based on a design of Crooker and Herriges (2004). Let Wi = β1 + β2 Xi + σ εi ,

h[Λ(m(Xj , θ0 ) − Ui )|Xj ]Λ (m(Xj , θ0 ) − Ui ) − ψ(Ui ) ′

ψ(Ui )

 | Xj .

The three terms η1j , η2j , and η3j are all mean zero and have finite variance. They are generally mutually correlated. When θ0 is known, the term η2j = 0 and this term is missing from the asymptotic expansion. The term η3j is due to the estimation of ψ even when θ0 is known. We next give the distribution theory for the semiparametric estimator  µ4r (x). Let

ψ ′ (Ui ) {f0 (Zi , θ0 ) − E [f0 (Zi , θ0 )|Ui ]} γi∗ ψ( U ) i   − E E [f0 (Zi , θ0 )|Ui ]m′θ0 (Ui ) ] [ ∂m mθ0 (Ui ) = E ( X , θ ) | U i 0 i ∂θ ⊤ [ ] ∂m ∂m γi∗ = ⊤ (Xi , θ0 ) − E ( X , θ ) | U . i 0 i ∂θ ∂θ ⊤ ΨF∗ = E

4. Numerical results 4.1. Monte Carlo

η3j = −E f0 (Zi , θ0 ) ×

179

[

]

Theorem 4. Suppose that Assumptions A.1–A.4, B.1, B.2 and C.1– C.4 hold. Then

√

n[ µ4r (x) − µr (x)] H⇒ N (0, ση∗2 (x)),

∗ ∗ ∗ where 0 < ση∗2 (x) = var(ηj∗ ) < ∞, with: ηj∗ = η1j + η2j + η3j , ∗ where η1j = η1j , while ∗ η2j = (ΓF − ΨF∗ )ς (Zj ; θ0 ) ∗ η3j = − (E [f0 (Zi , θ0 )|Ui ] − E [E [f0 (Zi , θ0 )|Ui ]]) . ∗ ∗ ∗ The three terms η1j , η2j , and η3j are all mean zero and have finite variance. They are generally correlated. When θ0 is known, ∗ the term η2j = 0 and this term is missing from the asymptotic ∗ expansion. The term η3j is due to the estimation of ψ .

Remarks. 1. Consistent standard errors can be constructed by substituting population quantities by estimated ones along the lines discussed in Newey and McFadden (1994) for finite dimensional parameters. An alternative approach to inference here is based on the bootstrap. In our case a standard i.i.d. resample from the data set can be shown to work for the nonparametric and semiparametric cases even under our discrete/asymptotically continuous design at least as far as approximating the asymptotic variance (see, e.g., Horowitz (2001) and Mammen (1992) for the nonparametric case and Chen et al. (2003) for the semiparametric case). We have taken this approach to inference in the application due to its simplicity. 2. Regarding the semiparametric estimators, it is not possible to provide an efficiency ranking of the two estimators  µ3r (x) and  µ4r (x) uniformly throughout the ‘parameter space’. This result partly depends on the choice of  θ . It may be possible to develop an efficiency bound for estimation of the function µr (.) by following the calculations of Bickel et al. (1993, Chapter 5). Since there are no additional restrictions on µr , the plug-in estimator with efficient  θ should be efficient. See, e.g., Brown and Newey (1998).

where Xi is uniformly distributed on [−30, 30] and εi is standard normal. We take β1 = 100 and β2 = 2, which guarantees that the mean WTP is equal to 100. We vary the value of σ ∈ {5, 10, 25, 50} and sample size n ∈ {100, 300, 500}. For our first set of experiments the bid values are five points in [25, 175] if n = 100, ten points if n = 300, and 15 points if n = 500; these points are randomly assigned to individuals i before drawing the other data and so are fixed in repeated experiments. We take κ = 100. This design was chosen because it permits direct comparison with the parametric and SNP estimators of WTP considered by Crooker and Herriges (2004), at least when n = 100 (they did not increase the number of bids with sample size). In this case G(v|x) = 1 − Φ ((v − β1 − β2 x)/σ ) and g (v|x) = φ((v − β1 − β2 x)/σ )/σ , where Φ , φ denote the standard normal c.d.f. and density functions respectively. We estimate the moments:  E [W | X = x], i.e., r (w, x) = w, and std(W | X = x) = E [W 2 | X = x] − E 2 [W | X = x], which corresponds to taking r (w, x) = (w 2 , w) and then computing the square root of 2 rw 2 − rw . Then: µw (x) = β1 + β2 x, µw2 (x) = (β1 + β2 x)2 + σ 2 , and std(W | X = x) = σ . We compute estimators  µλ (.) for λ = 1, 2, 3, 4. We used a local linear estimator with product Gaussian kernel and Silverman’s rule of thumb bandwidths, that is, b = 1.06sn−1/5 , where s is the sample standard deviation of the specific covariate. The kernel and bandwidth are not likely to be optimal choices for this problem, but they are automatic and convenient and hence are fairly widely used choices in practice. We take h(v | n) to be uniform over the range of bid values. In this design, the estimator  µ1 (x) is predicted to be approximately unbiased while the predicted bias of  µ2 (x) is small but nonzero. In Tables 3 and 4 we report four different performance measures: root pointwise mean squared error (RPMSE), pointwise mean absolute error (PMAE), root integrated mean squared error (RIMSE), and integrated mean absolute error (IMAE). Crooker and Herriges (2004) only report pointwise results. Like Crooker and Herriges, our pointwise results are calculated at the central point x = 0. Thus, their Table 2(a) (n = 100) and Appendix Table 1(a) (n = 300) are directly comparable with a subset of our results. Our conclusions are: (A1) The performance of our estimators improves as σ decreases and as sample size increases according to all measures: the pointwise measures improve at approximately our theoretical asymptotic rate, while the integrated measures improve much more slowly; the semi-parametric estimators improve more rapidly with sample size. (A2) For the larger samples, estimator  µ4 performs best according to nearly all measures although for large σ , the difference between  µ4 and some other estimators is minimal. For smaller sample sizes the ranking is a bit more variable: only  µ3 is never ranked first. (A4) Our best estimators always perform better than the Crooker and Herriges SNP estimator. (A5) The estimates of std(W | X = x) are subject to much more variability and bias than the estimates of E [W | X = x], particularly in the large σ case. While our estimators seem to work reasonably well in this discrete bid case, we would expect to obtain better results when

180

A. Lewbel et al. / Journal of Econometrics 162 (2011) 170–188

Table 3 Estimation of conditional mean in discrete bid design; 500 replications.

σ =5 n = 100

n = 300

n = 500

σ = 10 n = 100

n = 300

n = 500

σ = 50 n = 100

n = 300

n = 500

RPMSE

 µ1  µ2  µ3  µ4  µ6

4.04 4.52 5.65 4.80 4.77

3.86 3.66 2.20 1.77 3.89

2.39 2.16 1.64 1.25 2.07

4.20 4.49 5.64 4.75 5.03

4.54 4.30 2.69 2.15 3.69

3.10 2.69 1.97 1.59 2.05

10.54 9.78 10.69 8.42 8.37

7.11 6.37 4.60 4.53 4.32

5.88 5.02 3.64 3.32 3.28

PMAE

 µ1  µ2  µ3  µ4  µ6

3.20 3.59 4.50 3.75 3.82

3.20 3.07 1.77 1.44 3.29

1.89 1.73 1.31 0.99 1.63

3.32 3.60 4.54 3.76 4.06

3.70 3.57 2.12 1.73 3.07

2.47 2.13 1.59 1.28 1.60

8.45 7.86 8.65 6.74 6.98

5.69 5.09 3.67 3.59 3.54

4.61 3.92 2.91 2.62 2.67

RIMSE

 µ1  µ2  µ3  µ4  µ6

14.39 12.85 12.28 11.90 11.83

7.89 5.29 4.76 4.58 5.72

5.81 2.50 3.35 3.18 3.58

14.29 13.22 12.21 11.80 11.80

7.88 5.60 5.16 4.90 5.75

5.76 3.06 3.46 3.27 3.50

17.29 16.60 16.12 14.65 14.62

11.72 10.90 9.83 9.79 9.67

9.59 8.94 7.81 7.66 7.65

IMAE

 µ1  µ2  µ3  µ4  µ6

10.88 10.54 9.44 9.12 9.32

5.71 4.17 3.59 3.36 4.50

4.10 1.92 2.49 2.33 2.73

10.77 10.76 9.52 9.16 9.39

5.90 4.47 3.88 3.59 4.47

4.17 2.34 2.62 2.45 2.65

13.41 13.05 12.71 11.53 11.74

8.98 8.56 7.68 7.69 7.64

7.33 6.94 6.13 5.98 6.04

Table 4 Estimation of conditional standard deviation in discrete bid design; 500 replications.

σ =5 n = 100

n = 300

n = 500

σ = 10 n = 100

n = 300

n = 500

σ = 50 n = 100

n = 300

n = 500

RPMSE

 µ1  µ2  µ3  µ4  µ6

8.95 16.39 6.75 6.60 27.26

7.43 14.51 2.65 2.13 13.86

7.12 12.89 2.08 1.39 15.33

7.40 13.00 5.43 5.15 24.38

5.03 10.93 2.53 1.98 14.16

5.35 9.97 2.52 1.49 13.71

9.89 8.73 7.68 8.27 19.92

8.17 7.29 5.29 5.33 15.00

7.32 6.64 5.12 4.87 10.80

PMAE

 µ1  µ2  µ3  µ4  µ6

8.02 16.20 5.64 5.41 23.28

7.03 14.43 2.04 1.59 10.77

6.95 12.85 1.60 1.05 12.42

6.29 12.69 4.51 4.19 21.01

4.36 10.77 1.97 1.54 12.25

4.91 9.86 2.08 1.18 12.08

7.86 7.04 6.11 6.60 14.67

6.86 6.10 4.45 4.42 10.98

6.34 5.69 4.60 4.09 8.19

RIMSE

 µ1  µ2  µ3  µ4  µ6

17.56 22.07 6.75 6.60 21.78

11.35 14.37 2.65 2.13 14.62

8.11 9.93 2.08 1.39 11.68

14.23 18.01 5.43 5.15 19.31

9.73 11.22 2.53 1.98 13.37

7.68 7.69 2.52 1.49 11.36

13.02 10.82 7.68 8.27 19.93

13.08 11.03 5.29 5.33 17.30

13.61 11.78 5.12 4.87 17.30

IMAE

 µ1  µ2  µ3  µ4  µ6

15.98 21.59 5.64 5.41 17.95

9.62 13.37 2.04 1.59 11.24

6.99 9.19 1.60 1.05 9.09

12.74 17.40 4.51 4.19 16.41

8.23 10.24 1.97 1.54 11.33

6.55 7.03 2.08 1.18 9.84

9.75 8.43 6.11 6.60 14.24

10.18 8.94 4.45 4.42 12.64

10.96 9.92 4.60 4.09 12.72

the bid distribution is actually continuous and with full support like W . We repeated the above experiments with bid distribution uniform on [25, 175] and report the results in Tables 5 and 6. Our conclusions are: (B1) The performance in the continuous design is somewhat better than in the discrete design. For some designs the pointwise results in Table 3 are better, but the integrated results are always better in Table 5. Note that for the pointwise results the chosen point of evaluation x = 0 corresponds to E [W | X = 0] = 100 and in Table 3 there is a point mass in the distribution of the bids at this point. (B2) The results for standard deviation estimation are in most cases better in Table 6 than in Table 4. (B3) The ranking of the estimators is the same in Table 5 as Table 3. Once again  µ4 performs the best in large samples. We also computed  µ0 , but the performance was considerably worse than  µ1 and  µ2 , especially in the discrete design case. This maybe as expected since with data as discrete as this, derivative estimates can be nonsensical.

We also considered a design with √ a heavier tailed asymmetric error, specifically εi ∼ (χ 2 − 1)/ 2. The performance results corresponding to Tables 3–6 were very similar in terms of the level of performance and the comparison across estimators likewise, a little bit worse everywhere, so to save space these results are not presented here. Instead in Fig. 1 we report a QQ plot comparing the finite sample distribution of the standardized (studentized) estimator  µ1 (0)/se( µ1 (0)) with the predicted normal distribution.8 Based on this plot the distributional approximation appears quite good. 4.2. Empirical application We examine a dataset used in An (2000), which is from a contingent valuation study conducted by Hanemann et al. (1991)

2 8 The studentization is done with the estimated variance ∑ w 2 ij εj , where wij are j the local linear smoothing weights and  εj are the nonparametric residuals.

A. Lewbel et al. / Journal of Econometrics 162 (2011) 170–188

181

Table 5 Estimation of conditional mean in continuous design; 10,000 replications.

σ =5 n = 100

σ = 10 n = 300

n = 500

n = 100

n = 300

n = 500

σ = 50 n = 100

n = 300

n = 500

RPMSE

 µ1  µ2  µ3  µ4  µ6

5.64 5.02 4.40 3.51 5.85

3.30 2.90 2.24 1.66 3.36

2.58 2.32 1.68 1.23 2.59

6.34 5.60 5.10 4.10 5.96

3.79 3.33 2.70 2.10 3.39

3.07 2.68 2.08 1.62 2.65

11.39 10.20 8.01 7.89 7.57

7.25 6.39 4.55 4.42 4.34

5.90 5.24 3.44 3.34 3.31

PMAE

 µ1  µ2  µ3  µ4  µ6

4.41 4.00 3.43 2.76 4.66

2.61 2.31 1.75 1.32 2.67

2.06 1.84 1.32 0.98 2.07

5.00 4.43 4.07 3.26 4.76

3.01 2.66 2.14 1.68 2.71

2.44 2.13 1.66 1.29 2.11

9.07 8.16 6.37 6.22 6.05

5.74 5.12 3.63 3.51 3.45

4.70 4.18 2.74 2.65 2.65

RIMSE

 µ1  µ2  µ3  µ4  µ6

12.41 7.85 8.14 7.70 9.02

7.59 4.42 4.57 4.32 5.22

6.02 3.42 3.51 3.32 4.03

12.37 8.30 8.44 7.89 9.00

7.56 4.80 4.69 4.38 5.12

6.06 3.77 3.70 3.47 4.05

15.88 14.36 12.68 12.61 12.41

11.22 10.33 9.06 9.00 8.96

9.75 9.16 8.07 8.03 8.01

IMAE

 µ1  µ2  µ3  µ4  µ6

8.88 6.01 6.04 5.62 6.95

5.44 3.40 3.39 3.14 4.03

4.33 2.64 2.61 2.42 3.11

8.97 6.35 6.39 5.87 6.99

5.48 3.69 3.56 3.26 3.97

4.40 2.90 2.81 2.59 3.14

12.11 11.01 9.80 9.76 9.61

8.56 8.01 7.08 7.02 6.98

7.50 7.19 6.37 6.34 6.33

Table 6 Estimation of conditional standard deviation in continuous design; 10,000 replications.

σ =5 n = 100

n = 300

n = 500

n = 100

σ = 10 n = 300

n = 500

σ = 50 n = 100

n = 300

n = 500

RPMSE

 µ1  µ2  µ3  µ4  µ6

9.65 16.71 5.20 4.31 20.62

7.43 14.02 2.90 2.16 15.39

6.46 12.48 2.12 1.46 13.24

7.51 13.05 5.20 4.14 19.29

5.33 10.66 2.81 1.95 14.55

4.55 9.37 2.36 1.42 12.71

10.70 9.60 7.95 8.67 24.09

7.80 7.08 5.62 5.71 15.72

7.18 6.62 5.09 4.89 12.25

PMAE

 µ1  µ2  µ3  µ4  µ6

8.90 16.52 4.02 3.32 15.34

7.10 13.94 2.19 1.59 11.92

6.24 12.43 1.64 1.10 10.46

6.42 12.72 4.03 3.19 16.04

4.71 10.50 2.24 1.54 12.58

4.05 9.25 1.94 1.12 11.20

8.69 7.92 6.51 7.08 18.27

6.51 5.93 4.78 4.71 11.34

6.17 5.67 4.51 4.11 9.23

RIMSE

 µ1  µ2  µ3  µ4  µ6

12.18 14.69 5.20 4.31 18.69

9.98 12.28 2.90 2.16 13.95

9.02 11.12 2.12 1.46 12.18

11.03 12.07 5.20 4.14 17.50

8.99 9.69 2.81 1.95 13.44

8.09 8.58 2.36 1.42 11.73

18.70 16.49 7.95 8.67 24.26

14.29 12.26 5.62 5.71 18.15

12.92 11.16 5.09 4.89 15.80

IMAE

 µ1  µ2  µ3  µ4  µ6

10.07 13.48 4.02 3.32 13.47

8.42 11.51 2.19 1.59 10.51

7.63 10.49 1.64 1.10 9.37

9.38 11.16 4.03 3.19 14.18

7.59 9.05 2.24 1.54 11.31

6.77 8.02 1.94 1.12 10.03

14.00 12.55 6.51 7.08 18.34

10.99 9.82 4.78 4.71 13.35

10.34 9.34 4.51 4.11 11.88

Fig. 1. QQ plot of standardized studentized estimator  µ1 (0)/se against standard normal distribution. This is from the case n = 500 and σ = 5.

to elicit the WTP for protecting wetland habitats and wildlife in California’s San Joaquin Valley. Each respondent was assigned

a bid value. They were then also given a second bid that was either higher or lower than the first, depending on their acceptance or rejection of the first bid. The total number of bid values in this unfolding bracket design is 14: {25, 30, 40, 55, 65, 75, 80, 110, 125, 140, 170, 210, 250, 375}. The dataset consists of bid responses and some personal characteristics of the respondents. The covariates X are age and number of years resident in California, education and income bracket, and binary indicators of sex, race, and membership in an environmental organization. The sample size, after excluding nonrespondents, incomplete responses, etc., is n = 518. The marginal distribution of Y across first bids was Y 1 = 0.396 and across second bids was Y 2 = 0.581, while V 1 = 132.4 and V 2 = 153.9. The second bid was more likely to receive a yes response, which is consistent with the larger mean value of the bid size. The contingency table is Y1 = 1 Y1 = 0

Y2 = 1 131 170

Y2 = 0 74 143

182

A. Lewbel et al. / Journal of Econometrics 162 (2011) 170–188

Fig. 2. First bid data. Marginal smooths  µ1 (Xi ) with pointwise confidence intervals with estimated unconditional mean.

This gives a chi-squared statistic of 4.68, which is to be compared with χ02.05 (1) = 3.84, so we reject the hypothesis of independence across bids, although not strongly. The individuals for whom either Y1 = 0 and Y2 = 1 or Y1 = 1 and Y2 = 0 reveal a bound on their willingness to pay, because for these individuals we know their WTP lies between min{V1 , V2 } and max{V1 , V2 }. By selecting these 244 individuals we obtain that E (W ) lies in the interval [112.1, 187.1]. This assumes that the first bids themselves do not influence the behaviour in the second round through, e.g., framing or anchoring effects. We provide some empirical evidence below that this assumption may not hold in our data. We first consider semiparametric specifications for W , in particular: W = Xi⊤ θ − ε

and

log(W ) = Xi⊤ θ − ε,

so m is linear and Λ is the identity or the exponential function, respectively. With these specifications we estimate the quantity µw (x) = E (W | X = x) using our semiparametric estimators  µj (x), j = 3, 4. To check for possible framing effects, we estimate this conditional mean WTP separately using first bid data and second bid data. Given that first bids were drawn with close to equal probabilities from a discrete distribution of bids, we assumed that the limiting design density h(V |X ) is uniform on the interval [Vmin , Vmax ] (which is not a bad approximation). In Table 7 we report the sample average of the estimates of E (W | X = Xi ), denoted  µj , j = 3, 4, along with bootstrap

Table 7 Estimates of WTP. Bid 1 Linear

 µ3

[101.3,126.0]

110.480

 µ4

[97.8,118.1]

105.611

Bid 2 Log-linear 112.676

[106.4,126.0]

104.674

[99.7,115.0]

Linear 172.838

[154.3,202.3]

246.059

[196.8,294.5]

Log-linear 356.771

[317.3,631.3]

715.210

[380.8,1810.4]

confidence intervals. The computation of  µj is exactly as described in the simulation section. Table 8 provides parameter estimates along with their 95% bootstrap confidence intervals, and asterisks indicating significant departure from zero at the 5% level. The estimated mean WTP based on only first bid data agree quite closely regardless of estimator or whether linear or loglinear model are assumed and the confidence intervals are quite narrow. Similar results were obtained for the sample median of { µj (Xi )}ni=1 and for the estimates at the mean covariate value  µj (X ). The results for the second bid data are rather erratic and generally produce higher mean WTP values. This may be an indicator of framing, shadowing, or anchoring effects, in which hearing the first bid and replying to it affects responses to later bids. See, e.g., McFadden (1994), Green et al. (1998). These results may also be due to small sample problems associated with the survey design, in particular, the distribution of second bids differs markedly from the distribution of first bids, including some far larger bid values.

A. Lewbel et al. / Journal of Econometrics 162 (2011) 170–188

183

Fig. 3. Second bid data. Marginal smooths  µ1 (Xi ) with pointwise confidence intervals with estimated unconditional mean.

Table 8 Parameter estimates. Bid 1 Linear YEARCA FEMALE ln(AGE ) EDUC WHITE ENVORG ln(INCOME )

0.3823

[−0.118,0.935]

0.5560

[−12.540,11.75]

−13.591

[−37.31,8.78]

−2.0237

[−4.98,0.72]

6.238

Bid 2 Log linear 0.0051

[−0.0011,0.01]

0.0105

[−0.14,0.15]

−0.159

[−0.5,0.12]

−0.0266

[−0.067,10.01]

Linear 0.5382

[−1.24,1.97]

28.290

[−10.5,70.8]

−31.714

[−98.9,32.7]

1.2919

−0.6081

[−1.73,0.33]

0.0563

0.5206∗

0.0211

60.2098∗

1.968

0.0423

34.8597

[−9.70,12.21]

0.5033∗ [0.073,0.98]

[−0.05,0.15]

[−0.17,0.21]

2.378

0.0096

[−0.008,0.03]

[−10.66,10.10]

[−9.36,26.07] [−15.07,16.49]

Log linear

[3.0,112.0]

[0.04,1.08]

0.0931

[−0.17,0.20]

[−23.9,88.7]

[−0.56,0.66]

0.0459

40.4140∗ [6.09,68.93]

0.2769

[−0.09,0.17]

[−0.15,0.56]

An (2000), using a very different modeling methodology, tests and accepts the hypothesis of no framing effects in these data, though he does report some large differences in coefficient estimates based on data using both bids versus just first bid data. Using different estimators and combining both first and second bid data sets, An (2000) reports WTP at the mean ranging from 155 to 227 (plus one outlier estimate of 1341), which may be compared to

our estimates of 99–113 for first bid data and 143–715 using only second bids. Finally, we conducted a purely nonparametric analysis with each of the four continuous covariates, one at a time. In Figs. 2 and 3 we provide the marginal smooths ( µ1 (Xi )) themselves along with a pointwise 95% confidence interval. These figures from the nonparametric estimator show some nonlinear effects. However, these are not likely to be statistically significant in view of the wide (pointwise) confidence intervals except for one case using second bid data. Recall that we are looking at marginal smooths and so it need not be the case that E [W |Xj ] is linear under either of our semiparametric specifications. Note also that the level of the effect is similar to that calculated in the full semiparametric specifications. 5. Concluding remarks We have provided semiparametric and nonparametric estimators of conditional moments and quantiles of the latent W . The estimators appear to perform well with both simulated and actual data. We have for convenience assumed throughout that the limiting support of V is bounded. Most of the results here should extend readily to the infinite support case, although some of the estimators may then require asymptotic trimming to deal with issues arising from division by a density estimate when the true density is not bounded away from zero.

184

A. Lewbel et al. / Journal of Econometrics 162 (2011) 170–188

and Ψn,|j| ( x) is a N|j| dimensional subvector whose r-th element is given by

The results here show the importance, for both identification and estimation, of experimental designs in which the distribution of bids or test values V possesses at least a fair number of mass points, and ideally is continuous. This should be taken as a recommendation to future designers of contingent valuation experiments. The precision of the estimators also depends in part on the distribution of test values. When designing experiments, one may wish to choose the limiting density h to maximize efficiency based on the variance estimators.

−1 −1  G( x) − G( x) = e⊤ x)Un ( x) + e⊤ x)Bn ( x). 1 Mn ( 1 Mn (

Acknowledgement

The stochastic term Un ( x) and the bias term Bn ( x) are N × 1 vectors



n 1 −  x − Xi

[Ψn,|j| ]r =

nbd i=1

φ|j| (r )  K

b

   x − Xi b

Yi .

We can write

Un,0 ( x)



Bn,0 ( x)





.. .

.. .

(16)



This paper was partly written while Oliver Linton was a Universidad Carlos III de Madrid Banco Santander Chair of Excellence, and he thanks them for financial support.

Un ( x) = 

Appendix A

where Un,l ( x) and Bn,l ( x) are defined similarly as Ψn,l ( x) so that Un,|j| ( x) and Bn,|j| ( x) are N|j| dimensional subvectors whose r-th elements are given by:

This appendix provides our main limiting distribution theorems. Some technical lemmas used by these theorems have been dropped to save space. They appear in a supplemental Appendix to this paper. The supplemental Appendix also contains results regarding identification when the bid distribution is discrete.

Proof of Theorem 2. First consider the estimator  µ1r (x). Given our assumptions regarding the triangular array nature of the sampling scheme, replacing H (v|x, n) with H (v|x), and hence treating the estimator  µ1r (x) as if V ’s were drawn based on a continuous distribution H (v|x), changes the limiting distribution of  µ1r (x) by an amount of smaller order than the leading term, and hence is first order asymptotically negligible. Moreover, when V ’s are draws based on a continuous distribution H (v|x), the estimator  µ1r (x) just equals an ordinary nonparametric regression (noting that Vi is part of the variable being smoothed), and so follows the standard limiting distribution theory associated with nonparametric regression, which we do not spell out here to save space. We now turn to  µ2r (x). First, we introduce some notation to define the local polynomial estimator  G(v | x). Following the notation of Masry (1996a,b), let Nℓ = (ℓ + d − 1)!/ℓ!(d − 1)! be the number of distinct d-tuples j with |j| = ℓ. Arrange these Nℓ d-tuples as a sequence in a lexicographical order and let φℓ−1 denote this one-to-one map. Define  Xi = (Vi , Xi ) and  x = (v, x), and write  G(v | x) =  G( x) and G(v | x) = G( x) for short. We −1 ⊤ have  G( x) = e⊤ 1 Mn Ψn , where e1 = (1, 0, . . . , 0) is the vector with the one in the first position, M ( x ) and Ψ ( x ) are a symmetric n n  N =

∑p−1 ℓ=0

Nℓ × 1

Bn ( x) = 







Un,|j|



=

r

n 1 −  x − Xi

nbd i=1

φ|j| (r )

b

Bn,d ( x)

 ,

   x − Xi  K εi b

 φ|j| (r )   n    1 −  x − Xi x − Xi  Bn,|j| r = K ∆i ( x), d nb

A.1. Distribution theory for nonparametric estimators

N × N

 , Un,p−1 ( x)



b

i=1

where ∆i ( x) = G( Xi ) − k1!

b

∑

0≤|k|≤p−1

(Dk G)( x)( Xi − x)k , while εi =

Yi − E (Yi | Xi ) are independent random variables with conditional mean zero and uniformly bounded variances. The argument is similar to Fan et al. (1998, Theorem 1); we just sketch out the extension to our quasi-discrete case. The first part of the argument is to derive a uniform approximation to the denominator in (16). We have sup

v∈[ρ0 (x),ρ1 (x)]

|Mn (v, x) − E [Mn (v, x)]| = Op (an ),

(17)

where an = log n/nbd+1 . The justification for this comes from Masry (1996a, Theorem 2). Although he assumed a continuous covariate density, it is clear from the proofs that the argument goes through in our case. Discreteness of Vi only affects the bias calculation. We calculate E [Mn (v, x)], for simplicity just the upper diagonal element



E Kb ( x − Xi ) =

∫ =

kb (v − v ′ )Kb (x − x′ )dH (v ′ , x′ |n)

kb (v − v ′ )Kb (x − x′ )dH (v ′ , x′ )

∫

matrix and an N × 1 dimensional

∫

+

kb (v − v ′ )Kb (x − x′ )[dH (v ′ , x′ |n) − dH (v ′ , x′ )].

column vector respectively and are defined as

  Mn ( x) = 

Mn,0,0 ( x)

··· .. . ···

.. .

Mn,p−1,0 ( x)

Mn,0,p−1 ( x)

Then using integration by parts for Lebesgue integrals (Carter and van Brunt (2000, Theorem 6.2.2)), for large enough n we have



 .. , . Mn,p−1,p−1 ( x)

∫

ρ0 (x)

Ψn,0 ( x)   . .. Ψn ( x) =  , Ψn,p−1 ( x)





=−

where Mn,|j|,|k| ( x) is a N|j| × N|k| dimensional submatrix with the (l, r ) element given by







n 1 −  x − Xi

Mn,|j|,|k| l,r = nbd i=1

ρ1 (x)

b

φ|j| (l)+φ|k| (r )  K

   x − Xi b

,

kb (v − v ′ )[dH (v ′ |x′ , n) − dH (v ′ |x′ )] 1 b2

∫

ρ1 (x) ρ0 (x)

k′



v − v′ b



[H (v ′ |x′ , n) − H (v ′ |x′ )]dv ′ ,

since the function k is continuous everywhere and the boundary term   µkH ([ρ0 (x), ρ1 (x)]) = kb (v − ρ1 (x)) H (ρ1 (x)|x′ , n) − H (ρ1 (x)|x′ )   − kb (v − ρ0 (x)) H (ρ0 (x)|x′ , n) − H (ρ0 (x)|x′ ) = 0

A. Lewbel et al. / Journal of Econometrics 162 (2011) 170–188

for large enough n, where µkH (A) denotes the H-measure of the set A. Therefore, by the law of iterated expectation for some constant C < ∞,

∫     kb (v − v ′ )Kb (x − x′ )[dH (v ′ , x′ |n) − dH (v ′ , x′ )]   ∫      =  kb (v − v ′ )Kb (x − x′ ) dH (v ′ |x′ , n) − dH (v ′ |x′ ) dH (x′ )  ∫  1 ′   ′ v−v = 2 k H (v ′ |x′ , n) − H (v ′ |x′ ) b b   ′ ′ ′  × dv Kb (x − x )dH (x )    ′ ′  ≤ sup sup H (v |x , n) − H (v ′ |x′ ) v ′ |x′ −x|≤b

 ∫   ∫  ′ v − v′  ′   k  dv × Kb (x − x′ ) dH (x′ )   2 b b     1 ′ ′ ′ ′   sup sup H (v |x , n) − H (v |x ) ≤C 1

×

∫

|k′ (t )|dt

∫

J −1 b−1 due to the discreteness. This term is of small order under our conditions. −1 −1 Then e⊤ x)Un ( x) = e⊤ Un ( x)/h( x) + rem( x), where 1 Mn ( 1 M rem( x) is a remainder term that is op (n−1/2 b−(d+1)/2 ). By similar −1 arguments we obtain e⊤ x)Bn ( x) = bp β( x) + rem( x), 1 Mn ( where rem( x) is a remainder term that is op (bp ) and β( x) = −1 e⊤ BG(p+1) ( x). Therefore, we obtain 1 M 1 ⊤ −1  G( x) − G( x) = e1 M Un ( x) + bp β(v, x) + rem( x), h( x) where rem( x) is a remainder term that is op (n−1/2 b−(d+1)/2 ) + op (bp ). We next substitute the leading terms into  µ2r (x), and recall that

 µ2r (x) − µr (x) =

|K (u)| du

∫ [U n,|j| ]r =

by the integrability and smoothness on k. The right-hand side does not depend on v so the bound is uniform. 2(p − 1), let µj ( K) =  For j each j with 0 ≤  |j| j ≤ 2    u K ( u ) du , ν ( K ) = u K ( u ) du, and define the N ×N j Rd+1 Rd+1 dimensional matrices M and Γ and N × 1 vector B, where N = ∑p−1 ℓ=0 Nℓ × 1, by M0,0

Mp−1,0

M0,1 M1,1

··· ···

Mp−1,1

···

Γ0,0 Γ0,1 Γ 1 ,1  Γ1,0 Γ =  .. . Γp−1,0 Γp−1,1  



M0,p−1 M1,p−1 

···

⊤

e1 M

 Γ0,p−1 Γ1,p−1  , ..  . Γp−1,p−1

φ x

|j| (r )

 K

b

x − Xi



x − Xi

 Ld,p

nbd i=1

x − Xi



r ′ (Vi , x) h(Vi , x)

b

εi ,

where



r ′ (Vi , x)

E

=

.. . φ x ( r )  x − Xi |j| b

  K



 x − Xi   b

.. .

 σ (Vi , Xi )|Xi = x 2

r ′ (v, x)

2

h(v, x)

∫  =

r ′ (v, x)

2

h(v, x)

∫  +

r ′ (v, x)

σ 2 (v, x)dH (v|x, n) σ 2 (v, x)dH (v|x)

2

h(v, x)

∫  =

b

2

h(Vi , x)

∫ 

φ (l)+φ|k| (r )   ∫    1 x − x′ |j| x − x′  K dH (v ′ , x′ )

r ′ (v, x)

2

h(v, x)

σ 2 (v, x)[dH (v|x, n) − dH (v|x)]

σ 2 (v, x)dH (v|x) + o(1).

It follows that the asymptotic variance of  µ2r (x) is

= h(v, x)[M|j|,|k| ]l,r + O(b)

1

uniformly over v . Therefore, (18)

where cn = an + b + J −1 b−1 , and the error is uniform over v in the support of H (v|x, n). There is an additional term here of order

r ′ (Vi , x) h(Vi , x)

b

Under our conditions



elements of the matrices M = M ( K ) and Γ = Γ ( K ) are simply multivariate moments of the kernel  K and  K 2 , respectively. Under the smoothness conditions on h(v, x) we have for all j, k, l, r

Mn ( x) = h( x)M + Op (cn ),

U n (x) =

n 1 −

∫  −1  φ v (r ) = e⊤ M 1  u |j| k(u)du 

Mp−1,p−1

where Mi,j and Γi,j are Ni × Nj dimensional matrices whose (ℓ, m) element are, respectively, µφi (ℓ)+φj (m) and νφi (ℓ)+φj (m) . Note that the

b

x − Xi

i=1



Mp−1,p

bd

nbd



b

, 

.. .

−1



, 

M0,p  M1,p 

B= 

n 1 −

v

φ (r ) u |j| k(u)du

We can write

Ld,p



.. .

··· ···

r ′ (v, x)[ G(v | x) − G(v | x)]dv.

subvector whose r-th elements are given by:

|x′ −x|≤b

 M1,0 M =  .. .

ρ0 (x)

−1  µ2r (x) − µr (x) = e⊤ U n (x) + bp β(x) + op (n−1/2 b−d/2 ), 1 M  where β(x) = β(v, x)dλ(v), while U n (x) = [U n,0 (x), . . . , U n,p (x)]⊤ is an N × 1 vector, where U n,|j| (x) is an N|j| dimensional

× sup h(x′ ) = Op (J −1 b−1 ),



ρ1 (x)

∫

The standard integration argument along the lines of Fan et al. (1998) shows that the term rem( x) can be ignored, and we obtain

b v ′ |x′ −x|≤b

×

185

nbd

 E

=

1 b

1 nbd

L2 d d,p

∫



x − Xi



h( Vi , x )

b 1 b

L2 d d,p

r ′ (Vi , x)



x−X b



2

 σ (Vi , Xi ) 2

. 

εi .

186

A. Lewbel et al. / Journal of Econometrics 162 (2011) 170–188

 ×

r ′ (V , x)

2

h( V , x )

2 1  ≃ d Ld,p  nb

∫

+

σ (V , X )dH (V |X )dH (X ) + o(1) σ (v, x) 2



r ′ (v, x) h(v, x)

2

h(v, x)dv,

2

 µ0r (x) = −

a1

r (v, x)

a0

∂ G(v | x) dv ∂v

(19)

f2 (Zi , θ )

=

r ′ [Λ(m(x, θ ) − Ui (θ )), x]Λ′ (m(x, θ ) − Ui (θ ))[Yi − 1(Ui (θ ) > 0)]

ψ 2 ( Ui )

Leading Terms. Lemma 1 implies that n 1 −

√

n i=1

∂ G(v | x) ∂ G(v | x) −1 −1 − = e⊤ x)Un∗ ( x) + e ⊤ x)Bn∗ ( x), v Mn∗ ( 1 Mn∗ ( ∂v ∂v where e⊤ x), Un∗ ( x), and Bn∗ ( x) are like v = (0, 1, . . . , 0) and Mn∗ ( Mn ( x), Un ( x), and Bn ( x) except that they are for one order higher −1 −1 polynomial. In particular, e⊤ x)Un∗ ( x) = e⊤ x)/h( x)+ v Mn∗ ( v M Un∗ ( rem( x), where rem( x) is a small remainder term and M is the corresponding matrix of kernel weights. We apply the same integration argument except that we have to apply integration by parts to eliminate a bandwidth factor, and this is why the limiting variance involves the derivative of r (v, x)/h(v, x). The bias term arguments are the same except that p is replaced by p + 1. A.2. Distribution theory for semiparametric quantities Let Ei denote the expectation conditional on Zi . In the proofs of Theorems 3 and 4 we make use of Lemmas 1 and 2 given in our supplemental Appendix. Define

ρj (u, θ) = h[Λ(m(Xj , θ ) − u)|Xj ]Λ′ (m(Xj , θ) − u) and ψθ (u) = E ρj (u, θ ) with ψ(u) = ψθ0 (u). Then, interchanging differentiation and integration (which is valid under our conditions) we have

∂ρj (u, θ0 ) ∂u

= −E ([h′ (Λ|Xj )(Λ′ )2 + h(Λ|Xj )Λ′′ ](m(Xj , θ0 ) − u)). (20) Proof of Theorem 3. Recall that

[f1 (Zi ,  θ ) − Ef1 (Zi , θ0 )]

n 1 −

= √

n i =1

{ΓF ς (Zi , θ0 ) + [f1 (Zi , θ0 ) − Ef1 (Zi , θ0 )]} + op (1), (24)

where Ef1 (Zi , θ0 ) = µr (x), and f1 (Zi , θ0 ) − Ef1 (Zi , θ0 ) = f0 (Zi , θ0 ) − Ef0 (Zi , θ0 ) due to the cancellation of the common term r [Λ(m(x, θ0 )), x]. The stochastic equicontinuity condition of Lemma 1 is verified in a separate Appendix, see below. Let L(Zi , Zj ) = ξj (Ui ) + Γ (Zi )ς (Zj ; θ0 ), and

ξj (u) = ρj (u, θ0 ) − E ρj (u, θ0 ) ] [ ∂ m(Xi , θ0 ) ∂ m(Xj , θ0 ) − Ei [ζij ] , Γ (Zi ) = Ei ζij ∂θ ∂θ ∂ρj (Ui , θ0 ) where ζij = − . ∂u Note that Ei [ξj (Ui )] = 0 but Ej [ξj (Ui )] ̸= 0. We first approximate ∑n ∑n ∑n  n−1 i=1 f2 (Zi , θ0 )[ψ( Ui ) − ψ(Ui )] by n−2 i=1 j=1 f2 (Zi , θ0 ) L(Zi , Zj ). Specifically, by Lemma 2 and the fact that E |f2 (Zi , θ0 )| < ∞, we have   n n 1 −  1−     f2 (Zi , θ0 )[ψ(Ui ) − ψ(Ui ) − L(Zi , Zj )]   n i =1  n j =1   n n   1− 1−   ≤ |f2 (Zi , θ0 )| × max [ψ(Ui ) − ψ(Ui ) − L(Zi , Zj )]  1≤i≤n  n i=1 n j =1 = op (n−1/2 ). Next, letting ϕn (z1 , z2 ) = n−2 f2 (z1 , θ0 )L(z1 , z2 ) we have n n 1 −−

 µ3r (x) = r [Λ(m(x,  θ )), x] n θ) −  Ui ), x]Λ′ (m(x,  θ) −  Ui )[Yi − 1( Ui > 0)] 1 − r ′ [Λ(m(x,  + ,   n ψ(Ui )

.

The leading terms are derived from (21), while (22) and (23) contain remainder terms. We just present the leading terms here and refer to the supplementary material for the full arguments.

follow similarly. We have

ψ ′ (u) = E

(23)

where

quantity Ld,p  can also be defined in terms of the basic kernel k. The properties of

∫

 ψ 2 (Ui )ψ( Ui )

n i=1

 × [ψ( Ui ) − ψ(Ui )]2 ,

by a change of variables and dominated convergence and taking account of the discreteness error. Furthermore, the central limit theorem holds by the arguments used in Gozalo and Linton (2000, Lemma CLT), and is not affected by the discreteness of V . The



n θ) −  Ui ), x]Λ′ (m(x,  θ) −  Ui )[Yi − 1( Ui > 0)] 1 − r ′ [Λ(m(x, 

 2

n2 i=1 j=1

f2 (Zi , θ0 )L(Zi , Zj ) =

n − n −

ϕn (Zi , Zj ),

i =1 j =1

where  Ui = m(Xi ,  θ ) − Λ−1 (Vi ) and

which can be approximated by a second order U-statistic as follows. Letting pn (z1 , z2 ) = n(n − 1)[ϕn (z1 , z2 ) + ϕn (z2 , z1 )]/2 we have

n 1−  h[Λ(m(Xj ,  θ) −  Ui )|Xj ]Λ′ (m(Xj ,  θ) −  Ui ) ψ( Ui ) =

Qn =

i=1

n j=1

=

n 1−

n j=1

 µ3r (x) = −

n i=1

n 1−

n i=1

ϕn (Zi , Zj ) =

i=1 j=1

f1 (Zi ,  θ) −

n −1 − n  n −1 −

2

pn (Zi , Zj ) + op (n−1/2 ),

i=1 j=i+1

−1/2 since ). Now pn is a symmetric kernel, i=1 ϕn (Zi , Zi ) = op (n i.e., pn (z1 , z2 ) = pn (z2 , z1 ) and we can apply Lemma 3.1 of Powell et al. (1989). Letting

∑n

ρj ( Ui ,  θ ).

 By a geometric series expansion of 1/ψ( Ui ) about 1/ψ(Ui ) we can write n 1−

n − n −

n 1−

n i=1

 f2 (Zi , θ0 )[ψ( Ui ) − ψ(Ui )]

 [f2 (Zi ,  θ ) − f2 (Zi , θ0 )][ψ( Ui ) − ψ(Ui )]

n 2−

ωn (Zj ),

where ωn (Zi ) = Ei [pn (Zi , Zj )],

(21)

n = Q

(22)

n ) = op (1). It remains to find ωn (Zi ). We we have n(Qn − Q    have 2ωn (Zi ) = E f2 (Zj , θ0 )Γ (Zj ) ς (Zi ; θ0 ) + Ei f2 (Zj , θ0 )ξi (Uj ) ,

n j =1

√

A. Lewbel et al. / Journal of Econometrics 162 (2011) 170–188

187

because Ei [L(Zi , Zj )] = 0. Furthermore,

where L∗ (Zi , Zj ) = b−1 k((Ui −Uj )/b)−ψ(Ui )+Γ ∗ (Zi )·ς (Zj , θ0 ) and

Ej [f2 (Zi , θ0 )ξj (Ui )] = Ej [f2 (Zi , θ0 )[ρj (Ui , θ0 ) − Ei ρj (Ui , θ0 )]] [ ] [ρj (Ui , θ0 ) − ψ(Ui )] = Ej f0 (Zi , θ0 ) ψ(Ui )

[ ψ ′ (Ui ) ∂ m Γ (Zi ) = ψ(Ui ) (Xi , θ0 ) ψ(Ui ) ∂θ ⊤   ] ∂m ′ −E m ( U ) . ( X , θ ) | U − i i 0 i θ0 ∂θ ⊤

E [f2 (Zi , θ0 )Γ (Zi )]





∗

 [

 ] ∂ m(Xj , θ0 ) ∂ m(Xi , θ0 ) = E f2 (Zi , θ0 ) Ei ζij − Ei [ζij ] ∂θ ∂θ  ] [ ∂ m(Xj , θ0 ) ∂ m(Xi , θ0 ) f0 (Zi , θ0 ) ζij − . =E ψ(Ui ) ∂θ ∂θ

∂m Here, mθ0 (Ui ) = E [ ∂θ ⊤ (Xi , θ0 ) | Ui ]. Under our bandwidth conditions, the right-hand side of (29) is op (n−1/2 ). Next, n 1−

Writing ζij = Ei ζij + ζij − Ei ζij , where Ei ζij = −ψ ′ (Ui ), we have [  ] f0 (Zi , θ0 ) ∂ m(Xj , θ0 ) ∂ m(Xi , θ0 ) E ζij − ψ(Ui ) ∂θ ∂θ [   ] ψ ′ (Ui ) ∂ m(Xi , θ0 ) ∂ m(Xi , θ0 ) = E −f0 (Zi , θ0 ) −E ψ(Ui ) ∂θ ∂θ [   ]  ∂ m(Xj , θ0 ) f0 (Zi , θ0 )  ∂ m(Xj , θ0 ) +E ζij − Ei ζij −E . ψ(Ui ) ∂θ ∂θ

n i =1

f2 (Zi , θ0 )

where

ϕn (Zi , Zj ) =

] − n f0 (Zi , θ0 ) 1 ψ ′ (Ui )   γi + ζij γj ς (Zj ; θ0 ) ψ(Ui ) ψ(Ui ) n j =1 [ ] n 1− [ρj (Ui , θ0 ) − ψ(Ui )] − Ej f0 (Zi , θ0 ) . (25) n j =1 ψ(Ui ) [

n = E −f0 (Zi , θ0 ) Q

 n + We have shown that n−1 i=1 f2 (Zi , θ0 )[ψ( Ui ) − ψ(Ui )] = Q n is given in (25). This concludes the analysis of op (n−1/2 ), where Q the leading terms. ∑n √ −1/2 In conclusion, n[ µ3r (x) − µr (x; θ0 )] = n√ i=1 ηi + op (1), as required. The asymptotic distribution of n[ µ3r (x) − µr (x)] follows from the central limit theorem for independent random variables with finite variance. ∑n

n 1−

n i=1

f1 (Zi ,  θ) −

n 1−

n i=1

 f2 (Zi , θ0 )[ψ( Ui ) − ψ(Ui )]

n 1−  − [f2 (Zi ,  θ ) − f2 (Zi , θ0 )] × [ψ( Ui ) − ψ(Ui )]

n i=1

+

f2 (Zi , θ0 )

n i=1

b

(26)

k

Ui − Uj



b

Note that

[∫

] k ψ(u)du − ψ(Ui ) Ei ϕn (Zi , Zj ) = 2 f2 (Zi , θ0 ) n b b [∫ ] 1 = 2 f2 (Zi , θ0 ) k(t )ψ(Ui + tb)dt − ψ(Ui ) 1



Ui − u



n

= Op (n−2 b2 ) uniformly in i. Define f 2 (Ui ) = E [f2 (Zi , θ0 )|Ui ]. Then by iterated expectation

[

1 n2 Ej ϕn (Zi , Zj ) = E f 2 (Ui ) k b



Ui − Uj

]

b

  − E f 2 (Ui )ψ(Ui )

  + E f2 (Zi , θ0 )Γ ∗ (Zi ) · ς (Zj , θ0 ), where, using integration by parts, a change of variable, and dominated convergence, 1 E f 2 (Ui ) k b



Ui − Uj

]

∫ =

b

1 f 2 (u) k b



u − Uj



b

ψ(u)du

uniformly in i. Note that f 2 (Uj )ψ(Uj ) = f 0 (Uj ) = E [f0 (Zj , θ0 )|Uj ]. Furthermore,

(27)

] ′ ψ ′ (Ui )γi∗ m (Ui ) − θ ψ(Ui ) ψ(Ui ) [ ′ ]  ψ (Ui )  =E f0 (Zi , θ0 ) − f 0 (Ui ) γi∗ ψ(Ui ) [

E f2 (Zi , θ0 )Γ ∗ (Zi ) = E f0 (Zi , θ0 )



 ψ 2 (Ui )ψ( Ui ) (28)

The leading terms in this expansion are derived from (26), while (27) and (28) contain the remainder terms. Leading Terms. We make use∑ of Lemma 3 given in our supplemenn tal Appendix. The term n−1 i=1 f1 (Zi ,  θ ) has already been analyzed above. By Lemma 3 we have with probability tending to one for some function d(.) with finite r moments

   n n 1 −  1− ∗     f2 (Zi , θ0 ) ψ(Ui ) − ψ(Ui ) − L (Zi , Zj )    n i =1  n j =1   n 1 1− ≤ 3 |f2 (Zi , θ0 )|d(Xi )



− E [f 0 (Ui )m′θ (Ui )] by substituting in for f2 and decomposing f0 (Zi , θ0 ) = f 0 (Ui ) + f0 (Zi , θ0 ) − f 0 (Ui ) and using that E [γi∗ |Ui ] = 0. Using the same U-statistic argument as in the proof of Theorem 3 we obtain n 1 −

n2 i=1

f2 (Zi , θ0 )

n − j =1

L∗ (Zi , Zj ) =

n 1−

n j=1

ωn (Zj ) + op (n−1/2 ),

where ωn (Zj ) = f 0 (Uj ) − E [f 0 (Uj )] + E [f2 (Zi , θ0 )Γ ∗ (Zi )]ς (Zj ).

Appendix B. Supplementary data

n i =1

= Op (n−1 b−3 )



 − ψ(Ui ) + Γ (Zi ) · ς (Zj , θ0 ) .



 × [ψ( Ui ) − ψ(Ui )]2 .

1

= f 2 (Uj )ψ(Uj ) + Op (b2 )

n 1 − r ′ [Λ(m(x,  θ) −  Ui ), x]Λ′ (m(x,  θ) −  Ui )[Yi − 1( Ui > 0)]

nb

n2

[

∗

[

Proof of Theorem 4. By a geometric series expansion we can write

 µ4r (x;  θ) =

1

1

Therefore,

∗

n n − n − 1− ∗ L (Zi , Zj ) = ϕn (Zi , Zj ) n j =1 i=1 j=1

(29)

Supplementary material related to this article can be found online at doi:10.1016/j.jeconom.2010.11.006.

188

A. Lewbel et al. / Journal of Econometrics 162 (2011) 170–188

References An, M.Y., 2000. A semiparametric distribution for willingness to pay and statistical inference with dichotomous choice CV data. The American Journal of Agricultural Economics 82, 487–500. Bickel, P.J., Klaassen, C.A.J., Ritov, J., Wellner, J., 1993. Efficient and Adaptive Estimation for Semiparametric Models. Springer, Berlin. Brown, B.W., Newey, W.K., 1998. Efficient semiparametric estimation of expectations. Econometrica 66, 453–464. Carter, N., van Brunt, B., 2000. The Lebesgue Stieltjes Integral. Springer, Berlin. Chen, X., Linton, O., Van Keilegom, I., 2003. Estimation of semiparametric models when the criterion is not smooth. Econometrica 71, 1591–1608. Chen, H., Randall, A., 1997. Semi-nonparametric estimation of binary response models with an application to natural resource valuation. Journal of Econometrics 76, 323–340. Coppejans, M., 2003. Effective nonparametric estimation in the case of severely discretized data. Journal of Econometrics 117, 331–367. Creel, M., Loomis, J., 1997. Semi-nonparametric distribution-free dichotomous choice contingent valuation. Journal of Environmental Economics and Management 32, 341–358. Crooker, J.R., Herriges, J.A., 2004. Parametric and semi-nonparametric estimation of willingness-to-pay in the dichotomous choice contingent valuation framework. Environmental & Resource Economics 27, 451–480. David, H.A., Nagaraja, H.N., 2003. Order Statistics. John Wiley and Sons, New Jersey. Delgado, M., Mora, J., 1995. Nonparametric and semiparametric estimation with discrete regressors. Econometrica 63, 1477–1484. Dinse, G.E., Lagakos, S.W., 1982. Nonparametric estimation of lifetime and disease onset distributions from incomplete observations. Biometrics 38, 921–932. Fan, J., Gijbels, I., 1996. Local Polynomial Modelling and Its Applications. Chapman and Hall. Fan, J.E., Härdle, W., Mammen, E., 1998. Direct estimation of low dimensional components in additive models. Annals of Statistics 26, 943–971. Goldstein, L., Messer, K., 1992. Optimal plug-in estimators for nonparametric functional estimation. The Annals of Statistics 20, 1306–1328. Gozalo, P., Linton, O.B., 2000. Local Nonlinear least squares: using parametric information in nonparametric regression. Journal of Econometrics 99, 63–106. Green, D., Jacowitz, K., Kahneman, D., McFadden, D., 1998. Referendum contingent valuation, anchoring, and willingness to pay for public goods. Resource and Energy Economics 20, 85–116. Hall, P., Huang, L.-S., 2001. Nonparametric kernel regression subject to monotonicity constraints. The Annals of Statistics 29, 624–647. Hanemann, W.M., Loomis, J., Kanninen, B., 1991. Statistical efficiency of doublebounded dichotomous choice contingent valuation. The American Journal of Agricultural Economics 73, 1255–1263. Härdle, W., Linton, O.B., 1994. Applied nonparametric methods. In: McFadden, D.F., Engle, III., R.F. (Eds.), The Handbook of Econometrics, vol. IV. North Holland. Ho, K., Sen, P.K., 2000. Robust procedures for bioassays and bioequivalence studies. Sankhya Series B 62, 119–133. Horowitz, J.L., 2001. The bootstrap. In: Heckman, J.J., Leamer, E.E. (Eds.), Handbook of Econometrics, vol. V. Elsevier Science B.V., pp. 3159–3228 (Chapter 52).

Jewell, N.P., van der Laan, M., 2002. Bivariate current status data. U.C. Berkeley Division of Biostatistics Working Paper Series. Working Paper 114. Kanninen, B., 1993. Dichotomous choice contingent valuation. Land Economics 69, 138–146. Klein, R., Spady, R.H., 1993. An efficient semiparametric estimator for binary response models. Econometrica 61, 387–421. Lewbel, A., 1997. Semiparametric estimation of location and other discrete choice moments. Econometric Theory 13, 32–51. Lewbel, A., 2000. Semiparametric qualitative response model estimation with unknown heteroscedasticity or instrumental variables. Journal of Econometrics 97, 145–177. Li, Q., Racine, J., 2004. Nonparametric estimation of regression functions with both categorical and continuous data. Journal of Econometrics 119, 99–130. Linton, O., Nielsen, J.P., 1995. A kernel method of estimating structured nonparametric regression based on marginal integration. Biometrika 82, 93–100. Linton, O., Park, J., 2009. A comparison of analytic standard errors for nonparametric regression (Unpublished manuscript). Mammen, E., 1992. When Does Bootstrap Work? Asymptotic Results and Simulations. In: Lecture Notes in Statistics, vol. 77. Springer, New York. Manski, C., Tamer, E., 2002. Inference on regressions with interval data on a regressor or outcome. Econometrica 70, 519–546. Matzkin, R., 1992. Non-parametric and distribution-free estimation of the binary threshold crossing and the binary choice models. Econometrica 60, 239–270. Masry, E., 1996a. Multivariate local polynomial regression for time series: uniform strong consistency and rates. Journal of Time Series Analysis 17, 571–599. Masry, E., 1996b. Multivariate regression estimation: local polynomial fitting for time series. Stochastic Processes and their Applications 65, 81–101. McFadden, D., 1994. Contingent valuation and social choice. The American Journal of Agricultural Economics 76, 4. McFadden, D., 1998. Measuring willingness-to-pay for transportation improvements. In: Gärling, T., Laitila, T., Westin, K. (Eds.), Theoretical Foundations of Travel Choice Modeling. Elsevier Science, Amsterdam, pp. 339–364. Newey, W.K., McFadden, D., 1994. Large sample estimation and hypothesis testing. In: Engle, R.F., McFadden, D.L. (Eds.), Handbook of Econometrics, vol. IV. Elsevier, Amsterdam, pp. 2111–2245. Pollard, D., 1984. Convergence of Stochastic Processes. Springer-Verlag, New York. Powell, J.L., Stock, J.H., Stoker, T.M., 1989. Semiparametric estimation of index coefficients. Econometrica 57, 1403–1430. Ramgopal, P., Laud, P.W., Smith, A.F.M., 1993. Nonparametric Bayesian bioassay with prior constraints on the shape of the potency curve. Biometrika 80, 489–498. Shorack, G.R., Wellner, J.A., 1986. Empirical Processes with Applications to Statistics. John Wiley and Sons. Sperlich, S., Linton, O.B., Härdle, W., 1999. A simulation comparison between the backfitting and integration methods of estimating separable nonparametric models. TEST 8, 419–458. Wang, M.-C., Van Ryzin, J., 1981. A class of smooth estimators for discrete distributions. Biometrika 68, 301–309.

Journal of Econometrics 162 (2011) 189–212

Contents lists available at ScienceDirect

Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom

A martingale approach for testing diffusion models based on infinitesimal operator✩ Zhaogang Song ∗ Department of Economics, Cornell University, United States

article

info

Article history: Received 11 August 2009 Received in revised form 15 July 2010 Accepted 15 December 2010 Available online 21 December 2010 JEL classification: C12 C14 Keywords: Diffusion Markov Martingale problem Semi-group Infinitesimal operator

abstract I develop an omnibus specification test for diffusion models based on the infinitesimal operator. The infinitesimal operator based identification of the diffusion process is equivalent to a ‘‘martingale hypothesis’’ for the processes obtained by a transformation of the original diffusion model. My test procedure is then constructed by checking the ‘‘martingale hypothesis’’ via a multivariate generalized spectral derivative based approach that delivers a N (0, 1) asymptotical null distribution for the test statistic. The infinitesimal operator of the diffusion process is a closed-form function of drift and diffusion terms. Consequently, my test procedure covers both univariate and multivariate diffusion models in a unified framework and is particularly convenient for the multivariate case. Moreover, different transformed martingale processes contain separate information about the drift and diffusion specifications. This motivates me to propose a separate inferential test procedure to explore the sources of rejection when a parametric form is rejected. Simulation studies show that the proposed tests have reasonable size and excellent power performance. An empirical application of my test procedure using Eurodollar interest rates finds that most popular short-rate models are rejected and the drift misspecification plays an important role in such rejections. © 2010 Elsevier B.V. All rights reserved.

1. Introduction Diffusion models have proven to be mostly successful in finance over the past three decades in modeling the dynamics of, for instance, interest rates, stock prices, exchange rates and option prices. Since economic theories usually do not suggest any concrete functional form for the processes, the choice of a model is somewhat arbitrary and a great number of parametric diffusion models have been proposed in the literature; see for example Ait-Sahalia (1996a), Ahn and Gao (1999), Chan et al. (1992), Cox et al. (1985) and Vasicek (1977). However, model misspecification may yield misleading conclusions about the dynamics of the process and result in large errors in pricing, hedging and risk management. Therefore, the development of reliable specification tests for diffusion models is necessary to tackle such problems.

✩ I would like to thank the Co-Editor, Ron Gallant, Associate Editor and two anonymous referees for careful and constructive comments which improved the paper significantly. I am especially indebted to my advisor Yongmiao Hong for his invaluable guidance. I also thank Valentina Corradi, seminar participants at the 2009 International Symposium on Risk Management and Derivatives in Xiamen, and doctoral students in the economics program at Cornell University for their discussions. Special thanks go to Yacine Ait-Sahalia for providing me the Eurodollar interest rates data and his Matlab codes. All remaining errors are solely mine. ∗ Corresponding address: Uris Hall 445, Department of Economics, Cornell University, Ithaca, NY 14850, United States. Tel.: +1 6077930666. E-mail address: [email protected].

0304-4076/$ – see front matter © 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2010.12.005

In this study, I develop an omnibus test for the specification of diffusion models based on the infinitesimal operator which is a complete characterization of the whole dynamics of the process alternative to transition density used by Ait-Sahalia et al. (2009), Chen et al. (2008) and Hong and Li (2005). Through the celebrated ‘‘martingale problem’’ developed by Stroock and Varadhan (1969), the infinitesimal operator based martingale characterization of a diffusion process is obtained such that the identification of the diffusion process is equivalent to a ‘‘martingale hypothesis’’ for the processes transformed from the original diffusion process. I then check the ‘‘martingale hypothesis’’ via a multivariate generalized spectral derivative approach, which is an extension of Hong (1999) for univariate time series processes. Such a test is particularly powerful against alternatives with zero autocorrelation but a nonzero conditional mean and has a convenient one-sided N (0, 1) asymptotic distribution. The infinitesimal operator of the diffusion process enjoys the nice property of being a closed-form expression of drift and diffusion terms. This makes my test procedure feature many good properties which will be discussed in the following. Ait-Sahalia (1996a) developed probably the first nonparametric test for univariate diffusion models by comparing the modelimplied stationary density (or transition density) with a smoothed kernel density estimator based on discretely sampled data. Hong and Li (2005) observed that when a diffusion model is correctly specified, the probability integral transform (PIT) of data via the model-implied transition density is i.i.d. U [0, 1]. Then an omnibus

190

Z. Song / Journal of Econometrics 162 (2011) 189–212

test is proposed by checking the joint hypothesis of i.i.d. U [0, 1] via a smoothed kernel estimator of the joint density of the probability integral transform series. As by-products of the Efficient Method of Moments (EMM) algorithm, a χ 2 test for model misspecification and a class of appealing diagnostic t-tests to gauge possible sources of model failure are proposed in Gallant and Tauchen (1996), which are applicable to general continuous time models. The idea is to match the model-implied moments to the moments implied by a seminonparametric (SNP) transition density for observed data. Many other tests have appeared recently for univariate diffusion models based on the transition density directly. Both Ait-Sahalia et al. (2009) and Chen et al. (2008) proposed tests by comparing the model-implied parametric transition density and distribution function to their nonparametric counterparts, with latter using a nonparametric empirical likelihood approach. Corradi and Swanson (2005) introduced two bootstrap specification tests for diffusion processes. The first, for one-dimensional case, is a Kolmogorov type test based on comparison of the empirical cumulative distribution function (CDF) and the model-implied parametric CDF. The second, for multidimensional or multifactor models characterized by stochastic volatility, compares the empirical distribution of the actual data and the empirical distribution of the (model) simulated data. Noticing most of the tests for diffusions only apply for the univariate case, Chen and Hong (2010) considered a test for multivariate diffusion models based on the conditional characteristic function (CCF) which is the Fourier transform of the transition density. Different from all the tests above, my proposed test procedure is based on the so-called infinitesimal operator which can completely characterize the dynamics of the underlying continuous time process. Intuitively speaking, the infinitesimal operator captures the limiting behavior of the conditional movement and hence the whole dynamics of the process since the time goes continuously. Several properties of the infinitesimal operator and the relevant advantages of the proposed test are on the way. First, alternative to the transition density, the infinitesimal operator is also able to completely identify the dynamics of the diffusion process. Consequently, my infinitesimal operator based test can pick up effectively the misspecified models that have a correct stationary density which Ait-Sahalia’s (1996a) marginal density based test may easily pass over. It hence significantly improves the size and power performance of the marginal density based test. In such a sense, my test is omnibus, unlike Gallant and Tauchen’s (1996) EMM tests which, as Tauchen (1997) points out, are not consistent against all model misspecifications because they are based on a seminonparametric score function rather than the transition density itself. Second, the infinitesimal operator has always a closed-form expression in terms of the drift and diffusion functions. In contrast, it is well known that the transition density of most continuous time models has no closed form. As a result, some techniques to approximate the transition density are required in the transition based tests (see Hong and Li, 2005 and Ait-Sahalia et al., 2009), for example, the simulation methods of Brandt and Santa-Clara (2002), the Hermite expansion approach of Ait-Sahalia (2002), or for affine diffusions, the closed-form approximation of Duffie et al. (2003) and the empirical characteristic function approach of Singleton (2001) and Jiang and Knight (2002). Although the asymptotic distribution of some tests (like Hong and Li, 2005) is not affected by the estimation uncertainty, the use of the transition density may not be computationally convenient and may affect the finite sample performance of the test. However, my infinitesimal operator based test requires nothing except the drift and diffusion terms. No approximation techniques are needed and the test is easy to implement and computationally convenient.

Third, the closed-form expression of the infinitesimal operator for multivariate cases are similar to and as simple as that for univariate cases. Hence, the proposed test is particularly convenient for checking multivariate diffusion models, which is fairly difficult by other methods. For example, Hong and Li’s (2005) approach cannot be extended to a multivariate context directly because the PIT of data with respect to a model-implied multivariate transition density is no longer i.i.d. U [0, 1], even if the model is correctly specified. Although they propose to evaluate multivariate models using the PITs for each state variable which is valid by partitioning the information set appropriately, it may fail to detect misspecification in the joint dynamics of state variables. In particular, their test may easily overlook misspecification in the conditional correlations between state variables, which are known to be important in term structure literature (Dai and Singleton, 2000). Chen and Hong (2010) do have the ability to check multivariate diffusion models but their test depends crucially on the availability of closed-form CCF. For the proposed test here, univariate and multivariate diffusions are unified in the same framework and no additional steps are necessary for multivariate cases. Fourth, the infinitesimal operator based martingale characterization of diffusion models can reveal separate information about the specification of drift and diffusion terms or even their interactions. This is a property which no other approaches enjoy so far. Although other methods are also available to check the specification of the drift or diffusion terms by nonparametrically smoothing only one of them, the infinitesimal operator based martingale characterization proposed in this study brings up this type of information in an essential way. This motivates me to suggest a separate inference test to determine the sources when rejection of a parametric form happens. Not only is my test convenient for multivariate models which are difficult for other methods like Li (2007) and Kristensen (2008), but it is constructed in exactly the same framework as the proposed test for joint dynamics. In other words, a unified procedure is developed to first check the specification for the joint dynamics and then gauge sources of rejection in order to build a more accurate model for financial variables. This paper is also related to the literature of operator methods for continuous time processes (see Ait-Sahalia et al., 2004 for a survey), including the GMM-type and estimating equation-type estimators in Hansen and Scheinkman (1995) and Kessler and Sorensen (1996) respectively, identification problem in Hansen et al. (1998), semi-group pricing theory in Hansen and Scheinkman (2003), and the test in Kanaya (2007). Different from the econometric studies above using operator methods, the infinitesimal operator is utilized here via the martingale characterization which can also be extended to construct estimators of diffusion models and tests of whether a continuous time process is a diffusion generically like Kanaya (2007). Several nice advantages over the existing studies are expected to come up with the properties of the infinitesimal operator based martingale characterization, for example, the complete identification of the diffusion process unlike Hansen and Scheinkman’s (1995) identification up to scale, the closed-form expressions unlike the eigenfunctions used in Kessler and Sorensen (1996) and Hansen et al. (1998), and convenient applications to multivariate diffusions unlike Kanaya (2007) which is applicable only for univariate cases. These research are being investigated and will be reported soon. The paper is organized as follows. Section 2 derives the infinitesimal operator based martingale characterization for diffusion models and obtains the specification hypothesis as a martingale property. Section 3 discusses the construction of the test by a multivariate generalized spectral derivative approach. The test procedure for doing separate inference is proposed in Section 4. Asymptotics of the tests are presented in Section 5, including the

Z. Song / Journal of Econometrics 162 (2011) 189–212

asymptotic distribution and asymptotic power. The applicability of a data-driven bandwidth is also justified. In Section 6, I examine the finite sample performances of the tests and an empirical study for spot rate models is conducted in Section 7. Section 8 concludes. All the mathematical proofs are in the Appendix. Throughout, we use C to denote a generic bounded constant, ‖ ‖ for the Euclidean norm, and A∗ for the complex conjugate of A. 2. Infinitesimal operator based martingale characterization Consider a multivariate diffusion model defined by the following stochastic differential equation (SDE): dXt = b0 (Xt )dt + σ 0 (Xt )dWt

(2.1)

where Wt is a d × 1 standard Brownian motion in Rd , b : E ⊂ Rd → Rd is a drift function (i.e., instantaneous conditional mean) and σ : E → Rd×d is a diffusion function (i.e., instantaneous conditional standard deviation). We will call (2.1) a SDE-diffusion process. What we are interested in is to test the parametric form of a SDE-diffusion, i.e., b0 ∈ Mb , {b(·, θ ), θ ∈ Θ }

σ 0 ∈ Mσ , {σ (·, θ ), θ ∈ Θ }

(2.2)

where Θ is a finite-dimensional parameter space. We say that the model {Mb , Mσ } is correctly specified for (2.1) if H0 : P [b(Xt , θ0 ) = b0 (Xt ), σ (Xt , θ0 ) = σ 0 (Xt )] = 1, for some θ0 ∈ Θ .

(2.3)

The alternative hypothesis is that there exists no parameter value θ ∈ Θ such that b(·, θ ) and σ (·, θ ) coincide with b0 (·) and σ 0 (·) simultaneously: HA : P [b(Xt , θ ) = b0 (Xt ), σ (Xt , θ ) = σ 0 (Xt )] < 1, for all θ ∈ Θ .

Definition 1. A Markov process X = (Ω , {Ft }, {Xt }, {Pt }, {P x , x ∈ E })t =0 with state space (E , ε) is an E-valued stochastic process adapted to the sequence of σ -algebras {Ft } such that for 0 5 s 5 t and x ∈ E , E x [f (Xs+t )|Fs ] = (Pt f )(Xs ), P x -a.s., where {Pt } is a transition function on (E , ε), i.e., a family of kernels Pt : E × ε → [0, 1] such that (i) for t = 0 and x ∈ E , Pt (x, ·) is a measure on ε with Pt (x, E ) 5 1 (ii) for t = 0 and Γ ∈ ε, Pt (·, Γ ) is ε -measurable (iii) for s, t = 0, x ∈ E and Γ ∈ ε , Pt +s (x, Γ ) =

In this case, the Markov property is expressed as the following semi-group property equivalent to the Chapman–Kolmogorov equation: Ps Pt = Ps+t ,

∫

Ps (x, dy)Pt (y, Γ ).

(2.5)

E

In this definition, the Markov property is characterized by the transition function (or transition density when the density of transition function exists) and (2.5) is the so-called Chapman–Kolmogorov equation. An alternative and equivalent characterization is the induced family {Pt } which is a set of positive bounded operators with norm less than or equal to 1 on bε (bounded and ε -measurable functions) and which is defined by: Pt f (x) ≡ (Pt f )(x) =

∫

Pt (x, dy)f (y). E

(2.6)

for any s, t = 0.

(2.7)

Both transition function and the semi-group of operators characterize the Markov process and interact with the sample-path property of the process. However, since the general Markov process consists of too many processes and is too broad, we choose to focus on the more interesting subclass, Feller process. By Rogers and Williams (2000, Ch. III.6), Feller process is defined as follows: Definition 2. The transition function {Pt }t =0 of a Markov process is called a Feller transition function if (i) Pt C0 ⊂ C0 for all t = 0 (ii) for any f ∈ C0 and x ∈ E , Pt f (x) → f (x) as t ↓ 0, where C0 = C0 (E ) is the space of real-valued, continuous functions on E which vanish at infinity and C0 is endowed with the sup-norm. Feller process has good path properties1 and is also general enough to contain most processes we are interested in, for example, Feller diffusion which will be defined below and has been extensively used in finance, and Levy process including Poisson process and Compound process which has received more and more attention in finance recently (see Schoutens, 2003). For Feller processes, we will consider another characterization, the infinitesimal operator, other than the transition function and semigroup of operators introduced above which are for the general Markov process. Definition 3. A function f ∈ C0 is said to belong to the domain D(A) of the infinitesimal operator of a Feller process X if the limit

(2.4)

Since in this study I am relying on a characterization of a continuous time Markov process alternative to transition density, i.e., the infinitesimal operator, and in finance a diffusion process is usually specified as a SDE-diffusion, I will discuss first some related mathematical concepts and clarify their relationship. By Rogers and Williams (2000, Ch. III.1), a continuous time Markov process is defined as follows:

191

Af = lim

Pt f − f

(2.8)

t

t ↓0

exists in C0 . The linear transformation A : D(A) → C0 is called the infinitesimal operator of the process. Immediately from Definition 3, we see that for f ∈ D(A), it holds P-a.s. that

 E

f (Xt +h ) − f (Xt )  h

   Ft = Af (Xt ) + o(h), 

as h ↓ 0.

(2.9)

In this sense, the infinitesimal operator indeed describes the movement of the process in an infinitesimally small amount of time. Therefore, intuitively the infinitesimal operator characterizes the whole dynamics of a Feller process because the time is continuous here.2 So far we have had Feller process for which three complete characterization of the dynamics are available: transition function (or transition density), semi-group of operators and infinitesimal operator. The most important Feller process in continuous time finance is the diffusion process. By Rogers and Williams (2000, Ch. III.13).

1 By Rogers and Williams (2000, Ch. III.7–9), the canonical Feller process always admits a Cadlag (the path of the process is right continuous and has left limits) modification and satisfies the strong Markov property. 2 Rigorously, it can be proved that the infinitesimal operator is equivalent to the semi-group of operators in characterizing a Feller process (see the Hill–Yoshida theorem in Dynkin (1965)). Therefore, infinitesimal operator does determine the whole dynamics of the process.

192

Z. Song / Journal of Econometrics 162 (2011) 189–212

Definition 4. A Feller process with state space E ⊂ Rd is called a Feller diffusion if it has continuous sample paths and the domain of its infinitesimal operator contains the function space Cc∞ (int(E )) which is the space of infinitely differentiable functions with compact support contained in the interior of the state space E. We can see that the Feller diffusion is defined through the combination of the sample-path properties and the restrictions imposed on the infinitesimal operator. A very convenient property of Feller diffusion is that its infinitesimal operator has an explicit form. According to Kallenberg (2002, Thm. 19.24) and Rogers and Williams (2000, Vol. 1, Thm. III.13.3 and Vol. 2, Ch. V.2), for a Feller diffusion {Xt }, there exist some functions ai,j and bi ∈ C (Rd ) for

d

i, j = 1, . . . , d where ai,j i,j=1 forms a symmetric nonnegative definite matrix such that the infinitesimal operator is



Af (x) =

d −

bi (x)fi′ (x) +

i=1

d 1−

2 i,j=1

ai,j (x)fi,′′j (x)

(2.10)

for f ∈ D(A) and x ∈ Rd . Now we have arrived at the Feller diffusion and its infinitesimal operator which has a closed form. Then what is the relationship between Feller diffusion and SDE-diffusion (2.1)? By Rogers and Williams (2000, Ch. V.2 and V.22), under some regularity conditions, they are equivalent. That is, for a Feller diffusion as in Definition 4, there is a corresponding SDE-diffusion and also a SDE-diffusion like (2.1) is a Feller diffusion, where the function ∑d b(·) are the same and a = σ σ T , i.e., aij (x) = k=1 σi,k (x)σj,k (x). Therefore, the SDE-diffusion which has been analyzed extensively in continuous time finance and which belongs to the class of Feller process also has (2.10) as the closed-form infinitesimal operator. To illustrate the relationship between infinitesimal operator and drift and diffusion terms, let us consider the univariate diffusion defined as dXt = b(Xt )dt + σ (Xt )dWt with Wt a 1-dimensional standard Brownian motion in R, b : E ⊂ R → R a drift function and σ : E → R a diffusion function. Then by (2.10) and the discussion above, the infinitesimal operator for this univariate diffusion is

Af (x) = b(x)f ′ (x) +

1

σ 2 (x)f ′′ (x). (2.11) 2 Clearly the first term involving the first derivative of function f (·) is related to the dynamics of drift and the second term involving the second derivative of function f (·) to the dynamics of diffusion function. This is consistent with the intuition that drift describes the dynamics of mean and the diffusion describes that of variance of the process (see Nelson, 1990) for more discussion which proves that the diffusion process is the approximation of an ARCH process). However, we have to point out that it is not absolutely right to simply think of drift and diffusion terms as the continuous time counterparts of conditional mean and variance respectively. Consider the infinitesimal changes of this univariate diffusion process. By (2.9) and (2.11), for any f ∈ D(A), it holds P-a.s. that  E

  f (Xt +h ) − f (Xt )   Ft = b(Xt )f ′ (Xt ) + 1 σ 2 (Xt )f ′′ (Xt )  h 2 + o(h),

as h ↓ 0.

(2.12)

Therefore, the dynamics of {Xt } are characterized completely by the drift and diffusion coefficients, including the conditional probability law. But in discrete time series models, the mean and variance solely cannot determine the complete conditional probability law unless it is Gaussian. In fact, the conditional mean of the process {Xt }, E [Xt +h |Xt ] for a fixed h > 0 is in general a function of both the drift b(·) and diffusion σ (·) instead of the drift solely (see Ait-Sahalia, 1996a).

From the discussions above, for a SDE-diffusion which is also a Feller diffusion, there are at least two characterizations we can use to identify the whole dynamics of the process: the transition function and the closed-form infinitesimal operator which is also the generator of the third characterization, semi-group of operators. The former, also well known as transition density when the density of transition function exists, has been the primary tool to analyze the diffusion process, not only in estimation (see Ait-Sahalia, 2002) but also in the construction of specification tests (see Hong and Li, 2005; Ait-Sahalia et al., 2009). However, as we discussed in Section 1, specification of drift and diffusion terms rarely give a closed-form transition density. In contrast, (2.10) and (2.11) tell us that the infinitesimal operator does have a direct and explicit expression and this nice property makes it a convenient tool for analyzing the diffusion process. It has already been used in identification and estimation problems as discussed above and the idea of constructing a specification test for diffusion models comes up naturally. To construct a test of diffusion based on infinitesimal operator, I consider a transformation based on the celebrated ‘‘martingale problems’’. This transformation gives us a martingale characterization for diffusion processes which is not only a complete identification but also very simple and convenient to check. Let me first define the martingale problem (see, Karatzas and Shreve, 1991, Ch. 5.4): Definition 5. A probability measure P on C [0,∞)d , B (C [0,∞)d ) under which



f

Mt = f (Xt ) − f (X0 ) −



t

∫

(Af )(Xs )ds 0

is a martingale for every f ∈ D(A)

(2.13)

is called a solution to the martingale problem associated with the operator A. How is the martingale problem related to the SDE-diffusion? As we know, SDE has two types of solutions: strong solutions and weak solutions (see Karatzas and Shreve (1991), Ch. 5.2-3 or Rogers and Williams (2000), Ch. V.2-3 for details). Intuitively, the strong solution is a solution to SDE with a.s. properties and a weak solution is the solution to SDE with in law properties. When the drift and diffusion terms of a SDE satisfy the Lipschitz and linear growth conditions, there is a strong solution to the SDE. But for general drift and diffusion terms, a strong solution may not exist; in this case, probabilists usually attempt to solve the SDE in the ‘‘weak’’ sense of finding a solution with the right probability law. The martingale problem is a variation of this ‘‘weak solution approach’’ developed by Stroock and Varadhan (1969) and is in fact equivalent to the weak solution of a SDE as shown by the following: Theorem 1. The process {Xt } is a weak solution to the SDE (2.1) if and only if it satisfies the martingale problem of Definition 5 with A as the infinitesimal operator of {Xt } defined in (2.10). Now we have shown that the weak solution of a SDE is equivalent to the martingale problem. When strong solution exists the weak solution will coincide with it. Hence it is enough to consider the weak solution identification for doing econometric inference because regularity conditions for the existence of strong solution are usually satisfied and thus imposed in analysis (see Protter, 2005 for some regularity Lipschitz conditions for the existence and uniqueness of a strong solution to a SDE). By Theorem 1 and (2.13), the correct specification of a SDEdiffusion is equivalent to whether the martingale problem is

Z. Song / Journal of Econometrics 162 (2011) 189–212

satisfied, implying that the hypotheses of interest H0 in (2.3) versus HA in (2.4) can be equivalently written as: t f H0 : For some θ0 ∈ Θ , Mt (θ0 ) = f (Xt ) − f (X0 ) − 0 (Aθ0 f )(Xs )ds is a martingale for every f ∈ D(A), where

Aθ0 f (x) =

d −

bi (x; θ0 )fi′ (x) +

i=1

aij (x; θ0 ) =

d −

d 1−

2 i,j=1

ai,j (x; θ0 )fi,′′j (x) and (2.14)

σi,k (x; θ0 )σj,k (x; θ0 ).

same and do not pay much attention to their difference in this study.5 Actually, by imposing certain regularity conditions, a local martingale can become a martingale (see Protter, 2005, for such technical conditions). I do not explore them here since it is not the focus of this study and could certainly distract the attention. To sum up, Theorem 2 implies that the hypotheses of interest H0 in (2.14) can be equivalently written as: H0 : For some θ0 ∈ Θ x Mt i

(θ0 ) =

k =1 f

d −

bi (x; θ )fi′ (x) +

i =1

aij (x; θ ) =

d −

Xti

−

t

∫

bi (Xs ; θ0 )ds

−

d 1−

2 i,j=1

ai,j (x; θ )fi,′′j (x)

t 0

(Aθ f )(Xs )ds

x ,x

Mt i i (θ0 ) =

− X0  ∫ t d − i 2 − 2bi (Xs ; θ0 )Xs + σi,k (Xs ; θ0 ) ds Xt

0

(2.15)

xi ,xj

Mt

x ,x j

(θ0 ) = Xt i

x ,x j

− X0 i

∫ t −

bi (Xs ; θ0 )Xsj + bj (Xs ; θ0 )Xsi

0

+

Now we have transformed the correct specification hypothesis of a multivariate time-homogeneous diffusion into a martingale hypothesis for some new processes based on the infinitesimal operator and martingale problems which is very convenient to check. Observe from (2.14) that what we have to do is only to f check the martingale property for the transformed processes Mt for every f ∈ D(A). However, there are usually an infinite number of functions f (·) in the domain D(A) which are usually called test functions (note that D(A) contains the function space Cc∞ (int(E )) as a subset for Feller diffusion defined in Definition 4). Hence we unfortunately have to check the martingale property for infinitely f many processes {Mt } for test function f ∈ D(A). It is definitely impossible in practice and we need a subclass of D(A) which not only consists of finitely many function forms but also plays the same role as D(A) does. Luckily, the following celebrated theorem gives an equivalent subclass by which a practical test procedure can be constructed easily.3 Theorem 2. The process {Xt } is a weak solution to the SDE in (2.1) if it satisfies the martingale problem of Definition 5 with A as the infinitesimal operator of {Xt } for the choices f (x) = xi and f (x) = xi xj with 1 ≤ i, j ≤ d. At first glance, this result may appear confusing because f (x) = xi and f (x) = xi xj do not belong to D(A) which is a subset of C0 (Rd ). To get an intuition for this important result, let me choose (K ) (K ) ∞ d sequences {gi }∞ K =1 and {gij }K =1 in function space C0 (R ) such (K )

g

(K )

that gi (x) = xi and gij = xi xj for ‖x‖ ≤ K . If M gi and M ij are martingales, then M xi and M xij are local martingales. A similar result to Theorem 1 with local martingale replacing martingale then tells us that {Xt } is a weak solution to the SDE in (2.1). Of course, the converse of Theorem 2 only holds with local martingale replacing martingale. However, since examples which are local martingales but not martingales are few and too artificial in certain sense even when they exist,4 I regard them as almost the

3 We can also reduce the space of test functions to an equivalent subclass by the method considered in Kanaya (2007) which is based on the concept of a core and ‘‘approximation’’ theory. Since my reduced space of test functions constructed by Theorem 2 is much more simple and intuitive than that in Kanaya (2007), I do not use that method here. Also see Hansen and Scheinkman (1995) and Conley et al. (1997) for more discussions about choices of test functions. 4 See Karatzas and Shreve (1991, p. 168 and 200–201) for some examples which are local martingales but not martingales.

k=1

and

σi,k (x; θ )σj,k (x; θ ).

(K )

 i 2

 i 2

k=1

(K )

X0i

0

Versus HA : For all θ ∈ Θ , Mt (θ ) = f (Xt ) − f (X0 ) − is not a martingale for some f ∈ D(A), where

Aθ f (x) =

193

d 1−

2 k=1

 σi,k (Xs ; θ0 )σj,k (Xs ; θ0 ) ds for i ̸= j. (2.16)

This greatly simplifies the hypothesis and makes the testing of the specification completely practical. Note that my hypothesis of correct specification can be expressed explicitly by the drift and diffusion terms. Therefore, any specification of the diffusion model can be tested directly without computation of transition density and the asymptotic distribution is completely free of estimation √ uncertainty as long as the estimator is n-consistent. In contrast, the transition density based methods like Hong and Li (2005) or Ait-Sahalia et al. (2009) have to approximate the model-implied transition density because the transition density hardly has a closed form. To see the information contained in different transformed processes and prepare for the discussions of separate inference in Section 4 illustrated using univariate diffusion models, we state the specification hypothesis corresponding to (2.16) for univariate models with infinitesimal operator defined by (2.11): H0 : For some θ0 ∈ Θ Mtx (θ0 ) = Xt − X0 −

t

∫

b(Xs ; θ0 )ds 0

2

Mtx (θ0 ) = Xt2 − X02 −

∫ t



2b(Xs ; θ0 )Xs + σ 2 (Xs ; θ0 ) ds

0

are both martingales.

(2.17)

For the convenience of constructing a test procedure, I further state the following equivalent hypotheses of correct specification in terms of the m.d.s. property for the transformed processes. H0 : For some θ0 ∈ Θ , E [Zt (θ0 )|It ′ ] = 0 for any t ′ < t, where It ′ = σ {Xt ′′ }t ′′
x

i Zti (θ0 ) = Mt i (θ0 ) − Mt − ∆ (θ0 )

=

Xti

−

Xti−∆

∫

t

− t −∆

bi (Xs ; θ0 )ds

5 When the difference really matters, the local martingale property can be used in the specification testing of diffusion models. The idea is to use the fact that the timechanged continuous local martingale by quadratic variation is a standard Brownian Motion (see Andersen et al., 2007 and Park, 2008 for details). Since this approach is closely related to time-dependent diffusion models and the test procedure will be very different, I do not pursue it here. But the research on it is being investigated and will be reported soon.

194

Z. Song / Journal of Econometrics 162 (2011) 189–212

i,i

xx

 i 2

xx

i i Zt (θ0 ) = Mt i i (θ0 ) − Mt − ∆ (θ0 ) = Xt

t

∫



− t −∆ i,j Zt

xi xj Mt

2bi (Xs ; θ0 )Xsi +

+

2 k=1



σi,k (Xs ; θ0 )2 ds

k=1

(θ0 ) −

d 1−

2

xi xj Mt −∆

(θ0 ) ∫ = Xti Xtj − Xti−∆ Xtj−∆ −

(θ0 ) =

d −

− Xti−∆ 

t t −∆

 bi (Xs ; θ0 )Xsj + bj (Xs ; θ0 )Xsi

 σi,k (Xs ; θ0 )σj,k (Xs ; θ0 ) ds,

i ̸= j.

(2.18)

3. Test procedure based on multivariate generalized spectral derivative In this section, I shall construct a test procedure of the correct specification hypotheses H0 versus HA in (2.14) and (2.15) for the multivariate diffusion process. The sample data is discrete in time, i.e., {Xτ ∆ }nτ =1 observed over a time span T with sampling interval ∆ and sample size n = T /∆. Therefore, the process is in continuous time but the data sample is discrete. This is a general problem in continuous time series econometrics not only for testing but also for estimation (see Lo, 1988 and Ait-Sahalia, 1996a,b) for discussions about the estimation of the discretized version of a continuous time model). Like Ait-Sahalia (1996b) and Hong and Li (2005), I will consider the discrete time implications of the m.d.s. property which is derived in continuous time.6 The asymptotic scheme I use here is n = T /∆ → ∞. It can be obtained by either infill (∆ → 0) or long span (T → ∞)7 instead of both and this implies that my test procedure can be applied to both highfrequency and low-frequency data. In contrast, many other papers like Stanton (1997), Bandi and Phillips (2003) and Kanaya (2007) assume ∆ → 0 and hence can only be used for high-frequency data. The null hypothesis is that E [Zt (θ0 )|It ′ ] = 0 for any t ′ < t, where It ′ = σ {Xt ′′ }t ′′
 Z



for any t ′ < t , where ItZ′ = σ {Zt ′′ (θ0 )}t ′′
where ItZ′ is the sigma-field generated by past information of {Zt (θ0 )}.8 Since the sample data we have is {Xt , t = τ ∆}nτ =0 with n = T /∆, an application of the Law of Iterated expectation as well as (3.1) implies that E Zτ ∆ (θ0 )|IτZ −1 = 0,

where

Iτ −1 = σ {Z(τ −1)∆ (θ0 ),

Z(τ −2)∆ (θ0 ), . . . , Z∆ (θ0 ), Z0 (θ0 )}.





Z

(3.2)

Observe that (3.2) is a m.d.s. property for discrete time process {Zτ ∆ (θ0 )}nτ =1 and it is derived as an implication of the m.d.s. property in continuous time instead of a result from the

6 The discretization can be justified by Zähle (2008) who proves rigorously that the discrete time processes solving the discrete analogue of the martingale problem approximate weakly the solution of the stochastic differential equation under additional assumption on the moments of the increments. 7 Bandi and Phillips (2003) argued that both the infill and long-span assumptions are needed to estimate continuous time (diffusion) process fully nonparametrically. 8 I ′ can still be used here and this actually simplifies the test statistic M  (p ) t

0

in (3.9) because {Xt } is only d-dimensional while {Zt } is a d′ -dimensional process. The test procedure constructed this way can be called a generalized cross-spectral derivative approach. I do not follow this direction here since the performances of the tests are expected to be very similar due to the close relationship between {Xt } and {Zt } which can be seen from (2.20).

discretization of the continuous time process. In this respect, it is similar to the approaches of Ait-Sahalia (1996a,b) and Lo (1988) which deal with estimation problems and therefore is free of the discretization errors which are discussed in Lo (1988) in the context of estimation. Moreover, this property is also the reason why my test procedure based on (3.2) is only assuming n = T /∆ → ∞ for asymptotic theory and applicable to both low- and high-frequency data. As discussed above, my test procedure will be based on checking whether (3.2) is true or not. However, it is not a trivial task to check this. First, the conditioning information set IτZ −1 has an infinite dimension as τ → ∞ and then there is a ‘‘curse of dimensionality’’ difficulty associated with testing the m.d.s. property. Second, {Zτ ∆ (θ0 )} may display serial dependence in its higher order conditional moments. Any test should be robust to time-varying conditional heteroskedasticity and higher order moments of unknown form in {Zτ ∆ (θ0 )}. To check the m.d.s. property of {Zτ ∆ (θ0 )}, I extend Hong’s (1999) generalized spectral approach to a multivariate generalized spectral derivative method. The idea is similar to Hong and Lee (2005) which considers testing time series conditional mean models with no prior knowledge of possible alternatives. The difference is that here the process I check for m.d.s. property is transformed explicitly from the original process while the process Hong and Lee (2005) check is the estimated residuals from a conditional mean model. Furthermore, the process {Zτ ∆ (θ0 )} is multivariate but that in Hong and Lee (2005) is only univariate. Therefore, the problem here is more complicated and we need to extend the generalized spectral approach to an multivariate one while keeping the property of being free of ‘‘curse of dimensionality’’. This can be regarded as another contribution of this paper. Suppose {Zτ } is a strictly stationary process with marginal char′ acteristic function ϕ(u) = E (eiu Zτ ) and pairwise joint character√ iu′ Zτ +iv ′ Zτ −|m| istic function ϕm (u, v) = E (e ), where i = −1, u, ′ v ∈ Rd , and m = 0, ±1, . . . . The basic idea of the generalized spectrum is to consider the spectrum of the transformed series ′ {eiu Zτ }. It is defined as f (ω, u, v) ≡

1

∞ −

2π

m=−∞

σm (u, v)e−imω ,

ω ∈ [−π , π]

where ω is the frequency, and σm (u, v) is the covariance function of the transformed series:

σm (u, v) ≡ cov(eiu Zτ , eiv Zτ −|m| ), ′

′

m = 0, ±1, . . . .

(3.3)

Note that the function f (ω, u, v) is a complex-valued scalar function although Zτ is a d′ × 1 vector. It can capture any type of pairwise serial dependence in {Zτ }, i.e., dependence between Zτ and Zτ −m for any nonzero lag m, including that with zero autocorrelation. First, this is analogous to the higher order spectra (Brillinger and Rosenblatt, 1967) in the sense that f (ω, u, v) can capture the serial dependence in higher order moments. However, unlike the higher order spectra, f (ω, u, v) does not require existence of any moment of {Zτ }. This is important in economics and finance because it has been argued that the higher order moments of many financial time series may not exist. Second, this can capture nonlinear dynamics while maintaining the nice features of spectral analysis, especially its appealing property to accommodate information in all lags. In the present context, it can check the m.d.s. property over many lags in a pairwise manner, avoiding the ‘‘curse of dimensionality’’ difficulty. This is not achievable by other existing tests in the literature which only check a fixed lag order. The generalized spectrum f (ω, u, v) itself cannot be applied directly for testing H0 , because it will capture the serial dependence not only in mean but also in higher order moments. However,

Z. Song / Journal of Econometrics 162 (2011) 189–212

just as the characteristic function can be differentiated to generate various moments of {Zτ }, f (ω, u, v) can be differentiated to capture the serial dependence in various moments. To capture (and only capture) the serial dependence in conditional mean, one can consider the derivative:

∂ f (ω, u, v)|u=0 f (0,1,0) (ω, 0, v) ≡ ∂u ∞ 1 − (1,0) = σm (0, v)e−imω , 2π m=−∞

which can be consistently estimated by 1 (1,0) (0,1,0)   σ (0, v), f0 (ω, 0, v) = 2π 0

ω ∈ [−π , π]

∂ ′ σm (u, v)|u=0 = cov(iZτ , eiv Zτ −|m| ) ∂u

(3.4)

(3.7) The estimators  f (0,1,0) (ω, 0, v) and f0 (ω, 0, v) converge to the same limit under H0 and generally converge to different limits under HA . Thus, any significant divergence between them is evidence of the violation of the MDS property and hence of the misspecification of the process. We can measure the distance (0,1,0) between  f (0,1,0) (ω, 0, v) and  f0 (ω, 0, v) by quadratic form:

′



n −1 1 −  f (0,1,0) (ω, 0, v) ≡ (1 − |m|/n)1/2 k(m/p) 2π m=1−n

× σm(1,0) (0, v)e−imω , (1,0)

0

′

where  σm (0, v) ≡  ϕm (u, 0) ϕm (0, v), and

 ϕm (u, v) =

n −

1

=  ϕm (u, v) −

eiu Zτ +iv Zτ −|m| . ′

n − |m| τ =|m|+1

′

(3.5)

Here, p = p(n) is a bandwidth, and k : R → [−1, 1] is a symmetric kernel. Examples of k(·) include Bartlett, Daniell, Parzen and Quadratic spectral kernels (e.g. Priestley, 1981, p. 442). The factor (1 − |m|/n)1/2 is a finite sample correction and could be replaced by unity. Under certain conditions,  f (0,1,0) (ω, 0, v) is (0,1,0) consistent for f (ω, 0, v). See Theorem 3. ′ (1,0) Under H0 , we have σm (0, v) = 0 for all v ∈ Rd and all m ̸= 0. Consequently, the generalized spectral derivative f (0,1,0) (ω, 0, v) becomes a ‘‘flat spectrum’’ as a function of frequency ω: (0,1,0)

f0

(ω, 0, v) ≡

1 2π

σ0(1,0) (0, v) = ′

ω ∈ [−π , π] and v ∈ Rd

1 2π



=

n −1 −

k2 (m/p)(1 − m/n)

∫

 (1,0) 2  σm (0, v) dW (v)

(3.8)

m=1

where the second equality follows by Parseval’s identity and

∏d′

+ W (v) = a nondecreasing c =1 W0 (vc ) with W0 : R → R weighting function that weighs sets symmetric about the origin equally. Examples of W0 (·) include the CDF of any symmetric probability distribution, either discrete or continuous. My proposed omnibus test statistic for correct specification hypothesis is an appropriately standardized version of  Q,

 0 (p) = M

n−1 −

k2 (m/p)(n − m)

∫

 (1,0) 2  σm (0, v)

m=1

  × dW (v) −  C0 (p) D0 (p) where

 C0 (p) =

n−1 −

k (m/p) 2

m=1

 D0 (p) = 2

n−2 − n−2 −

∫ n −1 − 2  2     ψτ −m (v) dW (v) Zτ

1

n − m τ =m+1

k2 (m/p)k2 (l/p)

m=1 l=1

 d′ − d′ ∫ ∫  n − −   1   τ −m (v) × Zaτ  Za′ τ ψ   n − max(m, l) τ =max(m,l)+1 a=1 ′ a =1

ω ∈ [−π , π] and v ∈ Rd ∂  σ (u, v)|u=0 ,  σm (u, v) ∂u m

 2 (0,1,0)  (0,1,0) (ω, 0, v) −  f (ω, 0, v) dωdW (v) f

−π

′

exp iv1′ Zτ −m + iv2′ Zτ −l , where v = (v1 , v2 ) ∈ Rd × Rd . In the present context, I suppress ∆ and θ0 and then let Zτ ≡ Zτ ∆ (θ0 ) for the simplification of notations. Obviously, Zτ cannot be observed. We can first estimate the parameter θ0 by the √ random sample {Xτ ∆ }nτ =1 to get a n-consistent estimator. Then the estimated processes  Zτ = Zτ ∆ ( θ ) is obtained. Examples of  θ are approximated transition density based estimator in Ait-Sahalia (2002), simulated MLE in Brandt and Santa-Clara (2002) and so on. Then we can estimate f (0,1,0) (ω, 0, v) for process {Zτ (θ )} by the following smoothed kernel estimator:

π

∫∫  Q ≡

(1,0)

is a d′ × 1 vector. The measure σm (0, v) checks whether the autoregression function E [Zτ | Zτ −m ] at lag order m is zero. Under ′ (1,0) some regularity conditions, σm (0, v) = 0 for all v ∈ Rd if and only if E [Zτ | Zτ −m ] = 0, a.s. It should be noted that the hypothesis of E [Zτ (θ ) | Iτ −1 ] = 0 a.s. is not exactly the same as the hypothesis of E [Zτ | Zτ −m ] = 0 a.s. for all τ > 0. The former implies the latter but not vice versa. There exists a gap between them. This is the price we have to pay to deal with the difficulty of the ‘‘curse of dimensionality’’. Nevertheless, the examples for which E [Zτ | Zτ −m ] = 0 a.s. for all τ > 0 but E [Zτ (θ ) | Iτ −1 ] ̸= 0 a.s. may be rare in practice and are thus pathological. Even in cases for which the gap does matter, it can be further narrowed down by using the function E [Zτ | Zτ −m , Zτ −l ] which may be called the bi-autoregression function of Zτ at lags (m, l). An equivalent measure is the general (1,0) ized third order central cumulant function σm,l (0, v) = cov Zτ ,



′

ω ∈ [−π , π] and v ∈ Rd . (0,1,0)

where

σm(1,0) (0, v) ≡

195

cov iZτ , eiv Zτ ′



, (3.6)

2   ∗  × ψτ −l (u) dW (u)dW (v) 

(3.9)

iv Zτ τ (v) = eiv Zτ − n−1 and ψ . Throughout, all unspecified τ =1 e integrals are taken on the support of W (·). The factors  C0 (p) and  D0 (p) are approximately the mean and the variance of quadratic form n Q . The impact of conditional heteroskedasticity and other time-varying higher order conditional moments has already 0 (p) involves d′ - and 2d′ been taken into account. Note that M dimensional numerical integrations which can be computationally cumbersome when d′ is large. In practice, one may choose a finite number of grid points symmetric about zero or generate a finite ′ number of points drawn from a uniform distribution on [−1, 1]d . The asymptotic theory allows for both discrete and continuous weighting function for W0 (·) which weigh sets symmetric about zero equally. A continuous weighting function for W0 (·) will 0 (p), but there is a trade-off between ensure good power for M computational cost and power gains when choosing a discrete or continuous weighting function. One may expect that the power of 0 (p) will be ensured if sufficiently fine grid points are used. M ′

∑n

′

196

Z. Song / Journal of Econometrics 162 (2011) 189–212

b(Xt , θ )dt + σ (Xt )dWt with θ ∈ Θ and Θ a finite-dimensional parameter space, then the identification assumption is

4. Separate inference When a model is rejected using the test procedure above which checks the joint specification of both drift and diffusion terms, it would be interesting to explore possible sources of the rejection. Specifically, is the rejection due to the misspecified form of drift function or the diffusion function? Having this information in hand, one can try other parametric forms of drift, diffusion or both. This is particularly important when economic theory provides little guidance about the specification of the drift and diffusion, which is usually the case in practice. While the separate specification testing of drift or diffusion is so important, surprisingly only several papers are available in this respect and most of them are just for univariate models (Li, 2007; Corradi and White, 1999; Kristensen, 2008; Bandi and Phillips, 2007). Although some can be extended to test multivariate models, they depend on a nonparametrically smoothed drift or diffusion function, which is subject to the ‘‘curse of dimensionality’’. Chapman and Pearson (2000), by a striking simulation study, find that the finite sample bias of the truncation of a distribution makes the smoothed nonparametric kernel estimation of drift highly unreliable. Consequently, the test procedure by comparing the parametric drift and the nonparametric smoothing counterpart is not reliable at all. Moreover, to do the nonparametric estimation of drift or diffusion terms, high-frequency data with sampling interval going to zero is needed (see Stanton, 1997), which may not be a valid assumption for applications to daily interest rates. Since the infinitesimal operator has a closed form in terms of drift and diffusion terms, the martingale based identification of diffusion process proposed here has the potential to do the separate inference in order to explore possible sources of the rejection. For simplicity, I first consider the univariate diffusion model with infinitesimal operator defined by (2.11). The extension to multivariate cases is straightforward and presented later in this section. By (2.17), the identification of the model is equivalent to the martingale property: Mtx = Xt − X0 −

∫

t

b(Xs )ds

(4.1)

0

and 2 Mtx

=

Xt2

−

X02

t

∫

b(Xs )Xs ds −

−2 0

∫

t

σ 2 (Xs )ds

(4.2)

0

are both martingales. Observe that the first transformed process 2

Mtx only involves the drift term and the second Mtx has both the drift and diffusion terms as inputs. Intuitively, Mtx characterizes the dynamics of the drift term solely and this characterization is robust t to the dynamics of diffusion term. Note also that 0 σ 2 (Xs )ds is the so-called ‘‘integrated volatility’’ or the quadratic variation [X , X ]t of the process {Xt } which has received extremely intensive attention in recent years (see Andersen et al., 2003; Barndorff-Nielsen and 2

Shephard, 2004; Ait-Sahalia et al., 2005). Therefore Mtx contains the dynamics of diffusion term, i.e., the volatility of the process

t

2

illustrated by 0 σ 2 (Xs )ds. Furthermore, Mtx also characterizes the interaction between drift and diffusion terms which is represented t by 0 b(Xs )Xs ds because b(Xs )Xs will raise the power of Xs at least to 2 and hence variance will also appear in this term. Since the characterization for the dynamics of the drift term by the martingale property of Mtx is robust to the dynamics of diffusion term, it is conceivable that we can check the specification of the drift term robust to diffusion misspecification if we further assume the drift term is identified by this characterization (4.1). Explicitly, suppose {Xt } follows a univariate diffusion model given by dXt =

Assumption 4.1. There exists a unique θ0 ∈ Θ such that Mtx (θ ) = Xt − X0 −

t 0

b(Xs ; θ )ds is a martingale.

Actually, under this assumption (this is equivalent to Assumption 2.1 in Park (2008)), Park (2008) proposes a so-called ‘‘conditional mean model of instantaneous change for a given stochastic process’’9 which is exactly the same as Mtx here.10 Such a model framework, with no diffusion term involved, is actually a very general setup, covering diffusion, jump diffusion and even stochastic volatility models.11 By assuming the drift term is identified by the martingale property of Mtx , i.e., Assumption 4.1, a specification test for drift term can be constructed which is robust to diffusion term misspecification. The null hypothesis is the correct specification of drift term: H0 : P [b(Xt , θ0 ) = b0 (Xt )] = 1,

for some θ0 ∈ Θ where b0 (·)

is the true drift function which is equivalent to H0 : For some θ0 ∈ Θ , Mtx t

∫

b(Xs ; θ )ds is a martingale.

= Xt − X0 −

(4.3)

0

Following the same reasoning as that for (3.2), I can test H0 in (4.3) by checking the following m.d.s. property: E Yτ ∆ (θ0 )|IτY −1 = 0,





where

Iτ −1 = σ {Y(τ −1)∆ (θ0 ), Y(τ −2)∆ (θ0 ), . . . , Y∆ (θ0 ), Y0 (θ0 )} Y

9 Caution is needed for these terminologies. The instantaneous conditional mean for continuous time stochastic processes are different from the conditional mean for discrete time models. As discussed earlier, for instance, in a general diffusion process, the conditional mean of Xt +∆ given Xt is usually a function not of drift solely but of both drift and diffusion terms jointly. See Ait-Sahalia (1996a) and discussions below (2.12) in Section 2. 10 Park (2008) only proposes the instantaneous conditional mean model and claims that his model covers the diffusion process as a special case without giving the corresponding conditions. My infinitesimal operator based martingale characterization, however, provides the mathematical conditions that the instantaneous conditional mean in Park’s (2008) model is equal to the drift a diffusion process. They can further illustrate how much information of model dynamics is lost by only considering the Mtx (θ) process in a full diffusion model. Moreover Park’s (2008) identification of drift can be regarded as a special case of the infinitesimal operator based martingale characterization in the case of diffusion processes. The reason is that (4.1) which is also Park’s (2008) identification assumption is derived using a special choice of function forms (see Theorem 2 and discussion therein for details). If interesting function forms other than f (x) = xi and xi xj in Theorem 2 are suitably chosen, we may get other convenient and intuitive characterizations of diffusion processes. 11 A drawback of such a general framework is that it does not utilize all the information available when the underlying data generating process is known to be a diffusion model. If we are interested in testing the specification of the drift term and we know the DGP is a diffusion model, a test based on checking the martingale property of Mtx (θ) is not omnibus. Since the identification not only involves Mtx (θ) 2

in (4.1) but also involves Mtx in (4.2), it could be the case that Mtx (θ) is a martingale 2 Mtx

but (θ, σ (·)) is not. In this case, the test which only checks the martingale property of Mtx (θ) cannot reject the null hypothesis although it should be rejected. This under-rejection may lead to misleading conclusion about the specification of diffusion models. In other words, Assumption 4.1 may be too restricted as an identification assumption and may not hold in many cases if a diffusion model is considered as the underlying process. To deal with this under-rejection problem, 2

Song (2009b) designs a test to check the martingale property of both Mtx and Mtx (θ)

t

by integrating the diffusion term using the integral 0 σ (Xs )ds. Such a test is free of the under-rejection problem; see Song (2009b) for details. 2

Z. Song / Journal of Econometrics 162 (2011) 189–212

where

and

Yτ ∆ (θ0 ) = Xτ ∆ − X(τ −1)∆ −

∫

τ∆ (τ −1)∆

b(Xs ; θ0 )ds.

(4.4)

Let Yτ (θ0 ) ≡ Yτ ∆ (θ0 ) for the simplification of notations. Obviously, Yτ (θ0 ) cannot be observed. We first estimate the parameter θ0 by √ the random sample {Xτ ∆ }nτ =1 to get a n-consistent estimator and then the estimated processes  Yτ = Yτ ( θ ) is obtained. Since we are only interested in the specification of drift, it is better for us to use an estimation method which can estimate the parameters in the drift consistently while being robust to the diffusion misspecification. This essentially requires the estimation of a semi-parametric diffusion model with diffusion term unrestricted. Ait-Sahalia (1996a), Kristensen (2008) and Song (2009a,b) are several examples. The test for checking (4.4) is a univariate special case of (3.9), i.e.,

1 (p) = M

 n −1 −

k2 (m/p)(n − m)

∫

m=1

1

n −1 −

n − m τ =m+1

 Yτ2

∫

2   ψτ −m (v) dW (v)

n −2 n −2

 D1 (p) = 2

−−

2 k=1

Then we have i i Zτi (θ ) = Xτi ∆ − X(τ −1)∆ + g (τ , θ )

(5.3)

i i ,j and Zτi,j (θ ) = Xτi ∆ Xτ ∆ − X(τ −1)∆ X(τ −1)∆ + g (τ , θ ). j

j

(5.4)

Assumption A.2. For each sufficiently large q, there exists a strictly stationary process {Zq,τ } measurable with respect to the sigma-field generated by {Zτ −1 , Zτ −2 , . . . , Zτ −q } such that as q → ∞, Zq,τ is independent of {Zτ −q−1 , Zτ −q−2 , . . .} for each τ , E [Zq,τ | Iτ −1 ] = 0, a.s. where Iτ −1 is the information set at time (τ − 1)∆ that may contain lagged random variables {X(τ −m)∆ , m > 0} from original process and lagged random variables {Z(τ −m)∆ , m > 0}

where k2 (m/p)

(5.2)

Assumption A.1. {Xt } is a strictly stationary time series such that µ = E [Xt ] exists a.s., and E [‖Zτ ‖4 ] 5 C .

  × dW (v) −  C1 (p) D1 (p)

n −1 −

  ∫ t  d −  i 2   2bi (Xs ; θ0 )Xs + σi,k (Xs ; θ0 ) ds −    t −∆ k=1     ∫ τ∆  i ,j g (τ , θ ) = − bi (Xs ; θ )Xsj + bj (Xs ; θ )Xsi   (τ −1)∆     d    + 1 −[σ (X ; θ )σ (X ; θ )] ds.   i ,k s j ,k s

To derive the null asymptotic distribution the test statistic 0 (p) in Eq. (3.9), the following regularity conditions are imposed. M

 (1,0) 2  σm (0, v)

j =1

 C1 (p) =

197

k2 (m/p)k2 (l/p)

 2 5 Cq−κ for some  4 constant κ = 1, and E Zq,τ  5 C for all large q.

m=1 l=1

from the transformed process, E Zτ − Zq,τ 

2 ∫ ∫  n  − 1   τ −m (v)ψ τ −l (u) × Yτ2 ψ   n − max(m, l) τ =max(m,l)+1  × dW (v)dW (u)

(4.5)

Assumption A.3. With probability one, both g i (τ , ·) and g i,j (τ , ·) are continuously twice differentiable with respect to θ ∈ Θ

τ (v) = eivYτ −  and ψ ϕ (v), and  ϕ (v) = n−1 τ =1 eiuYτ and all the terms are defined correspondingly for univariate case similar to multivariate case. Throughout, all unspecified integrals are taken on the support of W (·). The test of the drift in a multivariate diffusion model is a straightforward extension by x replacing Mtx (θ ) with Mt i (θ0 ) with i = 1, . . . , d in Assumption 4.1, the transformation (4.4) and the test (4.5). 1 (p) test is very convenient for It can be observed that the M multivariate models which are difficult for other methods like Li (2007) and Kristensen (2008). It is constructed in exactly the same framework as the proposed test for joint dynamics in (3.9). In 0 (p) and M 1 (p), is developed to other words, a unified procedure, M first check the specification for the joint dynamics and then gauge sources of rejection in order to build a more accurate model for financial variables. Although Chen and Hong (2010) also design tests for specifications of various conditional moments, we note that they are only for the resulting discrete time series rather than the continuous time models. Specifically, the first two conditional moments they consider are different from the drift and diffusion terms unless ∆ → 0 which is not a safe assumption for such applications as term structure of interest rates.

4 ∂ i  2 2 and E supθ∈Θ  ∂θ g (τ , θ ) 5 C , E supθ∈Θ  ∂θ∂∂θ ′ g i (τ , θ ) 5 C ,

5. Asymptotic theory

(a)

∑n





5.1. Asymptotic distribution Let g (τ , θ ) = − i

∫

τ∆ (τ −1)∆

bi (Xs ; θ )ds

(5.1)





 ∂ i ,j 4  2 2 E supθ∈Θ  ∂θ g (τ , θ ) 5 C , and E supθ ∈Θ  ∂θ∂∂θ ′ g i,j (τ , θ ) 5 C . 



θ − θ0 = Op (n−1/2 ), where θ0 = p lim( θ) ∈ Θ. Assumption A.4.  Assumption A.5. k : R → [−1, 1] is symmetric and is continuous at (0, 0) and all but a finite number of points, with k(0) = 1 and |k(z )| 5 C |z |−b for large z and some b > 1. ′

Assumption A.6. W : Rd → R+ is nondecreasing and weighs  sets symmetric about zero equally, with ‖v‖4 dW (v) 5 C . iv Zτ − ϕ(v) with ϕ(v) = E [eiv Zτ ] Assumption A.7. Put  ψτ (v)  =e ′

a a′

and σ (a, a′ ) = E Zτ Zτ

′

for a, a′ = i, ij and i, j = 1, . . . , d.

(Note here ij does not denote the product between i and j but an index equivalent  ∂ i to (i, j). This notation  ∂ i,j applies to  the whole g (τ , θ0 ), Zτ and ∂θ g (τ , θ0 ), Zτ are strictly paper.) Then ∂θ stationary processes such that:

∂ a  ∑∞  ∂ a   5 C for a = m=1 Cov ∂θ g (τ , θ0 ), ∂θ g (τ − m, θ0 ) i, (i, j) and i, j = 1, . . . , d; ∑∞ (b) =1 sup(u,v)∈R2d′ |σm (u, v)| 5 C ;   ∂ a ∑m ∞ ′ Cov (c) g (τ , θ0 ), ψτ −m (v)  5 C for a = m=1 supv∈Rd ∂θ i, (i, j) and i, j = 1, . . . , d;  ∑∞ ′  ′  ′ (d) m,l=1 sup(u,v)∈Rd E [((Zτ ,a Zτ ,a )−σ (a, a ))ψτ −m (u)ψτ −l (v)] 5 C for a, a′ = i, (i, j) and i, j = 1, . . . , d;

198

(e)

Z. Song / Journal of Econometrics 162 (2011) 189–212

  κm,l,r (v) 5 C , where κm,l,r (v) is the

∑∞

′ m,l,r =−∞ supv∈Rd fourth order cumulant of the joint distribution of the process



 ∂ a ∂ a g (τ , θ0 ), ψτ −m (v), g (τ − l, θ0 ), ψτ∗−r (v) ∂θ ∂θ

(5.5)

for a = i, ij and i, j = 1, . . . , d. Assumptions A.1 and A.2 are regularity conditions on the data generating process (DGP). The strict stationarity on {Xt } is imposed and the existence of the first order moment µ can be ensured by assuming E ‖Xt ‖2 < ∞. Assumption A.2 is required only under H0 . It assumes that the martingale difference sequence (m.d.s.) {Zτ } can be approximated by a q-dependent m.d.s. process {Zq,τ } arbitrarily well when q is sufficiently large. Because {Zτ } is a m.d.s., Assumption A.2 essentially imposes restrictions on the serial dependence in higher order moments of Xτ . Besides, it implies ergodicity for {Zτ }. It holds trivially when {Zτ } is a q-dependent process with an arbitrarily large but finite order q. In fact, this is general enough to cover many interesting processes, for example, a stochastic volatility model with short memory (see Hong and Lee, 2005 for details). Although Assumption A.3 appears in terms of restrictions on g i (τ , ·) and g i,j (τ , ·), it is actually imposing moment regularity conditions on the drift and diffusion terms b(Xτ ; θ0 ) and σ (Xτ ; θ0 ) which can be seen from (5.1) and (5.2). It covers most of the popular univariate and multivariate diffusion processes in both time-homogeneous and time-inhomogeneous cases, for example, Ait-Sahalia (1996a), Ahn and Gao (1999), Chan et al. (1992), Cox et al. (1985) and Vasicek (1977).√ Assumption A.4 requires a n-consistent estimator  θ , which may not be asymptotically most efficient. We do not need to know the asymptotic expansion of  θ , because the sampling variation in 0 (p). This delivers  θ does not affect the limit distributions of M a convenient and generally applicable procedure in practice, because asymptotically most efficient estimators such as MLE or approximated MLE may be difficult to obtain in practice. One could choose a suboptimal, but convenient, estimator in implementing our procedure. Assumption A.5 is a regularity condition on the kernel k(·). It contains all commonly used kernels in practice. The condition of k(0) = 1 ensures that the asymptotic bias of the smoothed kernel estimator  f (0,1,0) (ω, 0, v) in (3.5) vanishes as n → ∞. The tail condition on k(·) requires that k(z ) decays to zero sufficiently ∞ fast as |z | → ∞. It implies 0 (1 + z )k2 (z )dz < ∞. For kernels with bounded support, such as the Bartlett and Parzen kernels, b = ∞. For the Daniell and quadratic spectral kernels, b = 1 and 2, respectively. These two kernels have unbounded support, and thus all (n − 1) lags contained in the sample are used in constructing our test statistics. Assumption A.6 is a condition on the weighting function W (·) for the transform parameter v . It is satisfied by the CDF of any symmetric continuous distribution with a finite fourth moment. Finally, Assumption A.7 provides  ∂ i some co- variance and fourth order cumulant conditions on ∂θ g (τ , θ0 ), Zτ ∂ i ,j g (τ , θ0 ), Zτ , which restrict the degree of the serial de∂θ ∂ i   ∂ i,j  pendence in ∂θ g (τ , θ0 ), Zτ and ∂θ g (τ , θ0 ), Zτ . These con-

and





ditions can be ensured by imposing more restrictive mixing and moment conditions on these two processes. However, to cover a sufficiently large class of DGPs, I choose not to do so. 0 (p) I now state the asymptotic distribution of the test statistic M under H0 . Theorem 3. Suppose that hold, and p = cnλ  Assumptions A.1–A.7  for c ∈ (0, ∞) and λ ∈ 0, 3 +



0 (p) →d N (0, 1) as n → ∞. M

−1 1 4b−2



. Then under H0 ,

0 (p), the use of the estimated As an important feature of M processes { Zτ } in place of the true processes {Zτ } has no impact 0 (p). One can proceed as if the true on the limit distribution of M parameter value θ0 were known and equal to  θ . The reason, as pointed out by Hong and Lee (2005), is that the convergence rate of the parametric parameter estimator  θ to θ is faster than that of the nonparametric kernel estimator to  f (0,1,0) (ω, 0, v) to 0 (p) is f (0,1,0) (ω, 0, v). As a result, the limiting distribution of M (0,1,0)  solely determined by f (ω, 0, v) and replacing θ0 by  θ has no impact asymptotically. This delivers a convenient procedure, because no specific estimation method for θ0 is required.12 Of course, parameter estimation uncertainty in  θ may have impact 0 (p). In small samples, one on the small sample distribution of M can use a bootstrap procedure to obtain more accurate levels of the tests. 5.2. Asymptotic power My tests are derived without assuming an alternative model to H0 . To gain insight into the nature of the alternatives that my tests 0 (p) are able to detect, I now examine the asymptotic behavior of M under HA . For this purpose, a condition on the serial dependence in {Zτ } is imposed: Assumption A.8.

∑∞

m=1

 

(1,0)

supv∈Rd′ σm

  (0, v) 5 C .

Theorem 4. Suppose Assumptions A.1 and A.3–A.8 hold, and p = cnλ for c ∈ (0, ∞) and λ ∈ (0, 1/2). Then under HA and as n → ∞,

0 (p) (p1/2 /n)M [ ∫ ∞ ]−1/2 − ∞ ∫   p 4 σ (1,0) (0, v)2 dW (v) → 2D k (z )dz m 0

m=1

where D = 2

d′ − d′ −

∫∫∫

π

E |Zaτ Za′ τ |

a=1 a′ =1

× dωdW (u)dW (v).

|f (ω, u, v)|2

−π

(5.6)

The constant D takes into account the impact of the serial ′ dependence in conditioning variables {eiv Zτ −m }, which generally exists even under H0 , due to the presence of the serial dependence in the conditional variance and higher order moments of {Zτ }. Suppose the autoregression function E [Zτ |Zτ −m ] ̸= 0 at some lag m > 0. Then we have

2   (1,0)  σm (0, v) dW (v) > 0 for any

weighting function W (·) that is positive, monotonically increasing ′ and continuous, with unbounded support on Rd . As a consequence,  limn→∞ P [M0 (p) > C (n)] = 1 for any constant C (n) = o(n/p1/2 ) 0 (p) has asymptotic unit power at any given significance and M level, whenever E [Zτ |Zτ −m ] ̸= 0 at some lag m > 0. 0 (p) diverges to infinity at the rate of Note that under HA , M np−1/2 , which is faster than both the rate npd of a nonparametric transition density based tests like Ait-Sahalia et al. (2009) and Hong and Li (2005) for d = 1 and the rate npd/2 of the characteristic function based test in Chen and Hong (2010). The differences in the divergence rates can actually lead to the conclusion, by a

12 M 0 (p) can actually be used to test the m.d.s. hypothesis for multivariate processes with conditional heteroscedasticity of unknown form. Although other tests like Park and Whang (2003) are available for extensions, the limit distributions depend on the DGP and cannot be tabulated; resampling methods have to be applied to obtain critical values on a case-by-case basis. That is also why I choose the generalized multivariate derivative approach for constructing the test procedure.

Z. Song / Journal of Econometrics 162 (2011) 189–212

0 (p) test standard proof (see Serfling (1980) for details), that the M is asymptotically more powerful than the tests cited above in terms of the Bahadur (1960) asymptotic relative efficiency (ARE) under fixed alternatives.13 Such an advantage is due to the reduction of the dimension from d to 1 by the infinitesimal operator based martingale characterization and the spectral approach for testing the m.d.s. But it should be warned that such a comparison only makes sense when the process is a diffusion under both the null and alternative hypothesis. The transition density based tests of Ait-Sahalia et al. (2009) and Hong and Li (2005) are actually applicable to cases where the process is not a diffusion under HA and hence more general than the proposed test in this sense. Moreover, it should also be emphasized that the power property 0 (p) test is more powerful than any does not mean that the M other existing test against every alternative. In fact, it may be less powerful against certain specific alternatives in finite samples since a wide range of possible alternatives are incorporated. The power performances in the simulation studies of Section 6 show that my test is more powerful in many cases but less powerful against certain alternatives than other tests.

f

(q,1,0)

=

199

(ω, 0, v)

1

n −1 −

2π

m=1−n

(1 − |m|/n)1/2 k(m/p) σm(1,0) (0, v) |m|q e−imω

(5.7)

where the kernel k(·) needs not be the same as the kernel k(·) used in (3.5). Note that f

(0,1,0)

(q,1,0)

(ω, 0, v) is an estimator for f (0,1,0) (ω, 0, v)

and f (ω, 0, v) is an estimator for the generalized spectral derivative f (q,1,0) (ω, 0, v) ≡

1

∞ −

2π

m=−∞

σm(1,0) (0, v) |m|q e−imω .

(5.8)

For the kernel k(·), suppose there exists some q ∈ (0, ∞) such that 1 − k(z ) 0 < k(q) = lim . z →0 |z |q

(5.9)

Then I define the plug-in bandwidth as 1

 p0 =  c0 n 2q+1

5.3. Data-driven lag order

(5.10)

where the turning parameter estimator A practical issue in implementing our tests is the choice of the lag order p. As an advantage, the smoothing generalized spectral approach can provide a data-driven method to choose p, which, to some extent, lets data themselves speak for a proper p. Before discussing any specific method, I first justify the use of a datadriven lag order,  p, say. Here, we impose a Lipschitz continuity condition on k(·).

  2q1+1 2  π   (q,1,0)     2q(k(q) )2  f (ω 0 , v) d ω dW (v)   , −π  c0 =  ∞   2   2  −∞ k (z )dz π  (0,1,0)    (ω, v, −v)dW (v) dω  −π  f   2q(k(q) )2 = ∞ 2  −∞ k (z )dz

Assumption A.9. |k(z1 ) − k(z2 )| 5 C |z1 − z2 | for any (z1 , z2 ) in R2 and some constant C < ∞. This condition rules out the truncated kernel k(z ) = 1 (|z | 5 1), but it still contains most commonly used nonuniform kernels.

n−1

∑ ×

2

(n − |m|)k (m/p) |m|2q

m=1−n n−1

∑

2   1  (1,0)  σm (0, v) dW (v)  2q+1 

  2 (n − |m|)k (m/p) R(m) ‖ σm (v, −v)‖ dW (v)

m=1−n

 Theorem 5. Suppose that Assumptions A.1–A.7 and A.9 and  p  hold,  is a data-driven bandwidth such that  p/p = 1 + Op for some β >

2b−1/2 , 2b−1

−

p

3 β−1 2

where b is as in Assumption A.5, and p is

λ a nonstochastic bandwidth   with p = cn for c ∈ (0, ∞) and

 −1 λ ∈ 0, 3 + d 4b1−2 . Then under H0 , 0 ( 0 (p) →p 0 M p) − M

0 ( and M p) →d N (0, 1).

Hence, the use of  p has no impact on the limit distribution 0 ( of M p) as long as  p converges to p sufficiently fast and my test procedure enjoys an additional ‘‘nuisance parameter-free’’ property. Theorem 5 allows for a wide range of admissible rates for  p. One possible choice is the nonparametric plug-in method similar to Hong (1999, Theorem 2.2) which minimizes an asymptotic integrated mean square error (IMSE) criterion for the estimator  f (0,1,0) (ω, 0, v) in (3.5). Consider some ‘‘pilot’’ generalized spectral derivative estimators based on a preliminary bandwidth p: f

(0,1,0)

(ω, 0, v) n−1 1 − = (1 − |m|/n)1/2 k(m/p) σm(1,0) (0, v)e−imω 2π m=1−n

13 The Bahadur ARE is defined as the limiting ratio of the sample sizes required by the two competing tests to attain the same asymptotic significance level under the fixed alternative models. See Serfling (1980) for details.

(5.11)

∑n and  R(m) = (n − |m|)−1 τ =|m|+1  Zτ′  Zτ −|m| . The data-driven  p0 in (5.10) involves the choice of a preliminary bandwidth  p, which can be fixed or grow with the sample size 1 n. If it is fixed,  p0 still generally grows at rate n 2q+1 under HA , but  c0 does not converge to the optimal tuning constant c0 (say) that minimizes the IMSE of  f (0,1,0) (ω, 0, v) in (3.5). This is a parametric plug-in method. Alternatively, following Hong (1999), we can show that when p grows with n properly, the data-driven bandwidth  p0 in (5.10) will minimize an asymptotic IMSE of  f (0,1,0) (ω, 0, v). Simulation experiences show that the choice of p has little impact on the finite sample performances of the test; see the next section for simulation results. 6. Monte Carlo simulations In this section, I shall investigate the finite sample performances 0 (p) and M 1 (p) for joint and separate of the proposed tests M specifications respectively, with a comparison to the Hong and Li (2005) test. Since my test is constructed by a mathematical transformation and then a multivariate generalized spectral derivative approach, which pose a bit complication, I first give a clear documentation of the steps for the numerical realization to make the computation easy to follow. Then the empirical size and power performances will be studied for both univariate and bivariate models. Last, I shall illustrate the impact of numerical approximation for the integral involved in computing the test statistics on the test performances.

200

Z. Song / Journal of Econometrics 162 (2011) 189–212

6.1. Numerical computation of the tests14

0 ( 1 ( The computation of the tests M p0 ) and M p0 ) can be done by the following steps: √

1. Estimate the model parameters to obtain a n-consistent 0 ( estimator  θ for θ0 . For computing M p0 ), a full parametric diffusion model needs to be estimated by such methods as the simulated MLE in Brandt and Santa-Clara (2002) and approximated MLE in Ait-Sahalia (2002, 2008). But for 1 ( computing M p0 ), only drift parameters need to be estimated in a semi-parametric diffusion model and consistent estimators can be obtained by Ait-Sahalia’s (1996a) OLS for univariate case with linear drift, Kristensen’s (2008) pseudo-MLE for univariate case with general drift specification and Song’s (2009a) conditional GMM for general multivariate models. 2. Compute the model-implied processes {Zτ ( θ)} by plugging the 0 ( estimator  θ obtained in Step 1 into (2.20) for M p0 ) and 1 ( (4.4) for M p0 ). Note that to obtain the numerical value of t the sequence {Zτ ( θ )}, an integral of the t −∆ f (Xs )ds type has to be computed. Similar to Pan (2002), I approximate t  these  [f ] integrals by t −∆ f (Xs )ds = ∆ X + f X + O ∆2 . It ( ) ( ) t t − ∆ P 2 is expected that the approximation errors should be negligible when ∆ is small enough. However, it may affect the finite sample performances of the tests when the data is sampled at very low frequency, e.g., quarterly and yearly with ∆ = 1/4 and 1 respectively. This is a price we need to pay by employing the infinitesimal operator which delivers many nice properties discussed above. The impact of this numerical approximation on the finite sample performances of the tests is investigated in Section 6.4. 3. With the estimated sequence {Zτ ( θ )}, the data-driven bandwidth  p0 can be computed according to (5.10) and (5.11). 0 ( 1 ( Then the test statistics M p0 ) and M p0 ) are calculated as in (3.9) and (4.5). Since an arbitrarily preliminary bandwidth p is needed to compute  p0 , I shall consider different choices of p for computing the test statistics. Simulation studies in the following show that finite sample performances of the tests do not vary much for different p. 0 ( 1 ( 4. Finally, the test statistics M p0 ) and M p0 ) will be compared with the upper-tailed N (0, 1) critical value Cα at level α (the asymptotic critical value under the null hypothesis). If 0 ( M p0 ) > Cα then reject the joint parametric specification of 1 ( the model at the significant level α while for M p0 ) > Cα , reject the parametric form of the drift function.

1 (p) is for the linear drift hypothesis, i.e., b0 (Xt ) = κ(α − statistic M Xt ) for some κ and α . To examine the size of the tests for multivariate models, I generate data from a Bivariate Uncorrelated Ornstein–Uhlenbeck (O–U) model (DGP B0), which is also the A0 (2) affine diffusion term structure model in Dai and Singleton (2000): [

X d 1t X2t

] =

κ22

0

κ11 X1t b (Xt ) = κ22 X2t 

0

σ11 X1t dt + X2t 0

][

]

[

0

σ22

] [

W1t d W2t

] (6.2)

 (6.3)

for some κ11 and κ22 . For each parameterization, we simulate 1000 data sets of a random sample {Xτ ∆ }nτ =1 at the monthly frequency (∆ = 1/22) for n = 250, 500, and 1000 respectively. Each simulated sample path is generated using 40 intervals per month with 39 discarded out of every 40 observations, obtaining discrete observations at the monthly frequency. The simulation is carried out based on the transition density of {Xt } which is known to be normal for both DGPs A0 and B0. These sample sizes correspond to about 20–100 years of monthly data. For each data set, we estimate the 0 (p) and model parameters via the MLE and then compute both M 1 (p) following the steps in Section 6.1. The Bartlett kernel is M used both in computing the data-dependent optimal bandwidth  p0 by the plug-in method for some preliminary bandwidth p 0 ( 1 ( and in computing the test statistics M p0 ) and M p0 ). The standard multivariate normal CDF is chosen for W (·). Simulation experiences indicate that choices of the kernel function k(·) and weighting function W (·) have no substantial impact on the size performances of tests. I consider the empirical rejection rates using the asymptotic critical values (1.28 and 1.65) at the 10% and 5% significance levels respectively. For comparison, I describe the construction of Hong and Li (2005) test, which is based on g (x, t |Xs , θ), the model-implied transition density of Xt = x given Xs for s < t and the true correspondent g0 (x, t |Xs ). For the univariate Xt , the Hong and Li (2005) test is constructed by checking the probability integral transform of the transition density

∫

Xt

g (x, t |Xt −∆ , θ0 ) dx ∼ i.i.d.U [0, 1] under H0 .

−∞

6.2. Empirical size of the test I now study the size performances of the test procedures. To examine the size of the tests for univariate models, I simulate data from Vasicek’s (1977) model (DGP A0): (6.1)

where α is the long run mean and κ is the speed of mean reversion. To illustrate the possible impact of dependent persistence in {Xt } on the size of the test, I follow Hong and Li (2005) and Pritsker (1998) to choose two sets of parameter values,   κ, α, σ 2 = (0.85837, 0.089102, 0.002185) and (0.214592, 0.089102, 0.000546), for the low and high persistent dependence 0 (p) is to check whether cases respectively. The test statistic M the DGP is a Vasicek model in (6.1) while the separate inference

14 I am very grateful to an anonymous referee for suggesting this section.

0

where W1t and W2t are two independent Brownian Motions and (κ11 , κ21 , κ22 , σ11 , σ22 ) = (−0.1117, −1.1637, 1, 1). For this case, 0 (p) is to check whether the DGP is a Bivariate the test statistic M Uncorrelated O–U model in (6.2) while the separate inference 1 (p) is for the special drift specification: statistic M

Qt (θ0 ) =

dXt = κ(α − Xt )dt + σ dWt

[ κ11

Their test is pretty robust to the persistent dependence in {Xt } due to the transformation. However, as discussed earlier, the model-implied transition density g (x, t |Xs , θ) is not in closed form for most cases and approximation techniques are needed. More seriously, the multivariate version of the probability integral transform Qt (θ0 ) as defined above is no longer i.i.d. U [0, 1] even under H0 . Although Hong and Li (2005) propose to check the multivariate diffusion models by applying the univariate Qt (θ0 ) for each state variable through a suitable partitioning, the resulting procedure does not make full use of the information for the joint dynamics of different component processes in Xt . Specifically, it may miss the misspecification in the joint dynamics of Xt for the following DGP

[

X d 1t X2t

]

[ κ = 11 κ21

0

κ22

X1t σ11 dt + X2t 0

][

]

[

0

σ22

] [

W1t d W2t

]

with W1t and W2t two independent Brownian Motions, when the model (6.2) is fit for the data. The reason is that the

Z. Song / Journal of Econometrics 162 (2011) 189–212

201

Table 1 Empirical sizes under DGPs A0 and B0. n = 250 10%

n = 500 5%

n = 1000

n = 1500

10%

5%

10%

5%

10%

5%

0.163 0.202 0.187 0.168 0.030 0.030 0.027 0.030

0.152 0.184 0.175 0.146 0.029 0.029 0.027 0.028

0.165 0.142 0.151 0.162 0.097 0.097 0.094 0.097

0.143 0.132 0.138 0.145 0.088 0.086 0.085 0.088

0.114 0.114 0.114 0.114 0.099 0.099 0.099 0.102

0.085 0.085 0.085 0.085 0.057 0.057 0.057 0.057

0.162 0.210 0.193 0.170 0.038 0.038 0.038 0.044

0.155 0.189 0.180 0.152 0.038 0.035 0.033 0.038

0.166 0.145 0.150 0.167 0.089 0.082 0.080 0.086

0.145 0.133 0.134 0.150 0.066 0.064 0.060 0.062

0.092 0.092 0.092 0.092 0.098 0.098 0.098 0.098

0.074 0.074 0.074 0.074 0.061 0.061 0.061 0.061

0.113 0.116 0.120 0.112 0.102 0.102 0.100 0.104

0.133 0.132 0.132 0.133 0.116 0.116 0.114 0.114

0.096 0.094 0.094 0.094 0.083 0.083 0.080 0.080

0.106 0.106 0.106 0.106 0.103 0.103 0.101 0.101

0.072 0.072 0.070 0.072 0.066 0.066 0.066 0.066

DGP A0: high persistent Vasicek model

0 (5) M 0 (10) M 0 (15) M 0 (20) M 1 (5) M 1 (10) M 1 (15) M 1 (20) M

0.313 0.362 0.372 0.325 0.228 0.258 0.252 0.243

0 (5) M 0 (10) M 0 (15) M 0 (20) M 1 (5) M 1 (10) M 1 (15) M 1 (20) M

0.343 0.380 0.396 0.346 0.225 0.225 0.216 0.216

0 (5) M 0 (10) M 0 (15) M 0 (20) M 1 (5) M 1 (10) M 1 (15) M 1 (20) M

0.221 0.218 0.206 0.204 0.184 0.180 0.182 0.182

0.298 0.350 0.341 0.274 0.234 0.234 0.220 0.213

DGP A0: low persistent Vasicek model 0.312 0.373 0.356 0.303 0.166 0.166 0.158 0.167

DGP B0: bivariate Ornstein–Uhlenbeck model 0.164 0.166 0.165 0.164 0.158 0.158 0.157 0.157

0.152 0.150 0.152 0.152 0.137 0.137 0.135 0.135

Notes: (i) 1000 iterations; (ii) DGP A0 is the Vasicek model in (6.1) with parameter values (κ, α, σ 2 ) = (0.214592, 0.089102, 0.000546) and (0.85837, 0.089102, 0.002185) corresponding to high and low persistence cases respectively. DGP B0 is the bivariate Ornstein–Uhlenbeck model in (6.2). p0 with the Bartlett kernel used. The Bartlett kernel is also used for (iii) p, the preliminary bandwidth, is used in a plug-in method to choose the data-dependent bandwidth  0 ( 1 ( computing M p0 ) and M p 0 ).

probability integral transforms Qt1 (θ0 ) and Qt2 (θ0) for individual  conditional densities g X1,t , t |Xt −∆ , X2,t , θ and g X2,t , t |Xt −∆ , θ respectively, where are employed by Hong and Li (2005), are both i.i.d. U [0, 1] sequences while the joint dynamics are obviously misspecified due to the misspecification of the drift. Hence, the Hong and Li (2005) test has no power against such alternatives. 0 ( Table 1 reports the empirical sizes of M p0 ) at the 10% and 5% levels under the correct Vasicek and Bivariate Uncorrelated O–U models. Both of the cases with low and high persistence of dependence are considered for the former. It can be observed that there is over-rejection at both 10% and 5% levels, but the performances are improving as n increases for all three cases. Since the over-rejection is still serious especially at the 5% level when n = 1000, I increase the sample size to n = 1500 (only for size performances) to check the empirical sizes of the tests. Obviously, when the sample size is large enough, the over-rejection is not very serious, with rejection rates around 7% at the 5% level. Furthermore, the tests display more over-rejections under strong mean reversion than under weak mean reversion. For comparison, Table 2 reports the empirical sizes of the Hong and Li (2005) test under the same DGPs. Similarly, the Hong and Li test has some 0 ( over-rejection which is close to that of the M p0 ) tests at 10% level but much less serious at 5% level for the Vasicek model. For the Bivariate Ornstein–Uhlenbeck model, however, the Hong and Li test has more serious over-rejection than my test at both 5% and 10% levels. For the separate inference, the drift is correctly specified as a linear function for Vasicek models and as that in (6.3) for the Bivariate Uncorrelated O–U model. It can be seen that the 1 ( test M p0 ) has also nice performances for all three cases, with

rejection rates around 6% at the 5% significance level when n = 1000, which is actually better than the performances of 0 ( M p0 ). Therefore, the separate inference test features nice size performances. Another observation worth pointing out is that the 0 ( 1 ( rejection rates of both M p0 ) and M p0 ) do not vary much for different choices of preliminary bandwidths. This can be seen as a robust property of the optimal bandwidth based on plug-in methods. 6.3. Empirical power of the test To investigate the power of the test for univariate diffusion models, I simulate data from the following four popular diffusion models:

• DGP A1 (CIR (Cox et al., 1985) Model):  dXt = κ(α − Xt )dt + σ Xt dWt

(6.4)

where (κ, α, σ ) = (0.89218, 0.090495, 0.032742). • DGP A2 (Ahn and Gao’s (1999) Inverse-Feller Model):   3/2 dXt = Xt [κ − σ 2 − κα Xt ]dt + σ Xt dWt

(6.5)

2

where (κ, α, σ ) = (3.4387, 0.0828, 1.420864). • DGP A3 CKLS (Chan et al., 1992) Model: 2

ρ

dXt = κ(α − Xt )dt + σ Xt dWt

(6.6)

where (κ, α, σ , ρ) = (0.0972, 0.0808, 0.52186, 1.46). • DGP A4 (Ait-Sahalia’s (1996a) Nonlinear Drift Model): 2

ρ

dXt = (α−1 Xt−1 + α0 + α1 Xt + α2 Xt2 )dt + σ Xt dWt

(6.7)

202

Z. Song / Journal of Econometrics 162 (2011) 189–212

Table 2 Empirical sizes and powers of the Hong and Li (2005) test. n = 250

Models

10%

n = 500

n = 1000

5%

10%

5%

10%

5%

0.102 0.104 0.114

0.150 0.142 0.188

0.097 0.092 0.120

0.136 0.153 0.153

0.094 0.096 0.098

0.126 0.766 0.583 0.861 0.064 0.060 0.643

0.303 0.922 0.880 0.988 0.109 0.089 0.906

0.264 0.908 0.867 0.982 0.076 0.065 0.878

0.595 1.000 0.988 1.000 0.133 0.101 1.000

0.501 1.000 0.975 1.000 0.080 0.077 1.000

Size performances DGP A0: high persistence DGP A0: low persistence DGP B0

0.157 0.150 0.175 Power performances

DGP A1 DGP A2 DGP A3 DGP A4 DGP B1 DGP B2 DGP B3

0.182 0.794 0.628 0.874 0.083 0.081 0.685

Notes: (i) 1000 iterations; (ii) DGP A0 is the Vasicek model in (6.1) with parameter values (κ, α, σ 2 ) = (0.214592, 0.089102, 0.000546) and (0.85837, 0.089102, 0.002185) corresponding to high and low persistence cases respectively. DGP B0 is the bivariate Ornstein–Uhlenbeck model in (6.2). (iii) p, the preliminary bandwidth, is used in a plug-in method to choose the data-dependent bandwidth  p0 with the Bartlett kernel used. The Bartlett kernel is also used for 0 ( 1 ( computing M p0 ) and M p0 ).

where (α−1 , α0 , α1 , α2 , σ 2 , ρ) = (0.00107, −0.0517, 0.877, −4.604, 0.64754, 1.50). Following Hong and Li (2005), the parameter values for the CIR model are taken from Pritsker (1998), and those for Ahn and Gao’s model from Ahn and Gao (1999).15 For DGPs A3 and A4, the parameter values are taken from Ait-Sahalia’s (1999) estimates of real interest rate data. For each of univariate diffusion models 0 (p) is to check whether the DGP is a above, the test statistic M 1 (p) Vasicek model in (6.1) while the separate inference statistic M is for the linear drift hypothesis, i.e., b0 (Xt ) = κ(α − Xt ) for some κ and α . Obviously, both of these two hypotheses should be rejected. To investigate the power of the test for multivariate diffusion models, sample data will be simulated from the following three bivariate models:

• DGP B1 (Bivariate Correlated O–U Model, with constant correlation in diffusion)

[

X d 1t X2t

] =

[ −0.1117 0 1 0.25

0 W1t d . 1 W2t

[ +

][

0 −1.1637

] [

]

X1t dt X2t

]

(6.8)

• DGP A2 (Bivariate Correlated O–U Model, with constant correlation in drift):

[

X d 1t X2t

]

[ ][ ] −0.1117 0 X1t = dt 0.4 −1.1637 X2t   [ ] 1 0 W1t + d . 0

1

W2t

(6.9)

• DGP A3 (Bivariate Correlated A2 (2) model in Dai and Singleton (2000)):

[

X d 1t X2t

]

[ ][ ] −0.7 0.3 X1t = dt 0.4 −0.8 X2t   [ ] X1t 0 W1t + d . 0

X2t

W2t

(6.10)

15 Chen and Hong (2010) found some typos in the parameter values of Ahn and Gao’s (1999) inverse-Feller model by private correspondence and corrected therm. Here I choose the parameter values used by them.

For each of bivariate diffusion models above, the test statistic 0 (p) is to check whether the DGP is a Bivariate Uncorrelated O–U M 1 (p) is for the model in (6.2) while the separate inference statistic M special drift specification in (6.3). The perfect measure for the distances between the alternative univariate DGPs A1–A4 to A0 and between the alternative bivariate DGPs B1–B3 to B0 is the Kullback–Leibler information criterion since all the diffusion models have a transition density. But as discussed above, the transition density is usually not available in closed form and hence difficult to use here for capturing the distance. Alternatively, I shall measure the distance of the model under HA to that under H0 by whether the drift, diffusion or both are misspecified. I admit that this approach may not be able to measure the exactly precise distance. However, it can give heuristic estimates for the distance between two models and moreover is very informative about separate specifications of the process dynamics. For each of the DGPs above, I generate 1000 data sets of the random sample for {Xτ }nτ ∆ =∆ where n = 250, 500, and 1000 at the monthly frequency, either via the transition density or Euler–Milstein scheme depending on whether the closed-form transition density is available. For DGPs A1–A4, the Vasicek model implied by the null hypothesis is estimated by MLE and by OLS when the separate inference statistic is computed while for DGPs B1–B3, the Bivariate Uncorrelated O–U model in (6.2) is estimated by MLE and Song’s (2009a) conditional GMM when computing the separate inference test statistic for each generated sample path. 0 (p) and M 1 (p) are computed following Then the test statistics M the steps in Section 6.1. 0 ( 1 ( Table 3 reports the rejection rates of M p0 ) and M p0 ) at the 10% and 5% levels for DGPs A1–A4 and B1–B4 and for comparison, those of the Hong and Li (2005) test are reported in Table 2. Under DGP A1, model (6.1) is correctly specified for the drift but is misspecified for the diffusion function because it fails to capture 0 ( the ‘‘level effect’’. The test M p0 ) has good power in this case, with rejection rates around over 96% at the 5% level when n = 1000. The 0 ( Hong and Li (2005) test is less powerful than the M p0 ) test, with rejection rates around 50% at the 5% level when n = 1000. The 1 ( separate inference test M p0 ) has also good performances with rejection rates about 9% at the 5% level when n = 1000, revealing that the rejection of the model is due to the misspecification of the diffusion term instead of the drift term. Under DGP A2, model (6.1) is misspecified for both the instantaneous conditional mean and variance because it ignores 0 ( the nonlinear drift and diffusion. As expected, both M p0 ) and

Z. Song / Journal of Econometrics 162 (2011) 189–212 Table 3 (continued)

Table 3 Empirical powers under DGPs A1–A4 and B1–B4. n = 250 10%

n = 250

n = 500 5%

n = 1000

10%

5%

10%

5%

0.778 0.773 0.765 0.784 0.139 0.134 0.087 0.114

0.764 0.760 0.737 0.773 0.123 0.106 0.077 0.102

1.000 0.987 0.964 1.000 0.152 0.155 0.116 0.108

1.000 0.984 0.960 1.000 0.120 0.122 0.085 0.092

0.638 0.635 0.609 0.627 0.730 0.734 0.682 0.744

0.588 0.583 0.552 0.573 0.699 0.692 0.683 0.722

0.965 0.967 0.978 0.975 0.789 0.763 0.780 0.780

0.954 0.962 0.970 0.968 0.755 0.730 0.724 0.782

0.798 0.782 0.764 0.758 0.068 0.068 0.060 0.047

0.763 0.754 0.752 0.747 0.030 0.030 0.030 0.029

1.000 1.000 1.000 1.000 0.120 0.109 0.109 0.110

1.000 1.000 1.000 1.000 0.072 0.089 0.060 0.058

0.425 0.414 0.422 0.425 0.403 0.405 0.388 0.366

0.838 0.835 0.852 0.867 0.798 0.766 0.750 0.802

0.802 0.798 0.814 0.832 0.754 0.732 0.690 0.791

0.967 0.978 0.962 0.957 1.000 1.000 1.000 1.000

0.962 0.973 0.958 0.942 1.000 1.000 1.000 1.000

0.250 0.252 0.260 0.259 0.052 0.052 0.057 0.057

0.571 0.570 0.575 0.580 0.104 0.104 0.104 0.105

0.338 0.340 0.334 0.334 0.060 0.060 0.063 0.060

0.832 0.830 0.830 0.837 0.110 0.109 0.109 0.110

0.690 0.696 0.702 0.696 0.072 0.075 0.070 0.075

0.372 0.372 0.374 0.374 0.425 0.426 0.426 0.426

0.880 0.884 0.884 0.882 0.901 0.900 0.906 0.906

0.643 0.647 0.648 0.640 0.883 0.890 0.894 0.897

1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000

0.994 0.995 0.993 0.993 1.000 1.000 1.000 1.000

0.808 0.810 0.805 0.802 0.908 0.908

0.994 0.994 0.990 0.988 1.000 1.000

0.981 0.980 0.977 0.980 1.000 1.000

1.000 1.000 1.000 1.000 1.000 1.000

1.000 1.000 1.000 1.000 1.000 1.000

DGP A1: CIR

0 (5) M 0 (10) M 0 (15) M 0 (20) M 1 (5) M 1 (10) M 1 (15) M 1 (20) M

0.656 0.693 0.722 0.683 0.483 0.433 0.301 0.325

0.634 0.662 0.679 0.646 0.426 0.395 0.236 0.220

DGP A2: Ahn and Gao

0 (5) M 0 (10) M 0 (15) M 0 (20) M 1 (5) M 1 (10) M 1 (15) M 1 (20) M

0.268 0.252 0.244 0.237 0.654 0.632 0.611 0.617

0.262 0.247 0.240 0.235 0.617 0.580 0.574 0.633

DGP A3: CKLS

0 (5) M 0 (10) M 0 (15) M 0 (20) M 1 (5) M 1 (10) M 1 (15) M 1 (20) M

0.684 0.690 0.716 0.689 0.056 0.050 0.043 0.042

0.659 0.667 0.701 0.673 0.022 0.022 0.018 0.018

DGP A4: Ait-Sahalia

0 (5) M 0 (10) M 0 (15) M 0 (20) M 1 (5) M 1 (10) M 1 (15) M 1 (20) M

0.475 0.454 0.435 0.449 0.432 0.443 0.421 0.364 DGP B1:

0 (5) M 0 (10) M 0 (15) M 0 (20) M 1 (5) M 1 (10) M 1 (15) M 1 (20) M

0.366 0.368 0.373 0.368 0.096 0.094 0.094 0.094 DGP B2:

0 (5) M 0 (10) M 0 (15) M 0 (20) M 1 (5) M 1 (10) M 1 (15) M 1 (20) M

0.483 0.480 0.480 0.482 0.504 0.506 0.504 0.504 DGP B3:

0 (5) M 0 (10) M 0 (15) M 0 (20) M 1 (5) M 1 (10) M

0.873 0.870 0.866 0.870 0.922 0.920

203

1 (15) M 1 (20) M

n = 500

n = 1000

10%

5%

10%

5%

10%

5%

0.915 0.912

0.904 0.901

1.000 1.000

1.000 1.000

1.000 1.000

1.000 1.000

Notes: (i) 1000 iterations; (ii) DGP1-4 are CIR model, Ahn and Gao’s (1999) inverse-Feller model, CKLS model and Ait-Sahalia’s (1996a) nonlinear drift model, given in Eqs. (6.2)–(6.4). (iii) p, the preliminary bandwidth, is used in a plug-in method to choose a datadependent bandwidth  p0 with the Bartlett kernel is used. The Bartlett kernel is also 0 ( 1 ( used for computing M p0 ) and M p 0 ).

1 ( M p0 ) have excellent power when the Vasicek model (6.1) is used 0 ( to fit the DGP A2. The power of M p0 ) increases substantially with the sample size n and approaches unity when n = 1000 while the 1 ( rejection rates of M p0 ) are around 75% at 5% level, implying the misspecification of the drift function. The Hong and Li (2005) test 0 ( is more powerful than the M p0 ) tests for small sample sizes but the difference becomes negligible when n is increased to 1000. Similar to DGP A1, DGP A3 is only misspecified for the diffusion term, with the only difference that the coefficient of elasticity for volatility is equal to 1.46 rather than 0.5. The rejection rates of 0 ( M p0 ) increases very quickly from around 65% when n = 250 to over 100% when n = 1000 at the 5% level. The Hong and Li 0 ( (2005) test is very comparable to the M p0 ) tests in the power performances and slightly less powerful when n = 1000. For the 1 ( separate inference, the rejection rates of M p0 ) is around 7% at 5% level when n = 1000, indicating the true source of rejection is the misspecification of the diffusion function. Under DGP A4, model (6.1) is misspecified for both the drift and diffusion terms because it ignores the nonlinearity in both terms. The rejection rates are 0 ( 1 ( already over 80% when n = 500 for M p0 ) and 100% for M p0 ) when n = 1000 both at the 5% level. The Hong and Li (2005) test 0 ( is more powerful than M p0 ) when n = 250 but the difference becomes smaller as n increases. The results above for univariate diffusion models show that the 0 ( 1 ( combination of the proposed tests M p0 ) and M p0 ) not only have good power in detecting various model misspecifications but also are excellent in uncovering the sources of misspecification. In the following, I shall check their power performances for the multivariate diffusion models under DGPs B1–B3. Under DGP B1, model (6.2) is correctly specified for the drift but misspecified for the diffusion function since it misses the nonzero constant 0 ( correlation in the state variables. The test M p0 ) has good power in detecting the misspecification in the joint dynamics, with rejection rate around 70% when n = 1000. In contrast, the Hong and Li (2005) test has no power and the rejection rate is only 13% at the 10% level when n = 1000. This is not surprising since the conditional densities of individual state variables are correctly specified although the joint dynamics are not. The separate 1 ( inference test M p0 ) has also good performances with rejection rate about 7% at the 5% level when n = 1000, indicating that the diffusion rather than the drift function is misspecified. Under DGP B2, model (6.2) is correctly specified for the diffusion 0 ( but misspecified for the drift function. The rejection rate of M p0 ) increases very quickly and approaches unity when n is rising to 1000 while that of the Hong and Li (2005) test is only about 7% at the 5% level when n = 1000. This confirms again that my test is powerful against misspecifications in the joint dynamics which the Hong and Li (2005) test would miss. Moreover, the 1 ( separate inference test M p0 ) has also good power against the drift misspecification, with 100% rejection rate at the 5% level when n = 1000. Under DGP B3, model (6.2) is misspecified for 0 ( both the drift and diffusion functions. Both M p0 ) and the Hong and Li (2005) test have nice power performances with the former more powerful when the sample size is only 500. The separate

204

Z. Song / Journal of Econometrics 162 (2011) 189–212

Table 4 The impact of numerical integral approximation. Daily (∆ = 1/252) 10%

Monthly (∆ = 1/22)

Quarterly (∆ = 1/4)

Yearly (∆ = 1)

10%

5%

10%

5%

10%

5%

0.058 0.058 0.060 0.060

0.114 0.114 0.114 0.114

0.085 0.085 0.085 0.085

0.152 0.152 0.152 0.150

0.107 0.107 0.110 0.110

0.260 0.264 0.268 0.268

0.228 0.230 0.222 0.225

0.061 0.057 0.054 0.053

0.132 0.132 0.132 0.132

0.092 0.090 0.086 0.086

0.175 0.175 0.170 0.169

0.123 0.120 0.126 0.126

0.304 0.298 0.298 0.300

0.256 0.254 0.250 0.250

5%

DGP A0: high persistence

0 (5) M 0 (10) M 0 (15) M 0 (20) M

0.103 0.103 0.106 0.106 DGP B0

0 (5) M 0 (10) M 0 (15) M 0 (20) M

0.105 0.105 0.107 0.105

Notes: (i) The iteration number is 1000 while the sample size is 1500. (ii) DGP A0 is the Vasicek model in (6.1) with high persistence and DGP B0 is the Bivariate Ornstein–Uhlenbeck model in (6.2). (iii) p, the preliminary bandwidth, is used in a plug-in method to choose the data-dependent bandwidth  p0 with the Bartlett kernel used. The Bartlett kernel is also used for 0 ( 1 ( computing M p0 ) and M p0 ). Table 5 Testing spot rate models (∆ = 1/252).

0 (5) M 0 (10) M 0 (15) M 0 (20) M 1 (5) M 1 (10) M 1 (15) M 1 (20) M

Vasicek

CIR

CKLS

Ahn and Gao

Ait-Sahalia

731.7 728.2 720.9 725.5 422.4 430.3 433.6 424.7

503.2 509.3 510.2 502.4 – – – –

481.0 476.8 472.4 480.3 – – – –

238.6 230.3 215.2 207.2 158.3 150.6 147.9 144.8

173.4 180.0 169.7 165.5 140.0 142.2 133.6 130.3

Notes: (i) The model parameters are estimated by MLE for a full parametric model and by OLS and Song’s (2009a) estimator when only drift parameters are estimated. (ii) The sample period for the daily Eurodollar interest rates is from 6/01/1973 to 2/25/1995. (iii) p, the preliminary bandwidth, is used in a plug-in method to choose a datadependent bandwidth  p0 with the Bartlett kernel employed. The Bartlett kernel is 0 ( 1 ( also used for computing M p0 ) and M p 0 ).

1 ( inference test M p0 ) has also excellent power against the drift misspecification, with 100% rejection rate at the 5% level when n = 500 (see Table 5). In summary, the following observations are made: (1), For 0 ( 1 ( both univariate and bivariate models, the M p0 ) and M p0 ) 0 ( tests have reasonable sizes in finite samples. (2), The M p0 ) test has nice power against various model misspecifications. Particularly, it has excellent power in identifying misspecifications of the joint dynamics for multivariate diffusion models even when the individual component processes are correctly specified. This feature cannot be attained by the Hong and Li (2005) test which though performs well for univariate cases. (3), The separate 1 ( inference test M p0 ) has nice performances in revealing the sources of rejection, i.e., whether the rejection is due to drift or diffusion misspecification. 6.4. The impact of numerical integral approximation As discussed earlier, the computation of my tests involves a numerical approximation for some integrals which come from the infinitesimal operator based martingale characterization. This may affect the finite sample performances of the tests when the sampling interval ∆ is not small enough, which is a price we need to pay by enjoying many nice properties such as being convenient for multivariate cases and able to check the separate specifications. To investigate the impact of the numerical integral approximation and under which frequency of the data sampling my tests are robust to the approximation errors, I shall check the

0 ( p0 ) by changing the sampling size performances of the test16 M interval ∆. I consider the univariate Vasicek model in (6.1) with high persistence and Bivariate Uncorrelated O–U model in (6.2). The data generating schemes are exactly the same as those in Section 6.2 with the exception that only n = 1500 is considered and the sampling interval is set at daily (∆ = 1/252), monthly (∆ = 1/22), quarterly (∆ = 1/4), and yearly (∆ = 1) respectively. 0 ( Table 4 reports the rejection rates of M p0 ) at the 10% and 5% 0 ( levels. It can be observed that for the Vasicek model, the test M p0 ) has excellent size performances when the sampling frequencies are daily and monthly, with rejection rates around 6% and 8% at the 5% significance level. When the data is sampled quarterly, the test exhibits a bit over-rejection but not very excessive with rejection rate about 10% at the 5% level. However, when the sampling frequency is increased to yearly, the test has serious over-rejection with the rejection rate around 22% at the 5% level. 0 ( Similarly for the Bivariate Uncorrelated O–U model, M p0 ) has excellent size performances when the sampling frequencies are daily and monthly, a bit but not very serious over-rejection at quarterly frequency and serious over-rejection when the data is sampled yearly. When the data is sampled quarterly, the test exhibits a bit over-rejection but not very excessive with rejection rate about 10% at the 5% level. To sum up, the approximation errors for the numerical integral involved in the test statistic have serious impact on the test performances only when the sampling frequency is yearly. This is the price paid by employing the infinitesimal operator based martingale characterization which delivers many nice properties for my test procedures. As far as the data are as frequent as or higher than monthly, the approximation has little impact and the tests have nice finite sample performances. Therefore, it seems that this is not very empirically relevant since in the fields where diffusion models are used, monthly data and even data sampled higher than monthly are usually available. For example, daily or even intra-daily data can be obtained for stocks, options, and bonds in finance research. Even in the case only very low frequent data are available, this problem can be circumvented by generating higher frequent data in the sampling interval ∆ similar to Brandt and Santa-Clara (2002) according to the estimated models and then compute the integrals by taking the average of the generated sample paths.

16 The performances of the separate inference test M 1 ( p0 ) are also checked following the same simulation design. The performance patters are very similar to  those for the M0 ( p0 ) test.

Z. Song / Journal of Econometrics 162 (2011) 189–212

7. Empirical application: short-rate dynamics In this section, I shall apply the proposed test procedure to investigate the dynamics of short-term interest rates as an empirical application.17 The data set is the same as that in Ait-Sahalia (1996a), i.e., daily Eurodollar rates from June 1, 1973 to February 25, 1995, with a total of 5505 observations. See Ait-Sahalia (1996a) for detailed summary statistics for the data. Five popular models are considered: the Vasicek, CIR, Ahn and Gao, CKLS, and Ait-Sahalia’s nonlinear drift models, as given in (6.1)–(6.5). For each model, I estimate parameters via MLE for a full parametric model and OLS and Song’s (2009a) conditional GMM when only drift parameters are estimated. For the Vasicek, CIR, and Ahn and Gao’s models, the model likelihood function has a closed form. For the CKLS and Ait-Sahalia’s nonlinear drift models, Ait-Sahalia’s (2002) closed-form approximations for the model likelihood are used. With the parameter estimates in hand, the test statistic is computed following the computation procedure in Section 6.1. The empirical results are reported in Table 3. It shows that 0 ( the M p0 ) statistics with the four choices (5, 10, 15, and 20) of preliminary bandwidths for the five models range from 165.5 to 731.7. Compared to upper-tailed N (0, 1) critical values (e.g., 2.33 at 0 ( the 1% level), these huge values of M p0 ) statistics implies strong evidence that all five models are severely misspecified. Similar to Hong and Li (2005), the Vasicek model performs the worst, with the test values around 720 for all preliminary bandwidths, probably due to its restrictive assumption of constant volatility. The CIR and CKLS models dramatically reduces the test statistics values to about 500, obviously because of the more flexible diffusion specifications. The goodness of fit is further improved substantially by Ahn and Gao and Ait-Sahalia’s nonlinear drift models. The latter performs the best, with the test statistic values around 170, which is the most flexible model among the five for both drift and diffusion specifications. These findings demonstrate the power of my test: they overwhelmingly reject all parametric forms, including the CKLS and Ait-Sahalia models, which Ait-Sahalia’s (1996a) marginal density based test fails to reject. To explore the sources of rejection for the spot rate models 0 ( above, I report, in Table 3, the separate inference statistics M p0 ) defined in (4.5) for each model. For the Vasicek, CIR and CKLS 0 ( models, the statistic M p0 ) is to check whether the drift is linear, i.e., H0,1 : b(Xt ) = κ(α − Xt ) for some κ and α while for the Ahn and Gao’s and Ait-Sahalia’s models, it is testing whether the drift follows two specific nonlinear forms, respectively: H0,2 : b(Xt ) = Xt κ − α Xt2 for some κ and α and H0,3 : b(Xt ) = α−1 Xt−1 + α0 + α1 Xt + α2 Xt2 for some (α−1 , α0 , α1 , α2 ). These tests are actually related to the literature of the debate about whether the drift of the interest rate process is linear or not. The early studies AitSahalia (1996a) and Stanton (1997) use smoothed nonparametric kernel methods to estimate the drift of the short rate and find nonlinearity, Chapman and Pearson (2000), in a striking simulation study, find that the evidence of nonlinearity documented may be spurious due to the nature of smoothed nonparametric kernel estimation. Since then, many research have appeared exploring this issue, but most of them only estimate the drift term either parametrically or nonparametrically and check if the estimated drift is linear (see, e.g., Sam and Jiang, 2009 and Takamizawa, 2008), which cannot lead to a rigorous econometric procedure. In contrast, the test proposed in this study is able to check the whole dynamics of a spot rate model, reveal sources of rejection and point to the direction of a better model.

17 I am grateful to an anonymous referee for suggesting this empirical study.

205

It can be seen from Table 3 that the linear drift hypothesis is rejected strongly with the test statistic around 420. The quadratic drift specification in H0,2 and general nonlinear drift reduce the test statistic value dramatically to around 150 but are still rejected. These findings tell us that the drift misspecifications do play an important role in the rejection of all five spot rate models and a potential direction for more accurate models is to consider models with nonlinear drift specifications. The latter conclusion is in sharp contrast with Hong and Li (2005) who claim that the nonlinear drift model underperforms the linear drift models based on their separate inference statistics. However, as discussed in Section 4, their test statistics for separate inference are only for the conditional mean for a fixed sampling interval ∆ instead of for the instantaneous conditional mean or the drift with ∆ → 0. Therefore, their conclusion is only valid when a discrete time model is employed to fit the short-term interest rate. In contrast, 0 ( my separate inference test statistic M p0 ) is able to check the dynamics of the drift as the instantaneous conditional mean with ∆ → 0. The results in Table 3 show that, different from Hong and Li (2005), nonlinear drift outperforms linear drift substantially and should be an important consideration in building more accurate models for the spot rate. 8. Conclusion I develop an omnibus specification test for diffusion models based on the infinitesimal operator instead of the already extensively used transition density. The infinitesimal operator based identification of the diffusion process is equivalent to a ‘‘martingale hypothesis’’ for the new processes transformed from the original diffusion process by the celebrated ‘‘martingale problems’’. My test procedure is to check the ‘‘martingale hypothesis’’ by a multivariate generalized spectral derivative approach which has many good properties. The infinitesimal operator of the diffusion process enjoys the nice property of being a closed-form expression of drift and diffusion terms. This makes my test procedure capable of checking both univariate and multivariate diffusion models and particularly powerful and convenient for the multivariate case while in contrast checking the multivariate diffusion models is very difficult by transition density based methods because transition density does not have a closed form in general. Moreover, the transformed martingale processes via the infinitesimal operator based martingale characterization contain separate information about the drift and diffusion terms and their interactions. Consequently, a separate inference test is proposed to explore the sources when rejection of a parametric form happens. Finally, simulation studies show that the proposed tests have reasonable size performances and excellent power performances in finite sample. Since the infinitesimal operator is a general tool to characterize continuous time stochastic processes and enjoys the nice property of being a closed-form function of the model components, there are many future extensions to explore. First, although the test procedure in this study is for diffusion models specifically, the infinitesimal operator based martingale characterization actually remains valid for general Markov processes, including jump diffusion and Levy jump models which have received great attention recently (see Johannes, 2004, Pan, 2002, and Schoutens, 2003). The proposed tests can be extended to cover these models and the only complication is that Theorem 2 does not hold for general Markov processes and suitable functions at which the infinitesimal operator is evaluated need be chosen to make the test powerful. Second, it is known that the m.d.s. property is equivalent to a conditional moment condition. Therefore, a GMM-type estimator can be constructed for general Markov processes and is expected to have many nice properties such as

206

Z. Song / Journal of Econometrics 162 (2011) 189–212

being convenient for multivariate models due to the closed-form infinitesimal operator. Appendix. Mathematical appendix Throughout the Appendix, let gi (τ , θ ), gij (τ , θ ), Zτi (θ ), and Zτ (θ) be defined as in (5.1)–(5.4). I let M0 (p) be defined in the 0 (p) in (3.5) with the unobservable sample {Zτ = same way as M Zτ ∆ (θ0 )}nτ =1 , where θ0 = p lim  θ , replacing the estimated processes samples { Zτ = Zτ ∆ ( θ )}nτ =1 . Also, C ∈ (1, ∞) denotes a generic bounded constant.

(1,0)

Now put nm = n − |m|, and let  σm (0, v) be defined in the same way as  σm(1,0) (0, v) in (3.5), with {Zτ }nτ =1 replacing { Zτ }nτ =1 . p  To show M0 (p) − M0 (p) → 0, it is sufficient to prove − 12

 D

0

(p)

∫ − n −1 m=1

ij

Proof of Theorem 1. See ChV.19–20 of Rogers and Williams (2000), or Theorem 21.7 of Kallenberg (2002), or Proposition 2.4 of Ch. VII in Revuz and Yor (2005). Proof of Theorem 2. See Proposition 4.6 of Karatzas and Shreve (1991, Ch. 5.4). Proof of Theorem 3. It suffices to show Theorems A.1–A.3. Theorem A.1 implies that replacing {Zτ }nτ =1 by { Zτ }nτ =1 has no impact on 0 (p); Theorem A.2 says that the use of the limit distribution of M truncated process {Zq,τ }nτ =1 rather than the original {Zτ }nτ =1 does 0 (p) for q sufficiently large. not affect the limit distribution of M The assumption that Zq,τ is independent of {Zτ −m }∞ m=q+1 when q is large greatly simplifies the derivation of asymptotic normality 0 (p). of M

  (1,0) 2 −  σm (0, v) dW (v) →p 0

1+ 4b1−2

the conditions of Theorem 3 and q = p M0 (p) →p 0.

ln2 n



 2b1−1

, M0q (p) −

Since

 (1,0) 2  (1,0) 2  σm (0, v) −  σm (0, v) d [ 2 ] −   (1,0) σm(1,,i 0) (0, v) σm,i (0, v) −  =  i =1

d − d [ 2 ] −   (1,0) (1,0) σm,ij (0, v) −  σm,ij (0, v) + 

p

1+ 4b1−2

ln2 n



 2b1−1

=

(1,0)

Proof of Theorem A.1. Note that Zτ (θ ) has components Zτi (θ ) = j j i i i ij Xτi ∆ − X(τ −1)∆ + gi (τ , θ ) and Zτ (θ ) = Xτ ∆ Xτ ∆ − X(τ −1)∆ X(τ −1)∆ + i gij (τ , θ) and similarly  Zτ has components  Zτi = Xτi ∆ − X(τ −1)∆ + j j i i ij    gi (τ , θ) and Zτ = Xτ ∆ Xτ ∆ − X(τ −1)∆ X(τ −1)∆ + gij (τ , θ ) respectively.

By the mean value theorem, we have  Zτi = Zτi − gi′ (τ , θ )′ ( θ − θ0 ) for some θ between  θ and θ0 , where gi′ (τ , θ ) = ∂θ∂ gi (τ , θ ). By the Cauchy–Schwarz inequality and Assumptions A.3 and A.4, n n − −  2  2  i 2  Zτ − Zτi 5 n  θ − θ0  n−1 sup gi′ (τ , θ )

 σm(1,,ij0) (0, v) for i, j = 1, . . . , d are the components of  σm(1,0) (0, v). By (A.5), it is sufficient for (A.4) to show that

= Op (1),

for any i = 1, . . . , d

(A.1)

where Θ0 is a neighborhood of θ0 . By similar reasoning, we have n n − −  2  2  ij 2  Zτ − Zτij 5 n  θ − θ 0  n− 1 sup gi′ (τ , θ )

τ =1 θ∈Θ0

τ =1

= Op (1),

for any i, j = 1, . . . , d

 

2 ] 

(1,0)

for i = 1, . . . , d

(A.6)

1

−  D 2 (p)

∫ − n −1

0

m=1

[ 

2 

(1,0)

 

2 ] 

(1,0)

σm,ij (0, v) σm,ij (0, v) −  k2 (m/p)nm 

× dW (v) →p 0,

for i, j = 1, . . . , d.

(A.7)

We will only show (A.6) here and the proof of (A.7) is similar. To show (A.6), I first decompose

m=1

[ 

2 

(1,0)

 

(1,0)

2 ] 

k2 (m/p)nm  σm,i (0, v) −  σm,i (0, v)

× dW (v) =  A1 + 2 Re( A2 )

(A.8)

where

∫ − n−1 m=1

 

(1,0)

(1,0)

2 

k2 (m/p)nm  σm,i (0, v) −  σm,i (0, v) dW (v)

∫ − n −1



(1,0)

(1,0)



k2 (m/p)nm  σm,i (0, v) −  σm,i (0, v)

m=1

× σm(1,,i 0) (0, v)∗ dW (v) (1,0)

  n −− −  ij  ij 2  + Zτ − Zτ = Op (1).

where Re( A2 ) denote the real part of  A2 and  σm,i (0, v)∗ denote

d

i=1 j=1

2 

(1,0)

and

 A2 =

τ =1

d

[ 

σm,i (0, v) k2 (m/p)nm  σm,i (0, v) − 

× dW (v) →p 0,

(A.2)

  n d n − − −  2  i 2 i   Zτ − Zτ  = Zτ − Zτ i=1

∫ − n −1

0

 A1 =

(A.1) and (A.2) together imply that

τ =1

−  D 2 (p)

∫ − n −1

τ =1 θ ∈Θ0

τ =1

(1,0)

where  σm,i (0, v) and  σm,ij (0, v) for i, j = 1, . . . , d are the (1,0) σm(1,,i 0) (0, v) and components of  σm (0, v) and correspondingly 

m=1

, M0q (p) →d N (0, 1).

(A.5)

i =1 j =1

1

Theorem A.3. Under the conditions of Theorem 3 and q

(A.4)

 C0 (p) −  C0 (p) = Op (n−1/2 ), and  D0 (p) −  D0 (p) = op (1), where   C0 (p) and D0 (p) are defined in the same way as  C0 (p) and  D0 (p) n n  in (3.9) with {Zτ }τ =1 replacing {Zτ }τ =1 . To save space, I focus on the proof of (A.4); the proofs for  C0 (p) −  C0 (p) = Op (n−1/2 ), and   D0 (p) − D0 (p) = op (1) are routine. Note that it is necessary to achieve the convergence rate Op (n−1/2 ) for  C0 (p) −  C0 (p) to make   sure that replacing C0 (p) with C0 (p) has asymptotically negligible impact given p/n → 0.

0 (p) − M0 (p) Theorem A.1. Under the conditions of Theorem 3, M →p 0. Theorem A.2. Let M0q (p) be defined as M0 (p) with {Zq,τ }nτ =1 replacing {Zτ }nτ =1 ,where {Zq,τ } is as in Assumption A.2. Then under

  (1,0) 2 k (m/p)nm  σm (0, v) 2

τ =1

(1,0)

(A.3)

the complex conjugate of  σm,i (0, v). Then (A.6) follows from the following Propositions A.1 and A.2 and p → ∞ as n → ∞.

Z. Song / Journal of Econometrics 162 (2011) 189–212

Proposition A.1. Under the conditions of Theorem 3,  A1 = Op (1). Proposition A.2. Under the conditions of Theorem 3, p−1/2 A2 = op (1). Proof of Proposition A.1. Put  δτ (v) = eiv Zτ − eiv Zτ and ψτ (v) = ′ ′ eiv Zτ − ϕ(v), where ϕ(v) = Eeiv Zτ . Then straightforward algebra yields that for m > 0, ′

1  σm(1,,i 0) (0, v) −  σm(1,,i 0) (0, v) = in− m



( Zτ ,i − Zτ ,i )

1 n− m

τ =m+1 n −

−1

+ inm

( Zτ ,i − Zτ ,i ) δτ −m (v) n −

It follows from (A.1) and Assumptions A.5 and A.6 that

∫ − n −1

2

1  k2 (m/p)n− B1m (v) dW m



m=1



′

5

n−1 −

 2 ∫ n − 2 ‖v‖2 dW (v) an (m) ( Zτ ,i − Zτ ,i ) τ =1

m=1

= Op (p/n)

(A.10)

where I made use of the fact that

τ =m+1



n −

1 − i n− m

n −

207

n−1 −

  δτ −m (v)

an (m) =

m=1

τ =m+1

n−1 −

1 2 n− m k (m/p) = O(p/n)

(A.11)

m=1

given p = cnλ for λ ∈ (0, 1/2), as shown in Hong (1999, A.15, page 1213).

Zτ ,i δτ −m (v)

τ =m+1





n −

1 − i n− m

1 n− m

Zτ ,i

τ =m+1 n −

1 + in− m

n −



 δτ −m (v)

τ =m+1

 2  2 n n − − 2      −1 −1        B2m (v) 5 nm Zτ ,i − Zτ ,i nm v Zτ ,i − v Zτ ,i

( Zτ ,i − Zτ ,i )ψτ −m (v)

τ =1

τ =m+1





n −

1 − i n− m

( Zτ ,i − Zτ ,i )



Proof of Lemma A.2. By the inequality that eiz1 − eiz2  5 |z1 − z2 | for any real-valued variables z1 and z2 , I have



n −

1 n− m

τ =m+1



 5 ‖v‖2

ψτ −m (v)

1 n− m

τ =1 n −

2 ( Zτ ,i − Zτ ,i )2

.

τ =1

τ =m+1

 =i  B1m (v) −  B2m (v) +  B3m (v)

By the same reasoning as that of Lemma A.1, the desired result follows.



− B4m (v) +  B5m (v) −  B6m (v) ,

say.

(A.9)

Then it follows that  A1 5 8 a′ =1 m=1 k2 (m/p)nm | Ba′ m (v)|2 dW (v). Proposition A.1 follows from Lemmas A.1–A.6 and p/n → 0.

∑n−1

∑6



Proof of Lemma A.3. Using the inequality that eiz − 1 − iz  5





|z |2 for any real-valued variables z, I have  ′    ′ ′  iv Zτ −m  − eiv Zτ −m − iv  Zτ −m,i − Zτ −m,i eiv Zτ −m  e 5 ‖v‖2  Zτ −m,i − Zτ −m,i



2

.

(A.12)

A second order Taylor series expansion yields 2 m=1 k (m/p)nm

2   B1m (v) dW (v) = Op (p/n).

k2 (m/p)nm

2   B2m (v) dW (v) = Op (p/n).

Lemma A.1.

∑n−1

Lemma A.2.

∑n−1

m=1

 ′  Zτ −m,i = Zτ −m,i − gi′ (τ − m, θ0 ) ( θ − θ0 ) 1 − ( θ − θ0 )′ gi′′ (τ − m, θ )( θ − θ0 )

(A.13)

2

Lemma A.3.

∑n−1

k2 (m/p)nm

2   B3m (v) dW (v) = Op (p/n).

Lemma A.4.

∑n−1

k2 (m/p)nm

2   B4m (v) dW (v) = Op (p/n).

m=1

m=1

2  ∑n−1 2  Lemma A.5. B5m (v) dW (v) = Op (1). m=1 k (m/p)nm Lemma A.6.

∑n−1

m=1

k2 (m/p)nm

2   B6m (v) dW (v) = Op (p/n).

1 2 Now let an (m) = n− m k (m/p). In the following, I will show these lemmas above.

Proof of Lemma A.1. By the  Cauchy–Schwarz inequality and the inequality that eiz1 − eiz2  5 |z1 − z2 | for any real-valued variables z1 and z2 , I have



n −  2 1  B1m (v) 5 n− ( Zτ ,i − Zτ ,i )2 m

 1 n− m

n −  2  δτ (v)

τ =1

 5 ‖v‖2

τ =1 n

1 n− m

− ( Zτ ,i − Zτ ,i )2 τ =1

2 .



for some θ between  θ and θ0 , where gi′′ (τ , θ ) ≡ ∂θ∂∂θ ′ g (τ , θ ). Put ′ iv ′ Zτ . Then (A.12) and (A.3) imply that ξτ (v) = gi (τ , θ0 )e 2

 ′  ′  iv Zτ −m  − eiv Zτ −m − ivξτ −m (v)( θ − θ0 ) e  2    2 5 ‖v‖2  Zτ −m,i − Zτ −m,i + ‖v‖  θ − θ0  sup gi′′ (τ − m, θ ) θ∈Θ0

where Θ0 is a neighborhood of θ0 . Henceforth, by (A.9), I obtain

  n  −       nm  B3m (v) 5 ‖v‖  θ − θ0   Zτ ,i ξτ −m (v) τ =m+1  + ‖v‖2

n −   2 Zτ ,i   Zτ −m,i − Zτ −m,i

τ =m+1 n  2 −     Zτ ,i  sup g ′′ (τ − m, θ ) . + ‖v‖  θ − θ0  i

τ =m+1

θ ∈Θ0

Then it follows from Assumptions A.1–A.7 and (A.11) that

208

Z. Song / Journal of Econometrics 162 (2011) 189–212

n−1 ∫ −

2 ∫  n    −1 − ′  × nm gi (τ , θ0 )ψτ −m (v) dW (v)   τ =m+1  2 n √  4 −1 −  ′′  + 2  n( θ − θ0 ) n sup g (τ , θ )

2

k2 (m/p)nm  B3m (v) dW (v)



m=1 n −1 2 −

√

n( θ − θ 0 )

5 4

k2 (m/p)

τ =1 θ∈Θ0

m=1

2 ∫  n    −1 −  ×  nm Zτ ,i ξτ −m (v)   τ =m+1

 ×

n − √ 4 × ‖v‖2 dW (v) + 4  n( θ − θ0 ) n−1 Zτ2,i



n− 1

×

τ =1

  sup gi′ (τ , θ )

]4  − n−1

×

θ ∈Θ0

an (m)

m=1

sup

τ =1

 ×

n

n [ −

−1

τ =1

∫ ×

  sup gi′′ (τ , θ )

]2   − n −1

θ ∈Θ0



(A.14)

2

by the fact that E  τ =m+1 Zτ ,i ξτ −m (v) 5 Cnm given E (Zτ ,i | Iτ −1 ) = 0 a.s. under H0 and Assumptions A.1 and A.3. Proof of Lemma A.4. By the Cauchy–Schwarz inequality,

2

−  2 1  B4m (v) 5 n− Zτ ,i m τ =m+1

5

−    δτ (v) . τ =m+1

5

k (m/p) nm

(A.16)

2 ∫  n    −1 − ′  k (m/p)E gi (τ , θ0 )ψτ −m (v) dW (v) nm   m=1 τ =m+1 ∫ n − 1 n − 1 − − ‖ηm (v)‖2 dW (v) + C 5C an (m)

Zτ ,i

τ =m+1

m=1

r =−nm

5C

n −1 −

2

n −

nm nm −     −   ηm+|r | (−v) · ηm−|r | (v) + κm,|r |,m+|r | (v)

given Assumption A.7, where κm,l,r (v) is as in Assumption A.7. As a result, from (A.11) and (A.16), |k(·)| 5 1, and p/n → 0, I get

m=1

−1

i

r =−nm



 ′  v  Zτ − Zτ , n−1 ∫ − 2  B4m (v) dW (v) k2 (m/p)nm  

i

+ 

2

nm −   Cov[g ′ (τ , θ0 ), g ′ (−r , θ0 )′ ] · |σr (v, −v)| r =−nm

δτ (v) 5 Then by this inequality, Cauchy–Schwarz again, and 

n −1 −

2  n     −1 − ′ gi (τ , θ0 )ψτ −m (v) − ηm (v) nm E  nm   τ =m+1

n 1 n− m

‖ηm (v)‖ 5 C

by Assumption A.7. Then expressing the moments in terms of cumulants by the well-known formulas (see Hannan, 1970, (5.1), page 23 for real-valued processes and also Stratonovich, 1963, Chapter 1 and Leonov and Shiryaev, 1959 for more details), I can obtain

an (m)

= Op (p/n)

n

∞ −

′

m=1

∑n

(A.15)

v∈Rd m=1

‖v‖2 dW (v)



dW (v)

where the last term is Op (p/T ) given (A.11) and the first term is Op (1), as is shown below:   Put ηm (v) = E gi′ (τ , θ0 )ψτ −m (v) = Cov[gi′ (τ , θ0 ), ψτ −m (v)]. Then



  n −   √ 4 ‖v‖4 dW (v) + 4  n( θ − θ 0 )  n− 1 Zτ2,i

∫

an (m)

= Op (1) + Op (p/T )

τ =1

n [ −

∫

m=1





n −1 −

i

 ∫ n −  2   ‖v‖2 dW (v) Zτ − Zτ ×

2

m=1

m=1

= O(1) + O(p/n) = O(1).

τ =1

= Op (p/n)

Therefore the first term in (A.10) is Op (1).

2

= σ 2 nm by H0 , the

Proof of Lemma A.6. The proof is analogous to that of Lemma A.4.

Proof of Lemma A.5. By the second order Taylor series expansion in (A.13),

Proof of Proposition A.2. Given the decomposition in (A.9), I have

given (A.3) and (A.11), and E m.d.s. hypothesis of {Zτ }.

∑n

n −

1 − B5m (v) = ( θ − θ 0 ) ′ n− m

τ =m+1 Zτ ,i

gi′ (τ , θ0 )ψτ −m (v)

τ =m+1

1



+ ( θ − θ0 )

′

2

−1

nm

n −

 gi (τ , θ )ψτ −m (v) ( θ − θ0 )

k (m/p)nm 2

∫

τ =m+1

 2  B5m (v) dW (v)

5

6  −    (1,0)   Ba′ m (v)  σm,i (0, v)

where  Ba′ m (v) is defined in (A.9). By the Cauchy–Schwarz inequality, n −1 −

k (m/p)nm 2

m=1

m=1 n −1 √ 2 − 5 2  n( θ − θ 0 ) k2 (m/p) m=1

(A.17)

a′ =1

′′

for some θ between  θ and θ0 . Then I have n−1 −

    (1,0)  σm,i (0, v) −  σm(1,,i 0) (0, v)  σm(1,,i 0) (0, v)∗   

 5

n −1 −

m=1

∫

    (1,0)   Ba′ m (v) ·  σ (0, v) dW (v)

k (m/p)nm 2

m,i

∫

1/2  2   Ba′ m (v) dW (v)

Z. Song / Journal of Econometrics 162 (2011) 189–212

 ×

n−1 −

1/2

k2 (m/p)nm

m=1

Let δq,τ = e

n−1 −

k2 (m/p)nm

m=1

a′ = 1, 2, 3, 4, 6,

∫  2  (1,0)  σm,i (0, v) dW (v) = Op (1) 

k2 (m/p)nm

+ inm 

1 + in− m

Zq,τ ,i δq,τ −m (v) − i nm

n −



 δq,τ −m (v)

(Zτ ,i − Zq,τ ,i )ψq,τ −m (v)  (Zτ ,i − Zq,τ ,i )

(A.18)

2 

p

   n     −1 − ′    (1,0) (0, v) E nm gi (τ , θ0 )ψτ −m (v)  σ   m,i τ =m+1   2 1/2 [ n 2 ]1/2    −      (1,0) −1 ′  E  σm,i (0, v) 5 E  nm gi (τ , θ0 )ψτ −m (v)   τ =m+1   −1/2 1/2 5 C ‖ηm (v)‖ + Cn− nm m

− 12 

A1,q 5 8p

6 − n−1 −

− 12

k (m/p)nm 2

∫

 ∫  n   −   1 n k2 (m/p)nm E gi′ (τ , θ0 )ψτ −m (v) n− m   m=1 τ =m+1    (1,0)  ×  σm,i (0, v) dW (v) n −1 ∫ − ‖ηm (v)‖ dW (v) 5C m =1

a ′ =1 τ =1

= Op (p /qκ ) = op (1) given Assumption A.2, q/p → ∞, and κ Cauchy–Schwarz inequality, 1

1

p− 2  A2,q = 2p− 2

6 − n−1 −

k2 (m/p)nm Re

to show that p

A1,q → 0 and p

 Ba′ mq (v)

× σq(,1m,0,i) (0, v)∗ dW (v) = Op (p 2 /qκ ) = op (1). 1

This completes the proof of Theorem A.2.

Proof of Theorem A.3. It is sufficient to show Propositions A.3 and A.4. (1,0)

(1,0)

Proposition A.3. Let  σq,m (0, v) be defined as  σm (0, v), and let  C0q (p) be defined as  C0 (p), with {Zq,τ }nτ =1 replacing {Zτ }nτ =1 . Then under the conditions of Theorem 1, n−1 −

k2 (m/p)nm

∫

 (1,0) 2  σq,m (0, v) dW (v)

= p−1/2 C0q (p) + p−1/2 Vq + op (1)

Proof of Theorem A.2. The proof is similar to Theorem A.1. By the same reasoning as that of (A.4)–(A.7), we will consider only the case i = 1, . . . , d. Let  A1,q and  A2,q be defined in the same way as  A1 and  A2 in (A.8), with {Zq,τ }nτ =1 replacing { Zτ }nτ =1 . It is enough − 21 

1. Further, by

m=1

m=1

p

∫

=

a ′ =1 τ =1

p−1/2

k2 (m/p) = O(1 + p/n1/2 )

2   Ba′ mq (v) dW (v)

1 2

and consequently n −1 −

say.

Following the same reasoning as that of Theorem A.1 and noting that E [Zτ | Iτ −1 ] = 0 a.s. and E [Zq,τ | Iτ −1 ] = 0 a.s., we have

by the m.d.s. property of {Zτ } under H0 and the fact that the first term in (A.18) is Op (1 + p/n1/2 ), as shown below: By (A.16) and Cauchy–Schwarz inequality, I have

− 21 

ψq,τ −m (v)

τ =m+1

 − B4mq (v) +  B5mq (v) −  B6mq (v) ,

σm,i (0, v) ≤ C nm E 

given |k(·)| 5 1 and Assumption A.7.



 B1mq (v) −  B2mq (v) +  B3mq (v) =i

given p → ∞ and p/n → 0, where I have used the fact that

+ Cn

n −

1 n− m

τ =m+1

i

= Op (1 + p/n1/2 ) + Op (p/n1/2 ) = op (p1/2 )

−

Zq,τ ,i

τ =m+1

n −

1 − i n− m

m=1

− 12



τ =m+1

∫   −   (1,0) σm,i (0, v) dW (v) × k2 (m/p) 

n −1

δq,τ −m (v)

τ =m+1

n−1

− 12



n −

−1

n −

× nm

k2 (m/p)nm

m=1

(1,0)

q,τ

τ =m+1

τ =m+1

 ∫  n     −1 − ′   (1,0)  × nm gi (τ , θ0 )ψτ −m (v)  σm,i (0, v) dW (v)   τ =m+1   n−1  2 −1 −   ′′      + n θ − θ0 n sup g (τ , θ )

 

nm



n −

−1

m=1 θ ∈Θ0

(Zτ ,i − Zq,τ ,i )

n −

−1

τ =m+1

m,i

n −1 −



− i nm

    (1,0)   B5m (v) ·  σ (0, v) dW (v)

5  θ − θ0 



n −

−1

−1

m=1

q,τ

 σq(,1m,0) (0, v)

τ =m+1

by Markov’s inequality, the m.d.s. property of {Zτ } under H0 , and (A.9). Then consider the case a′ = 5. By Assumptions A.1–A.7, n −1 −

−e

and ψq,τ (v) = e

iv ′ Z

 σm(1,,i 0) (0, v) −  σq(,1m,0,i) (0, v) n − 1 = in− (Zτ ,i − Zq,τ ,i )δq,τ −m (v) m 

∫

τ

iv ′ Zq,τ

given Lemmas A.1–A.4 and A.6, and p/n → 0, where p−1

209 iv ′ Z

− ϕq (v), where ϕq (v) = E [e ]. Let be defined as  σm(1,0) (0, v) with {Zq,τ }nτ =1 replacing {Zτ }nτ =1 . Then similar to (A.9), I have

∫  2  (1,0)  σm,i (0, v) dW (v) 

= Op (p1/2 /n1/2 )Op (p1/2 ) = op (p1/2 ),

iv ′ Z

p

A2,q → 0.

where

−

 Vq =

 Vq,a and

a=i and (i,j) with i,j=1,...,d

 C0q (p) =

− a=i and (i,j) with i,j=1,...,d

 C0q,a (p)

(A.19)

210

Z. Song / Journal of Econometrics 162 (2011) 189–212

and

then it suffices to prove that n −

 V q ,a =

q −

Zq,τ ,a

an (m)

∫

m=1 τ =2q+2  τ− 2q−1 −

ψq,τ −m (v)

n

→ 0,

k2 (m/p)

m=1

n −1 −

1

n − m τ =m+1

∫

Zq2,τ ,a

|ψτ −m (v)|2 dW (v).

n− 1

Proposition A.4. Let  D0q (p) be defined as  D0 (p) with {Zq,τ }nτ =1 replacing {Zτ }nτ =1 . Then

 −1/2   Vq →d N (0, 1). D0q (p) Proof of Proposition A.3. See the proof of Proposition A.3 in Song (2010). Proof of Proposition A.4. See the proof of Proposition A.4 in Song (2010). Proof of Theorem 4. It is sufficient to prove the following Theorems A.4 and A.5. 1

0 (p) − Theorem A.4. Under the conditions of Theorem 4, (p 2 n)[M M0 (p)] →p 0. Theorem A.5. Under the conditions of Theorem 4 and for a = i and ij, i, j = 1, . . . , d,



1

[



p 2 /n M0 (p) →p 2D

∞

∫

k4 (z )dz

∫ − n−1 m=1

 

(A.23)

By (A.11), (A.23), (A.8) and Cauchy–Schwarz inequality, it is sufficient for (A.22) to prove that n−1 A1 = op (1), where  A1 is defined as in (A.8). Then (A.9) implies further that it is enough to show n

−1

∫ − n−1



m=1

for h = 1, . . . , 6. I first consider the case h = 1. By Cauchy–Schwarz inequality  δτ (v) 5 2, and 

  n − 2   2 − 1   B1m (v) 5 nm Zτ ,i − Zτ ,i τ =m+1



 n n − − 2   2 1    δτ (v) 5 n− Zτ ,i − Zτ ,i . m

× nm

τ =m+1

(A.20)

[

π

× −π

∫ − n

∫ − n−1

∫

∞

k4 (z )dz

]−1/2



τ =1

0

2    (0,1,0) (ω, 0, v) − f0(0,1,0) (ω, 0, v) dωdW (v). f



τ =m+1

The proof for case h = 2 is similar, noting that

 2 n n  −  2   −1 −  1  Zτ ,i − Zτ ,i  . Zτ ,i − Zτ ,i  5 n−  nm m   τ =m+1 τ =m+1 Next consider the case h = 3. Still by the Cauchy–Schwarz inequality, I have

2 k2 (m/p)nm  σm(1,0) (0, v)



m=1



 (1,0) 2  −  σm (0, v) dW (v) →p 0 (A.21)     p −1  C0 (p) −  C0 (p) = Op (1), and p−1  D0 (p) −  D0 (p) = op (1), where  C0 (p) and  D0 (p) are defined in the same way as  C0 (p) and  D0 (p) in (3.5) with {Zτ }nτ =1 replacing { Zτ }nτ =1 . I focus on the proof   of (A.21) to save space; the proofs for p−1  C0 (p) = Op (1), C0 (p) −    D0 (p) −  D0 (p) = op (1) are straightforward. Because and p−1 

 2  B3m (v) 5

(A.5) implies that

n− 1

n− 1

∫ − n







a=i and (i,j),i,j=1,...,d

n− 1

∫ − n

k2 (m/p)nm

m=1

 2  (1,0) 2  ×  σm(1,,a0) (0, v) −  σm,a (0, v) dW (v),

nm



n −

2

Zτ ,i

τ =1

5 ‖v‖2

1 n− m

n −

τ =1

n −  2  δτ −m (v)

1 n− m

τ =m+1

 Zτ2,i

1 n− m

n −  2  Zτ ,i − Zτ ,i .

τ =m+1

It then follows that

∫ − n−1

2

k2 (m/p)nm  B3m (v) dW (v)



m=1

 

 5

−

−1



2 2 k2 (m/p)nm  σm(1,0) (0, v) −  σm(1,0) (0, v) dW (v)

m=1

=

2

B1m (v) dW (v) k2 (m/p)nm 

  [∫ ]2 n n −  2 −  5 Zτ ,i − Zτ ,i an (m) dW (v) = Op (p/n).

Proof of Theorem A.4. It suffices to show that n− 1

n− 1

m=1

p 2 /n M0 (p) →p 2D

∫∫

τ =1

Then it follows from (A.3) and (A.11) and Assumption A.6 that

π



1

2

Bhm (v) dW (v) = op (1), k2 (m/p)nm 

−1

and therefore



(A.22)

2 

(1,0)

for i = 1, . . . , d.

]−1/2

2      (0,1,0) (0,1,0) (ω, 0, v) − f0,a (ω, 0, v) dωdW (v) fa   −π

×

 



k2 (m/p)nm  σm,i (0, v) dW (v) = Op (1)

0

∫∫



We shall show this only for the case a = i with i = 1, . . . , d; the proofs for all other cases are similar. Since the proof of Theorem A.5 does not depend on Theorem A.4, it follows from (A.20) that

s=1

n −1 −



2 2 σm(1,,a0) (0, v) −  k2 (m/p)nm  σm(1,,a0) (0, v) dW (v)

for a = i and (i, j), i, j = 1, . . . , d.

p

 C0q,a (p) =

∫ − n m=1



Zq,s,a ψq∗,s−m (v) dW (v)

×

−1

n

−1

n −

τ =1

× (m/p)

 Zτ ,i

∫

= Op (p/n).

2

n

−1

 n n−1 −  2 −  Zτ ,i − Zτ ,i k2 τ =1

‖v‖ dW (v) 2

τ =1

Z. Song / Journal of Econometrics 162 (2011) 189–212

The proof for the cases h = 4, 5, 6 is similar to the case h = 3, noting that

 2 n n   −  2  −1 −   1  δτ (v) 5 n− δτ (v) . nm m   τ =m+1 τ =m+1 This completes the proof of Theorem A.4.

Proof of Theorem 5. It is sufficient to prove the Theorems A.6 and A.7.

0 ( Theorem A.6. Under the conditions of Theorem 5, M p)− M0 ( p) = op (1). Theorem A.7. Under the conditions of Theorem 5, M0 ( p)− M0 (p) = op (1). Proof of Theorem A.6. Define n −1 −

∫   (1,0) 2  k (m/ p)nm σm (0, v) 2

m=1

 (1,0) 2  −  σm (0, v) dW (v). 1

1

Then it suffices to show that p− 2  B = op (1), p− 2  C0 ( p) −  C0 ( p) =





1

op (1), and p−1  B = op (1) D0 ( p) −  D0 ( p) = op (1). I only show p− 2  here to save space; the proof of the other two is similar. Since





 −

n −1 −

a=i and (i,j),i,j=1,...,d

m=1

 B =

k2 (m/ p)nm

 ∫   (1,0) 2  (1,0) 2   × σm,a (0, v) −  σm,a (0, v) dW (v) −

≡

 Ba

a=i and (i,j),i,j=1,...,d 1

then it is sufficient to show p− 2  Ba = op (1) for a = i and ij, i, j = 1, . . . , d. We shall only show this holds for a = i; the proofs for all other cases are similar. By the conditions on k(·) implied by Assumption A.5, there exists a symmetric monotonic decreasing function k0 (z ) for z > 0 such that |k(z )| 5 k0 (z ) for all z > 0 and k0 (·) satisfies Assumption A.5. Then for any constants ϵ, η > 0,



1

P p− 2  Bi  > ϵ

 



 1   ≤ P p− 2  Bi  > ϵ, | p/p − 1| ≤ η + P (| p/p − 1| > η)

where the second term vanishes asymptotically for all η > 0 and given  p/p − 1 →p 0. Therefore, by the definition of convergence in probability, it remains to show that the first terms also vanishes as n → ∞. Because | p/p − 1| ≤ η implies  p ≤ (1 + η)p, for | p/p − 1| ≤ η 1

1

1

p− 2  Bi  ≤ (1 + η) 2 [(1 + η)p]− 2

 

n −1 −

Proof of Theorem A.7. The proof is a straightforward extension of that for Theorem A.7 of Hong and Lee (2005) which follows a reasoning analogous to the proof of Hong (1999, Theorem 4). Note that the latter uses an assumption which is exactly the same as Assumption A.9. That is, Assumption A.9 is also used in this proof.

Proof of Theorem A.5. The proof is a straightforward extension for that of Hong (1999, Proof of Theorem 5), for the case (m, l) = (1, 0) and W1 (·) = δ(·), the Dirac delta function. I omit it here to save space. Note that Assumption A.8 is needed here.

 B ≡

211

k20 [m/(1 + η)p]

m=1

[ 2  2 ]  (1,0)   (1,0)  × nm  σm,i (0, v) −  σm,i (0, v) →p 0 for any η > 0 given (A.9), where the inequality follows from |k(z )| 5 k0 (z ) for all z > 0. This completes the proof of Theorem A.6.

References Ahn, D., Gao, B., 1999. A parametric nonlinear model of term structure dynamics. Review of Financial Studies 12, 721–762. Ait-Sahalia, Y., 1996a. Testing continuous-time models of the spot interest rate. Review of Financial Studies 9, 385–426. Ait-Sahalia, Y., 1996b. Nonparametric pricing of interest rate derivative securities. Econometrica 64, 527–560. Ait-Sahalia, Y., 1999. Transition densities for interest rate and other nonlinear diffusions. Journal of Finance 54, 1361–1395. Ait-Sahalia, Y., 2002. Maximum-likelihood estimation of discretely sampled diffusions: a closed-form approach. Econometrica 70, 223–262. Ait-Sahalia, Y., 2008. Closed-form likelihood expansions for multivariate diffusions. Annals of Statistics 36, 906–937. Ait-Sahalia, Y., Fan, J., Peng, H., 2009. Nonparametric transition-based tests for diffusions. Journal of the American Statistical Association 104, 1102–1116. Ait-Sahalia, Y., Hansen, L., Scheinkman, J., 2004. Operator methods for continuoustime Markov processes. In: Handbook of Financial Econometrics. Ait-Sahalia, Y., Mykland, P., Zhang, L., 2005. A tale of two time scales: determining integrated volatility with noisy high-frequency data. Journal of the American Statistical Association 100, 1394–1411. Andersen, T.G., Bollerslev, T., Diebold, F.X., Labys, P., 2003. Modeling and forecasting realized volatility. Econometrica 71, 579–625. Andersen, T.G., Bollerslev, T., Dobrev, D., 2007. No-arbitrage semi-martingale restrictions for continuous-time volatility models subject to leverage effects, jumps and i.i.d. noise: Theory and testable distributional implications. Journal of Econometrics 138 (1), 125–180. Bahadur, R.R., 1960. Stochastic comparison of tests. Annals of Mathematical Statistics 31, 276–295. Bandi, F.M., Phillips, P.C.B., 2003. Fully nonparametric estimation of scholar diffusion models. Econometrica 71, 241–283. Bandi, F.M., Phillips, P.C.B., 2007. A simple approach to the parametric estimation of potentially nonstationary diffusions. Journal of Econometrics 137 (2), 354–395. Barndorff-Nielsen, O.E., Shephard, N., 2004. Power and bipower variation with stochastic volatility and jumps. Journal of Financial Econometrics 2, 1–37. Brandt, M., Santa-Clara, P., 2002. Simulated likelihood estimation of diffusions with an application to exchange rate dynamics in incomplete markets. Journal of Financial Economics 63, 161–210. Brillinger, D.R., Rosenblatt, M., 1967. Asymptotic theory of estimates of kth order spectra. In: Harris, B. (Ed.), Spectral Analysis of Time Series. Wiley, New York. Chan, K.C., Karolyi, G.A., Longstaff, F.A., Sanders, A.B., 1992. An empirical comparison of alternative models of the short-term interest rate. Journal of Finance 47, 1209–1227. Chapman, D., Pearson, N., 2000. Is the short rate drift actually nonlinear? Journal of Finance 55, 355–388. Chen, S.X., Gao, J., Tang, C., 2008. A test for model specification of diffusion processes. The Annals of Statistics 36, 167–198. Chen, B., Hong, Y., 2010. Characteristic function-based testing for multifactor continuous-time models via nonparametric regression. Econometric Theory 26, 1115–1179. Conley, T.G., Hansen, L.P., Luttmer, E.G.J., Scheinkman, J.A., 1997. Short-term interest rates as subordinated diffusions. Review of Financial Studies 10, 525–578. Corradi, V., Swanson, N.R., 2005. Bootstrap specification tests for diffusion process. Journal of Econometrics 124, 117–148. Corradi, V., White, H., 1999. Specification tests for the variance of a diffusion. Journal of Time Series Analysis 20, 253–270. Cox, J.C., Ingersoll, J.E., Ross, S.A., 1985. A theory of the term structure of interest rates. Econometrica 53, 385–407. Dai, Q., Singleton, K.J., 2000. Specification analysis of affine term structure models. Journal of Finance 55, 1943–1978. Duffie, D., Pedersen, L., Singleton, K., 2003. Modeling credit spreads on sovereign debt: a case study of Russian bonds. Journal of Finance 55, 119–159. Dynkin, E.B., 1965. Markov Processes, vol. I. Springer-Verlag, 32. Gallant, A.R., Tauchen, G., 1996. Which moments to match? Econometric Theory 12, 657–681. Hannan, E., 1970. Multiple Time Series. Wiley, New York. Hansen, L.P., Scheinkman, J.A., 1995. Back to the future: generating moment implications for continuous time Markov processes. Econometrica 63, 767–804. Hansen, L.P., Scheinkman, J.A., 2003. Semigroup pricing. Manuscript. Hansen, L.P., Scheinkman, J.A., Touzi, Nizar, 1998. Spectral methods for identifying scalar diffusions. Journal of Econometrics 86 (1), 1–32. Hong, Y., 1999. Hypothesis testing in time series via the empirical characteristic function: a generalized spectral density approach. Journal of the American Statistical Association 94, 1201–1220. Hong, Y., Lee, Y., 2005. Generalized spectral tests for conditional mean specification in time series with conditional heteroskedasticity of unknown form. Review of Economic Studies 72, 499–541.

212

Z. Song / Journal of Econometrics 162 (2011) 189–212

Hong, Y., Li, H., 2005. Nonparametric specification testing for continuous-time models with applications to term structure of interest rates. Review of Financial Studies 18, 37–84. Jiang, G., Knight, J., 2002. Estimation of continuous-time processes via the empirical characteristic function. Journal of Business and Economic Statistics 20, 198–212. Johannes, M., 2004. The statistical and economic role of jumps in interest rates. Journal of Finance 59, 227–260. Kallenberg, O., 2002. Foundations of Modern Probability, 2nd ed. Springer-Verlag. Kanaya, S., 2007. Non-parametric specification testing for continuous-time Markov processes: do the processes follow diffusions? Manuscript, University of Wisconsin. Karatzas, I., Shreve, St.E., 1991. Brownian motion and stochastic calculus. In: Graduate Texts in Mathematics, vol. 113. Springer-Verlag, New York. Kessler, M., Sorensen, M., 1996. Estimating equations based on eigenfunctions for a discretely observed diffusion process. Bernoulli 5 (2), 299–314. Kristensen, D., 2008. Pseudo-maximum-likelihood estimation in two classes of semiparametric diffusion models. Working Paper, Columbia University. Leonov, V.P., Shiryaev, A.N., 1959. On a method of calculation of semi-invariants. Theory of Probability and its Applications 4, 319–329. (English translation). Li, F., 2007. Testing the parametric specification of the diffusion function in a diffusion process. Econometric Theory 23, 221–250. Lo, A.W., 1988. Maximum likelihood estimation of generalized ito processes with discretely sampled data. Econometric Theory 4, 231–247. Nelson, D.B., 1990. ARCH models as diffusion approximations. Journal of Econometrics 45, 7–38. Pan, J., 2002. The jump-risk premia implicit in options: evidence from an integrated time series study. Journal of Financial Economics 63, 3–50. Park, J., 2008. Martingale regression and time change. Working Paper. Park, J., Whang, Y.J., 2003. Testing for the martingale hypothesis. Working Paper. Priestley, M.B., 1981. Spectral Analysis and Time Series. Academic Press, London. Pritsker, M., 1998. Nonparametric density estimation and tests of continuous time interest rate models. Review of Financial Studies 11, 449–487. Protter, P., 2005. Stochastic Integration and Differential Equations, second ed. Springer-Verlag, Heidelberg, Version 2.

Revuz, D., Yor, M., 2005. Continuous Martingales and Brownian Motion. In: Grundlehren der Mathematischen Wissenschaften, vol. 293. SpringerVerlag, Berlin. Rogers, L.C.G., Williams, D., 2000. Diffusions, Markov Processes and Martingales, 2nd ed. vol. 1. Cambridge University Press. Sam, G., Jiang, G., 2009. Nonparametric estimation of the short rate diffusion from a panel of yields. Journal of Financial and Quantitative Analysis 44 (5), 1197–1230. Schoutens, W., 2003. Levy Processes in Finance. Wiley, New York. Serfling, R.J., 1980. Approximation Theorems of Mathematical Statistics. John Wiley. Singleton, K., 2001. Estimation of affine asset pricing models using the empirical characteristic function. Journal of Econometrics 102, 111–141. Song, Z., 2009a. Estimating semi-parametric diffusion models with unrestricted volatility via infinitesimal operator based characterization. Working Paper, Cornell University. Song, Z., 2009b. An infinitesimal operator based specification test for the drift function in multivariate diffusion models using realized volatility and covariance. Working Paper, Cornell University. Song, Z., 2010. A martingale approach for testing diffusion models based on infinitesimal operator: techniqual appendix. Manuscript. Stanton, R., 1997. A nonparametric model of term structure dynamics and the market price of interest rate risk. Journal of Finance 52, 1973–2002. Stratonovich, R.L., 1963. Topics in the Theory of Random Noise, vol. 1. Gordon and Breach, New York., (English translation). Stroock, D.W., Varadhan, S.R.S., 1969. Diffusion processes with continuous coefficients I & II. Communications on Pure and Applied Mathematics 22, 345–400. 479–530. Takamizawa, H., 2008. Is nonlinear drift implied by the short end of the term structure? Review of Financial Studies 21 (1), 311–346. Tauchen, G., 1997. New minimum chi-square methods in empirical finance. In: Kreps, D., Wallis, K. (Eds.), Advances in Econometrics: Seventh World Congress. Cambridge University Press, Cambridge, UK. Vasicek, O., 1977. An equilibrium characterization of the term structure. Journal of Financial Economics 5, 177–188. Zähle, H., 2008. Weak approximations of SDEs by discrete-time processes. Journal of Applied Mathematics and Stochastic Analysis.

Journal of Econometrics 162 (2011) 213–224

Contents lists available at ScienceDirect

Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom

A bootstrap-assisted spectral test of white noise under unknown dependence Xiaofeng Shao ∗ Department of Statistics, University of Illinois at Urbana-Champaign, 725 South Wright St, Champaign, IL, 61820, United States

article

abstract

info

Article history: Received 16 October 2008 Received in revised form 24 December 2010 Accepted 20 January 2011 Available online 27 January 2011

To test for the white noise null hypothesis, we study the Cramér–von Mises test statistic that is based on the sample spectral distribution function. Since the critical values of the test statistic are difficult to obtain, we propose a blockwise wild bootstrap procedure to approximate its asymptotic null distribution. Using a Hilbert space approach, we establish the weak convergence of the difference between the sample spectral distribution function and the true spectral distribution function, as well as the consistency of bootstrap approximation under mild assumptions. Finite sample results from a simulation study and an empirical data analysis are also reported. © 2011 Elsevier B.V. All rights reserved.

JEL classification: C12 C20 Keywords: Hypothesis testing Spectral distribution function Time series White noise Wild bootstrap

1. Introduction This article is concerned with testing if a sequence of observations is generated from a white noise process. Testing for white noise, or lack of serial correlation, is a classical problem in time series analysis and it is an integral part of the Box–Jenkins linear modeling framework. For example, if the series is white noise, then there is no need to fit an ARMA-type linear time series model to the data, which is mainly used to model nonzero auto-correlations. We consider a stationary real-valued time series Xt , t ∈ Z. Denote by E(Xt ) = µ and γ (k) = cov(Xt , Xt +k ). Then the null and alternative hypothesis are H0 : γ (k) = 0,

k∈N

versus H1 : γ (k) ̸= 0

for some k ∈ N. Let f (w) = (2π )−1

∞ −

γ (k)e−ikw ,

w ∈ [−π , π] and

k=−∞

F (λ) =

λ

∫

f (w)dw,

λ ∈ [0, π]

0

be the spectral density function and spectral distribution function respectively. Under the null hypothesis, f (λ) = γ (0)/(2π ) is a

∗

Tel.: +1 217 2447285; fax: +1 217 2447190. E-mail address: [email protected].

0304-4076/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2011.01.001

constant over [−π , π]. There is a huge literature on the white noise testing problem and the existing tests can be roughly categorized into two types: time domain correlation-based tests and frequency domain periodogram-based tests. One of the challenges for the existing methods is that the asymptotic null distributions of the test statistics have been obtained under the independent and identically distributed (iid) assumption, and they may be invalid when the series is uncorrelated but dependent. The distinction between iid and uncorrelated dependence is important in view of the popularity of conditional heteroscedastic models (e.g. GARCH models) as used in econometric and financial time series modeling. Also there are other nonlinear processes that are uncorrelated but not independent; see Remark 2.2 for more examples. The discrepancy in asymptotic null distributions can be attributed to the unknown dependence of the series, which leads to severe size distortion if we use the critical values of the asymptotic null distribution obtained under iid assumptions. This phenomenon was first discovered for the popular Box and Pierce’s (1970) (BP, hereafter) portmanteau test by Romano and Thombs (1996), who advocated the use of nonparametric bootstrap and subsampling methods to approximate the sampling distribution of the BP test statistic. Also see Lobato (2001), Lobato et al. (2002) and Horowitz et al. (2006) for more recent contributions along this line. The BP test statistic and its variants only test for zero correlations up to lag m, where m is a prespecified integer, so they have trivial power against correlations at lags beyond m. Recently, Escanciano and Lobato (2009) proposed a data-driven BP test for serial correlation. Their work is interesting as it avoids

214

X. Shao / Journal of Econometrics 162 (2011) 213–224

any user-chosen number (say, m), does not involve the use of bootstrap and is shown to have good size and power performance. However, the asymptotic null distribution of their test statistic was established under the martingale difference assumption and it is unclear whether the limiting theory for their test still holds under the weaker white noise hypothesis. For time series models that are white noise but are not martingale difference, see Remark 2.2. The main focus of this paper is on the test statistic that is based λ on the sample spectral distribution function Fn (λ) = 0 In (w)dw , where In (w) = (2π n)−1

n −

|(Xt − X¯ )eit w |2

t =1

2. Test statistics and asymptotic theory illustrate the idea, we note the relationship In (w) = (2π ) ∑nTo ∑ −1 ˆ (k)e−ikw , where γˆ (j) = n−1 nt=1+|j| (Xt −X¯ )(Xt −|j| −X¯ ) is k=1−n γ the sample auto-covariance function at lag j = 0, ±1, . . . , ±(n − ∑n−1 ∑∞ 1). Write Fn (λ) = ˆ (j)Ψj (λ) and F (λ) = j=0 γ j=0 γ (j)Ψj (λ), where Ψj (λ) = sin(jλ)/(jπ )1 (j ̸= 0) + λ/(2π )1 (j = 0). Under H0 , F (λ) = γ (0)Ψ0 (λ), so we can construct a test based on the distance between Fn (λ) and γˆ (0)Ψ0 (λ), the latter of which is a natural estimator of F (λ) under the null hypothesis. This leads to −1

the process Sn (λ) =

√

n{Fn (λ) − γˆ (0)Ψ0 (λ)} =

n−1 − √

nγˆ (j)Ψj (λ).

j =1

∑n

is the periodogram and X¯ = n−1 t =1 Xt is the sample mean. The early work on spectra-based tests dates back to Bartlett (1955), who proposed famous Up and Tp processes in the frequency domain. A rigorous asymptotic theory was obtained by Grenander and Rosenblatt (1957) for the iid processes. More recent developments can be found in Durlauf (1991), Anderson (1993) and Deo (2000), to name a few. On one hand, it has been well-established in the literature that the spectra-based test belongs to the test of omnibus type, and it has nontrivial power against local alternatives that are within n−1/2 -neighborhoods of the null hypothesis. On the other hand, as demonstrated in Deo (2000), the spectrabased test no longer has the usual asymptotic distribution under iid assumption when dependence (e.g. conditional heteroscedasticity) is present. To account for unknown conditional heteroscedasticity, the latter author proposed a modified test statistic and justified its use for a subclass of GARCH processes and stochastic volatility models. However, the validity of Deo’s modification heavily relies on certain assumptions on the data generating process, which exclude some other commonly used GARCH models (e.g. EGARCH model) and uncorrelated nonlinear processes. Also those assumptions are difficult to check in practice. In this paper, we propose a blockwise wild bootstrap approach to approximating the asymptotic null distribution of the spectrabased test statistic. The blockwise wild bootstrap is a simple variant of the well-known wild bootstrap (Wu, 1986) in that the auxiliary variables are generated in a blockwise fashion. We establish the consistency of the bootstrap approximation using the Hilbert space approach, which allows us to get the limiting distribution of the Cramér–von Mises statistic. Assisted with the blockwise wild bootstrap, our test is asymptotically valid for a very wide class of processes, including GARCH processes of various forms and other uncorrelated linear or nonlinear processes. We now introduce some notation. ∑ For a column vector x = (x1 , . . . , xq )′ ∈ Rq , let |x| = ( qj=1 x2j )1/2 . Let ξ be a random vector. Write ξ ∈ Lp (p > 0) if ‖ξ ‖p := [E(|ξ |p )]1/p < ∞ and let ‖ · ‖ = ‖ · ‖2 . The positive constant C is generic and it may vary from line to line. Denote by ‘‘→D ’’ and ‘‘→p ’’ convergence in distribution and in probability, respectively. The symbols OP (1) and op (1) signify being bounded in probability and convergence to zero in probability, respectively. The paper is structured as follows. In Section 2, we present the spectra-based test and establish the weak convergence of the difference between the sample spectral distribution function and its population counterpart. Section 3 proposes a blockwise wild bootstrap approach and proves its consistency. We illustrate the finite sample performance of the test using a Monte Carlo experiment and real data sets in Section 4. Section 5 concludes. Proofs are gathered in the Appendix.

Let Yt = Xt − µ. Under the null hypothesis, we expect that Sn (λ) weakly converges to a Gaussian process S (λ) in C [0, π] (the space of continuous functions on [0, π]), where S (λ) has mean zero and covariance function cov{S (λ), S (λ′ )}

=

∞ − ∞ − ∞ −

cov(Yt Yt −j , Yt −d Yt −d−k )Ψj (λ)Ψk (λ′ ).

(1)

j=1 k=1 d=−∞

The continuous mapping theorem yields the limiting distributions of the following two well-known statistics: Kolmogorov–Smirnov (K–S) statistic sup |Sn (λ)| →D sup |S (λ)|,

λ∈[0,π]

λ∈[0,π]

(2)

Cramér von–Mises (C–M) statistic π

∫ CMn =

Sn2 (λ)dλ →D

π

∫

0

S 2 (λ)dλ.

(3)

0

The covariance structure of S (λ) can be simplified considerably if the series is iid Gaussian under H0 , and one may be able to exploit the Gaussian assumption to consistently approximate the limiting distributions of the K–S and C–M statistics. But for dependent white noise, such as a GARCH process, the critical values π of supλ∈[0,π] |S (λ)| and 0 S 2 (λ)dλ are typically hard to obtain. To circumvent the difficulty, Deo (2000) proposed a modified version of K–S and C–M statistics for the standardized version of Sn (λ), where γˆ (j) is replaced by its autocorrelation counterpart ρ( ˆ j) = γˆ (j)/γˆ (0). He established the validity of the modified test statistic under the assumptions that ensure asymptotic independence of ρ( ˆ j) and ρ( ˆ k), j ̸= k. His method seems no longer (asymptotically) valid if the asymptotic independence does not hold and also his assumptions are hard to check in practice. Therefore, it is desirable to come up with a general solution that is capable of approximating the limiting null distributions of K–S and C–M statistics. To facilitate asymptotic analysis, we shall adopt the Hilbert space approach to establish the weak convergence of Sn . As demonstrated in Escanciano and Velasco (2006), the Hilbert space approach is superior to the usual ‘‘sup’’ norm approach in proving the weak convergence due to its lack of theoretical complication in general settings. In particular, the proof of the tightness using the Hilbert space approach is simpler than that for the ‘‘sup’’ norm approach. Here, we regard Sn as a random element in the Hilbert space L2 [0, π] of all square integrable functions (with respect to the Lebesgue measure) with the inner product

⟨f , g ⟩ =

∫ [0,π]

f (λ)g c (λ)dλ,

where g c (λ) denotes the complex conjugate of g (λ). Note that L2 [0, π] is endowed with the natural Borel σ -field induced by the norm ‖f ‖ = ⟨f , f ⟩1/2 ; see Parthasarathy (1967). The Hilbert

X. Shao / Journal of Econometrics 162 (2011) 213–224

space approach has been utilized in time series analysis by Politis and Romano (1994), Chen and White (1996, 1998) and Escanciano and Velasco (2006), among others. From both methodological and technical perspectives, this paper is closely related to Escanciano and Velasco (2006), who proposed a generalized spectral distribution function type test for the null hypothesis of martingale difference and adopt the wild bootstrap to get a consistent approximation of the asymptotic null distribution. In our testing problem, a blockwise wild bootstrap is necessary to yield consistent approximation and our technical developments are considerably different from theirs since our white noise null hypothesis is weaker than their martingale difference hypothesis. To obtain the weak convergence of Sn in L2 [0, π], we need to impose some structural assumption on the process Xt . Throughout, we assume that Xt = G(. . . , εt −1 , εt ),

t ∈ Z,

(4)

where G is a measurable function for which Xt is a well-defined random variable. The class of processes that (4) represents is huge; see Rosenblatt (1971), Kallianpur (1981), and Tong (1990, p. 204). Let (εk′ )k∈Z be an iid copy of (εk )k∈Z ; define the physical dependence measure (Wu, 2005) δq (t ) = ‖Xt − Xt∗ ‖q , q ≥ 1, where Xt∗ = G(. . . , ε−1 , ε0′ , ε1 , . . . , εt ), t ∈ N. For ξ ∈ L1 define projection operators Pk ξ = E(ξ |Fk ) − E(ξ |Fk−1 ), k ∈ Z, where Fk = (. . . , εk−1 , εk ). Theorem 2.1. Assume the process Xt admits the representation (4), Xt ∈ L4 and ∞ −

δ4 (k) < ∞.

(5)

k=0

Further we assume that ∞ −

|cum(X0 , Xk1 , Xk2 , Xk3 )| < ∞.

(6)

k1 ,k2 ,k3 =−∞

Then Sn (λ)− E{Sn (λ)} ⇒ S (λ) in L2 [0, π], where S (λ) is a mean zero Gaussian process with the covariance specified in (1) and ‘‘ ⇒’’ stands for weak convergence in the Hilbert space L2 [0, π] endowed with the norm metric. Corollary 2.1. Under the assumptions of Theorem 2.1, (3) holds under H0 . Under H1 , we have that CMn n

→p

∞ − j=1

γ 2 (j)

π

∫

Ψj2 (λ)dλ.

(7)

0

Thus the C–M test is consistent since at least one γ (j) ̸= 0 for j = 1, 2, . . . under the alternative. The Hilbert space approach is very natural for studying the asymptotic properties of the statistic of Cramér–von Mises type. Unfortunately, it is not applicable to the statistic of Kolmogorov–Smirnov type, since the ‘‘sup’’ functional is no longer a continuous mapping from L2 [0, π] to R. This is a price we pay for the reduced technicality of the Hilbert space approach as compared to the ‘‘sup’’ norm approach.

 γ (k) = cov(X0 , Xk ) = E

 − j∈Z

=

− j∈Z

By the Cauchy–Schwarz inequality, we have

−

|γ (k)| ≤

E(Pj X0 Pj Xk ).

Pj X0

− j′ ∈Z

Pj′ Xk

−

|E(Pj X0 Pj Xk )|

j,k∈Z

k∈Z

2

 ≤

−

‖Pj X0 ‖‖Pj Xk ‖ =

−

j,k∈Z

‖P0 Xj ‖

.

j∈Z

By Theorem 1 of Wu (2005), ‖P0 Xj ‖ ≤ δ2 (∑ j) ≤ δ4 (j) for j ≥ 0. ∞ Therefore, the assumption (5) implies that k=−∞ |γ (k)| < ∞, i.e., Xt is a short memory process. Remark 2.2. A natural question is how to verify the conditions (5) and (6) for time series models. It turns out that both (5) and (6) are implied by the GMC(4) condition; see Wu (2005) and Wu and Shao (2004). Let (εk′ )k∈Z be an iid copy of (εk )k∈Z ; let Xn′ = ′ ′ G(. . . , ε− 1 , ε0 , ε1 , . . . , εn ) be a coupled version of Xn . We say that Xn is GMC(α), α > 0, if there exist C > 0 and ρ = ρ(α) ∈ (0, 1) such that

E(|Xn − Xn′ |α ) ≤ C ρ n ,

n ∈ N.

(8)

The property (8) indicates that the process {Xn } forgets its past exponentially fast, and it can be verified for many nonlinear time series models, such as threshold model (Tong, 1990), bilinear model (Subba Rao and Gabr, 1984), various forms of GARCH models (Bollerslev, 1986; Ding et al., 1993); see Wu and Min (2005) and Shao and Wu (2007) for more details. Recently, there is a surge of interest in conditional heteroscedastic models (e.g. GARCH models) due to its empirical relevance in modeling financial time series. The GARCH process is a martingale difference with time varying conditional variance, so it is uncorrelated but dependent. In addition, there are a few commonly used models that are uncorrelated but are not martingale differences (MDS). Examples include all-pass ARMA models (Breidt et al., 2001), certain forms of bilinear models (Granger and Anderson, 1978) and nonlinear moving average models (Granger and Teräsvirta, 1993); see Lobato et al. (2002). These models also satisfy GMC conditions, as demonstrated in Shao (in press). To study the local power, we define the spectral density under the local alternative as

√

H1n : fn (w) = (2π )−1 γ (0)(1 + g (w)/ n),

w ∈ [−π , π],

(9)

where g is a symmetric and 2π -periodic function that satisfies π g (w) dw = 0, which ensures that fn is a valid spectral −π density function for large n. In this case, we have γn (j) √ = π ijw −π fn (w)e dw = γ (0)1 (j = 0) + 1 (j ̸= 0)γ (0)/(2π n)

π

−π

g (w)eijw dw . So under the local alternative,

Sn (λ) =

n −1 − √

n −1 − √

j =1

j=1

n{γˆ (j) − γn (j)}Ψj (λ) +

nγn (j)Ψj (λ),

where the first term converges to S (λ) in L2 [0, π] and the λ second term approaches γ (0)/(2π ) 0 g (w)dw . This implies

π

Remark 2.1. Note that

215

λ

CMn →D 0 {S (λ) + γ (0)/(2π ) 0 g (w)dw}2 dλ. It is not hard to show that limn→∞ P (CMn rejects H0 |H1n ), which is the limiting λ power of the C–M test, approaches 1 as ‖ 0 g (w)dw‖ goes to ∞. This suggests that the test has nontrivial power against local alternative that is within n−1/2 -neighborhood of the null hypothesis.

216

X. Shao / Journal of Econometrics 162 (2011) 213–224

the bootstrapped test statistic CM∗n converges to distribution in probability.

3. Nonparametric blockwise wild bootstrap In this section, we shall use a blockwise wild bootstrap method to approximate the limiting null distribution of CMn . We describe the steps involved in forming bootstrapped statistic CM∗n as follows: 1. Set a block size bn , s.t. 1 ≤ bn < n. Denote the blocks by Bs = {(s − 1)bn + 1, . . . , sbn }, s = 1, . . . , Ln , where the number of blocks Ln = n/bn is assumed to be an integer for the convenience of presentation. 2. Take iid random draws δs , s = 1, 2, . . . , Ln , independent of the data, from a common distribution W , where E(W ) = 0, E(W 2 ) = 1 and E(W 4 ) < ∞. Define the auxiliary variables wt = δs , if t ∈ Bs , for ∑nt = 1, . . . , n. 3. Let γˆ ∗ (j) = n−1 t =j+1 {(Xt − X¯ )(Xt −j − X¯ ) − γˆ (j)}wt for j = 1, . . . , n − 1. Define the bootstrapped process Sn∗ (λ) =

n −1 √ −

n

γˆ ∗ (j)Ψj (λ).

j =1

π

4. Compute the bootstrapped test statistic CM∗n = 0 {Sn∗ (λ)}2 dλ. 5. Repeat steps 2 and 3 M times and denote by CM∗n,α the empirical 100(1 −α)% sample percentile of CM∗n based on M bootstrapped values. Then we reject the null hypothesis at the significance level α if CMn > CM∗n,α . 1 Throughout the paper, we assume that b− + bn /n → 0 n as n → ∞. As seen from the above procedure, the blockwise wild bootstrap is a simple variant of the traditional wild bootstrap (Wu, 1986; Liu, 1988; Mammen, 1993), which was originally proposed to deal with independent and heteroscedastic errors in the regression problems. The main difference from the traditional wild bootstrap is that the auxiliary variables {wt }nt=1 are the same within each block of size bn and iid across different blocks. This helps to capture the dependence between {(Xt − X¯ )(Xt −j − X¯ )} and {(Xt ′ − X¯ )(Xt ′ −k − X¯ )} for j, k = 1, . . . , n − 1 when t , t ′ belong to the same block. Here we mention that the wild bootstrap has been widely used in the context of specification testing to achieve a consistent approximation of the limiting null distribution of the test statistics that depends on the data generating mechanism; see Stute et al. (1998), Li et al. (2003) and Escanciano and Velasco (2006), among others. No centering such as the one appeared in γˆ ∗ (j) has been used in the above-mentioned papers. Let dw be any metric that metricizes weak convergence in L2 [0, π]; see Politis and Romano (1994). To describe the convergence of CM∗n , conditional on the sample, we adopt the notion ‘‘in distribution in probability’’ introduced by Li et al. (2003). Given the observations {Xt }nt=1 and a statistic ξn , L(ξn |X1 , X2 , . . .) (i.e., the distribution of ξn given X1 , X2 , . . .) is said to converge to L(ξ ) in distribution in probability if for any subsequence ξn′ , there exists a further subsequence ξn′′ such that L(ξn′′ |X1 , X2 , . . .) converges to L(ξ ) for almost every sequence (X1 , X2 , . . .). Denote by P ∗ , E∗ , var∗ and cov∗ the probability, expectation, variance and covariance conditional on the data.

Theorem 3.1. Assume (5) and ∞ −

|kj ||cum(X0 , Xk1 , . . . , XkJ )| < ∞,

j = 1, . . . , J ,

(10)

k1 ,...,kJ =−∞

for J = 1, . . . , 7. Under the null hypothesis, under any fixed alternative hypothesis, or under the local alternatives (9), we have that dw [L{Sn∗ (λ)|Xn }, L{S (λ)}] → 0

(11)

in probability as n → ∞, where L(Sn (λ)|Xn ) stands for the distribution of Sn∗ (λ) given the sample Xn . Consequently, ∗

π 0

S 2 (λ)dλ in

Summability conditions on joint cumulants (e.g. (10)) are common in spectral analysis; see Brillinger (1975). ∑ For a linear ∑ process Xt = j∈Z aj εt −j with εj being iid, (10) holds if j∈Z |jaj | <

∞ and ε1 ∈ L8 . If Xt admits the form (4), then (10) is true provided that Xt satisfies GMC(8) (see Wu and Shao, 2004, Proposition 2). The assumption of eighth moment condition is a bit strong, but seems necessary in our technical argument. Remark 3.1. If the process Xt is a sequence of martingale differences satisfying the conditions of Theorem 3.1, then cov{S (λ), S (λ′ )} =

∞ −

E(Yt2 Yt −j Yt −k )Ψj (λ)Ψk (λ′ )

j,k=1

(compare (1)) and taking bn = 1 (i.e., using traditional wild bootstrap) still leads to a consistent bootstrap approximation to the limiting null distribution of CMn . Since the proof is basically a repetition of the argument in the proof of Theorem 3.1, we omit the details. Therefore, in view of Theorem 2.1, Corollary 2.1 and the discussions at the end of Section 2, Theorem 3.1 suggests that the bootstrap-assisted test (i.e. use bootstrapped critical values) has a correct asymptotic level, is consistent, and has nontrivial power to detect the alternatives tending to the null hypothesis at n−1/2 rate. 4. Finite sample performance In the section, we shall investigate the finite sample performance of our bootstrap-assisted test in comparison with three other methods: (i) Deo’s corrected C–M test; (ii) The  QK test as proposed in Lobato et al. (2002) to account for dependence; (iii) Subsampling-based test (Politis et al., 1999). Deo (2000) reported through simulation studies that the uncorrected standardized C–M test statistic has severe size distortion when conditional heteroscedasticity is present, the corrected standardized C–M test statistic retains the nominal size even when the correction is unnecessary, and using the corrected statistic does not result in a loss of power compared to its uncorrected counterpart. So we do not include the uncorrected standardized C–M test statistic in the comparison. For the  QK test, a bandwidth parameter is involved in the consistent estimation of asymptotic covariance matrix of the first K sample correlations. Here we adopt an automatic procedure as used in Lobato et al. (2002), i.e., we employ the AR(1) prewhitening on each series and select the bandwidth using formula (2.2) of Newey and West (1994) with weights equal to one and lag truncation equal to 2(n/100)2/9 . We consider K = 1, 5, 10. For the subsampling method, the key idea is to approximate the sampling distribution of CMn with the subsampling ana−l +1 logue {CMl (t )}tn= 1 , where l is the subsample window width and CMl (t ) is the subsampled counterpart of CMn and is based on the t-th subsample . , Xt +l−1√ ) for t = 1, . . . , n − l + 1. We let √ √(Xt , . . √ l = ⌊ n/2⌋, ⌊ n⌋, ⌊2 n⌋, ⌊4 n⌋, where ⌊a⌋ stands for the integer part of a. Among these three methods, the theoretical validity under general dependence assumption (including non-martingale difference sequence) has been proved only for  QK test. Presumably one can show that the subsampling method also works under certain mixing assumptions in view of its wide applicability. Three series lengths (n = 100, 400 and 1000) are√investigated. √ For our√bootstrap-assisted test, we let bn = 1, ⌊ n/2⌋, ⌊ n⌋ and ⌊2 n⌋. These choices deliver bn = 1, 5, 10, 20 for n = 100, bn = 1, 10, 20, 40 for n = 400 and bn = 1, 15, 31, 63 for

X. Shao / Journal of Econometrics 162 (2011) 213–224

n = 1000. The following Bernoulli distribution is employed to generate δt : P (δt = 0.5(1 − P (δt = 0.5(1 +

√

5)) =

√

√

1+

5

√

2 5

5)) = 1 −

and

√

1+

√

2 5

5

.

500 bootstrap samples were taken in the calculation of empirical rejection percentage. 4.1. Size Let εt ∼ iid N (0, 1) unless specified and B denotes the backward shift operator. To examine the empirical size, we consider the following models: M1 : IID normal, Xt = εt . M2 : GARCH(1, 1), Xt = εt σt , where σt2 = 0.001 + 0.09σt2−1 + 0.89Xt2−1 . M3 : IGARCH(1, 1), Xt = εt σt , where σt2 = 0.001 + 0.09σt2−1 + 0.91Xt2−1 . M4 : EGARCH, Xt = εt σt , where log(σt2 ) = 1 + (1 + 0.8B)[0.5εt + 2{|εt | − E|εt |}]. M5 : Non-MDS 1 (non-martingale difference): Xt = εt2 εt −1 . M6 : Non-MDS 2: Xt = εt2 εt2−1 εt2−2 εt −3 . M7 : All-pass ARMA(1, 1) model: Xt = 0.8Xt −1 + εt − (1/0.8)εt −1 , where εt ∼ t (10). M8 : Bilinear model: Xt = εt + 0.5εt −1 Xt −2 . The models M1 –M4 are martingale differences, whereas the models M5 –M8 are uncorrelated but are not martingale differences. The models M2 and M3 have been used in Lobato et al. (2001) in the examination of the finite sample performance of a modified Box–Pierce test. Among the non-MDS examples, M6 exhibits stronger dependence than M5 . The models M7 and M8 have been investigated in Lobato et al. (2002) for  QK test. Note that the IGARCH model M3 does not admit a finite second moment and the GARCH model M2 admits a finite second moment but the fourth moment is infinite; see Davidson (2004) and Ling and McAleer (2002) for sufficient and necessary conditions on the existence of higher order moments for GARCH models. According to examples 2.1 and 2.2 in Shao (in press), the models M7 and M8 both satisfy GMC(8), so all the moment and weak dependence assumptions in Theorems 2.1 and 3.1 hold. The GMC(8) property also holds for the models M1 , M5 and M6 , as can be easily seen. For the EGARCH model M4 , Min (2004) showed that it is GMC(q) for any q ∈ N if εt ∼ iid N (0, 1). Table 1(a)–(c) shows the proportion (in percentage) of 5000 replications in which the null hypothesis was rejected at 1%, 5% and 10% nominal significance levels for n = 100, n = 400 and n = 1000 respectively. For all sample sizes, Deo’s test delivers accurate size for the models M1 , M2 , M3 and M7 but the size distortion is noticeable for other models. Note that the asymptotic validity of Deo’s test statistic requires the asymptotic independence of ρ( ˆ j) and ρ( ˆ k), j ̸= k and that the distorted size for the model M4 might be due to the fact that this assumption does not hold for M4 although it is a MDS. For the bootstrap-assisted test, the size is accurate for the models M1 –M3 at bn = 1 since as commented in Remark 3.1, taking bn = 1 leads to consistent bootstrap approximation when the series is a MDS. When bn gets large, the size is more distorted at all levels, but the size √ distortion √ decreases as n increases. For the model M4 , taking bn = n (2 n) corresponds to least size distortion when n = 100 (n = 1000). For non-MDS models, the optimal bn in terms of size depends on the model and sample size; compare the models M5 and M6 . The

217

subsampling based test seems √ quite sensitive to the choice of the subsampling width, with n/2 a √ suboptimal choice for almost all models at all sample sizes and 2 n corresponding to accurate size for most models. The  QK test performs reasonably well when K = 1 for the models M1 , M2 , M3 , M7 and M8 , but less satisfactory for other models. A plausible explanation is that the automatic bandwidth selection procedure may not perform well in all the situations. The  QK test performs very poorly for K = 5, 10 when n = 100, which indicates that the asymptotic approximation is less accurate when K is relatively large. The size distortion of  QK at K = 5, 10 improves for some models as sample size increases, but not uniformly. It is a little surprising to see that for some models (e.g. M6 ) the size gets more distorted as sample size gets larger. This might be related to the finite sample bias in the datadriven bandwidth selection algorithm. In particular, the use of the truncation lag 2(n/100)2/9 is somewhat arbitrary, and it may not work well for all the models under examination. Hence, for some models, the bias in the estimation of optimal bandwidth could get large as sample size increases, thus affects the size negatively. It is also worth noting that our bootstrap-based test performs reasonably well for the model M3 , especially at n = 1000, although the second moment is infinite. 4.2. Power We report the empirical powers of the above-mentioned tests in a small Monte Carlo study. The following alternatives are considered: M9 : MA(1), Xt = ut + ρ ut −1 , where ut follows the GARCH model M2 and ρ = 0.05, 0.1, . . . , 0.3. M10 : MA(1), Xt = ut + ρ ut −1 , where ut follows the bilinear model M8 and ρ = 0.05, 0.1, . . . , 0.3. M11 : ARFIMA(0, d, 0), (1 − B)d Xt = εt , where d = −0.2, −0.15, . . . , 0.2. We present the size-adjusted power in Table 2(a), (b) and (c) for the models M9 , M10 and M11 at sample size n = 400. For bootstrap and subsampling based tests, the calculation of size-adjusted power does not appear to as obvious as that for an asymptotic test. Here we use the technique described in Dominguez and Lobato (2001), who proposed an alternative way of calculating size-corrected power for bootstrap tests. The extension to subsampling based test is straightforward, so we omit the details. As shown in Table 2, the power for the bootstrap-assisted test seems not very sensitive to the choice of bn . It is comparable to the power for Deo’s test and  Q1 test, and is superior to that for  Q5 and  Q10 for all the examined alternatives. In a few cases, the bootstrap-based test outperforms all the other alternative methods in power. In particular, our bootstrap test is more powerful than the Q˜ K for K = 1, 5, 10 for the FARIMA(0, d, 0) alternative when d = 0.05, 0.1, 0.15 and 0.2 (i.e., in the long memory region). For the subsampling method, larger √ subsampling width corresponds to less power. The power for l = n/2 is comparable to those delivered by Deo’s corrected C–M test and our bootstrap test, and the power corresponding to the other three widths are substantially lower. However, the size distortion reaches the √ largest when l = n/2, so the small width is also not preferred. In view of the overall size and power performance, it is fair to conclude that the subsampling method does not perform as well as the bootstrap-based test, which has very good size and power properties provided that the block size is suitably chosen.

218

X. Shao / Journal of Econometrics 162 (2011) 213–224

Table 1 Rejection rates in percentage under the null hypothesis for sample sizes (a) n = 100, (b) n = 400 and (c) n = 1000. The symbols BOOT(l) and SUB(l) stand for the bootstrap-assisted test and the subsampling based test with the bandwidth parameter equal to l. The largest standard error is 0.7%. (a) n = 100

M1

M2

M3

M4

1%

5%

10%

1%

5%

10%

1%

5%

10%

1%

5%

10%

Deo

0.6

5

9.6

0.8

4.8

10

0.8

4.8

10.1

0.2

2.1

6.9

BOOT(1) √ BOOT(√n/2) BOOT( √ n) BOOT(2 n)

1.1 2.1 4.2 11.8

5.8 7.3 9.3 15.2

10.8 12.7 15.0 21.1

1.3 1.8 3.5 10.0

5.4 7.1 8.5 13.7

10.9 12.7 14.6 19.1

1.4 1.7 3.2 10.2

5.6 6.8 8.6 13.6

10.9 12.5 14.6 19.0

0.4 0.4 1.0 5.2

3.4 3.3 4.5 9.7

11.0 9.4 10.4 13.2

SUB(√n/2) n) SUB( √ SUB(2√n) SUB(4 n)

3.7 2.4 2.8 3.9

12.6 7.1 6.6 8.1

26.2 15.4 11.7 12.5

3.6 2.1 1.9 3.2

11.6 6.3 5.3 7.1

27.5 14.6 11.3 12.2

3.7 2.0 2.0 2.9

11.9 6.1 5.5 6.9

28.0 14.9 11.2 12.4

2.9 0.9 1.2 2.5

26.5 5.8 6.1 8.1

60.8 30.3 12.0 13.3

 Q1  Q5  Q10

0.3 0.1 0.1

4.0 1.8 0.6

9.3 5.3 1.5

0.4 0.3 0.5

3.8 1.7 1.4

8.9 5.2 2.6

0.4 0.2 0.8

3.6 2.0 2.0

9.2 5.4 3.3

0.6 3.0 6.4

1.6 3.9 7.6

3.8 4.6 8.7

n = 100

M5 1%

5%

10%

1%

5%

10%

1%

5%

10%

5%

10%

Deo

0.2

2.4

7.6

0.1

1.1

4.8

0.6

3.5

8.3

1.8

7.9

13.9

BOOT(1) √ BOOT(√n/2) BOOT( √ n) BOOT(2 n)

0.3 0.6 1.5 7.0

4.1 4.0 5.9 11.3

10.6 11.0 12.6 16.1

0.2 0.1 0.2 2.3

2.5 1.3 1.9 6.5

11.0 7.2 6.8 8.7

0.9 1.7 3.8 11.1

4.5 6.6 9.1 15.2

9.7 12.1 14.6 20.6

2.6 3.3 4.6 12.9

9.0 9.7 10.8 16.8

15.2 16.6 17.7 21.8

SUB(√n/2) SUB( √ n) SUB(2√n) SUB(4 n)

1.8 1.1 1.6 3.5

15.5 4.8 5.8 8.0

42.0 22.8 11.4 13.5

1.8 0.8 1.1 2.6

31.9 6.4 6.9 8.3

73.8 40.3 14.0 13.7

3.0 2.3 2.9 4.2

11.1 6.4 6.1 8.2

26.3 14.6 11.5 12.7

5.4 2.7 2.7 4.2

15.4 8.3 7.1 9.0

35.0 18.8 12.1 14.2

 Q1  Q5  Q10

0.3 1.8 5.5

1.6 2.7 7.1

5.5 3.4 8.2

0.9 5.9 14.6

1.1 6.7 16.0

2.7 7.1 17.1

0.3 0.1 0.1

3.6 1.7 0.6

8.4 4.9 1.8

1.1 0.3 0.6

6.5 2.2 1.6

13.3 5.9 2.7

(b) n = 400

M1 1%

5%

10%

1%

5%

10%

1%

5%

10%

1%

5%

10%

Deo

0.8

4.9

10.4

0.8

5

10.3

0.8

5

10.3

0.2

2.7

7.7

BOOT(1) √ BOOT(√n/2) BOOT( √ n) BOOT(2 n)

1.1 1.6 2.4 4.4

5.2 6.2 7.4 9.8

10.7 12.2 13.2 15.2

1.2 1.3 1.7 3.3

5.2 5.6 6.2 7.9

10.6 11.2 11.8 13.3

1.1 1.4 1.6 2.6

5.3 5.6 5.7 7.0

10.6 11.4 11.1 12.3

0.4 0.5 0.6 1.7

3.6 3.6 3.8 5.9

9.9 10.1 10.5 11.5

SUB(√n/2) SUB( √ n) SUB(2√n) SUB(4 n)

1.0 1.2 1.8 2.2

6.6 5.1 4.9 5.5

16.9 12.7 10.4 9.5

0.5 0.3 0.5 1.0

5.1 2.7 2.2 3.2

17.6 11.3 6.8 7.7

0.5 0.4 0.4 0.7

4.8 2.1 2.1 3.1

18.0 11.0 6.9 7.5

0.5 0.6 0.8 1.6

32.2 12.5 3.9 5.4

61.8 40.2 8.5 9.9

 Q1  Q5  Q10

0.7 0.5 0.3

4.5 3.5 1.9

10.4 8.2 4.9

0.5 0.3 0.2

4.4 2.3 1.0

10.0 6.0 2.8

0.5 0.2 0.2

4.1 2.2 1.0

9.6 5.6 2.9

0.2 0.5 1.0

1.9 0.7 1.3

6.3 1.1 1.7

n = 400

M5 1%

5%

10%

1%

5%

10%

1%

5%

10%

5%

10%

Deo

0.4

4.1

8.8

0.1

1.6

5.7

0.7

4.1

9.1

2

7.4

13.9

BOOT(1) √ BOOT(√n/2) BOOT( √ n) BOOT(2 n)

0.6 1.0 1.2 2.6

4.9 4.6 5.9 7.7

10.4 10.6 11.5 13.8

0.1 0.1 0.1 0.4

2.7 1.8 1.9 3.4

11.3 8.8 8.4 8.5

0.9 1.5 2.2 4.0

4.5 6.2 7.0 8.9

9.2 11.6 12.3 14.3

2.5 2.1 2.6 4.2

7.7 6.9 7.3 9.1

14.3 12.5 13.2 14.6

SUB(√n/2) SUB( √ n) SUB(2√n) SUB(4 n)

0.2 0.2 0.6 1.3

11.6 4.9 2.4 3.9

33.9 21.6 6.7 8.2

0.5 0.6 0.8 1.3

47.1 15.7 4.0 5.2

81.6 53.3 9.5 11.3

0.8 0.6 1.3 1.8

6.0 4.4 4.2 4.5

16.3 11.9 9.1 9.1

0.6 0.6 0.9 1.7

7.8 4.1 3.4 4.5

21.9 14.2 8.4 8.4

 Q1  Q5  Q10

0.2 0.1 0.2

3.0 0.4 0.6

8.0 1.2 1.0

0.2 1.0 2.8

0.9 1.2 3.3

3.6 1.4 3.4

0.6 0.6 0.2

4.4 3.3 1.6

10.0 7.2 4.6

1.2 0.5 0.2

6.4 3.2 1.9

12.9 7.2 4.2

√

√

√

√

M6

M7

M2

M8

M3

M6

1%

M4

M7

M8 1%

(continued on next page)

4.3. Empirical illustration

weighted portfolios from January 1926 to December 1990. This data set has been analyzed by Deo (2000) and Durlauf (1991) using

In this section, we apply all the aforementioned tests to the 780 monthly returns on the CRSP-NYSE equal-weighted and value-

the corrected and uncorrected standardized C–M test statistic respectively. Table 3 reports the p-values for a range of block sizes

X. Shao / Journal of Econometrics 162 (2011) 213–224

219

Table 1 (continued) (c) n = 1000

M1

M2

M3

M4

1%

5%

10%

1%

5%

10%

1%

5%

10%

1%

5%

10%

Deo

1.1

5.1

10

0.8

4.6

10.4

1.1

5.3

10.6

0.2

2.8

7

BOOT(1) √ BOOT(√n/2) n) BOOT( √ BOOT(2 n)

1.2 1.7 1.9 3.0

5.3 5.7 6.8 7.7

10.4 10.9 12.0 13.0

1.1 1.3 1.4 1.9

4.8 5.5 5.8 6.5

10.4 10.7 10.8 12.1

1.3 1.3 1.4 1.5

5.7 5.5 5.5 5.3

10.7 10.5 10.4 11.1

0.4 0.5 0.6 1.0

3.4 3.4 3.4 4.5

9.4 9.5 9.7 10.1

SUB(√n/2) SUB( √ n) SUB(2√n) SUB(4 n)

0.9 0.9 1.2 1.4

5.6 4.9 4.8 4.7

13.3 11.3 10.2 9.1

0.1 0.0 0.2 0.4

3.8 2.3 1.6 2.0

16.3 11.7 8.2 5.7

0.0 0.1 0.1 0.2

3.6 1.7 1.2 1.7

17.4 11.8 7.7 5.6

0.1 0.2 0.4 0.9

34.4 17.3 2.6 3.9

62.2 45.7 28.7 8.1

 Q1  Q5  Q10

1.0 0.7 0.5

5.1 4.4 3.3

10.3 9.4 7.3

0.7 0.3 0.1

4.7 2.4 1.4

9.8 6.2 3.6

0.8 0.3 0.1

4.9 2.0 1.2

9.9 4.7 3.1

0.2 0.1 0.3

1.8 0.2 0.4

6.0 0.5 0.6

n = 1000

M5 5%

10%

1%

5%

10%

1%

5%

10%

1%

5%

10%

√

M6

1%

M7

M8

Deo

0.7

4

9

0.1

2.1

6.8

0.8

4.6

9.4

2

8.4

14.6

BOOT(1) √ BOOT(√n/2) BOOT( √ n) BOOT(2 n)

0.9 1.2 1.3 2.1

4.6 4.7 5.2 6.6

9.7 9.8 10.6 12.0

0.2 0.1 0.1 0.4

3.4 2.1 2.3 2.6

10.8 9.2 8.6 8.6

1.1 1.5 2.0 2.9

4.8 6.4 7.1 8.7

9.4 11.6 12.2 14.0

2.4 1.7 2.0 3.0

8.4 7.1 7.6 8.7

14.6 12.9 12.9 13.8

SUB(√n/2) SUB( √ n) SUB(2√n) SUB(4 n)

0.0 0.1 0.3 0.8

8.4 4.8 2.0 3.0

29.3 18.8 12.5 7.1

0.3 0.4 0.6 0.8

51.5 25.2 3.3 3.9

82.7 63.5 38.1 9.1

0.6 0.8 1.2 1.4

5.2 5.0 4.7 4.2

14.1 11.7 10.5 9.3

0.3 0.5 0.7 1.2

6.4 4.6 3.6 4.0

18.6 13.9 11.5 8.5

 Q1  Q5  Q10

0.4 0.0 0.0

3.4 0.6 0.3

8.5 2.1 1.0

0.1 0.3 0.7

1.0 0.4 0.8

4.4 0.4 0.9

0.9 0.6 0.6

5.2 4.1 2.7

10.7 8.4 6.2

0.9 0.9 0.3

5.9 4.2 3.1

11.7 8.8 6.7

√

Table 2 Rejection rates in percentage under the alternative hypothesis for (a) M9 , (b) M10 and (c) M11 . The symbols BOOT(l) and SUB(l) stand for the bootstrap-assisted test and the subsampling based test with the bandwidth parameter equal to l. The significance level is 5%. n = 400

(a)

ρ

0.05

0.1

0.15

0.2

0.25

0.3

0.05

0.1

0.15

0.2

0.25

0.3

Deo

10.4

28.5

55.4

80

94

98.6

12.9

37.3

69.6

90.4

98.4

99.8

Boot(1) √ Boot(√n/2) Boot( √ n) Boot(2 n)

11.4 10.1 10.9 13.7

30.7 29.5 30.6 33.0

57.6 58.3 59.1 60.0

81.9 82.4 82.4 82.9

94.6 95.3 95.1 94.9

98.9 99.0 98.9 98.8

12.7 13.0 13.9 15.4

36.1 37.1 37.8 39.3

69.1 69.2 68.8 69.4

90.0 90.5 89.6 89.1

98.3 98.2 97.8 97.1

99.7 99.6 99.5 98.9

SUB(√n/2) SUB( √ n) SUB(2√n) SUB(4 n)

12.5 8.4 7.4 7.5

32.8 24.3 20.6 20.0

58.8 48.7 42.7 40.1

82.4 70.9 63.6 60.7

94.5 86.4 79.2 75.9

98.8 93.2 87.4 84.8

10.6 7.7 6.6 7.9

31.5 24.6 20.4 21.4

62.1 50.9 41.4 39.3

84.7 74.3 61.5 56.7

95.5 88.1 74.7 67.4

99.0 93.6 82.0 74.3

 Q1  Q5  Q10

8.1 6.3 6.3

24.9 13.9 11.8

53.0 32.6 24.7

78.6 56.9 42.5

93.3 78.0 62.5

98.0 90.4 77.7

13.1 7.1 6.7

37.4 19.0 13.3

69.7 40.4 28.0

90.7 66.4 48.0

98.2 84.9 65.3

99.7 93.2 77.9

√

(b)

n = 400

(c)

d

−0.2

−0.15

−0.1

−0.05

0.05

0.1

0.15

0.2

Deo

95.2

78.5

46.9

16.2

17.6

56.2

90.4

99.4

Boot(1) √ Boot(√n/2) Boot( √ n) Boot(2 n)

95.5 96.1 96.0 95.3

78.7 81.1 81.8 82.2

47.6 50.9 51.4 55.1

16.9 18.3 19.5 22.4

17.6 17.2 18.9 21.0

56.4 52.5 52.8 54.6

90.4 87.7 87.3 87.0

99.4 98.6 98.6 97.9

SUB(√n/2) SUB( √ n) SUB(2√n) SUB(4 n)

91.2 88.7 83.9 78.6

69.0 66.8 62.6 58.3

37.2 35.6 33.4 32.4

11.2 10.9 11.2 11.2

22.6 20.7 18.2 15.8

62.8 59.4 52.3 44.6

92.6 90.8 84.9 75.0

99.4 98.9 97.0 91.1

 Q1  Q5  Q10

94.1 96.3 92.4

78.4 78.1 67.8

48.0 42.5 33.4

17.4 13.4 11.4

15.2 9.1 8.3

49.0 26.8 19.6

84.4 58.5 42.2

98.0 84.3 64.7

√

bn = 1, 13, 26, 52, 104, with the latter four values approximately

√

n/2,

√

√

√

of our bootstrap-assisted test are quite sensitive to the choice of

n, 2 n and 4 n respectively. We use 3000

block size, with smaller p-values at bn = 1. A possible explanation

Bootstrap replications. For the equal-weighted return, the p-values

for this sensitivity is that the series may not be stationary, which

equal to

220

X. Shao / Journal of Econometrics 162 (2011) 213–224

Table 3 The p-values of four types of tests (Deo’s corrected C–M test, our bootstrap-assisted test, the subsampling-based test and  QK test) applied to the equal-weighted and value-weighted returns for different block sizes. The symbols BOOT(l) and SUB(l) stand for the bootstrap-assisted test and the subsampling based test with the bandwidth parameter equal to l.

Deo BOOT(1) BOOT(13) BOOT(26) BOOT(52) SUB(13) SUB(26) SUB(52) SUB(104)  Q1  Q5  Q10

Equal-weighted

Value-weighted

(2.5%, 5%) 2.23% 19.2% 14.7% 16.7% 3.8% 5.6% 8.9% 11.7% 5.9% 12.2% 86.1%

(5%, 10%) 9.5% 26.2% 15.8% 20.8% 7.3% 6.9% 10.8% 11.5% 15.0% 23.1% 73.5%

is required in our theory. The sensitivity of p-values with respect to block size can also be seen for the subsampling-based test. QK test suggest that the lag-1 autocorrelation is The p-values of  marginally significant, and the autocorrelation at other lags (up to lag 10) are all zero. For the value-weighted returns, we cannot reject the null hypothesis at 5% level for all the tests at all the block sizes, which is consistent with the result obtained by Deo (2000). 5. Conclusion In summary, we propose a blockwise wild bootstrap procedure to approximate the limiting null distribution of the C–M test statistic. Since in practice we do not know the data generating process of the series at hand, it seems preferable to use the bootstrap-based test proposed here, which has been justified to be widely applicable to a large class of uncorrelated nonlinear processes. The comparison of our test with three other methods in terms of the finite sample performance shows that our test has overall good size and power with superiority ability to deal with unknown dependence. In addition, it is encouraging to see that the size and power of our bootstrap-based test is not very sensitive to the choice of block size for large sample size. Through an empirical illustration, we show that the evidence against the white noise null hypothesis for the two series, which were tested by Durlauf (1991) and Deo (2000), is considerably weakened after accounting for unknown dependence. The problem of selecting optimal block size remains open, and more work is needed in that respect. It would be interesting to investigate if the block size selection methodology proposed in the block bootstrap and subsampling literature (cf. Hall et al., 1995; Bühlmann and Künsch, 1999; Politis et al., 1999) can be extended to our setting. We leave it for future research. Acknowledgements I would like to thank Rohit Deo for providing me with the data sets used in the paper. I am also grateful to an associate editor and two referees for constructive comments that led to substantial improvement of the paper. The research is supported in part by NSF grant DMS-0804937.

Note that Pj ≤ Cj−2 uniformly in j ∈ N. Further we note that  Ψ (λ)Ψk (λ)dλ = 0 when j ̸= k, j, k ∈ N. Π j Proof of Theorem 2.1. By Lemma 5.1, it suffices to show that S˜n (λ) − E{S˜n (λ)} ⇒ S (λ). For each fixed K , 1 ≤ K ≤ n − 1, S˜n (λ) =

K − √

nγ˜ (j)Ψj (λ) +

Let γ˜ (j) = n−1

∑n =1+|j| Yt Yt −|j| for j = 0, ±1, . . . , ±(n − 1). ∑n−1t√ ˜ Define Sn (λ) = nγ˜ (j)Ψj (λ). Denote by a ∧ b = min(a, b) j=1  and a ∨ b = max(a, b); by Wh (j) = Π h(λ)Ψj (λ)dλ for any  h ∈ L2 (Π ), where Π = [0, π]; by Pj = Π Ψj2 (λ)dλ, j ∈ N.

nγ˜ (j)Ψj (λ)

j =K +1

j =1

= S˜nK (λ) + RKn (λ). Then we want to show (a). For an arbitrary but fixed integer K the finite dimensional distributions of S˜nK (λ) − E{S˜nK (λ)}, ⟨S˜nK − E{S˜nK }, h⟩, converge to those of S K (λ), ⟨S K (λ), h⟩, for any h ∈ L2 (Π ), where S K (λ) is a Gaussian process with zero mean and asymptotic projected variances

σh2,K = var[⟨S K , h⟩] =

K ∞ − −

cov(Yt Yt −j , Yt −d Yt −d−k )Wh (j)Wh (k).

j,k=1 d=−∞

Note that

⟨S˜nK − E{S˜nK }, h⟩ = n−1/2

K n − −

{Yt Yt −j − γ (j)}Wh (j)

j=1 t =j+1

= n−1/2

K +1 − t −1 − {Yt Yt −j − γ (j)}Wh (j) t =2 j =1

+ n−1/2

n K − − {Yt Yt −j − γ (j)}Wh (j), t =K +2 j=1

where the first summand above is op (1) since K is finite. For the ∑∞ second summand, since k=0 δ4 (k) < ∞ and

‖P0 (Yt Yt −j )‖ ≤ ‖Yt Yt −j − Yt′ Yt′−j ‖ ≤ C {δ4 (t ) + δ4 (t − j)}, we have that k=0 ‖P0 (Yk Yk−j )‖ < ∞ for any j = 1, . . . , K . Consequently, the distributional convergence follows from Theorem 1 in Hannan (1973) or Lemma 1 in Wu and Min (2005).

∑∞

(b). For an arbitrary but fixed integer K the sequence {S˜nK (λ)} is tight. S˜nK (λ) − E{S˜nK (λ)} = n−1/2

K n − −

{Yt Yt −j − γ (j)}Ψj (λ)

j =1 t =j +1

= n−1/2

1)∧K n (t − − − t =2

= n−1/2 Appendix

n−1 − √

K +1 − t =2

{Yt Yt −j − γ (j)}Ψj (λ)

j=1

GKn,t + n−1/2

n −

GKn,t ,

t =K +2

where GKn,t is implicitly defined. The tightness of n−1/2 t =2 GKn,t follows from the fact that it is a finite sum and each summand ∑n is tight. For the second term n−1/2 t =K +2 GKn,t , we note that it is a sum of stationary, mean zero random elements such that

∑K +1

X. Shao / Journal of Econometrics 162 (2011) 213–224

E‖ j=1 {Yt Yt −j − γ (j)}Ψj (λ)‖2 < ∞. Thus it suffices to verify the assumptions in Politis and Romano (1994) listed below: (i) E‖GKn,t ‖2 < ∞.

∑K

(ii) For each integer m ≥ 2, the random elements (GKn,K +2 ,

, . . . , GKn,K +m ) converge in distribution to (GKK +2 , GKK +3 , . . . ,

GKn,K +3 GKK +m .

)

(iii) For each integer m ≥ 2, E[⟨ , ⟩] → E[⟨GKK +2 , GKK +m ⟩] as n → ∞. ∑n ∑∞ K (iv) limn→∞ t =K +2 E[⟨GKn,K +2 , GKn,t ⟩] = t =K +2 E[⟨GK +2 , GKt ⟩] < ∞, and the last series converges absolutely. GKn,K +2

GKn,K +m

(v) var[⟨S˜nK − E(S˜nK ), h⟩] → σh2,K . Let GKt = GKn,t = j=1 {Yt Yt −j − γ (j)}Ψj (λ). Then (i), (ii) and (iii) are trivially satisfied. For (iv), we have

∑K

∞ −

|E[⟨GKK +2 , GKt ⟩]|

Define S˜n∗ (λ) =

∑n

t =j+1

≤

{Yt Yt −j − γ (j)}wt .

√ ∑K

n−1/2

∑(K +1)

GKn,∗t + n−1/2

t =2

n −

t =K +2 j =1 Ln

=

−

n−1/2 δs

=

∗ Hsn ,

s=1

∗ where Hsn is implicitly defined. Then

 Yt Yt −j Wh (j)

cov(Yt Yt −j , Yt ′ Yt ′ −k )Wh (j)Wh (k)

j,k=1 t =j+1 t ′ =k+1

= n− 1

K −

(n−j)∨( n−k)−1 −

cov(Yt Yt −j , Yt −d Yt −d−k )Wh (j)Wh (k)

j,k=1 d=(j−n)∧(k−n)+1

{(n − j) ∨ (n − k) − |d|} → σ

2 h,K

var ( ∗

JnK ∗

)=n

−1

Ln −

.

K →∞ n→∞

Under the assumptions (5) and (6), we have

≤ n−1

n

−

−

2 .

t ∈Bs ∩[K +2,n] j=1

In the sequel, we shall show that var∗ (JnK ∗ ) →p σh2,K . To this end, we calculate E{var∗ (JnK ∗ )} and var{var∗ (JnK ∗ )} as follows. First, we note that

E{var∗ (JnK ∗ )} Ln −

−

K −

cov(Yt Yt −j , Yt ′ Yt ′ −j′ )Wh (j′ )Wh (j)

→ σh2,K . Second, var{var∗ (JnK ∗ )} = n−2

×

Pj {|cum(X0 , X−j , Xt ′ −t , Xt ′ −t −j )|

j = K + 1 t ,t ′ = j + 1

+ |γ 2 (t ′ − t )| + |γ (t ′ − t + j)γ (t ′ − t − j)|} ∞ − ≤C j−2 → 0 as K → ∞. j =K +1

Proof of Corollary 2.1. Under the assumptions of Theorem 2.1, it follows from√Lemma ∑ 5.1 and a straightforward calculation that ‖E{Sn (λ)}− n ∞ j=1 γ (j)Ψj (λ)‖ = o(1). Therefore, Sn (λ) ⇒ S (λ) under H0 and (3) follows from the continuous mapping theorem. Similarly, the assertion (7) also follows.

Ln −

−

K −

−

s,s′ =1 t1 ,t2 ∈Bs ∩[K +2,n] t ′ ,t ′ ∈Bs′ ∩[K +2,n] j1 ,j2 =1 1 2

var{γ˜ (j)}Pj

j =K +1 n−1

K − {Yt Yt −j − γ (j)}Wh (j)

−

s=1 t ,t ′ ∈Bs ∩[K +2,n] j,j′ =1

lim lim P (‖RKn (λ) − E{RKn (λ)}‖ > ϵ) = 0.

n −1 −



s=1

= n−1

Thus (v) is shown and this completes the proof of part (b). (c). The process RKn (λ) satisfies that for all ϵ > 0,

E‖RKn (λ) − E{RKn (λ)}‖2 = n

K − {Yt Yt −j − γ (j)}Wh (j)

−

t ∈Bs ∩[K +2,n] j=1

s=1 Ln −

j=1 t =j+1 K n n − − −

⟨GKn,∗t , h⟩

n K − − {Yt Yt −j − γ (j)}wt Wh (j)

= n−1/2

in view of the assumptions (5), (6) and Remark 2.1. Further we note that

= n− 1

t =K +2

t =K +2

+ γ 2 (t − K − 2) + |γ (t − K − 2 + j)γ (t − K − 2 − j)|} <∞

var[⟨S˜nK − E(S˜nK ), h⟩] = n−1 var

∑(t −1)∧K

∑n

GKn,∗t , where GKn,∗t = j =1 ∑ [Yt Yt −j −γ (j)]wt Ψj (λ). Note that E[E∗ {n−1/2 (t K=+2 1) GKn,∗t }2 ] = o(1) since it only involves finite number ∑n of terms. It suffices to show the asymptotic normality of ⟨n−1/2 t =K +2 GKn,∗t , h⟩. Write

t =K +2 j=1

K n − −

γ˜ ∗ (j)Ψj (λ), where γ˜ ∗ (j) = n−1

S˜nK ∗ (λ) + RKn ∗ (λ), where S˜nK ∗ (λ) = n j=1 γ˜ ∗ (j)Ψj (λ). We shall show that (i). The finite dimensional distributions of S˜nK ∗ (λ), ⟨S˜nK ∗ , h⟩, converge to those of S K (λ), ⟨S K (λ), h⟩ for any h ∈ L2 (Π ), in probability conditional on the original sample. Write S˜nK ∗ (λ) =

Pj {|cum(Xt , XK +2 , Xt −j , XK +2−j )|



j=1

as the process Sn (λ). To this end, we decompose S˜n∗ (λ) as S˜n∗ (λ) =

  ∞ − K  −   = cov(Yt Yt −j , YK +2 YK +2−j )Pj     t =K +2 j=1 ∞ − K −

n

Proof of Theorem 3.1. By Lemma 5.2, it suffices to show that (a) The process S˜n∗ (λ) has the same asymptotic finite projections

JnK ∗ := n−1/2

t =K +2

221

√ ∑n−1

×

K −

C (t1 , t2 , t1′ , t2′ , j1 , j2 , j′1 , j′2 )

j′1 ,j′2 =1

× Wh (j1 )Wh (j2 )Wh (j′1 )Wh (j′2 ),

(12)

where C (t1 , t2 , t1 , t2 , j1 , j2 , j1 , j2 ) equals to ′

′

′

′

cov[{Yt1 Yt1 −j1 − γ (j1 )}{Yt2 Yt2 −j2 − γ (j2 )},

{Yt1′ Yt1′ −j′1 − γ (j′1 )}{Yt2′ Yt2′ −j′2 − γ (j′2 )}] = cum(Yt1 Yt1 −j1 , Yt2 Yt2 −j2 , Yt1′ Yt1′ −j′1 , Yt2′ Yt2′ −j′2 ) + cov(Yt1 Yt1 −j1 , Yt1′ Yt1′ −j′1 ) × cov(Yt2 Yt2 −j2 , Yt2′ Yt2′ −j′2 ) + cov(Yt1 Yt1 −j1 , Yt2′ Yt2′ −j′2 ) × cov(Yt2 Yt2 −j2 , Yt1′ Yt1′ −j′1 ) = T1 + T2 + T3 .

222

X. Shao / Journal of Econometrics 162 (2011) 213–224

So as n → ∞,

By Theorem II.2 in Rosenblatt (1985), we have T1 =

− v

E(E∗ ‖RKn ∗ (λ)‖2 )

cum(Xij , ij ∈ v1 ) · · · cum(Xij , ij ∈ vp ),

where the summation is over all indecomposable partitions v = v1 ∪ · · · ∪ vp of the two-way table t1 t2 t1′ t2′

t1 t2 t1′ t2′

=n

− j1 − j2 − j′1 − j′2 .

→

{Yt Yt −j − γ (j)}

Pj

t ∈Bs ∩[j+1,n]

Ln n −1 − −

−

cov(Yt Yt −j , Yt ′ Yt ′ −j )Pj

∞ −

Pj σY2 (j),

where σY2 (j) = 2π fYt Yt −j (0). Note that

|σY2 (j)| ≤ ≤

′

(b) The process S˜n∗ (λ) is tight in probability conditional on the sample. Write S˜n∗ (λ) = n−1/2

s =1

 −

∗ 4 E{E∗ |Hsn | }

E

K − {Yt Yt −j − γ (j)}Wh (j)

4

t ∈Bs ∩[K +2,n] j=1

Ln −

K −

−

|Wh (j1 )Wh (j2 )

= Ln−1/2

1/2 b− δs n

= Ln−1/2

−

t −1 − {Yt Yt −j − γ (j)}Ψj (λ)

t ∈Bs ∩[2,n] j=1

Ln

−

∗ Msn (λ),

s =1

E{E ‖Msn (λ)‖ } = bn ∗

2

−1

t ′ −1 − t ∧− t ,t ′ ∈Bs

× Wh (j3 )Wh (j4 )||E[{Yt1 Yt1 −j1 − γ (j1 )}{Yt2 Yt2 −j2 − γ (j2 )} × {Yt3 Yt3 −j3 − γ (j3 )}{Yt4 Yt4 −j4 − γ (j4 )}]|. Following a similar argument as used in proving var{var ( )} = o(1), we can show that the above expression is of order O(n−2 Ln b2n ) = o(1). We omit the details. ∗

JnK ∗

(ii). For any ϵ > 0, limK →∞ limn→∞ P ∗ (‖RKn ∗ (λ)‖ ≥ ϵ) = 0 in probability conditional on the sample. Write n−1 −

Ln − s =1

∗

s=1 t1 ,t2 ,t3 ,t4 ∈Bs ∩[K +2,n] j1 ,j2 ,j3 ,j4 =1

E∗ ‖RKn ∗ (λ)‖2 = n

{Yt Yt −j − γ (j)}wt Ψj (λ)

∗ ∗ where Msn (λ) is implicitly defined. Note that Msn (λ) and Ms∗′ n (λ), s ̸= s′ are independent given the sample. In view of the argument in the proof of Theorem 4 of Escanciano and Velasco (2006), it ∗ suffices to verify that E∗ ‖Msn (λ)‖2 < ∞ almost surely; also see Example 1.8.5 of Van der Vaart and Wellner (1996). To this end, write

s=1 Ln −

t −1 n − − t =2 j =1

∗ Further, we note that Hsn , s = 1, . . . , Ln are independent conditional on the data. We verify Lindeberg–Feller’s condition as follows:

∗ 2 ∗ E[E∗ {|Hsn | 1(|Hsn | > ϵ)}] ≤ C

{|cum(Yt , Yt −j , Yt +k , Yt +k−j )|

uniformly in j. Then we get limK →∞ limn→∞ E(E∗ ‖RKn ∗ (λ)‖2 ) = 0.

When s = s′ in the summation of (12), the contribution of T2 to var{var∗ (JnK ∗ )} is O(n−2 Ln b2n ) = o(1). In the case of s ̸= s′ , then (10) implies that the magnitude of all such terms is bounded by Cn−2 L2n = o(1). The same argument applies to T3 . Hence, var{var∗ (JnK ∗ )} = o(1).

s =1

∞ −

+ |γ 2 (k) + γ (k + j)γ (k − j)|} < ∞

′

Ln −

|cov(Yt Yt −j , Yt +k Yt +k−j )|

k=−∞

+ γ (t2 − t2 )γ (t2 − t2 − j2 + j2 ) + γ (t2′ − j′2 − t2 )γ (t2′ − t2 + j2 )}. ′

∞ − k=−∞

1

1

+ γ (t1′ − t1 )γ (t1′ − t1 − j′1 + j1 ) + γ (t1′ − j′1 − t1 )γ (t1′ − t1 + j1 )} × {cum(X0 , X−j2 , Xt2′ −t2 , Xt2′ −t2 −j′2 )

≤ Cn−2

E

j =K +1

1

≤ Cn

−

j=K +1 s=1 t ,t ′ ∈Bs ∩[j+1,n]

T2 = {cum(X0 , X−j1 , Xt ′ −t1 , Xt ′ −t1 −j′ )

−2

2



j=K +1 s=1

= n− 1

For example, one such term in T1 is γ (t2 − j2 − t1 )γ (t1′ − j′1 − t2 )γ (t2′ − t1′ − j′2 )γ (t2′ − t1 + j1 ). Under (10), it is easily seen that its contribution to the sum (12) is of order o(1). The contributions of other terms in T1 are also o(1) due to (10). For T2 , we can write

Ln −

Ln n −1 − −

−1

cov(Yt Yt −j , Yt ′ Yt ′ −j )Pj .

j =1

Under (5) and (6), the above expression is bounded by 1 b− n

t ′ −1 − t ∧− {|cum(Xt , Xt −j , Xt ′ , Xt ′ −j )| + |γ 2 (t ′ − t )| t ,t ′ ∈Bs

j =1

+ |γ (t ′ − t − j)γ (t ′ − t + j)|}Pj < ∞, ∗ which implies that E∗ ‖Msn (λ)‖2 is finite almost surely. The conclusion is established.

The lemma below shows that the difference between Sn (λ) − E{Sn (λ)} and S˜n (λ) − E{S˜n (λ)} is negligible.

E∗ {γ˜ ∗ (j)2 }Pj

j =K +1

= n− 1

n −1 −

n −

{Yt Yt −j − γ (j)}{Yt ′ Yt ′ −j − γ (j)}

j=K +1 t ,t ′ =j+1

× E∗ (wt wt ′ )Pj  Ln n −1 − −

= n− 1

j=K +1 s=1

t ∈Bs ∩[j+1,n]

{Yt Yt −j − γ (j)}

‖Sn (λ) − E{Sn (λ)} − [S˜n (λ) − E{S˜n (λ)}]‖2 →p 0. Proof of Lemma 5.1. Let Y¯ = X¯ − µ. Since for j = 0, 1, . . . , n − 1,

2 −

Lemma 5.1. Under the assumptions (5) and (6), we have

Pj .

γˆ (j) = γ˜ (j) − Y¯ n−1

n − t =j+1

(Yt + Yt −j ) +

n−j n

Y¯ 2 ,

X. Shao / Journal of Econometrics 162 (2011) 213–224

∗ For M1n (λ), we have

we have Sn (λ) = S˜n (λ) + Mn (λ), where n−1 − n −

Mn (λ) = −n−1/2 Y¯

223

(Yt + Yt −j )Ψj (λ)

E ‖M1n (λ)‖ = n ∗

∗

2

−1

j =1 t =j +1

+ n−1/2

n−1 −

 {γˆ (j) − γ (j)} E

n −1 − (n − j)Y¯ 2 Ψj (λ).

∗ E{E∗ ‖M1n (λ)‖2 } ≤ Cn−1

‖Mn (λ)‖ ≤

n

+

+



j =1

n −

n−1 3Y¯ 2 −

n

j =1

n −

Pj

Yt −j

Pj

E{E ‖I1 (λ)‖ } = n ∗

t =j +1

≤ Cn−2

∗

2

(n − j)2 Pj = I1 + I2 + I3 . =n

j =1

Under (5) and (6), it is easy to show E(Y¯ 4 ) < Cn−2 , which ∑n−1 implies that E(I3 ) ≤ Cn−1 E(Y¯ 4 ) j=1 (n − j)2 Pj ≤ C /n. By the Cauchy–Schwarz inequality, we have that



n−1 −

Pj Pk

j,k=1

  n n−1 − −  j =1

t =j+1

n −

n −

Yt

2  2 

−1

E{Yt1 Yt2 Yt3 Yt4 }

.

Lemma 5.2. Under the assumptions of (5) and (6), E∗ ‖Sn∗ (λ) − S˜n∗ (λ)‖2 = op (1).

Proof of Lemma 5.2. According to the definitions of γˆ ∗ (j) and ∗ ∗ (λ) + M2n (λ), where γ˜ ∗ (j), we can write Sn∗ (λ) − S˜n∗ (λ) = M1n

t =j +1

Ψj (λ)

j =1

n −

{(Xt − X¯ )

t =j +1

× (Xt −j − X¯ ) − Yt Yt −j }wt n −1 − n − = −n−1/2 Y¯ Yt wt Ψj (λ) j =1 t =j +1

− n−1/2 Y¯

n −1 − n −

Yt −j wt Ψj (λ)

j =1 t =j +1

+ Y¯ 2 n−1/2

E

n−1 − n −

wt Ψj (λ)

j=1 t =j+1

= Y¯ I1∗ (λ) + Y¯ I2∗ (λ) + Y¯ 2 I3∗ (λ).

 ∗

E



Ln n −1 − −

n −

Y t wt

2   Pj



t =j +1

2

 −

E

Yt

Pj = O(1),

t ∈Bs ∩[j+1,n]

2

−1

n −1 −

 Pj E

∗

n −

2 wt

≤ Cbn .

t =j+1

By Corollary 2(ii) (ii) of Wu (2007), t =1 Yt = o(n1/2 log n) almost ∗ surely under the assumption (5). So E∗ ‖M2n (λ)‖2 = op (1) and then

∑n

1/2

n −1 n − − {γˆ (j) − γ (j)}Ψj (λ) wt ,

−1

j =1

t1 ,t2 =j+1 t3 ,t4 =k+1

j=1

∗



and S˜n∗ (λ) is negligible.

∗ M2n (λ) = n−1/2

 n −1  −

which implies that E∗ ‖I1∗ (λ)‖2 = Op (1). By a similar argument, E∗ ‖I2∗ (λ)‖2 = Op (1). Concerning I3∗ (λ), we have that

E ‖I3 (λ)‖ = n

Pj

The following lemma shows that the difference between Sn∗ (λ)

n−1 −

Pj E{γˆ (j) − γ (j)}2 (n − j)bn

j=1 s=1

∗

Expressing E{Yt1 Yt2 Yt3 Yt4 } = γ (t1 − t2 )γ (t3 − t4 )+γ (t1 − t3 )γ (t2 − t4 ) + γ (t1 − t4 )γ (t2 − t3 ) + cum(Yt1 , Yt2 , Yt3 , Yt4 ), we can derive that E(I1 ) ≤ C /n under (5) and (6). By a similar argument, E(I2 ) ≤ C /n. Subsequently, E‖Mn (λ)‖2 = o(1). By the Cauchy–Schwarz inequality, we have ‖E{Mn (λ)}‖2 = o(1). The conclusion follows.

∗ M1n (λ) = −n−1/2

n −1 −

j =1

E(I1 ) ≤ Cn−1 E1/2 (Y¯ 4 )E1/2

Pj .

∗ ∗ Hence, E∗ ‖M1n (λ)‖2 = op (1). Regarding M2n (λ), we have

2

n−1 3Y¯ 4 −

n

wt

t =j +1

≤ Cbn /n = o(1).

t =j +1



2

j=1

2 Yt

n −

Following the same argument as presented in the proof of Lemma 5.1, we can show

Therefore, n−1 3Y¯ 2 −

∗

j =1

j =1

2

2

∗ E∗ ‖S1n (λ) − S˜n∗ (λ)‖2 = op (1). This completes the proof.

References Anderson, T.W., 1993. Goodness of fit tests for spectral distributions. Annals of Statistics 21, 830–847. Bartlett, M.S., 1955. An Introduction to Stochastic Processes. Cambridge University Press, Cambridge. Bollerslev, T., 1986. Generalized autoregressive conditional heteroscedasticity. Journal of Econometrics 31, 307–327. Box, G., Pierce, D., 1970. Distribution of residual autocorrelations in autoregressiveintegrated moving average time series models. Journal of the American Statistical Association 65, 1509–1526. Breidt, F.J., Davis, R.A., Trindade, A.A., 2001. Least absolute deviation estimation for all-pass time series models. Annals of Statistics 29, 919–946. Brillinger, D.R., 1975. Time Series: Data Analysis and Theory. Holden-Day, SanFrancisco. Bühlmann, P., Künsch, H.R., 1999. Block length selection in the bootstrap for time series. Computational Statistics and Data Analysis 31, 295–310. Chen, X., White, H., 1996. Laws of large numbers for Hilbert space-valued mixingales with applications. Econometric Theory 12, 284–304. Chen, X., White, H., 1998. Central limit and functional central limit theorems for Hilbert-valued dependent heterogeneous arrays with applications. Econometric Theory 14, 260–284. Davidson, J., 2004. Moment and memory properties of linear conditional heteroscedasticity models and a new model. Journal of Business & Economic Statistics 22, 1629. Deo, R.S., 2000. Spectral tests for the martingale hypothesis under conditional heteroscedasticity. Journal of Econometrics 99, 291–315. Ding, Z., Granger, C., Engle, R., 1993. A long memory property of stock market returns and a new model. Journal of Empirical Finance 1, 83–106. Dominguez, M.A., Lobato, I.N., 2001. Size corrected power for bootstrap tests. Working Paper. Centro de Investigacion Economica, CIE, Instituto Tecnologico Autonomo de Mexico, ITAM. Downloaded from http://ftp.itam.mx/pub/academico/inves/lobato/01-02.pdf. Durlauf, S., 1991. Spectral-based test for the martingale hypothesis. Journal of Econometrics 50, 1–19. Escanciano, J., Velasco, C., 2006. Generalized spectral tests for the martingale difference hypothesis. Journal of Econometrics 134, 151–185. Escanciano, J., Lobato, I.N., 2009. An automatic data-driven portmanteau test for testing for serial correlation. Journal of Econometrics 151, 140–149. Granger, C.W.J., Anderson, A.P., 1978. An Introduction to Bilinear Time Series Models. Vandenhoek and Ruprecht, Gottinger. Granger, C.W.J., Teräsvirta, T., 1993. Modelling Nonlinear Economic Relationships. Oxford University Press, New York.

224

X. Shao / Journal of Econometrics 162 (2011) 213–224

Grenander, U., Rosenblatt, M., 1957. Statistical Analysis of Stationary Time Series. Wiley, New York. Hall, P., Horowitz, J.L., Jing, B.-Y., 1995. On blocking rules for the bootstrap with dependent data. Biometrika 82, 561–574. Hannan, E.J., 1973. Central limit theorems for time series regression. Zeitschrift fur Wahrscheinlichkeitstheorie und Verwandte Gebiete 26, 157–170. Horowitz, J.L., Lobato, I.N., Nankervis, J.C., Savin, N.E., 2006. Bootstrapping the Box–Pierce Q test: a robust test of uncorrelatedness. Journal of Econometrics 133, 841–862. Kallianpur, G., 1981. Some ramifications of Wiener’s ideas on nonlinear prediction. In: Wiener, N., Masani, P. (Eds.), Norbert Wiener, Collected Works with Commentaries. MIT Press, Cambridge, MA, pp. 402–424. Li, Q., Hsiao, C., Zinn, J., 2003. Consistent specification tests for semiparametric/nonparametric models based on series estimation methods. Journal of Econometrics 112, 295–325. Ling, S., McAleer, M., 2002. Necessary and sufficient moment conditions for the GARCH(r , s) and asymmetric power GARCH(r , s) models. Econometric Theory 18, 722–729. Liu, R.Y., 1988. Bootstrap procedures under some non-i.i.d. models. Annals of Statistics 16, 1696–1708. Lobato, I.N., 2001. Testing that a dependent process is uncorrelated. Journal of the American Statistical Association 96, 1066–1076. Lobato, I.N., Nankervis, J.C., Savin, N.E., 2001. Testing for autocorrelation using a modified Box–Pierce Q test. International Economic Review 42, 187–205. Lobato, I.N., Nankervis, J.C., Savin, N.E., 2002. Testing for zero autocorrelation in the presence of statistical dependence. Econometric Theory 18, 730–743. Mammen, E., 1993. Bootstrap and wild bootstrap for high dimensional linear models. Annals of Statistics 21, 255–285. Min, W., 2004. Inference on time series driven by dependent innovations. Ph.D. Dissertation, University of Chicago. Newey, W.K., West, K.D., 1994. Automatic lag selection in covariance matrix estimation. Review of Economic Studies 61, 631–653. Parthasarathy, K.R., 1967. Probability Measures on Metric Spaces. Academic Press, New York.

Politis, D.N., Romano, J., 1994. Limit theorems for weakly dependent Hilbert space valued random variables with application to the stationary bootstrap. Statistica Sinica 4, 461–476. Politis, D.N., Romano, J.P., Wolf, M., 1999. Subsampling. Springer-Verlag, New York. Romano, J.L., Thombs, L.A., 1996. Inference for autocorrelations under weak assumptions. Journal of the American Statistical Association 91, 590–600. Rosenblatt, M., 1971. Markov Processes, Structure and Asymptotic Behavior. Springer, New York. Rosenblatt, M., 1985. Stationary Sequences and Random Fields. Birkhäuser, Boston. Shao, X., 2010. Testing for white noise under unknown dependence and its applications to diagnostic checking for time series models. Econometric Theory (in press). Shao, X., Wu, W.B., 2007. Asymptotic spectral theory for nonlinear time series. Annals of Statistics 4, 1773–1801. Stute, W., Gonzalez-Manteiga, W., Presedo-Quindimil, M., 1998. Bootstrap approximations in model checks for regression. Journal of the American Statistical Association 93, 141–149. Subba Rao, T., Gabr, M.M., 1984. An Introduction to Bispectral Analysis and Bilinear Time Series Models. In: Lecture Notes in Statistics, vol. 24. Springer-Verlag, New York. Tong, H., 1990. Non-linear Time Series: A Dynamical System Approach. Oxford University Press. Van der Vaart, A.W., Wellner, J.A., 1996. Weak Convergence and Empirical Processes. Springer, New York. Wu, C.F.J., 1986. Jackknife, bootstrap and other resampling methods in regression analysis (with discussion). Annals of Statistics 14, 1261–1350. Wu, W.B., 2005. Nonlinear system theory: another look at dependence. Proceedings of National Academy of Science, USA 102, 14150–14154. Wu, W.B., 2007. Strong invariance principles for dependent random variables. Annals of Probability 35, 2294–2320. Wu, W.B., Min, M., 2005. On linear processes with dependent innovations. Stochastic Processes and Their Applications 115, 939–958. Wu, W.B., Shao, X., 2004. Limit theorems for iterated random functions. Journal of Applied Probability 41, 425–436.

Journal of Econometrics 162 (2011) 225–239

Contents lists available at ScienceDirect

Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom

Nonparametric model validations for hidden Markov models with applications in financial econometrics Zhibiao Zhao ∗ Department of Statistics, Penn State University, University Park, PA 16802, United States

article

info

Article history: Received 22 February 2009 Received in revised form 9 January 2011 Accepted 21 January 2011 Available online 28 January 2011 JEL classification: C12 C14 C22

abstract We address the nonparametric model validation problem for hidden Markov models with partially observable variables and hidden states. We achieve this goal by constructing a nonparametric simultaneous confidence envelope for transition density function of the observable variables and checking whether the parametric density estimate is contained within such an envelope. Our specification test procedure is motivated by a functional connection between the transition density of the observable variables and the Markov transition kernel of the hidden states. Our approach is applicable for continuoustime diffusion models, stochastic volatility models, nonlinear time series models, and models with market microstructure noise. © 2011 Elsevier B.V. All rights reserved.

Keywords: Confidence envelope Diffusion model Hidden Markov model Market microstructure noise Model validation Nonlinear time series Transition density Stochastic volatility

1. Introduction

In finance, an important Markov chain example is the continuous-time diffusion model

Let {Xt }t ∈T be a stationary process with time index t ∈ T = {0, 1, 2, . . .} in discrete-time setting or the interval T = [0, ∞)

dXt = µ(Xt )dt + σ (Xt )dWt ,

in continuous-time setting. Examples include stock prices, interest rates, temperature series, rainfall measurements, and unemployment rates over a certain period of time among others. The datagenerating mechanism underlying the process {Xt }t ∈T usually involves some unknown parameter Q which could be finite dimensional real-valued parameters in parametric settings or nonparametric functions in nonparametric inference problems. To conduct statistical inference about Q, one often needs to impose a certain dependence structure on the underlying process {Xt }t ∈T . For the latter purpose, Markov chains are widely used in virtually every scientific subjects, such as biology, engineering, queueing theory, physics, finance, econometrics, and statistics.

∗ Corresponding address: 326 Thomas Building, University Park, PA 16802, United States. Tel.: +1 814 865 6552; fax: +1 814 863 7114. E-mail address: [email protected]. 0304-4076/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2011.01.002

t ≥ 0,

(1)

where {Wt }t ≥0 is a standard Brownian motion, and µ and σ are the drift and volatility functions. Then {Xt }t ≥0 is a Markov chain, and so is the discrete sample {Xi∆ }i∈N with ∆ being the sampling interval. In nonparametric setting, no parametric forms are imposed on (µ, σ ), and we may take Q = (µ, σ ) to be a vector of two functions. On the other hand, if we know (µ, σ ) = (µ(·; θ ), σ (·; θ )) for some known parametric forms µ(·; θ ) and σ (·; θ ) with unknown parameter θ , then we may take Q = θ . The survey paper Zhao (2008) reviewed different specifications of (1). In econometrics, another useful example is the discrete-time version of (1): Xi = µ(Xi−1 ) + σ (Xi−1 )εi , where εi , i ∈ Z, are independent and identically distributed (i.i.d.) random variables. See Section 3.4 for different specifications. Despite the popularity of Markov chains, the Markovian assumption seems too restrictive in many situations. One distinctive example violating the Markovian assumption is the class of stochastic volatility models. Stochastic volatility model has emerged as a useful alternative to the traditional deterministic

226

Z. Zhao / Journal of Econometrics 162 (2011) 225–239

volatility model (e.g. Hull and White, 1987; Taylor, 1994; Kim et al., 1998; Ball and Torous, 1999). Consider the continuous-time stochastic volatility model d log(St ) = σt dW1 (t ) and dσt2 = r (σt2 )dt + s(σt2 )dW2 (t ), (2) where {W1 (t )}t ≥0 and {W2 (t )}t ≥0 are two independent standard Brownian motions. Then the return series {Xi = log(Si∆ ) − log(S(i−1)∆ )}i∈N is not a Markov chain. Similarly, for the discretetime stochastic volatility model Xi = σi εi

and σi2 = r (σi2−1 ) + s(σi2−1 )ηi ,

(3)

where {εi } and {ηi } are two independent i.i.d. sequences. Then {Xi } from (3) is not a Markov chain. Volatility plays an important role in risk analysis and options pricing. To study the evolving dynamics of volatilities, we can let Q = (r , s) be the parameter of interest in (2) and (3). Due to the unobservable volatilities, it is more challenging to conduct statistical inferences for (2) and (3) than (1); see Broto and Ruiz (2004). In this article we study inferences for hidden Markov models (HMM). Given discrete observations {Xi } from a stationary process {Xt }t ∈T , we assume that {(Xi , Yi )} is a HMM with observable variables {Xi } and some hidden or unobservable states {Yi }, where {Yi } is a Markov chain; see the definition in Section 2. While Markov chain can be viewed as a special example of HMM, HMM includes many models that otherwise cannot be studied under the Markov chain framework. For example, while the returns {Xi = log(Si∆ ) − log(S(i−1)∆ )}i∈N in (2) are non-Markovian, they form HMM with the hidden    states {Yi = (σs+(i−1)∆ )s∈[0,∆] }i∈N or

 i∆

Yi = σi∆ , (i−1)∆ σs2 ds ; see Section 3.2. Similarly, for (3), i∈N {Xi } is not a Markov chain but a HMM with the hidden states {Yi = σi }; see Section 3.5. For a third example, consider Xi = Yi + εi ,

(4)

where {Xi } is the observation sequence, {Yi } is the underlying true but unknown process, and {εi } is the measurement or contamination error independent of {Yi }. Several researchers have used (4) to model financial markets in the presence of market microstructure noise (e.g. Aït-Sahalia et al., 2005; Zhang et al., 2005). If {Yi } is a Markov chain, then {Xi } is a HMM with hidden chain {Yi }; see Section 3.6. For other applications of HMM, see the monograph by MacDonald and Zucchini (1997). Due to the unobservable states, statistical inference for HMM is more challenging than that for Markov chains. Given observations {Xi }, researchers want to draw statistical inferences about Q which generates the HMM with unobservable states {Yi }. If we know an a priori parametric family {Qθ , θ ∈ Θ } for Q, then a parametric setting would be reasonable and the main focus becomes the estimation of parameter θ . In many situations, however, researchers have no or little prior information, and a mis-specification of the underlying model could lead to wrong conclusions. For example, specifying the correct model of the price process of the underlying assets plays a key role in the pricing of derivatives. In such circumstances, it is essential to test the null hypothesis H0 : Q = Qθ for some unknown parameter θ ∈ Θ before using any parametric model Qθ . In different contexts in the literature, the latter hypothesis testing problem is often called model validation, model checking, goodness-of-fit, or specification testing. This is the primary goal of this article. Nonparametric model validation under dependence has been an important yet difficult problem. A few model-specific model validation approaches have been proposed. Azzalini and Bowman (1993) studied model checking by using pseudo-likelihood ratio test. Härdle and Mammen (1993) proposed measuring the discrepancy between parametric and nonparametric estimates of the mean regression function. For residuals based tests, see

Fan and Li (1996) and Hong and White (1995). Fan and Yao (2003) dealt with model validation problem for time series data by using generalized likelihood ratio test in Fan et al. (2001), which has been developed for independent data. Zhao and Wu (2008) studied model validations for time series models through simultaneous confidence bands. Most aforementioned approaches rely on nonparametric regression estimation. Recently, there has been considerable interest in density based specification tests. AïtSahalia (1996) introduced a density based test by comparing the parametric and nonparametric density estimates for model (1); see also Hong and Li (2005), Bosq (1998), and Gao and King (2004) for density based approaches. For Markov models, Aït-Sahalia et al. (2009) proposed specification tests based on transition densities. Aït-Sahalia et al. (2010) also used transition densities to test the Markov hypothesis. In this article we consider model validations for HMM {(Xi , Yi )} with observable variables {Xi } and unobservable states {Yi }, without imposing any specific model structure. We achieve this goal by constructing a nonparametric simultaneous confidence envelope (SCE) for the transition density or conditional density function of Xi given Xi−1 . Our specification test procedure is motivated by a functional connection between the transition density of the observable variables {Xi } and the Markov transition kernel of the unobservable states {Yi }; see Section 2.1 for more discussions. The proposed method constructs nonparametric SCE for transition density and checks whether the parametrically implied density estimate is contained within such an envelope. As demonstrated in Section 3, the proposed method works for a variety of models widely used in financial econometrics, including continuous-time diffusion models, stochastic volatility models, nonlinear time series models, and models with market microstructure noise among others. The rest of this article is organized as follows. In Section 2, we first discuss the motivation of our method and then address the model validation problem by constructing nonparametric SCE for transition density function. In Section 3, we demonstrate the applicability of our methods for several widely used models. The finite sample performance is studied in Section 4. We defer the proofs to Section 5. 2. Transition density specification test for HMM Given discrete samples {Xi } from a stationary process {Xt }t ∈T whose data-generating mechanism involves some unobservable states {Yi } and unknown characteristics Q, we are interested in the hypothesis testing problem H0 : Q = Qθ , where Qθ is some parametric form, θ ∈ Θ is an unknown parameter, and Θ is the parameter space. We shall address this model validation problem for hidden Markov models (HMM). First we give a formal definition of HMM by following Bickel and Ritov (1996) with minor modifications. For a set S , we denote its Borel set by B (S ). Definition 1. A real-valued stochastic process {Xi }i∈Z on (R, B (R)) is a hidden Markov model with the hidden chain or states {Yi }i∈Z on a polish space (Y, B (Y)) if (i) {Yi }i∈Z is a strictly stationary Markov chain. (ii) For all i, given {Yj }j≤i , {Xj }j≤i are conditionally independent, and the conditional distribution of Xi depends only on Yi . (iii) The conditional distribution of Xi given Yi does not depend on i. In the definition, the observable process {Xi } is real-valued whereas the hidden states process {Yi } can take values in a general polish space (Y, B (Y)). For example, as we will show in Section 3.2, for the continuous-time stochastic volatility model (2),

Z. Zhao / Journal of Econometrics 162 (2011) 225–239

Yi itself is a stochastic process with continuous sample path and Y is the set of continuous functions. If {Xi } is a stationary Markov chain, then it is also a HMM with {Yi = Xi }. In Section 3, we show that many continuous and discrete time models are special examples of HMM.

invariant probability measure ΠY (S ) = P(Yi ∈ S ) with S ∈ B (Y) and the Markov transition kernel or generator QY (S |y) = P(Yi ∈ S |Yi−1 = y) with y ∈ Y and S ∈ B (Y). The identity (7) then becomes

  Y

qX (x |x) = ′

2.1. Transition density: connecting the observable {Xi } and unobservable {Yi } To motivate the idea, we first focus on the simple case that the unobservable stationary Markov chain {Yi } is real-valued, and we will deal with the general case at the end of this section. For the Markov chain {Yi }, the transition density qY (y′ |y) of Yi at y′ given Yi−1 = y plays a key role in characterizing the probabilistic properties of {Yi }. In many applications, qY (y′ |y) can fully characterize the underlying process. For example, consider Yi = µ(Yi−1 ) + σ (Yi−1 )εi , where εi are i.i.d. standard normal random variables. Denote by φ the standard normal density function. Then qY (y′ |y) = φ{[y′ − µ(y)]/σ (y)}/σ (y). It can be easily shown that, if two specifications (µ, σ ) and (µ∗ , σ ∗ ) result in the same transition density qY (y′ |y), then (µ, σ ) = (µ∗ , σ ∗ ), and hence qY (y′ |y) completely determines the underlying datagenerating mechanism. In fact, several researchers have used marginal and transition densities to address specification testing problems for Markov models (e.g. Aït-Sahalia, 1996; Aït-Sahalia et al., 2009). Aït-Sahalia et al. (2010) also used transition densities to test the Markov hypothesis. Unfortunately, since {Yi } is not observable, qY (y′ |y) is not directly applicable, and we need to seek a reasonable ‘‘proxy’’ of it based on observations {Xi }. Let πX (x), pX (x, x′ ) and qX (x′ |x) be the marginal density of Xi , the joint density function of (Xi−1 , Xi ) at point (x, x′ ), and the transition density or conditional density of Xi at x′ given Xi−1 = x, respectively. Similarly, we define πY (y) and pY (y, y′ ) as the marginal density of Yi and joint density of (Yi−1 , Yi ), respectively. Denote by qX |Y (x|y) the conditional density of Xi at x given Yi = y. By the definition of HMM, conditioning on {Yj }j≤i , Xi−1 and Xi are independent and their conditional distributions depend only on Yi−1 and Yi , respectively. So, by the conditioning argument, we have pX (x, x′ ) =

∫∫

qX |Y (x|y)qX |Y (x′ |y′ )pY (y, y′ )dydy′ .

(5)

227

Y

qX |Y (x|y)qX |Y (x′ |y′ )ΠY (dy)QY (dy′ |y)

 Y

qX |Y (x|y)ΠY (dy)

.

(8)

In some applications, it is more convenient to express (8) through expectations

E[qX |Y (x|Yi−1 )qX |Y (x′ |Yi )]

qX (x′ |x) =

E[qX |Y (x|Yi−1 )]

.

(9)

In particular, the expectation motivates us to estimate the numerator and denominator by their corresponding empirical versions; see Section 2.4. 2.2. Confidence envelope and specification testing In this section we propose a specification testing procedure based on qX (x′ |x). We achieve this by constructing a simultaneous confidence envelope (SCE) for qX (x′ |x) over a compact set X ⊂ R2 . For a significance level α ∈ (0, 1) and a pair of bivariate functions ℓn (·, ·) and un (·, ·) based on observations, we say that [ln (·, ·), un (·, ·)] is an asymptotic (1 − α) nonparametric SCE for qX (x′ |x) on a compact set X ⊂ R2 if lim P{ln (x, x′ ) ≤ qX (x′ |x) ≤ un (x, x′ )

n→∞

for all (x, x′ ) ∈ X} = 1 − α.

(10)

With asymptotic probability (1 − α), the unknown true density qX (x′ |x) is contained within the envelope [ln (·, ·), un (·, ·)]. The concept of confidence envelope for a bivariate function is a natural extension of the confidence interval for one-dimensional parameter. Now we relate the idea of nonparametric SCE to our model validation problem. Consider the popular nonparametric kernel density estimate for the transition density qX (x′ |x) pˆ X (x, x′ )

qˆ X (x′ |x) =

πˆ X (x)

,

(11)

Kbn (x − Xi−1 )Kbn (x′ − Xi ),

(12)

where Conditioning on Yi , we have

πX (x) =

∫

qX |Y (x|y)πY (y)dy.

(6)

Since qX (x′ |x) = pX (x, x′ )/πX (x) and qY (y′ |y) = pY (y, y′ )/πY (y), by (5) and (6), qX (x′ |x) =



qX |Y (x|y)qX |Y (x′ |y′ )πY (y)qY (y′ |y)dydy′



qX |Y (x|y)πY (y)dy

.

(7)

The identity (7) establishes a functional connection between qY (y′ |y) of the hidden states {Yi } and qX (x′ |x) of the observable variables {Xi }. In particular, qX (x′ |x) is uniquely determined by qY (y′ |y) together with qX |Y (x|y) and πY (y). On the other hand, {qX (x′ |x)}x,x′ ∈R also contains rich information about qY (y′ |y) through the functional relationship (7). Therefore, we can view the process from the observable qX (x′ |x) to the unobservable qY (y′ |y) as a functional inverse problem. In practice, qX |Y is often specified through model assumptions. For example, for the HMM {(Xi , Yi ) = (Xi , σi )} in (3), qX |Y is simply the normal density if we assume that εi therein are normal errors. In summary, (7) motivates us to use qX (x′ |x) as a ‘‘proxy’’ for qY (y′ |y). Now we consider the case of a general polish space (Y, B (Y)). The probabilistic properties of {Yi } are characterized by the

pˆ X (x, x′ ) =

πˆ X (x) =

n 1 −

nb2n i=1

n 1 −

nbn i=1

Kbn (x − Xi−1 ).

(13)

Here and  hereafter Kbn (u) = K (u/bn ) for a kernel function K satisfying R K (u)du = 1 and bandwidth bn > 0. Under mild conditions, qˆ X (x′ |x) is a consistent estimate of qX (x′ |x), regardless of the underlying model structure, and hence it can be used as the ‘‘true’’ reference density. Under H0 : Q = Qθ , denote by qˆ X (x′ |x; Qθˆ ) a parametric estimate of qX (x′ |x), where θˆ is an estimator of θ . To test H0 , we can measure the discrepancy between the parametric estimate qˆ X (x′ |x; Qθˆ ) under H0 and the nonparametric estimate qˆ X (x′ |x). A general test statistic has the form Tn = d(ˆqX (·|·; Qθˆ ), qˆ X (·|·)) with a proper choice of the distance measure d(·, ·). Aït-Sahalia (1996) studied model (1) based on square distance. Following Bickel and Rosenblatt (1973), we consider maximal deviation:

|ˆqX (x′ |x) − qˆ X (x′ |x; Qθˆ )| , (x,x′ )∈X ω(x, x′ )

Tn = max

(14)

228

Z. Zhao / Journal of Econometrics 162 (2011) 225–239

where ω(·, ·) is a weight function that may depend on observations. Bickel and Rosenblatt (1973) considered marginal density for i.i.d. observations. Classical result asserts the asymptotic normal ity of nb2n [ˆqX (x′ |x) − qX (x′ |x)]. On the other hand,√under H0 , the parametric density estimate qˆ X (x′ |x; Qθˆ ) is usually n-consistent. Therefore, under H0 : Q = Qθ , as bn → 0, qˆ X (x′ |x; Qθˆ ) − qX (x′ |x) is negligible compared to qˆ X (x′ |x) − qX (x′ |x), and we have

|[ˆqX (x′ |x) − qX (x′ |x)] − [ˆqX (x′ |x; Qθˆ ) − qX (x′ |x)]| (x,x′ )∈X ω(x, x′ )

Tn = max

≈ max

(x,x′ )∈X

|ˆqX (x′ |x) − qX (x′ |x)| := Tn∗ . ω(x, x′ )

(15)

The quantity Tn∗ is closely related to nonparametric SCE for qX (x′ |x). To see this, we assume that there exist normalizing sequences (γn , βn ) such that γn Tn∗ − βn has a limiting distribution G. Denote by Gα the (1 − α)-quantile of G. Based on Tn∗ , a nonparametric SCE for qX (x′ |x) can be constructed as qˆ X (x′ |x) ± γn−1 (Gα + βn )ω(x, x′ ) in view of

P{|ˆqX (x′ |x) − qX (x′ |x)| ≤ γn−1 (Gα + βn )ω(x, x′ ), (x, x′ ) ∈ X}

= P{γn Tn∗ − βn ≤ Gα }. The above argument has a number of implications. First, the test statistic Tn is equivalent to constructing nonparametric SCE for qX (x′ |x). Second, under H0 , the parametric estimate qˆ X (x′ |x; Qθˆ ) is contained within the nonparametric SCE with asymptotic probability (1 − α). Third, it is essential to construct SCE based on the nonparametric estimate qˆ X (x′ |x), otherwise the method might fail. Consider, for example, the case where the SCE were constructed based on the parametric estimate qˆ X (x′ |x; Qθˆ ) under √ H0 . Due to the n-consistence of qˆ X (x′ |x; Qθˆ ), the constructed SCE would have a width proportional to O(n−1/2 ) and therefore cannot  2 cover the more volatile nbn -consistent nonparametric estimate qˆ X (x′ |x). We now summarize our specification testing procedure: (i) Construct (1 − α) nonparametric SCE for qX (x′ |x): [ℓn (·, ·), un (·, ·)]. (ii) Under H0 , apply parametric methods to obtain an estimate θˆ of θ . (iii) Under H0 , obtain a parametric estimate qˆ X (x′ |x; Qθˆ ) of qX (x′ |x). (iv) Check whether ln (x, x′ ) ≤ qˆ X (x′ |x; Qθˆ ) ≤ un (x, x′ ) holds for all (x, x′ ) ∈ X. If so, then we accept H0 at level α . Otherwise H0 is rejected. To implement the above idea, two tasks remain. The first is to establish some maximal deviation result of the form (15) for the nonparametric density estimate qˆ X (x′ |x). The second is to construct parametrically implied density estimate qˆ X (x′ |x; Qθˆ ) of qX (x′ |x) under H0 . We shall address these two issues in Sections 2.3 and 2.4. 2.3. Construct confidence envelope for transition density For maximal deviations of nonparametric density estimate, Bickel and Rosenblatt (1973) dealt with i.i.d. data. Under the independence assumption, the problem of constructing simultaneous confidence bands (SCB, the term ‘‘band’’ is used for univariate function in contrast to ‘‘envelope’’ for bivariate function) have been studied previously under various settings (e.g. Johnston, 1982; Knafl et al., 1985; Eubank and Speckman, 1993; Fan and Zhang, 2000). Zhao and Wu (2008) considered SCB construction for time series models. Here we shall extend Bickel and Rosenblatt (1973)’s result to the transition density of hidden Markov models (HMM). Recall that X ⊂ R2 is a two-dimensional compact set.

Condition 1 (Dependence Assumption). Suppose that {Xi } is a HMM j with respect to a stationary hidden chain {Yi }. Denote by Gi the sigmafield generated by {Yt , i ≤ t ≤ j}. Define the α -mixing coefficient of {Yi } by

α(k) = sup{|P(A)P(B) − P(A ∩ B)|, A ∈ G0−∞ , B ∈ G∞ k }, ∑∞ Assume that k=1 α(k) < ∞.

k ∈ N.

Condition 2 (Kernel Assumption). Assume that the kernel K is bounded, symmetric, and has bounded derivative and bounded support [−ω,  ω ω]. Further assume that K is a 4-th order  ω kernel in the sense that −ω uj K (u)du = 0, j = 1, 2, 3. Let ϕK = −ω K 2 (u)du. Condition 3 (Regularity Assumption). Without loss of generality write X = [−T , T ] × [−T , T ] for some T > 0. Denote by Xϵ = [−T − ϵ, T + ϵ] × [−T − ϵ, T + ϵ] the ϵ -neighborhood of X. Assume that there exists some ϵ > 0 such that πX (x) has bounded fourth order derivatives on [−T − ϵ, T + ϵ] and qX (x′ |x) has bounded fourth order derivatives with respect to both x and x′ on Xϵ , infx∈[−T +ϵ,T −ϵ] πX (x) > 0, inf(x,x′ )∈Xϵ qX (x′ |x) > 0, and supx∈[−T −ϵ,T +ϵ],y∈Y [|qX |Y (x|y)| + |∂ qX |Y (x|y)/∂ x|] < ∞. We briefly comment on Conditions 1–3. Condition 1 is frequently used in statistical inferences involving dependent data. Condition 2 is a standard assumption on the kernel function in nonparametric inference problems. Condition 3 imposes smoothness assumptions. Theorem 1. Recall qˆ X (x′ |x) in (11). Assume that Conditions 1–3 hold. 2 ′ Further assume that nb10 n → 0, nbn → ∞. Then for every (x, x ) ∈ X, as n → ∞, Zn (x, x′ ) :=

bn

√

n qˆ X (x′ |x) − qX (x′ |x)

√

ϕK

qX (x′ |x)/πX (x)

H⇒ N (0, 1).

(16)

Moreover, for distinct points (x1 , x′1 ), . . . , (xk , x′k ) ∈ X, Zn (x1 , x′1 ), . . . , Zn (xk , x′k ) are asymptotically independent. Theorem 1 can be used to construct a point-wise confidence envelope for qX (x′ |x). Theorem 2 provides a maximal deviation result for qˆ X (x′ |x). Theorem 2. Let qˆ X (x′ |x) be as in (11). Recall that, in Condition 2, K has support [−ω, ω] and X is a compact subset of R2 . For mn → ∞, let Xn = {(xj , x′j ) ∈ X, j = 1, . . . , mn } be a subset of X consisting of mn points such that max{|xj − xj′ |, |x′j − x′j′ |} ≥ 2ωbn for all 1 ≤ j ̸= j′ ≤ mn . Assume that Conditions 1–3 hold. Further assume that 2 3 3 2 −1 nb10 ] → 0. n log n + mn (log mn ) [bn + (nbn )

(17)

For k ≥ 2 define Bk (z ) =



2 log k − √



1 2 log k

1 2



√

log log k + log(2 π ) − z ,

k ∈ N, z ∈ R.

(18)

Then for every z ∈ R,

 lim P

n→∞

sup (x,x′ )∈Xn −z

= e−2e .

bn

√

ϕK

n |ˆqX (x′ |x) − qX (x′ |x)|

√

qX (x′ |x)/πX (x)

 ≤ Bmn (z ) (19)

In (17), the first term nb10 n log n → 0 is needed to control the bias while the second term m2n (log mn )3 [b3n +(nb2n )−1 ] → 0 ensures the validity of the moderate deviation (cf. Theorem 3). In particular, if mn ≍ 1/bn , then (17) holds if bn = n−β for β ∈ (1/10, 1/4).

Z. Zhao / Journal of Econometrics 162 (2011) 225–239

We can use Theorem 2 to construct asymptotic (1 − α) SCE for qX (x′ |x) as qˆ X (x′ |x) ±

ϕK Bmn (zα )  qˆ X (x′ |x)/πˆ X (x), √ bn n  

(20)

1 zα = − log log √ , 1−α

in the following approximated version of (10) (note that zα is the (1 − α)-quantile of the limiting distribution on the right-hand side of (19)): lim P{ln (x, x′ ) ≤ qX (x′ |x)

Here we include θ to mean that the estimate is based on samples from Qθ . Under mild conditions, for example mixing condition, by the Strong Law of Large Numbers for both the numerator and denominator in (22), qˆ X (x′ |x; Qθ ) → qX (x′ |x; Qθ ) as m → ∞. To implement the above idea, we need a consistent estimate θˆ of θ . Parameter estimation for HMM is usually a difficult task. Under parametric specification Qθ , a natural choice is the likelihood method. By the conditional independence, the likelihood for observations x := {Xi = xi }1≤i≤n is L(x; θ ) =

n→∞

approximation to (10) for large n. Remark 1. In Condition 3, we have assumed that the conditional density qX |Y of Xi given Yi exists. If {Xi } itself is a Markov chain, then we can let {Yi = Xi } be the observable states. However, the conditional density qX |Y does not exist. A careful examination of the proof reveals that, Theorems 1 and 2 still hold if we replace qX |Y in Condition 3 by the conditional density of Xi given Xi−1 and apply the new filtration Fi = σ (Xj , j ≤ i) in the proofs. Thus, the theoretical results developed also apply to ordinary Markov chains. 2.4. Construct parametric transition density estimate After constructing nonparametric SCE for qX (x′ |x), in order to test H0 : Q = Qθ , we need to obtain a parametric density estimate qˆ X (x′ |x; Qθˆ ) of qX (x′ |x) under H0 and check whether it is contained within SCE. Parametric estimation of qˆ X (x′ |x; Qθˆ ) involves estimating parameter θ through various parametric methods, such as least-squares, maximum likelihood, generalized method of moments, and M-estimation methods. Under H0 : Q = Qθ , let us assume at the outset that θ is known. In practice, the conditional density qX |Y of Xi given Yi is often specified through model assumptions. For example, for the HMM {(Xi , Yi ) = (Xi , σi )} in (3), qX |Y is the normal density if we assume that εi therein are normal errors. Therefore, under H0 : Q = Qθ with known θ , the underlying model is completely specified. Theoretically speaking, depending on whether {Yi } takes value in R or a general polish space Y, one can use either (7) or (8) to obtain theoretical parametric density qX (x′ |x; Qθ ) by replacing (πY , qY ) and (ΠY , QY ) therein by their theoretical expressions. In many applications, however, the above naive method fails. First, there is no closed-form (πY , qY ) in (7) or (ΠY , QY ) in (8). Even for the simplest ARCH(1) model Yi = (a2 + b2 Yi2−1 )1/2 εi , there is no closed-form stationary density. For (2) with hidden states {Yi = (σs+(i−1)∆ )s∈[0,∆] }i≥1 , Y is the set of continuous function on [0, ∆], and it is infeasible to compute (ΠY , QY ). Second, it is not easy to evaluate the integrals in (7) and (8), especially when {Yi } takes value in a general polish space Y. To attenuate these issues, we propose a Markov Chain Monte Carlo simulation based method. Under H0 : Q = Qθ with known θ so that the model structure is completely specified, we can simulate the hidden states {Yj∗ }1≤j≤m from the underlying model Qθ . In (9), replacing the numerator and denominator by their corresponding empirical versions, we propose the following estimate of qX (x′ |x): m−1 qˆ X (x′ |x; Qθ ) =

m ∑

qX |Y (x|Yi∗−1 ; θ )qX |Y (x′ |Yi∗ ; θ )

i=1

m−1

∫

∫

m ∑ i=1

. qX |Y (x|Yi∗−1 ; θ )

(22)

qX |Y (x1 |y1 ) · · · qX |Y (xn |yn )

··· Y

≤ un (x, x′ ) for all (x, x′ ) ∈ Xn } = 1 − α. (21) As mn → ∞, we can let Xn become asymptotically dense in X. Therefore, for smooth function qX (x′ |x), (21) provides a good

229

Y

× PY (dy1 , . . . , dyn ),

(23)

where PY (S1 , . . . , Sn ) is the joint probability measure of {Yi }1≤i≤n . Therefore, we can in principle obtain an estimate θˆ of θ by maximizing the likelihood L(x; θ ). However, L(x; θ ) is not directly computable even for simple models. A possible solution is to use the same idea in (22) by approximating L(x; θ ) as Lˆ (x; θ ) =

m 1 −

m i =1

(i)

(i)

qX |Y (x1 |Y1 ) · · · qX |Y (xn |Yn(i) ),

(24)

(i)

where Y1 , . . . , Yn is the ith sample path. This method requires a large number of sample paths (m sample paths of size n each) and is computationally expensive. In general, there are no universally efficient parameter estimation methods for HMM. In next section we discuss related references for some specific models. Now we summarize our procedure to obtain parametric density estimate qˆ X (x′ |x; Qθˆ ). (i) Under H0 : Q = Qθ , obtain an estimate θˆ . (ii) Under the estimated parametric model Qθˆ , simulate Y1∗ , . . . , Ym∗ . (iii) Plug θˆ and simulated samples Y1∗ , . . . , Ym∗ into (22) to obtain m −1 qˆ X (x |x; Qθˆ ) = ′

m ∑

qX |Y (x|Yi∗−1 ; θˆ )qX |Y (x′ |Yi∗ ; θˆ )

i =1

m−1

m ∑

.

(25)

qX |Y (x|Yi∗−1 ; θˆ )

i=1

3. Examples In this section, we show that many popular continuous-time and discrete-time models in financial econometrics can be viewed as HMMs with properly chosen {(Xi , Yi )}. Under H0 : Q = Qθ , since various parameter estimation methods, such as maximum likelihood estimate (MLE), generalized least-squares method, Mestimation, and generalized method of moments (GMM), are available in a vast literature, we only briefly discuss some related references and focus our attention on constructing parametric density estimate under H0 . Throughout the rest of this article, we denote by φ(x) and Φ (x) the standard normal density and distribution functions, respectively. For p > 0 and a random variable X write X ∈ Lp if ‖X ‖p := [E(|X |p )]1/p < ∞. 3.1. Continuous-time diffusion models Let Xi ≡ Xi∆ be discrete samples from the diffusion model (1). Here ∆ > 0 is a small but fixed number representing sampling interval. In practice, for daily or weekly data, ∆ is one day or one week, respectively. The process {Xi } is a Markov chain. Denote by

230

Z. Zhao / Journal of Econometrics 162 (2011) 225–239

πX be the marginal density function of the stationary solution Xt on D = (Dl , Du ) with −∞ ≤ Dl < Du ≤ +∞. By Aït-Sahalia (1996), ∫  x 2µ(y) c (x0 ) πX (x) = 2 exp dy , (26) 2 σ (x) x0 σ (y) where the choice of the lower bound point x0 ∈ D is irrelevant, and c (x0 ) is a normalizing constant. Similar versions of Condition 4 below have been discussed in Hansen and Scheinkman (1995), AïtSahalia (1996), and Genon-Catalot et al. (2000).

And the transition density is given by qˆ X (x |x; Qθˆ ) = ′



1

√ φ σ (x; θˆ ) ∆

x′ − x − µ(x; θˆ )∆

√ σ (x; θˆ ) ∆

 .

(31)

3.2. Continuous-time stochastic volatility models

Condition 4. Assume that: (i) Both µ and σ are twice continuously differentiable and σ 2 > 0 on D; (ii) The integral of m(x) = exp{Λ(x)}/σ 2 (x) converges at both boundaries of D, and the integral of s(x) =  x exp{−Λ(x)} diverges at both boundaries of D, where Λ(x) = x [2µ(y)/σ 2 (y)]dy; (iii) limx→Dl ,Du σ (x)πX (x) = 0 and

Consider the continuous-time stochastic volatility model (2). Let Xi = log(Si∆ ) − log(S(i−1)∆ ) be the log returns during time period [(i − 1)∆, i∆]. The unobserved volatility process {σt2 }t ≥0 is a stationary Markov process. Volatility plays an important role in risk analysis and options pricing. To study volatilities, we can take Q = (r , s). Part (i) of Proposition 2 is adopted from Genon-Catalot et al. (2000).

Proposition 1. Under Condition 4, Condition 1 holds with Yi = Xi .

Proposition 2. Assume that (r , s) in model (2) governing σt2 satisfy Condition 4. Then

0

limx→Dl ,Du σ (x)/|2µ(x) − σ (x)σ ′ (x)| < ∞.

Proof. Under Condition 4, Hansen and Scheinkman (1995) proved that, the operator Ht defined by Ht g (y) = E[g (Xt )|X0 = y] for g (X0 ) ∈ L2 is a strong contraction in the sense that there exists λ ∈ (0, 1) such that

‖Ht g (X0 )‖2 ≤ λt ‖g (X0 )‖2 for g (X0 ) ∈ L2 and E[g (X0 )] = 0.

(27)

Therefore, for g1 (X0 ), g2 (X0 ) ∈ L satisfying E[g1 (X0 )] = E[g2 (X0 )] = 0, by the Cauchy–Schwarz inequality,

|E[g1 (X0 )g2 (Xt )]| = |E{g1 (X0 )E[g2 (Xt )|X0 ]}| = |E[g1 (X0 )Ht g2 (X0 )]| (28)

t

Denote by Gt and G the sigma fields generated by {Xs }s≤t and {Xs }s≥t , respectively. The ρ -mixing coefficient of {Xt }t ≥0 is defined as where ρ(G, G′ ) = sup



|Cov(X , X )| , ‖X ‖2 ‖X ′ ‖2 ′

where the supermum is taken over all G (respectively G )measurable random variable X (respectively X ′ ) satisfying X , X ′ ∈ L2 , E(X ) = E(X ′ ) = 0. Since {Xt }t ≥0 is a stationary Markov process, by Theorem 4.1 in Bradley (1986) and (28),

|Cov[g1 (X0 ), g2 (Xt )]| ρt = sup ‖g1 (X0 )‖2 ‖g2 (Xt )‖2

2 t dt

Proof. (i): We shall only consider {Yi

}; see Genon-Catalot et al.

(2)

(2000) for the proof of {Yi }. Let Gt be the sigma field generated by {σs2 }s≤t . By the independence of {W1 (t )} and {W2 (t )}, conditional

 j∆

σt dW1 (t ), j ≤ i, are independent normal  i∆ random variables with zero mean and variance Σi2 = (i−1)∆ σt2 dt. Notice that the random variables Σj2 , j ≤ i, are measurable with (1) respect to the sigma field σ (Yj , j ≤ i) ⊂ Gi∆ . Therefore, we have for all λ1 , . . . , λi ∈ R, (j−1)∆

(1)

(1)

E[eλ1 X1 +···+λi Xi |Y1 , . . . , Yi ]

= E{E[eλ1 X1 +···+λi Xi |Gi∆ ]|Y1(1) , . . . , Yi(1) } ′



= (σt2+(i−1)∆ )t ∈[0,∆] }

or {Yi = (σ , (i−1)∆ σ )}; (ii) Condition 1 holds; (iii) Condition 3 holds provided that with probability one c1 < inft ≥0 σt2 ≤ supt ≥0 σt2 < c2 for some constants 0 < c1 < c2 < ∞. 2 i∆

on Gi∆ , Xj =

≤ ‖g1 (X0 )‖2 ‖Ht [g2 (X0 )]‖2 ≤ ‖g1 (X0 )‖2 ‖g2 (X0 )‖2 λt .

ρt = sup ρ(Gs , Gs+t ),

(2)

 i∆

(1)

2



(1)

(i) {Xi } is a HMM with the hidden chain {Yi

 = O(λt ).

Hence, {Xt }t ≥0 is ρ -mixing with mixing coefficient ρ(t ) = O(λt ), which completes the proof since α -mixing coefficient is less than the corresponding ρ -mixing coefficient.

= exp{(λ21 Σ12 + · · · + λ2i Σi2 )/2}, entailing the conditional independence of Xj , j ≤ i. The Markovian (1)

property of {σt2 }t ≥0 implies that {Yi }i≥0 is Markovian, completing the proof. (ii) See Proposition 1. (1) (iii) Let Yi = Yi . Then the conditional density qX |Y of Xi given Yi is uniformly bounded in view of qX |Y (x|Yi ) = φ(x/Σi )/Σi = (2π Σi2 )−1/2

× exp[−x2 /(2Σi2 )] ≤ (2π c ∆)−1/2 ,

(32)

 i∆

After we construct SCE for qX (x′ |x) as in Section 2.3, we need to obtain a parametric estimate qˆ X (x′ |x; Qθˆ ) of qX (x′ |x) under H0 : Q = Qθ = (µ(·; θ ), σ (·; θ )) for a parametric specification Qθ . To estimate θ under H0 , a natural choice is the maximum likelihood estimate (MLE). Consider the following Euler discretization scheme:

and Σi2 = (i−1)∆ σt2 dt > c ∆. Similarly, |∂ qX |Y (x|Yi )| is uniformly bounded. Clearly, qX |Y (x|Yi ) is also uniformly bounded away from zero on any compact set. By (9) and the Dominated Convergence Theorem, it is easy to verify that qX (x′ |x) has bounded derivatives of all orders on any compact set. So, Condition 3 holds.

Xt +∆ − Xt = µ(Xt )∆ + σ (Xt )(Wt +∆ − Wt ).

Q = Qθ = (r (·; θ ), s(·; θ )). First, we need to estimate θ . Under parametric specification Qθ , a natural choice is the likelihood method. Let fΣ (Σ02 , . . . , Σn2 ; θ ) be the joint density of {Σi2 }0≤i≤n under H0 : Q = Qθ . By the argument in the proof of Proposition 2,

(29)

Then θ can be estimated by maximizing the approximate conditional log-likelihood

θˆ = argmax θ

 ×

n −

the likelihood for given observations x = {Xi = xi }0≤i≤n from (2) is

log

i=1

1

√ φ σ (Xi−1 ; θ ) ∆

We now discuss parametric density estimate under and H0 :



Xi − Xi−1 − µ(Xi−1 ; θ )∆

√ σ (Xi−1 ; θ ) ∆

L(x; θ ) =

 .

(30)

∫ ∏ n

Σi−1 φ(xi /Σi )fΣ (Σ02 , . . . , Σn2 ; θ )dΣ02 · · · dΣn2 .

i=0

(33)

Z. Zhao / Journal of Econometrics 162 (2011) 225–239

Under H0 , fΣ is completely determined up to unknown parameter θ . Therefore, we can in principle obtain an estimate θˆ of θ by maximizing the likelihood L(x; θ ). However, L(x; θ ) is not directly even ∏n computable  for very simple models. Since L(x; θ ) = −1 E i=0 Σi φ(xi /Σi ) , a possible solution is Markov chain Monte Carlo method by simulating a large number of sample paths of {σt2 }t ≥0 from the parametric model (2), which, unfortunately, involves extensive computation. Possible alternatives have been proposed in, for example, Kim et al. (1998) for Bayesian method, and Andersen and Sørensen (1996) for moments based method. For other contributions, see the survey paper by Broto and Ruiz (2004). We point out that, most existing estimation methods deal with simple (say, lognormal autoregressive) stochastic volatility models. It still remains open how to develop efficient estimation techniques for general stochastic volatility models. In our simulation studies, we use a moment based method. Recall qX |Y in (32). Once we have a consistent estimate of θ , we can apply (25) to estimate qX (x′ |x) parametrically as m−1

m ∑

φ(x/Σi∗−1 )φ(x′ /Σi∗ )/(Σi∗−1 Σi∗ )

i=1

qˆ X (x |x; Qθˆ ) = ′

m−1

m ∑

,

(34)

φ(x/Σi∗−1 )/Σi∗−1

Under mild conditions, Zhao (2010) derived a Bahadur represen√ tation for θˆ and established its n-consistency. Assume that εi are standard normal random variables. By the plug-in method, the parametrically estimated transition density is qˆ X (x |x; Qθˆ ) = ′



1

σ (x; θˆ )

φ

x′ − µ(x; θˆ )

σ (x; θˆ )

 .

(37)

3.5. Discrete-time stochastic volatility models Let {Xi } be samples from the discrete-time stochastic volatility model (3). Special examples have been studied in Ruiz (1994), Jacquier et al. (1994) and Kim et al. (1998). If we are interested in the data-generating mechanism of the unobservable volatility process {σi }, then we can let Q = (r , s). Clearly, {Xi } is a HMM with the hidden chain {Yi = σi }. Now we consider parametric density construction under H0 : Q = Qθ = (r (·; θ ), s(·; θ )). Let θˆ be an estimate of θ ; see Section 3.2 for various estimation methods. Assume that {εi } and {ηi } are i.i.d. standard normal random variables. Then the conditional density qX |Y of Xi given Yi = y is qX |Y (x|y) = y−1 φ(x/y). By (25), we propose

i=1

 i∆

231

m−1

m ∑

φ(x/σi∗−1 )φ(x′ /σi∗ )/(σi∗−1 σi∗ )

∗2 ∗2 (i−1)∆ σt dt and {σt } is a simulated realization from the estimated null model dσt∗2 = r (σt∗2 ; θˆ )dt + s(σt∗2 ; θˆ )dWt .

qˆ X (x |x; Qθˆ ) =

3.3. Stochastic volatility models driven by stable Lévy process

where {σi∗2 } are simulated samples from the estimated null model

where Σi∗2 =

i=1

′

m −1

m ∑

, φ(x/σi∗−1 )/σi∗−1

i =1

σi∗2 = r (σi∗−21 ; θˆ ) + s(σi∗−21 ; θˆ )ηi .

Model (2) can be extended to non-Gaussian stable Lévy processes. A process {Zt } is said to be a α -stable Lévy process if it has independent and stationary increments, and Z1 has a stable distribution with index α ∈ (0, 2]. The special case of α = 2 corresponds to Brownian motion. Consider d log(St ) = σt dZ (t ) and

dσtα = r (σtα )dt + s(σtα )dW (t ),

t ≥ 0,

where {Z (t )}t ≥0 is a α -stable Lévy process with index α ∈ (0, 2] independent of the Brownian motion {W (t )}t ≥0 . Using the scaling property of stable Lévy process, the same argument in Section 3.2 (1) shows that Proposition 2 still holds with the hidden chain {Yi =

(σtα+(i−1)∆ )t ∈[0,∆] } or {Yi(2) = (σiα∆ ,

 i∆

details.

(i−1)∆

σtα dt )}. We omit the

Consider the nonlinear autoregressive conditional heteroscedastic model

θˆ = argmin θ

i=1



σ (Xi−1 ; θ )

estimate qˆ X (x′ |x; Qθˆ )

(35)

where εi are i.i.d. random variables. Special cases of (35) include linear AR Xi = aXi−1 + εi , ARCH (Engle, 1982) Xi = (a2 + b2 Xi2−1 )1/2 εi , TAR (Tong, 1990) Xi = a max(Xi , 0)+ b min(Xi , 0)+εi , and EAR (Haggan and Ozaki, 1981) Xi = [a + b exp(−cXi−1 )]Xi−1 + εi among others. Clearly, {Xi } is a Markov chain. We refer the reader to Bradley (2005) for discussions on mixing conditions for Markov chains. Let Q = (µ, σ ). Under H0 : Q = Qθ for a specification Qθ = (µ(·; θ ), σ (·; θ )), Zhao (2010) considered the following estimate

 2 n  − Xi − µ(Xi−1 ; θ )

Let {Yt }t ∈T be the true process. In practice, we often do not observe Yt directly but a contaminated version Xt of it. For example, if Yi is the actual stock price at the i-th sampling time point, then we may only observe Xi = Yi + εi with i.i.d. errors εi . Assume that the errors {εi } are independent of {Yi }. This framework has been proposed as models with market microstructure noise (Aït-Sahalia et al., 2005; Zhang et al., 2005). If {Yi } is a Markov chain, then {Xi } is a HMM with the unobservable chain {Yi }. To construct parametric density estimate, assume that {Yi } is a Markov chain with data-generating mechanism Q of interest. Further assume that the contamination errors εi are centered normal random variables with variance σ 2 . Denote by qX |Y the conditional density of Xi given Yi . Then qX |Y (x|y) =

σ −1 φ{(x − y)/σ }. Under H0 : Q = Qθ , let (σˆ , θˆ ) be a consistent estimate of (σ , θ ). By (25), we can obtain the parametric density

3.4. Nonlinear time series

Xi = µ(Xi−1 ) + σ (Xi−1 )εi ,

3.6. Models with market microstructure noise

  + 2 log σ (Xi−1 ; θ ) . (36) 

m −1

=

m ∑

σˆ −1 φ{(x − Yi∗−1 )/σˆ }σˆ −1 φ{(x′ − Yi∗ )/σˆ }

i =1

m−1

m ∑

,

(38)

σˆ −1 φ{(x − Yi∗−1 )/σˆ }

i=1

where {Yi∗ } are simulated from estimated null model Qθˆ . 3.7. Extension to higher order Markov models The proposed transition density based test can be extended to deal with higher order Markov models. Consider Xi = µ(Xi−1 , . . . , Xi−p ) + σ (Xi−1 , . . . , Xi−p )εi ,

(39)

where µ and σ are p-dimensional functions. Let Q = (µ, σ ). Suppose that we wish to test H0 : Q = Qθ for a specification Qθ .

232

Z. Zhao / Journal of Econometrics 162 (2011) 225–239

For p ≥ 3, due to the ‘‘curse of dimensionality’’, it is practically infeasible to obtain a nonparametric estimate of the conditional density of Xi given Xi−1 , . . . , Xi−p . Here we shall give a partial solution based on transition density. In Section 2, our specification testing procedure is based on one-step transition density qX (x′ |x). Here, due to the pth order Markovian structure in (39), we need to consider k-step transition densities, k = 1, 2, . . . , p, simultaneously. For k ∈ N, denote by qk (x′ |x) the conditional density of Xi given Xi−k = x. Let qˆ k (x′ |x) and qˆ k (x′ |x; Qθˆ ) be the nonparametric estimate and parametric estimate under H0 of qk (x′ |x), respectively. For appropriate distance measure d(·, ·) and weights ωk > 0, we may construct a test statistic of the form Tn =

p −

ωk d(ˆqk (·|·), qˆ k (·|·; Qθˆ )) or

k=1

(40)

Tn = max d(ˆqk (·|·), qˆ k (·|·; Qθˆ )). 1≤k≤p

It is beyond the scope of the present work to explore this approach and further research will be conducted in the future.

We simulate 1000 realizations of size n from the null model (42). For each realization, we compute the test statistic Tn =

sup (x,x′ )∈Xn

bn

√

ϕK

n |ˆqX (x′ |x) − qX (x′ |x; θˆ )|



qˆ X (x′ |x)/πˆ X (x)

,

(43)

where θˆ is the linear regression estimate of θ based on the approximation (29). Using the cross-validation method (41), we find that most realizations from (42) give the optimal bandwidth bn ≈ 0.0005. Since (41) is computationally expensive, instead of applying (41) for each realization, we set bn = 0.0005 for all realizations to limit computation; the same technique is also used in Sections 4.3–4.6. We compare the empirical quantiles of these 1000 realized Tn with the asymptotic quantiles derived from Theorem 2. In particular, the asymptotic (1 − α)-quantile is Bmn (zα ), where Bk (z ) and zα are defined as in (18) and (20), respectively. Table 1 presents the empirical and asymptotic quantiles of Tn for different values of 1 − α . We see that the asymptotic quantiles approximate the empirical quantiles reasonably well.

4. Finite sample performance 4.3. Power study for Markov diffusion process 4.1. Kernel function, bandwidth selection, and Xn In our data we use the 4-th order kernel K (u) = √ analysis √ 2φ(u)−φ(u/ 2)/ 2, where φ(u) is the standard Gaussian kernel. For nonparametric problems, the choice of bandwidth is usually more important than that of kernel function. For bandwidth bn , we adopt the likelihood cross-validation bandwidth selection method. Denote by qˆ −i (x′ |x; bn ) the estimate in (11) based all samples but the leave-one-out pair (Xi , Xi+1 ) and bandwidth bn . The likelihood cross-validation bandwidth selection method then selects the opt optimal bn bopt n

= argmax bn

n −1 −

log qˆ −i (Xi+1 |Xi ; bn ).

(41)

i=0

See Section 5.2.2 in Li and Racine (2007) for more discussion. In Theorem 2, we need to select a set Xn of grid points. For a realization {Xi }, let ℓ0.15 and ℓ0.85 be the 15 and 85 percentiles, respectively. Let s be the sample standard deviation of differences {Xi+1 − Xi }. Then the region X = {(x, x′ ) : x ∈ [ℓ0.15 , ℓ0.85 ], |x′ − x| ≤ s} contains a large proportion of points (Xi , Xi+1 ), i = 0, . . . , n. Partition [ℓ0.15 , ℓ0.85 ] into 10 intervals of equal length, and denote the grid points by xj = ℓ0.15 + j(ℓ0.85 − ℓ0.15 )/10, j = 0, . . . , 10. For each xj , divide [xj − s, xj + s] into five intervals of equal length with grid points xj ± s, xj ± 0.5s, xj . We then take Xn = {(xj , xj ± τ s), τ = 1.0, 0.5, 0, j = 0, . . . , 10}.

Compared to marginal density estimation, transition density estimation requires larger sample sizes. With the development of modern technology, data sets with sizes of the order of tens of thousands have become available. We simulate n = 10, 000 daily observations with ∆ = 1/252 (one year has approximately 252 trading days) from the Ornstein–Uhlenbeck process (Vasicek, 1977):

ν > 0, β > 0, σ > 0.

(42)

We set θ = (β, ν, σ ) = (0.2, 0.06, 0.013). For simplicity write Xi = Xi∆ . By Euler’s discretization scheme (29), the true transition density of (42) is qX (x′ |x; θ ) = (σ

√ −1 √ ∆) φ{[x′ − x − β(ν − x)∆]/(σ ∆)}.

dXt = β(ν − Xt )dt + [(1 − λ)σ + λ · 0.07|X |0t .7 ]dWt ,

t > 0. (44)

If λ = 0, then (44) reduces to the Vasicek (1977) model (42); if λ = 1, then it becomes a special example of the CKLS model (Chan et al., 1992): γ

dXt = β(ν − Xt ) + σ Xt dWt .

(45)

For λ ∈ (0, 1), the volatility term (1 − λ)σ + λ · 0.07|X |0t .7 is a weighted version of the volatilities in the two models (42) and (45). The latter two models are among the most widely used interest rates models. In (44), we use the same setting for ν, β, σ , n, ∆ as in Section 4.2. We use (44) as our true data generating process and (42) as our null hypothesis H0 . To study the power of testing H0 , we use the asymptotic quantile from Table 1 with significance level α = 0.05. For a fixed λ ∈ [0, 1], the power is the proportion of realizations among 1000 realizations from (44) that reject H0 . The size and power for (42) against (44) are 0.034, 0.068, 0.165, 0.428, 0.709, 0.885, corresponding to λ = 0.0, 0.2, 0.4, 0.6, 0,8, 1.0. 4.4. Power study for jump–diffusion models We now consider the power when testing (42) against the jump–diffusion model

4.2. Accuracies of the asymptotic null distribution

dXt = β(ν − Xt )dt + σ dWt ,

Let λ ∈ [0, 1]. Consider the true underlying process

dXt = β(ν − Xt − )dt + σ (Xt − )dWt + Jt − dNt − ,

(46)

where Nt is a Poisson process with intensity λ(Xt − ), and Jt ∼ N (0, η2 ) is the independent jump size. As in Aït-Sahalia et al. (2009), consider the specification λ(x) = λ, σ (x) = ξ , such that

ξ 2 + λη2 = σ 2 ,

λη2 τ , = 2 ξ + λη 1.1 2

η2 = 2ξ 2 ,

where σ = 0.013 as in (42) and τ ∈ [0, 1], with τ = 0 being the null model (42). For all cases, we use bn = 0.0005 to limit computations. The size and power for (42) against (46) are 0.036, 0.032, 0.096, 0.424, 0.896, 0.994, corresponding to λ = 0.0, 0.2, 0.4, 0.6, 0,8, 1.0.

Z. Zhao / Journal of Econometrics 162 (2011) 225–239

233

Table 1 Comparison of empirical and asymptotic quantiles of Tn in (43). 1−α

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

Empirical Asymptotic

1.86 2.00

1.98 2.09

2.06 2.16

2.13 2.22

2.21 2.27

2.26 2.32

2.31 2.37

2.36 2.41

2.41 2.46

2.46 2.51

1−α

0.55

0.60

0.65

0.70

0.75

0.80

0.85

0.90

0.95

0.99

Empirical Asymptotic

2.52 2.57

2.57 2.63

2.63 2.68

2.71 2.75

2.78 2.82

2.87 2.91

2.98 3.03

3.11 3.18

3.33 3.43

3.73 4.01

4.5. Power study for models with market microstructure noise Consider the following model with market microstructure noise Xi = Yi + εi , Yi = θ1 [(1 − λ)Yi−1 + λ|Yi−1 |] + ηi , |θ1 | < 1, λ ∈ [0, 1]

(47)

where εi ∼ N (0, θ32 ) and ηi ∼ N (0, θ22 ) are independent noises. Based on the contaminated observations {Xi }, we wish to test the null hypothesis H0 : Yi = θ1 Yi−1 + ηi about the unobservable process {Yi }. In (47), the parameter λ measures the deviation from the null model, with λ = 0 being the null model and λ = 1 being the TAR model Yi = θ1 |Yi−1 | + ηi . Under H0 , elementary calculations show that

E(Xi2 ) =

θ22 + θ32 , 1 − θ12

Cov(Xi−2 , Xi ) =

Cov(Xi−1 , Xi ) =

θ1 θ22 , 1 − θ12

empirical versions. Using sample size n = 20, 000 and true parameters (θ1 , θ2 , θ3 ) = (0.3, 0.6, 0.4), we find that most realizations give optimal bandwidth bn ≈ 0.02 for λ ∈ [0, 1]. As in Section 4.2, we use bn = 0.02 for all realizations to limit computation. The size and power are 0.074, 0.080, 0.124, 0.350, 0.778, 0.968, corresponding to λ = 0.0, 0.2, 0.4, 0.6, 0,8, 1.0. 5. Proofs Recall that {Xi } is a HMM with respect to the Markov chain {Yi }. Also recall that, πX (x), pX (x, x′ ) and qX (x′ |x) are the marginal density of Xi , joint density of (Xi−1 , Xi ) and transition density or conditional density of Xi given Xi−1 = x, respectively, and qX |Y (x|y) is the conditional density of Xi given Yi = y. Throughout the proofs we let c1 , c2 , . . . , be constants that may vary from place to place. 5.1. A martingale decomposition argument

θ12 θ22 . 1 − θ12

So, we can use moments based estimation method by replacing the theoretical expectations with their corresponding empirical versions. Using sample size n = 2000 and true parameters (θ1 , θ2 , θ3 ) = (0.6, 0.2, 0.2), we find that most realizations give optimal bandwidth bn ≈ 0.12 for λ ∈ [0, 1]. As in Section 4.2, we use bn = 0.12 for all realizations to limit computation. The size and power are 0.027, 0.212, 0.675, 0.943, 0.991, 1.000, corresponding to λ = 0.0, 0.2, 0.4, 0.6, 0,8, 1.0. 4.6. Power study for stochastic volatility models Let εi be i.i.d. standard normal random variables. Consider stochastic volatility model

The key idea of our proofs is based on a martingale decomposition argument. First, we illustrate the basic idea. Let Fi = σ (Xj , Yj+1 , j ≤ i) be the sigma field generated by Xj , Yj+1 , j ≤ i. For i ∈ Z, define the projection operator Pi Z = E(Z |Fi ) − E(Z |Fi−1 ) for Z ∈ L1 . By the property of conditional expectation, it can be easily checked that

E(Pi Z |Fi−1 ) = E{[E(Z |Fi ) − E(Z |Fi−1 )]|Fi−1 }

= E[E(Z |Fi )|Fi−1 ] − E(Z |Fi−1 ) = 0. Thus, {Pi }i∈Z form a sequence of martingale difference operators with respect to the filtration {Fi }i∈Z , in the sense that they can transform a sequence of random variables into martingale differences through projection. See Wu (2005) for more discussions. Let g be any function satisfying g (X0 , X1 ) ∈ L1 and define

Xi = σi εi , Sn =

with {σi } being a stochastic process given by (48)

where ηi are i.i.d. centered normal random variables with variance θ32 > 0. In (48), for λ ̸= 0, {vi } satisfy a autoregressive conditional heteroscedastic model. As in Section 3.5, {Xi } form a HMM with respect to the unobservable volatilities {σi }. We test the simple linear autoregressive null hypothesis H0 : vi = θ1 + θ2 vi−1 + ηi . Under H0 , it is easy to check that

θ1 θ32 + , 1 − θ2 2(1 − θ22 )

log[E(Xi4 )] = 4 log(0.1) + log(3) + log[E(Xi2−1 Xi2 )] = 4 log(0.1) +

2θ1 1 − θ2

2θ1 1 − θ2

+

+

where ωi = g (Xi−1 , Xi ).

i=1

σi2 = 0.01 exp(vi ) and    vi = θ1 + θ2 vi−1 + 1 − λ + λ 0.1 + vi2−1 ηi ,

log[E(Xi2 )] = 2 log(0.1) +

n − [ωi − E(ωi )],

2θ32 1 − θ22

,

θ23 . 1 − θ2

So, we use a moments based method to estimate (θ1 , θ2 , θ3 ) by replacing the theoretical expectations with their corresponding

Suppose that we want to study the asymptotic behavior of Sn . By definition, it is easily seen that ωi − E(ωi ) = Pi ωi + Pi−1 ωi + [E(ωi |Fi−2 ) − E(ωi )]. Thus, we can write Sn as Sn =

n − i =1

Pi ωi +

n − i=1

:= Mn + Nn + Rn .

Pi−1 ωi +

n − [E(ωi |Fi−2 ) − E(ωi )] i=1

(49)

In the above expression, since {Pi }i∈Z are martingale difference operators with respect to the filtration {Fi }i∈Z , Mn is a martingale with respect to Fn and Nn is a martingale with respect to Fn−1 , and standard tools for martingales are applicable. The relative order of magnitude of Mn and Nn depends on the dimensionality of the nonparametric inference problem involved. Generally speaking, Mn and Nn are of the same order of magnitude for univariate nonparametric problems, but Mn dominates Nn for bivariate or multivariate nonparametric problems. An intuitive explanation is that, due to conditional expectation, Pi−1 ωi = E(ωi |Fi−1 ) − E(ωi |Fi−2 ) is smoother than Pi ωi = ωi − E(ωi |Fi−1 ) and hence has smaller variance. Thanks to the highest amount of smoothing, Rn is

234

Z. Zhao / Journal of Econometrics 162 (2011) 225–239

often negligible under mild dependence conditions. We generally call (49) the martingale decomposition. In Lemma 1 we establish a representation for Rn in (49). To this end we define Wn (x, x′ ) =

n − {E[qX |Y (x|Yi−1 )qX |Y (x′ |Yi )|Yi−1 ] i =1

(50)

It turns out that Rn is closely related to Wn (x, x ). In our subsequent proofs, we shall frequently use the following properties of conditional expectation. ′

(i) For a random variable X ∈ L1 and any sigma-field F , we have E(X ) = E[E(X |F )]. (ii) For a random variable X ∈ L1 and any two sigma-fields F ⊂ G, we have E(X |F ) = E[E(X |G)|F ]. Lemma 1. Let g (·, ·) be a bivariate measurable function such that g (X0 , X1 ) ∈ L1 . Define n − Rn = {E[g (Xi−1 , Xi )|Fi−2 ] − E[g (Xi−1 , Xi )]}.

(51)

i=1

Assume for simplicity that g (·, ·) is a bounded function so that expectation and integral can be exchanged. Let Wn (x, x′ ) be as in (50). Then Rn =

g (x, x′ )Wn (x, x′ )dxdx′ .

E[g (Xi−1 , Xi )|Fi−1 ] =

g 2 (u, v) ≤ 4[¯g (u, v, u0 , v0 )2 + g 2 (u0 , v)

+ g 2 (u, v0 ) + g 2 (u0 , v0 )].

(56)

Notice that

 ∫ u ∫ v 2   ∂ g (u, v)  dudv  |¯g (u, v, u0 , v0 )| =  ∂ u∂v u0 v 0  ∫  2  ∂ g (u, v)    ≤  ∂ u∂v  dudv, Ia,b uniformly over (u, v), (u0 , v0 ) ∈ Ia,b . Again, by the Cauchy–Schwarz inequality,

g (Xi−1 , x )qX |Y (x |Yi )dx . ′

′

′

(53)

sup

(u,v)∈Ia,b

g¯ (u, v, u0 , v0 )2 ≤ δ1 δ2

sup

a+δ1

∫

g (u, v0 )du + 2δ1 2

a

sup

v∈[b,b+δ2 ]

Notice that, by the HMM property,

E[g (Xi−1 , x′ )qX |Y (x′ |Yi )|Fi−2 , Yi ]

≤ 2δ2

E[g (Xi−1 , x′ )qX |Y (x′ |Yi )|Fi−2 ]

= E{E[g (Xi−1 , x′ )qX |Y (x′ |Yi )|Fi−2 , Yi ]|Fi−2 } = E[g (Xi−1 , x′ )|Yi−1 ] × E[qX |Y (x′ |Yi )|Yi−1 ] ∫ = g (x, x′ )qX |Y (x|Yi−1 )dx × E[qX |Y (x′ |Yi )|Yi−1 ]dx ∫ = g (x, x′ )E[qX |Y (x|Yi−1 )qX |Y (x′ |Yi )|Yi−1 ]dx.

a+δ1

∫



a

b+δ2

∫

g (u0 , v)dv + 2δ2 2

b

Thus, by the property of conditional expectation and the Markovian property of {Yi },

dudv.

(57)

∂ g (u, v0 ) ∂u

2

∂ g (u0 , v) ∂v

2

du,

(58)

g 2 (u0 , v)

−1

= qX |Y (x′ |Yi )E[g (Xi−1 , x′ )|Yi−1 ].

2

g 2 (u, v0 )

−1

E[g (Xi−1 , x′ )qX |Y (x′ |Yi )|Fi−2 ]dx′ .

∂ 2 g (u, v) ∂ u∂v

By Lemma 4 in Wu (2003), we have

≤ 2δ1

E[g (Xi−1 , Xi )|Fi−2 ] = E{E[g (Xi−1 , Xi )|Fi−1 ]|Fi−2 }



∫ Ia,b

So, by the property of conditional expectation,

=



2

Proof. For (u, v), (u0 , v0 ) ∈ Ia,b , let g¯ (u, v, u0 , v0 ) = g (u, v) − g (u0 , v) − g (u, v0 ) + g (u0 , v0 ). By the Cauchy–Schwarz inequality,

u∈[a,a+δ1 ]

∫



(52)

Proof. Since {Xi } is a HMM with respect to {Yi }, given Fi−1 , the conditional distribution of Xi depends only on Yi . We have

∫

2 ∂ g (u, v) sup g (u, v) ≤ C g (u, v) + ∂u (u,v)∈Ia,b Ia,b  2  2  ∂ g (u, v) ∂ 2 g (u, v) + + dudv. ∂v ∂ u∂v ∫

2

− E[qX |Y (x|Yi−1 )qX |Y (x′ |Yi )]}.

∫∫

Lemma 2. Let g (·, ·) be a differentiable bivariate function. For fixed δ1 , δ2 > 0 and a, b ∈ R, denote by Ia,b the set {(u, v) ∈ R2 : u ∈ [a, a + δ1 ], v ∈ [b, b + δ2 ]}. Then there exists a constant C < ∞, depending only on δ1 and δ2 , such that

b+δ2

∫ b



dv.

(59)

Inserting (57)–(59) into the right-hand side of (56) and then taking an integral on both sides of the resulting inequality over (u0 , v0 ) ∈ Ia,b , we obtain the desired result. Lemma 3. Recall Wn (x, x′ ) in (50). Assume that Conditions 1 and 3 hold. Then

    √  ′   sup |Wn (x, x )| = O( n). (x,x′ )∈X  2

(54)

Also, notice that,

Proof. By Lemma 2, it suffices to prove



E[g (Xi−1 , Xi )] = E{E[g (Xi−1 , Xi )|Fi−2 ]}

∫∫ = completing the proof.

g (x, x′ )E[qX |Y (x|Yi−1 )qX |Y (x′ |Yi )]dxdx′ , (55)

Lemma 2 is needed to establish a uniform bound for Wn (x, x′ ) in Lemma 3.

   ∂ Wn (x, x′ ) 2  ‖Wn (x, x′ )‖22 +    ∂x (x,x′ )∈X 2      2  ∂ Wn (x, x′ ) 2  ∂ 2 Wn (x, x′ )   +  +    ∂ x∂ x′  = O(n). ∂ x′ 2 2 sup

First, we consider sup(x,x′ )∈X ‖Wn (x, x′ )‖22 . Under Conditions 1 and 3, the summands in Wn (x, x′ ) are α -mixing and uniformly

Z. Zhao / Journal of Econometrics 162 (2011) 225–239

bounded. By Theorem 2.20 in Fan and Yao (2003), ‖Wn (x, x′ )‖22 = O(n), uniformly over (x, x′ ) ∈ X. For other terms, by the boundedness of |∂ qX |Y (x|y)/∂ x|, we can exchange the order of the differentiation and expectation in Wn (x, x′ ) in view of the Dominated Convergence Theorem. Thus, they can be treated using the same argument as above. 5.2. Proof of Theorems 1 and 2 Recall qˆ X (x, x′ ), pˆ X (x, x′ ) and πˆ X (x) in (11)–(13), respectively. Define

ξi (x, x′ ) = Kbn (x − Xi−1 )Kbn (x′ − Xi ), Sn (x, x′ ) =

(60)

n 1 −

{ξi (x, x′ ) − E[ξi (x, x′ )]},

235

Assume bn → 0 and supn log n/(nbn ) < ∞. Then sup |Hn (x, x′ )| = OP ( nbn log n).



(70)

(x,x′ )∈X

Proof. We shall use a chain argument. The basic idea is to approximate Hn (x, x′ ) by discrete versions over finer grid points. For simplicity write the compact set X = [0, T ] × [0, T ] for some 0 < T < ∞. Let N = n2 , ωn = T /N , xj = jωn , 0 ≤ j ≤ N. Then x0 , . . . , xN partition [0, T ] into uniformly spaced intervals with equal length ωn . For any x, x′ ∈ [0, T ], there exist j and k such that x ∈ [xj , xj+1 ) and x′ ∈ [xk , xk+1 ). Then we have

|Kbn (x − Xi−1 ) − Kbn (xj − Xi−1 )| ≤ c1 (x − xj )/bn ≤ c1 ωn /bn , (71) (61)

where c1 = sup |K ′ (u)|. Similarly,

2 ′ ′ Un (x, x′ ) = b− n E[ξ1 (x, x )] − pX (x, x ),

(62)

|qX |Y (x′ |Yi ) − qX |Y (xk |Yi )| ≤ c2 (x′ − xk ) ≤ c2 ωn ,

Vn (x, x ) = pX (x, x )[1 − πˆ X (x)/πX (x)].

(63)

where c2 = supx,y |∂ qX |Y (x|y)/∂ x|. Thus, by the boundedness of K (u) and qX |Y (x|y), when bn → 0, there exists constant c3 such that

nb2n i=1

′

′

Then we have

πˆ X (x)[ˆqX (x′ |x) − qX (x′ |x)] = Sn (x, x′ ) + Un (x, x′ ) + Vn (x, x′ ),

(64)

where Sn (x, x′ ) is the stochastic part determining the asymptotic distribution of qˆ X (x′ |x), Un (x, x′ ) is the bias part due to pˆ X (x, x′ ), and Vn (x, x′ ) is the bias part due to πˆ X (x). We shall treat them separately. For Un (x, x′ ), by Conditions 2 and 3 and the Taylor expansion of pX (x − ubn , x′ − v bn ) around (x, x′ ), elementary calculations show that 2 ′ b− n E[ξ1 (x, x )] 2 = b− n

∫∫ =



∫∫ K

x − x0 bn

  K

′

x − x1 bn

pX (x0 , x1 )dx0 dx1

= pX (x, x′ ) + O(b4n ), uniformly over (x, x′ ) ∈ X. Thus, Un (x, x′ ) = O(b4n ) uniformly over (x, x′ ) ∈ X. Now we consider Sn (x, x′ ). By the martingale decomposition technique in (49), we have (65)

Nn (x, x′ ) = Rn (x, x′ ) =

nb2n i=1 n 1 −

nb2n i=1

dijk = ζi (xj , xk ) = Kbn (xj − Xi−1 )qX |Y (xk |Yi ). We then obtain

sup |Hn (x, x′ )| ≤ max |Dn (j, k)| + 2c3 nωn /bn , 0≤j,k≤N

(x,x′ )∈X

(74)

where Dn (j, k) =

n −

Pi−1 dijk .

i=1

Pi−1 ξi (x, x′ ),

(67)

E[(Pi−1 dijk )2 |Fi−2 ] ≤ E(d2ijk |Fi−2 ) ≤ c4 bn ,

{E[ξi (x, x′ )|Fi−2 ] − E[ξi (x, x′ )]}.

Pi−1 ζi (x, x′ ),

i =1

where ζi (x, x′ ) = Kbn (x − Xi−1 )qX |Y (x′ |Yi ).

2 uniformly over i, j, k. So, ≤ c4 nbn . i=1 E[(Pi−1 dijk ) |Fi−2 ] Assume without loss of generality that supu,x,y |K (u)qX |Y (x|y)| ≤ 1. By Freedman’s exponential inequality (Freedman, 1975) for bounded martingale differences, for any c > 0,

∑n

(68)

P(|Dn (j, k)| ≥ c nbn log n)





Lemma 4. Let Conditions 2 and 3 hold. Define Hn (x, x′ ) =

where dijk is defined on discrete grid point (xj , xk ) as

(66)

To establish a uniform bound for Nn (x, x′ ), we need the following Lemma 4.

n −

(73)

Pi ξi (x, x′ ),

n 1 −

nb2n i=1

≤ c3 ωn /bn ,

Clearly, in (74), nωn /bn = O[1/(nbn )] → 0 with ωn = O(1/n2 ). Now we consider max0≤j,k≤N |Dn (j, k)|. Notice that for fixed j and k, {Pi−1 dijk }i∈Z form martingale differences with respect to the filtration {Fi−1 }i∈Z . Elementary calculations show that there exists some constant c4 < ∞ such that

where n 1 −

x ,y

u

Notice that c3 does not depend on the choices of i, j, k. So, Hn (x, x′ ) can be uniformly bounded as

K (u)K (v)pX (x − ubn , x′ − v bn )dudv

Mn (x, x′ ) =

|ζi (x, x′ ) − dijk | ≤ (c1 ωn /bn + c2 ωn )   × sup |K (u)| + sup qX |Y (x|y)

|Pi−1 ξi (x, x′ ) − Pi−1 dijk | ≤ |E{[ξi (x, x′ ) − dijk ]|Fi−1 }| + |E{[ξi (x, x′ ) − dijk ]|Fi−2 }| ≤ 2c3 ωn /bn .



Sn (x, x′ ) = Mn (x, x′ ) + Nn (x, x′ ) + Rn (x, x′ ),

(72)

≤ 2 exp − (69)

2c

√

nbn log n + 2c4 nbn

 = 2 exp −

2c



c 2 nbn log n

√

c 2 log n

log n/(nbn ) + 2c4

 = O(n−λc ),

236

Z. Zhao / Journal of Econometrics 162 (2011) 225–239

uniformly over j, k, where λc √ supn log n/(nbn ) < ∞. Thus,

 P

max |Dn (j, k)| ≥ c



nbn log n

0≤j,k≤N

≤

−

= c 2 /(2cc5 + 2c4 ) and c5 = 

Lemma 7. Recall πˆ X (x) in (13). Assume that Conditions 1–3 hold. Then sup |πˆ X (x) − πX (x)| = O[b4n +

x∈[−T ,T ]

P(|Dn (j, k)| ≥ c nbn log n)

log n/(nbn )].

(79)



Proof. Applying the martingale decomposition technique in Section 5.1, we write

0≤j,k≤N

= O(N 2 n−λc ) = O[n−(λc −4) ]. Thus, by (74), the proof is completed by choosing a sufficiently large c so that λc > 4.

πˆ X (x) − πX (x) = Zn,1 (x) + Zn,2 (x) + Zn,3 (x), where

Lemmas 5 and 6 give uniform bounds for Nn (x, x′ ) and Rn (x, x′ ), respectively.

Zn,1 (x) =

Lemma 5. Recall Nn (x, x′ ) in (67). Assume that the conditions in Lemma 4 hold. Then

Zn,2 (x) =

sup |Nn (x, x′ )| = OP [ log n/(nbn )].



(75)

Proof. Applying the identity (53), we obtain

E[ξi (x, x′ )|Fi−1 ]

∫

nb2n

(76)

K (u)E[ζi (x, x′ − ubn )|Fi−2 ]du.

{E[Kbn (x − Xi−1 )|Fi−2 ] − EKbn (x − Xi−1 )},

1 Zn,3 (x) = b− n EKbn (x − Xi−1 ) − πX (x).

In (x) =

∫

Kbn (x − z )qX |Y (z |Yi−1 )dz K (u)qX |Y (x − ubn |Yi−1 )du.

n − {qX |Y (x|Yi−1 ) − E[qX |Y (x − ubn |Yi−1 )]}. i =1

By a similar argument√ in Lemma 3, it can be shown that supx∈[−T ,T ] |In (x)| = OP ( n). Thus Zn,2 (x) =

K (u)Hn (x, x′ − ubn )du

1 n

∫

K (u)In (x − ubn )du = OP (n−1/2 ),

uniformly over x ∈ [−T , T ]. By Taylor’s expansion and Condition 2, it is easily seen that Zn,3 (x) = O(b4n ) uniformly over x ∈ [−T , T ]. Thus, the desired result follows. The following Lemma 8 is needed to study the conditional variance in Proposition 3 and quadratic characteristic of multidimensional martingale in Theorem 3.

 = OP [ log n/(nbn )] in view of Lemma 4.

nbn i=1

Define

Combining the above two identities, we have Pi−1 ξi (x, x′ ) =  bn K (u)Pi−1 ζi (x, x′ − ubn )du. Recall Hn (x, x′ ) in (69). Now we have bn

n 1 −

= bn

E[ξi (x, x′ )|Fi−2 ] = E{E[ξi (x, x′ )|Fi−1 ]|Fi−2 }

′

Pi−1 Kbn (x − Xi−1 ),

∫

where ζi (x, x′ ) is defined as in (69). Thus, we have the identity

∫

nbn i=1

E[Kbn (x − Xi−1 )|Fi−2 ] =

∫ = Kbn (x − Xi−1 ) Kbn (x′ − z )qX |Y (z |Yi )dv ∫ = bn K (u)Kbn (x − Xi−1 )qX |Y (x′ − ubn |Yi )du ∫ = bn K (u)ζi (x, x′ − ubn )du,

= bn

n 1 −

We treat Zn,1 (x), Zn,2 (x) and Zn,3 (x) separately. By the same argument in Lemma 4, we can show that supx∈[−T ,T ] |Zn,1 (x)| = √ OP [ log n/(nbn )]. For Zn,2 (x), notice that

(x,x′ )∈X

Nn (x, x ) =



Lemma 8. Let ζi (x, x′ ) be as in (69). Define

Lemma 6. Recall Rn (x, x′ ) in (68). Assume that Conditions 1–3 hold. Then

In (x, x′ , z , z ′ ) =

n −

ζi (x, x′ )ζi (z , z ′ ).

(80)

i=1

       sup |Rn (x, x′ )| = O(n−1/2 ). (x,x′ )∈X 

(77)

sup

2

(x,x′ ),(z ,z ′ )∈X

Proof. Recall Wn (x, y) in (50). Applying Lemma 1 with g (z , z ′ ) = Kbn (x − z )Kbn (x′ − z ′ ), we obtain R n ( x, x ) = ( ′

)

nb2n −1

= n− 1

∫∫

Assume nbn → ∞. Then



∫∫ K

x−z bn

  K

x′ − z ′ bn

 Wn (z , z ′ )dzdz ′

K (u)K (v)Wn (x − ubn , x′ − v bn )dudv,

(78)

‖In (x, x′ , z , z ′ )‖2 = O(nbn ).

Proof. Let di (x, x′ , z , z ′ ) = ζi (x, x′ )ζi (z , z ′ ) − E[ζi (x, x′ )ζi (z , z ′ )| Fi−2 ]. Then {di (x, x′ , z , z ′ )}i∈Z form martingale differences with respect to {Fi−1 }i∈Z . By the orthogonality of martingale differences, it is easy to see that

 2 n n −  −  ′ ′  d i ( x, x , z , z )  = ‖di (x, x′ z , z ′ )‖22   i =1  i=1 2

in view of the change-of-variable u = (x − z )/bn , v = (x − z )/bn . Thus, by Lemma 3, the proof is completed. ′

′

≤

n − i=1

‖ζi (x, x′ )ζi (z , z ′ )‖22 = O(nbn ),

Z. Zhao / Journal of Econometrics 162 (2011) 225–239

237

uniformly over x, x′ , z , z ′ . Since E[ζi (x, x′ )ζi (z , z ′ )|Fi−2 ] = O(bn ) uniformly over i, x, x′ , z , z ′ . By the triangle inequality, the desired result then follows from

Proof. We shall only show the convergence (81) for any fixed point (x, x′ ) ∈ X since the asymptotic independence can be similarly n (xj , x′ ), j = treated by considering linear combinations of M j

  n −   ′ ′  ‖In (x, x , z , z )‖2 ≤  di (x, x , z , z )  i=1  2   n −    ′ ′ + E[ζi (x, x )ζi (z , z )|Fi−2 ] .  i=1 

n (x, x′ ) 1, . . . , k via the Cramér-Wold device. By Section 5.1, M is a martingale with respect to Fn , so it suffices to verify the convergence of the conditional variance and the Lindeberg condition in order to prove (81). The convergence of the conditional variance is verified in the proof of Theorem 3; see qrs in (85) with r = s. It remains to verify the Lindeberg condition. ′ Recall ξi (x, x∑ ) in (60). Let γi (x, x′ ) be defined as in (83). Then n ′ ′ ′  Mn (x, x ) = i=1 γi (x, x ). Since (x, x ) is fixed, we suppress the dependence on (x, x′ ) and write ξi = ξi (x, x′ ) and γi = γi (x, x′ ). By the boundedness of qX |Y (x|y), it is easy to see that |E(ξi |Fi−1 )| ≤ c1 bn for some constant c1 . For any c2 > 0, let c3 = c2 [ϕK2 pX (x, x′ )]1/2 . Because bn → 0 and nb2n → ∞, c3 nb2n −

′

′

2

Lemma 9. Let Conditions 2 and 3 hold. Define Jn (x, x′ ) =

n −

{βi (x, x′ ) − E[βi (x, x′ )]},

i=1

where βi (x, x′ ) = Kb2n (x − Xi−1 )qX |Y (x′ |Yi ).

c1 bn ≥ c3 n −

Assume bn → 0 and nbn → ∞. Then (x,x′ )∈X

Proof. Apply the martingale decomposition technique in Section 5.1 and write Jn (x, x′ ) as Jn (x, x′ ) = J n (x, x′ ) + J (x, x′ ), where J n ( x, x ) =

ϕ

≤

ϕK2 pX (x, x′ )b2n

2 K pX

1

|ξ1 −E(ξ1 |F0 )|≥c3

E{(ξ12 + c12 b2n )1

√

|ξ1 |≥c3

nb2n −c1 bn

√

nb2n

}

}

 ) + O ( 1 ) P (|ξ | ≥ c nb2n /2). 1 3 nb2 /2

√ 1 |≥c3

n

n −

‖Pi−1 βi (x, x′ )‖22 ≤

i =1

n −

nbn /2

|ξ1 |≥c3

c3 nb2n /2) → 0. Thus, we conclude that completing the proof.



i=1

‖βi (x, x′ )‖22 = O(nbn ),

i =1

uniformly over x, x′ . By the same argument in (54) and (55), we can write

∫

K 2 (u)Wn (x − ubn , x′ )du,

= bn

n

O(bn n) uniformly over x, x′ . The desired result then follows from the triangle inequality ‖Jn (x, x′ )‖2 ≤ ‖J n (x, x′ )‖2 + ‖J (x, x′ )‖2 =

√

O( nbn ) + O(bn

n

√

n).

E(γi2 1|γi |≥c2 ) → 0,

(82)

Then for every z ∈ R,



bn

sup (x,x′ )∈X

n

√

ϕK

n |Mn (x, x′ )|

√

p X ( x, x′ )

 −z

≤ Bmn (z ) = e−2e .

Proof. We follow the argument in Zhao and Wu (2008). Recall ξi (x, x′ ) = Kbn (x − Xi−1 )Kbn (x′ − Xi ) in (60). Let

where Wn (x, x′ ) is defined as in (50). By Lemma 3, ‖J (x, x′ )‖2 =

√

i=1

m2n (log mn )3 [b3n + (nb2n )−1 ] → 0.

n→∞

Kb2n (x − z )Wn (z , x′ )dz

∑n

Theorem 3. Recall Mn (x, x′ ) in (66). Let mn , Bn (z ) and Xn be as in Theorem 2. Assume that Conditions 1–3 hold. Further assume that

lim P

n

E{[ξ1 − E(ξ1 |F0 )]2 1



Since {Pi−1 βi (x, x′ )}i∈Z form martingale differences with respect to {Fi−1 }, by the orthogonality of martingale differences, we have

∫

(x, x′ )b2n

2 2 o(b2n ) with λ = c3 nb2n /2 → ∞, we obtain O(b− n )E(ξ1 √ 1 ) → 0. By the boundedness of ξ1 , P(|ξ1 | ≥ 2

Pi−1 βi (x, x ),

i=1

J ( x, x′ ) =

1

=

Applying the inequality E(ξ12 1|ξ1 |≥λ ) ≤ E(|ξ1 |3 )/λ = O(b2n /λ) = ′

n − {E[βi (x, x′ )|Fi−2 ] − E[βi (x, x′ )]}. J ( x, x′ ) = n

‖J n (x, x′ )‖22 =

E(γi2 1|γi |≥c2 )

2 2 = O(b− n )E(ξ1 1|ξ

n

n −

nb2n /2 for large enough n. Therefore,

i =1

 sup ‖Jn (x, x′ )‖2 = O( nbn ).

′



Proposition 3 and Theorem 3 present CLT and maximal deviation results for Mn (x, x′ ) in (66).

γi (x, x′ ) = [nb2n ϕK2 pX (x, x′ )]−1/2 {ξi (x, x′ ) − E[ξi (x, x′ )|Fi−1 ]}. (83) For fixed k ∈ N distinct integers 0 ≤ j1 , j2 , . . . , jk ≤ mn − 1, define the k-dimensional vector ζi = [γi (xj1 , x′j1 ), . . . , γi (xjk , x′jk )]T and Mn,k =

n −

n (xj , x′j ), . . . , M n (xj , x′j )]T . ζi = [M 1 k k 1

i =1

Proposition 3. Recall Mn (x, x′ ) in (66). Assume that Conditions 1– 3 hold. Further assume that bn → 0 and nb2n → ∞. Let (x, x′ ) ∈ X. Then as n → ∞,

n (x, x′ ) := M

bn

√

ϕK

n Mn (x, x′ )

√

pX (x, x′ )

⇒ N (0, 1).

(81)

n (xj , x′ ), Moreover, for distinct points (xj , x′j ), j = 1, . . . , k, in X, M j j = 1, . . . , k, are asymptotically independent.

n (x, x′ ) is defined as in (81). Here T denotes the transpose and M Then {ζi }i∈Z are k-dimensional vectors of martingale differences with respect to {Fi }i∈Z . Denote by Qn the k × k quadratic characteristic matrix of Mn,k . That is, Qn =

n − i =1

E(ζi ζiT |Fi−1 ) := (qrs )1≤r ,s≤k .

(84)

238

Z. Zhao / Journal of Econometrics 162 (2011) 225–239

Let τrs = ϕK2 [pX (xjr , x′jr )pX (xjs , x′js )]1/2 . Then we can write qrs as qrs =

n −

E[γi (xjr , x′jr )γi (xjs , x′js )|Fi−1 ] = qrs (1) − qrs (2),

(85)

i=1

since the limiting covariance matrix is the identity matrix; the second equality follows from P(N1 > x) = [1 + o(1)]φ(x)/x as x → ∞, where φ is the standard normal density. m Notice that Emn = ∪j=n1 Aj . Therefore, by (86) and the inclusion– exclusion inequality, we have, for any fixed k and large enough n,

where n −

1

qrs (1) =

nb2n τrs i=1 n −

1

qrs (2) =

nb2n τrs i=1

P(Emn ) ≥

E[ξi (xjr , xjr )ξi (xjs , xjs )|Fi−1 ], ′

′

js

∫

K (u)βi (x, x′ − ubn )du,

where βi (x, x′ ) is defined as in Lemma 9. Also, E[ξi2 (xjr , x′jr )] =

E{E[ξi2 (xjr , x′jr )|Fi−1 ]} = bn K (u)E[βi (x, x′ − ubn )]du. Recall Jn (x, x′ ) in Lemma 9. Therefore, by Lemma 9,



  n   1 −   2 ′ E[ξi (xjr , xjr )] qrr (1) − 2   nbn τrr i=1 2 ∫   1  ′  K (u)Jn (x, x − ubn )du =  nbn τrr  2 = O[(nbn )−1/2 ]. It is easily seen that E[ξi2 (xjr , x′jr )] = b2n τrr [1 + O(b2n )]. Thus,

by the triangle inequality, ‖qrr (1) − 1‖2 = O[(nbn )−1/2 + b2n ]. In summary, by (85), we have ‖qrs − Irs ‖3/2 ≤ ‖qrs − Irs ‖2 = O[(nbn )−1/2 + bn ] uniformly over 1 ≤ r , s ≤ k. Here Irs is the (∑ r , s)-element of the k × k identity matrix. It is easily seen that n ′ 3 2 −1/2 ] uniformly over 1 ≤ r ≤ k. i=1 E|γi (xjr , xjr )| = O[(nbn )

E|γi (xjr , x′jr )|3 + E(|qrs − Irs |3/2 ) = O(Ωn ) uniformly, 3/2

where Ωn = (nb2n )−1/2 + bn . Under (82), it is easily shown that [1 + Bmn (z )]4 exp[B2mn (z )/2] Ωn → 0 for every fixed z. For j = 1, . . . , mn , define events

n (xj , x′j )| > Bmn (z )}, Aj = {|M (x,x′ )∈Xn

 n (x, x′ )| > Bmn (z ) . |M

Let N1 , N2 , . . . , be independent standard normals. By Theorem 1 (a multivariate version of it also holds) in Grama and Haeusler (2006), k 

P

Ajr



=P

k   {|Nr | > Bmn (z )} [1 + o(1)]

r =1

r =1

 =

2e−z mn



Ajr

r!



r =1

r

mn

[1 + o(1)]

[1 + o(1)].

Thus, lim infn→∞ P(Emn ) ≥ − r =1 (−2e−z )r /r !. Similarly, applying the inclusion–exclusion inequality with an odd number of items in the expansion, we have lim supn→∞ P(Emn ) ≤

∑2k

E[ξi2 (xjr , x′jr )|Fi−1 ] = bn

sup

2k − (−2e−z )r r =1

2

in view of the Cauchy–Schwarz inequality and Lemma 8. For qrs (1), we consider two cases r ̸= s and r = s separately. For r ̸= s, since max{|xjr − xjs |, |x′jr − x′js |} ≥ 2ωbn and K has support [−ω, ω], we have ξi (xjr , x′jr )ξi (xjs , x′js ) = 0 and qrs (1) = 0. For r = s, notice that



2k 

P

r

r =1

=−

Aj2

j1 <j2

2k  m  2e−z − n = (−1)r −1

= O(bn ),

Emn =





s

i=1

P Aj1

j1 <···<j2k

 ∫∫ n  − 1  2 ‖qrs (2)‖2 = 2 bn K (u)K (v) ζi (xjr , x′jr − ubn ) nbn τrs  i =1  × ζi (xj , x′ − v bn )dudv 

∑n

− 

−

+ ··· −

′

By (76), for all r , s,

Then

P[Aj ] −

j =1

E[ξi (xjr , xjr )|Fi−1 ]E[ξi (xjs , xjs )|Fi−1 ]. ′

mn −

k [1 + o(1)].

(86)

n (xj , x′ ), Here the first equality agrees with the intuition that M j1 1

n (xj , x′ ) are asymptotically independent standard normals ...,M jk k

∑ − r2k=−11 (−2e−z )r /r !. The result then follows by letting k → ∞.

Proof of Theorems 1 and 2. Theorems 1 and 2 follows from Slutsky’s theorem in view of (64), (65), Lemmas 5–7, Proposition 3, and Theorem 3. We omit the details. Acknowledgements I am grateful to Editor Professor Ronald Gallant, an Associate Editor, three anonymous referees, and Professor Wei Biao Wu for their constructive comments. This work was partially supported by a NIDA grant P50-DA10075. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIDA or the NIH. References Aït-Sahalia, Y., 1996. Testing continuous-time models of the spot interest rate. Review of Financial Studies 9, 385–426. Aït-Sahalia, Y., Fan, J., Jiang, J., 2010. Nonparametric tests of the Markov hypothesis in continuous-time models. The Annals of Statistics 38, 3129–3163. Aït-Sahalia, Y., Fan, J., Peng, H., 2009. Nonparametric transition-based tests for jump–diffusions. Journal of the American Statistical Association 104, 1102–1116. Aït-Sahalia, Y., Mykland, P.A., Zhang, L., 2005. How often to sample a continuoustime process in the presence of market microstructure noise. Review of Financial Studies 18, 351–416. Andersen, T.G., Sørensen, B.E., 1996. GMM estimation of a stochastic volatility model: a Monte Carlo study. Journal of Business & Economic Statistics 14, 328–352. Azzalini, A., Bowman, A., 1993. On the use of nonparametric regression for checking linear relationships. Journal of the Royal Statistical Society. Series B 55, 549–557. Ball, C.A., Torous, W.N., 1999. The stochastic volatility of short-term interest rates: some international evidence. The Journal of Finance 54, 2339–2359. Bickel, P.J., Ritov, Y., 1996. Inference in hidden Markov models I: local asymptotic normality in the stationary case. Bernoulli 2, 199–228. Bickel, P.J., Rosenblatt, M., 1973. On some global measures of the deviations of density function estimates. The Annals of Statistics 1, 1071–1095. Bosq, D., 1998. Nonparametric Statistics for Stochastic Processes: Estimation and Prediction. Springer, New York. Bradley, R.C., 1986. Basic properties of strong mixing conditions. In: Eberlein, E., Taqqu, M.S. (Eds.), Dependence in Probability and Statistics. Birkhauser, Boston, pp. 165–192. Bradley, R.C., 2005. Basic properties of strong mixing conditions. A survey and some open questions. Probability Surveys 2, 107–144. Broto, C., Ruiz, E., 2004. Estimation methods for stochastic volatility models: a survey. Journal of Economic Surveys 18, 613–649. Chan, K.C., Karolyi, A.G., Longstaff, F.A., Sanders, A.B., 1992. An empirical comparison of alternative models of the short-term interest rate. The Journal of Finance 47, 1209–1227.

Z. Zhao / Journal of Econometrics 162 (2011) 225–239 Engle, R.F., 1982. Autoregressive conditional heteroscedasticity with estimates of the variance of UK inflation. Econometrica 50, 987–1008. Eubank, R.L., Speckman, P.L., 1993. Confidence bands in nonparametric regression. Journal of the American Statistical Association 88, 1287–1301. Fan, Y., Li, Q., 1996. Consistent model specification tests: omitted variables and semiparametric functional forms. Econometrica 64, 865–890. Fan, J., Yao, Q., 2003. Nonlinear Time Series: Nonparametric and Parametric Methods. Springer, New York. Fan, J., Zhang, W., 2000. Simultaneous confidence bands and hypothesis testing in varying-coefficient models. Scandinavian Journal of Statistics 27, 715–731. Fan, J., Zhang, C., Zhang, J., 2001. Generalized likelihood ratio statistics and Wilks phenomenon. The Annals of Statistics 29, 153–193. Freedman, D.A., 1975. On tail probabilities for martingales. The Annals of Probability 3, 100–118. Gao, J., King, M., 2004. Adaptive testing in continuous-time diffusion models. Econometric Theory 20, 844–882. Genon-Catalot, V., Jeantheau, T., Larédo, C., 2000. Stochastic volatility models as hidden Markov models and statistical applications. Bernoulli 6, 1051–1079. Grama, I.G., Haeusler, E., 2006. An asymptotic expansion for probabilities of moderate deviations for multivariate martingales. Journal of Theoretical Probability 19, 1–44. Haggan, V., Ozaki, T., 1981. Modelling nonlinear random vibrations using an amplitude dependent autoregressive time series model. Biometrika 68, 189–196. Hansen, L.P., Scheinkman, J.A., 1995. Back to the future: generating moment implications for continuous time Markov processes. Econometrica 63, 767–804. Härdle, W., Mammen, E., 1993. Comparing nonparametric versus parametric regression fits. The Annals of Statistics 21, 1926–1947. Hong, Y., Li, H., 2005. Nonparametric specification testing for continuous-time models with applications to term structure of interest rates. Review of Financial Studies 18, 37–84. Hong, Y., White, H., 1995. Consistent specification testing via nonparametric series regression. Econometrica 63, 1133–1159. Hull, J., White, A., 1987. The pricing of options on assets with stochastic volatilities. The Journal of Finance 42, 281–300.

239

Jacquier, E., Polson, N.G., Rossi, P.E., 1994. Bayesian analysis of stochastic volatility models (with discussion). Journal of Business & Economic Statistics 12, 371–417. Johnston, G.J., 1982. Probabilities of maximal deviations for nonparametric regression function estimates. Journal of Multivariate Analysis 12, 402–414. Kim, S., Shephard, N., Chib, S., 1998. Stochastic volatility: likelihood inference and comparison with ARCH models. The Review of Economic Studies 65, 361–393. Knafl, G., Sacks, J., Ylvisaker, D., 1985. Confidence bands for regression functions. Journal of the American Statistical Association 80, 683–691. Li, Q., Racine, J., 2007. Nonparametric Econometrics. Princeton University Press, Princeton, New Jersey. MacDonald, I.L., Zucchini, W., 1997. Hidden Markov and Other Models for DiscreteValued Time Series. Chapman & Hall, London. Ruiz, E., 1994. Quasi-maximum likelihood estimation of stochastic volatility models. Journal of Econometrics 63, 289–306. Taylor, S.J., 1994. Modeling stochastic volatility: a review and comparative study. Mathematical Finance 4, 183–204. Tong, H., 1990. Nonlinear Time Series Analysis: A Dynamical System Approach. Oxford University Press, Oxford. Vasicek, O.A., 1977. An equilibrium characterization of the term structure. Journal of Financial Economics 5, 177–188. Wu, W.B., 2003. Empirical processes of long-memory sequences. Bernoulli 9, 809–831. Wu, W.B., 2005. Nonlinear system theory: another look at dependence. Proceedings of the National Academy of Sciences of the USA 102, 14150–14154. Zhang, L., Mykland, P.A., Aït-Sahalia, Y., 2005. A tale of two time scales: determining integrated volatility with noisy high-frequency data. Journal of the American Statistical Association 472, 1394–1411. Zhao, Z., 2008. Parametric and nonparametric models and methods in financial econometrics. Statistics Surveys 2, 1–42. Zhao, Z., 2010. Density estimation for nonlinear parametric models with conditional heteroscedasticity. Journal of Econometrics 155, 71–82. Zhao, Z., Wu, W.B., 2008. Confidence bands in nonparametric time series regression. The Annals of Statistics 36, 1854–1878.

Journal of Econometrics 162 (2011) 240–247

Contents lists available at ScienceDirect

Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom

Estimation of fractional integration under temporal aggregation Uwe Hassler ∗ Goethe University Frankfurt, RuW, Grueneburgplatz 1, 60323 Frankfurt, Germany

article

info

Article history: Received 17 March 2010 Received in revised form 22 January 2011 Accepted 24 January 2011 Available online 11 March 2011 JEL classification: C14 C22 C82

abstract A result characterizing the effect of temporal aggregation in the frequency domain is known for arbitrary stationary processes and generalized for difference-stationary processes here. Temporal aggregation includes cumulation of flow variables as well as systematic (or skip) sampling of stock variables. Next, the aggregation result is applied to fractionally integrated processes. In particular, it is investigated whether typical frequency domain assumptions made for semiparametric estimation and inference are closed with respect to aggregation. With these findings it is spelled out, which estimators remain valid upon aggregation under which conditions on bandwidth selection. © 2011 Elsevier B.V. All rights reserved.

Keywords: Long memory Difference stationarity Cumulating time series Skip sampling Closedness of assumptions

1. Introduction Determining inflation persistence is a prominent issue when it comes to forecasting (Stock and Watson, 2007), or when monetary policy recommendations are at stake; see e.g. Mishkin (2007). The effect of temporal aggregation on inflation persistence has recently been studied by Paya et al. (2007). Fractional integration is one model for inflation persistence that can be traced back to Hassler and Wolters (1995) or Baillie et al. (1996). The question how aggregation and persistence interact is of interest beyond inflation, and has troubled applied economists for a long time; see Christiano et al. (1991) for empirical evidence in the context of the permanent income hypothesis and Rossana and Seater (1995) for a representative set of economic time series. Using fractionally integrated models, Chambers (1998) found with macroeconomic series that the empirical degree of integration may depend on the level of temporal aggregation; see also Diebold and Rudebusch (1989) or Tschernig (1995). In empirical finance, too, one of the core issues with respect to realized volatility is optimal sampling; see e.g. Ait-Sahalia et al. (2005) and the results by Drost and Nijman (1993). In this paper we understand by temporal aggregation both systematic sampling (or skip sampling) of stock variables where only

∗

Tel.: +49 69 798 34762; fax: +49 69 798 35014. E-mail address: [email protected].

0304-4076/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2011.01.003

every pth data point is observed, and summation of flow variables where neighboring observations are cumulated to determine the total flow. Econometricians have devoted their attention to both types of temporal aggregation for decades; see Silvestrini and Veredas (2008) for a recent survey. Early results for autoregressive moving average (ARMA) models were obtained by Brewer (1973) and Weiss (1984). A treatment of integrated (of order one) ARIMA models was provided by Wei (1981) and Stram and Wei (1986), for skip sampling and cumulating, respectively. In particular, skip sampling can be embedded in the more general problem of missing observations; see Palm and Nijman (1984) for an investigation of dynamic regression models. Aspects of forecasting have been addressed by Lütkepohl (1987) and Lütkepohl (2009), while Marcellino (1999) deals with cointegration and causality under aggregation. Moreover, the potential interaction of seasonal integration and unit roots at frequency zero due to temporal aggregation was studied by Granger and Siklos (1995); see also Pons (2006). In fact, there is a literature on ‘‘span versus frequency’’ when it comes to testing the null hypothesis of a unit root, which started with Shiller and Perron (1985) and came to a preliminary end with Chambers (2004). Notwithstanding the vast amount of papers on temporal aggregation, little attention has been paid to effects in the frequency domain, notable exceptions being Drost (1994) and Souza (2003). In the frequency domain, temporal aggregation is accompanied by the so-called aliasing effect, which is well known under discretetime sampling from a continuous-time process; see e.g. Hansen

U. Hassler / Journal of Econometrics 162 (2011) 240–247

and Sargent (1983). For the special case of fractional integration, spectral results have been obtained by Chambers (1998), Hwang (2000), Tsai and Chan (2005b), and Souza (2005). Further, Chambers (1996) and Tsai and Chan (2005a) cover the related case of discrete-time sampling from a continuous-time long memory process, while Souza (2007, 2008) focusses on the effect of temporal aggregation on widely used memory estimators. We add two aspects to this literature: a general characterization of time aggregation in the frequency domain for processes that become stationary only after differencing r times for some natural number r, and an investigation, which semiparametric estimators of fractionally integrated models retain their consistency and limiting normality under aggregation. In greater detail our contributions are the following. We draw from the literature results on aliasing and moving-averaging in case of temporal aggregation of arbitrary stationary processes (Lemmas 1 and 2), and we combine these lemmae to characterize the frequency domain effect of temporal aggregation for processes that become stationary only after integer differencing r times, r = 0, 1, 2, . . . (Proposition 1). Next, the aggregation results are applied to fractionally integrated processes. In particular, we investigate whether typical assumptions on fractionally integrated processes, which are made in the literature to obtain consistency or limiting normality of semiparametric estimators, are closed with respect to aggregation. In other words: if {zt } satisfies a set of assumptions used to prove properties of some estimator or test, does the temporal aggregate fulfill them, too? Differing findings are obtained for cumulating of flow data (Proposition 2), skip sampling of stocks (Proposition 3), and for the case of generalized fractional integration where the singularity may occur at frequencies different from zero (Proposition 4). In a couple of remarks we discuss as consequences for applied work, which estimators remain valid upon aggregation (under which conditions on the bandwidth choice). The rest of this paper is organized as follows. Section 2 treats the general aggregation effect in terms of spectral densities. In Section 3, the aggregation results are applied to the semiparametric estimation of the memory parameter of fractional integration. The last section contains a more detailed non-technical summary. Proofs are relegated to the Appendix. 2. Aggregation in the frequency domain For sequences {aj } and {bj }, let aj ∼ bj denote aj /bj → 1 as j → ∞, while for functions, a(x) ∼ b(x) is short for a(x)/b(x) → 1 as x → 0. Further, a(x) = O(xc ) means that a(x)x−c is bounded as x → 0, while a(x) = o(xc ) signifies a(x)x−c → 0. First-order derivatives are given as a′ (x). Finally, let Z stand for the set of all integers. 2.1. Notation and assumptions Let {zt }, t = 1, 2, . . . , T , denote some time series to be aggregated over p periods. The aggregate is constructed for the new time scale τ . In case of stock variables, aggregation or systematic sampling means skip sampling where only every p’th data point is observed, z˙τ := zpτ ,

τ = 1, 2, . . . ,

(1)

where for the rest of the paper p ≥ 2 is a finite integer. Flow variables are aggregated by cumulating p neighboring observations that do not overlap to determine the total flow over p sub-periods,

241

Clearly, many economic variables are not stationary. It is often assumed that the basic variable {zt } is given by integration over stationary increments, zt = z0 +

t −

yi ,

t = 1, 2, . . . , T .

i=1

If {yt } is a stationary fractionally integrated process of order d, d < 0.5, as defined in a subsequent section, then the partial sum process {zt } is sometimes called fractionally integrated (of order δ = 1 + d) of ‘‘type I’’; see Marinucci and Robinson (1999) and Robinson (2005). Some economic variables are even considered as integrated of order 2. Therefore, we allow for stationarity and different degrees of nonstationarity at the same time. It is maintained for some natural number r ∈ {0, 1, 2, . . .} that the process {zt } solves the following difference equation with 1 = 1 − L:

1r zt = yt ,

t = 1, 2, . . . , T .

(3)

Note that differencing changes the status of stock series: While logprices pt = log Pt are stocks, the inflation rate πt = 1pt is a flow variable. To fully specify the potentially nonstationary processes from (3), we have to add assumptions on {yt }. Our results will hold for any stationary process {yt } with integrable spectral density fy . Since fy is an even and 2π -periodic function, the definition of the spectral density can be extended to the whole real range, and we focus on the interval [0, π] in the following assumption. Assumption 1. The process {yt }, t ∈ Z, is covariance stationary with integrable spectral density fy (λ) on Π , where Π = [0, π] if fy is well defined on the whole interval, or Π = [0, π] \ {λ∗ } if fy has a singularity at some frequency λ∗ ∈ [0, π]. Note that fy does not have to exist everywhere. A singularity at λ∗ might come from (generalized) fractional integration with long memory; see (12) below. In fact, we might allow for k singularities (having e.g. so-called k-factor Gegenbauer processes in mind; see Woodward et al., 1998). Further, we stress that fy (0) = 0 is not excluded. This covers the particular case of over-differencing. Assume e.g. that no differencing is required to obtain stationarity, but {zt } is differenced in practice. This case is dealt with by r = 1 in (3) with the assumption that {yt } is over-differenced. To set the scene for the next subsection, we define the lag operator L operating on the aggregate time scale τ , such that L = Lp with L operating on t (see e.g. Wei, 1990, Ch. 16). Let ∇ = 1 − L stand for the differences of the new time scale τ . In case that r ≥ 1 in (3), we will study the effect of first aggregating and then differencing. The spectral densities of the differenced aggregates {∇ r z˙τ } and {∇ r zτ } are denoted as f˙∇ r z (λ) and  f∇ r z (λ), respectively. For r = 0, we have zt = yt and f˙y (λ) or fy (λ) represent the spectra of the stationary aggregates {˙yτ } and { yτ }.1 2.2. Result and discussion The main effect in the frequency domain is the so-called aliasing effect that arises from skip sampling. Since cumulation of nonoverlapping data can be reduced to skip sampling a moving average, the effect will be present also with flow data. Therefore, we first pin down the aliasing effect. The following finding for

 zτ := zpτ + zpτ −1 + · · · + zp(τ −1)+1 = Sp (L)zpτ ,

τ = 1, 2, . . . ,

(2)

where Sp (L) := 1 + L +· · ·+ Lp−1 is the moving average polynomial of degree p in the usual lag operator L. Hence, { zτ } is obtained by skip sampling the overlapping moving average process {Sp (L)zt }.

1 Sometimes stock variables are aggregated by averaging over p non-overlapping observations, {z τ }, such that p sub-periods are replaced by the mean of p values. Obviously this is directly connected to cumulation from (2), z τ :=  zτ /p. Let the spectrum of the differenced aggregate {∇ r z τ } be denoted as f ∇ r z (λ). There is no need to address the case of averaging separately since it holds f ∇ r z (λ) =  f∇ r z (λ)/p2 .

242

U. Hassler / Journal of Econometrics 162 (2011) 240–247

stationary processes is essentially due to Drost (1994, Lemma 2.1). We highlight his result as a lemma, since many authors seem to be not aware of it, see e.g. Chambers (1998), Hwang (2000), Souza (2005), and Tsai and Chan (2005b), although an equivalent representation can be found in Souza (2003, Theo. 1). Lemma 1 (Aliasing). Let {zt } from (3) with r = 0 equal {yt } with Assumption 1, and assume that its spectral density fy is bounded at (λ + 2π j)/p, j = 1, . . . , (p − 1). It then holds for the spectral density of the skip sampled aggregate over p periods, {˙yτ }: p−1 1− f˙y (λ) = fy



λ + 2π j

p j=0

p



. λ+2π j

The summation over the frequencies p , j = 0, 1, . . . , p − 1, in Lemma 1 corresponds to the well-known aliasing effect that occurs when observing a continuous-time process at discrete points in time, see e.g. Hansen and Sargent (1983), or the discussion λ+2π j in Priestley (1981, p. 224, p. 506): Cycles of frequency p in the basic data become cycles of frequency λ + 2π j upon skip sampling, and are hence indistinguishable from λ. A second effect that will be present in case of cumulation on top of aliasing is the transfer function of the moving average filter Sp (L); see (2). This effect also shows up when considering differenced aggregates with ∇ = (1 − Lp ) = Sp (L)(1 − L), and it is characterized in the following lemma. The required transfer function is given e.g. in Priestley (1981, p. 270), where Tj (λ) is proportional to the so-called Fejér kernel; see e.g. Priestley (1981, p. 401, p. 418) for a discussion. Lemma 2 (Transfer Function of Sp (L)). The transfer function |Sp (ei· )|2 evaluated at (λ + 2π j)/p for j = 0, . . . , p − 1 is equal to Tj (λ) :=

  sin2 λ2 sin2



λ+2π j

,

λ > 0,

2p

where Tj (λ) is continuously differentiable with T0 (λ) = p2 + O(λ2 ), Tj (λ) = O(λ), ′

Tj (λ) = O(λ2 ),

j = 1, . . . , p − 1,

j = 0, . . . , p − 1,

as λ → 0. Now, it is straightforward to prove the general result. Proposition 1. Let {yt } be from Lemma 1, and let {1r zt } equal {yt }, r = 0, 1, 2, . . .. It then holds for the spectral densities of the differences of the aggregates of {zt }

skip sampling a moving average. In this case, however, aliasing is superimposed by the factors Tj (λ) due to the moving average filter Sp (L). Consequently, at frequency zero the aliased frequencies are squelched out, and it holds in case of cumulation (λ → 0)

 fy (λ) ∼ pfy

  λ p

,

 fy′ (λ) ∼ fy′

  λ p

.

(4)

In particular, the slope of fy (λ) around frequency zero is inherited by  fy . A similar effect shows up for spectra from differences, r ≥ 1. Second, an immediate consequence of Proposition 1 is that differencing and temporal aggregation are not exchangeable without required modification. Below Eq. (3), we noted that differencing stock variables yields flow data. Consequently, for r = 1, when comparing the spectral densities of the differenced aggregates (∇ z) with the aggregates of the stationary differences (1z), we find that differencing skip sampled stock data has the same effect as cumulating the differences (flows): f˙∇ z (λ) =  f1z (λ) ̸= f˙1z (λ),

and  f∇ z (λ) ̸=  f1z (λ).

(5)

Third, Proposition 1 contains a unifying framework for several familiar results. The result (a) for r = 0 of course reproduces the original Lemma 1. The result (b) for r = 0 is from Drost (1994, Lemma 2.2), while an equivalent representation can be found again in Souza (2003). For the special case of fractionally integrated ARMA processes Tsai and Chan (2005b, Theo. 1(a)) provide equivalent results under cumulation (Proposition 1(b)). Notice that they have to spend more than two pages of technically involved derivations to establish their special case, while our more general result follows in a very straightforward manner from Lemmas 1 and 2. Proposition 1 will enable us to investigate systematically which properties of the basic process are inherited by the aggregates. Such properties are called closed in the following sense. Definition 1. A set of assumptions on some process {zt } is called closed with respect to temporal aggregation (skip sampling or cumulating), if {˙zτ } or { zτ }, respectively, satisfy the same set of assumptions for any finite positive integer p ≥ 2, too. For practical purposes procedures with properties established under assumptions that are closed with respect to aggregation are desirable, because in most practical situations a ‘‘true’’ frequency of the DGP is not known or does not exist. Most economic and financial time series have to be considered as aggregates. And a statistical procedure relying on a set of assumptions A cannot be safely applied to an aggregate, unless A is closed with respect to temporal aggregation. With Proposition 1 at hand we will now discuss closedness and lack thereof of certain general assumptions about fractionally integrated processes. 3. Fractional integration

(a) in case of skip sampling (∇ r z˙τ ): p−1 1− f˙∇ r z (λ) = fy



λ + 2π j

p j =0



p

3.1. Assumptions

[Tj (λ)] , r

(b) and in case of cumulating (∇ r zτ ): p−1

 f∇ r z (λ) =

  1− λ + 2π j p j =0

fy

p

[Tj (λ)]r +1 ,

where Tj (λ), j = 0, 1, . . . , (p − 1), are from Lemma 2. Proof. See Appendix.

It seems advisable to discuss the proposition with a couple of comments. First, the cumulated stationary aggregate,  f∇ 0 z (λ) =  fy (λ), is subject to aliasing, too, simply because { yτ } is constructed from

Let us consider the fractionally integrated process {yt } constructed from the filter (1 − L)−d with the usual expansion, yt = (1 − L)−d et ,

with |d| < 0.5,

where the short memory component {et } is a stationary process with spectral density fe . For {yt } it holds fy (λ) = |1 − eiλ |−2d fe (λ). Equivalently (because |1 − eiλ |−2d = λ−2d (1 + o(1))) fractional integration is characterized through the assumption fy (λ) = λ−2d fe (λ),

|d| < 0.5.

(6)

Papers on semiparametric inference of long memory typically assume that the observed process has a spectral density like in (6) where the short memory component fe is characterized by assumptions A as weak as possible. We consider typical spectral assumptions next.

U. Hassler / Journal of Econometrics 162 (2011) 240–247

Assumption 2. Let A be a set of assumptions for fy (λ) = λ−2d fe (λ), |d| < 0.5, including (A0) (A1)

fe is bounded and bounded away from zero at frequency λ = 0; for some β ∈ (0, 2] it holds

(A3)

fe has a finite first derivative fe′ in a neighborhood (0, ϵ) of zero, and fe′ (λ) = O(λ−1 ),

3.2. Cumulation of flow variables

λ → 0;

λ → 0;

fe has a finite first derivative fe′ at λ = 0.

The first assumption (A0) that fe (0) is bounded and positive is minimal and common to all papers in order to identify d from (6). Next, assumption (A1) imposes a rate of convergence on (6) characterizing the smoothness of the short memory component fe around zero. If {et } is ARMA, then β = 2. With m denoting the bandwidth of semiparametric estimators and T standing for the sample size, the parameter β controls the rate the bandwidth has grow with through the following condition: 1 m

+

at multiples of the so-called Nyquist frequency 2π /p; see Proposition 1. The usual long memory literature not addressing the aggregation issue does not need Assumption 3. Souza (2007, Cond. 3 and 9), however, when addressing memory estimation under cumulation formulates very similar assumptions. Assumption 3. The process {yt } from Assumption 1 has a spectral density fy (λ), which at frequencies 2π j/p, j = 1, . . . , (p − 1), is bounded, bounded away from zero and continuously differentiable with derivative fy′ .

fe (λ) = fe (0) + O(λβ ), (A2)

243

m1+2β (log m)2 T 2β

→ 0,

(7)

implying m = o(T 2β/(1+2β) ). Assumption (A1) is widely used to establish not only consistency, but also limiting normality of semiparametric memory estimators, see e.g. Robinson (1995a, Ass. 1′ ), Robinson (1995b, Ass. 1), Velasco (1999a, Ass. 2), Velasco (1999b, Ass. 1), Shimotsu and Phillips (2005, Ass. 1′ ), and Shimotsu (2010, Ass. 1′ ).2 While this assumption implies that fe is continuous on (0, ϵ), some results require that the derivative fe′ exists in a neighborhood of the origin, even if it may diverge at appropriate rate as getting close to zero; see Assumption (A2). Although put slightly differently such an assumption is found again in Robinson (1995a, Ass. 2), Velasco (1999a, Ass. 3), and Shimotsu and Phillips (2005, Ass. 2) or Shimotsu (2010, Ass. 2) when establishing consistency of the local Whittle (LW) estimator and the so-called exact LW estimator, respectively.3 A related but slightly weaker condition is employed in Robinson (1994, Ass. 4) and Lobato and Robinson (1996, (C2)) to determine optimal spectral bandwidth rates and limiting properties of the averaged periodogram estimator, respectively. Other papers assume a stronger degree of smoothness of fe at frequency zero in that they demand the first derivative fe′ (0) to be finite (or even zero), which is our assumption (A3). Hurvich et al. (1998) for instance assume fe′ (0) = 0 when deriving the asymptotic mean squared error and limiting distribution of the log-periodogram regression (LPR) by Geweke and PorterHudak (1983), while Andrews and Guggenberger (2003) discuss properties of a bias-reduced version under a smoothness assumption requiring fe′ (0) to exist; see also Guggenberger and Sun (2006). Under similar assumptions Andrews and Sun (2004) improved on the LW estimator. Since the following results are obtained under temporal aggregation we need spectral assumptions for λ > 0 due to the aliasing effect. We require that the spectral density is ‘‘well behaved’’

2 Allowing for tapered data, a slightly stronger, parametric version of assumption (A1) is required, fe (λ) = b0 + b1 λβ + o(λβ ), see e.g. Velasco (1999a, Ass. 8), Velasco (1999b, Ass. 2), Hurvich and Chen (2000, Ass. 1), and also Abadir et al. (2007, eq. (2.23)). 3 See also the assumption |f ′ (λ)| ≤ c λ−1 for λ > 0 in Moulines and Soulier e

(1999, Ass. 2), and similar although slightly weaker in Soulier (2001, Ass. 1).

It has been documented empirically that cumulation of flow variables will affect memory estimation in finite samples; see e.g. Diebold and Rudebusch (1989), Tschernig (1995), and Chambers (1998). Experimentally, a finite sample bias due to cumulation has been reported by Teles et al. (1999) and Souza (2007). In this subsection, we address the asymptotic properties of some well-known semiparametric memory estimators for finite p; the effect of increasing aggregation level (p → ∞) on cumulation has been investigated by Man and Tiao (2006) in the time domain and by Tsai and Chan (2005b) with spectral methods. Let us briefly discuss the cumulation of stationary flow variables, zt = yt . From (4) it is obvious that a zero or just as well a singularity of fy at frequency zero is inherited by  fy , and the spectral slope of {yt } at frequency zero is carried over to the aggregate { yτ }, or in other words: assumptions about the spectral slope of stationary processes at frequency zero are closed with respect to cumulating. This confirms the finding by Chambers (1998), Hwang (2000), and Souza (2005) that the order of fractional integration at the origin is maintained under cumulated aggregation of flow variables. More formally, it holds the following result for the stationary and nonstationary case at the same time; the result for r = 0 was obtained as part of the proof in Souza (2007, p. 721). Proposition 2. Let {1r zt } with r = 0, 1, . . . equal {yt } with spectral density as in (6) satisfying Assumptions 1 and 3. It then holds for the spectral density of the differences ∇ r of { zτ }

 f∇ r z (λ) = λ−2d ϕr (λ) with

 ϕr (λ) = fe

  λ p

(p2d+2r +1 + O(λ2 )) + λ2d Rr (λ),

where  Rr (λ) is differentiable in a neighborhood of λ = 0 with

 Rr (λ) = O(λ2r +2 ) and  R′r (λ) = O(λ2r +1 ), Proof. See Appendix.

λ → 0.

We want to spell out explicitly the closedness of the conditions from Assumption 2. From Proposition 2 it follows for |d| < 0.5: under (A0).  ϕr (0) = p2d+2r +1 fe (0);

under (A1).  ϕr (λ) = p2d+2r +1 (fe (0) + O(λmin(β,2d+2r +2) )); under (A2).  ϕr′ (λ) = O(λ−1 );

under (A3).  ϕr′ (0) = p2d+2r fe′ (0).

For r ≥ 1, the smoothness parameter β from (A1) of fe carries over to  ϕr , and this holds true for r = 0 with d ≥ 0, too. For r = 0 with d < 0, Assumption (A1) is still closed in that there exists a new smoothness parameter min(β, 2d + 2) ∈ (0, 2]. Note that the parametric version of (A1) given in footnote 2 is closed as well. We want to discuss consequences with respect to statistical inference in two remarks.

244

U. Hassler / Journal of Econometrics 162 (2011) 240–247

Remark A. For r = 0, Souza (2007) proved that the LW and the LPR estimators retain the limiting normal distribution under cumulation of stationary series. To that end he showed Proposition 1 for r = 0 and established the closedness of some further sufficient conditions ({et } is a linear sequence with certain moment and regularity conditions). In addition, we want to highlight Assumption (A1) for the stationary case:

 ϕ0 (λ) = p2d+1 fe (0) + O(λmin(β,2d+2) ),

λ → 0.

Hence, for d < 0 it may happen that min(β, 2d + 2) < β , implying a slower rate for the bandwidth according to (7) after cumulation: m = o(T (4d+4)/(5+4d) ). Remark B. Velasco (1999a, Theo. 3) and Velasco (1999b, Theo. 3) prove the limiting normal distribution of the LW and the LPR estimators, respectively, when applied to nonstationary levels integrated of order 0.5 < δ < 0.75. More generally, Abadir et al. (2007, Coro. 2.1) showed that the so-called fully extended LW has a limiting normal distribution when applied to nonstationary levels integrated of any order δ > 0.5. In all three papers the main assumption is (A1), which turns out to be closed with respect to cumulation of difference-stationary series (r ≥ 1). Further assumptions they require (again, {et } is a linear sequence with certain moment and regularity conditions) have been established in Souza (2007); see Remark A. Hence, the asymptotic results by Velasco (1999a,b) or Abadir et al. (2007) for nonstationary series remain valid after cumulation. 3.3. Skip sampling Souza and Smith (2002) provide bias approximations for some semiparametric estimators that are well supported experimentally. Considerable finite sample biases are found due to skip sampling. Here, we add asymptotic insights by discussing closedness and lack thereof of Assumption 2 under skip sampling. We start with nonstationary processes because we know from Proposition 1 that f˙∇ r z =  f∇ r −1 z . Consequently, the results for skip sampling under r ≥ 1 are contained in Proposition 2 already! Therefore, Remark B carries over to skip sampling as follows. Remark C. The limiting normality established in Velasco (1999a, Theo. 3), Velasco (1999b, Theo. 3), and Abadir et al. (2007, Coro. 2.1) continues to hold when applied to skip sampled nonstationary levels integrated of order δ as in Remark B. Now, we turn to the stationary case, r = 0. Before showing a further proposition, we recollect some findings with respect to Assumption (A0) from the literature. Let us consider a stationary process {yt } with fy (0) = 0.

∑p−1

Proposition 1(a) yields f˙y (0) = p−1 j=0 fy (2π j/p). Hence, the assumption fy (0) = 0 is not closed with respect to skip sampling except for the unlikely case where fy (2π j/p) = 0 for j = 1, . . . , p − 1. This has first been observed by Drost (1994, p. 16), and it corrects differing claims made in Chambers (1998) and Hwang (2000), see also the elucidating discussion by Souza (2005): Integration of order d in the sense of (6) is not closed under skip sampling for d < 0. This is a puzzling result at first glance, since fractional processes are known to be self-similar in that stretching the time scale leaves distributional properties unchanged upon rescaling the process; see e.g. Mandelbrot and van Ness (1968). In fact, for ARFIMA processes it holds for |d| < 0.5 that E (˙yτ y˙ τ +h ) = E (yt yt +ph ) ∼ C (ph)2d−1 ,

h→∞

for some constant C . Hence, the hyperbolic decay of the autocovariance is inherited by the skip sampled process irrespective of the sign of d, while the power law in (6) is lost for d < 0. However,

this lack of closedness is of little practical concern. Note that negative orders of integration typically arise only after differencing, and differencing a stock variable results in a flow series, which should be aggregated by cumulating, not by skip sampling. Next, we provide a formal discussion of the effect of skip sampling stationary stock variables. Proposition 3. Let {yt } be I (d) with spectral density as in (6) satisfying Assumptions 1 and 3. It then holds for the spectral density of the skip sampled process f˙y (λ) = λ−2d ϕ˙ y (λ) with ϕ˙ y (λ) = p2d−1 fe

  λ p

+ λ2d R˙ y (λ),

(8)

where R˙ y (λ) = ϕ1 + O(λ), 0 < ϕ1 < ∞, and R˙ ′y (λ) = O(1) as λ → 0. Proof. See Appendix.

Remark D. The above discussion illustrates that the case d < 0 may be ignored when talking about skip sampling. We now assume d ≥ 0. From Proposition 3 it follows under (A1) with ϕ0 = p2d−1 fe (0):

ϕ˙ y (λ) = ϕ0 + O(λmin(β,2d) ),

d > 0.

Hence, Assumption (A1) is closed with α = min(β, 2d) only as long as d ≥ 0.4 Similarly, Assumption (A2) implies ϕ˙ y′ (λ) = O(λ−1 ) as long as d ≥ 0. Therefore, conditions by Robinson (1995a) used to prove consistency and limiting normality of the local Whittle estimator continue to hold after systematic sampling for d ≥ 0. However, the order of integration d may affect the required rate of divergence of the bandwidth m (see (7)): m = o(T 2α/(1+2α) ),

α = min(β, 2d).

(9)

For values of d close to zero with α = 2d, this implies a very slow divergence of m, and hence a very slow convergence of some semiparametric estimator  d to the limiting distribution since the variance of  d is proportional to 1/m. Remark E. Note that Assumption (A3) is never closed with respect to skip sampling. The aggregated spectral density in (8) displays an unbounded derivative at the origin for all d < 0.5:

ϕ˙ y′ (λ) = p2d−2 fe′

  λ p

+ O(λ2d−1 ).

This means that sufficient conditions for consistency or limiting normality of the log-periodogram regression made by Hurvich et al. (1998) or Andrews and Guggenberger (2003) do not hold upon systematic sampling, which sheds some doubt on the use of the LPR in applied work. Notice, however, there is a trimmed version of the LPR by Robinson (1995b), where trimming means that the first ℓ harmonic frequencies are omitted from the regression. Robinson (1995b) assumes Assumptions (A1) and (A2), which are closed under skip sampling for d ≥ 0. To ensure limiting normality of the trimmed LPR, Robinson (1995b, Ass. 6) requires with α from Remark D

4 Strictly speaking, the case d = 0 requires separate consideration with

ϕ˙ y (λ) = p−1 fe

  λ p

+ R˙ y (λ) = ϕ0 + ϕ1 + O(λβ ) + O(λ) = ϕ0 + ϕ1 + O(λmin(β,1) ).

U. Hassler / Journal of Econometrics 162 (2011) 240–247

m

1/2

log m

ℓ

+

ℓ(log T ) m

1+1/2α

2

+

m

T

Proposition 4. Let {yt } be I (d) with spectral density as in (12) satisfying Assumptions 1 and 3. It then holds for the spectral density

→ 0,

which obviously implies (9). While m has again to diverge very slowly for small values of d, the trimming parameter ℓ has to √ diverge faster than m, which makes appropriate choices of ℓ and m a delicate matter in practice. To shed further light on the effect of skip sampling it is elucidating to relate to a different strand of the literature. Let {xt } be a fractionally integrated process {yt } perturbed by some I (0) process {ut }, xt = yt + ut ,

(10)

where we assume that {ut } is independent of the unobservable process {yt }. Given {yt } is fractionally integrated with (6) it holds in the frequency domain fx (λ) = λ−2d fe (λ) + fu (λ) = λ−2d ϕ(λ) where the short memory component of the observable {xt } becomes

with c0 = fe (0) and c1 = fu (0). For 0 < d, the perturbed process {xt } is fractionally integrated of order d where the short memory component ϕ(λ) behaves like in case of skip sampling, cf. (8): skip sampling has in the frequency domain the same effect on long memory as adding noise. Therefore, methods tailored to the estimation of d from {xt } in (10) are candidates for the estimation of d from skip sampled long memory series. For that reason, a short and informal review of related work is provided to close down this subsection. Most papers dealing with perturbed fractional integration (also called ‘‘long memory plus noise’’) are related to the so-called long memory stochastic volatility model (LMSV) introduced by Breidt et al. (1998) or the FIEGARCH model by Bollerslev and Mikkelsen (1996). Such volatility models assume for return processes {rt } that (11)

where the perturbation term {εt } is white noise. Sun and Phillips (2003) considered the more general model (10) under Gaussianity. They proposed an improved nonlinear version of the LPR estimator that accounts explicitly for the effect of perturbation. The bandwidth m has to obey m = o(T 8d/(8d+1) ), which is less stringent than our condition (9) only if min(β, 2d) < 4d. Hurvich and Ray (2003) proposed a modification of the LW estimator adjusting explicitly for the noise effect of model (11); further refinements are provided by Hurvich et al. (2005) in that correlation between yt and εt is allowed for. Finally, it should be noted that the so-called broadband log-periodogram regression by Moulines and Soulier (1999) remains valid for a Gaussian LMSV model; see Iouditsky et al. (1999). 3.4. General fractional integration We now briefly touch the case where a singularity may occur at a frequency λ∗ different from zero: fy (λ) = |λ∗ − λ|−2d fe (λ),

|d| < 0.5, λ∗ ∈ [0, π].

(a) of the skip sampled process f˙y (λ) = |pλ∗ − λ|−2d ϕ˙ y∗ (λ),

ϕ˙ y (λ) = p ∗

2d−1

fe

  λ p

+ |pλ∗ − λ|2d R˙ y (λ),

where R˙ y is from Proposition 3; (b) of the cumulated process

 fy (λ) = |pλ∗ − λ|−2d ϕy∗ (λ),   λ  ϕy∗ (λ) = p2d−1 fe T0 (λ) + |pλ∗ − λ|2d R0 (λ), p

where  R0 is from Proposition 2, and T0 (λ) from Lemma 2. Proof. Follows the lines of the proofs of Propositions 2 and 3 and is therefore omitted. The final remark collects three comments.

ϕ(λ) = fe (λ) + fu (λ)λ2d ∼ c0 + c1 λ2d , λ → 0,

log rt2 = µ + yt + εt ,

245

(12)

A parametric model for such a spectral behavior has been proposed by Gray et al. (1989), while the more recent literature focusses on a semiparametric approach only assuming that fe (λ) is bounded, bounded away from zero and twice continuously differentiable on (0, π); see for instance Giraitis et al. (2001, Ass. A.1, A.1′ ), Hidalgo (2005, Cond. C .1), and Dalla and Hidalgo (2005, C 1). The following proposition focusses on the stationary case for convenience; the extension to r ≥ 1 is obvious. It thus extends Propositions 2 and 3 in case that r = 0.

Remark F. First, from Proposition 4 we observe that the singularity is shifted from frequency λ∗ to pλ∗ due to aggregation. If pλ∗ exceeds π , one may of course rescale due to periodicity and symmetry, and replace pλ∗ by λ0 ∈ [0, π]:

λ0 =



2kπ − pλ∗ , pλ∗ − 2kπ ,

if pλ∗ ∈ ((2k − 1)π , 2kπ ] if pλ∗ ∈ (2kπ , (2k + 1)π ],

k = 1, 2, . . . .

Second, contrary to (4), the moving average filter no longer annihilates the aliasing effect in case of cumulation (as λ → pλ∗ ):

 fy (λ) ∼ |pλ∗ − λ|−2d p2d−1 fe (λ∗ )T0 (pλ∗ ) +  R0 (pλ∗ ), where  R0 (pλ∗ ) is positive in general. Hence, if d < 0, a zero at ∗ λ of fy is not reproduced by  fy at pλ∗ , such that generalized fractional integration with d < 0 is no longer closed under cumulation. This is not true, however, in the special case where pλ∗ is a multiple of 2π , because  R0 is periodic, as one can see from its definition in the proof of Proposition 2. Third and most interestingly, in case d ≥ 0, one may investigate whether assumptions about fe are inherited by the short memory components  ϕy∗ (λ) and ϕ˙ y∗ (λ) of the aggregates. Both short memory components are not differentiable at frequency pλ∗ , thus violating the typical assumption mentioned above made by Giraitis et al. (2001), Hidalgo (2005), and Dalla and Hidalgo (2005), and indeed by most other papers working under model (12). 4. Concluding remarks We characterize effects of cumulating flow variables as well as systematic sampling stock data of arbitrary stationary or difference-stationary processes in the frequency domain (Proposition 1). The results are applied to fractionally integrated processes of order d. In particular, we investigate whether typical assumptions on fractionally integrated processes, which are made in the literature to justify statistical semiparametric inference about d, are closed with respect to aggregation. That is we study whether assumptions that hold for basic data continue to hold for temporal aggregates, such that semiparametric methods like the logperiodogram regression (LPR) or the local Whittle (LW) estimator are justified for aggregates, too. It turns out (Proposition 2) that typical spectral assumptions made for semiparametric estimation are closed with respect to cumulating flow variables (or averaging non-overlapping stocks; see footnote 1). Hence, the semiparametric procedures discussed in Remark A for stationary data and in Remark B for the nonstationary case may be safely applied to those aggregates.

246

U. Hassler / Journal of Econometrics 162 (2011) 240–247

In case of skip sampling fractionally integrated stock variables matters are more complicated. The methods applied to nonstationary aggregates continue to hold (Remark C). In the stationary case, we conclude from Proposition 3 that certain properties that hold for the basic data cannot be maintained for the aggregate, while other assumptions are closed as long as d ≥ 0. More precisely, it turns out (Remark D) that the LW estimator can be applied to skip sampled aggregates, although the rate of divergence of the bandwidth may be influenced by d, such that the bandwidth selection may become a delicate problem in practice. A similar comment holds for the trimmed LPR (Remark E), while sufficient conditions for the conventional LPR do actually not carry over from the basic series to the aggregate. Note, that in practice such difficulties can be circumvented by aggregating stocks through averaging non-overlapping observations (see again footnote 1). Further, we reveal that skip sampling has the same spectral effect as adding noise to the data. Hence we suggest that shortcomings when estimating long memory under skip sampling may be alleviated using approaches tailored to cope with perturbed fractional integration or so-called long memory stochastic volatility models. An investigation how fruitful the many different procedures are in the presence of skip sampled long memory is beyond the present paper and has to be left for future research. For the model of general fractional integration, where a spectral singularity may occur at a frequency different from zero, we treat cumulating and skip sampling in one go (Proposition 4). In Remark F it is concluded that an assumption that is ever-present in related literature is not closed with respect to aggregation. Consequently, when applying a model of general fractional integration one should always stick to the basic data and not work with aggregated time series. Acknowledgements An earlier version of this paper was written while visiting the University of California San Diego and was presented at Texas A&M University, Universidad Carlos III de Madrid, Institute for Advanced Studies, Vienna, and the 3rd ETSERN Meeting, Nottingham. The author is grateful to Patrik Guggenberger, Joon Park, Benedikt Pötscher, Philippe Soulier, Jim Stock, Yixiao Sun, and Carlos Velasco for support and many insights. Moreover, the author thanks two anonymous referees and Peter Robinson for very helpful comments. Appendix Proof of Proposition 1. We start to prove (a). The case r = 0 is covered by Lemma 1. For r ≥ 1 we observe:

∇ r z˙τ = [(1 − L)Sp (L)]r zpτ = [Sp (L)]r ypτ , i.e. {∇ r z˙τ } is obtained by skip sampling {[Sp (L)]r yt }. Consequently, by Lemma 1, f˙∇ r z (λ) =

p−1 1−

p j =0

 fy

    λ + 2π j  λ + 2π j 2r S exp i  p  , p p

which is the required result with Tj (λ) from Lemma 2. For cumulated series with r ≥ 0 it holds in (b):

∇ r zτ = [(1 − L)Sp (L)]r Sp (L)zpτ = [Sp (L)]r +1 ypτ , i.e. {∇ r zτ } is obtained by skip sampling {[Sp (L)]r +1 yt }. The result follows as in (a), which completes the proof. Proof of Proposition 2. By Proposition 1 and Lemma 2, it holds under (6),

  λ −2d 2d−1  [T0 (λ)]r +1 +  Rr (λ) f∇ r z (λ) = λ p fe ] [  p λ −2d 2d+2r +1 2 2d = λ fe (p + O(λ )) + λ Rr (λ) , p

p−1

 Rr (λ) =

  λ + 2π j 1− p j =1

fy

p

[Tj (λ)]r +1 .

With Assumption 3 it follows from Lemma 2 that  Rr (λ) = O(λ2r +2 ) ′ 2r +1  and Rr (λ) = O(λ ), as required. This completes the proof. Proof of Proposition 3. Lemma 1 yields under (6)

    λ −2d λ

1 f˙y (λ) = p

fe

p

p

+

  p−1 − λ + 2π j fy

p

j=1

,

such that

ϕ˙ y (λ) = p

2d−1

fe

  λ p

+

  p−1 λ + 2π j λ2d − p

fy

j =1

p

,

which defines R˙ y with 0 < ϕ1 = p−1

  p−1 − 2π j fy

j=1

p

< ∞.

With Assumption 3 one obtains the required rates for R˙ y and R˙ ′y , which completes the proof. References Abadir, K.M., Distaso, W., Giraitis, L., 2007. Nonstationarity-extended local Whittle estimation. Journal of Econometrics 141, 1353–1384. Ait-Sahalia, Y., Mykland, P.A., Zhang, P.A., 2005. How often to sample a continuoustime process in the presence of market microstructure noise. Review of Financial Studies 18, 351–416. Andrews, D.W.K., Guggenberger, P., 2003. A bias-reduced log-periodogram regression estimator for the long-memory parameter. Econometrica 71, 675–712. Andrews, D.W.K., Sun, Y., 2004. Adaptive local polynomial Whittle estimation of long-range dependence. Econometrica 72, 569–614. Baillie, R.T., Chung, C.-F., Tieslau, M.A., 1996. Analysing inflation by the fractionally integrated ARFIMA–GARCH model. Journal of Applied Econometrics 11, 23–40. Bollerslev, T., Mikkelsen, H.O., 1996. Modeling and pricing long memory in stock market volatility. Journal of Econometrics 73, 151–184. Breidt, F.J., Crato, N., de Lima, P., 1998. The detection and estimation of long memory in stochastic volatility. Journal of Econometrics 83, 325–348. Brewer, K.R.W., 1973. Some consequences of temporal aggregation and systematic sampling for ARMA and ARMAX models. Journal of Econometrics 1, 133–154. Chambers, M.J., 1998. Long memory and aggregation in macroeconomic time series. International Economic Review 39, 1053–1072. Chambers, M.J., 2004. Testing for unit roots with flow data and varying sampling frequency. Journal of Econometrics 119, 1–18. Chambers, M.J., 1996. The estimation of continuous parameter long-memory time series models. Econometric Theory 12, 374–390. Christiano, L.J., Eichenbaum, M., Marshall, D., 1991. The permanent income hypothesis revisited. Econometrica 59, 397–423. Dalla, V., Hidalgo, J., 2005. A parametric bootstrap test for cycles. Journal of Econometrics 129, 219–261. Diebold, F.X., Rudebusch, G.D., 1989. Long memory and persistence in aggregate output. Journal of Monetary Economics 24, 189–209. Drost, F.C., 1994. Temporal aggregation of time-series. In: Kaehler, J., Kugler, P. (Eds.), Econometric Analysis of Financial Markets. Physica, Heidelberg, pp. 11–21. Drost, F.C., Nijman, T.E., 1993. Temporal aggregation of GARCH processes. Econometrica 61, 909–927. Geweke, J., Porter-Hudak, S., 1983. The estimation and application of long memory time series models. Journal of Time Series Analysis 4, 221–238. Giraitis, L., Hidalgo, J., Robinson, P.M., 2001. Gaussian estimation of parametric spectral density with unknown pole. The Annals of Statistics 29, 987–1023. Granger, C.W.J., Siklos, L., 1995. Systematic sampling, temporal aggregation, seasonal adjustment, and cointegration: theory and evidence. Journal of Econometrics 66, 357–370. Gray, H.L., Zhang, N.-F., Woodward, W.A., 1989. On generalized fractional processes. Journal of Time Series Analysis 10, 233–257.

U. Hassler / Journal of Econometrics 162 (2011) 240–247 Guggenberger, P., Sun, Y., 2006. Bias-reduced log-periodogram and Whittle estimation of the long-memory parameter without variance inflation. Econometric Theory 22, 863–912. Hansen, L.P., Sargent, T.J., 1983. The dimensionality of the aliasing problem in models with rational spectral densities. Econometrica 51, 377–387. Hassler, U., Wolters, J., 1995. Long memory in inflation rates: international evidence. Journal of Business & Economic Statistics 13, 37–45. Hidalgo, J., 2005. Semiparametric estimation for stationary processes whose spectra have an unknown pole. The Annals of Statistics 33, 1843–1889. Hurvich, C.M., Chen, W.W., 2000. An efficient taper for overdifferenced series. Journal of Time Series Analysis 21, 155–180. Hurvich, C.M., Deo, R., Brodsky, J., 1998. The mean squared error of Geweke and Porter–Hudak’s estimator of the memory parameter of a long-memory time series. Journal of Time Series Analysis 19, 19–46. Hurvich, C.M., Moulines, E., Soulier, P., 2005. Estimating long memory in volatility. Econometrica 73, 1283–1328. Hurvich, C.M., Ray, B., 2003. The local Whittle estimator of long-memory stochastic volatility. Journal of Financial Econometrics 1, 445–470. Hwang, S., 2000. The effects of systematic sampling and temporal aggregation on discrete time long memory processes and their finite sample properties. Econometric Theory 16, 347–372. Iouditsky, A., Moulines, E., Soulier, P., 1999. Adaptive estimation of the fracional differencing coefficient. Bernoulli 7, 699–731. Lobato, I.N., Robinson, P.M., 1996. Averaged periodogram estimation of long memory. Journal of Econometrics 73, 303–324. Lütkepohl, H., 1987. Forecasting Aggregated Vector ARMA Processes. Springer. Lütkepohl, H., 2009. Forecasting aggregated time series variables: a survey. EUI Working Papers 17. Mandelbrot, B.B., van Ness, J.W., 1968. Fractional Brownian motion, fractional noises and applications. SIAM Review 10, 422–437. Man, K.S., Tiao, G.C., 2006. Aggregation effect and forecasting temporal aggregates of long memory process. International Journal of Forecasting 22, 267–281. Marcellino, M., 1999. Some consequences of temporal aggregation in empirical analysis. Journal of Business & Economic Statistics 17, 129–136. Marinucci, D., Robinson, P.M., 1999. Alternative forms of fractional Brownian motion. Journal of Statistical Planning and Inference 80, 111–122. Mishkin, F.S., 2007. Inflation dynamics. International Finance 10, 317–334. Moulines, E., Soulier, P., 1999. Broadband log-periodogram regression of time series with long-range dependence. The Annals of Statistics 27, 1415–1439. Palm, F.C., Nijman, T.E., 1984. Missing observations in the dynamic regression model. Econometrica 52, 1415–1435. Paya, I., Duarte, A., Holden, K., 2007. On the relationship between inflation persistence and temporal aggregation. Journal of Money, Credit and Banking 39, 1521–1531. Pons, G., 2006. Testing monthly seasonal unit roots with monthly and quarterly information. Journal of Time Series Analysis 27, 191–210. Priestley, M.B., 1981. Spectral Analysis and Time Series, Vol. 1. Academic Press. Robinson, P.M., 2005. The distance between rival nonstationary fractional processes. Journal of Econometrics 128, 283–399. Robinson, P.M., 1995a. Gaussian semiparametric estimation of long range dependence. Annals of Statistics 23, 1630–1661. Robinson, P.M., 1995b. Log-periodogram regression of time series with long range dependence. Annals of Statistics 23, 1048–1072.

247

Robinson, P.M., 1994. Rates of convergence and optimal bandwidth in spectral analysis of processes with long range dependence. Probability Theory and Related Fields 99, 443–473. Rossana, R.J., Seater, J.J., 1995. Temporal aggregation and economic time series. Journal of Business & Economic Statistics 13, 441–451. Shiller, R.J., Perron, P., 1985. Testing the random walk hypothesis. Economics Letters 18, 381–386. Shimotsu, K., 2010. Exact local Whittle estimation of fractional integration with unknown mean and trend. Econometric Theory 26, 501–540. Shimotsu, K., Phillips, P.C.B., 2005. Exact local Whittle estimation of fractional integration. The Annals of Statistics 33, 1890–1933. Silvestrini, A., Veredas, D., 2008. Temporal aggregation of univariate and multivariate time series models: a survey. Journal of Economic Surveys 22, 458–497. Soulier, P., 2001. Moment bounds and central limit theorem for functions of Gaussian vectors. Statistics & Probability Letters 54, 193–203. Souza, L.R., 2003. A note on Chambers’s ‘‘long memory and aggregation in macroeconomic time series’’. EPGE Ensaios Econômicos Working Paper 503. Souza, L.R., 2005. A note on Chambers’s long memory and aggregation in macroeconomic time series. International Economic Review 46, 1059–1062. Souza, L.R., 2007. Temporal aggregation and bandwidth selection in estimating long memory. Journal of Time Series Analysis 28, 701–722. Souza, L.R., 2008. Why aggregate long memory time series? Econometric Reviews 27, 298–316. Souza, L.R., Smith, J., 2002. Bias in the memory parameter for different sampling rates. International Journal of Forecasting 18, 299–313. Stock, J.H., Watson, M.W., 2007. Why has US inflation become harder to forecast? Journal of Money, Credit and Banking 39, 3–33. Stram, D.O., Wei, W.W.S., 1986. Temporal aggregation in the ARIMA process. Journal of Time Series Analysis 7, 279–292. Sun, Y., Phillips, P.C.B., 2003. Nonlinear log-periodogram regression for perturbed fractional processes. Journal of Econometrics 115, 355–389. Teles, P., Wei, W.W.S., Crato, N., 1999. The use of aggregate time series in testing for long memory. Bulletin of the International Statistical Institute, 52nd Session 3, 341–342. Tsai, H., Chan, K.S., 2005a. Temporal aggregation of stationary and nonstationary continuous-time processes. Scandinavian Journal of Statistics 32, 583–597. Tsai, H., Chan, K.S., 2005b. Temporal aggregation of stationary and nonstationary discrete-time processes. Journal of Time Series Analysis 26, 613–624. Tschernig, R., 1995. Long memory in foreign exchange rates revisited. Journal of International Financial Markets, Institutions & Money 5, 53–78. Velasco, C., 1999a. Gaussian semiparametric estimation of non-stationary time series. Journal of Time Series Analysis 20, 87–127. Velasco, C., 1999b. Non-stationary log-periodogram regression. Journal of Econometrics 91, 325–372. Wei, W.W.S., 1981. Effect of systematic sampling on ARIMA models. Communications in Statistics — Theory & Methods 23 (A10), 2389–2398. Wei, W.W.S., 1990. Time Series Analysis: Univariate and Multivariate Methods. Addison-Wesley. Weiss, A.A., 1984. Systematic sampling and temporal aggragation in time series models. Journal of Econometrics 26, 271–282. Woodward, W.A., Cheng, Q.C., Gray, H.L., 1998. A k-factor GARMA long-memory model. Journal of Time Series Analysis 19, 485–504.

Journal of Econometrics 162 (2011) 248–267

Contents lists available at ScienceDirect

Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom

Estimating structural changes in regression quantiles✩ Tatsushi Oka a , Zhongjun Qu b,∗ a

Department of Economics, National University of Singapore, AS2 Level 6, 1 Arts Link, Singapore 117570, Singapore

b

Department of Economics, Boston University, 270 Bay State Rd., Boston, MA, 02215, United States

article

info

Article history: Received 17 May 2010 Received in revised form 29 December 2010 Accepted 27 January 2011 Available online 16 February 2011 JEL classification: C14 C21 C22

abstract This paper considers the estimation of multiple structural changes occurring at unknown dates in one or multiple conditional quantile functions. The analysis covers time series models as well as models with repeated cross-sections. We estimate the break dates and other parameters jointly by minimizing the check function over all permissible break dates. The limiting distribution of the estimator is derived and the coverage property of the resulting confidence interval is assessed via simulations. A procedure to determine the number of breaks is also discussed. Empirical applications to the quarterly US real GDP growth rate and the underage drunk driving data suggest that the method can deliver more informative results than the analysis of the conditional mean function alone. © 2011 Elsevier B.V. All rights reserved.

Keywords: Structural breaks Change-point Conditional distribution Quantile regression Policy evaluation

1. Introduction The issue of structural change has been extensively studied in a variety of applications. Recent contributions have substantially broadened the scope of the related literature. For example, Bai and Perron (1998, 2003) provided a unified treatment of estimation, inference and computation in linear multiple regression with unknown breaks. Bai et al. (1998), Bai (2000) and Qu and Perron (2007) extended the analysis to a system of equations. Hansen (1992), Bai et al. (1998) and Kejriwal and Perron (2008) considered regressions with integrated variables. Andrews (1993), Hall and Sen (1999) and Li and Müller (2009) considered nonlinear models estimated by Generalized Method of Moments. Kokoszka and Leipus (1999, 2000) and Berkes et al. (2004) studied parameter change in GARCH processes. One may refer to Csörgő and Horváth (1998) and Perron (2006) for a comprehensive review of the literature.

✩ We thank the Editor, Takeshi Amemiya, Associate Editor and two referees for constructive comments. We also thank Roger Koenker, Pierre Perron, Barbara Rossi, Ivan Fernandez-Val and participants at 2010 Econometric Society winter meeting for useful suggestions, Denis Tkachenko for research assistance and Tzuchun Kuo for suggesting the underage drunk driving data used in Section 8. ∗ Corresponding author. Tel.: +1 617 353 3184; fax: +1 617 353 4449. E-mail addresses: [email protected] (T. Oka), [email protected] (Z. Qu).

0304-4076/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2011.01.005

A main focus in the literature has been the conditional mean function, while, under many circumstances, structural change in the conditional quantile function is of key importance. For example, when studying income inequality, it is important to examine whether (and how) the wage differential between different racial groups, conditional on observable characteristics, has changed over time. An increase in inequality may increase the conditional dispersion of the differential, while leaving the mean unchanged. Thus, the conditional mean ceases to be informative and the conditional quantiles should be considered. As another example, consider a policy reform that aims at helping students with low test performance. In this case, attention should clearly be focused on the lower quantiles of the conditional distribution in order to understand the effect of such a policy. In both examples, it can be desirable to allow the break dates to be unknown and estimate them from the data. The reason is that in the former case it is often difficult to identify the source of the change a priori, while in the latter the policy effect may occur with an unknown time lag due to various reasons. To address such issues, Qu (2008) and Su and Xiao (2008) considered Wald and subgradient-based tests for structural change in regression quantiles, allowing for unknown break dates. However, they did not consider the issue of estimation and inference regarding break dates and other coefficients. This is the subject of the current paper. Specifically, we consider the estimation of multiple structural changes occurring at unknown dates in conditional quantile

T. Oka, Z. Qu / Journal of Econometrics 162 (2011) 248–267

functions. The basic framework is that of Koenker and Bassett (1978), with the conditional quantile function being linear in parameters in each individual regime. The analysis covers two types of models. The first is a time series model (e.g., the quantile autoregressive (QAR) model of Koenker and Xiao, 2006), which can be useful for studying structural change in a macroeconomic variable. The second model considers repeated cross-sections, which can be useful for the analysis of effects of social programs, laws and economic policy. For each model, we consider both structural change in a single quantile and in multiple quantiles. The joint analysis of multiple quantiles requires imposing stronger restrictions on the model, but is important, because it can potentially increase the efficiency of the break estimator and, more importantly, reveal the heterogeneity in the change, thus delivering a richer set of information. We first assume that the number of breaks is known and construct estimators for unknown break dates and other coefficients. The resulting estimator is the global minimizer of the check function over all permissible break dates. When multiple quantiles are considered, the check function is integrated over a set of quantiles of interest. The underlying assumptions are mild, allowing for dynamic models. Also, they restrict only a neighborhood surrounding quantiles of interest. Other quantiles are left unspecified, thus being allowed to change or remain stable. The latter feature allows us to look at slices of the conditional distribution without relying on global distributional assumptions. Under these assumptions, we derive asymptotic distributions of the estimator following the methodology of Picard (1985) and Yao (1987). The distribution of the break date estimates depend on a two-sided Brownian motion, which is often encountered in the literature and the analytical properties of which have been studied by Bai (1997). It involves parameters that can be consistently estimated, thus the confidence interval can be constructed without relying on simulation. We then discuss a testing procedure that allows one to determine the number of breaks. It builds upon the subgradient-based tests proposed in Qu (2008). These tests do not require estimation of the variance (more precisely, the sparsity) parameter, thus having monotonic power even when multiple changes are present. This feature makes them suitable for our purposes. We consider two empirical applications, one for each type of models studied in the paper. The first application is to the quarterly US GDP growth, whose volatility has been documented to exhibit a substantial decline since the early to mid-1980s (the so-called ‘‘Great Moderation’’), see for example McConnell and Perez-Quiros (2000). The paper revisits this issue using a quantile regression framework. The result suggests that the moderation mainly affected the upper tail of the conditional distribution, with the conditional median and the lower quantiles remaining stable. Hence, it suggests that a major change in the GDP growth should be attributed to the fact that the growth was less rapid during expansions, while the recessions have remained just as severe when they occurred. The second application considers structural changes in young drivers’ blood alcohol levels using a data set for the state of California over the period 1983–2007. Two structural changes are detected, which closely coincide with the National Minimum Drinking Ages Act of 1984 and a beer tax hike in 1991. Interestingly, the changes are smaller in higher quantiles, suggesting that the policies are more effective for ‘‘light drinkers’’ than for ‘‘heavy drinkers’’ in the population. The technical development in the paper relies heavily on Bai (1995, 1998). Bai (1995) developed asymptotic theory for least absolute deviation estimation of a shift in linear regressions, while Bai (1998) extended the analysis to allow for multiple changes. Recently, Chen (2008) further extended Bai’s work to study structural changes in a single conditional quantile function.

249

There, the regressors are assumed to be strictly exogenous. This paper is different from their studies in three important aspects. First, the assumptions allow for dynamic models. Therefore, the result has wider applicability. Secondly, we consider models with repeated cross sections, which is important for policy related applications. Finally, we consider structural change in multiple quantiles and a testing procedure for determining the number of breaks. From a methodological perspective, this paper is related to the literature of functional coefficient quantile regression models, see Cai and Xu (2008) and Kim (2007). Their model is suitable for modeling smooth changes and our model is more suitable for modeling sudden shifts. Finally, the paper is related to robust estimation of a change-point: see Hušková (1997), Fiteni (2002) and the references therein. The paper is organized as follows. Section 2 discusses the setup and three simple examples to motivate the study. Section 3 discusses multiple structural changes in a pre-specified quantile in a time series model, together with the method of estimation and the limiting distributions of the estimates. Section 4 considers structural change in multiple quantiles. Section 5 considers models with repeated cross-sections. In Sections 2–5, the number of breaks is assumed to be known. The issue of estimating the number of breaks is addressed in Section 6. Section 7 contains simulation results and Section 8 presents the empirical applications. All proofs are included in the Appendix. The following notation is used. The superscript 0 indicates the true value of a parameter. For a real valued vector z , ‖z ‖ denotes its Euclidean norm. [z] is the integer part of z. 1(·) is the indicator function. D[0,1] stands for the set of functions on [0, 1] that are right continuous and have left limits, equipped with the Skorohod metric. The symbols ‘‘⇒’’, ‘‘→p ’’ and ‘‘→a.s. ’’ denote weak convergence under Skorohod topology, convergence in probability and convergence almost surely, and Op (·) and op (·) is the usual notation for the orders of stochastic convergence. 2. Setup and examples Let yt be a real-valued random variable, xt a p by 1 random vector, and Qyt (τ |xt ) the conditional quantile function of yt given xt , where t corresponds to the time index or an ordering according to some other criterion. Note that if t is the index for time, then Qyt (τ |xt ) is interpreted as the quantile function of yt conditional on the σ -algebra generated by (xt , yt −1 , xt −1, yt −2 , . . .). Let T denote the sample size. We assume the conditional quantile function is linear in parameters and is affected by m structural changes:

 ′ 0 xt β1 (τ ),    x′ β 0 (τ ), t 2 Qyt (τ |xt ) = . .   .   ′ 0 xt βm+1 (τ ),

t = 1, . . . , T10 , t = T10 + 1, . . . , T20 ,

.. .

(1)

t = Tm0 + 1, . . . , T ,

where τ ∈ (0, 1) denotes a quantile of interest, βj0 (τ ) (j = 1, . . . , m + 1) are the unknown parameters that are quantile dependent, and Tj0 (j = 1, . . . , m) are the unknown break dates. A

subset of βj0 (τ ) may be restricted to be constant over t to allow for partial structural changes. The regressors xt can include discrete as well as continuous variables. We now give three examples to illustrate our framework. Example 1. Cox et al. (1985) considered the following model for the short-term riskless interest rate: 1/2

drt = (α + β rt ) dt + σ rt

dWt ,

(2)

where rt is the riskless rate and Wt is the Wiener process. The process (2) can be approximated by the following discrete-time

250

T. Oka, Z. Qu / Journal of Econometrics 162 (2011) 248–267

model if the sampling intervals are small (see Chan et al., 1992, p. 1213 and the references therein for discussions on the issue of discretization): 1/2

rt +1 − rt = α + β rt + (σ rt

)ut +1

with ut +1 ∼ i.i.d. N (0, 1), implying that the quantiles of rt +1 given rt are linear in parameters and satisfy 1/2

Qrt +1 (τ |rt ) = α + (1 + β)rt + (σ rt

)Fu−1 (τ ),

where Fu (τ ) is the τ th quantile of a standard normal random variable. The procedure developed in this paper can be used to estimate structural changes in some or all of the parameters (α, β and σ ). More complicated models than (2) can be analyzed in the same way provided that, upon discretization, the conditional quantile function can be approximated well by a linear function. −1

Example 2. Chernozhukov and Umantsev (2001) studied the following model for Value-at-Risk1 : Qyt (τ |xt ) = x′t β(τ ),

(3)

where yt is return on some asset, xt is a vector of information variables affecting the distribution of yt . For example, xt may include returns on other securities, lagged values of yt , and proxies to volatility (such as exponentially weighted squaredreturns). They documented that such variables affect various quantiles of yt in a very differential and nontrivial manner. Here, an interesting open issue is whether the risk relationship (3) undergoes substantial structural changes. Our method can be applied to address this issue without having to specify the dates of the changes a priori. Example 3. Piehl et al. (2003) applied the structural change methodology to evaluate the effect of the Boston Gun Project on youth homicide incidents. They allowed the break date to be unknown in order to capture an unknown time lag in policy implementation. Their focus was on structural change in the conditional mean. However, in many cases, the aim of a policy is to induce a distribution change rather than a pure location shift. For example, consider a public school reform aimed at improving the performance of students with low test scores. In this case, the lower quantiles of the conditional distribution are the targets of the reform. Another example is a public policy to reduce income inequality. In this case, the target is the dispersion of the distribution. In these cases, the standard structural change in mean methodology ceases to be relevant, while the methods developed in this paper can prove useful. For now, assume that the number of structural changes is known. Later in the paper (Section 6), we will discuss a testing procedure that can be used to estimate the number of breaks. We first consider structural changes in a single conditional quantile function in a time series model. 3. Structural changes in a given quantile τ ∈ (0, 1) In the absence of structural change, the model (1) can be estimated by solving min b∈Rp

T −

ρτ (yt − x′t b),

(4)

t =1

where ρτ (u) is the check function given by ρτ (u) = u(τ − 1(u < 0)); see Koenker (2005) for a comprehensive treatment of related

issues. Now suppose that the τ th quantile is affected by m structural changes, occurring at unknown dates (T10 , . . . , Tm0 ). Then, define the following function for a set of candidate break dates T b = (T1 , . . . , Tm ): ST (τ , β(τ ), T b ) =

Tj+1 m − −

ρτ (yt − x′t βj+1 (τ )),

where β(τ ) = (β1 (τ )′ , . . . , βm+1 (τ )′ )′ , T0 = 0 and Tm+1 = T .2 Motivated by Bai (1995, 1998), we estimate the break dates and coefficients β(τ ) jointly by solving

ˆ ), Tˆ b ) = arg (β(τ

min

β(τ ),T b ∈Λε

ST (τ , β(τ ), T b ),

(6)

ˆ ) = (βˆ 1 (τ )′ , . . . , βˆ m+1 (τ )′ )′ and Tˆ b = (Tˆ1 , . . . , Tˆm ). where β(τ Specifically, for a given partition of the sample, we estimate the coefficients β(τ ) by minimizing ST (τ , β(τ ), T b ). Then, we search over all permissible partitions to find the break dates that achieve the global minimum. These break dates, along with the corresponding estimates for β(τ ), are taken as final estimates. Note that in (6), Λε denotes the set of possible partitions. It ensures that each estimated regime is a positive fraction of the sample. For example, it can be specified as Λε = {(T1 , . . . , Tm ) : Tj − Tj−1 ≥ ε T (j = 2, . . . , m), T1 ≥ ε T , Tm ≤ (1 − ε)T },

(7)

where ε is a positive small number. The precise assumptions on Λε will be stated later in the paper. Let ft (·), Ft (·) and Ft−1 (·) denote the conditional density, conditional distribution and conditional quantile function of yt given xt . Let Ft −1 be the σ -algebra generated by (xt , yt −1 , xt −1, yt −2 , . . .) and u0t (τ ) be the difference between yt and its τ th conditional quantile, i.e., u0t (τ ) = yt − x′t βj0 (τ )

for Tj0−1 + 1 ≤ t ≤ Tj0 (j = 1, . . . , m + 1).

We now state the assumptions needed for the derivation of asymptotic properties of the estimates. Assumption 1. {1(u0t (τ ) < 0) − τ } is a martingale difference sequence with respect to Ft −1 . Assumption 2. The distribution functions {Ft (·)} are absolutely continuous, with continuous densities {ft (·)} satisfying 0 < Lf ≤ ft (Ft−1 (τ )) ≤ Uf < ∞ for all t. Assumption 3. For any ϵ > 0 there exists a σ (ϵ) > 0 such that |ft (Ft−1 (τ ) + s) − ft (Ft−1 (τ ))| < ϵ for all |s| < σ (ϵ) and all 1 ≤ t ≤ T. Assumptions 1 and 2 are familiar in the quantile regression literature; heteroskedasticity is allowed. Assumption 2 requires the densities to be uniformly bounded away from 0 and ∞ at the quantile of interest. This assumption is local and therefore does not require the full density function to be bounded. Assumption 3 implies that the conditional densities are uniformly continuous in some neighborhood of the τ th quantile. This assumption, along with Assumption 2, implies that ft (u) is uniformly bounded away from 0 and ∞ for all t and all u in some open neighborhood of Ft−1 (τ ). Note that these two assumptions entail stronger restrictions in dynamic models than in ordinary regression applications since the support of the explanatory variables is

2 A more complete notation for β (τ ) should be β τ , T b . We have omitted T b in order not to overburden the notation.



1 See Taylor (1999) and Engle and Manganelli (2004) for related works.

(5)

j=0 t =Tj +1



T. Oka, Z. Qu / Journal of Econometrics 162 (2011) 248–267

determined within the model. However, they are often not difficult to verify given a particular specification. To illustrate these three assumptions, we can consider the following location-scale model (with no structural change): yt = zt′ β + (wt′ η)ut ,

(8)

where {ut } is a sequence of i.i.d. errors independent of zt and wt (ut can be correlated with zt +k and wt +k for k > 0, thus allowing for dynamic models). Then, Assumption 1 is satisfied due to the independence. Let fu (·) and Fu−1 (τ ) denote the density and the τ th quantile of the errors ut . Then, ft (Ft−1 (τ )) = fu (Fu−1 (τ ))/(wt′ η). Thus, Assumption 2 is satisfied if Fu (·) is absolutely continuous with continuous density fu (·) satisfying δu < fu (Fu−1 (τ )) < ∞ and δw < wt′ η < ∞ for all t for some arbitrary strictly positive constants δu and δw . Note that fu (Fu−1 (·)) can be unbounded at quantile different from τ . Assumption 3 is satisfied if, in addition, the density fu (·) is continuous over an open interval containing the τ th quantile. Assumption 4. Tj0 = [λ0j T ] (j = 1, . . . , m) with 0 < λ01 < · · · <

λ0m < 1.

Assumption 5. (a) an intercept is included in xt ; (b) for each j = 1, . . . , m + 1, 1 T 1 T

Tj0−1 +[sT ]

− t =Tj0−1 +1

(9)

Tj0−1 +[sT ]

−

′

xt xt →

p

sJj0

hold uniformly in 0 ≤ s ≤ λ0j − λ0j−1 as T → ∞, where Jj0 and

Hj0 (τ ) are non-random positive definite matrices; (c) E ‖xt ‖4+ϕ < L holds with some ϕ > 0 and L < ∞ for all t = 1, . . . , T ; (d) there ∑T exist M < ∞ and γ > 2 such that T −1 t =1 E ‖xt ‖2γ +1 < M and



∑T

3 t =1 ‖ x t ‖

Lemma 1. Under Assumptions 1–6, we have vT2 (Tˆj − Tj0 ) = Op (1) for j = 1, . . . , m and

√

T (βˆ j (τ )−βj0 (τ )) = Op (1) for j = 1, . . . , m + 1.

The next result presents the limiting distributions of the estimates. Theorem 1. Let Assumptions 1–6 hold. Then, for j = 1, . . . , m,

πj σj

2

vT2 (Tˆj − Tj0 )  W (s) − |s|/2 →d arg max (σj+1 /σj )W (s) − (πj+1 /πj )|s|/2 s

s≤0 s > 0,

where πj = ∆j (τ )′ Hj0 (τ )∆j (τ ), πj+1 = ∆j (τ )′ Hj0+1 (τ )∆j (τ ), σj2 =

t =Tj0−1 +1

E T −1

functional central limit theorem then applies, which permits us to obtain a limiting distribution invariant to the exact distribution of xt and yt . The setup is well suited to provide an adequate approximation to the exact distribution when the change is moderate but the resulting confidence interval can be liberal when the change is small. The quality of the resulting approximation will be subsequently evaluated using simulations. To summarize, the assumptions have two important features. First, they allow for dynamic models, for example, the quantile autoregression (QAR) model of Koenker and Xiao (2006): yt = ρ0 (Ut )+ρ1 (Ut )yt −1 +· · ·+ρq (Ut )yt −q , where {Ut } is a sequence of i.i.d. standard uniform random variables. Second, the assumptions are local, in the sense that they impose restrictions only on the τ th quantile and a small neighborhood surrounding it. This allows us to look at slices of the distribution without making global distributional assumptions. The following result establishes the convergence rates of the parameter estimates.



ft (Ft−1 (τ ))xt x′t →p sHj0 (τ ) and

251

γ

< M hold when T is large; (e) there exists ∑l+j ′ −1

j0 > 0, such that the eigenvalues of j t =l xt xt are bounded from above and below by λmax and λmin for all j ≥ j0 and 1 ≤ l ≤ T − j with 0 < λmin ≤ λmax < ∞. Assumption 4 states that each regime occupies a non-vanishing proportion of the sample. Assumption 5 imposes some structure on the regressors xt . The first part in (9) imposes some restriction on possible heteroskedasticity. Assumption 5(c) ensures weak ∑[sT ] convergence of the process T −1/2 t =1 xt (1 (Ft (yt ) ≤ τ ) − τ ). Assumption 5(d) is needed for stochastic equicontinuity of sequential empirical processes based on the estimated quantile regression residuals. It also implies max1≤t ≤T ‖xt ‖ = op (T 1/2 ), which is familiar in the literature on M-estimators. Assumption 5 rules out trending regressors under which the rates of convergence can be verified, although the limiting distribution of the break estimates will be different. This situation requires a separate treatment. Assumption 6. Let ∆T ,j (τ ) = β

0 j+1

(τ ) − β (τ ) (j = 1, . . . , m).   0 j

Assume ∆T ,j (τ ) = vT ∆j (τ ) for some ∆j (τ ) > 0, where ∆j (τ ) is a vector independent of T , vT > 0 is a scalar satisfying vT → 0 and T (1/2)−ϑ vT → ∞ for some ϑ ∈ (0, 1/2). Assumption 6 follows Picard (1985) and Yao (1987). The magnitudes of the shifts converge to zero as the sample size increases. Consequently, the breaks will be estimated at a slower convergence rate than under ‘‘fixed-break’’ asymptotics. A

τ (1 − τ ) ∆j (τ )′ Jj0 ∆j (τ ), σj2+1 = τ (1 − τ ) ∆j (τ )′ Jj0+1 ∆j (τ ) and W (s) is the standard two-sided Brownian motion. Also, √   T (βˆ j (τ ) − βj0 (τ )) →d N 0, Vj = τ (1 − τ )Ωj0 (τ )/(λ0j − λ0j−1 )2 and Ωj0 (τ ) (Hj0 (τ ))−1 Jj0 (Hj0 (τ ))−1 for j = 1, . . . , m + 1. with Vj

=

The limiting distribution has the same structure as that of Bai (1995). An analytical expression for its cumulative distribution function is provided in Bai (1997). To construct confidence intervals, we need to replace Hj0 (τ ) and Jj0 by consistent estimates. These can be obtained by conditioning on the estimated break date Tˆj . For example,

ˆ 1 (τ ) = Tˆ1−1 H

Tˆ1 −

fˆt (Ft−1 (τ ))xt x′t ,

Jˆ1 = Tˆ1−1

t =1

Tˆ1 −

xt x′t .

t =1

The estimates for the densities, fˆt (Ft−1 (τ )), can be obtained using the difference quotient, as considered by Siddiqui (1960) and Hendricks and Koenker (1992). A detailed discussion can be found in Qu (2008, pp. 176–177). 4. Structural changes in multiple quantiles Structural changes can be heterogeneous in the sense that different quantiles can change by different magnitudes. In such a context, it can be more informative to consider a range of quantiles as opposed to a single one. Suppose that quantiles in the interval Tω = [ω1 , ω2 ] with 0 < ω1 < ω2 < 1 are affected by structural change. Then, a natural approach is to consider a partition of this interval and examine a set of quantiles denoted by τh , h = 1, . . . , q. After such a set of

252

T. Oka, Z. Qu / Journal of Econometrics 162 (2011) 248–267

quantiles is specified, the estimation can be carried out in a similar way as in Section 3. Specifically, we define the following objective function for a set of candidate break dates T b = (T1 , . . . , Tm ) and parameter values β(Tω ) = (β(τ1 )′ , . . . , β(τq )′ )′ : ST (Tω , β(Tω ), T b ) =

Tj+1 q − m − −

ρτh (yt − x′t βj+1 (τh )),

β(Tω ),T b ∈Λε

ST (Tω , β(Tω ), T b ),

πj∗

2 vT2 (Tˆj − Tj0 )

σj∗

and solve min

Corollary 1. Under Assumption 7, for j = 1, . . . , m,



h=1 j=0 t =Tj +1

ˆ Tω ), Tˆ b ) = arg (β(

The next corollary gives the limiting distribution of estimated ˆ h ) (h = 1, . . . , q) are the break dates. The distributions for β(τ same as in Theorem 1, and thus are not repeated here.

W (s) − |s|/2 → arg max (σ ∗ /σ ∗ )W (s) − (π ∗ /π ∗ )|s|/2 s j +1 j j +1 j



d

(10)

where Λε has the same definition as in (7). The minimization problem (10) consists of three steps. First, we minimize ST (τh , β(τh ), T b ) for a given partition of the sample and a given quantile τh . Then, still conditioning on the same partition, we repeat the minimization for all quantiles {τh : h = 1, . . . , q} to ˆ Tω ), T b ). Finally, we search over all possible partiobtain ST (Tω , β( tions T b ∈ Λε to find the break dates that achieve the global minimum of ST (Tω , β(Tω ), T b ). In practice, we need to choose Tω and τh (h = 1, . . . , q). We view this as an empirical issue. The interval Tω is often easy to determine given the question of interest. The choice of τh is more delicate and will require some judgement. Evidence from the empirical literature suggests that considering a coarse grid of quantiles often suffices to deliver the desired information. The spacing can be between 5% and 15% depending on the question of interest. For example, Chamberlain (1994) studied a variety of issues including the changes in returns to education from 1979 to 1987. He considered quantile regression models of log weekly earnings for five quantiles τ = 0.10, 0.25, 0.50, 0.75 and 0.90. There, the quantiles 0.10 and 0.90 were used to examine the tails of the distribution, the quantile 0.50 was used to measure the central tendency while 0.25 and 0.75 captured the intermediate cases. He showed that the returns to education are different across quantiles. They also appear to be different for 1979 and 1987. This choice of quantiles has now become a useful rule of thumb in the empirical labor literature. For example, Angrist et al. (2006) considered the same quantiles when analyzing change in US wage inequality; see also Buchinsky (1994). Similar choices are also often made in other applied micro areas. For example, Eide and Showalter (1998) and Levin (2001) considered the effect of school quality and class size on students’ performance and Poterba and Rueben (1995) considered public-private wage differentials in the United States. Intuitively, a coarse grid often suffices because adjacent quantiles typically exhibit similar properties. Therefore, the incremental information gained from considering a finer grid is typically small. Once the grid is chosen, we suggest carrying out estimation using both multiple quantiles and individual quantiles and providing a full disclosure of the results. In practice, there will be some arbitrariness associated with a particular grid choice. Therefore it is useful to experiment with different grids to examine result sensitivity. For each grid choice, the proposed method can be used to obtain point estimates and confidence intervals, which can then be compared to examine result sensitivity. This point will be illustrated using two empirical applications in Section 8. We impose the following assumption on the q quantiles entering the estimation. Assumption 7. The conditional quantile functions at τh (h = 1, . . . , q) satisfy Assumptions 1–5. There exists at least one quantile τj (1 ≤ j ≤ q) satisfying Assumption 6 (other quantiles can remain stable or satisfy Assumption 6). Also, q is fixed as T → ∞.

s≤0 s > 0,

where W (s) is the standard two-sided Brownian motion, πj∗

∑q

Hj0

(τh )∆j (τh ), πj∗+1 ∑ ∑

∑q

=

(1/q) h=1 ∆j (τh ) = (1/q) (τ ) (τh )∆j (τh ), σj∗2 = (1/q2 ) qh=1 qg =1 (τh ∧ τg − τ τ ) (τ ) ∑ ∑ (τg ) and σj∗+21 = (1/q2 ) qh=1 qg =1 (τh ∧ τg − τ τ ) (τ ) ∆j (τg ). ′

0 h=1 ∆j h Hj+1 ′ 0 h g ∆j h Jj ∆j ′ 0 h g ∆j h Jj+1

′

5. Models with repeated cross-sections Suppose the data set contains observations (x′it , yit ), where i is the index for individual and t for time. Assume i = 1, . . . , N and t = 1, . . . , T . First, consider structural changes in a single conditional quantile function. Suppose the data generating process is Qyit (τ |xit ) = x′it βj0 (τ ) for t = Tj0−1 + 1, . . . , Tj0 ,

(11)

where the break dates (j = 1, . . . , m) are common to all individuals. The estimation procedure is similar to that in Section 3. Specifically, for a set of candidate break dates T b = (T1 , . . . , Tm ), define the following function Tj0

SNT (τ , β(τ ), T ) = b

Tj+1 m N − − −

ρτ (yit − x′it βj+1 (τ )),

j=0 t =Tj +1 i=1

∑N

where an additional summation ‘‘ i=1 ’’ is present to incorporate the cross-sectional observations. Then, solve the following minimization problem to obtain the estimates:

ˆ ), Tˆ b ) = arg (β(τ

min

β(τ ),T b ∈Λε

SNT (τ , β(τ ), T b ).

Let Ft −1 denote the σ -algebra generated by {xit , yi,t −1 , xi,t −1, yi,t −2 , . . .}Ni=1 . Let u0it (τ ) denote the difference between yit and its τ th conditional quantile, i.e., u0it (τ ) = yit − x′it βj0 (τ )

for Tj0−1 + 1 ≤ t ≤ Tj0 (j = 1, . . . , m + 1).

We make the following assumptions, which closely parallel the ones in Section 3. Assumption B1. For a given i, 1(u0it (τ ) < 0) − τ is a martingale difference sequence with respect to Ft −1 . Also, u0it (τ ) and u0jt (τ ) are independent conditional on Ft −1 for all i ̸= j.





Assumption B2. Assumption 2 holds for {Fit (·)} and {fit (·)}. Assumption B3. For any ϵ > 0 there exists a σ (ϵ) > 0 such that |fit (Fit−1 (τ ) + s) − fit (Fit−1 (τ ))| < ϵ for all |s| < σ (ϵ) and all 1 ≤ i ≤ N and 1 ≤ t ≤ T . Assumption B4. The break dates are common for all i; Tj0 = [λ0j T ] with 0 < λ01 < · · · < λ0m < 1.

Assumption B5. (a) an intercept is included in xit ; (b) for every j = 1, . . . , m + 1,

T. Oka, Z. Qu / Journal of Econometrics 162 (2011) 248–267 N 1 −

−

NT i=1 t =Tj0−1 +1 N 1 −

fit (Fit

−1

(τ ))xit xit → sH¯ j0 (τ ) and ′

p

τ (1 − τ ) ∆j (τ )′ J¯j0 ∆j (τ ), σ¯ j2+1 = τ (1 − τ ) ∆j (τ )′ J¯j0+1 ∆j (τ ) and W (s) is the standard two-sided Brownian motion. Also, √   NT (βˆ j (τ ) − βj0 (τ )) →d N 0, V¯ j

Tj0−1 +[sT ]

−

NT i=1 t =Tj0−1 +1

xit x′it →p sJ¯j0



∑T

t =1

∑N

i=1

‖xit ‖3

γ

< M hold for all N and sufficiently

large T ; (e) there exists j0 > 0 such that the eigenvalues of ∑ ∑ (jN )−1 lt+=jl Ni=1 xit x′it are bounded from above and below by λmax and λmin for all N, all j ≥ j0 and 1 ≤ l ≤ T − j; 0 < λmin ≤ λmax < ∞. Assumption B1 allows for dynamic models but rules out cross-sectional dependence in u0it (τ ). B2 and B3 strengthen Assumptions 2 and 3, requiring them to hold for all i and t. B4 assumes that the individuals are affected by common breaks. B5 is essentially a re-statement of Assumption 5. Assumption B6. Assume N and T satisfy one of the following two conditions: (1) N is fixed as T → ∞, or (2) (N , T ) → ∞ but log N /T ϑ/2 → 0, where ϑ is defined in Assumption holds, then assume the process   6. If0 the second  condition  xit 1 uit (τ ) < 0 − τ forms a stationary ergodic random field indexed by i and t. Assumption B6 encompasses both small N and large N asymptotics. The condition log N /T ϑ/2 → 0 allows for both N /T → 0 and N /T → ∞. The restriction is therefore quite mild. The second is used to ensure that  part of this assumption  the process xit (1 u0it (τ ) < 0 − τ ) satisfies a functional central limit theorem as (N , T ) → ∞. It can be replaced by other conditions that serve the same purpose. Assumption B7. Let ∆NT ,j (τ ) = βj0+1 (τ ) − βj0 (τ ). Assume   ∆NT ,j (τ ) = N −1/2 vT ∆j (τ ) for some ∆j (τ ) > 0 independent of T and N, where vT > 0 is a scalar satisfying vT → 0 and T (1/2)−ϑ vT → ∞ with ϑ defined in Assumption 6. Assumption B7 implies that, with the added cross-sectional dimension, the model can now handle break sizes that are of order equal to N −1/2 vT , which is smaller than in the pure time series case given by O(vT ). This assumption ensures that the estimated break dates will converge at the rate vT−2 . If the breaks were of higher order than N −1/2 vT , then the estimated breaks would converge faster than vT−2 . In those cases, the confidence interval reported below will tend to be conservative. Thus, this assumption, as in the pure time series case, can be viewed as a strategy to deliver a confidence interval that has good coverage when the break size is moderate while being conservative when the break size is large. The next two results present the rates of convergence and limiting distributions of the estimates. Lemma 2. Under Assumptions B1–B7, we have vT2 (Tˆj − Tj0 ) = Op (1) for j = 1, . . . , m and 1, . . . , m + 1.

√

NT (βˆ j (τ ) − βj0 (τ )) = Op (1) for j =

Theorem 2. Let Assumptions B1–B7 hold. Then, for j = 1, . . . , m,



π¯ j σ¯ j

¯ j0 (τ )/(λ0j − λ0j−1 )2 and Ω ¯ j0 (τ ) = τ (1 − τ )Ω 0 0 ( ¯ (τ ))−1 J¯j (H¯ j (τ ))−1 for j = 1, . . . , m + 1. with V¯ j

¯ j0 (τ ) and J¯j0 are uniformly in N and 0 ≤ s ≤ λ0j − λ0j−1 , where H non-random positive definite matrices; (c) E ‖xit ‖4+ϕ < L with some ϕ > 0 and L < ∞ for all i and t; (d) there exist M < ∞ ∑T ∑N and γ > 2 such that (NT )−1 t =1 i=1 E ‖xit ‖2γ +1 < M and E (NT )−1

253

¯ j0 (τ )∆j (τ ), π¯ j+1 = ∆j (τ )′ H¯ j0+1 (τ )∆j (τ ), σ¯ j2 = where π¯ j = ∆j (τ )′ H

Tj0−1 +[sT ]

2

vT2 (Tˆj − Tj0 )  W (s) − |s|/2 →d arg max (σ¯ j+1 /σ¯ j )W (s) − (π¯ j+1 /π¯ j )|s|/2 s

=

Hj0

An equivalent way to express the limiting distribution in Theorem 2 is as follows

2  ∆NT ,j (τ )′ H¯ j0 (τ )∆NT ,j (τ ) N (Tˆj − Tj0 ) τ (1 − τ ) ∆NT ,j (τ )′ J¯j0 ∆NT ,j (τ )  W (s) − |s|/2 →d arg max (σ¯ j+1 /σ¯ j )W (s) − (π¯ j+1 /π¯ j )|s|/2 s  

s≤0 s > 0,

where ∆NT ,j (τ ) denotes the magnitude of the jth break for a given finite sample of size (N , T ). This representation clearly illustrates the effect of the cross-section sample size N on the precision of the break estimates. Namely, if everything in the parentheses stays the same, increasing N will proportionally decrease the width of the confidence interval for break dates. Such a finding was first reported by Bai et al. (1998) when considering the estimation of a common break in multivariate time series regressions. We now extend the analysis to consider structural breaks in multiple quantiles. Define, for a given Tω and T b = (T1 , . . . , Tm ), SNT (Tω , β(Tω ), T b ) =

Tj+1 q − m N − − −

ρ(yit − x′it βj+1 (τh ))

h=1 j=0 t =Tj +1 i=1

and

ˆ Tω ), Tˆ b ) = arg (β(

min

β(Tω ),T b ∈Λε

SNT (Tω , β(Tω ), T b ).

Assumption B8. The conditional quantiles for τh (h = 1, . . . , q) satisfy Assumptions B1–B6. There exists at least one quantile τj (1 ≤ j ≤ q) satisfying Assumption B7. Also, q is fixed as T → ∞. Corollary 2. Under Assumption B8, for j = 1, . . . , m,



π¯ j∗ σ¯ j∗

2 vT2 (Tˆj − Tj0 )

 W (s) − |s|/2 →d arg max (σ¯ ∗ /σ¯ ∗ )W (s) − (π¯ ∗ /π¯ ∗ )|s|/2 s j +1 j j +1 j

s≤0 s > 0,

where W (s) is the standard two-sided Brownian motion, π¯ j∗

= ∑ ∑ (1/q) qh=1 ∆j (τh )′ H¯ j0 (τh )∆j (τh ), π¯ j∗+1 = (1/q) qh=1 ∆j (τh )′ H¯ j0+1 ∑ ∑ (τh )∆j (τh ), σ¯ j∗2 = (1/q2 ) qh=1 qg =1 (τh ∧ τg − τh τg )∆j (τh )′ J¯j0 ∆j ∑ ∑ (τg ) and σ¯ j∗+21 = (1/q2 ) qh=1 qg =1 (τh ∧ τg − τh τg )∆j (τh )′ J¯j0+1 ∆j (τg ). In summary, the method discussed in this section permits us to estimate structural breaks using individual level data. In this aspect, a closely related paper is Bai (2010), who considers common breaks in a linear panel data regression. A key difference is that Bai (2010) studies change in the mean or the variance while here we consider change in the conditional distribution. Hence, the results complement each other. 6. A procedure to determine the number of breaks

s≤0 s > 0,

The following procedure is motivated by Bai and Perron (1998). It is built upon two test statistics, SQτ and DQ , proposed in Qu (2008). We first give a brief review of these two tests.

254

T. Oka, Z. Qu / Journal of Econometrics 162 (2011) 248–267

The SQτ test is designed to detect structural change in a given quantile τ :

 



 

ˆ )) − λH1,T (β(τ ˆ ))  SQτ = sup (τ (1 − τ ))−1/2 Hλ,T (β(τ λ∈[0,1]

∞

,

if we have repeated cross-sections, βˆ j (τ ) is the estimate using the jth regime. Then, SQτ (l + 1|l) and DQ (l + 1|l) equals to the maximum of the SQτ ,j and DQj over the l + 1 segments, i.e., SQτ (l + 1|l) = max SQτ ,j , 1≤j≤l+1

DQ (l + 1|l) = max DQj .

where

 ˆ )) = Hλ,T (β(τ

T −

−1/2 xt x′t

t =1

1≤j≤l+1

[λT ] −

ˆ )) xt ψτ (yt − x′t β(τ

and

t =1

ψτ (u) = τ − 1(u < 0) if we have a single time series, and

 ˆ )) = Hλ,T (β(τ

T − N −

−1/2 ′

xit xit

t =1 i=1

[λT ] − N −

ˆ )) xit ψτ (yit − x′it β(τ

t =1 i=1

ˆ ) is the estimate using the full with repeated cross-sections, β(τ sample assuming no structural change, and ‖.‖∞ is the sup norm, i.e. for a generic vector z = (z1 , . . . , zk ), ‖z ‖∞ = max(z1 , . . . , zk ). The DQ test is designed to detect structural changes in quantiles in an interval Tω :  

 

ˆ )) − λH1,T (β(τ ˆ )) DQ = sup sup Hλ,T (β(τ τ ∈Tω λ∈[0,1]

∞

.

These tests are asymptotically nuisance parameter free and tables for critical values are provided in Qu (2008). They do not require estimating the variance parameter (more specifically, the sparsity), thus having monotonic power even when multiple breaks are present. Qu (2008) provided a simple simulation study. The results show that these two tests compare favorably with Wald-based tests (c.f. Figure 1 in Qu, 2008). We also need the following tests for the purpose of testing l against l + 1 breaks, labeled as SQτ (l + 1|l) test and DQ (l + 1|l) test. The construction follows Bai and Perron (1998). Suppose a model with l breaks has been estimated with the estimates denoted by Tˆ1 , . . . , Tˆl . These values partition the sample into (l + 1) segments, with the jth segment being [Tˆj−1 + 1, Tˆj ]. The strategy proceeds by testing each of the (l+1) segments for the presence of an additional break. We let SQτ ,j and DQj denote the SQτ and DQ test applied to the jth segment, i.e.,

    −1/2 SQτ ,j = sup (τ (1 − τ )) Hλ,Tˆ ,Tˆ (βˆ j (τ )) j−1 j λ∈[0,1]     − λH1,Tˆj−1 ,Tˆj (βˆ j (τ ))  ,  ∞     ˆ DQj = sup sup Hλ,Tˆ ,Tˆ (βj (τ )) − λH1,Tˆ ,Tˆ (βˆ j (τ )) , j−1 j j−1 j τ ∈Tω λ∈[0,1]

∞

where Hλ,Tj−1 ,Tj (βˆ j (τ ))

 =

Tj −

−1/2 xt x′t 

[λ(T− j −Tj−1 )]

t =Tj−1 +1

xt ψτ (yt − x′t βˆ j (τ ))

t =Tj−1 +1

if we have a single time series, Hλ,Tj−1 ,Tj (βˆ j (τ ))

 =

Tj N − −

t =Tj−1 +1 i=1

−1/2 ′

xit xit 

[λ(T− j −Tj−1 )] − N t =Tj−1 +1

i=1

xit ψτ (yit − x′it βˆ j (τ ))

We reject in favor of a model with (l + 1) breaks if the resulting value is sufficiently large. Some additional notation is needed to present the limiting distributions of the SQτ (l + 1|l) and DQ (l + 1|l) tests. Let Bp (s) be a vector of p independent Brownian bridge processes on [0, 1]. Also, let Bp (u, v) = (B(1) (u, v), . . . , B(p) (u, v))′ be a p-vector of independent Gaussian processes with each component defined on [0, 1]2 having zero mean and covariance function E (B(i) (r , u)B(i) (s, v)) = (r ∧ s − rs) (u ∧ v − uv) . The process B(i) (r , u) is often referred to as the Brownian Pillow or tucked Brownian Sheet. Theorem 3. Suppose m = l and that the model is given by (1) or (11) with Assumptions 1–6 or B1–B7 satisfied. Then, P (SQτ (l + 1|l) ≤ x) → Gp (x)l+1 with Gp (x) the distribution function of sups∈[0,1] Bp (s)∞ . If these assumptions hold uniformly in Tω , then

l +1 ¯ ¯ P (DQ (l + 1|l) ≤ x) →  Gp (x)  with Gp (x) the distribution function   of supτ ∈Tω sups∈[0,1] Bp (λ, τ ) ∞ .

The above limiting distributions depend on the number of parameters in the model (p), the number of changes under the alternative hypothesis (l + 1) and the trimming proportion (ω) in the case of the DQ test (note that we assume Tω = [ω, 1 − ω]). Instead of reporting critical values for each case, we conduct extensive simulations and provide relevant information via response surface regressions. Specifically, we first simulate critical values for specifications with 1 ≤ p ≤ 20, 0 ≤ l ≤ 4 and 0.05 ≤ ω ≤ 0.30, with the increment being 0.01. Then, we estimate a class of nonlinear regression of the form: ′ CVi (α) = (z1i β1 ) exp(z2i′ β2 ) + ei ,

where CVi is a simulated critical value for a particular specification ′ i, z1i and z2i indicate the corresponding p, l and ω, ei is an error term, and α is the nominal size. Regressors are selected such that the R2 is not smaller than 0.9999. The selected regressors are

  • SQτ (l + 1|l) test: z1 = 1, p, l + 1, 1/p, (l + 1)p, and z2 =   1/(l + 1) ;   • DQ (l + 1|l) test: z1 = 1, p, l + 1, 1/p, (l + 1)p, (l + 1)ω and   z2 = 1/(l + 1), 1/(l + 1)ω, ω . The estimated coefficients are reported in Table 1, which can then be used for a quick calculation of the relevant critical values for a particular application. We now discuss a procedure that can be used to determine the number of breaks (we consider the interval Tω and focus on the quantiles τ1 , . . . , τq ∈ Tω ).

• Step 1. Apply the DQ test. If the test does not reject, conclude that there is no break and terminate the procedure. If it rejects, then estimate the model allowing one break. Save the estimated break date and proceed to Step 2. • Step 2. Apply the DQ (l + 1|l) tests starting with l = 1. Increase the value of l if the test rejects the null hypothesis. In each stage, the model is re-estimated and the break dates are the global minimizers of the objective function allowing l breaks. Continue the process until the test fails to reject the null.

T. Oka, Z. Qu / Journal of Econometrics 162 (2011) 248–267

255

Table 1 Estimated response surface regression. Test

Size (%)

Level regressors (x1 )

Exponentiated regressors (x2 )

1

p

l+1

1 p

(l + 1)p

SQ (l + 1|l)

10 5 1

1.5432 1.6523 1.8703

0.0265 0.0242 0.0215

0.0388 0.0369 0.0331

−0.2371 −0.2226 −0.1742

−0.0021 −0.0019 −0.0017

DQ (l + 1|l)

10 5 1

0.9481 0.9944 1.0929

0.0062 0.0058 0.0050

0.0166 0.0157 0.0134

−0.1386 −0.1284 −0.1134

−0.0004 −0.0004 −0.0002

(l + 1)ω

1 l+1

1

(l+1)ω

ω

−0.1168 −0.0999 −0.0770 0.0018 0.0017 0.0010

−0.0801 −0.0716 −0.0565

−0.0004 −0.0005 0.0000

−0.0254 −0.0203 −0.0062

Note. p denotes the number of parameters allowed to change, l is the number of breaks under the null hypothesis, and ω is a trimming parameter determining the interval of quantiles being tested: [ω, 1 − ω]. Table 2 Coverage rates for the break date.

(N , T ) (1, 100)

(50, 100)

(100, 100)

√ N times break size

1.0 2.0 3.0 1.0 2.0 3.0 1.0 2.0 3.0

Quantile 0.2

0.3

0.4

0.5

0.6

0.7

0.8

All

0.876 0.895 0.934 0.897 0.905 0.926 0.892 0.892 0.929

0.901 0.930 0.951 0.904 0.902 0.944 0.900 0.897 0.928

0.917 0.933 0.964 0.900 0.915 0.943 0.890 0.905 0.937

0.920 0.934 0.970 0.899 0.908 0.948 0.890 0.900 0.949

0.911 0.923 0.969 0.901 0.908 0.940 0.888 0.910 0.947

0.902 0.916 0.959 0.904 0.902 0.934 0.893 0.901 0.946

0.866 0.882 0.938 0.903 0.911 0.941 0.889 0.892 0.932

0.856 0.908 0.933 0.869 0.912 0.953 0.857 0.904 0.950

Note. The nominal size is 95%. Columns indicated by 0.2–0.8 include results based on a single quantile function. In the last column, the break date is estimated using all seven quantiles.

• Step 3. Let ˆl denote the first value for which the test fails to reject. Estimate the model allowing ˆl breaks. Save the estimated break dates and confidence intervals.

• Step 4. This step treats the q quantiles separately and can be viewed as a robustness check. Specifically, for every quantile τh (h = 1, . . . , q), apply the SQτ and SQτ (l + 1|l) tests. Carry out the same operations as in Steps 1–3. Examine whether the estimated breaks are in agreement with those from Step 3. Since this is a sequential procedure, it is important to consider its rejection error. Suppose there is no break and a 5% significance level is used. Then, there is a 95% chance that the procedure will be terminated in Step 1, implying the probability of finding one or more breaks is 5% in large samples. If there are m breaks with m > 0, then, similarly, the probability of finding more than m breaks will be at most 5%. Of course, the probability of finding fewer than m breaks in finite samples will vary from case to case depending on the magnitude of the breaks. 7. Monte Carlo experiments We focus on the following location-scale model with a single structural change: yit = 1 + xit + δN xit 1(t > T /2) + (1 + xit )uit ,

(12)

where xit ∼ i.i.d. χ 2 (3)/3, uit ∼ i.i.d. N (0, 1), i = 1, . . . , N and t = 1, . . . , T . √ We set T√= 100 and√consider N = 1, 50 and 100. √ δN = 1.0/ N , 2.0/ N and 3.0/ N, where the scaling factor N makes the break sizes comparable across different N’s. Note that the powers of the DQ test against these three alternatives (at 5% nominal level, constructed with Tω = [0.2, 0.8]) are about 27%, 83% and 99% respectively. The powers √ of the sup-Wald test (Andrews, 1993) are√similar. Thus, 1.0/ N can be viewed as a small break and 3.0/ N as a large break. The computational detail is as follows. All parameters are allowed to change when estimating the model. The break date is searched over [0.15T , 0.85T ]. The Bofinger bandwidth is used for obtaining the quantile density function. Finally, all simulation results are based on 2000 replications.

7.1. Coverage rates We examine the coverage property of the asymptotic confidence intervals at 95% nominal level. Seven evenly spaced quantiles (0.2, 0.3, . . . , 0.8) are considered in the analysis. Table 2 presents coverage rates for the break date. The first seven columns are based on a single quantile function. The empirical √ coverage rates are between 86.6% and 92.0% when √ δN = 1.0/ N, between 88.6% and 93.4% when δN√ = 2.0/ N and between 92.6% and 97.0% when δN = 3.0/ N. The values are quite stable across different N’s, suggesting that the framework developed in Section 5 provides a useful approximation. The last column in the table is based on all seven quantiles. The result is quite similar to the single quantile case. Table 3 reports coverage rates for δN . When the break size is √ small (δN = 1.0/ N ) and the break date is estimated using a single quantile function, the confidence interval shows undercoverage, particularly for quantiles near the tail of the distribution. In contrast, when the break date is estimated using the seven quantile functions, the coverage rates are uniformly closer to the nominal rate, with the improvement being particularly important for more extreme quantiles. This suggests that even if one is only interested in a single quantile, say the 20th percentile, it may still be advantageous to borrow information from other quantiles when estimating √ the break date. Note that once the break size reaches 2.0/ N, the coverage rate is satisfactory, being robust to different cross-section sample sizes and to whether the break date is estimated based on a single quantile or multiple quantiles. The above result suggests that the asymptotic framework delivers a useful approximation. A shortcoming to the shrinking break framework is that the confidence intervals are liberal when the true break is small. This problem can be alleviated to some extent by borrowing information across quantiles. It should be noted that a few studies have addressed this under-coverage issue in other contexts and alternative inferential frameworks have been proposed, see Bai (1995) and Elliott and Muller (2007, 2010). A method that allows for multiple breaks has yet to be developed.

256

T. Oka, Z. Qu / Journal of Econometrics 162 (2011) 248–267

Table 3 Coverage rates for the break size parameter.

√

(N , T ) (1, 100)

N times break size

Single quantile

Multiple quantile

(50, 100)

Single quantile

Multiple quantile

(100, 100)

Single quantile

Multiple quantile

Quantile 0.2

0.3

0.4

0.5

0.6

0.7

0.8

1.0 2.0 3.0 1.0 2.0 3.0

0.815 0.928 0.926 0.879 0.939 0.936

0.836 0.943 0.930 0.879 0.947 0.932

0.862 0.944 0.933 0.888 0.946 0.932

0.864 0.950 0.941 0.880 0.954 0.942

0.853 0.937 0.937 0.884 0.941 0.940

0.848 0.935 0.939 0.888 0.936 0.942

0.821 0.921 0.933 0.889 0.926 0.942

1.0 2.0 3.0 1.0 2.0 3.0

0.794 0.948 0.960 0.902 0.956 0.953

0.829 0.946 0.957 0.887 0.952 0.955

0.846 0.951 0.957 0.887 0.958 0.955

0.834 0.952 0.954 0.884 0.952 0.953

0.842 0.948 0.950 0.881 0.949 0.944

0.826 0.948 0.951 0.888 0.951 0.949

0.797 0.935 0.951 0.902 0.945 0.950

1.0 2.0 3.0 1.0 2.0 3.0

0.799 0.935 0.950 0.903 0.939 0.948

0.841 0.946 0.961 0.898 0.949 0.954

0.845 0.942 0.959 0.887 0.947 0.957

0.848 0.947 0.956 0.891 0.947 0.953

0.833 0.940 0.952 0.878 0.941 0.951

0.820 0.949 0.955 0.887 0.951 0.948

0.805 0.940 0.960 0.898 0.953 0.955

Note. The nominal coverage rate is 95%. ‘‘Single quantile’’: the break date is estimated based on one quantile function. ‘‘Multiple quantiles’’: the break date is estimated using all seven quantiles. Conditioning on the estimated break date, no other restrictions are imposed across quantiles. Table 4 Comparisons of different break estimators. Error (uit ) N (0, 1)

(N , T ) (1, 100)

(50, 100)

(100, 100)

t (2.5)

(1, 100)

(50, 100)

(100, 100)

√ N times break size

Multiple quantiles

Mean

MAD

Median IQR90

MAD

IQR90

MAD

IQR90

1.0 2.0 3.0 1.0 2.0 3.0 1.0 2.0 3.0

13.37 5.55 2.12 14.29 6.69 2.88 14.74 6.87 2.85

63 31 10 64 39 15 66 39 15

12.50 4.93 1.90 12.53 4.93 2.14 13.32 5.09 2.00

61 27 10 63 27 11 64 28 11

13.93 5.69 2.26 14.31 6.36 2.64 14.60 6.44 2.85

64 33 13 66 36 15 65 35 16

1.0 2.0 3.0 1.0 2.0 3.0 1.0 2.0 3.0

14.91 7.43 3.58 15.37 7.88 3.46 15.16 7.68 3.68

65 42 20 66 46 19 65 43 20

14.89 7.43 3.69 15.02 6.95 3.16 14.81 7.11 3.22

66 43 19 65 39 17 65 39 17

17.88 11.48 6.97 19.03 14.25 9.82 19.34 13.94 10.06

68 61 43 68 65 55 68 65 57

Note. MAD: Mean Absolute Deviation. IQR90: the distance between 90% and 10% quantiles of the empirical distribution. ‘‘Multiple quantiles’’: the estimation is based jointly on quantiles 0.2, 0.3, . . . , 0.8.

7.2. Empirical distribution of break date estimates We compare estimates based on the median regression, the joint analysis of seven quantiles and the conditional mean regression. In addition to letting uit being N (0, 1), we also consider a tdistribution with 2.5 degrees of freedom with other specifications unchanged. Table 4 reports the mean absolute deviation (MAD) and the 90% inter-quantile range (IQR90) of the estimates. The upper panel corresponds to uit being N (0, 1). It illustrates that the estimates based on the median and mean regression have similar properties, while the estimates based on multiple quantile functions have noticeably higher precision. The lower panel corresponds to uit being t (2.5). It shows that the estimates based on the median and multiple quantile functions are similar and are often substantially more precise than the conditional mean regression. Thus, there can be an important efficiency gain from considering quantilebased procedures in the presence of fat-tailed error distributions, even if the goal is to detect changes in the central tendency. Even though this is documented using a very simple model, the result

should carry through to more general settings. Similar findings are reported in Bai (1998) in a median regression framework with i.i.d. errors. 8. Empirical applications 8.1. US real GDP growth It is widely documented that the volatility of the US real GDP growth has declined substantially since the early to mid-1980s. For example, McConnell and Perez-Quiros (2000) considered an AR(1) model for the GDP growth and found a large break in the residual variance occurring in the first quarter of 1984. We revisit this issue using a quantile regression framework. The data set we use contains quarterly real GDP growth rates for the period 1947:2 to 2009:2. It is obtained from the web page of the St. Louis Fed (the GDPC96 series) and corresponds to the maximum sample period available at the time of writing our paper. We consider the following model for the annualized quarterly growth rate yt :

T. Oka, Z. Qu / Journal of Econometrics 162 (2011) 248–267 Table 5a Structural breaks in the US real GDP growth rate. Panel (a). Joint analysis of multiple quantiles DQ (1|0) DQ (2|1) Break date 95% C.I.

0.994* 0.612 84:1 [77:3, 84:2]

Panel (b). Separate treatment of individual quantiles Quantile SQ (1|0) SQ (2|1) Break date 95% C. I.

0.20 1.423 – – –

0.35 1.310 – – –

0.50 1.019 – – –

0.65 1.818** 0.964 84:2 [68:2, 87:4]

0.80 2.078** 1.116 84:1 [78:4,90:1]

Note. The sample period is 1947:2–2009:2. The model is a quantile autoregressive model with all parameters allowed to change. C.I. denotes the 95% confidence interval. * Statistical significance at 5% level. ** Statistical significance at 1% level.

Qyt (τ |yt −1 , . . . , yt −p )

= µj (τ ) +

p −

αi,j (τ )yt −i (t = Tj0−1 + 1, . . . , Tj0 ),

i=1

where j is the index for the regimes and Tj0 corresponds to the last observation from the jth regime. The break dates and the number of breaks are assumed to be unknown. We consider five equally spaced quantiles, τ = 0.20, 0.35, 0.50, 0.65, 0.80, which are chosen to examine both the central tendency and the dispersion of the conditional distribution. The Bayesian Information Criterion is applied to determine the lag order of the quantile autoregressions, with the maximum lag order set to int[12(T /100)1/4 ], where T is the sample size. The criterion selects 2 lags for the quantile τ = 0.20 and one lag for the other quantiles. We take a conservative approach and set p = 2 for all five quantiles under consideration. First, we study the five quantiles jointly. The results are summarized in Panel (a) of Table 5a. The DQ test, applied to the interval [0.2, 0.8], equals 0.994. This exceeds the 5% critical value, which is 0.906, suggesting at least one break is present. The DQ (2|1) test equals 0.612 and is below the 10% critical value. Therefore, we conclude that only one break is present. The break date estimated using all five quantiles is 1984:1 with a 95% confidence interval [77:3, 84:2]. This finding is consistent with McConnell and Perez-Quiros (2000). Next, we analyze the quantiles separately. The results are summarized in Panel (b) of Table 5a. The SQτ test detects structural change only in the upper quantiles (τ = 0.65, 0.80), but not the median and the lower quantiles (τ = 0.20, 0.35). For τ = 0.65, the estimated break date is 1984:2 and for τ = 0.80 the date is 1984:1. This confirms that the break is common to both quantiles. Table 6 reports coefficient estimates conditional on the break date 1984:1. For both quantiles (τ = 0.65, 0.80), the structural

257

change is characterized by a large decrease in the intercept and a small change in the sum of the autoregressive coefficients. Thus, overall the dispersion of the upper tail has decreased substantially. Overall, the results suggest that a major change in the GDP growth should be attributed to the fact that the growth was less rapid during expansions, while the recessions have remained just as severe when they occurred. We also considered a finer partition including nine evenly spaced quantiles from 0.20 to 0.80. The results are reported in Table 5b. The application of the DQ test reports one break, estimated at 1984:2, with the 95% confidence interval being [1980:1, 1986:4]. The consideration of individual quantiles shows that only τ = 0.65, 0.725 and 0.80 are affected by the break, with the dates being 1984:2, 1984:2 and 1984:1, respectively. Overall, the results are consistent with the five-quantile specification considered earlier. To further examine the robustness of the result, we repeated the analysis excluding observations from 2008–2009. The DQ (l + 1|l) test detects one break and the estimated break date is 1984:2, confirming our findings using the full sample. We also studied the quantiles separately with the results summarized in Table 7. It shows that the upper quantiles are affected by structural change while the median and the 35th percentile are stable. The only difference from the full sample case is that the 20th percentile also exhibits a break. However, the break date is 1958:1 and there is no statistically significant break during the 1980s. Thus, the general picture still holds. 8.2. Underage drunk driving Motor vehicle crash is the leading cause of death among youth ages 15–20, a high proportion of which involves drunk driving. Here we study the structural change in the blood alcohol concentration (BAC) among young drivers involved in traffic accidents. The study is motivated by the fact that the BAC level is an important measure of alcohol impairment, whose changes deliver important information on whether and how young driver’s drinking behavior has changed over time. The data set contains information on young drivers (less than 21 years old) involved in motor vehicle accidents for the state of California over the period 1983–2007. It is obtained from National Highway Traffic Safety Administration (NHTSA), which reports the BAC level of the driver, his/her age, gender and whether the crash is fatal. For some observations, the BAC levels were not measured at the accident and their values were reconstructed using multiple imputations. They constitute about 26% of the sample. In these cases the first imputed value is used in our analysis. The numbers of observations in each quarter vary between 108 and 314 with the median being 191. We start by constructing a representative random subsample, containing 108 observations in

Table 5b Structural breaks in the US real GDP growth rate (nine quantiles). Panel (a). Joint analysis of multiple quantiles DQ (1|0) DQ (2|1) Break date 95% C. I.

0.994∗ 0.608 84:2 [80:1, 86:4]

Panel (b). Separate treatment of individual quantiles Quantile SQ (1|0) SQ (2|1) Break date 95% C.I. Note. See Table 5a.

0.20 1.423 – – –

0.275 1.409 – – –

0.35 1.310 – – –

0.425 1.006 – – –

0.50 1.019 – – –

0.575 1.097 – – –

0.65 1.818∗∗ 0.964 84:2 [68:2,87:4]

0.725 2.110∗∗ 1.165 84:2 [77.4, 86.4]

0.80 2.078∗∗ 1.116 84:1 [78:4, 90:1]

258

T. Oka, Z. Qu / Journal of Econometrics 162 (2011) 248–267

Table 6 Coefficient estimates for the US real GDP growth rate. Quantile Break date

0.20 NA µ1 (τ ) −0.928 (0.586) α1,1 (τ ) 0.288∗∗ (0.075) α2,1 (τ ) 0.236∗∗ (0.078) µ2 (τ ) − µ1 (τ ) – – α1,2 (τ ) − α1,1 (τ ) – – α2,2 (τ ) − α2,1 (τ ) – –

0.35 NA 0.697 (0.366) 0.282∗∗ (0.052) 0.159∗∗ (0.023) – – – – – –

0.50 NA 1.938∗∗ (0.334) 0.335∗∗ (0.054) 0.049 (0.035) – – – – – –

0.65 84:1 4.507∗∗ (0.665) 0.411 ∗∗ (0.105) −0.105 (0.104) −2.431∗∗ (0.739) −0.250 (0.148) 0.425∗∗ (0.154)

0.80 84:1 6.129∗∗ (0.493) 0.374** (0.074) −0.091 (0.073) −3.089∗∗ (0.772) −0.211 (0.116) 0.405∗∗ (0.142)

Note. The sample period is 1947:2–2009:2. The model is Qyt (τ |yt −1 , yt −2 ) =  µ1 (τ ) + α1,1 (τ )yt −1 + α2,1 (τ )yt −2 , if t ≤ T1 Standard errors arein parentheµ2 (τ ) + α1,2 (τ )yt −1 + α2,2 (τ )yt −2 , if t > T1 . ses. * and ** denote Statistical significance at 5% level 1% level, respectively.

each quarter with 10,800 observations in total.3 It should be noted that such a procedure does not introduce bias into our estimates. However, it does involve some arbitrariness and later in the paper we will report relevant results using the full sample to address this issue. We consider the following model Qyit (τ |xit ) = µj (τ ) + x′it γj (τ )

(t = Tj0−1 + 1, . . . , Tj0 ),

where yit is the BAC level. The BAC levels below the 62nd percentile are identically zero. Thus, we consider only the upper quantiles

3 We used the ‘‘surveyselect’’ procedure (SAS) with ‘‘Method’’ set to srs (simple random sampling without replacement) and ‘‘Seed’’ set to 2009.

0.70, 0.75, 0.80 and 0.85, all of which have a positive BAC in the aggregate. The consequence of such an action will be examined later in the paper. A ‘‘general to specific’’ approach is adopted to determine which variables to include in the regression. We start with a regression that includes age, gender, and quarterly dummies. The dummy variable for a fatal crash is not included to avoid possible endogeneity. The model is estimated assuming there is no structural change, with insignificant regressors sequentially eliminated until the remaining ones are significant at 10% level. This leaves the age, gender and a dummy variable for the fourth quarter (labeled as the winter dummy) in the regression. We first analyze the four quantiles jointly. The results are summarized in Table 8a. The DQ (l + 1|l) test, applied to the interval [0.70, 0.85], reports two breaks. Their dates are 1985:1 and 1992:2 with the 95% confidence intervals being [84:1, 86:1] and [91:2,92:3]. We then consider the quantiles separately. The test suggests that the 70th, 75th and 80th percentiles are affected by two breaks, while the 85th percentile is only affected by the second break. The first estimated break is 1985:2 and the second is either 1992:2 or 1993:2. Although there is some local variation, overall these estimates are consistent with the ones based on multiple quantiles. It is interesting to point out that the confidence intervals for these two breaks include two historically important policy changes. Specifically, the National Minimum Drinking Age Act (MDA) was passed on July 17, 1984. The federal beer tax was doubled in 1991, while the California state beer tax experienced a four-fold increase in the same year. To examine the result’s sensitivity, we consider a finer partition including seven evenly spaced quantiles from 0.70 to 0.85. The results are reported in Table 8b. The application of the DQ test reports two breaks, estimated at the same dates with identical 95%

Table 7 Results for the US real GDP growth rate based on a subsample. Quantile Number of breaks Break date 95% C.I.

µ1 (τ )

α1,1 (τ ) α2,1 (τ ) µ2 (τ ) − µ1 (τ ) α1,2 (τ ) − α1,1 (τ ) α2,2 (τ ) − α2,1 (τ )

0.20 1 58:1 [57:4, 61:3] −3.616 (1.928) 0.879∗∗ (0.259) −0.191 (0.274) 2.927 (1.992) −0.606∗ (0.271) 0.424 (0.282)

0.35 0 NA NA 0.954∗ (0.363) 0.242∗∗ (0.042) 0.153∗∗ (0.024) – – – – – –

0.50 0 NA NA 2.213∗∗ (0.336) 0.283∗∗ (0.055) 0.033 (0.039) – – – – – –

0.65 1 84:2 [72:3, 87:2] 4.517∗∗ (0.665) 0.420 ∗∗ (0.105) −0.105 (0.104) −1.305∗∗ (0.765) −0.453 (0.141) 0.287∗ (0.141)

0.80 1 84:2 [83:3, 86:1] 6.129∗∗ (0.491) 0.374∗∗ (0.074) −0.091 (0.073) −2.378∗∗ (0.537) −0.441∗∗ (0.130) 0.427∗∗ (0.132)

Note. See Table 6. The sample period is 1947:2–2007:4. Table 8a Structural breaks in the blood alcohol concentration. Single quantile

Tests (H0 vs H1 ) 0 vs 1 1 vs 2 2 vs 3 Number of breaks 1st Break date 95% C.I. 2nd Break date 95% C.I.

Multiple quantiles

0.70

0.75

0.80

0.85

5.176** 2.085** 1.339 2 85:2 [84:2, 86:3] 92:2 [91:3, 92:3]

4.613** 2.036** 1.259 2 85:2 [84:3, 86:2] 93:2 [92:1, 95.3]

3.503** 1.842* 1.303 2 85.2 [84:1, 86:2] 93:2 [91:4, 95:4]

3.258** 1.027 – 1 – – 92:2 [91:1, 94:4]

2.372** 1.048** 0.616 2 85:1 [84:1, 86:1] 92:2 [91:2, 92:3]

Note. The columns under ‘‘Single Quantile’’ report the SQ (l + 1|l) test and the estimated break dates based on individual quantiles. The last column reports the DQ (l + 1|l) test over the interval [0.70, 0.85] and break dates estimates based on four quantiles: 0.70, 0.75, 0.80 and 0.85. * Statistical significance at 5% level. ** Statistical significance at 1% level.

T. Oka, Z. Qu / Journal of Econometrics 162 (2011) 248–267

259

Table 8b Structural breaks in the blood alcohol concentration (seven quantiles). Single quantile

Tests (H0 vs H1 ) 0 vs 1 1 vs 2 2 vs 3 Number of breaks 1st Break Date 95% C.I. 2nd Break Date 95% C.I.

Multiple quantiles

0.70

0.725

0.75

0.775

0.80

0.825

0.85

5.176∗∗ 2.085∗∗ 1.339 2 85:2 [84:2, 86:3] 92:2 [91:3, 92:3]

4.999∗∗ 2.088∗∗ 1.318 2 85:2 [84:3, 86:2] 93.2 [92:1, 95:1]

4.613∗∗ 2.036∗∗ 1.259 2 85:2 [84:3, 86:2] 93:2 [92:1, 95.3]

4.309∗∗ 1.958∗∗ 1.314 2 85:2 [84:2, 86:3] 93:2 [91:2, 96.1]

3.503∗∗ 1.842∗ 1.303 2 85.2 [84:1, 86:2] 93:2 [91:4, 95:4]

3.610∗∗ 1.230 – 1 – – 92.2 [91:1, 94:1]

3.258∗∗ 1.027 – 1 – – 92:2 [91:1, 94:4]

2.372∗∗ 1.048∗∗ 0.616 2 85:1 [84:1, 86:1] 92:2 [91:2, 92:3]

Note. See Table 8a. The last column reports the DQ (l + 1|l) test over [0.70, 0.85] and break dates estimates based on seven quantiles.

confidence intervals as the four-quantile case. The consideration of individual quantiles shows that the quantiles τ = 0.70 to 0.80 are affected by two breaks and τ = 0.825 and 0.85 are affected by one break. The estimated break dates and confidence intervals are consistent with the four-quantile case. Overall, the findings remain qualitatively similar. Figs. 1 and 2 report the changes in the quantile functions for representative values of xit conditioning on break dates 1985:1 and 1992:2. They cover three age groups: 17, 18 and 19, which correspond to the 25th, 50th and 75th percentile of the unconditional distribution of xit . Males and females are reported separately. The winter dummy is set to zero (setting it to one produces similar results and is omitted to save space). Fig. 1 presents results for males. Each panel contains the changes and their pointwise 95% confidence intervals. Three interesting patterns emerge. First, the changes are all negative. They are also economically meaningful because BAC levels as low as 0.02 can affect a person’s driving ability with the probability of a crash increasing significantly after 0.05 according to studies by NHTSA. Second, for the first break, the change becomes smaller as age increases while for the second the opposite is true. Finally and most importantly, the change is smaller for higher quantiles, suggesting that the policies are more effective for ‘‘light drinkers’’ than for ‘‘heavy drinkers’’ in the sample. This is unfortunate since heavy drinkers are more likely to cause an accident. Fig. 2 presents results for females. The findings are qualitatively similar, except for the second break the change is more homogeneous across quantiles. It should also be noted that for the first break, the confidence intervals at the 85th percentile typically include zero (except for the first figure in 1(a)). This is consistent with the findings in Table 8a, where for this quantile only one break is detected. 8.2.1. Further robustness analysis We focus on the following two issues: (1) the distribution of BAC levels has a mass at zero, and (2) the analysis has been conducted on a subsample. To address the first issue, we apply the censored quantile regression of Powell (1986) conditioning on the break dates 1985:1 and 1992:2. The model is estimated using the crq procedure in quantreg. The estimated changes are reported in Figs. 1 and 2 (the solid line with triangle). The results are very similar. To address the second issue, the break dates are re-estimated using the full sample. The estimates are 1985:2 and 1992:3 using the four quantile functions. Conditioning on the break dates, the model is re-estimated using both quantile regression and censored quantile regression. The estimated changes are found to be very similar to the ones reported in Figs. 1 and 2. Most importantly, the three patterns discussed above still hold. The details are not reported here to save space. Thus, the results remain qualitatively the same after accounting for these two issues. In summary, this empirical application, although quite simple, illustrates that rich information can be extracted from considering structural change in the conditional quantile function.

9. Conclusions We have considered the estimation of structural changes in regression quantiles, allowing for both time series models and repeated cross-sections. The proposed method can be used to determine the number of breaks, estimate the break locations and other parameters, and obtain corresponding confidence intervals. A simple simulation study suggests that the asymptotic theory provides a useful approximation in finite samples. The two empirical applications, to the ‘‘Great Moderation’’ and underage drunk driving, suggest that our framework can potentially deliver richer information than simply considering structural change in the conditional mean function. Appendix We provide detailed proofs for results in Section 5. They imply Lemma 1, Theorem 1 and Corollary 1 as special cases (N = 1). All limiting results are derived using T → ∞ with N fixed, or (N , T ) → ∞. For a given τ and φ ∈ Rp , define qτ ,it (φ) = ρτ (u0it (τ ) − x′it φ) − ρτ (u0it (τ )) and Qk (τ , φ) =

k − N −

qτ ,it (φ).

t =1 i =1

The following decomposition, due to Knight (1998), will be used repeatedly in the analysis: Qk (τ , φ) = Wk (τ , φ) + Zk (τ , φ),

(A.1)

where Wk (τ , φ) = −

k − N −

ψτ (u0it (τ ))x′it φ with ψτ (u)

t =1 i=1

= τ − 1(u < 0), k − N ∫ x′ φ − it Zk (τ , φ) = (1(u0it (τ ) < s) − 1(u0it (τ ) < 0))ds. (A.2) t =1 i=1

0

Wk (τ , φ) is a zero-mean process for fixed τ and φ , while Zk (τ , φ) is in general not. Define bit (τ , φ) = x′it 1(u0it (τ ) < x′t φ)

ξit (τ , φ) = {bit (τ , φ) − bit (τ , 0)} − Et −1 {bit (τ , φ) − b(τ , 0)} , where Et −1 is taken with respect to the σ -algebra generated by

N

xit , yi,t −1 , xi,t −1, yi,t −2 , . . . i=1 . Note that ξit (τ , φ) forms an array of martingale differences for given τ and φ . The following lemma provides upper and lower bounds for Zk (τ , φ) in terms of bit (τ , φ).



260

T. Oka, Z. Qu / Journal of Econometrics 162 (2011) 248–267

Fig. 1. Structural changes in young drivers’ blood alcohol concentration (male).

Lemma A.1. Suppose there is no structural change. Then, for every k = 1, . . . , T , 0 ≤ (1/2)

k − N − t =1 i =1

{bit (τ , φ/2) − bit (τ , 0)} φ ≤ Zk (τ , φ)

≤

k − N −

{bit (τ , φ) − bit (τ , 0)} φ.

t =1 i =1

Proof of Lemma A.1. We consider the (i, t )th term in the summation (A.2). If x′it φ ≥ 0, then it is bounded from below by

T. Oka, Z. Qu / Journal of Econometrics 162 (2011) 248–267

261

(a) Results for the 17-year-old

(b) Results for the 18-year-old

(c) Results for the 19-year-old

Fig. 2. Structural changes in young drivers’ blood alcohol concentration (female).

∫

x′it φ x′it φ/2

{1( (τ ) < s) − 1( (τ ) < 0)}ds u0it

∫

x′it φ

≥ x′it φ/2

u0it

∫

0

−|x′it φ|

∫ {1( (τ ) < xit φ/2) − 1( (τ ) < 0)}ds u0it

′

u0it

= (1/2){bit (τ , φ/2) − bit (τ , 0)}φ ≥ 0. If x′it φ < 0, then this term is equal to

{1(u0it (τ ) < 0) − 1(u0it (τ ) < s)}ds −|x′it φ|/2

≥ −|x′it φ|

{1(u0it (τ ) < 0) − 1(u0it (τ ) < −|x′it φ|/2)}ds

= (1/2){bit (τ , φ/2) − bit (τ , 0)}φ ≥ 0. Taking the summation yields the lower bound. The upper bound can be proved similarly.

262

T. Oka, Z. Qu / Journal of Econometrics 162 (2011) 248–267

The next lemma will be used to study the asymptotic properties of the bounds derived in the previous lemma.

brackets, after applying (A.3), is bounded by

 −γ /2

Lemma A.2. Suppose there is no structural change and Assumptions B1–B3 and B5–B6 hold.

  = φ ∈ Rp : ‖φ‖ = A(NT )−1/2 with A being some

1. Let Φ1 arbitrary constant, then

  [Ts] − N   −   −1/2 sup sup (NT ) ξit (τ , φ) = op (1).  s∈[0,1] φ∈Φ1  t =1 i =1   2. Let Φ2 = φ ∈ Rp : ‖φ‖ = A(Nk)−1/2 (log NT )1/2 with A being some arbitrary constant. Then, for any ϵ > 0 and η > 0, there exists a TL < ∞, such that for all T ≥ TL ,    k − N   −   P sup sup (Nk)−1/2 (log NT )−1/2 ξit (τ , φ)   1≤k≤T φ∈Φ2 t =1 i=1 

(NT )

E

AUf (NT )

−1

γ

T − N −

‖xit ‖

3

t =1 i =1

 γ ≤ (NT )−γ /2 AUf M → 0 as T → ∞

(A.4)

due to Assumption B5(d). The second term can be rewritten as

 −γ +1/2

(NT )

E (NT )−1/2

T − N −

E t −1

t =1 i =1

 × ‖ξit (τ , φ)‖

2 γ −2

‖ξit (τ , φ)‖

2

.

Because ‖ξit (τ , φ)‖ ≤ ‖xit ‖, the preceding quantity is less than or equal to

 (NT )−γ +1/2 E (NT )−1/2

> η < ϵ.

T − N −

 ‖xit ‖2γ −2 Et −1 ‖ξit (τ , φ)‖2

t =1 i =1

 −γ +1/2

3. Let hT and dT be positive sequences such that as T → ∞, hT is nondecreasing, hT → ∞, ∞ and (hT d2T )/T → h with  dT → p 0 < h < ∞. Let Φ3 = φ ∈ R : ‖φ‖ = (NT )−1/2 dT . Then, for any ϵ > 0, D > 0 and B > 0, there exists TL < ∞, such that for all T ≥ TL ,

  k N    −1 −1/2 − −  P sup sup k N ξit (τ , φ)  BhT ≤k≤T φ∈Φ3  t =1 i=1  √ > DdT / T < ϵ. 

AUf (NT )

E

T − N −

 ‖xit ‖

2γ +1

t =1 i=1

≤ (NT )−γ +1/2 AUf M → 0 as T → ∞,

(A.5)

where the first inequality uses (A.3) and the second uses Assumption B5(d). (A.4) and (A.5) imply that Lemma A.2(1) holds for any given φ ∈ Φ1 . Consider the second result. Note that

   k ∑ N ∑    ξ (τ , φ) it      t =1 i=1    P  sup   > η 1 / 2  1≤k≤T  (Nk log(NT ))         k ∑ N ∑    ξ (τ , φ) it T    −    t =1 i =1  ≤ P   > η . 1 / 2     Nk log ( NT )) ( k=1     

Proof of Lemma A.2. Note that ξit (τ , φ) has the same dimension as xit . Without loss of generality, we assume that it is a scalar. Otherwise, the subsequent can be applied to each of its elements. Further, we assume xit is nonnegative. Otherwise, we can write − + − xit = x+ it − xit ≡ xit 1(xit ≥ 0) − (−xit )1(xit ≤ 0). Then, xit and xit are nonnegative and satisfy the assumptions stated in the paper. We will only prove the result for a fixed φ , because the uniformity over Φ1 , Φ2 or Φ3 follows from the compactness of these sets and the monotonicity of bit (τ , φ) in x′it φ , which can be verified using the same argument as in Theorem A3(ii) in Bai (1996). Consider the first result. For any φ ∈ Φ1 , ξit (τ , φ) satisfies Et −1 ‖ξit (τ , φ)‖2

  ≤ ‖xit ‖2 Fit (x′it β 0 (τ ) + (NT )−1/2 A ‖xit ‖) − τ  ≤ (NT )−1/2 AUf ‖xit ‖3 ,

≤ (NT )

−1

(A.3)

where Uf is defined in Assumption B2. Applying the Doob inequality and the Rosenthal inequality as in Bai (1996, p. 618), we have, for any N and T ,

   [Ts] − N   −   −1/2 P sup (NT ) ξit (τ , φ) > ϵ  s∈[0,1]  t =1 i =1   γ T − N − M1 −γ 2 ≤ 2γ (NT ) E Et −1 ‖ξit (τ , φ)‖ ϵ t =1 i =1   T − N − 2γ −γ + (NT ) E Et −1 ‖ξit (τ , φ)‖

The rest of the proof is similar to the first result. Applying the Markov inequality followed by the Rosenthal inequality, we have, for any η > 0 and some γ > 2,

   k ∑ N  ∑   ξit (τ , φ)  T   −    t =1 i=1 > η P    1/2     Nk log ( NT )) ( k=1       γ k − N T − − M2 −1 3 ‖xit ‖ ≤ E (Nk) (Nk log(NT ))γ /2 η2γ k=1 t =1 i=1   k − N − −1 2 γ +1 ‖xit ‖ + (Nk) E



t =1 i=1

where γ is defined in Assumption B5(d) and M1 is some constant that depends only on p and γ . The first term inside of the curly

t =1 i=1

 ≤

2M2 M

(N log(NT ))γ /2 η2γ

− T

k−γ /2

k=1

where M2 is a constant that only depends on p, Uf , γ and A, and the second inequality uses Assumption B5(d). Because γ > 2, the ∑T −γ /2 summation is finite. The term inside the parentheses k=1 k converges to zero. Thus, Lemma A.2(2) holds. For the third result, applying the same argument as above and using Et −1 ‖ξit (τ , φ)‖2 ≤ (NT )−1/2 dT Uf ‖xit ‖3

T. Oka, Z. Qu / Journal of Econometrics 162 (2011) 248–267

for any φ ∈ Φ3 , we have

   k N   √  −1 −1/2 − −  P sup k N ξit (τ , φ) > DdT / T  BhT ≤k≤T  t =1 i=1     γ γ − T 1/2 Uf ≤ M1 1/2 2 

kdT

k≥BhT

 × E (Nk)−1

N

= (a) − ‖(b)‖ .

D

(log NT )−1 (a) ≥

‖xit ‖3 ≥

t =1 i =1

 +

T

  1/2 2γ −1

Uf

E

D2γ

kdT

≤ M3

 

 −  T 1/2 γ kdT

k≥BhT

where M3 = M1 M max preceding line as



k

(Nk)−1

−

‖xit ‖2γ +1

+

T

1/2

 2 γ −1 

kdT

γ

Uf /(N 1/2 D2 )



,

(A.6)

 , Uf /D2γ . Rewrite the

γ

Bγ −1

+

hT d2T M3 B2γ −2

T



k≥BhT

2γ −2 

T hT d2T

d2T

kγ

 γ − 32  − (BhT )2γ −2

T

γ −1

k≥BhT

k2γ −1

.

γ /2−1

For the first term, T /(hT d2T ) → h1−γ < ∞, d2T /T → 0 because γ > 2, and the summation inside the curly brackets is finite by the Euler-Maclaurin formula. Thus, this term converges to zero. The second term also converges to zero by the same argument.





The next lemma provides convergence rates for parameter estimates using subsamples. Lemma A.3. Suppose that there is no structural change and that Assumptions B1–B3 and B5–B6 hold. Let βˆ k be the quantile regression estimate of β 0 (τ ) using observations t = 1, . . . , k, and φk = βˆ k − β 0 (τ ). Then: 1. (log NT )−1/2 (Nk)1/2 ‖φk ‖ = Op (1) uniformly over k ∈ [1, T ]. 2. For any 0 < δ < 1, (NT )1/2 ‖φk ‖ = Op (1) uniformly over k ∈ [δ T , T ]. Proof of Lemma A.3. We only prove the first result; the proof of the second is similar and simpler. The proof is by contradiction, i.e., showing that otherwise the objective function Qk (τ , φk ) will be strictly positive with probability close to 1, implying that βˆ k cannot be its minimizer. Due to the convexity of Qk (τ , φk ),4 it suffices to consider its property over φk satisfying

(log NT )−1/2 (Nk)1/2 ‖φk ‖ = B, where B is an arbitrary positive constant. Apply the Knight identity (A.1) and study the terms Zk (τ , φk ) and Wk (τ , φk ) separately. For Zk (τ , φk ), by Lemma A.1, Zk (τ , φk ) ≥ (1/2)

k − N −

{bit (τ , φk /2) − bit (τ , 0)} φk

t =1 i =1

≥

k N 1 −−

2 t =1 i =1

4 1 4

(log NT )−1 Lf

k − N −

φk′ xit x′it φk

t =1 i =1

Lf B2 λmin ,

where λmin is the minimum eigenvalue of (Nk)−1 t =1 i=1 xit x′it , the first inequality is due to the mean value theorem, and the second inequality uses (log NT )−1/2 (Nk)1/2 ‖φk ‖ = B. Term (b), after dividing by (log NT ), is of order op (B) uniformly in k ∈ [1, T ] by Lemma A.2(2). Thus,

(log NT )−1 ((a) − ‖(b)‖) ≥

1 8

∑N

Lf B2 λmin

(A.8)

with probability close to 1 uniformly in k ∈ [1, T ] for large T . Now consider Wk (τ , φk ).

   γ −1  2  2 −1  − T dT (BhT )γ −1

M3

1

∑k

t =1



(A.7)

Term (a) satisfies

γ

k − N −

263

  k − N  1 −   − ξit (τ , φk /2)φk    2 t =1 i=1

(log NT )−1 |Wk (τ , φk )|   k − N  −   ψτ (u0it (τ ))x′it  ‖φk ‖ ≤ (log NT )−1    t =1 i=1   k − N   −  −1/2 −1/2 0 ′  (log NT ) ψτ (uit (τ ))xit  . = B (Nk)   t =1 i=1

Applying the Hájek-Rényi inequality for martingales (see, e.g. Chow and Teicher, 2003, p. 255),

   k − N   −  −1/2 −1/2 0 ′  (Nk) P sup (log NT ) ψτ (uit (τ ))xit  > C  1≤k≤T  t =1 i =1 

≤

N T 1 −−

C2

i=1 t =1

E ‖xit ‖2 Nt log(NT )

,

where C is an arbitrary constant. Because E ‖xit ‖2 < ∞ and ∑T −1 = (log T ), the left-hand side can be made arbitrarily t =1 t small by choosing a large C . Therefore, if B is large, the term (A.9) will be dominated by (A.8) asymptotically, implying Qk (τ , φk ) will be strictly positive with probability close to 1 for large T . This contradicts the fact that φk minimizes Qk (τ , φ), thus proving the first result. The second result can be proved along the same lines, by applying Lemma A.2(1) to term (b) and the Hájek-Rényi inequality Wk (τ , φk ). The next lemma shows that the objective function Qk (τ , φ) can be bounded in various ways when the model is estimated using subsamples of various sizes. It is an extension of Lemma A.1 in Bai (1995) along the two directions, i.e., by allowing for time series dynamics and a cross-sectional dimension. Lemma A.4. Suppose there is no structural change and Assumptions B1–B3 and B5–B6 hold. 1. For any δ ∈ (0, 1), supδ T ≤k≤T infφ Qk (τ , φ) = Op (1).



Et −1 {bit (τ , φk /2) − bit (τ , 0)} φk



2. sup1≤k≤T infφ Qk (τ , φ) = Op (log NT ). 3. For any δ ∈ (0, 1), ϵ > 0 and D > 0 and T sufficiently large





 4 If g (θ) is convex, then for any γ ≥ 1, g (γ θ) − g (0) ≥ γ (g (θ) − g (0)).

(A.9)

P

inf

inf

δ T ≤k≤T ‖φ‖≥(NT )−1/2 log NT

Qk (τ , φ) < D log NT



< ϵ.

264

T. Oka, Z. Qu / Journal of Econometrics 162 (2011) 248–267

4. For any δ ∈ (0, 1), ϵ > 0 and D > 0, there exists A > 0 such that when T is sufficiently large

 P

inf

inf

δ T ≤k≤T ‖φ‖≥A(NT )−1/2

Qk (τ , φ) < D



(log NT )−1 Zk (τ , φ) = Op (1).

< ϵ.

5. Let hT and dT be positive sequences such that hT is nondecreasing, hT → ∞, dT → ∞ and (hT d2T )/T → h with 0 < h < ∞. Then for each ϵ > 0 and D > 0 there exists an A > 0, such that when T is large enough,

 P

inf

inf

AhT ≤k≤T ‖φ‖≥dT (NT )−1/2

Qk (τ , φ) < D



For Zk (τ , φ), applying the same argument as in Lemma A.4(1). (cf. (A.11)), we have

< ϵ.

Proof of A.4(3). Due to convexity, it suffices to consider ‖φ‖ = (NT )−1/2 log NT and show

 P

inf

inf

≥ inf

− sup

Proof of A.4(1). By Lemma A.3, uniformly over k ∈ [δ T , T ]. Thus, it suffices to prove

|Qk (τ , φ)| = Op (1) for any A > 0.

sup

sup

(A.10)

|Qk (τ , φ)| = Op (1) for any A > 0.

inf

sup

1≤k≤T ‖φ‖=A(NT )−1/2

sup

≤ sup (log NT )

1

(log NT )−2 Zk (τ , φ) ≥

8

δ Lf λmin

(log NT )−2 |Wk (τ , φ)|

  k − N   −  −1/2 0 ′  ψτ (uit (τ ))xit  (NT )   t =1 i=1

= op (1). The result follows by combining the above two results. Proof of A.4(4). It is similar to A.4(3) and is omitted.

Proof of A.4(5). Due to convexity, it is sufficient to show

 P



inf

inf

AhT ≤k≤T ‖φ‖=dT (NT )−1/2

Qk (τ , φ) < D

< ϵ.

First consider k−1 Wk (τ , φ). Let C be an arbitrary constant, we have



(A.11)

The first term on the right-hand side is bounded from above ∑T ∑N ′ ′ by Uf t =1 i=1 φ xi xi φ uniformly in φ because of the mean value theorem, which is further bounded by A2 Uf λmax because φ = A(NT )−1/2 . The second term is op (1) by Lemma A.2(1). Thus, sup1≤k≤T sup‖φ‖=A(NT )−1/2 |Zk (τ , φ)| = Op (1). Proof of A.4(2). Let Dk = B(Nk)−1/2 (log NT )1/2 with B an arbitrary constant. Because of the first result of Lemma A.3 and the convexity of Qk (τ , φ), it suffices to show



P

sup

1≤k≤T ‖φ‖=Dk

Apply the decomposition (A.1). For Wk (τ , φ),

|Wk (τ , φ)|   k − N −  −1  0 ′  ≤ (log NT )  ψτ (uit (τ ))xit  ‖φ‖  t =1 i =1    k − N   −  −1/2 −1/2 0 ′  ψτ (uit (τ ))xit  = Op (1). ≤ B (Nk) (log NT )   t =1 i=1

sup

AhT ≤k≤T ‖φ‖=dT (NT )−1/2

 −1  k Wk (τ , φ) > Cd2 /T T

   k N   √  −1/2 −1 − − 0 ′  ≤ P sup N k ψτ (ut (τ ))xit  > CdT / T  AhT ≤k≤T  t =1 i =1  Ah N T − − T ≤ 2 2 N −1 (AhT )−2 E ‖xit ‖2 

C dT

+N

t =1 i =1 T N − −

−1

 t

−2

E ‖xit ‖

2

t =AhT +1 i=1

  sup (log NT )−1 Qk (τ , φ) = Op (1) for each B > 0.

(log NT )

−1

δ T ≤k≤T

where the last inequality is due to the functional central limit theorem (or alternatively applying the Hájek-Rényi inequality). For Zk (τ , φ), apply Lemma A.1,

−1

sup

δ T ≤k≤T ‖φ‖=(NT )−1/2 log NT

  N k −   −  0 −1/2 ′  ψτ (uit (τ ))xit  = Op (1), ≤ sup A (NT )   1≤k≤T t =1 i =1

sup

|Wk (τ , φ)| .

in probability for large T , and

|Wk (τ , φ)|

  k − N  −   Et −1 {bit (τ , φ) − bit (τ , 0)}φ  0 ≤ Zk (τ , φ) ≤    t =1 i=1   k − N   −   + A(NT )−1/2 ξit (τ , φ) .   t =1 i =1

inf

δ T ≤k≤T ‖φ‖=(NT )−1/2 log NT

Apply the decomposition (A.1). For Wk (τ , φ), we have sup

sup

Applying similar arguments as in Lemma A.3, we can show

Note that the sup is taken over 1 ≤ k ≤ T instead of k ∈ [δ T , T ]. Due to the convexity of Qk (τ , φ), it then suffices to show 1≤k≤T ‖φ‖=A(NT )−1/2

Zk (τ , φ)

δ T ≤k≤T ‖φ‖=(NT )−1/2 log NT

NT ‖βˆ k − β 0 (τ )‖ = Op (1)

sup

inf

δ T ≤k≤T ‖φ‖=(NT )−1/2 log NT

T

√

sup

< ϵ.

Qk (τ , φ)

inf

δ T ≤k≤T ‖φ‖=(NT )−1/2 log NT

     sup  inf Qk (τ , φ) = Op (1). − 1 / 2 k≤Ah ‖φ‖≤d (NT )

1≤k≤T ‖φ‖≤A(NT )−1/2

Qk (τ , φ) < D log NT

We have

6. Suppose the same conditions as in part (5) hold. Then, for any A > 0,

T

inf

δ T ≤k≤T ‖φ‖=(NT )−1/2 log NT



≤

TK C 2 d2T

 (AhT )

−1

+

T − t =AhT +1

 t

−2

≤

3K C2



AhT d2T

 −1

T

where the second inequality is due to the Hájek–Rényi inequality, the third inequality is because of E ‖xit ‖2 < ∞ by Assump∑T −2 tion B5(c), and the last inequality is because of ≤ t =AhT +1 t

 −1

2(AhT )−1 . Because hT d2T /T → h > 0, the quantity AhT d2T /T , and thus the preceding display, can be made arbitrarily close to zero by choosing a large A.



T. Oka, Z. Qu / Journal of Econometrics 162 (2011) 248–267

Now, consider k−1 Zk (τ , φ). Applying the same argument as in the proof of Lemma A.3 (the discussion between the display (A.7) and (A.8)) but using Lemma A.2(3) instead of Lemma A.2(2), we have

 P

inf

inf

AhT ≤k≤T ‖φ‖=dT (NT )−1/2



k−1 Zk (τ , φ) < 2Cd2T /T

 inf

inf

AhT ≤k≤T ‖φ‖=dT (NT )−1/2

k−1 Qk (τ , φ) < Cd2T /T



ˆ ) − β10 (τ )‖ ≤ (Nk1 )−1/2 log For (c), notice that the restriction ‖β(τ

ˆ ) − β10 (τ ) = op (N −1/2 vT ). (Nk1 ) and Assumption B6 imply β(τ This implies (c) = op (1) using the same argument as in the proof of Lemma A.2. Similarly, term (d) = Op (1). Term (e) = Op (1) after applying the mean value theorem and the strong law of large numbers. Term (f) = op (1) because of the functional central limit theorem and k2 /T → 0. The above results all hold uniformly in 0 ≤ k2 ≤ T 1/2 vT−1 . Thus, by choosing a large A, the first term in (A.13) will dominate the second term. Consequently, the subgradient will be strictly positive with probability 1.

<ϵ

for sufficiently large T and A. Thus, P

265

<ϵ

for sufficiently large T and A. Because k ≥ AhT , this implies  Qk (τ , φ) is greater than AC hT d2T /T with probability arbitrarily close because  to 1 for sufficiently large T and A. 2However,  AC hT d2T /T → ACh, the quantity AC hT dT /T can be made greater than D by choosing a large A. Proof of A.4(6). Uses the same argument as Bai (1995, p. 432). The detail is omitted.

Lemma A.6. Suppose there is no structural change. Under Assumptions B1–B3 and B5–B7, for any A < ∞, B < ∞, and ∆ ∈ Rp , we have sup

sup

−2 −1/2 B k≤AvT ‖φ‖≤(NT )

  Qk (τ , N −1/2 vT ∆ + φ) − Qk (τ , N −1/2 vT ∆)

= op (1).

The next lemma is similar to Lemma 9 in Bai (2000). It shows that when pooling data from two regimes, the estimated parameters are close to those of the dominating regime.

The proof is similar to that for Lemma A.4(1) and is omitted.

Lemma A.5. Consider a sample of size T = k1 + k2 with a structural change occurring at k1 :

their estimates, βˆ j (τ ) be the coefficients estimates correspond-

Qyit (τ |xit ) =



x′it β10 (τ ), x′it β20 (τ ),

t = 1, . . . , k1 , t = k1 + 1, . . . , T .

Assume k2 ≤ T 1/2 vT−1 and suppose Assumptions B1–B3 and B5–





ˆ ) be the quantile regression estimate using the pooled B7 hold. Let β(τ sample ignoring the break and under the additional restriction that ˆ ) − β10 (τ )‖ ≤ (Nk1 )−1/2 log(Nk1 ). Then, ‖β(τ ˆ ) − β10 (τ ) = Op ((NT )−1/2 ). β(τ

(A.12)

Proof of Lemma A.5. The proof is based on analyzing the subgradient normalized by (NT )−1/2 :

(NT )−1/2

k1 − N −

ˆ ) − β10 (τ ))′ − τ xit } {bit (τ , β(τ

t =1 i=1

+ (NT )−1/2

T N − − ˆ ) − β20 (τ ))′ − τ xit }. {bit (τ , β(τ

(A.13)

t =k1 +1 i=1

(NT )−1/2

(2) |Tˆ2 − T20 | > K and |Tˆ1 − T20 | ≤ K ,

(3) |Tˆ2 − T20 | ≤ K and |Tˆ1 − T10 | > K . ∗

First consider Case (1). Let T b be the ordered version of ∗ ˆ {T1 , Tˆ2 , T10 , T20 − K , T20 + K }. Values in T b partition the full sample into (at most) six segments, with the lth segment containing the portion of the sample that falls between the (l − 1)th and lth ∗ largest values in T b . For any such partition, we always have

ˆ ), Tˆ b ) − SNT (τ , β 0 (τ ), T 0 )} {SNT (τ , β(τ ≥ inf {SNT (τ , β ∗ (τ ), T b ) − SNT (τ , β 0 (τ ), T 0 )}. ∗ β (τ )

ˆ ) − β20 (τ ))′ {bit (τ , β(τ

lth segment in the partition with coefficient estimates being βˆ l∗ (τ ). It includes observations from both regimes 2 and 3. Note that

xit Fit x′it β10 (τ ) − τ









(e)

t =k1 +1 i=1 T N − − t =k1 +1 i=1

xit 1(u0it (τ ) ≤ 0) − τ



 

 

 

 

  ≥ N 1/2 K 1/2 β20 (τ ) − β30 (τ ) /2 ≥ T v/2 ‖∆2 (τ )‖ /4 ≥ log(NK )

t =k1 +1 i=1 T N − −

(A.14)

Therefore, to reach a contradiction, it suffices to show that the right-hand side is strictly positive.   Suppose the subsample T20 − K , T20 + K corresponds to the



− bit (τ , β10 (τ ) − β20 (τ ))′ } (c) T N − − +(NT )−1/2 ξit (τ , β10 (τ ) − β20 (τ ))′ (d)

+(NT )−1/2

(1) |Tˆ2 − T20 | > K and |Tˆ1 − T20 | > K ,

max N 1/2 K 1/2 βˆ l∗ (τ ) − β20 (τ ) , N 1/2 K 1/2 βˆ l∗ (τ ) − β30 (τ )

t =k1 +1 i=1

+(NT )−1/2

ˆ ) = (βˆ 1 (τ )′ , ing to the jth segment (i.e., [Tˆj−1 + 1, Tˆj ]), and β(τ ˆβ2 (τ )′ , βˆ 3 (τ )′ )′ . The proof consists of four steps that successively ˆ ) and Tˆ b . tighten the bounds on β(τ ˆ Step 1. (Prove P (|Tj − Tj0 | ≤ T 1/2 vT−1 ) → 1 for j = 1, 2 as T → ∞.) The proof is by contradiction. Suppose the result does not hold and let K = [T 1/2 vT−1 ]. Then, it suffices to consider the following three cases:

∗

We will proceed by contradiction, showing that the subgradient will be strictly positive with probability 1 if the condition (A.12) is violated. ˆ ) − β10 (τ )‖ > A(NT )−1/2 and study the two terms Suppose ‖β(τ separately. Following the same argument as in Lemma 2 in Qu (2008, p. 182), the first term in (A.13) can be made arbitrary large by choosing a large A. For the second term, rewrite it as T N − −

Proof of Lemma 2. For the ease of exposition, assume there are only two breaks, occurring at T10 and T20 . Let Tˆ b = (Tˆ1 , Tˆ2 ) be



(f).

for large T , where the second inequality is due to Assumption B7 and K = [T 1/2 vT−1 ], and the last inequality holds because Assumption B6 implies log(NK )/T ϑ/2 → 0 as T → ∞. Therefore, without loss of generality, we can assume (NK )1/2 ‖βˆ l∗ (τ ) − β20 (τ )‖ ≥ log(NK ). By Lemma A.4(3) (applied with T replaced by K ), the con tribution of the sub-segment T20 − K , T20 to the right side of (A.14) is greater than D log(NK ) withprobability approaching 1. The con tribution of the sub-segment T20 + 1, T20 + K is either nonnegative, or of order Op (log NK ) when it is negative by Lemma A.4(2). Other segments are of order Op (log(NT )) by Lemma A.4(2). By

266

T. Oka, Z. Qu / Journal of Econometrics 162 (2011) 248–267

choosing D large enough, the term D log(NK ) dominates the rest and thus (A.14) is positive with probability approaching 1 as T → ∞. Thus, we have reached a contradiction. Next consider Case (2). By asmrefasmB7, K /T → 0 and there∗ fore |Tˆ1 − T10 | > K for large T . Let T b be the ordered version of

{T10 − K , T10 + K , T20 , Tˆ1 , Tˆ2 }. Then, [T10 − K , T10 + K ], which forms

a single segment in the partition, contains observations from both regimes 1 and 2. Repeat the same argument as in Case 1, starting at (A.14) but with T20 , β20 (τ ) and β30 (τ ) replaced by T10 , β10 (τ ) and β20 (τ ). We can reach the same contradiction. The analysis of Case (3) proceeds in the same way as Case (2), by considering the ordered version of {Tˆ1 , T10 − K , T10 + K , T20 , Tˆ2 } and noticing that [T10 − K , T10 + K ] again forms a single segment. Step 2. (Prove P (‖βˆ j (τ ) − βj0 (τ )‖ ≤ (NT )−1/2 log NT ) → 1 for j = 1, 2 and 3.) Suppose βˆ 2 (τ ) does not satisfy this condition. Then |βˆ 2 (τ ) − β20 (τ )| > (NT )−1/2 log NT with positive probability for any T . Consider a subset of the second segment, with boundary points Tˆ1,1 = max(Tˆ1 , T10 ) and Tˆ1,2 = min(Tˆ2 , T20 ). Consider a new partition of the sample using the ordered version of {Tˆ1 , T10 , Tˆ2 , T20 , Tˆ1,1 , Tˆ1,2 }. Then, Step 1 implies that the seg-

ment [Tˆ1,1 , Tˆ1,2 ] contains a positive fraction of the sample. Its contribution to (A.14) is positive and greater than D log(NT ) by Lemma A.4(3). Contributions from other segments are of order Op (log(NT )) by Lemma A.4(2). Thus, the objective function (A.14) will be positive with probability approaching 1 as T → ∞, and ˆ ) is its minimizer. this contradicts the fact that β(τ Step 3. (Prove βˆ j (τ ) − βj0 (τ ) = Op ((NT )−1/2 ) for j = 1, 2 and 3.) The results from Steps 1 and 2 imply that we can restrict our attention to the following set:

Φ = {‖βˆ j (τ ) − βj0 (τ )‖ ≤ (NT )−1/2 log NT (j = 1, 2, 3) and

consider (β(τ ), T b ) ∈ Φ . In what follows, we shall prove that for any ϵ > 0 and D > 0, there exists an A > 0 such that

 SNT (τ , β(τ ), T    (β(τ ),T b )∈Φ ,T1 −T10 >AvT−2

b

SNT (τ , β(τ ), T ) < D

inf

< ϵ for j = 1 and 2

with large T . We only prove (A.16) as the proof for (A.15) is similar. Withˆ ), Tˆ b ) and out loss of generality, assume Tˆ2 > T20 . Let SNT (τ , β(τ ∗

SNT (τ , βˆ ∗ (τ ), Tˆ b ) denote the minimized values of the two terms inside the parentheses, respectively. Then, ∗

ˆ ), Tˆ b ) − SNT (τ , βˆ ∗ (τ ), Tˆ b ) SNT (τ , β(τ ˆ ), Tˆ b ) − SNT (τ , β(τ ˆ ), Tˆ b∗ ) ≥ SNT (τ , β(τ Tˆ2 N − −

=

qτ ,it (βˆ 2 (τ ) − β30 (τ ))

t =T20 +1 i=1 Tˆ2 N − −

−

qτ ,it (βˆ 3 (τ ) − β30 (τ )).

t =T20 +1 i=1

From Step 3, ‖βˆ j (τ ) − βj0 (τ )‖ = Op ((NT )−1/2 ), implying

‖βˆ 2 (τ ) − β30 (τ )‖ ≥ N −1/2 vT ‖∆2 (τ )‖ /2. By Lemma A.4(5) and (A.10), the above difference is greater than D with probability greater than 1 − ϵ when A is large. Note that Lemma A.4(5) is applied with hT = vT−2 and dT = T 1/2 vT . Therefore (A.16) holds. Proof of Theorem 2. By Lemma 2, we can restrict our attention to the set K × Θ , where

√

NT ‖βj − βj0 (τ )‖ ≤ M , j = 1, . . . , m + 1}.

Adding and subtracting terms, inf SNT (τ , β(τ ), T b )

inf

T b ∈K β(τ )∈Θ

= inf

inf {SNT (τ , β(τ ), T 0 )

T b ∈K β(τ )∈Θ

  + SNT (τ , β(τ ), T b ) − SNT (τ , β(τ ), T 0 ) }.

(A.17)

First, assume Tˆj < Tj0 for all j = 1, . . . , m. The second term inside the curly brackets is equal to 0

Tj m N − − − {ρτ (yit − x′it βj0+1 (τ )) − ρτ (yit − x′it βj0 (τ ))} + op (1)

(A.15)

inf {ST (τ , β(τ ), T 0 )}

β(τ )∈Θ

0

and

+ inf

 P



uniformly in T b ∈ K and β(τ ) ∈ Θ by Lemma A.6. Thus, minimizing (A.17) is asymptotically equivalent to solving

<ϵ

b

(β(τ ),T b )∈Φ ,T1 =T10



j=1 t =Tˆ +1 i=1 j

) 

−

SNT (τ , β(τ ), T b ) for j = 1, 2,

(A.15) and (A.16) imply P vT2 |Tˆj − Tj0 | > A

Θ = {βj :

Consider a partition of the sample using break dates {Tˆ1 , Tˆ2 }. Then, all segments are non-vanishing fragments of the sample. Consider the first segment. Then, it either: (1) only contains observations from the first regime, or (2) contains observations from both regimes but with less than T 1/2 vT−1 observations from the second regime. In the first case, apply the second result of Lemma A.3, and for the second case, apply Lemma A.5, leading to βˆ 1 (τ ) − β10 (τ ) = Op ((NT )−1/2 ). The parameter estimates corresponding to other segments can be analyzed similarly. Step 4. (Prove vT2 (Tˆj − Tj0 ) = Op (1).) From Step 3, it suffices to

inf 

inf

(β(τ ),T b )∈Φ

K = {Tj : Tj = Tj0 + [svT−2 ] and |s| ≤ A < ∞, j = 1, . . . , m},

|Tˆi − Ti0 | ≤ T 1/2 vT−1 (i = 1, 2)}.

P

≥

inf SNT (τ , β(τ ), T     (β(τ ),T b )∈Φ ,T2 −T20 >AvT−2

b

T b ∈K

)

−

inf

SNT (τ , β(τ ), T ) < D

(β(τ ),T b )∈Φ ,T2 =T20

hold for all sufficiently large T . Because inf

(β(τ ),T b )∈Φ ,Tj =Tj0

SNT (τ , β(τ ), T b )

{ρτ (yit − x′it βj0+1 (τ ))

j=1 t =Tj +1 i=1

− ρτ (yit − x′it βj0 (τ ))}. 

b

Tj m N − − −

<ϵ

(A.16)

The first term depends only on β(τ ) but not on T b , which delivers ˆ ) as stated in the theorem. The the asymptotic distribution of β(τ second term only depends on T b but not on β(τ ), which delivers the limiting distribution for the break date estimate. Consider the jth break and rewrite the summation involving Tˆj as Hj,2 (s)−Hj,1 (s), where

T. Oka, Z. Qu / Journal of Econometrics 162 (2011) 248–267 Tj0

−

Hj,1 (s) = (βj0+1 (τ ) − βj0 (τ ))′

N −

xit ψτ (u0it (τ )),

−2 t =Tj0 +[svT ]+1 i=1 0

Tj −

Hj,2 (s) =

x′it (βj0+1 (τ )−βj0 (τ ))

N ∫ −

−2 t =Tj0 +[svT ]+1 i=1

{1(u0it (τ ) ≤ u)

0

− 1(u0it (τ ) ≤ 0)}du. First consider Hj,1 (s). If N is fixed, then we can apply a FCLT for martingale differences. If (N , T ) → ∞, we can apply a FCLT for random fields (e.g., Theorem 3 in Poghosyan and Roelly, 1998). In both cases, Hj,1 (s) ⇒ σ¯ j W (s), where σj2 = τ (1 − τ )∆j (τ )′ J¯j0 ∆j (τ ) and W (s) is a two-sided Wiener process satisfying W (0) = 0. Consider Hj,2 (s), its mean, for a given s, is equal to 1 2

∆j (τ )′ H¯ j0 (τ )∆j (τ )|s| + op (1) =

π¯ j 2

|s| + op (1),

and the deviation from the mean is uniformly small. Similar arguments can be applied to analyze the case Tˆj > Tj0 , leading to

vT2 (Tˆj − Tj0 ) ⇒ arg max



s

σ¯ j W (s) − π¯ j |s|/2 σ¯ j+1 W (s) − π¯ j+1 |s|/2

s≤0 s > 0.

Then, by a change of variables,



π¯ j σ¯ j

2

vT2 (Tˆj − Tj0 )  W (s) − |s|/2 ⇒ arg max (σ¯ j+1 /σ¯ j )W (s) − (π¯ j+1 /π¯ j )|s|/2 s

This completes the proof.

s≤0 0 < s.

References Andrews, D.W.K., 1993. Tests for parameter instability and structural change with unknown change point. Econometrica 61, 821–856. Angrist, J., Chernozhukov, V., Fernández-Val, I., 2006. Quantile regression under misspecification, with an application to the US wage structure. Econometrica 74, 539–563. Bai, J., 1995. Least absolute deviation estimation of a shift. Econometric Theory 11, 403–436. Bai, J., 1996. Testing for parameter constancy in linear regressions: an empirical distribution function approach. Econometrica 64, 597–622. Bai, J., 1997. Estimation of a change point in multiple regression models. The Review of Economics and Statistics 794, 551–563. Bai, J., 1998. Estimation of multiple-regime regressions with least absolutes deviation. Journal of Statistical Planning and Inference 74, 103–134. Bai, J., 2000. Vector autoregressive models with structural changes in regression coefficients and in variance. Annals of Economics and Finance 1, 303–339. Bai, J., 2010. Common breaks in means and variances for panel data. Journal of Econometrics 157, 78–92. Bai, J., Lumsdaine, R.L., Stock, J.H., 1998. Testing for and dating common breaks in multivariate time series. Review of Economic Studies 65, 395–432. Bai, J., Perron, P., 1998. Estimating and testing linear models with multiple structural changes. Econometrica 66, 47–78. Bai, J., Perron, P., 2003. Computation and analysis of multiple structural change models. Journal of Applied Econometrics 18, 1–22. Berkes, I., Horváth, L., Kokoszka, P., 2004. Testing for parameter constancy in GARCH(p; q) models. Statistics & Probability Letters 70, 263–273. Buchinsky, M., 1994. Changes in US wage structure 1963–1987: an application of quantile regression. Econometrica 62, 405–458. Cai, Z.W., Xu, X.P., 2008. Nonparametric quantile estimations for dynamic smooth coefficient models. Journal of the American Statistical Association 1595–1608. Chamberlain, G., 1994. Quantile regression, censoring and the structure of wages. In: Sims, Christopher (Ed.), Advances in Econometrics. Elsevier, New York, pp. 171–209. Chan, K.C., Karolyi, G.A., Longstaff, F.A., Sanders, A.B., 1992. An empirical comparison of alternative models of the short-term interest rate. The Journal of Finance 47, 1209–1227.

267

Chen, J., 2008. Estimating and testing quantile regression with structural changes. Working Paper. NYU Dept. of Economics. Chernozhukov, V., Umantsev, L., 2001. Conditional value-at-risk: aspects of modeling and estimation. Empirical Economics 26, 271–292. Chow, Y.S., Teicher, H., 2003. Probability Theory: Independence, Interchangeability, Martingales, third ed. Springer. Cox, J.C., Ingersoll, J.E., Ross, S.A., 1985. A theory of the term structure of interest rates. Econometrica 53, 385–407. Csörgő, M., Horváth, L., 1998. Limit Theorems in Change-Point Analysis. Wiley. Eide, E., Showalter, M., 1998. The effect of school quality on student performance: a quantile regression approach. Economics Letters 58, 345–350. Elliott, G., Muller, UK, 2007. Confidence sets for the date of a single break in linear time series regressions. Journal of Econometrics 141, 1196–1218. Elliott, G., Muller, UK, 2010. Pre and post break parameter inference. Working Paper. Department of Economics, Princeton University. Engle, R.F., Manganelli, S., 2004. CAViaR: conditional autoregressive value at risk by regression quantiles. Journal of Business and Economic Statistics 22, 367–381. Fiteni, I., 2002. Robust estimation of structural break points. Econometric Theory 18, 349–386. Hall, A.R., Sen, A., 1999. Structural stability testing in models estimated by generalized method of moments. Journal of Business and Economic Statistics 17, 335–348. Hansen, B.E., 1992. Tests for parameter instability in regressions with I (1) processes. Journal of Business and Economic Statistics 10, 321–335. Hendricks, B., Koenker, R., 1992. Hierarchical spline models for conditional quantiles and the demand for electricity. Journal of the American Statistical Association 87, 58–68. Hušková, M., 1997. Limit theorems for rank statistics. Statistics and Probability Letters 32, 45–55. Kejriwal, M., Perron, P., 2008. The limit distribution of the estimates in cointegrated regression models with multiple structural changes. Journal of Econometrics 146, 59–73. Kim, M.O., 2007. Quantile regression with varying coefficients. The Annals of Statistics 35, 92–108. Knight, K., 1998. Limiting distributions for L1 regression estimators under general conditions. Annals of Statistics 26, 755–770. Koenker, R., 2005. Quantile Regression. Cambridge University Press. Koenker, R., Bassett Jr., G., 1978. Regression quantiles. Econometrica 46, 33–50. Koenker, R., Xiao, Z., 2006. Quantile autoregression. Journal of the American Statistical Association 101, 980–990. Kokoszka, P., Leipus, R., 1999. Testing for parameter changes in ARCH models. Lithuanian Mathematical Journal 39, 182–195. Kokoszka, P., Leipus, R., 2000. Change-point estimation in ARCH models. Bernoulli 6, 513–539. Levin, J., 2001. For whom the reductions count: a quantile regression analysis of class size on scholastic achievement. Empirical Economics 26, 221–246. Li, H., Müller, UK, 2009. Valid inference in partially unstable generalized method of moments models. Review of Economic Studies 76, 343–365. McConnell, M.M., Perez-Quiros, G., 2000. Output fluctuations in the united states: what has changed since the early 1980’s? American Economic Review 90, 1464–1476. Perron, P., 2006. Dealing with structural breaks. In: Palgrave Handbook of Econometrics. In: Patterson, K., Mills, T.C. (Eds.), Econometric Theory, vol. 1. Palgrave Macmillan, pp. 278–352. Picard, D., 1985. Testing and estimating change-points in time series. Advances in Applied Probability 17, 841–867. Piehl, A.M., Cooper, S.J., Braga, A.A., Kennedy, D.M., 2003. Testing for structural breaks in the evaluation of programs. The Review of Economics and Statistics 85, 550–558. Poghosyan, S., Roelly, S., 1998. Invariance principle for martingale-difference random fields. Statistics and Probability Letters 38, 235–245. Poterba, J., Rueben, K., 1995. The Distribution of public sector wage premia: new evidence using quantile regression methods. NBER Working Paper. No. 4734. Powell, J.L., 1986. Censored regression quantiles. Journal of Econometrics 32, 143–155. Qu, Z., 2008. Testing for structural change in regression quantiles. Journal of Econometrics 146, 170–184. Qu, Z., Perron, P., 2007. Estimating and testing structural changes in multivariate regressions. Econometrica 75, 459–502. Siddiqui, M., 1960. Distribution of quantiles from a bivariate population. Journal of Research of the National Bureau of Standards Section B 64, 145–150. Su, L., Xiao, Z., 2008. Testing for parameter stability in quantile regression models. Statistics and Probability Letters 78, 2768–2775. Taylor, J.W., 1999. A quantile regression approach to estimating the distribution of multiperiod returns. The Journal of Derivatives 7, 64–78. Yao, Y.C., 1987. Approximating the distribution of the maximum likelihood estimate of the change-point in a sequence of independent random variables. The Annals of Statistics 15, 1321–1328.

Journal of Econometrics 162 (2011) 268–277

Contents lists available at ScienceDirect

Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom

A new class of asymptotically efficient estimators for moment condition models✩ Yanqin Fan, Matthew Gentry, Tong Li ∗ Department of Economics, Vanderbilt University, VU Station B #351819, 2301 Vanderbilt Place, Nashville, TN 37235-1819, United States

article

info

Article history: Received 4 March 2009 Received in revised form 23 July 2010 Accepted 28 January 2011 Available online 3 March 2011

abstract In this paper, we propose a new class of asymptotically efficient estimators for moment condition models. These estimators share the same higher order bias properties as the generalized empirical likelihood estimators and once bias corrected, have the same higher order efficiency properties as the bias corrected generalized empirical likelihood estimators. Unlike the generalized empirical likelihood estimators, our new estimators are much easier to compute. A simulation study finds that our estimators have better finite sample performance than the two-step GMM, and compare well to several potential alternatives in terms of both computational stability and overall performance. © 2011 Elsevier B.V. All rights reserved.

1. Introduction Hansen’s (1982) seminal paper has provided a unified framework within which econometric inference can be made using the Generalized Method of Moments (GMM), as most of the econometric models can be viewed as moment condition models. Over the past two decades, the GMM approach has made a profound impact on the development of econometric theory and seen wide applications. The GMM estimation, on the other hand, could have undesirable small sample bias properties, particularly in cases where there are many moments or instruments are weak, as demonstrated in Altonji and Segal (1996) among others. In view of this, various alternative estimators have been proposed, such as the continuous updating estimator (CUE) (Hansen et al., 1996), the empirical likelihood (EL) estimator (Owen, 1988; Qin and Lawless, 1994; Imbens, 1997), and the exponential tilting (ET) estimator (Kitamura and Stutzer, 1997; Imbens et al., 1998). See Kitamura (2005) for a comprehensive survey on the development of the EL and related methods. Newey and Smith (2004) showed that all these estimators belong to the class of generalized empirical likelihood (GEL) estimators, first introduced in Smith (1997). Newey and Smith (2004) also showed that GEL estimators have no asymptotic bias due to correlation of the moment conditions with their Jacobian and that EL has no asymptotic bias from estimating the optimal weighting matrix. Antoine et al. (2007) provide some related discussions. Thus from an asymptotic viewpoint, GEL and particularly EL estimators have superior bias properties relative to GMM. ✩ We thank Co-Editor, Takeshi Amemiya, an associate editor, and two referees for their constructive comments that have greatly improved the paper. We also thank Eric Renault for drawing our attention to important references and participants at the 2008 Far Eastern Meeting of the Econometric Society, Singapore, and the seminars at various universities for helpful comments. Li gratefully acknowledges support from the National Science Foundation (SES-0922109). ∗ Corresponding author. Tel.: +1 615 322 3582; fax: +1 615 343 8495. E-mail address: [email protected] (T. Li).

0304-4076/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2011.01.006

Despite the nice theoretical properties of the GEL and in particular EL, they have been rarely applied in empirical applications, and their finite sample properties have not been thoroughly studied and understood.1 This could be attributed to the computational difficulty arising from the saddle point characterization of the GEL, as discussed in Guggenberger and Hahn (2005). Alternatively, GEL can be implemented through a nested optimization algorithm, as discussed in Kitamura (2005), using the first-order conditions established in Newey and Smith (2004). While the inner loop optimization with respect to the auxiliary parameters is usually a well defined convex programing problem, the outer loop optimization is generally complicated by the highly nonlinear nature of the firstorder condition within the loop.2 In this paper, we propose a new class of estimators that are computationally convenient, asymptotically efficient, and share the same higher order properties as GEL to any given order, provided the number of iterations is large enough. Noting that the implementation of GEL can be considered as a nested optimization routine with the inner loop a well defined convex problem, at the j-th step we exploit the first-order condition within the outer loop, and obtain our estimator by modifying the first-order condition so that the part on the left-hand side of the equation is the score of the two-step GMM, and the part on the right-hand side is evaluated at the estimator obtained in the previous step. As a result, the 1 Exceptions are Hansen et al. (1996) who study finite sample performance of CUE and related inference, Mittelhammer et al. (2005), and Guggenberger and Hahn (2005) who study small sample performance of EL estimators when instruments are strong and weak, respectively. Kunitomo and Matsushita (2003) find that EL estimators may have no moments. In addition, Kitamura (2005) provides some Monte Carlo evidence indicating good finite sample performance of EL estimators in a panel data setting. 2 Computation of CUE, if not viewed as a special case of the GEL, is also complicated due to the fact that the optimal weighting matrix in the objective function is a function of the parameters.

Y. Fan et al. / Journal of Econometrics 162 (2011) 268–277

optimization problem within the outer loop is no more difficult than the GMM optimization problem. At the same time, since we maintain the structure of the first-order condition of the GEL, higher order properties of the GEL are preserved. In fact, we show that (i) with any root-n consistent estimator as an initial estimator, our estimators at the j-th step are equivalent to the GEL estimator up to the order of n−(j+1)/2 ; (ii) if we use the two-step GMM as our initial estimator, then our estimators at the j-th step are equivalent to the GEL estimator up to the order of n−(j/2+1) . As a result, as long as j is large enough, our new estimators will have the same higher order properties as GEL to any given order. In particular, once bias corrected, our estimators will have the same higher order efficiency properties as the bias corrected GEL estimators discussed in Newey and Smith (2004). We also conduct Monte Carlo experiments to study the finite sample performance of our new class of estimators, and find that they have better finite sample performance than the two-step GMM, and compare well to several potential alternatives in terms of both computational stability and overall performance. Thus, our estimators have the advantages of being computationally tractable, sharing the same higher order properties as the GEL estimators, and also having desirable finite sample performance. The rest of this paper is organized as follows. Section 2 sets up the moment condition models and reviews the GMM and GEL estimators. Section 3 proposes a new class of estimators and Section 4 establishes their asymptotic efficiency and higher order properties. Section 5 is devoted to a simulation study to investigate the finite sample performance of the proposed new estimators. Section 6 concludes and discusses several possible extensions. All technical proofs are included in the Appendix.

 ′

  (β)−1 g (β). βCUE = arg min  g (β)′ Ω β∈B

As shown in Newey and Smith (2004),  βCUE belongs to the class of GEL estimators. To describe GEL, let ρ (v) be a function of a scalar v that is concave  in its domain, an open interval  V containing zero. n (β) = λ : λ′ gi (β) ∈ V , i = 1, . . . , n . The GEL estimator Let Λ is the solution to a saddle point problem:

 β = arg min sup

The EL estimator is a special case of GEL with ρ (v) = ln (1 − v) and V = (−∞, 1). The ET estimator is a special case of GEL with ρ (v) = − exp (v). As shown by Newey and Smith (2004), the CUE is also a GEL estimator with a quadratic ρ (v). In contrast to the two-step GMM, computation of GEL is much more involved. Let ρj (v) = ∂ j ρ (v) /∂v j and ρj = ρj (0) (j = 0, 1, 2, . . .). As in Newey and Smith (2004), we normalize so that ρ1 = ρ2 = −1. For a given  function ρ (v), an associated GEL estimator  β , and  gi = gi  β , let

   πi ≡ πi  β,  λ =

E [g (z , β0 )] = 0, where E [·] denotes expectation taken with respect to the distribution of z. Let gi (β) = g (zi , β). The sample first and second moments of gi (β) are given by: n i =1

gi (β),

 (β) = Ω

 ′  λ gi ρ1  , n   ∑ ρ1  λ′ gj

(2)

j=1

We use the notation in Newey and Smith (2004). Let zi (i = 1, . . . , n) denote i.i.d. observations on a data vector z. Let β denote a p × 1 parameter vector and g (z , β) be an m × 1 vector of functions of the data observation z and the parameter β , where m ≥ p. The model has a true parameter β0 satisfying the moment condition:

n 1−

n −   ρ λ′ gi (β) .

β∈B λ∈Λ n (β) i=1

2. GMM and GEL estimators

 g (β) =

269

E gi (β)gi (β) by the corresponding unweighted sample averages. This finding is consistent with the small sample bias problem of  βGMM reported in existing Monte-Carlo experiments. Alternative estimators have been proposed to alleviate the potential small sample bias problem of  βGMM . One such estimator is the CUE of Hansen et al. (1996). It is defined as the solution to a minimization problem as follows:



n 1−

n i=1

gi (β)gi (β)′ .

The two-step optimal GMM estimator of Hansen (1982) is obtained as the solution to the following minimization problem:

  ( βGMM = arg min  g (β)′ Ω β)−1 g (β) β∈B

where  β is a preliminary consistent estimator of β0 and the minimum is taken over some compact parameter space B . It can be computed via the first-order condition:

  ( G( βGMM )′ Ω β)−1 g ( βGMM ) = 0, (1) ∑n −1  where G (β) ≡ n i=1 Gi (β) and Gi (β) = ∂ gi (β) /∂β . Andrews (2002) showed that a k-step estimator obtained by iterating the Newton–Raphson algorithm (NR) applied to (1) k times has the same higher order asymptotic efficiency as  βGMM , to any given order, provided k is large enough and some regularity conditions hold. Via stochastic expansions, Newey and Smith (2004) showed that several sources contribute to the asymptotic bias of  βGMM including the inefficient estimation of the Jacobian of the moment condition E [Gi (β)] and of the optimal weighting matrix

where n −      λ≡λ  β = arg max ρ λ′ gi /n.   λ∈Λn (β ) i=1

(3)

The  πi ’s are the implied probabilities for the observations. They ∑n sum to one, satisfy the sample moment condition πi gi = 0 i=1  when the first-order conditions for  λ hold, and are positive when λ′ gi is small uniformly in i. Let k(v) = [ρ1 (v) + 1] /v, v ̸= 0, and ∑k(0)  = −1. Also, let  vi =  λ′ gi ,  ki ≡ ki  β,  λ = k ( vi ) / nj=1 k  vj . Theorem 2.3 in Newey and Smith (2004) showed that the GEL first-order conditions imply:

 n − i =1

′  −1 n −      ′         πi G i β ki gi β gi β g  β = 0,

(4)

i=1

where  ki =  πi for EL and  ki = 1/n for CUE. Comparing (1) with (4) reveals the main difference between GEL estimators and the two-step GMM estimator: instead of using the unweighted sample average to estimate the Jacobian of the moment condition, GEL estimators employ an efficient estimator of the Jacobian of the moment condition by using the implied probabilities  πi . In addition, the EL estimator also makes use of an efficient estimator of the optimal weighting matrix. As a result, GEL estimators do not suffer from the asymptotic bias due to the estimation of the Jacobian and EL also does not suffer from the asymptotic bias due to the estimation of the optimal weighting matrix. Moreover, Newey and Smith (2004) showed that the bias corrected EL inherits the higher order property of maximum likelihood and that it is higher order asymptotically efficient relative to the other bias corrected estimators. Based on (3) and (4), computation of  β can be done via a nested optimization algorithm, see Kitamura (2005). In the inner loop, the

270

Y. Fan et al. / Journal of Econometrics 162 (2011) 268–277

auxiliary parameter λ is computed via (3) for a given β and in the outer loop,  β is found via (4). While the computation of λ through (3) is often easy to implement, the solution to (4) is more difficult to find due to the highly nonlinear nature of (4). In contrast to the two-step GMM, numerical algorithms such as the NR algorithm are difficult to apply because of the highly nonlinear dependence of  πi and/or  ki on β . 3. New estimators To alleviate the computational difficulty of GEL, we propose a new class of estimators corresponding to each GEL. Like GEL, computation of our estimators requires solving (3) in the inner loop, but instead of solving (4) in the outer loop, we propose to solve an equation that is no more difficult to solve than the first-order condition for GMM (1). To introduce the idea, we first propose our estimators corresponding to CUE and then extend them to GEL.

Newey and Smith (2004) showed that CUE is a member of GEL with ρ (v) = a − v − v 2 /2 for a constant a. In this case, the inner loop maximization (3) implies: 1 + λ′CUE gi  βCUE





βCUE = 0 gi  



i =1

βCUE , where resulting in  λCUE =  λCUE  



        −1  λCUE  β = −Ω β  g  β .

(5)

The first-order condition (4) for the outer loop optimization implies that  β = βCUE satisfies

  ( G( β)′ Ω β)−1 g ( β) n   1−  (  ( − Gi ( β)′ gi ( β)′ Ω β)−1 g ( β) Ω β)−1 g ( β) = 0.

(6) n i =1 Donald and Newey (2000) first established (6) for a scalar β . CUE was first proposed in Hansen et al. (1996) and shown to have desirable small sample bias properties compared with the two-step GMM. They also proposed an Iterative Estimator (IE). Let  βIEj denote the IE from iteration j, j = 1, 2, . . .. The iteration starts from the two-step GMM estimator by reestimating the weighting  ( matrix in (1) using Ω βIEj−1 ) and constructing a new estimator

 βIEj . Note that  βIEj is the solution to the following minimization problem3 :

  ( βIEj = arg min  g (β)′ Ω βIEj−1 )−1 g (β) β∈B

or equivalently, it is the solution to the following equation:

  ( βIEj−1 )−1 g ( βIEj ) = 0 G( βIEj )′ Ω where j = 1, 2, . . . with  β0 =  β GMM .

(7)

IE

Inspecting (1) and (7) reveals that any improvement of the IE  βIEj over the two-step GMM  βGMM comes from the improved estimation of the weighting matrix. The large small sample bias of the two-step GMM  βGMM compared with the CUE  βCUE is due to the omission of the second term in (6). Because of this second term, as shown in Donald and Newey (2000), the CUE can be interpreted as a jackknife estimator, and the first-order conditions for the CUE are centered at zero, while the first-order conditions of the two-step

3 We note that  βIE is an example of the backfitting estimator proposed by Pastorello et al. (2003) in the context of general extremum estimation. In general, the backfitting estimator may not be asymptotically efficient. However, as stated in their Rejoinder, in the case of CUE,  βIEj is asymptotically as efficient as  βCUE , because   ′ the partial derivative of E gi (β) Ω (β1 ) gi (β) with respect to β1 is zero. j

  ( β j−1 )−1 g ( β j) G( β j )′ Ω n   1−  ( β j−1 )−1 g ( β j −1 ) β j −1 ) ′ Ω = Gi ( β j−1 )′ gi ( n i =1

 ( ×Ω β j−1 )−1 g ( β j−1 ), where  β 0 is a root-n consistent estimator of β0 . It is instructive to look at  β 1 when  β0 =  β: 1 ′   −1 1   G(β ) Ω (β)  g (β ) n   1−  (  ( β)−1 g ( β) Ω β)−1 g ( β). β)′ Ω = Gi ( β)′ gi (

(8)

n i =1

3.1. Exploiting the first-order conditions of the CUE

n − 

GMM are not.4 On the other hand, it is exactly the second term in (6) that makes the computation of  βCUE more involved than that of the two-step GMM or of the IE  βIEj . To make use of the second term in (6) and at the same time to alleviate the computational burden associated with  β CUE , we propose the following j-step estimator denoted by  β j , j = 1, 2, . . .. It solves the following equation:

Comparing this with (1), we observe that  β 1 makes use of the  ‘score function’ for βGMM . Instead of setting it to zero like  βGMM ,  β 1 sets it equal to the value of the second term in (6) evaluated at the initial estimator  β . Computationally solving  β 1 is no more difficult than solving  βGMM from (1), as the right hand side of the above equation is evaluated at  β and hence fixed. However as we show in the next section,  β 1 is not only asymptotically efficient like  βGMM but also equivalent to  βCUE up to order n−1 due to the use of the information in the second term in the score function for  ∑   (  ( βCUE : n−1 ni=1 Gi ( β)′ gi ( β)′ Ω β)−1 g ( β) Ω β)−1 g ( β). Thus, it shares the same higher order bias properties as  βCUE , up to order n− 1 . We can iterate further using (8) to obtain  β j . We show in the next section that  β j is asymptotically equivalent to  βCUE up to order n−(j+1)/2 with any root-n consistent estimator of β0 as the initial estimator and is asymptotically equivalent to  βCUE up to order n−(j+2)/2 with an asymptotically efficient estimator such as  βGMM as the initial estimator. Compared with  βIEj proposed in Hansen et al. (1996), at each iteration,  β j makes use of not only an improved estimator of the weighting matrix, but also an improved estimator of the second term in (6). 3.2. Exploiting the first-order conditions of the GEL — the Outer loop The estimator  β j in (8) makes use of the natural decomposition of the ‘score function’ of  βCUE into the ‘score function’ of  βGMM and an additional term. It is computationally comparable to  βGMM and asymptotically equivalent to  βCUE up to a higher order. For a general GEL, we define  β j as the solution to:

  j ′   j−1 −1  j      G  β Ω β g  β     j−1 ′   j−1 −1    G  β Ω β −    ′  −1  n n − − =       ′ j −1     πij−1 Gi  β j −1 ki gi  β j−1 gi  β j −1   i =1

i =1

 j−1  × g  β , j −1 i

(9) j −1 ki

where  π and  are respectively  πi and  ki with  β replaced by  β j−1 and  λ replaced by  λ j −1 : n   − j −1  λj−1 = arg max ρ λ′ gi /n. n ( λ∈Λ β j−1 ) i=1 4 The CUE and its first-order conditions have been recently used in Kleibergen (2005) and Kleibergen and Mavroeidis (2008) for testing full/sub-sets of parameters in GMM without identification.

Y. Fan et al. / Journal of Econometrics 162 (2011) 268–277

The initial estimator  β 0 can be any root-n consistent estimator of β0 .5 For CUE,  ki = 1/n and (9) reduces to

  j ′   j−1 −1  j      g  β Ω β G  β  ′ n    −    j−1 −1  j−1  j −1 −1 j −1     g  β , (10) Ω β = n − πi Gi β i=1

 j−1    j−1 −1  j−1    1 − g′  β Ω β gi  β =     −1   .    n 1 − g′  β j −1 Ω β j −1 g  β j −1

 πij−1

1 It is interesting to compare  βM2 with the three-step estimator6 of Antoine et al. (2007) and the quasi-empirical likelihood estimator of Sherlund (2004). Both make use of the efficient estimators of the Jacobian and the optimal weighting matrix obtained from weighted averages with weights being the implied probabilities from CUE. To illustrate, let  βABR denote the three-step estimator of Antoine et al. (2007). In terms of our notation,  βABR is defined as

 n −

where

Under Assumptions 1 and 2 to be introduced in the next section, it is easy to show:

 j−1    j−1 −1  j−1      g  β = Op (n−1 ) g′  β Ω β implying that (10) is equivalent to (8). For EL, ρ (v) = ln (1 − v) and  ki =  πi . We have:

  j ′   j−1 −1  j      g  β G  β Ω β     j−1 ′   j−1 −1    G  β Ω β     ′  −1  n n − − =  j −1   j−1   j−1 ′   β β β  πij−1 Gi   πij−1 gi  gi  −  i =1

i =1

 π

0 iCUE Gi

 0  β

′  n −

i =1

 −1  π

0 iCUE gi

 0   0 ′  β gi  β

   g  βABR = 0,

i =1

0 where  πiCUE is the implied probability  πi corresponding to CUE evaluated at the initial estimator  β0 ≡  βGMM . When we use 1 1  β0 ≡  βGMM to compute  βM2 , the two estimators  βM2 and  βABR share similar features. In particular, both are easy to compute — no optimization is required, and both are asymptotically equivalent to  βEL up to order n−3/2 . The third modification makes use of the first-order condition for (3). To illustrate, consider EL. The first-order condition for  λ denoted as  λEL is: n − i =1

gi  βEL





  = 0. β EL 1 + λ′EL gi  j

This motivates the following algorithm for computing  λEL : n [ −

 j −1  × g  β .

271



λjEL 1+ 

′

gi  β0



] 

gi  β0





i =1

3.3. Alternative estimators corresponding to GEL — the inner loop In contrast to (8), to apply (9) to compute  β j for GEL, we need j−1 j − 1 to compute  πi and  ki which depend on  λj−1 , where  λj−1 needs to be computed via the optimization in (3). When the dimension of λ or g is large, this can be time consuming in particular when j is large, as we need to repeat the above optimization j times. In this section, we propose several modifications of (9) which employ different versions of the Lagrange multiplier λ in computing  πij−1 j−1

and  ki . In the first modification, instead of updating λ at each iteration, we fix  λj−1 at  λ0 so that only one inner-loop maximization is j  required. Let βM1 denote the resulting estimator at j-th iteration. The second and third modifications make use of the fact that for CUE,  λ denoted as  λCUE satisfies: n − 

    βCUE gi  βCUE = 0 1 + λCUE gi  ′

 ′  ]2  j−1  β0 gi  n  1 − 1 + λEL    −   0     =−   gi β j −1   0   1 + λEL gi β i=1 

[

yielding

         j − 1 0     λEL = −Ω β β0 g        ′  ]2   j −1  0   gi β n  1 − 1 + λEL     1 −  0     + .   gi β   n i =1   1+  λjEL−1 gi  β0   

[

(11)

i=1

 −1  resulting in  λCUE =  λCUE  βCUE , where  λCUE  β = −Ω β g  β . 



 

     j−1  In the second modification, we choose either  λj−1 =  λCUE  β   j or  λj−1 =  λCUE  β 0 . Let  βM2 denote this estimator. Consider j = 1. 1 Corresponding to EL,  βM2 satisfies:   1 ′   0 −1  1      G  βM2 Ω β g  βM2   ′  n −    0 ′   0 −1  0  0   0               Ω β −  πi β , λCUE β Gi β   G β  i=1  −1 = n   −  0  0   0   0 ′       ×  πi  β , λCUE  β gi  β gi  β   i=1

 0 × g  β . 5 As pointed out by an anonymous referee, one of the essential problems of the computation of GEL is that (4) may have multiple solutions due to its highly nonlinear nature. The less complicated equation for  β j in (9) is expected to alleviate this problem.

j

We denote this estimator by  βM3 . In addition to simplifying the computation of  β j , (11) can also be used in efficient estimation of an expectation taking into account the moment conditions, see Brown and Newey (1998). 4. Asymptotic efficiency and higher order properties We adopt as Newey and Smith (2004).  the same assumptions  Let Ω = E gi (β0 ) gi (β0 )′ and G = E [∂ gi (β0 ) /∂β ]. Assumption 1. (a) β0 ∈ B is the unique solution to E [g (z , β)] = 0; (b) B is compact; (c)  g (z , β) is continuous  at each β ∈ B with probability one; (d) E supβ∈B ‖g (z , β) ‖α < ∞ for some α > 2; (e) Ω is nonsingular; (f) ρ (v) is twice continuously differentiable in a neighborhood of zero. 6 We thank Eric Renault for drawing our attention to the three-step estimator of Antoine et al. (2007), and an anonymous associate editor for bringing Sherlund (2004) to our attention.

272

Y. Fan et al. / Journal of Econometrics 162 (2011) 268–277

Assumption 2. (a) β0 ∈int(B ); (b) g (z , β)  is continuously differ- entiable in a neighborhood N of β0 and E supβ∈N ‖∂ gi (β) /∂β ′ ‖ < ∞; (c) rank(G)= p. Under Assumptions 1 and 2, Newey and Smith (2004) showed that the GEL estimator  β is consistent and asymptotically normally distributed. Theorem 4.1. Under Assumptions 1 and 2, if  β 0 is root-n consistent, then for any finite j, our iterative estimator  β j is asymptotically efficient. Remark 4.1. In the context of maximum likelihood estimation (MLE), Song et al. (2005) proposed maximization by parts (MBP) algorithms to overcome computational difficulties when the score function is of an additive form, where one part is much simpler than the other part. The MBP algorithms start from a consistent estimator, evaluate the more complicated part at this estimator, and solve for an updated estimator by letting the sum of the simpler part and the already evaluated more complicated term equal to zero. This procedure is repeated until convergence, producing MLE. Fan et al. (2007) extended the MBP algorithms in Song et al. (2005) to more general extremum problems. The MBP algorithms of Song et al. (2005) and Fan et al. (2007) produce efficient estimators starting from a consistent estimator at convergence. This generally requires a certain ‘dominance condition’ to hold in order for the algorithms to converge. In addition, for any finite iteration, the estimator is only consistent under certain conditions and only the final estimator upon convergence is guaranteed to be asymptotically efficient. In contrast, our estimator  β j is asymptotically efficient for any finite j. To prove higher order properties of  β j , we adopt Assumption 3 in Newey and Smith (2004). In fact, Assumption 3 is more than what we need. However, given that the higher order bias properties of GEL are established under Assumptions 1–3 in Newey and Smith (2004), we will maintain the same set of assumptions as Newey and Smith (2004). Let ∇ k denote a vector of all distinct partial derivatives with respect to β of order k. Assumption 3. There is b (z ) with E b (zi )6 < ∞ such that for 0 ≤ k ≤ 4 and all z, ∇ k g (z , β) exists on a neighborhood N of β0 , supβ∈N ‖∇ k g (z , β) ‖ ≤ b (z ), and for each β ∈ N , ‖∇ 4 g (z , β) − ∇ 4 g (z , β0 ) ‖ ≤ b (z ) ‖β − β0 ‖, ρ (v) is four times continuously differentiable with the Lipschitz fourth derivative in a neighborhood of zero.





Theorem 4.2. Under Assumptions 1–3, (i) if  β 0 is root-n consistent, then for any finite j,  β j is Op n−



j+1 2



To see more clearly the difference between our estimator and that based on NR, we apply the one-step NR to (9) for j = 1 and define  β 1 as:

]  [  0  ′   0 −1  0 −1  ∂       g β Ω β G β  β1 =  β0 − ∂β   0 ′  0 −1  0    β G β + G β Ω  ′  −1 n n − −  0  0  0   0 ′ 0 0   ×  πi Gi  β g  β . (12) ki gi  β gi  β i =1

i=1

Iterating (12), we get  β j at the j-th step (j ≥ 2): ]  [  j−1  ′   j−1 −1  j−1 −1  ∂       β g G β Ω β  βj =  β j −1 − ∂β   j−1 ′  j−1 −1  j−1    β + G β Ω G β  ′  −1 n n − −  j −1   j−1   j−1 ′ j −1 j −1     ×  π Gi β k gi β gi β i

i

i=1

i=1

 j −1  × g  β , j −1

(13)

j −1

where  πi and  ki are respectively  πi and  ki with  β replaced by j − 1  β . Like NR, (13) makes use of the score function for GEL ensuring efficiency of  β j or  β j . In sharp contrast to NR, instead of using the Hessian for GEL, (13) makes use of the Hessian for GMM which simplifies the algorithm greatly. Remark 4.3. By following the same arguments as in the proof of Theorem 4.2, we can show that if we use  β0 =  βGMM , then the j modified estimators  βMl , l = 1, 2, 3 in Section 3.3 are asymptotically equivalent to the GEL estimator  β up to order n−3/2 . Remark 4.4. As j → ∞,  β j should converge to the corresponding GEL. Heuristically, this is because if the respective algorithms converge, then  β j will converge to the solution of the first-order condition for the GEL. 5. A simulation study

order equivalent to the GEL

estimator  β ; (ii) if  β0 −  β = Op n−1 , then  βj −  β = Op n



can conclude that iterative estimators from the NR algorithm and its variants applied to (4) are higher order equivalent to  β . However, given the highly nonlinear nature of (4), direct application of NR and its variants would be difficult, if not impossible.7 Instead, our estimator  β j is much easier to compute and shares the same higher properties as  β to any order provided j is large enough.





j+2 − 2



.

As discussed in Bonnal and Renault (2004), most asymptotically efficient estimators of β share the same dominant term in their asymptotic expansions and hence satisfy the condition in Theorem 4.2 (ii). Thus, with an asymptotically efficient estimator as our initial estimator, Theorem 4.2 (ii) implies that  β j has the j+2

same stochastic expansion as  β up to order n− 2 and has the same higher order bias properties as  β discussed in Newey and Smith (2004) even for j = 1. The same bias correction as in Newey and Smith (2004) can be applied to  β j and the bias corrected  β j has the  same higher order properties as the bias corrected β . Remark 4.2. Robinson (1988) showed the higher order equivalence of iterative estimators obtained from zeros of recursive linear approximations to the function defining the extremum estimator. Applying Robinson’s result to the GEL estimator  β , we

In this section we conduct a series of Monte Carlo experiments designed to compare the finite-sample performance of our proposed class of estimators with several established estimation techniques: standard two-step GMM, standard EL, and the threestep EL estimator proposed by Antoine et al. (2007). We explore several variants of our proposed class of estimators, including CUE-analog estimators based on Eq. (8), EL-analog estimators based on Eq. (9), and EL-analog estimators corrected for NeweySmith bias. We also compare the results generated by our multistep estimators starting from one-step GMM to those generated 7 Noting that the EL can be viewed as the method of moments estimator for an exactly identified model with the number of moment conditions equal to p + m, Guggenberger and Hahn (2005) propose a two-step estimator based on NR to approximate the EL estimator as the latter can be difficult to compute. Their NR based two-step estimator, however, can be unstable because of the large number of the moment conditions in the exactly identified model as well as the highly nonlinear feature of these moment conditions.

Y. Fan et al. / Journal of Econometrics 162 (2011) 268–277

starting from two-step GMM. We use John Zedlewski’s matElike MATLAB package to obtain baseline GMM and EL estimates, and a numeric zero-search algorithm to compute our remaining estimates.8 Our Monte Carlo design is based on work by Zedlewski (2008), who runs a simulation study using a simple dynamic panel model discussed in Imbens (2002).9 The econometrician is assumed to observe data on n individuals for T periods. The dependent variable Yit is serially dependent and includes an unobservable individual-level fixed effect ηi that is identically and independently distributed across individuals. The model is: Yit = ηi + β0 Yi(t −1) + ϵit ,

t = 1, . . . , T

(14)

where ϵit is identically and independently distributed across individuals and periods. To complete our Monte Carlo specification, we assume (following Zedlewski (2008)) that ηi ∼ N (0, ση2 ) and

ϵit ∼ N (0, σϵ2 ), where ση = 1 and σϵ = 0.3. We consider three possible values for β0 : β0 = 0.5, β0 = 0.9, and β0 = 0.95. The goal of estimation is to recover the (scalar) parameter β0 . To eliminate the unobservable fixed effect ηi , we estimate the model in differences. The equation to estimate thus becomes:

∆Yit = β0 ∆Yi(t −1) + ∆ϵit ,

t = 2, . . . , T .

As usual in such models, differencing induces a correlation between ∆Yi(t −1) and ∆ϵit . Consequently, we must instrument for ∆Yi(t −1) in the estimation. This leads us to consider moments. As outlined in Imbens (2002), the structure of the model enables us to employ two types of moment conditions. First, for any t ≥ 3, we can form a set of t − 2 Type 1 moment functions: g1(t ,τ ) (Yi , β) = Yi(t −τ ) [∆Yit − β ∆Yi(t −1) ],

τ = 1, . . . , t − 2.

Second, if we assume the data come from a long-run steady state of the model, we can form a set of T − 2 Type 2 moment functions: g2t (Yi , β) = ∆Yi(t −1) [Yit − β Yi(t −1) ],

t = 2, . . . , T .

Pooling all possible Type 1 moment functions across all T periods, we obtain (T − 1) · (T − 2)/2 moments, which we stack into a [(T − 1) · (T − 2)/2] × 1 vector g1 (yi , β). Similarly, stacking Type 2 moment functions across periods, we obtain a (T − 2) × 1 vector g2 (yi , β). Estimation is then based on (a subset of) the moment conditions: E [g1 (Yi , β0 )] = 0 E [g2 (yi , β0 )] = 0. By pooling all Type 1 and Type 2 moments, we can thus obtain up to (T + 1) · (T − 2)/2 estimating equations for a single (scalar) parameter β . The model is therefore greatly overidentified. In what follows, we consider both estimators based only on Type 1 conditions and estimators based on pooling Type 1 and Type 2 conditions. 8 In particular, for GMM and EL, we implement matElike using the optional Zipsolver package provided by Zedlewski. For our multi-step estimators, we use MATLAB’s built-in fzero subroutine to solve Eqs. (8) and (9) at each estimation iteration. The fzero subroutine uses a combination of bisection, secant, and inverse quadratic interpolation methods to find points where a continuous function crosses the y-axis. Brent (1973) and Forsythe et al. (1976) provide more detailed descriptions of the underlying algorithm. For current purposes (a scalar unknown), the fzero algorithm seems to work quite well. For more complicated multiparameter problems, the Newton–Raphson algorithm or one of its variants could be used instead. 9 We are grateful to the associate editor for drawing our attention to Zedlewski (2008) and for suggesting that we re-design our Monte Carlo experiments following Zedlewski (2008). As a result, the Monte Carlo study in the paper is different from the one in the previous version of the paper Fan and Li (2009) that follows the design in Guggenberger and Hahn (2005) and Hahn and Hausman (2002).

273

For each Monte Carlo simulation, we draw a sample of n = 1500 individuals over T = 12 periods. From this sample, we then estimate β0 using a variety of benchmark and experimental estimators. In particular, we consider standard two-step GMM, standard EL, and Antoine et al. (2007) three-step Euclidean EL estimators as benchmarks, and our proposed multi-step CUE and multi-step EL analogs as experimental estimators. We explore estimates based on both Type 1 moments only (our baseline) and pooled Type 1 and Type 2 moments. Our primary specification for our multi-step estimators uses two-step GMM to obtain βˆ 0 and does not correct for potential higher-order bias of the resulting estimates. In the Type 1 moments case, however, we also explore using one-step GMM initial estimates and correcting for Newey–Smith higher-order bias at each estimation step. Our simulation design thus enables us to compare the finitesample performance of several plausible variants of our proposed multi-step estimators with that of three established benchmark estimators under two alternative sets of moment conditions. Tables 1–4 report results from 100 repetitions of the procedure outlined above for each value of β0 considered.10 Table 1 gives results for our baseline case: only Type 1 moments, a two-step GMM initial estimate  β 0 , and no bias correction. Table 2 reports results obtained by adding Type 2 moments to the estimating equations and recalculating the estimators given in Table 1. Table 3 compares multi-step EL estimates obtained using one-step GMM for the initial estimate  β 0 to those obtained using two-step GMM in the Type 1 moments case. Finally, Table 4 explores the effect of correcting for higher-order bias at each step of our multi-step EL estimator in the Type 1 moments case following Newey and Smith (2004). For our multi-step estimators, we report  β 1,  β 5 , and  β 10 , the results from our first, fifth and tenth estimation iterations. In one β0 = 0.9 simulation and one β0 = 0.95 simulation, the baseline matElike EL algorithm failed to converge at all, and in four β0 = 0.9 simulations it converged to nonsensical (negative) results. Results from these simulations are omitted when calculating the relevant lines of the tables reported.11 Our simulation results suggests several preliminary conclusions. First, consistent with the findings of Zedlewski (2008), twostep GMM performs very well for small β0 , but its performance decreases dramatically as β0 approaches 1, and for β0 = 0.95 twostep GMM is severely biased downward (a median bias of −0.2403 in the best-case scenario). Meanwhile, EL performs consistently well for all three values of β0 considered. Thus, somewhat trivially, our Monte Carlo design highlights a situation where an econometrician might prefer using EL to using GMM. Second, as expected, our multi-step EL–analog estimators converge to their corresponding EL estimators. This convergence can be seen by comparing means and medians in Tables 1 and 2, but (importantly) it is not only an average or median phenomenon. For 10 instance, when β0 = 0.95,  βEL was within 0.01 of  βEL for all but 4 of our 100 baseline simulations, and within 0.001 of  βEL for all but 9. When β0 = 0.5, the corresponding numbers were 100 of 100 within 0.01, and 97 of 100 within 0.001. In our baseline case, the median number of iterations required for the multi-step estimator to converge at the 0.0001 level was 2 when β0 = 0.5, 3 when β0 = 0.9, and 7 when β0 = 0.95. Not surprisingly, the number of iterations required to achieve convergence depends on the initial estimate  β 0 ; when β0 = 0.95, the correlation between first-step error and number of iterations to convergence was 0.55. 10 Given the concern that the GEL estimators may have no finite moments as illustrated in Kunitomo and Matsushita (2003) and Guggenberger (2005, 2008), in addition to reporting mean and standard deviation, we also report the median. 11 Notably, our multi-step EL estimator did not encounter similar computational

problems: when β0 = 0.9, our 10th-iteration EL-analog estimates fell between a minimum of 0.8034 and a maximum of 1.0282. In the current application, at least, our algorithm may therefore be somewhat more stable.

274

Y. Fan et al. / Journal of Econometrics 162 (2011) 268–277

Table 1 j

j

Finite sample performance of benchmarks,  βEL , and  βCUE , (j = 1, 5, 10), 2SGMM initial estimator, Type 1 moments.

β0 = 0.5 β0 = 0.9 β0 = 0.95

Median bias Mean bias SD Median bias Mean bias SD Median bias Mean bias SD

 βEL

3S  βEL

 βGMM

1  βEL

5  βEL

10  βEL

1  βCUE

5  βCUE

10  βCUE

0.0014 0.0015 0.0140 0.0038 0.0045 0.0411 0.0097 0.0168 0.0877

0.0020 0.0017 0.0140 0.0064 0.0068 0.0406 −0.0244 −0.0331 0.1438

−0.0020 −0.0024

0.0023 0.0014 0.0139 −0.0038 −0.0036 0.0446 −0.1195 −0.1678 0.2055

0.0027 0.0016 0.0139 0.0038 0.0050 0.0415 −0.0062 −0.0127 0.1312

0.0023 0.0015 0.0139 0.0038 0.0049 0.0415 0.0062 0.0079 0.1027

0.0020 0.0012 0.0138 −0.0091 −0.0106 0.0447 −0.1235 −0.1750 0.1965

0.0024 0.0017 0.0138 0.0030 0.0059 0.0409 −0.0070 −0.0107 0.1109

0.0024 0.0017 0.0138 0.0030 0.0060 0.0409 0.0094 0.0151 0.0831

0.0136

−0.0571 −0.0630 0.0581

−0.2403 −0.2893 0.2047

Table 2 j

j

Finite sample performance of benchmarks,  βEL , and  βCUE , (j = 1, 5, 10), 2SGMM initial estimator, Type 1 and Type 2 moments.

β0 = 0.5 β0 = 0.9 β0 = 0.95

a

Median bias Mean bias SD Median bias Mean bias SD Median bias Mean bias SD

 βEL

3S  βEL

 βGMM

1  βEL

5  βEL

10  βEL

1  βCUE

5  βCUE

10  βCUE

0.0016 0.0014 0.0100 0.0038 0.0022 0.0156 0.0033 0.0007 0.0192

0.0019 0.0014 0.0100 0.0097 0.0169 0.0899 −0.0269a −0.0237a 0.5397a

0.0000 0.0006 0.0101 −0.0682 −0.0830 0.0681 −0.2771 −0.3149 0.1863

0.0015 0.0015 0.0100 −0.0044 0.0020 0.0295 −0.0177 0.0393 0.2032

0.0016 0.0015 0.0100 0.0033 0.0021 0.0148 −0.0112 −0.0027 0.0697

0.0016 0.0015 0.0100 0.0038 0.0023 0.0155 −0.0047 0.0077 0.0382

0.0015 0.0016 0.0099 −0.0293 −0.0398 0.0489 −0.2147 −0.2591 0.1896

0.0015 0.0018 0.0099 −0.0012 −0.0021 0.0175 −0.0616 −0.1250 0.1521

0.0015 0.0018 0.0099 0.0034 0.0020 0.0153 −0.0277 −0.0544 0.0960

3S After dropping 9 negative outliers (e.g. βˆ EL = −40.32). Summary measures over all 100 iterations: median = 0.9185, mean = 0.2739, SD = 4.3034.

Table 3 j  , 1SGMM vs 2SGMM initial estimators, (j = 1, 5, 10), Type 1 moments. βEL

β0 = 0.5 β0 = 0.9 β0 = 0.95

Median bias Mean bias SD Median bias Mean bias SD Median bias Mean bias SD

1S  βGMM

2S  βGMM

1  βEL (1S )

5  βEL (1S )

10  βEL (1S )

1  βEL (2S )

5  βEL (2S )

10  βEL (2S )

−0.0281 −0.0306

−0.0020 −0.0024

0.0016 0.0016 0.0137 −0.1001 −0.1264 0.1414 −0.4102 −0.4587 0.2974

0.0027 0.0016 0.0139 0.0038 0.0041 0.0428 −0.0544 −0.1286 0.2422

0.0023 0.0015 0.0139 0.0038 0.0049 0.0415 −0.0061 −0.0186 0.1498

0.0023 0.014 0.0139 −0.0038 −0.0036 0.0446 −0.1195 −0.1678 0.2055

0.0027 0.0016 0.0139 0.0038 0.0050 0.0415 −0.0062 −0.0127 0.1312

0.0023 0.0015 0.0139 0.0038 0.0049 0.0415 0.0062 0.0079 0.1027

0.0240

0.0136

−0.0571 −0.3055

−0.2841 −0.0630

0.1774

0.0581

−0.5424 −0.5854

−0.2403 −0.2893

0.2853

0.2047

Table 4 j

Uncorrected vs Newey–Smith bias-corrected  βEL , (j = 1, 5, 10), Type 1 moments.

β0 = 0.5 β0 = 0.9 β0 = 0.95

Median bias Mean bias SD Median bias Mean bias SD Median bias Mean bias SD

 βEL

1  βEL

5  βEL

10  βEL

1  βEL (BC )

5  βEL (BC )

10  βEL (BC )

0.0014 0.0015 0.0140 0.0038 0.0045 0.0411 0.0097 0.0168 0.0877

0.0023 0.0014 0.0139 −0.0038 −0.0036 0.0446 −0.1195 −0.1678 0.2055

0.0027 0.0016 0.0139 0.0038 0.0050 0.0415 −0.0062 −0.0127 0.1312

0.0023 0.0015 0.0139 0.0038 0.0049 0.0415 0.0062 0.0079 0.1027

0.0024 0.0015 0.0139 −0.0034 −0.0034 0.0443 −0.1177 −0.1650 0.2027

0.0028 0.0017 0.0139 0.0038 0.0052 0.0412 −0.0068 −0.0105 0.1246

0.0024 0.0016 0.0139 0.0038 0.0052 0.0412 0.0053 0.0095 0.0949

Third, as can be seen in Table 3, multi-step EL–analog estimates based on two-step GMM tend to converge to standard EL estimates more quickly than those based on one-step GMM. This is also not surprising — since one-step GMM estimates are less precise, iteration based on Eq. (9) on average has further to go. When β0 = 0.5, this difference is probably not important: by the fifth iteration, the multi-step estimators starting from one-stage GMM and two-step GMM are indistinguishable. As β0 approaches 1, however, the performance gap increases, and when β0 = 0.95 the fifth iteration starting from two-step GMM outperforms the tenth iteration starting from one-step GMM. Thus, when practical, our proposed procedure should probably be implemented starting from two-step GMM. Fourth, based on Tables 1–3, we can compare several variants of our multi-step estimators to each other. In particular, we

explore variations in type of estimator (CUE versus EL), the number of moment conditions used in estimation, and whether or not to correct for higher-order bias in estimation. On balance, our multi-step CUE and EL estimators perform very similarly: median bias of the 5th- and 10th-step CUE estimators is slightly lower in most cases, but greater when β0 = 0.95. Inclusion of Type 2 moments tends to improve the performance of EL-based estimators: comparing Tables 1 and 2, median bias, mean bias and standard deviation are all lower for both standard EL and multi-step EL when Type 2 moments are used. Effects on other estimators are ambiguous, though when β0 = 0.95 both GMM and our multi-step CUE analog perform substantially worse when Type 2 moments are included. Lastly, in Table 4, we explore adapting our multi-step EL estimator to correct for higher-order bias by subtracting the GEL bias estimator given in Section 5 of Newey

Y. Fan et al. / Journal of Econometrics 162 (2011) 268–277

and Smith (2004) at each iteration. When β0 = 0.5 and β0 = 0.9, estimated higher order bias is very small, but when β0 = 0.95 correcting for Newey–Smith bias generates perceptible finitesample improvements. Finally, our Monte Carlo design enables us to compare our proposed multi-step estimators to several other benchmark estimators. We have seen that our proposed multi-step EL estimator converges to standard EL, and therefore tend to perform much better than two-step GMM when β0 is close to 1. We also consider the three-step Euclidean EL estimator proposed by Antoine et al. (2007), which builds on two-step GMM (or another efficient esti3S mator) using a control variate approach to obtain an estimator  βEL that is asymptotically equivalent to standard EL. We calculate this estimator using the formula given in Section 4.1 of Antoine et al. (2007), and report the results in Tables 1 and 2.12 As can be seen from these tables, our multi-step estimator performs comparably to three-step EEL when β0 = 0.5, slightly better by the first iteration when β0 = 0.9, and considerably better by the fifth iteration when β0 = 0.95.13 Thus, our proposed multi-step estimators compare well to several potential alternatives in terms of both computational stability and overall performance.14 6. Conclusion and some extensions In this paper, we have proposed a new class of estimators for unconditional moment restriction models. They are obtained by exploiting the first-order conditions of the GEL estimators. They share the same asymptotic efficiency and higher order properties as the GEL estimators, but are computationally much more attractive than the GEL estimators. Given the attractive theoretical properties of our estimators, and their computational advantage, as well as their nice finite sample performances as demonstrated in our Monte Carlo experiments, they are potentially useful in practice as an alternative to the GMM and GEL estimators, in particular, when there are a large number of moment conditions and/or a large number of parameters, or when instruments are relatively weak. Kitamura et al. (2004), Smith (2007a,b), and Antoine et al. (2007), among others, have extended members of GEL estimators for unconditional moment restriction models to conditional moment restriction models. Antoine et al. (2007) also proposed an extension of their three step estimator to the conditional moment restriction models. We expect that our approach, suitably modified, would work well for conditional moment restriction models as well and are currently exploring details of such an extension to conditional moment restriction models without unknown functions (e.g. Newey (1990)) and with unknown functions (e.g. Ai and Chen (2003)). Other properties and applications of the proposed estimators may be explored in future work.15 First, the proposed estimators in this paper may be used to replace the GEL estimators in tests 12 It should be noted, however, that we consider a very simple version of the threestep EEL estimator: we implement the procedure outlined in Section 4.1 of Antoine et al. (2007), but do not adjust for the fact that estimated EEL probabilities can be negative in finite samples. Antoine et al. (2007) suggest that a shrinkage procedure could be used to account for this possibility, which might help to improve finitesample performance. 13 When β = 0.95 and Type 2 moments are included, the version of the 0

three-step EEL estimator we implement yields several large negative estimates 3S (e.g.  βEL = −40.42). This may stem from our failure to adjust for negative empirical probabilities, or from some other source. The results we report for the three-step EEL in the β0 = 0.95 case were obtained after dropping 9 negative estimates. 14 We have also increased the number of repetitions to 500 for β = 0.95 and got 0

quite similar results, suggesting that the conclusions reported in this section with 100 repetitions are robust. 15 We thank an anonymous associate editor for raising these issues.

275

for the over-identifying moment conditions and for the parametric constraints in Smith (1997) and Ramalho and Smith (2005). In view of Theorem 4.2, we expect tests using our proposed estimators and the corresponding ones in Smith (1997) and Ramalho and Smith (2005) to have similar properties at least when the number of iterations j is large enough. Second, consider the minimum distance estimator β defined as

β = arg n −

min

β∈B ,π1 ,...,πn

πi gi (β) = 0,

i =1

n −

h (πi ) ,

subject to

i =1 n −

πi = 1,

(15)

i=1

where h (π ) is a member of the Cressie and Read  of  (1984) family discrepancies in which h (π) = [γ (γ + 1)]−1 (nπ )γ +1 − 1 /n. Theorem 2.2 in Newey and Smith (2004) shows that under certain conditions, the first-order conditions for GEL with ρ (v) defined as

ρ (v) = −

(1 + γ v)(γ +1)/γ , γ +1

coincide with the first-order conditions for (15). As a result, for discrete data, the implied probabilities  πi in (2) can be interpreted as the best approximation of the probabilities of the observations incorporating the moment conditions. Now consider the implied probabilities corresponding to our proposed estimator  β j , i.e.,  πij =

 j j  j πi  β , λ , where  λj = λ  β . Since  λj still solves (3),  πij ’s still ∑n j sum to one, satisfy the sample moment condition πij gi = 0 i=1  when the first-order conditions for  λj hold, and are positive when λ′ gi is small uniformly in i. Third, it is known that certain

minimum distance estimators are robust against violations of model restrictions. Kitamura et al. (2009) show that the minimum distance estimator β corresponding to γ = −1/2 enjoys an optimal minimax robust property when the model restriction does not hold and is asymptotically efficient when the model assumption holds. Moreover, they demonstrate via simulations that EL (corresponding to γ = −1) and ET (corresponding to γ = 0) are robust, but GMM is not. In view of the higher order equivalence of our proposed estimators and the corresponding GEL estimators, we expect them to have similar robustness properties. Appendix

 (β1 )−1 Proof of Theorem 4.1. Let  S (β, β1 ) =  G(β)′ Ω g (β) and     j−1 ′   j−1 −1    G  β Ω β −   −1   j−1  ′   j −1    n n  − − j−1   g  β . A  β =  j−1   j−1 ′  j − 1 j − 1    gi  β  πi Gi  β ki gi  β   i=1

i=1

Then for any fixed j, we have

 j j −1   j−1   S  β , β = A  β . First, we show that  A  β j −1



 −1/2



  = op n−1/2 . Given that  β j −1 −

β0 = Op n , we get from Lemma A2 in Newey and Smith   (2004) that  λj−1 = Op n−1/2 . Then Lemma A1 in Newey and  j−1  Smith (2004) implies that maxi≤n | λj−1′ gi  β | = op (1). As a  j−1′  j−1    result, ρ1 λ gi β − ρ1 (0) = op (1), leading to  j−1′  j−1   ρ1  λ gi  β 1  πij−1 = n = 1 + op (1) uniformly in    ∑ n ρ1  λj−1′ gl  β j −1 

l =1

i = 1, . . . , n.

(A.1)

276

Y. Fan et al. / Journal of Econometrics 162 (2011) 268–277

Similarly, we get

 j−1′  j−1   j−1′  j−1  ρ1  λ gi  β +1     = ρ2 (0) + op (1) k λ gi β = j − 1 ′ j − 1   λ gi β uniformly in i = 1, . . . , n,

Proof of obtain

  1 0 ∂ S β ,β ∂β1

 1 0 ∂ S β ,β

    1 = Op n−1/2 : Since β − β0 = Op n−1/2 , we

1 1  (β 0 )−1 ⊗   (β 0 )−1 = − g (β )′ Ω G(β )′ Ω   = Op n−1/2 .

∂β1′

implying

j −1  k i

Proof of

 j−1′  j−1   k  λ gi  β 1 = n 1 + op (1) =    ∑ n k  λj−1′ gl  β j −1 l=1

uniformly in i = 1, . . . , n.

(A.2)

  ∗ ∂ A β ∂β

  = Op n−1/2 : Let

 ′     (β) −1 D (β) =  G (β) Ω  ′   −1 n n − − ′ − πi (β) Gi (β) ki (β) gi (β) gi (β) . i=1

i=1

It follows from (A.1) and (A.2) that

Since  A (β) =  D (β) g (β), we have

  j−1 ′   j−1 −1    G  Ω β β  ′  −1 n n − j −1  −   j−1   j−1 ′ j − 1 j − 1  β β β −  πi Gi  ki gi  gi 

 ∗  ∗  ∗  ∗  ∗ ∂ A β = ∂ D β  g β + D β ∂ g β .  ∗  ∗   So we need to show: (i)  D β = Op n−1/2 and (ii) ∂ D β = Op (1). (i) is implied by

i =1

i =1

= op (1) .

n −

    = op n−1/2 is completed by noting that  j −1   −1/2   g  β = Op n . The proof of  A  β j −1

Consequently,  β j satisfies  the first-order conditions for the twostep GMM to order op n−1/2 , i.e.,

    ( G( β j )′ Ω β j−1 )−1 g ( β j ) = op n−1/2 . This implies that  β j has the same first-order asymptotic properties  as βGMM .

 ∗  ∗   π ∗i Gi β −  G β = Op n−1/2 ,

i=1 n −

∗



ki gi β

∗





gi β

 ∗ ′

     β ∗ = Op n−1/2 . −Ω

i=1

Now,

 ∗′  ∗   ∗′  ∗  = ρ1 (0) + ρ2 (0) λ gi β ρ1 λ gi β    ∗′  ∗ 2 , + ρ3 vi∗ λ gi β  ∗ ∗′ where vi∗ lies between λ gi β and zero. Hence,    ∗′  ∗ 2 + ρ3 vi∗ λ gi β ] [ n  ] [n π ∗i =    ∗  ∗ ∗ ∗′ ∑ ∗ ∗′ ∑ +λ n+λ gl β ρ3 vi∗ gi β gi′ β λ 

∗′

Proof of Theorem 4.2. (i). Let us consider j = 1. We apply a mean  1 0  0 value expansion to  S  β , β = A  β and obtain

l =1

 1 0  1 0 S β ,β S β ,β   ∂  ∂   1  0    S  β,  β + β − β + β − β ∂β ∂β1  ∗ A β   ∂  0   = A  β + β − β , ∂β

Now, since λ

 1 0  1 0   ∗ β , β lies between  β , β and  β,  β , and β lies     between  β 0 and  β . Since  S  β,  β = A  β , we obtain  1 0  1 0 ∂ S β ,β S β ,β  1  ∂  0    β − β + β − β ∂β ∂β1  ∗  ∂A β    = β0 −  β . ∂β We need to show that



 −1/2

= Op n  −1 

 1 ∂β1 which imply  β − β = Op n

n j follows suit.



−1/2



and

∗



l =1

   ∗′  ∗ 2 1 + λ gi β + ρ3 vi∗ λ gi β  ∗ =   ∗′  . ∗′ ∗′  n 1+λ  g β + λ ρ3 (0) Ω + op (1) λ ∗′

where

  1 0 ∂ S β ,β

1 + λ gi β

n −

∗′



∗



  = Op n−1/2 , we get

  ∗ π ∗i Gi β

i=1 n ∑

    ∑ n n    ∗′  ∗ 2  ∗  ∗′ ∑ ∗ ∗ +λ g i β Gi β + ρ3 vi∗ λ gi β Gi β i=1 i=1  i=1    =   ∗′ ∗ ∗′ ∗ n 1+λ  g β + λ ρ3 (0) Ω + op (1) λ       n    ∗  ∗  ∗ ∗ ∑ ∗ ∗′  ∗ ∗′  G β +λ Λ β +λ n−1 ρ3 vi∗ gi β Gi β gi′ β λ =



Gi β

∗



i=1    ∗ ∗′ ∗ ∗′  1+λ  g β + λ ρ3 (0) Ω + op (1) λ

    ∗ = G β + Op n−1/2 .   ∗ ∂ A β ∂β

Likewise, we have

= Op

. The result for a general

 ∗′  ∗   ∗′  ∗  ρ1 λ gi β +1   k λ gi β = ∗′ ∗ λ gi β

Y. Fan et al. / Journal of Econometrics 162 (2011) 268–277

 ∗     ∗′  ∗ 2 ∗′ ρ2 (0) λ gi β + ρ3 vi∗ λ gi β  ∗ = ∗′ λ gi β    ∗′  ∗  = ρ2 (0) + ρ3 vi∗ λ gi β . 

Hence



   ∗′  ∗  ρ2 (0) + ρ3 vi∗ λ gi β ki = n   ∗  = n     ∗  ∑ ∗′ ∗′ ∑ k λ gl β nρ2 (0) + λ ρ3 vl∗ gl β ∗′



k λ gi β

∗

∗



l =1

l=1

which implies n −

∗



ki gi β

∗





gi β

 ∗ ′

i=1

[ ]   n        ′ ∑  β ∗ + λ∗′ n−1 ρ3 vi∗ gi β ∗ gi β ∗ gi β ∗ ρ2 (0) Ω [i=1 n ] = ∑    ∗ ∗′ ρ2 (0) + λ n−1 ρ3 vl∗ gl β l=1

     β ∗ + Op n−1/2 . =Ω



It remains to show (ii) ∂ D β



∂ D β

 ∗

∗



= Op (1). This follows, as

′   = E ∂ 2 gi (β0 ) Ω −1 − G′ Ω −1 E gi (β0 ) Gi (β0 )′   n   ∗   ∗  ′ −  −1 ′ πi β Gi β + Gi (β0 ) gi (β0 ) Ω − ∂ Ω −1 i=1

 − G′ Ω −1 ∂

n   −

ki β

 ∗



gi β

∗





gi β

 ∗ ′



Ω −1 + op (1)

i=1

  = −2G′ Ω −1 E gi (β0 ) Gi (β0 )′ + Gi (β0 ) gi (β0 )′ Ω −1 + op (1) = Op (1) . Theorem 4.2 (ii) follows immediately.

References Ai, C., Chen, X., 2003. Efficient estimation of models with conditional moment restrictions containing unknown functions. Econometrica 71, 1795–1843. Altonji, J.G., Segal, L.M., 1996. Small-sample bias in GMM estimation of covariance structures. Journal of Business and Economic Statistics 14, 353–366. Andrews, D.W.K., 2002. Equivalence of the higher order asymptotic efficiency of k-step and extremum statistics. Econometric Theory 18, 1040–1085. Antoine, B., Bonnal, H., Renault, E., 2007. On the efficient use of the information content of estimating equations: implied probabilities and euclidean empirical likelihood. Journal of Econometrics 138, 461–487. Bonnal, H., Renault, E., 2004. On the efficient use of the information content of estimating equations: implied probabilities and euclidean empirical likelihood, Working Paper. Brent, R., 1973. Algorithms for Minimization Without Derivatives. Prentice-Hall. Brown, B.W., Newey, W.K., 1998. Efficient semiparametric estimation of expectations. Econometrica 66, 453–464. Cressie, N., Read, T., 1984. Multinomial goodness-of-fit tests. Journal of the Royal Statistical Society Series B 46, 440–464. Donald, S.G., Newey, W.K., 2000. A jackknife interpretation of the continuous updating estimator. Economics Letters 67, 239–243. Fan, Y., Pastorello, S., Renault, E., 2007. Maximization by parts in extremum estimation, Manuscript.

277

Fan, Y., Li, T., 2009. A new class of asymptotically efficient estimators for moment condition models, Working Paper, Vanderbilt University. Forsythe, G.E., Malcolm, M.A., Moler, C.B., 1976. Computer Methods for Mathematical Computations. Prentice-Hall. Guggenberger, P., 2005. Monte-Carlo evidence suggesting a no moment problem of the continuous updating estimator. Economics Bulletin 3, 1–6. Guggenberger, P., 2008. Finite sample evidence suggesting a heavy tail problem of the generalized empirical likelihood estimator. Econometric Reviews 27, 526–541. Guggenberger, P., Hahn, J., 2005. Finite sample properties of the two-step empirical likelihood estimator. Econometric Reviews 24, 247–263. Hahn, J., Hausman, J., 2002. A new specification test for the validity of instrumental variables. Econometrica 70, 163–189. Hansen, L.P., 1982. Large sample properties of generalized method of moments estimators. Econometrica 50, 1029–1054. Hansen, L.P., Heaton, J., Yaron, A., 1996. Finite-sample properties of some alternative GMM estimators. Journal of Business and Economic Statistics 14, 262–280. Imbens, G., 1997. One-step estimators for over-identified generalized method of moments models. Review of Economic Studies 64, 359–383. Imbens, G., 2002. Generalized method of moments and empirical likelihood. Journal of Business and Economic Statistics 20, 493–506. Imbens, G., Spady, R.H., Johnson, P., 1998. Information theoretic approaches to inference in moment condition models. Econometrica 66, 333–357. Kitamura, Y., 2005. Empirical likelihood methods in econometrics: theory and practice, Invited Talk at the World Congress of the Econometric Society. Kitamura, Y., Otsu, T., Evdokimov, K., 2009. Robustness, infinitesimal neighborhoods, and moment restrictions, Cowles Foundation Discussion Paper No. 1720. Kitamura, Y., Stutzer, M., 1997. An information-theoretic alternative to generalized method of moments estimation. Econometrica 65, 861–874. Kitamura, Y., Tripathi, G., Ahn, H., 2004. Empirical likelihood-based inference in conditional moment restriction models. Econometrica 72, 1667–1714. Kleibergen, F., 2005. Testing parameters in GMM without assuming that they are identified. Econometrica 73, 1103–1123. Kleibergen, F., Mavroeidis, S., 2008. Inference on subsets of parameters in GMM without assuming identification, Working Paper, Brown University. Kunitomo, N., Matsushita, Y., 2003. Finite sample distributions of the empirical likelihood estimator and the GMM Estimator, Working Paper, University of Tokyo. Mittelhammer, R., Judge, G., Schoenberg, R., 2005. Empirical evidence concerning the finite sample performance of el-type structural equation estimation and inference methods. In: Identification and Inference for Econometric Models: Essays in Honor of Thomas Rothenberg. Cambridge University Press, pp. 282–305. Newey, W.K., 1990. Efficient instrumental variables estimation of nonlinear models. Econometrica 58, 809–837. Newey, W.K., Smith, R.J., 2004. Higher order properties of GMM and generalized empirical likelihood estimators. Econometrica 72, 219–255. Owen, A., 1988. Empirical likelihood ratio confidence intervals for a single functional. Biometrika 75, 237–249. Pastorello, S., Patilea, V., Renault, E., 2003. Iterative and recursive estimation in structural nonadaptive models. Journal of Business and Economic Statistics 21, 449–482. Qin, J., Lawless, J., 1994. Empirical likelihood and general estimating equations. Annals of Statistics 22, 300–325. Ramalho, J.J.S., Smith, R.J., 2005. Goodness of fit tests for moment condition models, Working Paper, UNIVERSIDADE DE ÉVORA. Robinson, P.M., 1988. The stochastic difference between econometric statistics. Econometrica 56, 531–548. Sherlund, S.M., 2004. Quasi empirical likelihood estimation of moment condition models, Working Paper, Federal Reserve Board. Smith, R.J., 1997. Alternative semi-parametric likelihood approaches to generalized method of moments estimation. Economic Journal 107, 503–519. Smith, R.J., 2007a. Efficient information theoretic inference for conditional moment restrictions. Journal of Econometrics 430–460. Smith, R.J., 2007b. Local GEL estimation with conditional moment restrictions. In: Phillips, G.D.A., Tzavalis, E. (Eds.), The Refinement of Econometric Estimation and Test Procedures: Finite Sample and Asymptotic Analysis. Cambridge University Press, Cambridge, pp. 100–122 (Chapter 4). Song, P., Fan, Y., Kalbfleisch, J., 2005. Maximization by parts in likelihood inference. Journal of the American Statistical Association 100, 1145–1158. Zedlewski, J., 2008. Practical empirical likelihood estimation using matElike, Manuscript, Harvard University.

Journal of Econometrics 162 (2011) 278–293

Contents lists available at ScienceDirect

Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom

Fourth order pseudo maximum likelihood methods Alberto Holly a,b , Alain Monfort c,∗ , Michael Rockinger d a

Institute of Health Economics and Management (IEMS), University of Lausanne, Centre administratif de Vidy, 1015 Lausanne, Switzerland

b

Nova School of Business and Economics, Nova University, Lisbon, Portugal

c

CREST, Banque de France, and University of Maastricht, CREST, 15 Boulevard Péri, 92245 Malakoff Cédex, France

d

Swiss Finance Institute and University of Lausanne, Faculty of Business and Economics, Extranef Building, CH-1015 Lausanne, Switzerland

article

info

Article history: Received 31 May 2009 Received in revised form 22 November 2010 Accepted 28 January 2011 Available online 16 February 2011 JEL classification: C01 C13 C16 C22

abstract We extend PML theory to account for information on the conditional moments up to order four, but without assuming a parametric model, to avoid a risk of misspecification of the conditional distribution. The key statistical tool is the quartic exponential family, which allows us to generalize the PML2 and QGPML1 methods proposed in Gourieroux et al. (1984) to PML4 and QGPML2 methods, respectively. An asymptotic theory is developed. The key numerical tool that we use is the Gauss–Freud integration scheme that solves a computational problem that has previously been raised in several fields. Simulation exercises demonstrate the feasibility and robustness of the methods. © 2011 Elsevier B.V. All rights reserved.

Keywords: Quartic exponential family Pseudo maximum likelihood Skewness Kurtosis

1. Introduction It is well known that the Maximum Likelihood estimator may not only be inefficient but also inconsistent under misspecification, that is when the parametric model providing the likelihood function does not contain the true distribution. The study of the relations between Maximum Likelihood Theory and misspecification now has a long history. Hood and Koopmans (1953) demonstrated that the conditionally Gaussian ML estimator is consistent and asymptotically Gaussian, even if the true distribution is not conditionally Gaussian, as soon as the first two conditional moments are well specified. They coined the label ‘‘quasi ML estimator’’ for this kind of estimator. White (1982) showed that, under misspecification, the ML estimator is in fact a CAN (consistent asymptotically normal) estimator of the pseudo-true value (as defined for instance in Sawa (1978)). Gourieroux et al. (1984) characterized the parametric families leading to CAN estimators of the parameters appearing in the first two conditional moments, even if the

∗

Corresponding author. E-mail addresses: [email protected] (A. Holly), [email protected] (A. Monfort), [email protected] (M. Rockinger). 0304-4076/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2011.01.004

true distribution does not belong to this parametric family. These families are the linear exponential families (when only the first conditional moment is specified) and the quadratic exponential families (when the first two conditional moments are specified). The estimators thus obtained were called PML1 and PML2 estimators, respectively. Bollerslev and Wooldridge (1992)generalized the properties of the quasi generalized estimator, i.e. the Gaussian PML estimator, to the dynamic case. The PML theories described above only consider inference on the parameters appearing in the first two conditional moments. More recently, however, many econometric fields are paying greater attention to higher order conditional moments. This is particularly the case in Financial Econometrics and in Health Econometrics. Very often, the approach used to account for higher moments is the ML method based on a choice of a parametric family, which allows for asymmetry and fat tails. A few such examples, occurring in finance are: Generalized Hyperbolic distribution (Eberlein and Keller, 1995; Barndorff-Nielsen, 1997), the noncentral Student t distribution (Harvey and Siddique, 1999), and the skewed-t distribution (Hansen, 1994; Jondeau and Rockinger, 2003). In Health Econometrics, some examples are: The Generalized Gamma distribution proposed by Stacy (1962) and Stacy and Mihram (1965), (Manning et al., 2005), and the

A. Holly et al. / Journal of Econometrics 162 (2011) 278–293

Pearson Type IV distribution, which may be considered a skewedt distribution (Holly and Pentsak, 2004). More generally, examples can be found in various fields, including physics, astronomy, image processing, and in the biomedical sciences (see Genton, 2004 and Arellano-Valle and Genton, 2005). Current ML approaches have two types of drawbacks. First, some families may not be flexible enough to span the whole set of possible skewness (s) and kurtosis (k), namely the domain k ≥ s2 +1. Second, as mentioned above, the risk of misspecification may lead to inconsistent estimators. If we are interested in the first four conditional moments, a natural method is the Generalized Method of Moments (GMM). There is now a large body of studies, however, suggesting that GMM estimators can have poor finite sample properties (see e.g. Tauchen, 1986, Andersen and Sorenson, 1996, Altonji and Segal, 1996, Ziliak, 1997 and Doran and Schmidt, 2006). These difficulties led Kitamura and Stutzer (1997), to propose an alternative estimation procedure based on the Kullback–Leibler Information Criterion. The objective of this paper is to propose another alternative to GMM. It is an extension of the PML method developed by Gourieroux et al. (1984), henceforth referred to as GMT. This work extends GMT to a situation where the first four moments (centered or not) are known functions depending on unknown parameters. Specifically, we show that the PML estimator is consistent for any specification of the first four conditional moments, any true conditional distribution of the endogenous variable, and any marginal distribution of the exogenous variables, if and only if the PML is based on a quartic exponential family. We shall refer to this extension as PML4. We also propose an extension of the Quasi-Generalized Pseudo Maximum Likelihood (QGPML) estimator proposed by Gourieroux et al. (1984) based on the quartic exponential family and called QGPML2. Beyond the robustness and nice asymptotic properties of the estimates resulting from the quartic exponential families, two additional features should be noted. First, the quartic exponential family spans the whole set of possible values for the mean and the variance as well as the pairs (s, k), except for a set of measure zero, namely {(s, k), s.t. s = 0, k > 3}. This is not necessarily the case for other parametric families of distributions such as those mentioned earlier. Second, we show how the parameters of the quartic exponential family may be obtained from a given set of moments. Thereby, we solve a numerical problem which had been already encountered, both in the econometric literature (Zellner and Highfield, 1988; Ormoneit and White, 1999) and other fields (Agmon et al., 1979 or Mead and Papanicolaou, 1984), where the exponential family arises as an Entropy Maximizing density and which were considered as difficult. The key issue, from the computational point of view, is the use of the Gauss–Freud quadrature scheme which seems very promising for computing the numerical integrations needed in this framework. It is also important to note that, since the quartic exponential family will be used as a tool providing a convenient set of auxiliary probability distributions generating the whole set of pairs (s, k) except the set of measure zero {(s, k), s = 0, k > 3}, it is also possible to exclude the point (s = 0, k = 3) without any practical consequence; this will be done for the sake of technical simplicity. Of course, as usual in PML theories, this choice of auxiliary probability distribution does not imply any restriction on the true distribution. The rest of this paper is organized as follows. Preliminary results are given in Section 2, where some properties of exponential families are briefly reviewed and the notion of a quartic exponential family is defined. This section also contains a brief presentation of the properties of M-Estimators. These preliminary results are then used to derive important properties of the exponential quartic family in Section 3. The PML4 method is defined in Section 4, and the asymptotic properties of the PML4 estimators are derived. In Section 5, we perform a similar analysis as in

279

Section 4 but for the QGPML2 method. In Section 6, we discuss the numerical issues and describe numerical algorithms for implementing of the PML4 and QGPML2 methods. Several MonteCarlo exercises demonstrating the usefulness of the methods proposed in our paper are presented in Section 7. This section contains a discussion on computational issues linked with the quartic exponential distribution, and it also presents four Monte-Carlo experiments, each of which numerically demonstrates a different property of PML4 or QGPML2. Conclusions are presented in Section 8. Finally, to not interrupt our discussion of the essential ideas of this paper, some proofs and other technical details are presented in Appendices. 2. Preliminaries 2.1. Exponential families Let us consider a measure space (Y, A, ν) where A is a σ field and ν a σ -finite measure. An exponential family is a family of probability distributions on (Y, A) which are equivalent to ν and with pdfs of the form:

  ℓ(y, λ) = exp λ′ T (y) − ψ(λ) ,

λ ∈ Λ ⊂ Rp ,

where T (y) is a p-dimensional vector defined on Y, and ψ(λ) is a normalizing constant, equal to the Log–Laplace transform of ν T , equivalent to the image of ν by T . Such families have many well-known properties. Some of them will be useful in the rest of the paper, and they are summarized below (the proofs can be found for instance in Barndorff-Nielsen, 1978, Monfort, 1982, or Brown, 1986). (1) Λ can be taken as the convex set where the Laplace transform of ν T is defined. ˚ , interior of Λ, all the moments of the statistic T (2) For any λ ∈ Λ exist, and in particular, we have: Eλ (T ) =

∂ ψ(λ), ∂λ

Vλ ( T ) =

∂2 ψ(λ), ∂λ∂λ′

which implies:

∂ Eλ (T ) ∂2 = ψ(λ) = Vλ (T ). ∂λ′ ∂λ∂λ′ (3) The Fisher information matrix IF (λ) is equal to Vλ (T ) = ∂ 2 ψ(λ)/∂λ∂λ′ . (4) The model is identifiable if, and only if, IF (λ) is invertible for any λ ∈ Λ. (5) If the model is identifiable, then the mapping λ → Eλ (T ) is injective. 2.2. Quartic exponential family We consider the particular case where Y = R, A = BR (the Borelian σ -field of R), ν is the Lebesgue measure on R, and T (y) = (y, y2 , y3 , y4 )′ . In other words, we consider the pdfs on R defined by:

ℓ(y, λ) = exp

 4 −

 λi y − ψ(λ) , i

i =1

with λ = (λ1 , λ2 , λ3 , λ4 )′ .

(2.1)

We will also use the notation:

 ℓ(y, λ) = exp λ0 +

4 − i=1

 λi y

i

,

with λ0 = −ψ(λ).

(2.2)

280

A. Holly et al. / Journal of Econometrics 162 (2011) 278–293

This type of density has been extensively used in the entropy literature, e.g. Golan et al. (1996), since  it is obtained by maximizing, with respect to f , the entropy − R f (y) log f (y) dy, under a  set of data moment-consistency constraints R yi f (y) dy = mj , for j = 1, . . . , 4, where the mj are given, as well as a normalization constraint R f (y) dy = 1. The set Λ where ℓ(y, λ) is defined is easily obtained. If λ4 < 0, ℓ(y, λ) is always integrable. If λ4 > 0, ℓ(y, λ) is never integrable. Finally, if λ4 = 0, ℓ(y, λ) is integrable if λ3 = 0 and λ2 < 0 and we get the Gaussian family. In other words, Λ is defined by:

Λ=R×R×R×R

−∗

+R × R

−∗

×{0} × {0},

−∗

where R is the set of strictly negative numbers. This family will be called the quartic exponential family and denoted by {Q (λ), λ ∈ Λ}. One should note that the variance–covariance matrix of T (Y ) = (Y , Y 2 , Y 3 , Y 4 )′ is invertible everywhere, since otherwise there would exist a linear relation between Y , Y 2 , Y 3 , Y 4 , i.e. the support of the distribution of Y would be made of at most four points, which is impossible since this distribution is absolutely continuous with respect to the Lebesgue measure. Therefore, using the general properties (3) and (4) we see that the model is identifiable. Moreover, using (5) we conclude that the mapping λ → [mi (λ), i = 1, . . . , 4], where mi (λ) = Eλ (Y i ), is injective. Denoting by s(λ) and k(λ) the skewness and kurtosis [s(λ) = Eλ [Y − E (Y )]3 /[Vλ (Y )]3/2 , k(λ) = Eλ [Y − E (Y )]4 /[Vλ (Y )]2 ], respectively, it is also clear that the mapping:

λ → [m1 (λ), m2 (λ), s(λ), k(λ)], λ → [m(λ), σ (λ), s(λ), k(λ)], 2

where m(λ) = m1 (λ), and σ 2 (λ) = m2 (λ) − m21 (λ). It is important to check whether the previous mapping is also surjective, which is to say that it can reach any admissible value of (m1 , m2 , s, k). It is well known that the set D of admissible values of (m, σ 2 , s, k) is defined by: m ∈ R, σ ≥ 0, s ∈ R,

k ≥ s + 1. 2

1 s

s



k−1

,

and therefore that k − 1 − s2 ≥ 0. Moreover, the boundary k = s2 + 1 is reached if Y and Y 2 are linked linearly, that is, if the support is made of at most two points. Therefore, this boundary clearly cannot be reached by the quartic family, and the boundary point σ 2 = 0 cannot be reached either (for the same reason). Therefore, the natural question is now the following: is the range of the mapping λ → [m(λ), σ 2 (λ), s(λ), k(λ)] defined by: D = m ∈ R, σ 2 > 0, s ∈ R, k > s2 + 1 ?







There is a one-to-one relationship between Λ∗ and D∗ , and, moreover, all the points of D not belonging to D∗ can be approached as closely as wished by a distribution of {Q (λ), λ ∈ Λ∗ }. In the sequel, we will take the quartic exponential family {Q (λ), λ ∈ Λ∗ } as the auxiliary family on which the semi-parametric estimator of the parameters of interest will be based.1 2.3. M-estimators and Quasi-Generalized M-estimators Let us consider an endogenous variable Yi and a vector of exogenous variables Xi . For simplicity, we assume that (Yi , Xi ) for i = 1, . . . , n are i.i.d. Standard extensions can be found in Gallant (1987), Holly (1993), or White (1994). To each possible conditional distribution of the Yi ’s given the Xi ’s, we associate a parameter θ ∈ Θ ⊂ RK . In particular, the value of the parameter corresponding to the true conditional distribution of the Yi ’s given the Xi ’s is called the true value of the parameter, and it is denoted by θ0 . The true distribution of the sequence (Yi , Xi , i ∈ N) is denoted by P0 . Throughout this paper, we adopt the notation corresponding to a conditional static model, but the results could be extended to a stationary conditional dynamic model by replacing Yi by Yt and Xi by (Yt −1 , . . . , Y1 , Xt , . . . , X1 ). An M-estimator of θ0 is an estimator θˆn obtained by maximizing, with respect to θ , an objective function of the form: n −

ϕ(Yi , Xi , θ ).

(2.3)

Under standard regularity conditions (see e.g. Chamberlain, 1987, Newey, 1990, White, 1994 and Gourieroux and Monfort, 1995a), it can be shown that θˆn is a consistent estimator of θ0 , for any θ0 , if the limit function

 ϕ∞ (θ , P0 ) = P0 lim

n 1−

n i =1

 ϕ(Yi , Xi , θ )

has a unique maximum at θ = θ0 . Moreover, the limit function can be written:

ϕ∞ (θ , P0 ) = EX E0 ϕ(Y , X , θ ),

The latter inequality is obtained, for instance, by noting that the variance–covariance matrix of (Y , Y 2 ) where E (Y ) = 0, V (Y ) = 1, is given by



D∗ = D − (m, σ 2 , s, k), s = 0, k ≥ 3 .

i=1

is injective. The same is true for the mapping

2

D except the points corresponding to s = 0, and k ≥ 3. We will denote by Λ∗ the set R × R × R × R−∗ , and by D∗ the set



The answer is no, but it can be shown (see Junk, 2000) that the range is almost equal to this set, in the sense that all the admissible values of (m, σ 2 , s, k) can be reached except for those corresponding to the set of measure zero, defined by s = 0, k > 3. Moreover, if we exclude the case λ4 = 0 (and therefore λ3 = 0), in other words, if we restrict the Q (λ) family to the case where λ4 < 0, then the only probability distributions excluded are the normal distributions (corresponding to s = 0 and k = 3) and, therefore, the range of the m reached by this restricted family is

(2.4)

where E0 is the conditional expectation operator associated to the true conditional distribution of Yi , given that Xi = x (independent of i) and EX is the expectation with respect to the distribution PX , of any Xi . Let us now consider two subvectors θ ∗ and θ ∗∗ of θ . These subvectors are not necessarily disjoint, and in particular, we can have θ ∗ = θ ∗∗ = θ . We assume that θ0∗ , the true value of θ ∗ , can be consistently estimated by θˆn∗ defined by:

θˆn∗ = Argmax θ∗

n    − ϕ Yi , Xi , θ ∗ , a Xi , θˆn∗∗ ,

(2.5)

i=1

where a is some function, and θˆn∗∗ is a consistent estimator of θ0∗∗ , the true value of θ ∗∗ . In other words, θ0∗ gives the unique maximum in θ ∗ of:

 P0 lim

n 1−

n i =1





ϕ Yi , Xi , θ , a Xi , θn ∗

ˆ ∗∗

   = EX E0 ϕ Y , X , θ ∗ , a X , θ0∗∗ .



 (2.6) (2.7)

1 Note that in particular the moments generated by standard exponential families like the binomial, gamma, and Poisson are reached, since their skewness is non zero.

A. Holly et al. / Journal of Econometrics 162 (2011) 278–293

281

Such an estimator is called a Quasi-Generalized M-estimator of θ0∗

Proposition 2. We have ∂ m/∂λ′ = Σ , and therefore, ∂λ/∂ m′ = Σ −1 .

defined by:

Proof. This is a direct consequence of the general property (2) of Section 2.1.

∗ (QGM estimator). The corresponding unfeasible M-estimator θˆ0n is

∗ θˆ0n

n −    = Argmax ϕ Yi , Xi , θ ∗ , a Xi , θ0∗∗ ,

θ∗

(2.8)

i=1

and is also consistent. As far as the asymptotic normality of the M and QGM estimators is concerned, we have the following properties. √ The asymptotic distribution of n(θˆn − θ0 ) is N [0, J −1 (θ0 )I (θ0 ) J −1 (θ0 )] where J (θ0 ) = −EX E0

[

] ∂ 2 ϕ(Y , X , θ0 ) , ∂θ ∂θ ′

(2.9)

]

3. Properties of the exponential quartic family Let us denote by:

Proposition 1. We have:

∂λ0 (m) ∂λ′ (m) + m = 0. ∂m ∂m

(3.1)

Proof. We have: 4 −

and the equality holds if and only if m = m0 . Proof. From Kullback’s inequality, we know that: Em0 [Logℓ [y, λ(m)]] ≤ Em0 [Logℓ [y, λ(m0 )]] , or Em0 λ0 (m) + λ′ (m)T (Y ) ≤ Em0 λ0 (m0 ) + λ′ (m0 )T (Y ) ,

λj (m)yj ,

j=1 4 ∂ Logℓ [y, λ(m)] ∂λ0 (m) − ∂λj (m) j = + y. ∂m ∂m ∂m j=1

The result follows by taking the expectation and using the fact that the score vector is of zero mean.





or

λ0 (m) + λ′ (m)m0 ≤ λ0 (m0 ) + λ′ (m0 )m0 , and the inequality holds. Moreover, m0 is the unique maximum of λ0 (m) + λ′ (m)m0 for the following reasons. When equality holds in Kullback’s inequality, we have, because of the strict concavity of the Log function, l[y, λ(m)] = l[y, λ(m0 )] almost everywhere, therefore λ(m) = λ(m0 ) because the quartic family is identifiable (see Section 2.2) and, finally, m = m0 since the mapping between λ and m is one-to-one. 4. PML 4 method

We adopt a semi-parametric approach based on the specification of the conditional moments up to fourth order. It is obviously equivalent to specifying (m1 , m2 , m3 , m4 ) or (m1 , σ 2 , s, k). Moreover, to satisfy the inequality k > s2 + 1, it could be convenient to specify (m1 , σ 2 , s, k∗ ), where k∗ = k − s2 − 1, which could be called the over-kurtosis, since (s, k∗ ) is only constrained to belong to R+ × R+ . We consider the latter parametrization, but the results could be adapted to other parametrizations in a straightforward manner. We therefore specify the following functions: m(xi , θ1 ), σ 2 (xi , θ2 ), s(xi , θ3 ), and k∗ (xi , θ4 ). Note that θ1 , θ2 , θ3 , and θ4 may have some components in common, and we denote by θ the union of θ1 , θ2 , θ3 , and θ4 without repetition (in particular, we could have θ1 = θ2 = θ3 = θ4 = θ ). We denote by Θ the range of θ . For a given xi and θ , we can compute the coefficients λ0 , λ1 , λ2 , λ3 , λ4 of the quartic exponential distribution having the same mean, variance, skewness and kurtosis. As mentioned in Section 2, this can always be done unless the skewness is zero and the kurtosis larger than 3, but even then these values can be closely approached. Let us denote these coefficients by λj (xi , θ ), j = 0, . . . , 4. Definition 1. The fourth order Pseudo Maximum Likelihood estimator of θ0 , called PML4 and denoted by θˆn is defined by:

θˆn = Argmax θ∈Θ

Corollary 1. We have:

∂ 2 λ0 (m) − ∂ 2 λj (m) ∂λ′ (m) + mj + = 0. ′ ′ ∂ m∂ m ∂ m∂ m ∂m j =1



4.1. Definition

ℓ(y, λ) = exp(λ0 + λ1 y + λ2 y2 + λ3 y3 + λ4 y4 ), the pdf of the exponential family, and where λ = (λ1 , λ2 , λ3 , λ4 )′ , and λ ∈ Λ∗ = R3 × R−∗ . We know that there is a one to one relationship between Λ∗ and D∗ (see Section 2.2). Let us denote by M the range of m = (m1 , m2 , m3 , m4 )′ , where mi = E (Y i ) corresponding to D∗ . The mapping m(λ) from Λ∗ to M is bijective, and we denote by λ(m) the inverse function and λ0 (m) = −ψ[λ(m)].

Logℓ [y, λ(m)] = λ0 (m) +

λ0 (m) + λ′ (m)m0 ≤ λ0 (m0 ) + λ′ (m0 )m0 ,



∂ϕ(Y , X , θ0 ) ∂ϕ(Y , X , θ0 ) . (2.10) ∂θ ∂θ ′ A nice property of the QGM-estimator θˆn∗ of θ0∗ is the following. If     ∂ 2 ϕ(Y , X , θ0∗ , a X , θ0∗∗ ) | X = 0, (2.11) E0 ∂θ ∗ ∂ a′ √ ∗ √ then n(θˆn∗ −θ0∗ ) has the same asymptotic distribution as n(θˆ0n −   ∗ −1 −1 ˜ ˜ ˜ ˜ ˜ θ0 ). Namely, N 0, J (θ0 )I (θ0 )J (θ0 ) , where J (θ0 ) and I (θ0 ) are of the same form as (2.9) and (2.10), θ being replaced by θ ∗ and ϕ(Y , X , θ ) by ϕ(Y , X , θ ∗ , a(X , θ0∗∗ )). I (θ0 ) = EX E0

[

Proposition 3. For any pair m, m0 ∈ M, we have:

n − 4 −

λj (xi , θ )yji .

i=1 j=0

4

(3.2)

Proof. The proof is straightforward by differentiating the identity (3.1) of Proposition 1 once more. Let us denote by Σ the variance–covariance matrix of T (Y ) = (Y , Y 2 , Y 3 , Y 4 )′ which is positive definite (since the support of Y is not reduced to point masses).

Condition 1. We assume that the semiparametric model is identifiable, i.e. that: (1) if m1 (xi , θ1 ) = m1 (xi , θ¯1 ), σ 2 (xi , θ2 ) = σ 2 (xi , θ¯2 ), s(xi , θ3 ) = s(xi , θ¯3 ), k∗ (xi , θ4 ) = k∗ (xi , θ¯4 ) (PX almost surely), then we have θ = θ¯ . Note that Condition 1 is equivalent to (2) λj (xi , θ ) = λj (xi , θ¯ ), j = 0, . . . , 4 (PX almost surely) implies θ = θ¯ .

282

A. Holly et al. / Journal of Econometrics 162 (2011) 278–293

It is important to stress that the exponential quartic family is a tool providing estimation procedures but that we do not assume that the true (conditional) pdf belongs to this family. 4.2. Asymptotic properties

Propositions 4 and 5 show that the PML4 method based on the quartic exponential family provides consistent and asymptotically normal estimators of the parameters specifying the conditional moments of order one to four. It is also important to note that the unique family with these properties is the generalized quartic family:



4 −



Proposition 4. Under standard regularity conditions, if the semiparametric model is identifiable, the PML4 estimator θˆn is consistent.

exp λ0 (m) +

Proof. From the properties of the M estimators mentioned in Section 2.3, we have to prove that the limit function (2.4) ϕ∞ (θ, P0 ) = EX E0 ϕ(Y , X , θ ) has a unique maximum at θ0 . Here we have:

where m = (m1 , m2 , m3 , m4 ).

 ϕ∞ (θ, P0 ) = EX E0

4 −

 λj (X , θ )Y

j

i=1

Proposition 6. Let f (y, m) be a family of pdfs on R indexed by their moments m = (m1 , m2 , m3 , m4 )′ . If the PML method based on the maximization of n −

j =0

 = EX λ0 (X , θ ) +

4 −

λj (X , θ)mj0 .

is consistent for any specification of the conditional moments, any true conditional distribution satisfying the moment specification for some value θ0 of θ = (θ1 , . . . , θ4 )′ , and any distribution PX of X , then f (y, m) is of the type:

Using Proposition 3, we know that

ϕ∞ (θ, P0 ) ≤ ϕ∞ (θ0 , P0 ),



and that ϕ∞ (θ , P0 ) = ϕ∞ (θ0 , P0 ) if and only if λj (X , θ ) = λj (X , θ0 ), j = 0, . . . , 4, PX almost surely in X , and, therefore, using the identification assumption, if and only if θ = θ0 . The result follows. Proposition 5. Under standard regularity if the semi conditions,  parametric model is identifiable,

√

n θˆn − θ0

is asymptotically

distributed as

f (y, m) = exp λ0 (m) +



∂ m′ (X , θ0 ) −1 ∂ m(X , θ0 ) Σ (X , θ0 ) , ∂θ ∂θ ′ [ ′ ] ∂ m (X , θ0 ) −1 ∂ m(X , θ0 ) −1 I (θ0 ) = EX Σ (X , θ0 )Ω (X )Σ (X , θ0 ) , ∂θ ∂θ ′ ]

where Σ (X , θ0 ) is the conditional variance–covariance matrix of T (Y ) = (Y , Y 2 , Y 3 , Y 4 )′ given X in the quartic conditional distribution associated with λj (X , θ0 ), j = 0, . . . , 4, and where Ω (X ) is the true conditional variance–covariance matrix of T given X. Proof. See Appendix A.

∂ m(X , θ0 ) ∂ m ∂µ(X , θ0 ) = . ′ ∂θ ∂µ′ ∂θ ′

 θ1  ≥ 0; θ2 

 1  θ1 θ

2

θ1 θ2 θ3

 θ2   θ3  ≥ 0. θ4 

In other words, if for all θi (i = 1, . . . , 4) we have E (Y i − θi ) = 0, then we must have:

[ E

] ∂ Logf (Y , θ ) = 0. ∂θ

Using a version of the Farkas Lemma (see Lemma 8.1 in Gourieroux and Monfort, 1995a, p. 252), we conclude that: 4 ∂ Logf (y, θ ) − = λi (θ )(yi − θi ). ∂θ i =1

Note that the generalized quartic family can be seen as a quartic family with respect to the modified measure dν ∗ (y) = exp(a(y))dν(y), ν being the Lebesgue measure on R. 5. QGPML2 method 5.1. Alternative parametrization

Therefore: Corollary 2. We have,

] ∂µ′ (X , θ0 ) ∂ m′ −1 ∂ m ∂µ(X , θ0 ) Σ (X , θ0 ) ′ , ∂θ ∂µ ∂µ ∂θ ′ [ ′ ∂µ (X , θ0 ) ∂ m′ −1 I (θ0 ) = EX Σ (X , θ0 )Ω (X ) ∂θ ∂µ ] ∂ m ∂µ(X , θ0 ) × Σ −1 (X , θ0 ) ′ . ∂µ ∂θ ′

λi (m)yi + a(y) .

Integrating the latter equation gives the result.

Formulas giving J (θ0 ) and I (θ0 ) contain the Jacobian matrices ∂ m(X , θ0 )/∂θ ′ . If the parametrization used is not m = (m1 , m2 , m3 , m4 )′ but instead µ = (m1 , σ 2 , s, k∗ )′ , we must compute ∂ m(X , θ0 )/∂θ ′ as a function of µ = [m1 (X , θ1 ), σ 2 (X , θ2 ), s(X , θ3 ), and k∗ (X , θ4 )]′ , and we get:

J (θ0 ) = EX



Proof. Under the assumptions of Proposition 6, we must have, in particular, the consistency property in a model without exogenous variables. Furthermore, this model must possess the parametrization θi = E (Y i ), i = 1, . . . , 4, where θ = (θ1 , θ2 , θ3 , θ4 )′ belongs to the interior of the domain defined by:

where

[

4 − i=1

 1  θ1

N 0, J −1 (θ0 )I (θ0 ) J −1 (θ0 ) ,

J (θ0 ) = EX

Logf [yi , m1 (xi , θ1 ), m2 (xi , θ2 ), m3 (xi , θ3 ), m4 (xi , θ4 )]

i=1



j =1



λi (m)y + a(y) , i

[

(4.1)

We have seen that the quartic exponential family can be equivalently parametrized by λ = (λ1 , λ2 , λ3 , λ4 )′ , by m = (m1 , m2 , m3 , m4 )′ , or by µ = (m1 , σ 2 , s, k∗ )′ . There is a fourth parametrization that will be of great interest, namely (m1 , m2 , λ3 , λ4 )′ or equivalently ν = (m1 , σ 2 , λ3 , λ4 )′ . First, we have to show that this is indeed a genuine parametrization. Proposition 7. There is a one-to-one relationship between (λ1 , λ2 , λ3 , λ4 ) and (m1 , m2 , λ3 , λ4 ).

(4.2)

Proof. We have to prove that for any (λ3 , λ4 ) the relationship

(λ1 , λ2 ) → [m1 (λ1 , λ2 , λ3 , λ4 ), m2 (λ1 , λ2 , λ3 , λ4 )]

A. Holly et al. / Journal of Econometrics 162 (2011) 278–293

is one-to-one. We have seen that the Jacobian matrix ∂ m/∂λ′ = Σ is symmetric and positive definite ∀m, so the same is true for the upper (2 × 2) block-diagonal submatrix. Moreover, for any given (λ3 , λ4 ) fixed, the section of Λ∗ is convex, and therefore, using Theorem 6 in Gale and Nikaido (1965), we obtain the required result. The previous result means that, starting from the quartic family Q (Λ∗ ), we can reparameterize it as Q (m1 , σ 2 , λ3 , λ4 ), and therefore, fixing (λ3 , λ4 ) at any admissible value (λ03 , λ04 ), we get a quadratic exponential family Q (m1 , σ 2 , λ03 , λ04 ) in the sense of Gourieroux et al. (1984). We denote by λ∗1 (m1 , σ 2 , λ3 , λ4 ), λ∗2 (m1 , σ 2 , λ3 , λ4 ) and λ∗0 (m1 , σ 2 , λ3 , λ4 ) the functions giving λ1 , λ2 , λ0 in terms of m1 , σ 2 , λ3 , λ4 .

and, therefore, the objective function of Definition 2 is equivalent to: n −

Logf (yi | m1 (xi , θ1 ), σ 2 (xi , θ2 ), λ3 (xi , θ˜n ), λ4 (xi , θ˜n )),

i =1

since the terms λ3 (xi , θ˜n )y3i +λ4 (xi , θ˜n )y4i do not depend on (θ1 , θ2 ). The method is called QGPML2 because only yi and y2i are involved, and it is clearly an example of a Quasi Generalized M-estimator. Moreover, we obtain the following important property: Proposition 8. The QGPML2 estimator (θˆ1n , θˆ2n ) is asymptotically equivalent to the unfeasible estimator based on the maximization of (2)

Ln0 (θ1 , θ2 ) n  −

5.2. QGPML2 method

= We assume that the conditional mean and variance are specified as m1 (Xi , θ1 ) and σ 2 (Xi , θ2 ), and the conditional skewness and over-kurtosis are specified as s(Xi , θ3 ) and k∗ (Xi , θ4 ). We can first estimate (θ1 , θ2 ) by the PML2 method based on the Gaussian family, i.e. by solving the problem:

(θ˜1n , θ˜2n ) = argmin θ1 ,θ2

n −

Logσ 2 (Xi , θ2 ) +

[Yi − m1 (Xi , θ1 )]2

σ (Xi , θ2 ) 2

i=1

. (5.1)

283

  λ∗0 m1 (xi , θ1 ), σ 2 (xi , θ2 ), λ3 (xi , θ0 ), λ4 (xi , θ0 )

i=1

  + λ∗1 m1 (xi , θ1 ), σ 2 (xi , θ2 ), λ3 (xi , θ0 ), λ4 (xi , θ0 ) yi    + λ∗2 m1 (xi , θ1 ), σ 2 (xi , θ2 ), λ3 (xi , θ0 ), λ4 (xi , θ0 ) y2i . Proof. According to the result given in Eq. (2.11), we have to check that:



Next, we compute



    ∂2   ′ λ∗0 + λ∗1 Yi + λ∗2 Yi2 | X   = 0. m1 λ3 ∂ ∂ λ4 σ2

Yi − m1 (Xi , θ˜1n ) uˆ i = , σ (Xi , θ˜2n )

E0   

and obtain consistent estimators of θ˜3n and θ˜4n of θ3 and θ4 from the nonlinear regressions of uˆ 3i on s(Xi , θ3 ) and uˆ 4i − s(Xi , θ˜3n )2 − 1 on k∗ (Xi , θ4 ). Explicitly, this corresponds to obtaining the pair (θ˜3n , θ˜4n ), verifying:

Differentiating Logf (yi | xi ; m1 , σ 2 , λ3 , λ4 ) with respect to m1 and σ 2 and then taking the expectation we get:

θ˜3n = argmin θ3

θ˜4n = argmin θ4

N −

∂

(ˆu3i − s(Xi , θ3 ))2 ,

i=1 N −

(ˆ − s(Xi , θ˜3n ) − 1 − k (Xi , θ4 )) . u4i

2

∗

2

i=1

Then, noting θ˜n = (θ˜1n , θ˜2n , θ˜3n , θ˜4n )′ , we define:

˜ 1i = m1 (xi , θ˜1n ); m

σ˜ i2 = σ 2 (xi , θ˜2n ); s˜i = s(xi , θ˜3n ); k˜ ∗i = k∗ (xi , θ˜4n ); ˜ 1i , σ˜ i2 , s˜i , k˜ ∗i ), j = 3, 4. λ˜ ji = λj (m

∂λ∗  0 + m1

σ2

B(θ0 ) =

i=1

Note that, using the parametrization (m1 , σ 2 , λ3 , λ4 ) the quartic family of pdf can be written:



  f (yi |m1 , σ 2 , λ3 , λ4 ) = exp λ∗0 m1 , σ 2 , λ3 , λ4   + λ∗1 m1 , σ 2 , λ3 , λ4 yi    + λ∗2 m1 , σ 2 , λ3 , λ4 y2i + λ3 y3i + λ4 y4i ,

∂

m1

σ2

  

 ∂ m (X , θ ) 1 10  ∂θ 1 EX   0

2

 ∂ m1 (X , θ10 ) m1 (X , θ10 )  ∂θ1  ∂σ 2 (X , θ20 ) ∂θ2

× Ω1−1 (X , θ0 )  ∂ m1 (X , θ10 )  ∂θ1′ ×   ∂ m1 (X , θ10 ) 2 m1 (X , θ10 ) ∂θ1′

  λ∗0 m1 (xi , θ1 ), σ 2 (xi , θ2 ), λ˜ 3i , λ˜ 4i

  + λ∗1 m1 (xi , θ1 ), σ 2 (xi , θ2 ), λ˜ 3i , λ˜ 4i yi    + λ∗2 m1 (xi , θ1 ), σ 2 (xi , θ2 ), λ˜ 3i , λ˜ 4i y2i .

σ2

Proposition 9. The QGPML2 estimator (θˆ1n , θˆ2n ) is consistent, asym√ ptotically normal, and the asymptotic distribution of n[(θˆ1n , θˆ2n ) − (θ10 , θ20 )] is N (0, B(θ0 )) with

Definition 2. The Quasi Generalized PML2 (QGPML2) estimator (θˆ1n , θˆ2n ) of (θ01 , θ02 ) is defined by maximizing with respect to (θ1 , θ2 ): n  −

m1

∂λ∗  2  m2 = 0 ,

for any (m1 , σ 2 , λ3 , λ4 ). Therefore, differentiating further with respect to λ3 and λ4 , we still get zero.

 

L(n2) (θ1 , θ2 ) =

∂

∂λ∗  1  m1 +

−1      (5.2) ∂σ 2 (X , θ20 )    ∂θ2′ 0

and

Ω1 (X , θ0 ) = V0

[

]

Y |X . Y2

Proof. See Appendix B.

It is easily seen that B(θ0 ) is equal to the semi-parametric efficiency bound based on the first two conditional moments. From consistent estimators of the asymptotic variance–covariance matrix B(θ0 ), we can deduce Wald and score tests, as well as asymptotic confidence regions.

284

A. Holly et al. / Journal of Econometrics 162 (2011) 278–293

We also note that, from the proof of Proposition 9, we know that the matrices ˜I and J˜ are equal, and therefore we can also use likelihood-ratio type tests (see Gourieroux and Monfort, 1995b, 0 0 Chapter 18). More precisely, denoting by θˆ1n and θˆ2n the con(2)

strained QGMPL2 estimators obtained by maximizing Ln (θ1 , θ2 ) under the null, we can use the test statistic:

  0 ˆ0 ξnR = 2 L(n2) (θˆ1n , θˆ2n ) − L(n2) (θˆ1n , θ2n ) .

(5.3)

If the null is g (θ1 , θ2 ) = 0 where g is an r-dimensional vector, then, under the null, ξnR is asymptotically distributed as a χ 2 (r ). If (θ1′ , θ2′ )′ is a p-dimensional vector and the null is of the form θ1 = h1 (γ ) and θ2 = h2 (γ ), where γ is a q-dimensional vector, then under the null, ξnR is asymptotically distributed as a χ 2 (p − q).

It is clear from the formulas given in (i) that in any equivalence class we can find an element for which λ∗4 = −1 and λ∗3 = 0, by starting from any Q (λ) of the class and taking its image by La,b (y) with b = (−λ4 )−1/4 , a = −λ3 /(4λ4 ). Moreover, such an element is unique since another one would be obtained from the first as the image by a linear mapping which is obviously the identity since for both distributions, we should have λ4 = −1 and λ3 = 0. This leads to the following definition: Definition 3. The canonical quartic family Q ∗ (α, β) is defined by the family of probability density functions: exp α0 (α, β) + α z + β z 2 − z 4 ,





(6.1)

where (α, β) ∈ R2 . The previous results immediately give the following corollary:

6. Numerical implementation The implementation of the PML4 and QGPML2 methods necessitates the numerical algorithms that will be described in this section. To this end, let us first introduce the useful notion of a canonical quartic family. 6.1. The canonical quartic family The PML4 method requires the computation of λ = (λ1 , . . . , λ4 ) ∈ Λ∗ = R3 ×R−∗ given ν = (m1 ,σ 2 , s, k) ∈ D∗ , where D∗ = D − (m, σ 2 , s, k), s = 0, k ≥ 3 has been defined in Section 2.2. Although the mapping between Λ∗ and D∗ has been shown to be one-to-one, it is well know (see Maasoumi, 1993 and Ormoneit and White, 1999) that the numerical computation of λ given ν is delicate. Our contribution to the problem is to show that the computation of this four-variate function boils down to the computation of a two variates function, thanks to the introduction of the canonical quartic family. The following results will be useful: Proposition 10. (i) The quartic family {Q (λ), λ ∈ Λ∗ } is globally invariant by any linear mapping La,b (y) = (y − a)/b, a ∈ R, b > 0. (ii) An equivalence relation is obtained in {Q (λ), λ ∈ Λ∗ } by imposing the equality of the skewness and kurtosis. (iii) An equivalent class is obtained by considering the image of any given element of the class, by all the linear mappings La,b (y), a ∈ R, b > 0. Proof. (i) The pdf of the image of Q (λ) by La,b (y) is exp[Logb + λ0 + λ1 (a + bz ) + λ2 (a + bz )2 + λ3 (a + bz )3

+ λ4 (a + bz )4 ],

Corollary 3. The canonical quartic family {Q ∗ (α, β), where (α, β) ∈ R2 } can be parametrized by (s, k) ∈ D∗ . It is now clear that the computation of λ = (λ1 , . . . , λ4 ) for a given ν = (m1 , σ 2 , s, k) can be done in two steps. First compute the appropriate (α, β) corresponding to (s, k), second find the linear mapping such that the image of Q ∗ (α, β) by this mapping has a mean and a variance equal to (m1 , σ 2 ), and the image of Q ∗ (α, β) by this mapping gives the required Q (λ). The first step necessitates solving a non-linear two dimensional system. Once Q ∗ (α, β) is obtained, the computation of its mean m∗1 and its variance σ ∗2 is straightforward as well as the computation of (a, b) defined by the system m1 = a + bm∗1 , σ = bσ ∗ . Finally, once the pair (a, b) is known, we immediately have λ4 = −b−4 and λ3 = −4λ4 a = 4ab−4 . The remaining parameters λ0, λ1 , λ2 are easily obtained from the first three equations of the system given in the proof of Proposition 10, in which λ∗0 = α0 (α, β), λ∗1 = α, λ∗2 = β , since this system is linear recursive in λ0, λ1 , and λ2 , yielding

λ2 = β/b2 − 3aλ3 − 6λ4 a2 ,

(6.2)

λ1 = α/b − 2λ2 a − 3λ3 a − 4λ4 a , 2

3

λ0 = α0 (α, β) − Logb − (λ1 a + λ2 a + λ3 a + λ4 a ).

Ij (α, β) =

λ1 = (λ1 + 2λ2 a + 3λ3 a + 4λ4 a )b, λ∗2 = (λ2 + 3λ3 a + 4λ4 a2 )b2 , λ∗3 = (λ3 + 4λ4 a)b3 , λ∗4 = λ4 b4 . (ii) is obvious and (iii) is proved by first noting that the image of a distribution Q (λ) by La,b (y) has the same skewness and kurtosis as Q (λ) and, second, that any element of a given equivalence class is obtained as the image of any other element of the same class by the linear mapping La,b (y) in which a and b have been adjusted to get the appropriate mean and variance.

(6.4)

q(z ; α, β) = exp(α z + β z 2 − z 4 ),

λ∗0 = Logb + λ0 + λ1 a + λ2 a2 + λ3 a3 + λ4 a4 , 3

4

In Section 6.1 we have seen that a key step in the construction of an exponential quartic density is the computation of α and β for a given pair of skewness and kurtosis, (s, k). Using the notation:

we have to compute the integrals:

2

3

6.2. Computation of the functions α(s, k), β(s, k) using the Gauss– Freud method

which is equal to the pdf of Q (λ∗ ) with ∗

(6.3) 2

∫

∞

z j q(z ; α, β) dz ,

j = 0, . . . , 4.

(6.5)

−∞

Once these integrals are known, we easily get the moments: mj (α, β) = Ij (α, β)/I0 (α, β),

j = 1, . . . , 4,

and, therefore, s(α, β), k(α, β). Finally, we have to minimize in (α, β) the distance:

[s − s(α, β)]2 + [k − k(α, β)]2 .

(6.6)

It is well known that the computation of the parameters for such a problem may be difficult. The basic reason for this is that the function q(z ; α, β) may have two maxima for very different values of z, and it may take very small values in a large area between these

A. Holly et al. / Journal of Econometrics 162 (2011) 278–293

two values of z. Furthermore, one of the maxima may be far out in the tails and yet contribute a relatively important probability mass. This implies that integration methods of the Newton type, based on an equidistant grid, are inadequate. More precisely, approximations of the form:

ˆIj (α, β) = δ

N −

j

zi exp(α zi + β zi2 − zi4 ),

i=0

with zi − zi−1 = δ and i = 1, . . . , N, or even improvements thereof, such as Simpson’s scheme, may necessitate very large values for N , −z0 , and zN to achieve acceptable precision. Typically, we would have to take values like N = 20, 000, and −z0 = zN = 80, which makes the optimization of (6.6) very difficult. In addition, ‘‘smarter’’ integration techniques, based on the Gauss–Lagrange scheme, may be problematic since such a scheme requires first a transform of R into (−1, 1), by the logistic map. This transform essentially varies in a neighborhood of the origin, and it tends to lose information contained in the tails. As a consequence, such schemes, even when performed with a large number of abscissas, tend to be inaccurate even for relatively low values of kurtosis. For this reason, we adopt the Gauss–Freud method (see Freud, 1986 for the seminal work and Levin and Lubinsky, 2001 for a recent research-monograph), which is designed to accurately approximate integrals of the kind:

∫

∞

f (z ) exp(−z 4 ) dz . −∞

This method leads to approximations of the form: Ij∗ (α, β) =

N −

j

zi exp(α zi + β zi2 )wi ,

(6.7)

i=0

where the abscissa zi and the weights wi are very precisely adapted to the shape of the function to be integrated. Further details on how the zi and the wi may be computed may be found in Gautschi (2004, Part 1).2 Thus, the proposed algorithm has two advantages over the other numerical methods: it uses results on numerical integration specifically related to the integration problem and, moreover, the calculation of parameters involves computing only two parameters, resulting in a significant gain in time. We performed all of the numerical integrations using N = 100.3 6.3. Implementation of the PML4 method In this section we synthesize the previous sections by presenting an algorithm that describes the computation of λ0 , λ1 , λ2 , λ3 , λ4 corresponding to a given m = (m1 , m2 , m3 , m4 )′ . (1) Compute s(m), k(m).4 (2) Find the pair of (α, β) corresponding to s(m), k(m), using the method of Section 6.2.

2 Essentially, the weights w and abscissa x can be obtained as eigenvecj j tors and eigenvalues of a Jacobi matrix (see Golub and Welch, 1969). This matrix, in turn, requires a sequence of parameters for which a stable estimation algorithm has been proposed by Noschese and Pasquini (1999) for the exp(−z 4 ) weight function. Prof. Milovanović implemented this algorithm and made the resulting parameters available to the public via the website of Prof. Gautschi. On this website, one may find the file: coefffreud4.txt under www.cs.purdue.edu/archives/2001/wxg/tables. To obtain the xj and wj , we use his routine to be found under: www.cs.purdue.edu/archives/2002/wxg/codes. 3 The time required to compute for given skewness and kurtosis the parameters

α and β represents currently a limitation for the application of these methods to

large conditional models. One possibility to circumvent this computation consists of computing the α and β once and for all and then use some interpolation scheme. We leave such an optimization of the program to further research. 4 Skewness and kurtosis are given, respectively, by s(m) =

m3 − 3m2 m1 + 2m31

(m2 −

m21 3/2

)

,

k(m) =

m4 − 4m3 m1 + 6m2 m21 − 3m41

(m2 − m21 )2

.

285

(3) Compute the mean m∗1 (α, β) and the variance σ 2∗ (α, β) corresponding to the canonical pdf Q ∗ (α, β). (4) The linear transform Y = a + bZ , b > 0, gives m1 = a + b m∗1 (α, β), and σ = bσ ∗ (α, β), where σ 2 = m2 − m21 . That is, b = σ /σ ∗ (α, β) and a = m1 − bm∗1 (α, β). (5) Compute λ4 = −(b)−4 , λ3 = −4λ4 a, and use Eqs. (6.2), (6.3) and (6.4) to get λ0 , λ1 , λ2 . We wish to emphasize that, whereas the previous literature (be it in econometrics, physics, or chemistry) involved the resolution of a non-linear system with four unknowns, our algorithm only involves the resolution of non-linear system with two unknowns. The rest involves elementary algebra. For this reason, our approach will not only be numerically more stable but also significantly faster. 6.4. Implementation of the QGPML2 method The numerical problem is the following: given any admissible value of (m1 , σ 2 , λ3 , λ4 ), compute λ∗i (m1 , σ 2 , λ3 , λ4 ), for i = 0, 1, 2. According to this approach, it is necessary to allow for densities of any given mean and variance. However, the numerical integration scheme uses the kernel exp(−z 4 ), a symmetric kernel that weights those observations in a neighborhood of 0. We expect that this may create numerical difficulties for random variables whose mean is distant from 0. For this reason, we consider a computation strategy of the λ∗i where, in a preliminary step, observations are studentized.5 Thus, instead of considering the pdf, exp λ∗0 + λ∗1 y + λ∗2 y2 + λ3 y3 + λ4 y4 ,





it is useful to characterize the associated density, which has a mean of zero and a variance of 1. This density is related to the previous one by the linear transformation Y = m1 +σ Z , where Z represents a random variable with mean 0 and variance 1. The corresponding density, which will be called the Studentized exponential quartic, is written as: exp δ0 + δ1 z + δ2 z 2 + δ3 z 3 + δ4 z 4 .





We have the relations δ4 = σ 4 λ4 < 0 and δ3 = σ 3 λ3 + 4m1 δ . Since, in the QGPML2 approach, m1 , σ 2 , λ3 , λ4 are given, the σ 4 parameters δ3 , δ4 are also given. Once δ0 , δ1 , δ2 corresponding to a zero mean and unit variance have been obtained using the method described below, one may revert to the initial parameters using:

 λ∗0 = σ 4 δ0 − m1 σ 3 δ1 + m21 σ 2 δ2  − m31 σ δ3 + m41 δ4 /σ 4 − Log (σ ),   λ∗1 = σ 3 δ1 − 2m1 σ 2 δ2 + 3m21 σ δ3 − 4m31 δ4 /σ 4 ,   λ∗2 = σ 2 δ2 − 3m1 σ δ3 + 6m21 δ4 /σ 4 . Using the notation, q(z ; δ1 , δ2 ) = exp δ1 z + δ2 z 2 + δ3 z 3 + δ4 z 4 ,





we have to compute the integrals: Ij (δ1 , δ2 ) =

∫

∞

z j q(z ; δ1 , δ2 )dz ,

j = 0, . . . , 2.

(6.8)

−∞

Using the change of variable u become:

= (−δ4 )1/4 z, these integrals

5 This studentization is not required for PML4 where the mean m∗ (and variances 1

(σ ∗ )2 ) turn out to be close to 0 (1).

286

Ij (δ1 , δ2 )

A. Holly et al. / Journal of Econometrics 162 (2011) 278–293

∫

+∞

=

  j (−δ4 )−1/4 u exp δ0 + δ1 (−δ4 )−1/4 u

u=−∞

4

 2 + δ2 (−δ4 )−1/4 u

3

(6.9)

Now, the kernel exp(−u4 ) appears again, and we may use the Gauss–Freud method outlined in Section 6.2. Once these integrals are efficiently evaluated, we may compute the moments: j = 1, 2.

2.5 2 1.5 1 0.5

The parameters δ1 and δ2 are obtained by minimizing the distance:

[m1 (δ1 , δ2 )]2 + [σ 2 (δ1 , δ2 ) − 1]2 ,

Skewness

j = 0, . . . , 2.

Domain where no density exists

3.5

   4 −1/4 3 + δ3 (−δ4 ) u − u (−δ4 )−1/4 du

mj (δ1 , δ2 ) = Ij (δ1 , δ2 )/I0 (δ1 , δ2 ),

Accuracy of α and β

4.5

(6.10)

where σ 2 (δ1 , δ2 ) = m2 (δ1 , δ2 ) − m21 (δ1 , δ2 ). Eventually, δ0 = −Log I0 (δ1 , δ2 ). 7. Numerical examples In this section, after discussing the computation of the parameters of the quartic exponential for given moments of orders 1–4, we will discuss several Monte-Carlo exercises demonstrating the usefulness of the methods at hand. The examples we want to discuss are (1) a comparison between various estimation techniques for small samples, (2) a study of the performance of PML4 in the case of misspecification, and (3) an illustration of QGPML2. 7.1. Computation of the quartic exponential As discussed in Section 6, feasibility of the PML4 estimation hinges on the ability to efficiently compute the parameters λ0 , . . . , λ4 of the quartic exponential for given m1 , m2 , m3 , m4 . Ormoneit and White (1999) attribute to Agmon et al. (1979) the first attempts to compute the ‘‘correct’’ λ∗ . Agmon et al. (1979) considered the maximization of entropy under moment constraints, that is the so-called primal problem. They computed the integrations using the Gauss–Lagrange scheme after mapping the domain of integration (−∞, ∞) into (−1, 1). Zellner and Highfield (1988) propose computation of the λ∗ by seeking the zeros of the first order conditions that result from the entropy maximization (i.e. the dual approach). Maasoumi (1993) reported difficulties with this method that Ormoneit and White (1999) corroborate. Ormoneit and White (1999) also map the domain (−∞, ∞) into (−1, 1), and they use the Gauss–Lagrange scheme. Moreover, they feed intelligent starting values into their optimization and stabilize the computation of the exponential to avoid numerical overflow. Last but not least, to our knowledge, they are the first ones in the literature to have acknowledged numerical difficulties in the λ parameter computation along the segment s = 0, k > 3. Indeed, we know that no density exists for this segment, based on theoretical grounds as discussed earlier. Our method hinges instead on obtaining the parameters α and β of the canonical form for given skewness and kurtosis. In this section, we wish to discuss the precision of these computations.6

6 All of the programming was performed in the MATLAB environment. We implemented the code on both Mac OS X and Windows Vista machines. All of the simulations were performed on a PC with an Intel quadricore processor running four MATLAB clones in parallel. To increase the speed of the computations, we transcribed the central part of the programs into the C language and called it via a MEX interface.

0

0

5

10

15

20

Kurtosis

Fig. 1. This figure represents the skewness–kurtosis domain for which a density exists (the domain is symmetric with respect to the horizontal axis). The circles represent those points for which we computed the parameters α and β . The symbol + represents those points for which the distance between the original skewness and kurtosis and the recomputed skewness and kurtosis (after evaluation of the α and β ) is larger than 10−5 .

In order to evaluate the algorithm which gives the parameters α and β of the canonical form, we considered a grid covering the range of values of kurtosis from 1.5 to 20. For each value of kurtosis, k, we considered a grid for skewness ranging from 0.1 to √ k − 1 − 0.1. For each point of this grid, say s, k, we computed α and β as described in Section 6.2 and recomputed the associated ˜ 7 In Fig. 1, circles represent the skewness and kurtosis, say s˜, k. points of the skewness–kurtosis grid for which we evaluated the parameters and, + symbols represent those points for which the distance D = (s − s˜)2 + (k − k˜ )2 > 10−5 . This figure demonstrates several interesting phenomena (1) Even though, based on theoretical grounds, no density can exist on the segment (s = 0, k ≥ 3), it is still possible to obtain a density for parameters close to the excluded segment. (2) Even for very large values of kurtosis (limited in the figure to 20), we obtain a large range of values of skewness for which a highly accurate density may be obtained. We also constructed a similar graph where kurtosis was allowed to took values up to 150. We find that, even for a kurtosis of 150, the range of skewness, where D < 10−5 , ranges from 2.5 to 12.1, still a very respectable domain. Many of the difficulties encountered in earlier attempts at such calculations disappear in our approach. The linear transform Y = a + bZ allows us to write the exponential quartic in terms of exp(−z 4 ). Then, by replacing exp(−z 4 ) by well-behaved discrete weights (this leads to formula (6.7)), we integrate directly over the range (−∞, ∞), thus obviating the use of the logistic map. Next, we only optimize over two parameters, rather than four. We also feed optimized starting values for α and β into the optimizer. These starting values are the α and β corresponding to those points in the domain represented in Fig. 1 that are closest to some given values of skewness and kurtosis. The skewness and kurtosis, and their associated α and β , are stored once and for all in some file that is read into memory as the program is initialized. To further understand some of the difficulties encountered in earlier studies, we obtained the parameters λj , for j = 0, . . . , 4 for extremely skewed cases, and evaluated the resulting densities at points far out in the tail (say z = 50 for a centered and reduced density) and still found a small, yet significant probability mass.

7 We perform the required minimization using the Nelder–Meade approach.

A. Holly et al. / Journal of Econometrics 162 (2011) 278–293

The logistic map, which transforms (−∞, ∞) into (−1, 1) used in the earlier work, may therefore have ‘fudged’ the behavior of the density for relatively large values of skewness and kurtosis. We also note that earlier work required many integration points (each evaluation costs time) whereas using abscissas and weights that are made specifically for the exp(−z 4 ) weighting function reduces the number of points for which the integrand needs to be evaluated.8 As the numerical exercises that follow will demonstrate, the time necessary to compute the required α and β parameters is of an order that is suitable for applying the exponential quartic in many econometric problems. The computation of one set of α and β requires about 0.015 s, allowing for about 66 density constructions per second. It is clear that the method proposed here is not confined to econometrics and may prove useful in other fields as well. Similarly to the protocol described above, in the context of the parameter estimations related to QGPML2, we verified the precision of the computation of δi , for i = 0, . . . , 2 for given δ3 , δ4 , yielding a density with mean 0 and variance 1. 7.2. A first experiment The objective of this first experiment is to demonstrate that the PML4 estimation may provide estimates which are superior to either the PML2 or the GMM estimators in an unconditional setting. We first discuss the choice of a data generating process, and then we focus on the estimation techniques. A priori, many distributions could be used for this experiment (Student-t, distributions in the Pearson family, Gamma, etc.) Preliminary work made it clear that a distribution should be chosen from which draws could be obtained in a very rapid manner. For this reason we settled on the family of skewed Laplace distributions, denoted sLD. These distributions have been used to price options in the context of extreme return realizations, for example, by Gourieroux and Monfort (2006). This family of distributions has three parameters, b0 > 0, b1 > 0, and c, and its pdf is defined by:

 b0 b1   exp [b0 (z − c )] ,  b0 + b1 f (z ; b0 , b1 , c ) = b0 b1    exp [−b1 (z − c )] , b0 + b1

if z ≤ c , (7.1) if z > c .

The mean, variance, skewness and kurtosis of this density are given by: 1

m1 (c , b0 , b1 ) = E [Y ] = c +

σ 2 (c , b0 , b1 ) = Var[Y ] = s(c , b0 , b1 ) = k(c , b0 , b1 ) =

2

[

σ3 9

σ

4

1 b31

[

1 b41

− +

b1

1 b20 1

+

]

b30 1 b40

− 1 b21

1 b0

,

(7.2)

,

(7.3)

,

(7.4)

] +

6

σ

4 b2 b2 0 1

.

(7.5)

Since this density may be viewed as describing a mixture of exponentials, we use the inverse c.d.f. technique to simulate random draws from it. In this first experiment, we focus on the situation where c = 0. Indeed, without an additional assumption on one of the parameters

287

of the sLD, we would not be able, in the following, to obtain parameter estimates based on the PML2 principle. We simulated 10,000 samples, each of a length of either T = 25, 50, 100, or 1000 i.i.d. observations. The estimation techniques used were ML, PML2, PML4 and GMM. Let us describe the way in which we implemented these estimations. For ML, we maximized for each sample, the log-likelihood obtained from (7.1): LML =

T −

Log f (zi ; b0 , b1 ).

i=1

For PML2, we considered the objective function: LPML2 = −T Log σ (b0 , b1 ) −

T 1−

2 i=1

zi − m1 (b0 , b1 )

σ (b 0 , b 1 )

2

.

For PML4, we formed the objective function: LPML4 =

T −

λ0 + λ1 yi + λ2 y2i + λ3 y3i + λ4 y4i ,

i=1

where the parameters λ0 , . . . , λ4 were computed as described in Section 6.3 for (m1 (b0 , b1 ), σ (b0 , b1 )2 , s(b0 , b1 ), k(b0 , b1 )). For GMM, we defined, (see Hansen, 1982), the 4 × 1 vector, zi − m1 (b0 , b1 )  − m21 (b0 , b1 ) − σ 2 (b0 , b1 )       zi − m1 (b0 , b1 ) 3 , − s ( b , b ) X i ( b0 , b1 ) =  0 1     σ (b0 , b1 ) 4   zi − m1 (b0 , b1 ) − k(b0 , b1 ) σ (b0 , b1 )





zi2

and considered the distance: J = gT (b0 , b1 )′ S −1 gT (b0 , b1 ), where gT (b0 , b1 ) =

T 1−

T i=1

Xi (b0 , b1 ).

The GMM estimates were obtained as those parameters minimizing the distance J. The matrix S that appears in the distance was obtained by using, as a first step, the identity matrix, and as a second step, the asymptotic variance–covariance matrix. Thus, the GMM estimates are asymptotically optimal in the sense that they reach the semiparametric bound. However, it is well know that the preliminary estimation of the optimal matrix S may induce some biases in finite samples. We performed the simulation using (b0 , b1 ) = (2.41, 1.30).9 This point corresponds to the moments (m1 , m2 , s, k) = (1.30, 0.35, 1.15, 6.91). All estimations were performed using the MATLAB fminsearch optimizer and John d’Errico’s fminsearchbnd’s extension. The bounds that we impose for the ML and GMM estimations are wide and only serve to stabilize the optimization, they do not affect the final results.10 Table 1 displays several statistics for 10,000 simulations and for various sample sizes. The main result that this table conveys is that, for all sample sizes considered, the MSE of the ML estimation dominates, as expected, over all other methods. However, we find that the PML4 technique yields estimates with an MSE up to less

9 We also used other points, but the results are very similar to the ones reported here. 10 The constraints are 0.001 ≤ b ≤ 10 and 0.00001 ≤ b ≤ 5. In general, in 0

8 We presently do not incorporate a selection rule on the number of abscissas, N, which may further decrease the speed of the computation of the α and β .



1

nonlinear estimations, it may occur that the optimizer proposes values of skewness and over-kurtosis that are such that the Q (λ) density may not be computed. In such cases, one may issue a warning and force the function to be optimized to return a penalty value telling the optimizer to seek parameters in other regions.

288

A. Holly et al. / Journal of Econometrics 162 (2011) 278–293

Table 1 This table presents statistics for the estimates of a skewed Laplace distribution where the true parameters are b0 = 2.414 and b1 = 1.301. There are 10,000 simulations for sample sizes of T = 100, 50, and 25. The 5 (95) percentile of these simulations is denoted by 5-ptile (95-ptile). The mean squared error is denoted by MSE. The total MSE is the sum of the MSE of bˆ 0 plus the one of bˆ 1 . Mean bˆ 0

5-ptile bˆ 0

95-ptile bˆ 0

MSE bˆ 0

Mean bˆ 1

5-ptile bˆ 1

95-ptile bˆ 1

MSE bˆ 1

Total MSE

2.220 2.174 2.179 2.228

2.633 2.697 2.706 2.723

0.011 0.018 0.018 0.018

1.304 1.306 1.306 1.342

1.217 1.215 1.208 1.231

1.398 1.403 1.409 1.493

0.002 0.002 0.003 0.006

0.013 0.020 0.021 0.024

1.477 1.470 1.492 2.021

3.278 3.494 3.669 3.798

0.166 0.231 0.276 0.341

1.317 1.338 1.332 1.620

1.050 1.053 1.035 1.169

1.651 1.694 1.700 2.267

0.024 0.028 0.030 0.181

0.190 0.260 0.306 0.522

1.487 1.461 1.459 1.956

3.742 4.062 4.585 4.642

0.314 0.454 0.683 0.897

1.346 1.385 1.375 1.860

0.976 0.988 0.963 1.161

1.861 1.946 1.929 2.885

0.054 0.068 0.068 0.525

0.367 0.522 0.751 1.422

1.389 1.386 1.380 1.974

4.131 4.528 6.058 6.008

0.504 0.739 1.675 2.730

1.397 1.466 1.456 2.374

0.882 0.907 0.879 1.202

2.225 2.408 2.365 4.697

0.128 0.173 0.169 2.317

0.632 0.912 1.844 5.047

T = 1000 ML PML4 PML2 GMM

2.420 2.429 2.426 2.458

ML PML4 PML2 GMM

2.440 2.504 2.508 2.776

ML PML4 PML2 GMM

2.498 2.606 2.647 3.057

ML PML4 PML2 GMM

2.540 2.709 2.860 3.551

T = 100

T = 50

T = 25

than half of the one for PML2. As far as the GMM are concerned, the total MSE is always the largest, especially when the number of observations is small, confirming the bad behavior of GMM in this setting. This suggests that for situations where the econometrician has no prior information on the skewed distribution to use for the estimation, the PML4 technique may be a most useful one. 7.3. A second experiment Many dynamic models are specified in the following way: yt = m(yt −1 , θ ) + σ (yt −1 , θ)εt , where m(yt −1 , θ ) and σ (yt −1 , θ) are functions of the past values yt −1 = {yt −1 , . . . , y1 } of yt , and of a parameter θ , {εt } being a zero mean, unit variance white noise process. In this kind of setting, the conditional skewness and kurtosis of yt given yt −1 are the same as those of εt , and, therefore, do not depend on yt −1 . This is a strong information that the PML4 method is able to take into account while not assuming any specific distribution of εt . For this reason, we can expect that the PML4 methods will perform better than, for instance, PML2 methods, and misspecified ML methods. We consider as setting yt = σt εt ,

σt2 = ω + α(yt −1 )2 + βσt2−1 , εt ∼ MixN(µ1 , µ2 , σ1 , σ2 , p), where MixN(µ1 , µ2 , σ1 , σ2 , p) corresponds to the mixture of normals distribution where a first normal distribution with parameters µ1 as mean, and σ12 as variance, gets drawn with a probability of p and where a second normal distribution with corresponding parameters µ2 and σ22 gets drawn with a probability of 1 − p. The family of two normals can reach any set of mean, variance, skewness, kurtosis, and this, moreover, in a non-unique way. We choose the parameters of MixN such that its expected value, µ, is nil and its variance, σ 2 , equal to one. We select several values of skewness and kurtosis for MixN. Furthermore, to investigate the robustness with respect to the choice of εt we consider two data generating processes. In the first case, we chose the MixN in such a way that it maximizes the

entropy under the moment constraints. Formally,  +∞ we select that density for MixN, say p, which maximizes − −∞ p(x)Log p(x)dx. In the second case, we impose the same standard deviation for both distributions of the mixture, that is σ1 = σ2 .11 The parameters of the variance dynamic σt2 are set to typical values (ω = 0.1, α = 0.05, and β = 0.9). For each simulation, we compute misspecified ML estimates as well as the PML2 and PML4 estimates of ω, α , and β . As misspecified ML we consider the estimations where εt is distributed as a (symmetric) Student-t or as an (asymmetric) skewed Student-t.12 The skewed Student-t has two parameters: first λ characterizing the asymmetry of the density and η characterizing the tail-fatness.13 The Student-t is misspecified since it is a symmetric distribution whereas the simulated innovations are skewed. The skewed Student-t density is also misspecified since its density does not belong to the exponential family. From the work of Newey and Steigerwald (1997) it is know that if one performs likelihood estimations involving asymmetric distributions, it is desirable to

11 More details on the construction of these processes are available upon request. The formulas on how to obtain these mixtures may be found in Titterington et al. (1985). 12 The Student-t only has one parameter, ν , describing the fatness of the tails. 13 Hansen’s Student-t distribution is defined by

   2 − η+2 1   1 bz + a   bc 1 +   η−2 1−λ g (z |η, λ) =   2 − η+2 1    1 bz + a  bc 1 +  η−2 1+λ

if z < −a/b, (7.6) if z ≥ −a/b

where

η−2 a ≡ 4λc , η−1

b ≡ 1 + 3λ − a , 2

2

2

Γ c≡ √



η+1



2

π(η − 2)Γ

η. 2

The use of a Student-t is a very popular assumption to model in a parsimonious manner large outliers, see Bollerslev and Wooldridge (1992). The choice of a skewed Student-t has been developed by Hansen (1994) and independently by Fernandez and Steel (1998). It has been successfully used by Jondeau and Rockinger (2003), in the context of finance.

A. Holly et al. / Journal of Econometrics 162 (2011) 278–293

289

Table 2 This table presents root mean squared errors for various GARCH estimations where we consider as DGP rt = σt εt . The innovation follows a mixture of two normals, εt ∼ MixN(µ1 , µ2 , σ1 , σ2 , p) where the parameters are chosen as to yield a density with mean 0, standard deviation 1, and skewness, kurtosis values as indicated in the top of the table. In columns 1–6, this density maximizes entropy. In columns 7–8 σ1 = σ2 . Conditional volatility follows a GARCH(1,1) dynamic σt2 = ω + α(rt −1 )2 + βσt2−1 , with ω = 0.1, α = 0.05, and β = 0.9. Various misspecified models are estimated. Garch-T (-SkT, -PML2, -PML4) fits a Student t (Hansen1994’s (1994) skewed Student t, a Gaussian, an exponential quartic) to innovations. 1

2

3

4

5

6

7

8

Sk Ku

0.2 5

0.8 5

1.6 5

0.2 8

1 8

2 8

1 8

2 8

p

µ1 µ2 σ1 σ2

0.0003 8.2184 −0.0024 2.5758 0.9894

0.0951 1.4531 −0.1527 1.3765 0.8129

0.1597 2.0212 −0.3840 0.7138 0.4117

4.9 10−5 10.0000 −0.0005 10.0500 0.9950

0.0178 2.9795 −0.0540 2.0251 0.8831

0.1089 2.2629 −0.2765 1.1347 0.5127

0.0075 5.1161 −0.0385 0.8961 0.8961

0.0630 3.1706 −0.2133 0.5690 0.5690

Garch-T Garch-SkT Garch-PML2 Garch-PML4

0.1304 0.1463 0.1441 0.1276

0.1076 0.1123 0.1588 0.1053

0.1687 0.0715 0.1353 0.0619

0.1321 0.1449 0.1460 0.1306

0.0934 0.0952 0.1820 0.0933

0.1457 0.0706 0.1597 0.0661

0.0927 0.1017 0.1830 0.0920

0.1392 0.0770 0.1515 0.0553

include an additional parameter, say ζ , for the location of the innovation density. Formally, we estimate a model where yt

σt

− ζ ∼ εt ,

and σt2 = ω + α(yt −1 )2 + βσt2−1 .

In the simulations, we use M = 10, 000 replications of samples containing T = 2000 observations. Such sample sizes are typical for finance applications. A general remark is that the estimation involving the PML2 or the two estimations involving the Student-t take between 0.07 and 0.7 s. The PML4 estimation takes between 1.7 s (for kurtic and moderately skewed data) up to 5.7 s (for very lightly or very heavily skewed distributions).14 Table 2 displays the results from the various simulations. The upper part of that table contains the parameters used for MixN, whereas the lower part contains the Root-Mean-Squared errors, defined as

    M M 1 − 1 −  i 2 RMS = (ωˆ − ω) +  (αˆ i − α)2 M i =1

M i =1

  M 1 − + (βˆ i − β)2 , M i =1

where the symbol i denotes the number of a simulation. Columns numbered 1 and 4 consider the cases where the distributions are kurtic, yet nearly symmetric. Columns 2 and 5 correspond to cases where the DGP density is moderately skewed. The skewed Student-t can reach the given skewness. Columns 3 and 6 correspond to cases where the density is heavily skewed. The skewed Student-t cannot reach the given theoretical values. Columns 7 and 8 correspond to cases where the mixture of normals assumes a same standard deviation. Inspection of the last line (corresponding to the RMS) of the GARCH-PML4 model shows that for all the cases under consideration, the PML4 model outperforms all other estimation techniques. We also notice that for very skewed distributions, the GARCH-Skt model does well, whereas those GARCH models based on a symmetric distribution generate much larger RMS.

14 All estimations were performed using identical lower and upper bounds on the parameters. We used as starting values for PML2 (ω = 10−3 , α = 0.15, and β = 0.8). We used the PML2 parameter estimates as starting values for all subsequent estimations. We also used the estimated values of skewness and overkurtosis of PML2 as starting values for PML4. In all cases, the degree of freedom of the Student-t is imposed to be larger than 4, which is equivalent to imposing existence of kurtosis for the Student-t.

Last, we wish to check the robustness of the simulation results if the innovation distribution is changed. To do so we replace the entropy-maximizing mixture of normals with one where standard deviations are the same for each distribution. The comparison of the results obtained for the last two columns with columns 5 and 6 reveals nearly identical figures (at least for the first two decimals). Again, PML4 results as the distributional assumption yielding the lowest RMS. We also checked the RMS at the level of the individual parameters and found that if an overall RMS was better for one model than for the other, the RMS improvements came from all parameters. As this experiment suggests, in the case of skewed and kurtic data, great care needs to be exercised if one uses ML estimation with a risk of misspecification. The PML4 method appears to be much more robust. 7.4. A third experiment In the previous examples, we demonstrated the usefulness of the PML4 technique, first in an IID case, second in a dynamic case. In both cases the skewness and kurtosis (conditional in the second example) were constant. Here, we consider a case where they depend on an exogenous variable. There are many situations where the modeling of the variation in higher moments may be of importance per se. For instance, Hansen (1994) considered a GARCH model with time varying skewness and kurtosis. A generic model that captures variation in the higher moments is given by: yi = µ + σ εi , where εi ∼ D(0, 1, s(xi ), k∗ (xi )), s(xi ) = 0.5 + axi ,

(7.7)

k∗ (xi ) = 2 + bxi , b > 0, xi ∼ U [1/2, 3/2],

i.i.d.

In the second line, D stands for some distribution where skewness and over-kurtosis depends on some exogenous variable, xi . The next two lines specify how skewness and over-kurtosis are parametrized. We recall that the kurtosis, k, is related to the overkurtosis, k∗ , by k = 1 + s2 + k∗ . For the simulations, we take the lower moments to be µ = 0 and σ = 1. Furthermore, we take a = 1 and b = 2. As long as the condition b > 0 is imposed in the numerical computation, the model will be well defined. The intercepts 0.5 and 2 in the specification of s(xi ) and k∗ (xi ) guarantee that the distribution will be skewed (here s > 1) and fat-tailed (here k > 4). The D distribution that we choose is the mixture of two normal distributions with identical variances MixN(µ1 , µ2 , σ , σ , p)

290

A. Holly et al. / Journal of Econometrics 162 (2011) 278–293

Table 3 This table presents the statistics of 1500 PML4 estimates of µ, σ , a and b as described in the model (7.7). Each sample contained T = 100 observations. Sk, Ku represent the skewness and kurtosis of the parameters. By σSk and σKu we represent the standard deviation of the skewness and kurtosis estimates. True parameters

Average Std Median Min Max Sk

σSk Ku

σKu MSE

µ

σ

a

b

0

1

1

2

0.006 0.071 0.000 −0.277 0.304 0.666 0.063 5.607 0.126 0.005

1.015 0.136 1.011 0.695 1.950 0.980 0.063 6.869 0.126 0.019

1.109 0.641 1.057 0.000 4.000 1.886 0.063 9.500 0.126 0.422

2.510 2.058 1.794 0.000 7.000 1.290 0.063 3.398 0.126 4.490

Table 4 This table reports the results of the QGPML2 simulation described in model (7.8). The true parameters are a = 1, and b = 1. The RMSE is defined as



1 M

∑M

j=1

(θˆ (j) − θ)2

1/2

, where θ = a or b. Here, the superscript j = 1, . . . , M

denotes a simulation. We took M = 30, 000. By 1RMSE (%) we denote the percentage gain in the MSE if one uses QGPML2 instead of PML2. True parameters

PML2 T = 25

T = 100

a=1

b=1

a=1

b= 1

Mean STD min max RMSE

0.996 0.567 0.001 4.464 0.567

0.915 0.249 0.001 2.206 0.263

0.994 0.401 0.001 2.848 0.401

0.956 0.177 0.331 1.619 0.182

True parameters

QGPML2 T = 25

already discussed in the previous section. In the Monte-Carlo exercise, we simulate 1500 samples, each with T = 100 observations. The estimations require between about 60 and 340 s, with an average time of about 130 s.15 Table 3 contains the statistics associated with the various estimations. As this table demonstrates, even though the numerical complexity behind the PML4 computation is significant, this method may be actually implemented even in a Monte-Carlo framework with many replications (here 1500). With the feasibility of the method already demonstrated, we may now turn to the interpretation of the statistics. We first note that the parameters tend to be estimated rather well if one uses the average of the estimates. The MSE of the mean µ is 0.05. The MSE of the parameter b that describes the kurtosis of the distribution takes a higher value. We find that the parameter estimates are skewed and kurtic, and we note that the MSE of the parameter estimates increases with the order of the moment that a given parameter describes. We conclude this section by noticing that our method may obviously be used in real applications, for models which may have more parameters, since it then has to be estimated only once. 7.5. A fourth experiment: QGPML2 To validate the QGPML2 approach, we consider as DGP the observation (yi , xi ) generated by:

a=1

b= 1

0.997 0.552 0.001 3.880 2.606

0.917 0.247 0.001 2.200 0.937

0.998 0.393 0.001 2.543 2.193

0.957 0.176 0.330 1.641 0.728

A preliminary simulation revealed that, for this parametrization, the skewness (kurtosis) of the εi ranges between 0 and 2 (respectively, 6 and 9). The QGPML2 estimation is based on the following steps: (1) Estimate a and b via PML2. This is tantamount to obtaining a, b by maximizing the function: T − i =1

−bxi −

1 2



yi − axi

2

exp(bxi )

.

˜ j,i for j = 1, . . . , 4 by using (2) Compute the first step estimators λ the information contained in s(xi ) and k(xi ). This computation ˜ 3i and λ˜ 4i are used is described in Section 5.2. Notice that only λ in the next step. (3) Maximize, with respect to a and b, the objective function: T −

λ∗0,i + λ∗1,i yi + λ∗2,i y2i ,

˜ 3i , λ˜ 4i ). The computation of the where, λ∗j,i = λ(axi , exp(bxi ), λ ∗ λi,j may be found in Section 6.4. The resulting a and b estimates are the QGPML2 estimates.

i.i.d.

xi = (1 + 29ui )/10,

π θi = (1 + 29ui ) , 180   1 1 εi ∼ sLD , , sin θi − cos θi . sin θi cos θi

b=1

i =1

yi = axi + exp(bxi )εi , ui ∼ U (0, 1),

Mean STD min max 1RMSE (%)

T = 100

a=1

(7.8)

The first line specifies the mean and the variance of the model as depending on some exogenous variable xi . The second line defines the ui as uniform draws. From this basic source of exogenous randomness, we construct xi as uniform random numbers U (1/10, 3). The fourth line specifies θi as an angle that varies between 1 and 30°. The ratio π /180 converts this angle into radians. The last equation specifies that εi is distributed according to the skewed Laplace distribution with mean 0, variance 1, and known skewness and kurtosis. A similar parametrization was chosen in Section 7.3. We select as parameters a = 1 and b = 1.

15 We also performed several estimations involving samples of size T = 1000. The time required for each estimation ranged between about 2100 and 2800 s.

Table 4 reports some statistics for the simulations. Each time we use N = 30, 000, a rather large number of simulations to ascertain that the findings are not spurious. We consider samples of size T = 25 and T = 100.16 Inspecting the table reveals that, as expected, the dispersion of the estimates obtained in the larger sample tends to be better. Comparing the dispersion of the parameters between the PML2 estimates and the QGPML2 estimates reveals an improvement when using QGPML2.17 For instance, for T = 25, the improvement of the RMSE of the parameter intervening in the mean, a, is 2.6%.

16 Here, we focus on a large number of simulations each involving a relatively small sample. The time required for this simulation could alternatively be devoted to the estimation of a model with either a larger sample or a more complex model structure involving several parameters. We performed some exploratory analysis with samples of size T = 1000 which requested between 500 and 1300 s. 17 We did not pursue a search for settings where the efficiency gain may be more important, as we simply wished to demonstrate the feasibility of the method here. We leave this pursuit for future research.

A. Holly et al. / Journal of Econometrics 162 (2011) 278–293

8. Conclusion In this paper, we generalize the PML2 and QGPML1 methods proposed in Gourieroux et al. (1984). The main objective of these methods was to propose consistent and asymptotically normal estimators of the parameters which appear in the specification of the first two conditional moments, based on the optimization of possibly misspecified likelihood functions. Here, we extend this approach by considering the first four conditional moments. A key tool is the quartic exponential family. This family allows us to introduce PML4 and QGPML2 estimators, respectively, generalizing PML2 and QGPML1. A complete asymptotic theory is proposed. Another key issue is the numerical computation of the exponential quartic density parameters for given values of the first four moments. The solution adopted in this paper, which is based on an approach proposed by Freud (1986), appears to be very quick and stable, and it solves technical problems, which had been stressed in different strands of the literature, e.g. Maasoumi (1993) and Ormoneit and White (1999). In numerical studies, we not only demonstrate the feasibility of the proposed estimation methods, but also show that PML4 may provide more efficient estimates, in particular for small samples where GMM based estimates may have encountered difficulties. We also consider an example where an econometrician uses either a misspecified ML model or the PML4 model. In that case, the PML4 model demonstrates superior results. Lastly, we show the feasibility of the QGPML2 estimation, and in that context, we again prove gains in efficiency. Our estimation method may prove useful in many econometric applications that involve non-Gaussianity of some random variable. For instance, Holly (2009) has recently studied risk adjustment schemes for health care expenditures. A key feature of this problem is the large skewness and kurtosis of the conditional distributions of these expenditures given demographic characteristics and the prior health status of policy holders. The PML4 method is obviously well suited to tackle this kind of problem since it incorporates the additional information on third and fourth order conditional moments and does not necessitate parametric assumptions on the conditional distribution. Moreover, the results in Holly (2009) show that the estimations only based on PML2 methods introduce a substantial amount of bias in the risk adjustment. Beyond this, the proposed numerical techniques may be of relevance in Bayesian analysis, independent component analysis, and possibly physics, i.e. in situations where non-Gaussian distributions occur. Acknowledgements

We have (omitting the variables X and θ0 ):

∂λj ∂ m′ ∂λj = , j = 0, . . . , 4, ∂θ ∂θ ∂ m 4 ∂ 2 λj ∂ m′ ∂ 2 λj ∂ m − ∂λj ∂ 2 mk = + , ∂θ ∂θ ′ ∂θ ∂ m∂ m′ ∂θ ′ ∂ mk ∂θ ∂θ ′ k=1     4 − ∂ m′ ∂ 2 λ0 ∂ 2 λj ∂m J (θ0 ) = −EX + mj0 ∂θ ∂ m∂ m′ ∂ m∂ m′ ∂θ ′ j=1  ] 2  [ 4 − ∂λ0 ∂ mk ∂λ1 ∂λ4 − + m10 + · · · + m40 , ∂ m ∂ m ∂ m ∂θ ∂θ ′ k k k k=1 and using Proposition 1, Corollary 1, and Proposition 2:

] ∂ m′ ∂λ′ ∂ m ∂θ ∂ m ∂θ ′ [ ′ ] ∂ m −1 ∂ m = EX Σ . ∂θ ∂θ ′

J (θ0 ) = EX

Appendix A. Computation of J (θ0 ) and I (θ0 )

] ∂ϕ(Y , X , θ0 ) ∂ϕ(Y , X , θ0 ) I (θ0 ) = EX E0 ∂θ ∂θ ′ [ ] ′  ∂λ0 ∂λ ∂λ0 ′ ∂λ = EX E0 + T + T , ∂θ ∂θ ∂θ ′ ∂θ ′ ∑4 j ′ 2 3 4 where ϕ(Y , X , θ ) = j=0 λj (X , θ )Y , and T = (Y , Y , Y , Y ). Using Proposition 1, we have (omitting the variables X and θ0 ): [ ′ ] ∂λ ′ ∂λ I (θ0 ) = EX E0 T − m T − m ( )( ) ∂θ ∂θ ′  ′  ∂λ ∂λ = EX Ω ∂θ ∂θ ′  ′ ′  ∂ m ∂λ ∂λ ∂ m = EX Ω ∂θ ∂ m ∂ m′ ∂θ ′  ′  ∂ m −1 −1 ∂ m = EX Σ ΩΣ . ∂θ ∂θ ′ [

Appendix B. Asymptotic behavior of the QGPML2 We focus on the proof of the asymptotic normality since the proof of the consistency is similar to that of the PML4. B.1. Preliminaries Let us consider the quartic family parametrized by ξ ′ = (m1 , σ 2 , λ3 , λ4 ). If we fix λ3 , λ4 to a given value of λ∗3 , λ∗4 , then we obtain a family, indexed by (m1 , σ 2 ), with log-density:

λ∗j (ξ12 )yj + λ∗3 y3 + λ∗4 y4 ,

j =0

where ξ12 = (m1 , σ 2 )′ and λ∗j (ξ12 ) is a notation which stands for

λ∗j (m1 , σ 2 , λ∗3 , λ∗4 ). It is the log-density of a quadratic exponential family (see GMT (1984)). Differentiating with respect to ξ12 , we get 2 − ∂λ∗j j =0

∂ 2 ϕ∞ (θ , P0 ) ′ ∂θ ∂θ  4 2 ∂ λ0 (X , θ0 ) − ∂ 2 λj (X , θ0 ) = −EX + mj0 . ∂θ ∂θ ′ ∂θ ∂θ ′ j =1

J (θ0 ) = −

[

Similarly,

2 −

The third author acknowledges support from the Swiss National Science Foundation through NCCR FINRISK (Financial Valuation and Risk Management). We are grateful to Prof. W. Gautschi for his advice concerning numerical integration and to Prof. G.V. Milovanović for providing a very useful sequence of parameters used for numerical integrations.

291

∂ξ12

mj = 0,

(with m0 = 1)

(B.1)

and

∂ 2 λ∗j

′

∂λ∗ ∂ m12 m + j ′ ′ = 0, ∂ξ12 ∂ξ12 ∂ξ12 ∂ξ12 j =0

2 −

with λ∗ = (λ∗1 , λ∗2 )′ and m12 = (m1 , m2 )′ .

(B.2)

292

A. Holly et al. / Journal of Econometrics 162 (2011) 278–293

Moreover,

or, using (B.1),

 ∂ m12 1 = ′ 2m1 ∂ξ12



0 . 1

(B.3)

The variance–covariance matrix Σ1 of (Y , Y 2 )′ in this family is easily found. For instance, we note that the pdf with respect to the measure µ∗ defined by dµ∗ (y) = exp(λ∗3 y3 + λ∗4 y4 )dy is   exp λ∗0 (ξ12 ) + λ∗1 (ξ12 )y + λ∗2 (ξ12 )y2 . Using the parametrization ∗ ∗ (λ1 , λ2 ) and the general property of the exponential family given in Section 2.1(2):

Σ1 =

∂ m12 , ∂λ∗′

However, since the moments of order one to four are well specified in the pseudo family, the true conditional variance Ω1 (X , θ0 ) of T1 is that corresponding to the model. Namely, they correspond to Σ1 (X , θ0 ), and we get:

We therefore have:

∂ m12



 −1

′

∂ξ12

Σ1 ,

[ ′ ] ′ ˜I (θ0 ) = EX ∂ξ12 ∂ m12 Ω1−1 ∂ m12 ∂ξ12 , ′ ′ ∂θ12 ∂ξ12 ∂ξ12 ∂θ12

hence:

∂λ∗ ′ 12

∂ξ

∂ m12

= Σ1−1

∂ξ

′ 12

.

(B.4)

In the estimation based on the unfeasible equivalent estimator of the QGPML2, we have, noting θ12 = (θ1 , θ2 ):

∂ 2 λ∗j

 2 − j =0

′ ∂θ12 ∂θ12

[m1 (X , θ10 ), σ 2 (X , θ20 ),

λ3 (X , θ0 ), λ4 (X , θ0 )]mj (X , θ0 ) . But we have also (omitting X and the θ ’s):

∂θ12

=

′ ∂λ∗ ∂ξ12 j , ∂θ12 ∂ξ12

∂ 2 λ∗j ∂θ12 ∂θ12 ′

Therefore, 2 ′ − ∂ 2 λ∗j ∂ξ12 ∂ξ12 ′ mj ′ ∂θ12 j=0 ∂ξ12 ∂ξ12 ∂θ12   2 2 − − ∂λ∗j ∂ 2 ξk

 J˜(θ0 ) = −EX

− EX

k=1

j=0

∂ξk

mj



′ ∂θ12 ∂θ12

,

or, using (B.1) and (B.2),

 J˜(θ0 ) = EX

 ′ ′ ∂ξ12 ∂λ∗ ∂ m12 ∂ξ12 , ′ ′ ∂θ12 ∂ξ12 ∂ξ12 ∂θ12

or, using (B.4), J˜(θ0 ) = EX

[

] ′ ∂ m′12 −1 ∂ m12 ∂ξ12 ∂ξ12 Σ1 . ′ ′ ∂θ12 ∂ξ12 ∂ξ12 ∂θ12

(B.5)

B.3. Computation of ˜I (θ0 ) Letting T1 = (Y , Y 2 )′ , we have,

˜I (θ0 ) = EX E0

[

∂ m12 ∂ξ12 1 ′ ′ = 2m1 ∂ξ12 ∂θ12 

 ∂ m (X , θ ) 1 1 0  ∂θ1 1  

0

0



 , ∂σ 2 (X , θ2 )  ∂θ2

References

2 ′ ∂λ∗j ∂ 2 ξk ∂ 2 λ∗j ∂ξ12 − ∂ξ12 ′ ′ + ′ . ∂θ12 ∂ξ12 ∂ξ12 ∂θ12 ∂ξk ∂θ12 ∂θ12 k=1

=

˜I (θ0 ) = J˜(θ0 ).

and ˜I (θ0 )−1 = J˜(θ0 )−1 = B(θ0 ), as given in (5.2). For the sake of simplicity, we have assumed that m1 (X , θ1 ) is only a function of θ1 and that σ 2 (X , θ2 ) is only a function of θ2 . The result, however, is easily generalized to the case m1 (X , θ12 ), σ 2 (X , θ12 ), with the only difference being that ′ ∂ξ12 /∂θ12 is no longer diagonal.



∂λ∗j

and hence, from (B.5):

Moreover, using (B.3),

B.2. Computation of J˜(θ0 )

J˜(θ0 ) = −EX

′

∂ξ ′ ∂λ∗′ ∂ξ ′ ∂ m′12 −1 ∂λ∗ = 12 = 12 Σ , ∂θ12 ∂θ12 ∂ξ12 ∂θ12 ∂ξ12 1 [ ′ ] ′ ˜I (θ0 ) = EX ∂ξ12 ∂ m12 Σ1−1 Ω1 Σ1−1 ∂ m12 ∂ξ12 . ′ ′ ∂θ12 ∂ξ12 ∂ξ12 ∂θ12

∂ m12 ∂ξ12 ′. ′ ∂ξ12 ∂λ∗

∂ξ12 = ∂λ∗′

We also have

which allows us to write

which, written as a function of ξ12 , leads to:

Σ1 =

[ ∗′ ∗ ] ˜I (θ0 ) = EX E0 ∂λ (T1 − m12 ) (T1 − m12 )′ ∂λ ′ ∂θ12 ∂θ12 [ ∗′ ∗ ] ∂λ ∂λ Ω1 ′ . = EX ∂θ12 ∂θ12

∂λ∗0 ∂λ∗′ + T1 ∂θ12 ∂θ12



∗ ∂λ∗0 ′ ∂λ ′ + T1 ′ ∂θ12 ∂θ12

]

,

Agmon, N., Alhassid, Y., Levine, R.D., 1979. An algorithm for finding the distribution of maximal entropy. Journal of Computational Physics 30, 250–259. Altonji, J.G., Segal, L.M., 1996. Small-sample bias in GMM estimation of covariance structures. Journal of Business & Economic Statistics 14 (3), 353–366. Andersen, T.G., Sorenson, B., 1996. GMM estimation of a stochastic volatility model: a Monte Carlo study. Journal of Business & Economic Statistics 14, 328–352. Arellano-Valle, R.B., Genton, M.G., 2005. On fundamental skew distributions. Journal of Multivariate Analysis 96, 93–116. Barndorff-Nielsen, O.E., 1978. Information and Exponential Families in Statistical Theory. Wiley, Chichester. Barndorff-Nielsen, O.E., 1997. Normal inverse Gaussian distributions and stochastic volatility modeling. Scandinavian Journal of Statistics 24, 1–13. Bollerslev, T., Wooldridge, J.M., 1992. Quasi-maximum likelihood estimation and inference in dynamic models with time varying covariances. Econometric Reviews 11 (2), 143–172. Brown, L.D., 1986. Fundamentals of Statistical Exponential Families. Institute of Mathematical Statistics, Hayward, California. Chamberlain, G., 1987. Asymptotic efficiency in estimation with conditional moment restrictions. Journal of Econometrics 34, 305–334. Doran, H.E., Schmidt, P., 2006. GMM estimators with improved finite sample properties using principal components of the weighting matrix, with an application to the dynamic panel data model. Journal of Econometrics 133 (1), 387–409. Eberlein, E., Keller, U., 1995. Hyperbolic distributions in finance. Bernoulli 1, 281–299. Fernandez, C., Steel, M.F.J., 1998. On Bayesian modelling of fat tails and skewness. Journal of the American Statistical Association 93, 359–371. Freud, G., 1986. On the greatest zero of an orthogonal polynomial, I. Journal of Approximation Theory 46, 16–24. Gale, D., Nikaido, H., 1965. The Jacobian matrix and global univalence of mappings. Mathematische Annalen 159 (2), 81–93. Gallant, A.R., 1987. Nonlinear Statistical Models. John Wiley & Sons, New York. Gautschi, W., 2004. Orthogonal Polynomials: Computation and Approximation. Oxford Science Publications, Oxford University Press.

A. Holly et al. / Journal of Econometrics 162 (2011) 278–293 Genton, M.G., 2004. Skew-Elliptical Distributions and their Applications: A Journey Beyond Normality. Chapman & Hall/CRC, Boca Raton, Florida.. Golan, A., Judge, G., Miller, D., 1996. Maximum Entropy Econometrics: Robust Estimation with Limited Data. Wiley, Chichester. Golub, G.H., Welch, J.H., 1969. Calculation of Gauss Quadrature Rules. Mathematics of Computations 23 (106), 221–230. Gourieroux, C., Monfort, A., Trognon, A., 1984. Pseudo maximum likelihood methods: theory. Econometrica 52 (1), 681–700. Gourieroux, C., Monfort, A., 1995a. Statistics and Econometric Models: Volume One. Cambridge University Press, Cambridge. Gourieroux, C., Monfort, A., 1995b. Statistics and Econometric Models: Volume Two. Cambridge University Press, Cambridge. Gourieroux, C., Monfort, A., 2006. Pricing with splines. Annales d’Economie et de Statistique 82, 4–33. Hansen, L.P., 1982. Large sample properties of the generalized methods of moments. Econometrica 50, 1029–1054. Hansen, B., 1994. Autoregressive conditional density estimation. International Economic Review 35, 705–730. Harvey, C.R., Siddique, A., 1999. Autoregressive conditional skewness. Journal of Financial and Quantitative Analysis 34 (4), 465–487. Holly, A., 1993. Asymptotic theory for nonlinear econometric models: estimation. In: de Zeeuw, Aart J. (Ed.), Advanced Lectures in Quantitative Economics II. Academic Press. Holly, A., 2009. Modeling risk using fourth order pseudo maximum likelihood methods. Mimeo, Institute of Health Economics and Management, University of Lausanne. Holly, A., Pentsak, Y., 2004. Maximum likelihood estimation of the conditional mean E (y | x) for skewed dependent variables in four-parameter families of distribution. Technical Report. Institute of Health Economics and Management, University of Lausanne. Hood, W., Koopmans, T., 1953. The estimation of simultaneous linear economic relationships. In: Studies in Econometric Method. Yale University Press, New Haven. Jondeau, E., Rockinger, M., 2003. Conditional volatility, skewness, and kurtosis: existence, persistence, and comovements. Journal of Economic Dynamics and Control 27 (10), 1699–1737. Junk, M., 2000. Maximum entropy for reduced moment problems. Mathematical Models and Methods in Applied Sciences 10 (7), 1001–1025. Kitamura, Y., Stutzer, M., 1997. An information-theoretic alternative to generalized method of moments estimation. Econometrica 65 (4), 861–874.

293

Levin, E., Lubinsky, D.S., 2001. Orthogonal Polynomials for Exponential Weights. In: CMS Books in Mathematics, Springer-Verlag, New-York. Maasoumi, E., 1993. A compendium to information theory in economics and econometrics. Econometric Reviews 12 (2), 137–181. Manning, W., Basu, A., Mullahy, J., 2005. Generalized modeling approaches to risk adjustment of skewed outcomes data. Journal of Health Economics 24, 465–488. Mead, L.R., Papanicolaou, N., 1984. Maximum entropy in the problem of moments. Journal of Mathematical Physics 25 (8), 2404–2417. Monfort, A., 1982. Cours de statistique mathmatique. Economica, Paris. Newey, W.K., 1990. Semiparametric efficiency bounds. Journal of Applied Econometrics 5, 99–135. Newey, W.K., Steigerwald, D.G., 1997. Asymptotic bias for quasi-maximum likelihood estimators in conditional heteroskedastic models. Econometrica 65, 587–599. Noschese, S., Pasquini, L., 1999. On the nonnegative solution of a Freud three-term recurrence. Journal of Approximation Theory 99, 54–67. Ormoneit, D., White, H., 1999. An efficient algorithm to compute maximum entropy densities. Econometric Reviews 18 (2), 127–140. Sawa, T., 1978. Information criteria for discriminating among alternative regression models. Econometrica 46 (6), 1273–1291. Stacy, E., 1962. A generalization of gamma distribution. Annals of Mathematical Statistics 33, 1187–1192. Stacy, E., Mihram, G., 1965. Parameter estimation for a generalized gamma distribution. Technometrics 7, 349–358. Tauchen, G., 1986. Statistical properties of generalized method-of-moments estimators of structural parameters obtained from financial market data. Journal of Business & Economic Statistics 4 (4), 397–416. Titterington, D.M., Smith, A.F.M., Makov, U.E., 1985. Statistical Analysis of Finite Mixture Distributions. John Wiley, Chichester. White, H., 1982. Maximum likelihood of misspecified models. Econometrica 50 (1), 1–25. White, H., 1994. Estimation, Inference, and Specification Analysis. Cambridge University Press, Cambridge, UK. Zellner, A., Highfield, R.A., 1988. Calculation of maximum entropy distributions and approximation of marginal posterior distributions. Journal of Econometrics 37 (2), 195–209. Ziliak, J.P., 1997. Efficient estimation with panel data when instruments are predetermined: an empirical comparison of moment-condition estimators. Journal of Business and Economic Statistics 15, 419–431.

Journal of Econometrics 162 (2011) 294–311

Contents lists available at ScienceDirect

Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom

Integrated variance forecasting: Model based vs. reduced form Natalia Sizova ∗ Department of Economics, Rice University, Box 1892, Houston, TX 77251, United States

article

info

Article history: Received 13 January 2009 Received in revised form 12 May 2010 Accepted 3 February 2011 Available online 15 February 2011 JEL classification: C22 C53 Keywords: Volatility forecasting High-frequency data Reduced-form methods Model misspecification

abstract This paper compares model-based and reduced-form forecasts of financial volatility when high-frequency return data are available. We derived exact formulas for the forecast errors and analyzed the contribution of the ‘‘wrong’’ data modeling and errors in forecast inputs. The comparison is made for ‘‘feasible’’ forecasts, i.e., we assumed that the true data generating process, latent states and parameters are unknown. As an illustration, the same comparison is carried out empirically for spot 5 min returns of DM/USD exchange rates. It is shown that the comparison between feasible reduced-form and model-based forecasts is not always in favor of the latter in contrast to their infeasible versions. The reduced-form approach is generally better for long-horizon forecasting and for short-horizon forecasting in the presence of microstructure noise. © 2011 Elsevier B.V. All rights reserved.

1. Introduction Traditionally, researchers who wanted to extract and forecast financial volatility had to rely on data recorded at only moderate intervals: daily, for instance, or even monthly. But recently, data at much more frequent intervals – high-frequency data – have become increasingly available. The increasing availability of high-frequency data allows researchers to improve on the techniques used to forecast volatility. Two types of these techniques are model based, which construct efficient volatility forecasts that rely on the model for returns, and reduced form, which construct simple projections of volatility on past volatility measures. Both types had been developed before the high-frequency data became available. Initially, however, only model-based techniques were considered to be sufficiently reliable, as they performed more accurately for daily data. For instance, two classes of volatility models – ARCH models1 by Bollerslev and Engle (1986) and stochastic volatility models2 – were conventionally preferred for extracting the volatility series. These model-based approaches gave more accurate estimates for latent volatility and provided better ways to form forecasts of the future volatility. Reduced-form techniques, on the other hand, relied on excessively noisy proxies of the volatility, e.g., daily squared returns.

∗

Tel.: +1 713 348 5613; fax: +1 713 348 5278. E-mail address: [email protected].

1 ARCH and GARCH models are reviewed by Andersen et al. (2005). 2 Reviewed by Tauchen (2004). 0304-4076/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2011.02.004

Now that high-frequency data are available, is it still the case that model-based forecasts are better than reduced-form ones? Is it possible that when using high-frequency data, reducedform forecasts are just as good as, or perhaps better than, model-based forecasts? Employing high-frequency data increases the efficiency of extracting latent spot variances by modelbased techniques; hence, model-based techniques can perform better. On the other hand, for reduced-form techniques, highfrequency data allow researchers to define better proxies for the daily volatility. For example, one nonparametric proxy – the realized variance – is a sum of squared returns. It is observable, and hence it admits a direct modeling, as in Corsi (2004), and Andersen et al. (2003). Apparently, the availability of high-frequency data starts a new chapter in the contest between reduced-form and modelbased approaches as the most efficient for forecasting volatility. In this paper, we compare these two approaches for observed data and simulated data. Furthermore, we investigate this comparison analytically. The object of interest is a forecast of the integrated variance. Integrated variance is a natural descriptor of the volatility of daily returns. It is an analog of the variance of a daily return in a discrete-time model, e.g., extracted variance in the GARCH model. Integrated variance can be used in VaR models of risk management, as an input to option pricing models and for variance hedging in trading; see Andersen et al. (2005). The primary goal of the current study is to compare analytically feasible model-based and reduced-form forecasts of integrated variance. The defining word here is ‘‘feasible’’, i.e., the comparison is carried out under ‘‘realistic’’ conditions, assuming that the

N. Sizova / Journal of Econometrics 162 (2011) 294–311

true data generating process is unknown, and forecast inputs are estimated with errors. To guarantee results that have the greatest generality, we formulate this comparison analytically within the framework of Meddahi’s ESV models. The groundbreaking result of Meddahi is that any square-integrable variance process can be decomposed into the sum of simple processes. This decomposition allows us to write down many of our results in analytical form. Therefore, rather than resorting to time-consuming simulations to compare forecasts, we can plug parameters into a formula and immediately evaluate the comparative performance. The forecasts to be compared are briefly defined as follows. The first one is model based. To implement this forecast, we model returns by a stochastic volatility (SV) model. Daily data can be used to estimate the parameters of the model. We then use this model to predict the future integrated variance. For example, SVmodels that can be used to form the forecast include affine models by Heston (1993) and by Duffie et al. (2000), CEV-models, and exponential models as in Chernov et al. (2003). The second forecast is reduced form. In this case, the predictor of the future integrated variance is a linear function of its historic values. This form of the forecast is based on theory developed by Meddahi (2003). The current study is logically connected to the paper by Andersen et al. (2004). They compared the performance of ‘‘infeasible’’ model-based forecasts to the performance of reduced-form forecasts. Their interesting finding is that the reduced-form forecast performs remarkably well even though the ‘‘infeasible’’ modelbased forecast minimizes forecast error by definition. In their study, the authors assumed that a true model is known, and all the inputs to the forecasts are observed. The major difference between their work and the current study is that we take things a step further and consider the case in which the model and parameters are unknown. Another paper that relates to the current study is the paper by Barndorff-Nielsen and Shephard (2002). Among other things, the authors looked at the problem of extracting the integrated variance using prior knowledge about a model. They compared this model-based estimate of the integrated variance to the reducedform estimate—realized variance. They found that the results of the comparison are affected by sampling frequency and volatility-ofvolatility. In keeping with the aforementioned work we show that the sampling frequency and the volatility-of-volatility parameters are major factors in comparison between model-based and reduced-form forecasts of integrated variance. Furthermore, we present theoretical reasons for supposing that these parameters necessarily affect the comparison. The paper is organized as follows. In Section 2, the model framework is set up and the forecasts to be compared are defined. In the same section, we introduce definitions for error components. These definitions will be used throughout the paper. In Section 3, we make analytical comparisons of feasible forecasts. The analysis includes the cases of long-horizon forecasting and forecasting using high-frequency data contaminated by microstructure noise. In Section 4 the theoretical findings from the previous part will be evaluated for observed data. Section 5 concludes the paper. 2. One model and two forecasts 2.1. Model Throughout the paper we assume the following dynamics for the logarithm of prices st . Denote xt as an n-dimensional vector of independent states. The Wi,t , i = 1, . . . , n, are independent

295

standard Brownian motions. Wts is a standard Brownian motion which may correlate with Wi,t , i = 1, . . . , n, i.e., corr(dWts , dWi,t ) = ρi (xt ). The dynamics of states xt and log-prices st are described by the following system of equations:

   µ(xt ) σ (xt )dWts κ1 (x1,t ) Λ1 (x1,t )dW1,t    · · ·  =  · · ·  dt +  ··· Λn (xn,t )dWn,t dxn,t κn (xn,t ) 





dst dx1,t 

(1)

where σ 2 (xt ) is referred to as a spot variance. The functions Λi (xi,t ) stand for diffusion terms in the dynamics of the states xt , and the functions κi (xi,t ) are drifts. For notational simplicity, denote σ 2 (xt ) ≡ σt2 . We are interested in forecasting the ex post variability of returns as measured by the integrated variance: IVtt +T

t +T

∫

σs2 ds,

≡

(2)

t

which is assumed to be well defined. Intuitively, IV defines the variance of the T -period return, i.e., Vart (st +T − st )2 = Et (IVtt +T ), if the drift in prices µ(xt ) is a predictable process of finite variation. In general, the same relation holds approximately, since the variation in the drift µ(xt ) is negligible in comparison to the variation in the diffusion part σt dWts . This implies that by forecasting the integrated variance, we seek to forecast the variability of the asset price over the next T periods. For example, variances that are inputs to Sharpe ratios of the assets can be taken from the values of IVtt +T . Note that system (1) has a very general form. Any known Markov continuous dynamics of stochastic volatility can be represented in this form, e.g., affine, GARCH-SV, and log-volatility models, to be defined more formally below. For the purpose of later derivations, we will use another representation of the same system (1). This representation is referred to as the ESV representation and was introduced by Meddahi (2001). He showed that any square-integrable spot variance σ 2 (xt ) appearing in the SDE system (1) admits an Eigenfunction Stochastic Volatility (ESV) representation:

σ 2 (xt ) = a0 +

∞ −

ai Pi (xt ),

(3)

i=1

where processes Pi (xt ) are called factors. These are squareintegrable processes with the following properties: (i) zero mean: EPi (xt ) = 0; (ii) uncorrelated, with a unit variance: Cov(Pi (xt ), Pj (xt ))  1 0

=

i=j i ̸= j;

(iii) if discretely observed, each factor follows an AR(1) process: E (Pi (xt +T )|xτ , τ ≤ t ) = e−ki T Pi (xt ).

(4)

In the following discussion, a process will be referred to as a onefactor process if ai = 0, ∀i > 1, and a two-factor process if ai = 0, ∀i > 2. In general, for a p-factor process ∑p the variance can be represented by the sum, σ 2 (xt ) = a0 + i=0 ai Pi (xt ). Note that the number of uncorrelated factors p in the ESV representation is generally not the same as the number of independent processes that drive the dynamics of volatility. We emphasize that the above representation is equivalent to the representation (1), and was introduced only because it facilitates further analytical derivations. We now define the two types of forecasts for study. The first one will be called ‘‘model-based’’ or simply the ‘‘best’’ forecast. The second will be referred to as ‘‘model free’’ or ‘‘reduced form’’.

296

N. Sizova / Journal of Econometrics 162 (2011) 294–311

2.2. The model-based forecast In this section, we define the model-based forecast. In addition, we show how the prediction error from this model-based forecast can be decomposed into three components: ‘‘genuine’’ forecast error, error from estimates of the parameters and states, and error from model misspecification. The model-based forecast is defined as the ‘‘best’’ forecast in terms of the mean squared error (MSE). That is, given the information set Ft at time t, the model-based forecast of the integrated variance IVtt +1 minimizes the square-loss function: IVmodel t + 1 ,t

= arg min E ( IVt +1|t ∈Ft

IVtt +1

− IVt +1|t |Ft ) . 2

(5)

Despite a controversy surrounding the choice of the most appropriate performance measure (see Patton and Timmermann, 2007a,b), mean squared error remains a common reference point for a forecast performance. The function that satisfies condition (5) is the expectation of IVtt +1 conditional on information at time t, i.e., E (IVtt +1 |Ft ). For the model described by system (1), let Ft be a σ -algebra generated by prices and states up to time t, i.e., pτ , xτ , ∀τ ≤ t. Due to the Markovian dynamics of log-prices and variances given by system (1), the model-based forecast depends only on the latest realization of the state t +1 IVmodel |xt ), t +1|t = E (IVt

(6)

and the model specification (1). The latter classifies it as ‘‘model based’’. The ESV framework (3) yields a closed-form representation for the model-based forecast: IVmodel t +1 | t

t +1

∫ =

 a0 +

t

= a0 +

∞ −

 ai e

−ki (s−t )

Pi (xt ) ds

i =1

∞ − 1 − e−ki

ai

i =1

ki

Pi (xt ).

(7)

However, this forecast is infeasible, since it is based on an unknown model, parameters and states. To define the feasible version, it is common to proceed as follows. First, we choose a model and derive a corresponding formula for the model-based forecast. We can then estimate the parameters and unobserved states using a general method such as MLE, exact or an approximation. Finally, we substitute recovered states and parameters into the formula. This resulting structure is termed a feasible model-based forecast. Its form can be derived from the ESV representation:

ˆ0 + IVmodel t +1 | t = a

ˆ ∞ − 1 − e−ki aˆ i Pˆ i (ˆxt ). kˆ i i=1

(8)

The feasible model-based forecast given by (8) and the corresponding forecast error are objects of interest for the rest of this subsection. In the ideal case, the exact model, parameters and latent states are all known. In this case, the only error of the modelbased forecast is the ‘‘genuine’’ forecast error: GFEmodel = E [IVtt +1 − IVt +1|t (Θ , xt )]2 ,

(9)

where Θ are the parameters in the model. However, the total forecast error of the feasible model-based forecast involves two extra components. First, the component to be termed the error in forecast IVt +1|t due to errors in the parameters and states:

ˆ − Θ ) = E [IVt +1|t (Θ , xt ) − IVt +1|t (Θ ˆ , xˆ t )]2 , F (ˆxt − xt , Θ

(10)

ˆ are estimated parameters in the model, and xˆ t are where Θ estimated states in the model. This component of the error is

uncorrelated with the ‘‘genuine’’ error, since the ‘‘genuine’’ error is unpredictable at time t, i.e., E [IVtt +1 − IVt +1|t |Ft ] = 0, and the error in parameters and states is a function of the information available ˆ , xˆ t ) ∈ Ft . Hence, the total at time t, i.e., IVt +1|t (Θ , xt ) − IVt +1|t (Θ t +1 error of predicting IVt – the total out-of-sample mean squared prediction error (MSPE) – is simply a sum of the two components, Decomposition of the model-based forecast:

ˆ , xˆ t )]2 Total MSPE = E [IVtt +1 − IVt +1|t (Θ ˆ − Θ ). = GFEmodel + F (ˆxt − xt , Θ

(11)

Second, an additional part of the error comes from model misspecification. Model misspecification is practically unavoidable, since the true model is generally not known for any observed data set. Hence, any chosen model is at best an approximation of the true one. This second additional component of the error is important, since each step to form a feasible forecast incorporates the knowledge of the model. First, we use the model to define the functional form of the model-based forecast. Then we use the model to estimate the parameters and extract the spot values of states, e.g., by MLE and particle filters, EMM and the reprojection method, or by Bayesian methods. (See Jacquier et al., 1994; Johannes and Polson, 2003; Stroud et al., 2003; Gallant and Tauchen, 1998; Pitt and Shephard, 1999.) Hence, the model misspecification is a critical factor that affects all the steps above and can weaken the performance of the model-based forecast. To summarize, although the model-based forecast minimizes the mean squared error in (5), it does so only under the assumption that the true model with the parameters is known, and all the states are observable. However, it may not do so under a more realistic assumption that many inputs are to be estimated. In this latter case, the total error consists of several parts and the ‘‘feasible’’ model-based forecast as defined by (8) may not perform best under the mean squared error loss. We seek circumstances in which the above factors eliminate the advantages of model-based forecasting. As a natural competing forecast, we use a model-free reduced-form forecast, as originally advocated by Andersen et al. (2003) and formally analyzed by Meddahi (2003). 2.3. The reduced-form forecast In this section, we define a benchmark reduced-form forecast. This forecast can be derived starting from formula (7), which reports the conditional expectation for the integrated variance IVtt +1 based on the ESV representation. The right-hand side of this formula is a sum of p autoregressive processes. This property has an important implication; it implies that for a p-factor ESV model, the integrated variance IVtt +1 is a sum of p AR(1) processes and a white noise term: IVt +1,t = a0 +

p − 1 − e−ki

ai

i=1

ki

P i ( x t ) + εt + 1 ,

where E (εt +1 |Ft ) = 0. This decomposition suggests that the integrated variance is an ARMA(p, p) process. In general, ARMA(p, p) for the integrated variance will take the form: p ∏ i =1

(1 − e−ki L)(IVtt +1 − θ ) = ηt +1 −

p −

βi ηt +1−i ,

(12)

i =1

where ηt is heteroscedastic white noise, and k1 , . . . , kp are mean reversions in the ESV representation (4). We can find the parameters of the ARMA model (12) if we know the parameters of the base model (1). Alternatively, we may also estimate the

N. Sizova / Journal of Econometrics 162 (2011) 294–311

same parameters in a ‘‘reduced-form’’ manner. Since IVtt +1 can be described as an ARMA process, we may simply fit the linear timeseries model to IVtt +1 . One could also use a simple autoregressive model instead of ARMA. The resulting forecast yields a model-free IVtt +1 predictor based on the past realizations of the integrated variance. Definition. The reduced-form forecast of the integrated variance IVtt +1 is a linear projection of IVtt +1 onto the space generated by its past realizations IVττ +1 , τ ≤ t: IVt +1|t = P (IVtt +1 |IVττ +1 , τ ≤ t ). rf

The above forecast is not feasible because the integrated variance is not observed. However, we may construct a feasible version of the same forecast, using a close proxy of the integrated variance— realized variance: Definition. Suppose log-prices st are observed at discrete times 0, h, 2h, etc. Then the realized variance over the period [t , t + 1] is defined as RVtt +1

1/h − ≡ (st +ih − st +(i−1)h )2 .

(13)

i=1

The realized variance is a directly observable measure of the intraday variance of the price. The asymptotic behavior of RV and its consistency as a proxy for IV is discussed by Jacod and Protter (1998), Barndorff-Nielsen and Shephard (2002), and Meddahi (2002). To construct a feasible version of the reduced-form forecast, we project RVtt +1 on its past values  P (RVtt +1 |RVττ +1 , τ ≤ t ). This projection can be constructed in a strictly model-free manner using only the data that are observable. For zero drifts in returns, this forecast will be equivalent to the projection of IVtt +1 on the rf

 t +1|t =  past values of RV, i.e., IV P (IVtt +1 |RVττ +1 , τ ≤ t ). This equivalence follows from the fact that with no drift in returns, the difference RVtt +1 − IVtt +1 is unpredictable. (See Barndorff-Nielsen and Shephard, 2002.) The same equivalence holds approximately for intra-day data, since the drift in asset prices is negligibly small between finely sampled observations. The ARMA representation for RV follows from the ARMA representation of integrated variance (see Meddahi, 2003): p ∏

(1 − e−ki L)(RVtt +1 − θ ) = ηt +1 (h) −

i=1

p −

βi (h)ηt +1−i (h), (14)

i=1

where ηt (h) is the heteroscedastic white noise that depends on the distance between observations h. If the parameters of the ARMA representation (14) are known, then the ‘‘genuine’’ error of the reduced-form forecast is the difference between the variance of the shock ηt +1 in (14) and the variance of the noise term RVtt +1 − IVtt +1 : GFErf = E IVtt +1 − P (IVtt +1 |RVττ +1 , τ ≤ t )



2

 2 = E RVtt +1 − P (RVtt +1 |RVττ +1 , τ ≤ t )  2 − E RVtt +1 − IVtt +1 .

(15)

The decomposition above follows from the fact that the difference between the realized and integrated variance is unpredictable. If the parameters of the ARMA representation are unknown, then the total Mean Squared Prediction Error of the reduced-form forecast includes an additional part, Decomposition of the reduced-form forecast:

ˆ rf − Θ ), Total MSPErf = GFErf + F rf (Θ

(16)

297

Table 1 Models with one component in the variance. Model name

Specification of the variance process

Square-root SV Log-volatility GARCH-SV Ornstein–Uhlenbeck

dσt2 = d ln σt2 dσt2 = dσt2 =

k(θ − σt2 )dt + ησt dWt = k(θ − ln σt2 )dt + ηdWt k(θ − σt2 )dt + ησt2 dWt k(θ − σt2 )dt + ηdWt

ˆ rf − Θ ) comes from the errors in the where the second part F rf (Θ rf coefficients and Θ are the parameters in the regression of RVtt +1 on its past values. Note that the covariance term is absent from the above decomposition. Since the infeasible error is orthogonal to the linear space spanned by RVττ +1 , τ ≤ t, therefore orthogonal to the error from parameter estimation. Observe that the total Mean Squared Prediction Error of the feasible reduced-form forecast does not include errors from estimating the instantaneous unobservable states xt . 3. Analytical comparison of feasible forecasts In this section we present a simple framework that allows us to compare feasible model-based and reduced-form forecasts analytically. Special attention will be paid to the effects of errors in the state estimates xˆ t and model misspecification. Throughout this section, we will assume that the data are generated from a multifactor model. Formally, this implies that for the ESV representation ((3) and (4)) a2 ̸= 0. This assumption was tested for financial time series and the hypothesis of a onefactor model (H0 : a2 = 0) was consistently rejected. For example, Chernov et al. (2003) reject the one-factor hypothesis for stock-index data using χ 2 -statistics within the EMM estimation. Bollerslev and Zhou (2002) also reject the one-factor hypothesis within the GMM estimation that matches conditional first and second moments of the realized variance for foreign exchange rates. The success of multifactor models is explained by their dual implications; they can simultaneously generate a large variability in the variance and a high persistence over long horizons. This combination of properties is an attribute of asset price series. Since any multifactor process is a mixture of at least two components, one of these components adds to the variability of the variance, and the other accounts for the high persistence. Using the ESV notation (3), this implies that for a two-factor model with a1 ̸= 0, a2 ̸= 0 and ai = 0, i > 2, the mean reversion of one factor k1 is significantly higher than k2 ≈ 0. Nonetheless, in the literature, models with only one component are often used instead of multi-component models. For instance, the models that are listed in Table 1 are often chosen due to the simplicity of their handling and estimation. Therefore, although the true generating process may be governed by several components, econometricians often assume simple one-component processes. In general, there are few research papers that focus on models with two components and even fewer that consider models with three components. (See Barndorff-Nielsen and Shephard, 2002.) We may expect that the true number of components is not limited to two or even three for real data, implying that the number of components is underestimated in many applications. Hence, the effect of the underestimation of the number of factors (components) can be a common source of forecasting error. In this section, we consider a simple case in which an econometrician assumes a one-factor model of the general form: dσt2 = k(θ − σt2 )dt + Λ(σt )dWt .

(17)

In the above model, the only factor is the spot variance. Therefore, we can make a link between the model (17) and the ESV

298

N. Sizova / Journal of Econometrics 162 (2011) 294–311

representation (3) by defining the ESV parameters a0 = θ , a21 = σt2 −a0 , |a1 |

Var σ P1 (xt ) = and k1 = k. Hence, the model-based forecast can be derived as a special case of the general formula (7): 2 t ,

Et IVtt +1

=θ+

1 − e− k k

σˆ t2+h = φh + ah σˆ t2 + bh ξt +h , (σ − θ ). 2 t

(18)

The model-based forecast is linear in the last realized spot variance and the slope is a function of the mean-reversion coefficient k. To implement this forecast in practice the following problems are to be solved: parameter estimation and estimation of the latent spot variances σt2 . The first problem of implementing (18) – estimation of the parameters – will not be fully addressed in this paper. We assume that a large span of data is available, thus driving errors in the parameters to zero. In the following analysis, we are concerned only with the errors that arise from the finite number of observations per unit of time (infill asymptotic). Nevertheless, not knowing the parameters still poses a problem for the model-based approach, since if a true model were known, then regardless of the estimation procedure, the estimates would converge to their true unique values. However, in our case the model is misspecified. Consequently the limits of the estimated parameters may depend heavily on the estimation procedure. Therefore, the estimation procedure for the model parameters also plays an important role in the feasible model-based forecast.3 The second problem of implementing (18) is filtering the state σt2 . As a result of this next step, another error will be added to the total prediction error. For illustration, assume that the parameters of (18) are estimated using the moment conditions θ˜ = E σt2 and t +1

1−e−κ˜

=

κ˜

cov(IVt

,σt2 ) 4

. When we substitute the estimate σˆ t2 into

Var σt2

the formula (18), we get the following decomposition of the error: ˜

IVtt +1 − θ˜ −

=

IVtt +1



1 − e− k

−

k˜

(σˆ t2 − θ˜ )

Pt IVtt +1



[ +

cov(IVtt +1 , σt2 ) Var σt2

]

(σ − σˆ ) , 2 t

2 t

(19)

where Pt IVtt +1 is a linear projection of IVtt +1 on σt2 . The first part of the above expression is the infeasible forecast error of the ‘‘lastspot’’ forecast, i.e., the linear forecast based on σt2 . The second part is due to the error in the spot variance. For consistent estimates of the spot variance σˆ t2 , the error from the second part in (19) converges to zero as the sampling frequency increases to infinity, e.g., Aït-Sahalia et al. (2005). However, at very high frequencies, microstructure effects may blur the results. To avoid such microstructure effects the minimum distance between observations is often not less than 5 min for financial time-series studies. Therefore, despite the asymptotic negligibility of the error in spot variance, it remains a nontrivial part of the total error in (19) under usual conditions. The error in the spot variance will also depend on the particular filtering technique used to extract the spot variance. There are several ways to extract the spot variance σt2 from past prices st −i , i ≥ 1. In this paper, we will focus on the efficient ARCH filters of Nelson and Foster (1994). ARCH filters give consistent variance estimates under very general conditions, most

3 The limits of the estimated parameters θ and k in the model (17) will be denoted by k˜ and θ˜ . For more details on parameter estimation see the proof of Proposition 1 in the Appendix. 4 We refer to this as a ‘‘no-bias’’ case, since the forecast is unbiased conditionally on σt2 .

importantly even for misspecified models. (See Nelson, 1992.) For example, for GARCH-SV models with Λ(σt2 ) = ησt2 in (17) and zero leverage, the efficient ARCH filter takes the form of a discretetime GARCH(1, 1):

where ξt +h =

st +h −st −µh

√

h

(20)

is a normalized innovation in prices.

The parameters of the filter φh , ah , and bh are chosen optimally based on an estimated model. In particular, they will depend on the estimates of the persistence k and the volatility-of-volatility

λ=

Var σ 2 . Eσ 4

In the Appendix, we prove the following statement: Proposition 1. Let σt2 be square integrable with the correlation function: corr(σt2+h , σt2 ) =

∑p

a2 e−ki h i=1 i ∑p a2 i=1 i

. Suppose we apply the ARCH

filter of the form (20) to extract the spot variance. The log-price process is described by (1) with no leverage effect and zero drift. Then the comparison of the reduced-form forecast and the forecast based on a one-factor model depends only on the following set of parameters Ξ :

• Volatility-of-volatility: λ =

Var σt2 E σt4 ai i a1

;

• Relative weights of factors: , = 2, . . . , p; • Persistence of factors: ki , i = 1, p; • Sampling frequency: h. All the steps of the analytical comparison are given and proved in the Appendix. Notably, the assumption about the correlation structure of the variance process is very general, since ESV representation for any square-integrable process satisfies this assumption. Hence, the comparison of the forecasts is simplified within the ESV framework. To understand the effects of the parameters λ, a2 /a1 , . . . , ap /a1 , k1 , . . . , kp , and h on the performance of the forecasting techniques, we consider two cases—the comparison of the forecasts when the true data generating process is a one-factor model, i.e., the model is correct, and the comparison of the forecasts under model misspecification. The importance of relevant parameters will be addressed in these subsections. To appreciate the results this section also includes an illustration using the GARCH-SV example from Andersen et al. (2004). In particular, we will replicate the comparison of the infeasible forecasts and update it by comparison of the feasible forecasts. The last two subsections extend results in two essential directions—long-horizon forecasting and forecasting with the data contaminated by the microstructure noise. 3.1. Comparison if the model is correctly specified If the model were correct, that is the forecast (18) were optimal, than out of two components of the error discussed above – model misspecification error and error in the latent state – only the latter would be left. Formally, the total MSPE of the model-based forecast would contain only two parts: model

Total MSPE = GFE

 +

1 − e− k k

2

E (σˆ t2 − σt2 )2 .

Note that the second part is inherently related to the quality of variance filtering. Thus only those parameters that define the accuracy of the estimate σˆ t2 will also affect the decrease in the accuracy of the model-based forecast. As was mentioned above, the second part of the error is zero if the data are observed continuously. However, since the usual data are discrete, the comparison of infeasible forecasts depends on how modelbased and reduced-form approaches are affected by the sampling frequency. The asymptotic behavior of an MLE-estimated spot

N. Sizova / Journal of Econometrics 162 (2011) 294–311

variance is discussed by Nelson and Foster (1994), and Gloter and Jacod (2001a,b). In this paper we derive the exact formula for the error in the estimate σˆ t2 extracted by the ARCH filter (20). The formula is given by (53) in the Appendix and can be further simplified through approximation around h ≈ 0, i.e., for infinitely active sampling. From (53) in the case of a correct model, it follows that the error in σˆ t2 equals approximately: E (σˆ t2 − σt2 |k)2 ≈ hVar σ 2

k

+ bh E σ 4 .

bh

(21)

The formula above clarifies the trade-off in estimating σˆ t2 . The choice is between using as much data as possible to make the estimator more efficient, or using an estimation window as narrow as possible to reduce the bias. Both characteristics are functions of the same parameter bh and the error is√ minimized by the following value5 of the GARCH-parameter bh = λkh. The resulting error in the variance estimate equals

 E (σˆ − σ ) ≈ 2 Var σ 2 t

2 2 t

2 t

kh

λ

.

(22)

Therefore, approximation of the total MSPE around h ≈ 0 will contain the ‘‘genuine’’ forecast error plus the part coming from the term in (22): Total MSPE Var σ

2 t

[ =2

1 k

−



1−e

−k ]

−

k2

hk

[

[

1−e

−k ] 2

1−e

−k ]2

299

Another interesting observation is that the increase in the volatility-of-volatility λ affects the forecast comparison in favor of the model-based forecast. This observation can be explained as follows. Volatility series are filtered from noisy squared returns. The level of noise in the squared return is proportional to E σ 4 = Var σ 2 + E 2 σ 2 . Hence, a reduction in the mean E σ would reduce the noise in the returns while keeping the same variability of the spot variance. Therefore, an increase in the volatility-of-volatility makes square returns better proxies for the spot variance. Corollary 2. As the sampling frequency decreases, the comparative performance of the model-based forecasts deteriorates less for higher levels of volatility-of-volatility λ =

Var σ 2 . Eσ 4

Going back to Proposition 1, we conclude that the volatility-ofvolatility, λ, and sampling frequency, h, affect the comparative performance of the forecasts if the model is correctly specified. 3.2. The effect of model misspecification In the above discussion, model misspecification resulted from application of one-factor models to the modeling of multifactor processes. Consider the case of a two-factor process. Within the ESV framework the true variance process includes two components with different mean reversions:

σt2 = a0 + a1 P1 (xt ) + a2 P2 (xt ), where a1 > 0 and a2 > 0. However, the econometrician wrongly assumes that either a1 = 0 or a2 = 0. Therefore, the measure of  

k

the model misspecification is the ratio ln

+ O(h).

a1 a2

. This ratio is equal to

(23)

±∞ when the model is truly one factor, and thus the contribution

The analogous Taylor decomposition of the reduced-form forecast error given by Eq. (58) in the Appendix yields:

of one of the factors is zero. At the other extreme, this ratio is zero when both factors are of equal importance, i.e., a1 = a2 . In this subsection, we investigate the  effect of the model

+2

Total MSPE Var σt2

λ

k

misspecification as measured by ln

1−τ Var[IVtt +1 − P (IVtt +1 |IVtt + −τ , τ = 1, 2, . . .)]

=

Var σt2

+ O(h).

(24)

The O(h) term appears in the above equation, since to predict IVtt +1 , we use past realizations of RV instead of the latent IV-series. It is worth noting that the two errors above have different convergence rates with respect to the distance between observations h. The reduced-form forecast is of a stochastic order √ O(h), while the model-based forecast is of a stochastic order O( h). Hence, we can conclude that6 Corollary 1. The decrease in sampling frequency has a higher negative effect on the performance of the model-based forecasts, than on the performance of the reduced-form forecasts.

a1 a2

on the comparison

between the model-based and reduced-form forecasts. The comparison is carried out for different values of the other parameters that influence the outcome: mean reversions k1 and k2 , volatility-of-volatility λ =

Var σt2 E σt4

and the sampling frequency.

First, we fix the estimates for k1 and k2 from studies that fitted two-factor models to financial series. These studies are summarized in Table 2 and include four examples: three papers estimated coefficients of multifactor models for foreign exchange rates and one dealt with stock indices. We derived the parameters   of interest – mean reversions ki , model misspecification ln

a1 a2

,

and volatility-of-volatility λ – from the parameters reported in those papers. For each pair of k1and  k2from Table 2, Fig. 1 shows the a regions in the space ln a1 , λ where the reduced-form forecast 2

5 The parameter of the efficient GARCH filter is derived by Nelson and Foster (1994) for a one-factor GARCH-SV process. As√follows from the Appendix, for a

¯ where k¯ is an average mean general ESV model the optimal choice of bh is λkh, reversion across volatility factors. 6 The same result holds under model misspecification, since for the modelbased forecast Total MSPE Var σt2

[ = 2

1 k

 +

− h

λ

1−e



k2

] −k

[ −

1−e



 k¯  + k˜ ˜k

]2 −k

k ˜

1 − e− k k˜

2 + O(h),

where k˜ is an estimate of k and k¯ is an average mean reversion across factors.

(25)

outperforms the model-based forecast, i.e., total MSPE is lower for the reduced-form forecast. The regions corresponding to different sampling frequencies are shown in different intensities of gray. Small square marks inside the graph indicate representative parameter values taken from the studies in Table 2. For example, for the model of Alizadeh et al. (2002)   the coefficients are k1 ≈ 0.9, k2 ≈ 0.02, λ ≈ 0.6, and ln

a1 a2

≈ −1. From the bottom

left panel in Fig. 1, we see that the intersection of 0.6 for ordinates and −1.0 for abscissas is dark gray, which corresponds to 15 min sampling frequencies. This implies that the model-based forecast renders smaller errors for 5 min sampling, but cedes efficiency to the reduced-form forecast for 15 min and 30 min frequencies. Fig. 1 illustrates the statements from the previous section. First, the graph shows that the decrease in sampling frequency adversely

300

N. Sizova / Journal of Econometrics 162 (2011) 294–311

Fig. 1. Comparison of day-ahead forecasts. Table 2 Parameters for two-factor models. Forecast:

Model type

k1

k2

LN |a1 /a2 |

λ

Bollerslev and Zhou (2002) Data: 5 min DM/USD spot Huang and Tauchen (2005) Data: stock indices Alizadeh et al. (2002) Data: daily exchange rates Barndorff-Nielsen and Shephard (2002) Data: 5 min DM/USD spot

Affine Log-normal Log-normal CEV

0.57 1.386 0.81/0.95 3.74

0.07 0.00137 0.02/0.03 0.04

0.025 0.976 −1.3/ − 0.3 0.59

0.103 0.860 0.66/0.68 0.64

The table reports parameters  of the ESVrepresentation for multifactor models estimated on observed data sets; mean reversions (k1 , k2 ), model misspecification (ln |a1 /a2 |), and volatility-of-volatility λ =

Var(σ |xt1 = Ext1 ).

Var σ 2 Eσ 4

. All the parameters are in daily units. For log-normal models with ln σt2 = x1t + x2t , we define a21 = Var(σt2 |xt2 = Ext2 ), a22 =

2 t

affects the model-based forecast, making it less appealing than the simpler alternative. As the distance between observations increases from 5 to 30 min, a larger area of parameter sets yields MSPE (reduced form) < MSPE (model based). Second, the graph confirms that the effect of finite sampling on the performance of the model-based forecast is lower for high volatility-of-volatility λ. For instance, consider the bottom left panel with k1 = 0.9 and  k2 = 0.02 and choose the level of misspecification to be ln

a1 a2

= −1. For λ = 0.1, the model-based

forecast results in higher errors for all the chosen frequencies. For λ = 0.4, the model-based forecast is the most efficient for 5 min but not for 15 min or for 30 min sampling. Finally, for λ = 0.8, the model-based forecast is more accurate for the highest frequencies (5 min and 15 min) and becomes less accurate only for 30 min sampling. Third, a higher persistence favors the reduced-form forecast. For each of the four panels, k1 is higher than k2 . The ‘‘average’’ mean reversion of the variance is equal to a21 (a21 + a22 )−1 k1 + a22 (a21 + a22 )−1 k2 . Thus, the mean reversion of the variance is increasing in ln

a1 a2

. Fig. 1 shows that the areas with a better performance

of the reduced-form are located to the left relative to the  forecast  a symmetric case ln a1 = 0. Thus, other factors being equal, a 2

higher persistence of the volatility process gives an advantage to the reduced-form forecast.

Finally, Fig. 1 demonstrates how the model misspecification affects the forecast Model misspecification is measured   comparison. 

 a1   ∈ [0, +∞), with no misspecification cases  a2 a1 located at ln a = ±∞, Notably, the areas where the reduced2      a form forecast is better (shown in gray) are centered at ln a1  ≈ 2 by the ratio ln

0 for three of the four graphs, implying that the model misspecification acts in favor of the reduced-form forecast. It can be formally shown that the ‘‘genuine’’ error in the modelbased forecast (44) is quadratic7 in the ratio

a21 a22

and achieves its

     a global maximum at a finite value of ln a1 . That is, the predictive 2 power of the model-based forecast is at its minimum when the spot variance consists of two components: slow moving and fast moving, and the contributions of each of these components are nontrivial, i.e., in the case of misspecification. Thus, in reference to the results of Proposition 1, we conclude that the ratios a2 /a1 , . . . , ap /a1 define the comparative performance of the forecasts under model misspecification, with the performance

7 For comparison, the ‘‘genuine’’ error is linear in

a21 a22

factor model (see (45)), and achieves maximum for ln

for a correctly specified twoa21 a22

= −∞ if k1 > k2 .

N. Sizova / Journal of Econometrics 162 (2011) 294–311

of the model-based forecast adversely affected if values of these ratios are close to one. These results indicate that the reduced-form forecast performs similarly to the model-based forecast due to two effects. First, the efficiency of the ‘‘best forecast’’ (model based) vanishes for finite sampling frequencies. The second effect results from model misspecification and explains why the reduced-form forecast eventually outperforms the model-based forecast for certain parameter values. (See Fig. 1.) Note an important connection between the findings of this section and earlier findings by Nelson (1992), and Nelson and Foster (1995). The common theme is the performance of modelbased techniques under model misspecification. A related result from these studies is the consistency of the estimate σˆ t2 under a variety of misspecification types. Nelson and Foster (1995) also argue that under certain conditions ARCH filters yield consistent forecasts. However, the assumptions of Nelson and Foster (1995) are violated if the number of factors is underestimated. For this reason, contrary to Nelson and Foster (1995), we find a significant loss in predictive power for misspecified model-based forecasts. 3.3. Numerical example In this section we continue a numerical example from Andersen et al. (2004). The example compares the model-based and reducedform forecasts for the one-factor GARCH-SV process calibrated to daily series of the spot DM/USD exchange rate from 1987 to 1992. This exercise assumes that states are observable and the GARCHSV is the true model for the data. We extend this example, first by relaxing the assumption that the states are observable, and second by relaxing the assumption that the true model is one factor. Instead, we assume that the true model is two factor, with parameters calibrated to the same DM/USD series. Therefore, we use the same assumption made in the previous subsection: an econometrician employs a one-factor model of the type (17). Specifically, following Andersen et al. (2004), he estimates the parameters to be k = 0.035, θ = 0.636 and Λ(σt ) = σ σt2 , where σ is chosen so that the volatility-of-volatility is λ = 0.296. However, the true data generating process is two factor and described by either of the following two candidates. The first candidate is the SR-SV model estimated by Bollerslev and Zhou (2002) for high-frequency 5 min spot returns on DM/USD from 1986 until 1996. The second candidate is the CEV-SV model estimated by Barndorff-Nielsen and Shephard (2002) for the same data. Thus, we have three different scenarios. Under the first scenario, the returns follow the one-factor GARCH-SV model. This case corresponds to the first row in Table 3. The second row in Table 3 is for the version where the true model is two factor, as suggested by Bollerslev and Zhou (2002). Finally, the third row of the table corresponds to the case when the true model is from Barndorff-Nielsen and Shephard (2002). Thus, for the first row the model is specified correctly, but in the second and third rows the model is misspecified. The data reported in Table 3 are the performances of the forecast based on the one-factor GARCH-SV model (the first two columns), and the reduced-form forecast (the last two columns). Table 3 uncovers the error-in-latent-states effect. In particular, the first and the third columns correspond to the case when the spot variance is observable, while the second and the fourth columns are for the case when the spot variance is latent. The results in the table assume a 5 min distance between observations for ‘‘feasible’’ forecasts and are calculated using formulas derived in the Appendix ((59)–(61)). Table 3 demonstrates the following results. First, the modelbased forecast is the most efficient if the model is correct and the variance is observed. This is in accordance with the definition of the

301

model-based forecast. In this case, the MSPE is a mere 2.3%, while the reduced-form forecast yields an error of 4.3%. Second, when the model is correct but the variance is unobserved, the quality of the model-based forecast deteriorates but this approach remains the most efficient with an MSPE of 6.2%. Third, when the model is wrong, the performance of the modelbased forecast is affected much more strongly than in the case of unobserved variance. For the 2F-SR-SV(II) model, for example, instead of an MSPE of 2.3% (if the model were correct), the modelbased forecast delivers an MSPE of 67.2%. However, the modelbased forecast keeps the leading position, as the reduced-form forecast gives 68.6%. Finally, the combination of two effects – misspecification in the model and unobserved variance – drives the performance of the model-based forecast below the reduced-form forecast. Despite its simplicity, the reduced-form forecast gives 68.8% MSPE vs. 115.6% MSPE of the model-based forecast. In the case of the 2F-SR-SV(I) model, the results are qualitatively similar, and the reduced-form forecast renders smaller errors, although the difference in errors is less dramatic: 35.4% error in the reduced-form forecast vs. 39.3% error in the model-based forecast. The most remarkable conclusion from Table 3 is that irrespective of the case we considered, Mincer–Zarnowitz R2 s8 of the feasible model-based forecast and the feasible reduced-form forecast are generally close. This finding indicates that the day-ahead reduced-form forecast is successful in capturing the same information that is available to the model-based forecast. Moreover, in terms of the MSPE, the reduced-form forecast can be significantly more accurate, since it is unbiased by construction. To summarize, both the possible model misspecification and the finite frequency of price observations should be taken into account when comparing the model-based forecast and reducedform forecast. The combination of these factors produces a different ranking of the forecasts. This example illustrates how the comparative ranking of the model-based and the reduced-form forecasts can reverse the given realistic assumptions. 3.4. Long-horizon forecasting It can be of separate interest to investigate how modelbased and model-free approaches perform over longer horizons. Multifactor models may exhibit long-memory-like properties, but one-factor models cannot. Therefore, a different behavior can be expected from one-factor and multifactor models for longerhorizon predictions. In this section, we consider the forecasting of the integrated T +1 if T > 0. A long-horizon model-based forecast variance IVtt + +T for the ESV models is a generalization of the formula (7): p − 1 − e−ki

T +1 Et IVtt + = a0 + +T

ai

i=1

ki

e−ki T Pi (xt ).

(26)

For example, for a one-factor model, the long-horizon forecast takes the form: T +1 Et IVtt + =θ+ +T

1 − e−k −kT 2 e (σt − θ ). k

(27)

8 To construct R2 , data on realizations of IVt +1 are regressed on the forecasts t IVt +1|t . The R2 -statistic of this regression is a measure of the forecast efficiency. 2 There is a link between R and MSPE, which is a main performance measure in this paper: MSPE Var IVtt +1

= 1 − R2 +

[P (IVtt +1 |IVt +1|t ) − IVt +1|t ]2 Var IVtt +1

,

where P (IVtt +1 |IVt +1|t ) is the projection of IVtt +1 on its forecast IVt +1|t . It follows that R2 is a valid measure of the forecast performance that does not, however, takes into account the bias part: P (IVtt +1 |IVt +1|t ) − IVt +1|t .

302

N. Sizova / Journal of Econometrics 162 (2011) 294–311

Table 3 GARCH-SV example. Forecasting model

True model

Model-based infeasible

Model-based feasible

Reduced-form infeasible

Reduced-form feasible

1F-GARCH-SV 1F-GARCH-SV 1F-GARCH-SV

1F-GARCH-SV 2F-SR-SV(I) 2F-SR-SV(II)

0.977(0.023) 0.819(0.181) 0.328(0.672)

0.938(0.062) 0.674(0.393) 0.295(1.156)

0.958(0.043) 0.699(0.301) 0.315(0.686)

0.910(0.090) 0.646(0.354) 0.312(0.688)

The table reports R2 from the Mincer–Zarnowitz regressions and the normalized MSPE in parenthesis. Feasible forecasts are based on 5 min returns. 2F-SR-SV(I) is calibrated from Andersen et al. (2004) and 2F-SR-SV(II) is calibrated from Barndorff-Nielsen and Shephard (2002). The results are calculated analytically from formulas presented in the Appendix.

The formulas above imply that as the horizon T increases, the only error, whose effect may be amplified, is the error from misspecification. Indeed, the error from the estimates of latent states remains constant, since the input xˆ t is the same for all T . Moreover, as T → ∞, its contribution to the MSPE fades away. On the other hand, as T increases the true forecast converges to lim

T →∞

+T +1 Et IVtt + − a0 T

e

−kj T

= aj

1 − e−kj kj

Pj (xt ),

(28)

where kj = mini ki . Therefore, the factors with the smallest mean reversions, ki , will dominate the forecast. If the number of factors is underestimated, the components that play a dominant role for weekly and longer-horizon forecasts will not be extracted properly. For example, for one-factor models, these dominant components are hidden within the estimate of the spot variance σˆ t2 . We extend the analysis of Section 3.2 to find the parameters for which the long-horizon reduced-form forecast outperforms the model-based forecast. Algebraic computations in the general case are given in the Appendix, and the results for T = 1 are presented in Fig. 2. As in the T = 0 case, the parameter sets for which the reduced-form forecast performs best are colored in gray (light gray, dark gray and black), starting from black for 5 min frequencies. Quite remarkably, all of the actual model estimates from Table 2, indicated by squares, now fall into this dark-shaded area, suggesting that for long-term forecasts the reduced-form forecast is more efficient even for the highest frequencies. It follows that for all these models, the underestimation of the number of factors weakens the power of the model to predict for longer horizons, while the performance of the reduced-form forecast is affected far less.

To evaluate the effect of the microstructure noise on the forecasts, we make two simplifying assumptions. First, as in the main-stream literature (see Aït-Sahalia and Mancini, 2008), the microstructure noise is modeled by a normally distributed i.i.d. sequence. In the proofs, the normality assumption defines only the link between the variances of n2t and nt , and therefore, this assumption can be easily relaxed by calibrating these quantities separately. The second assumption is that the biases of the modelbased and the reduced-form forecasts are equal. This assumption holds when θˆ in the model is estimated by a sample mean of spot variance estimates, σˆ t2 . The derivations of the modified formulas for MSPEs are outlined in the Appendix. In addition to the RV-based reduced-form forecast, this section also reports the results for the reduced-form forecast based on TSRV—a nonparametric measure of variance that is robust to the microstructure noise, 1 K

TSRVtt +1 =

K ∑ j =1

t +1+(j−1)h

RVt +(j−1)h

1−

,t + 1 − K1 RVall t

1 K

,

(29)

t +1+(j−1)h

We previously assumed that the observed prices are not contaminated by microstructure effects, i.e., the model (1) described both the dynamics for observed and efficient, ‘‘true’’, prices. For available data, however, it is fairer to model the asset price as a sum of the efficient price and a component often referred to as the microstructure noise, nt , i.e.,

where realized variances RVt +(j−1)h are based on observations recorded at K × h-intervals starting at t + (j − 1)h. The ‘‘tickall,t +1 is computed using the data at by-tick’’ realized variance, RVt higher-frequency observations recorded at each h-interval. Studies show the superior performance of TSRV over RV as a measure of the return variability; see Zhang et al. (2005) and Aït-Sahalia and Mancini (2008). The results of these studies motivate us to include this as an alternative in the comparison of the forecasts. Similar to RV, the reduced-form forecast based on TSRV is a linear projection of the future TSRV on its past values. Fig. 3 shows the comparison between the model-based forecast and two reduced-form forecasts: left panels correspond to the RV case, and right panels correspond to the TSRV case. To highlight the effect of the microstructure noise, we assume that the available data are finely recorded at 5 min intervals, and the forecasts use all the available data. For the TSRV forecast, we assume that K = 12. Thus, the graph fixes the frequencies but varies the intensity of the microstructure noise measured by the noise-to-

pt = p∗t + nt ,

signal ratio,

where now only p∗t follows the dynamics (1), and nt are zero-mean i.i.d. shocks. The reasons for the presence of the microstructure effects are bid–ask spreads, discrete grid of quotes, heterogeneity of transactions, etc. To construct the total MSPE for the model-based forecast and reduced-form forecast, the effect of this small contamination has to be considered. It is known that the effect of the microstructure noise on the realized variance grows indefinitely as h → 0; see Zhang et al. (2005). Therefore, we would expect that the prior results will be largely distorted at high frequencies. The performance of various reduced-form forecasts under microstructure noise is studied in detail by Andersen et al. (2011). This section looks at a similar setup from a new perspective, comparing those to the model-based approach for a continuum of parameters.

range as in Andersen et al. (2011). Different shades of gray denote parameter sets for which reduced-form forecasts outperform the model-based forecast for different choices of the microstructure noise intensity. The first question addressed by Fig. 3 is weather the modelbased or the reduced-form forecasts is more sensitive to the microstructure noise. Note that the parameter sets for which reduced-form forecasts outperform the model expand with the increase in the noise intensity. Even at fine 5 min frequencies, the RV-reduced-forecast outperforms the model-based forecast for a large set of parameters if the noise-to-signal ratio is at least 1%. And the TSRV-reduced-form forecast is almost always the best choice. The second question addressed by Fig. 3 is whether the use of TSRV instead of RV yields a substantial improvement. Comparing the left and right panels of the graph, we can see that the parameter

3.5. Effect of microstructure noise

En2t t +1 . The selected noise intensities are from the same

EIVt

N. Sizova / Journal of Econometrics 162 (2011) 294–311

303

Fig. 2. Comparison of long-horizon forecasts (day after tomorrow). Table 4 Parameter estimates of the one-factor model for DM/USD data.

GMM based on: – daily RV: – weekly RV: – monthly RV: – quarterly RV:

kˆ

λˆ

θˆ

0.250 0.047 0.033 0.018

0.459 0.325 0.257 0.216

0.508 0.508 0.508 0.509

Table reports the GMM parameter estimates for the model (17) with λ =

Var σt2 E σt4

.

Time is measured in daily units and returns are in percentage form.

sets for which TSRV outperforms the model are significantly larger than those of RV. Thus, this study confirms the superior performance of TSRV-reduced-form forecasts in the presence of microstructure noise. See Andersen et al. (2011) for a thorough study of this topic. Note that as in the one-period case without the noise, the result in Fig. 3 is driven mostly by the error in the estimation of the spot variance, since Et IVtt +1 ≈ σt2 . Thus, to gain quick insight into the result from Fig. 3 we can analyze the decomposition of RV, see AïtSahalia and Mancini (2008), for h ≈ 0,

was established theoretically for the models that are calibrated to observed data. In this section we demonstrate the same effect with actual DM/USD exchange rate 5 min returns from December 2, 1986, through June 30, 1999. The chosen market is open twentyfour hours a day, yielding a total of 288 observations per day. This data has been extensively studied before, and for a thorough description, one can refer to Andersen and Bollerslev (1998). In this study, we use raw data to form RV series. For filtration of states in the model-based forecasts, we adjust the intra-day returns by the seasonal component.9 We assess the performance of three model-based forecasts: one, two and three factor. The goal of this section is to compare these to the reduced-form approach. The object of the forecasting in all these cases will be the realized variance, RV. 4.1. One-factor model-based forecasts To estimate the parameters in (18) we rely on the method of moments that is independent of the other parameters in (1). Specifically, we match the mean of the T -period realized variance θ = T1 ERVtt +T , the correlation structure of the realized variance

−1

2 λk˜

2T e−kT = corr RVtt −T , RVtt +T × corr RVtt −T , RVtt + , and the +T variance of the realized variance to get the volatility-of-volatility λ. To avoid any biases in favor of the reduced-form forecast, we report the parameters and the forecast performance for different horizons T . Specifically, we will work with parameters calibrated to daily (T = 1), weekly (T = 5), monthly (T = 20), and quarterly data (T = 60). The performance of this forecasting procedure is reported in the first half of Table 5, which contains Mincer–Zarnowitz R2 s and normalized MSPEs for different parameter sets in Table 4. The last line of the table corresponds to the performance of the reduced-form forecast that is a simple AR-model for RV, with

We have demonstrated that the efficiency of the model-based forecast vanishes when we consider its feasible version. This result

9 We also verified that the results of this section do not change when using raw data.

En2t





+Z + O(h), h where Z is a standard normal variable. From formula (62) in the Appendix it follows that RVtt +1

≈

IVtt +1

Var(σˆ − σt

+2



En4t . h3/2 Hence, we conclude that the variance of the model-based forecasting error is h−3/2 —sensitive to the microstructure effects, while the sensitivity of the reduced-form forecasting error is of the order h−1 . 2 t

∗,2

)≈

2bh

4En4t h−1

En4t h2

=







4. Empirical example

304

N. Sizova / Journal of Econometrics 162 (2011) 294–311

Fig. 3. Day-ahead forecasts: Microstructure noise. Table 5 Mincer–Zarnowitz R2 (MSPE) for day-ahead forecasts.

The Mincer–Zarnowitz R2 of the model-based forecast is 35.6% for parameters calibrated to quarterly data. This result is close to the corresponding statistic for the reduced-form forecast (35.7%). Moreover, the difference is slightly in favor of the latter. For the other parameter values, the model-based forecast performs noticeably worse. In terms of the MSPE, the reduced-form forecast has a clear advantage with the error of 64.6% vs. 76.5%.

Forecast

R2 (MSPE)

Model-based 1-F: Daily kˆ = 0.250 Weekly kˆ = 0.047 Monthly kˆ = 0.033 Quarterly kˆ = 0.018

0.236 (1.233) 0.324 (0.976) 0.341 (0.866) 0.356 (0.765)

Model-Based 2-F: Model-Based 3-F:

0.370 (0.698) 0.267 (0.782)

4.2. Multifactor model-based forecasts

Reduced Form

0.357 (0.646)

The two-factor model considered here is the affine model suggested and estimated by Bollerslev and Zhou (2002) for the same data set on the DM/USD exchange rates. In contrast to the one-factor case, there are two latent states (x1t , x2t ) in this model. To extract these components we rely on the efficient particle filter following Durham (2006).10 We also include the results for a three-factor CEV-SV model with parameters estimated by Barndorff-Nielsen and Shephard (2002) for the same data set with 10 min observations. Barndorff-Nielsen and Shephard (2002) do not choose a specific family of models. The CEV-SV model is one of the possible choices.

‘‘Model-Based 1-F’’ is a forecast based on the one-factor model (17). ‘‘Model-Based 2-F’’ is a forecast based on the two-factor SR-SV model by Bollerslev and Zhou (2002). ‘‘Model-Based (3-F)’’ is a forecast based on the three-factor CEV-SV model with parameters from Barndorff-Nielsen and Shephard (2002). ‘‘Reduced Form’’ is a linear projection of the realized variance on its past values, with the number of lags chosen by the BIC-criterion. kˆ are estimated mean reversions of volatility from Table 4. Data: DM/USD spot rates. The parameter estimation period for the onefactor model is December 2, 1986, through June 30, 1999; for the two-factor model it is December 2, 1986, through December 1, 1996. The forecast evaluation period is December 2, 1987, through June 30, 1999.

the number of lags chosen by the BIC-criterion. The reduced-form forecast is purely out-of-sample, as the number of lags and values of parameters are calculated using only the data available at the time of the forecast. Forecast performances were evaluated using the whole data sample excluding the first year, i.e., for the period 1987–2007.

10 In this study, the states x , x are extracted using 5 min, 15 min and hourly 1t 2t returns. Here we report results only for 15 min frequencies, since they led to the smallest forecast errors.

N. Sizova / Journal of Econometrics 162 (2011) 294–311

305

Table 6 Mincer–Zarnowitz R2 (MSPE) for long-horizon forecasts. Forecast:

Day-ahead, IVTT +1

2 +1 Day, IVTT + +1

6 +1 Week, IVTT + +5

11 +2 Weeks, IVTT + +10

Reduced form Model-Based 1-F Model-Based 2-F Model-Based 3-F

0.357(0.646) 0.356(0.765) 0.370(0.698) 0.267(0.782)

0.233(0.767) 0.216(0.996) 0.247(0.815) 0.195(0.849)

0.137(0.863) 0.115(1.557) 0.146(0.914) 0.129(0.901)

0.114(0.887) 0.105(1.108) 0.110(0.946) 0.102(0.920)

‘‘Model-Based (1-F)’’ is a forecast based on the one-factor model. ‘‘Model-Based (2-F)’’ is a forecast based on the two-factor SR-SV model by Bollerslev and Zhou (2002). ‘‘Model-Based (3-F)’’ is a forecast based on the three-factor CEV-SV model with parameters from Barndorff-Nielsen and Shephard (2002). ‘‘Reduced Form’’ is a linear projection of the realized variance on its past values, with the number of lags chosen by the BIC-criterion. Data: DM/USD spot rates. The parameter estimation period for the one-factor model is December 2, 1986, through June 30, 1999; for the two-factor model it is December 2, 1986, through December 1, 1996. The forecast evaluation period is December 2, 1987, through June 30, 1999.

The second half of Table 5 reports the forecasting performance of the multifactor models. The table shows that in terms of R2 the two-factor model-based forecast appears to be the best choice, explaining 37.0% of the variance of RV. However, the reduced-form forecast is very close, with an R2 of 35.7%. Moreover, in terms of the mean squared error, the reduced-form forecast is better than the model (64.6% vs. 69.8%). The performance of the three-factor model is an example of the situation when model-based forecasting could be extremely inaccurate. A possible explanation is that the CEVSV model does not properly extract volatility components, hence, leading to the R2 of 26.7%. The conclusion from Table 5 is that, despite a better fitting of the data, the multifactor models produce forecasts that do not significantly outperform the one-factor model in terms of R2 . This could be due to the fact that the large part of variation in the next-day realized variance is explained by the short-run components, and therefore one-factor models are sufficient to capture the dynamics of the variance over short horizons. As was indicated in the previous section and as also follows from Table 5, the reduced-form forecast succeeds in capturing the same information, and therefore renders the close R2 result. Moreover, owing to its unbiasedness, it gives the lowest MSPE (64.6%).

availability of intra-day data. We show that when it comes to the feasible versions of the forecasts, reduced-form forecasts can outperform model-based forecasts. Since the model-based forecast requires the knowledge of both the true model and the latent instantaneous volatility, model misspecification and the errors in instantaneous volatility estimates can in effect combine to make the model-based forecast perform worse than the reducedform forecast. We also confirmed these results with actual highfrequency foreign exchange rates and several popular stochastic volatility models. This paper challenges the conventional wisdom that models always render the most efficient forecasts. It shows that simpler approaches do not necessarily fail in comparison with this benchmark. Our results lead us to challenge the belief that, though estimation and forecasting within SV-models are notoriously time consuming, the resulting gain in efficiency justifies it. On the contrary, based on the analytical and empirical evidence in this study, we conclude that reduced-form forecasts perform similar to supposedly efficient forecasts and are often better. And though our analysis has been limited to particular assumptions, the latter are still quite general. We anticipate, therefore, that our results may carry over to more complex settings.

4.3. Long-horizon forecasts Table 6 extends empirical results from the previous subsection to long-horizon forecasts—to predict variance after one day, T = 1, after one week, T = 5, and after two weeks, T = 10. The one-factor forecast in the table is ‘‘quarterly’’—calibrated, i.e., k equals 0.018. This parametrization proved to be the most successful for dayahead forecasting. However, for longer horizons, its performance gradually decays, with MSPE rising to 110.8%. In contrast to the one-factor model, the drop in the quality of the reduced-form forecast and of the multifactor models is less dramatic. The MSPE for the two-factor model reaches the level of 94.6% and R2 remains at the level of 11.0%. The three-factor model yields a lower MSPE of 92.0%, but also lower R2 of 10.2%. The reduced-form forecast yields MSPE of 88.7% and R2 of 11.5% for the two-week forecast. For the multifactor models the long-horizon forecasts can be represented by formula (26). Just as in the case of the dayahead forecasting, their implementation requires time-consuming estimation of the parameters and states xt . In return, they perform only on par with the reduced-form forecast, which is much simpler to implement—R2 of 10%–11% vs. R2 of 11.4%. Moreover, their MSPE is objectively higher than that of the reduced-form forecast. Thus, based on the results in this section, we find that even more sophisticated multifactor models, which are better at explaining the data, yield higher forecasting errors than the reduced-form forecasts. 5. Conclusion In this paper, we compare the performances of model-based and reduced-form forecasts of integrated variance assuming

Acknowledgements I am grateful to Tim Bollerslev and George Tauchen for the valuable advise that nourished this work. I would also like to thank Ronald Gallant, Shakeeb Khan, Bjørn Eraker, Viktor Todorov, and the participants of Duke Monday Econometrics Lunch Group, seminars at UNC Kenan-Flagler Business School, and Stanford Institute for Theoretical Economics Summer Workshop. I am also thankful to the anonymous referees whose comments helped to improve this paper. Appendix A. Moments of volatility measures

• To keep the formulas in this Appendix parsimonious, we use the following notation. For multifactor models of the form (3) we denote a weighted average of an arbitrary function f (ki ), i = 1, . . . , p over all factors with a bar operator: p ∑

f (k) =

a2i f (ki )

i =1 p ∑

, a2i

i=1

where ki , i = 1, . . . , p are persistence parameters and ai , i = 1, . . . , p are factor weights. Notably any function of this type a depends only on persistence parameters ki , and ratios a i , i = 1 2, . . . , p. • The following moments for the integrated variance were derived by Andersen et al. (2004):

306

N. Sizova / Journal of Econometrics 162 (2011) 294–311

Cov(IVtt +1 , σt2 ) =

1 − e− k k

= 2 Var σ

Var IVtt +1

2 t

[

1 k

Var σt2 , 1 − e− k

−

] ,

k2

+l +1 Cov(IVtt +1 , IVtt + |l > 0) = Var σt2 l

1 − e− k



2

k

(30)

to Itô-isometry, since covariances are preserved under L2 convergence, we can show that

(31)

cov(Pi (xt ), ξ

=

e−k(l−1) .

Hence, the correlation structure of the integrated variance depends only on the persistence parameters ki and ratios ai , i = 2, . . . , p. a1 • For the no leverage case, Meddahi (2003) obtained the following moments of the realized variance: t −j+1

Cov(RVtt +1 , RVt −j

j +1 ) = Cov(IVtt +1 , IVtt − −j ),

Var(σ 2 )

= 2hVar(σ ) 2 t

h

∫

h2



+

λ

h

∫

0

1−λ

φ(s − τ )dsdτ

]

0



2 h2

h k

−

1 − e−kh k2



. (33)

• We also will need to find the second and fourth moments of the returns. Denote the h-period demeaned return by  t +h ξt +h = h−1/2 t στ dWτp . From Itô-isometry for squareintegrable prices, it follows that 1

E ξt2+h =

h

στ2 dτ = E σt2 .

E t

Eξ

=

t +h

[∫

3 h2

t +h

∫

E t

]

στ σ τ ds  t +h  t +h 2

t

= 3E 2 σt2 + 3 Var σt2

2 s d

t

t

φ(τ , s)dτ ds

h2

,

E [ξt2+h−jh ξt2+h−ih ]i̸=j

=

1 h

t −jh

[∫

E 2

t −(j+1)h

t −ih

∫

= E 2 σt2 + Var σt2

στ σ 2

t −(i+1)h

 t −jh

t −(j+1)h

2 s d

τ ds

 t −ih

t −(i+1)h

φ(τ , s)dτ ds

h2

Therefore, for ESV models φ(t , s) = E ξt4+h = 3E 2 σt2 + 6

=E σ + 2 t

.

Var σt2 h h2

k

−

e−k|t −s|

1 − e−kh k2

p a2 e−ki h i=1 i ∑p a2 i=1 i

.

and

t −jh

t −(j+1)h

e−ki |t −τ | dτ =

a2i −ki jh 1 − e−ki h . e h ki

(35)

Summing up over the factors, we obtain the formula for the covariance of the spot variance and past squared returns: 1 − e−kh kh

.

(36)

Appendix B. Proof of Proposition 1 In this section we identify a complete set of parameters that affect the comparison of the model-based forecast and the reduced-form forecast. The model-based forecast employs an arbitrary stochastic volatility (SV) model with a variance dynamics of the form (17). The GMM approach is used to estimate the parameters (k, θ ), and the states are extracted by the ARCH filter. The reduced-form forecast employs the ARMA(p, p) model for realized variances RV to predict IV. For the GMM estimate, our first choice is a procedure that leaves forecast (18) unbiased irrespective of the model. This is achieved by a GMM procedure that matches the expectation E (IVtt +1 − θ˜ ) = 0,

(37)

Var σt2 1 − e−kh

[

k

]2

˜

1 − e− k

. (38) k˜ The case when parameters are defined by the conditions above will be further referred to hereafter as the ‘‘no-bias’’ case: in Mincer–Zarnowitz regressions, this forecast would yield the zero intercept and the slope equal to one. However, instead of the above moments, other moment conditions can be used to define k˜ and θ˜ . In general, estimation procedures based on those other moments may introduce a forecast bias. For the ARCH filter, σˆ t2 follows the discrete-time GARCH(1, 1): Var σ

2 t

=

,

e−k(|j−i|−1)h .

pt +h −pt −µh

√

h

(39)

. The parameters of the model are chosen

function: corr(σt2+h , σt2 ) =

∑p

a2 e−ki h i=1 i ∑p a2 i=1 i

. Suppose we apply the

ARCH filter of the form (20) to extract the spot variance. The log-price process is described by (1) with no leverage effect and zero drift. Then the comparison of the reduced-form forecast and the forecast based on a one-factor model depends only on the following set of parameters Ξ :

• Volatility-of-volatility: λ =

Var σt2 E σt4 ai i a1

;

• Relative weights of factors: , = 2, p; • Persistence of factors: ki , i = 1, p; • Sampling frequency: h.

]

h2

∫

cov[Pi (xt ), Pi (xτ )]dτ

optimally to minimize asymptotic MSE. Proposition 1. Let σt2 be square integrable with the correlation

2 2 t +h−jh t +h−ih i̸=j

2

a2i

where ξt +h =

be decomposed in the form corr(σt2+h , σt2 ) =

ξ

t −(j+1)h

σˆ t2+h = φh + ah σt2 + bh ξt +h ,

]

A variant of the above is proved by Meddahi (2002) for ESV models. From the ESV representation, it follows that the correlation function of the square-integrable process can ∑

E [ξ

h

cov(IVtt +1 , σt2 )

Similar to the proof of Itô-isometry, we can show that for the square-integrable spot variance under the no leverage condition, the forth moment of the return equals 4 t +h

cov[Pi (xt ), στ2 ]dτ

and the regression slope:

t +h

∫

t −(j+1)h

t −jh

∫

h

h

t −jh

(32)

where the last component is the variance of the noise ut +1 in RVtt +1 . Its variance has been derived by Barndorff-Nielsen and Shephard (2002):

[

=

1

)=

∫

1

cov(σt2 , ξt2−jh ) = Var σt2 e−kjh

j ̸= 0,

Var RVtt +1 = Var IVtt +1 + Var ut +1 ,

Var ut +1 = 2h E 2 σ 2 +

2 t −jh

(34)

• Finally, we will derive the covariance structure between factors Pi (xt ) and past h-period squared returns ξt2−jh . Similarly

Proof. The proof is organized as follows. First, we find the total MSPE of the ‘‘no-bias’’ model-based forecast. Second, we find the total MSPE of the reduced-form forecast. Finally, we compare the errors and consider a few extensions, namely (1) modelbased forecast with a bias and (2) long-horizon forecasts.

N. Sizova / Journal of Econometrics 162 (2011) 294–311

B.1. MSPE of the model-based forecast

Rewrite the filter for the spot volatility in the following form:

As was shown in Section 3, the total MSPE for the ‘‘no-bias’’ forecast based on a one-factor model is the sum of the ‘‘genuine’’ forecast error,

[

GFEmodel = E IVtt +1 − E σt2 −

cov(

IVtt +1

]2 ,σ ) 2 2 (σt − E σt ) , (40) 2 2 t

Var σt

and the part that is due to the error in spot variance, F (σˆ − σ ) = 2 t

2 t

[

cov(IVtt +1 , σt2 ) Var σ

]2

2 t

E σt2 − σˆ t2



2

.

(41)

If the model is correctly specified, then the ‘‘genuine’’ error is independent of all the past information and, therefore, of the error in the spot variance. However, under model misspecification this is not the case and the additional part of the error is the covariance between the above two components:

 Cov(GFEmodel , F (σˆ t2 − σt2 )) = Cov IVtt +1 −

cov(IVtt +1 , σt2 )

cov(IVtt +1 , σt2 ) Var σt2

Var σt2

 (σ − σˆ ) . 2 t

2 t

˜

k˜

=

1 − e−k

.

k

σˆ t2 = bh

∞ −

(42)

(43)

Further analysis of the model-based forecast error will be organized into three steps. First, we will devise the formula for the ‘‘genuine’’ error (40). Second, we will find F (σˆ t2 − σt2 ) from (41). Third, we will provide the formula for the covariance term (42).

j =0

The expression for R2 from the Mincer–Zarnowitz regression obtained by Andersen et al. (2002) can be readily converted into the expression for the corresponding mean squared error: GFEmodel = Var IVtt +1 (1 − R2 )

=

−

1

 p −

Var σt2

1 a2i

− e−ki

.

Substituting the moment (31) into the above expression yields the formula for the ‘‘genuine’’ error: GFEmodel = Var σt2

 [ 2

1 k

−

1−e

] −k

[ −

k2

1−e

]2  −k

GFE

= Var σ

2 t

 [ 2

1 k

−

1 − e− k

k

k2

]

[ −

1 − e− k k

(46)  t +h

στ dW

p

E σˆ t4 =

∞ − φh φh2 j b ah E (ξt2−j ) + 2 h (1 − ah )2 1 − ah j=0  2 ∞ − j + b2h E ah ξt2−j . j =0

2φh bh φh2 + Eσ 2 2 (1 − ah ) (1 − ah )2 t   h 1−e−kh − b2h k2 E 2 σt2 + 2 Var σt2 k  +3 h2 1 − a2h   −kh )2 b2h ah E 2 σt2 ah 2 (1 − e +2 + Var σt . k2 h2 1 − ah e−kh 1 − a2h 1 − ah

E (σˆ t4 ) =

(47)

Analogously, the corresponding cross-product equals E (σˆ t2 σt2 ) =

∞ − φh j E σt2 + bh ah E (ξt2−jh σt2 ). 1 − ah j =0

Substituting the correlation term from Eq. (36) yields that E (σˆ t2 σt2 ) =

φh 1 − ah

E σt2 +

+ bh Var σt2

bh 1 − ah

E 2 σt2

1 − e−kh

1

kh

1 − ah e−kh

.

(48)

Thus, the mean squared error in the spot variance estimate equals E (σˆ t2 − σt2 )2 = E (σˆ t4 ) + E σ 4 − 2E (σˆ t2 σt2 ).

(49)

√ φ(h) = O(h), bh = O( h), ah + bh = 1 − O(h). Then, taking limits of the expressions (47), (48) as h → 0 yields that: lim E (σˆ t4 |k) = E σ 4 , h→0

lim E (σˆ t2 σt2 |k) = E σ 4 .

h→0

.

(44)

For comparison, we derive the ‘‘genuine’’ error of a forecast based on a multifactor model. It also follows from the expression for R2 from the Mincer–Zarnowitz regression obtained by Andersen et al. (2004) and equals p-F model

,

Under what conditions is the estimate σˆ t2 consistent? Suppose,

2

ki

i =1

1 − ah

τ where the innovation is equal to ξt +h = t √ . That is, our h estimation of the spot volatility is a weighted average of past squared demeaned returns and a constant that later will be defined to converge to zero as h → 0. After squaring the estimate σˆ t2 and taking the expectation, we find that the second moment of the estimator σˆ t2 equals

Step 1: it ‘‘Genuine’’ forecast error GFEmodel

Var IVtt +1

φh

j

ah ξt2−jh +

After substituting moments from (34) we find that

σt2 ,

Forecast is formed using the estimates k˜ and θ˜ that match the moments (37) and (38). Using formula (30), those moments imply that11 1 − e− k

307

]2 

.

(45)

Hence, limh→0 E (σˆ t2 − σt2 |k)2 = 0. Therefore, any GARCH filter satisfying the assumptions outlined above is consistent, irrespective of the model. This result was first proved by Nelson (1992) for a general class of ARCH filters. What set of coefficients ah , bh , φ(h) renders the most efficient estimate σˆ t2 ? To answer this question, we minimize the error (49) approximated around h = 0. Asymptotically, the following relation holds: E (σˆ t2 − σt2 |k)2 ≈ hVar σ 2 p ∑

Step 2: Error due to volatility estimation k¯ = 11 See Meddahi (2001) for the related discussion of the misspecification case.

a2i ki

i=1 p

∑ i=1

. a2i

k¯ bh

+ bh E σ 4 ,

308

N. Sizova / Journal of Econometrics 162 (2011) 294–311

√ ¯ For any λkh. √ one-factor model, the efficient coefficients are bh = λkh, ah = 1 − bh − kh, and φh = kθ h. After substituting the limit for the The above error is minimized at bh =



Var σ 2 kh Eσ 4

¯ =

estimate of k given by (43), we find the following values for the ‘‘optimal’’ coefficients:

where: E (σˆ t2 − σt2 )2 Var σt2

1−λ

=

λ

(50)

+

Step 3: Covariance term

Cov(GFEmodel , F (σˆ t2 − σt2 )) ˜

=−

1 − e− k k˜

 Cov

IVtt +1

cov(IVtt +1 , σt2 )

−

Var σt2

σ , σˆ 2 t

2 t



.

∑p

i=1

1−e−ki ki

(1 − ah )2

e−kh



1 1 − ah e−kh

kh

1 − λ 3 − ah

λ

1 − a2h

1 − ah

+

+6 

(1 − e−kh )2

ah

k2 h2

1 − ah e−kh

−2

−k˜

=

bh



1 − ah

1

λ h − k

1−e−kh k2 2 h

(53)

˜ λkh

k˜

Pi (xt ).

+

˜ h ah 2khb

(54)

˜ ah = 1 − bh − kh 1−e

The covariance can be simplified further, since the integrated variance inside the covariance can be replaced by its conditional expectation Et (IVtt +1 ) = θ +



bh =

1−

b2h

+2

Since the covariance between the genuine forecast error and the spot variance is zero for the unbiased forecast, the covariance term (42) simplifies to:

k˜ 2 h2 1 − a2h

− 2bh

 ˜ , bh = λkh ˜ , ah = 1 − bh − kh φh = k˜ θ h.



1−

(55)

e− k

k

.

It follows from the expression above that (1) the total MSPE of the model-based forecast is proportional to the variance of the spot variance and (2) the coefficient of proportionality is a function of parameters Ξ only.

Cov(GFEmodel , F (σˆ t2 − σt2 ))

=−

1 − e− k k

Cov

  p − 1 − e−ki

−

ki

i =1

1 − e− k



k

B.2. MSPE of the reduced-based forecast

 Pi (xt ), σˆ

2 t

.

Combining formula (35) with the definition of filter (46) yields the expression for the above covariance: Cov(Pi (xt ), σˆ t2 ) = bh

∞ a2i 1 − e−ki h −

h

= bh a2i

ki 1−e

Total MSPE = Var ηt +1 (h) − Var u2t +1 + E 2 ut +1

j

ah e−ki jh

j =0

−ki h

1 1 − ah e−ki h

ki h

.

Hence, the covariance between the genuine error and the error in σˆ t2 is equal to Cov(GFEmodel , F (σˆ t2 − σt2 )) = Var σt2

 ×

1 − e− k k

−

1 − e− k



1 − e− k k bh

kh

1 − ah e−kh

.

(51)

In our final step, we assemble three parts of the total error of the unbiased model forecast and substitute for the efficient parameters ah , bh , φh from (50). Also, we replace the expectation of the spot  variance which enters formulas (47) and (48) by

1−λ

λ

Var σt2 .

Total MSPE = GFEmodel + F (σˆ t2 − σt2 )

+ 2Cov(GFEmodel , F (σˆ t2 − σt2 ))  [ ] [ ]2  −k −k 1 1 − e 1 − e = Var σt2 2 − − 2 k

 F (σˆ t2 − σt2 ) = Cov(GFE

model

 ×

k

k˜ 2 t

E (σˆ t2 − σt2 )2

−

2 t

1 − e− k k

 i =1  

p ∑

β i Li

   (RVt +1 − θ ) = ηt +1 (h). t 



21 t

Slope coefficients in the ARMA representation βi are naturally functions of the correlation structure of RV. On the other hand, from formulas for the moments of IV and RV (30)–(33) the correlation structure of the realized variance depends only on the parameters Ξ . Hence, the parameters of the ARMA representation are the functions of only Ξ . In particular, Meddahi (2003) derived coefficients for the ARMA representation of realized variance for the case of one-factor and two-factor models. It then follows that the variance of the innovation is equal to

(1 − e−ki L)

 i=1

 Var ηt +1 (h) = Var  

− e− k

1−

= Var σt2 Ψ (Ξ ). bh

kh

1 − ah e−kh

p ∑

βi Li





 t +1   RV   t 

i =1

k

1 − e−kh

(56)

i=1

 ∏ p

, F (σˆ − σ )) = Var σ

1 − e− k k

k

 ˜ 2

1 − e− k

(1 − e−ki L)

1−

Total MSPE of the model-based forecast

GFEmodel

where the first component is an innovation in the ARMA representation for realized variance (14). The second component ut +1 is the difference between the realized variance and the integrated variance. Its variance is expressed by (33) and expectation is typically negligible, and is zero for the case of no drift in returns. To obtain the variance of the innovation in the ARMA representation for RV, convert ARMA into an infinite AR representation:

∏ p



1 − e−kh

k

In this part, we will estimate the error from reduced-form forecasting. As follows from (15) the total Mean Squared Prediction Error for the no leverage case is decomposed as follows:

(57)

 (52)

The exact formulas for the above variance in the case of one-factor and two-factor models are also derived by Meddahi (2003).

N. Sizova / Journal of Econometrics 162 (2011) 294–311

Summing up the parts of the error and substituting the variance of the noise (33), the total error of the reduced-form forecast equals: E (IVtt +1 − P (IVtt +1 |RV))2

 = Var σt2 Ψ (Ξ ) − 2h

1−λ

λ

2

−



h

h2

k

−

1−e

 −kh

k2

.

(58)

309

Following the same steps from the previous subsection, we find that the total MSPE of the model-based forecast is equal to: Total MSPE = GFEmodel + F (σˆ t2 , Bias)

+ 2Cov(GFEmodel , F (σˆ t2 , Bias)),    2 ˜ 2 1 − e− k 1 − e− k 2 4 F (σˆ t , Bias) = E σˆ t + E σt4 k k˜

(59)

˜

−2

B.3. Forecast comparison

1 − e− k 1 − e− k k˜

k

E [σˆ t2 σt2 ], ˜

Hence, the total MSPE of both forecasts is proportional to the variance of the spot Var σt2 . The corresponding coefficients of proportionality derived in (52) and (58) are functions of the coefficients Ξ . Hence, the comparison of the forecasts depends only on the parameters from the set Ξ .

Cov(GFEmodel , F (σˆ t2 , Bias)) = Var σt2

B.4. Extension 1: bias in the model-based forecast

where

In the above proof, we defined the parameter k˜ in a way that ensures that the resulting model-based forecast is unbiased. In general, we may assume that the estimated persistence parameter k˜ satisfies another moment condition:



1 − e− k

×

k



E (σˆ t4 ) Var σ

2 t

=

+3

f (k˜ ) = f (k). For example, the method that matches n-period covariances of the ˜ defined by the condition: spot variance results in k, ˜ e−nk = e−nk .

 E

=E

− θ˜ −

IVtt +1



1 − e− k k˜

(σ − θ ) 2 t

where the first term includes a linear projection of the integrated variance on the most recent spot variance Pt (IVtt +1 ). Therefore, once we allow for model misspecification, the ‘‘genuine’’ forecast error includes two terms: a forecast error from the linear forecast based on the last observed spot and the ‘‘bias’’. This term is absent if the forecast is based on the parameters defined by (38), i.e., in the ‘‘no-bias’’ case. The redefined F (σˆ t2 − σt2 ) is

 F (σˆ t2 , Bias) = E E σt2 +

cov(

IVtt +1

,σ ) 2 t

Var σ

1−e k˜

−k˜

kh

1 − ah e−kh

(1 − ah )

1 − a2h



λ

b2h



1−λ

ah

λ

1 − ah

1 − a2h

 =

˜ kh 1 − ah

+ bh

+2

+

h k

−

λ 1−e−kh k2 h2

+

  

(1 − e−kh )2

ah

k2 h2

1 − ah e−kh



bh

(61)

1−λ

2

1−λ

b2h



˜ h 2khb

,

,

1−λ

λ

1 − ah

1 − e−kh

1

kh

1 − ah e−kh

.

Since the above error takes the form Var σt2 f (Ξ ), Proposition 1 is still valid for this more general case.

(σˆ t2 − θ˜ )

,

, F (σˆ , Bias)) = Cov IVtt +1 − 2 t

cov(

IVtt +1

Var σ

cov(IVtt +1 , σt2 )

, σt2 )

2 t

Suppose there is a constant drift in returns µdt. This generalization of the base case will not change the formula for the errors in the model-based forecast, since we demeaned returns before extracting spot variances. However, the total prediction error of the reduced-form forecast will now be equal to: Total MSPE ‘‘Reduced Form’’ Var σt2

= Ψ (Ξ ) − 2h

1−λ

λ

−

2 h2



h k

−

1 − e−kh k2

 +

E 2 ut + 1 Varσ 2

,

where the expectation of the noise is

The last term also affects the comparison. However, it is normally small for intra-day data and can be omitted from consideration. In general, constructing RV as a sum of the squared demeaned returns will result in the same forecast comparison as in the case of zero drift.

2

 Cov(GFE

bh

k

+



1 − e−kh

Eut +1 = µ2 h.

(σt2 − E σt2 )

2 t

and the covariance term is equal to model

1−

a2h

k˜

B.5. Extension 2: drift in returns

  ˜ 2 2 cov(IVtt +1 , σt2 ) 1 − e− k 2 , − Pt (IV ) + Var σt − Var σt2 k˜

− θ˜ −

Var σ

2 t

k˜ 2 h2

1 − e− k

2

˜

IVtt +1

+2

−

1 − e− k





E (σˆ t2 σt2 )

Here, we derive the contribution of this bias to the total MSPE. To keep the algebra simple, we will assume that for all estimation procedures θ matches the unconditional mean of the integrated variance IVtt +1 as in (37). In this case, irrespective of the choice for the other GMM moments, the ‘‘genuine’’ forecast error will be a sum of two parts

(60)

Var σt2

σt2 −

−k˜

1−e k˜

B.6. Extension 3: long-horizon forecast

σt2 ,  σˆ t2 .

For the model-based forecast (27), we decompose the error in the sum of the genuine part and the error coming from the error in spot variance. The genuine part follows from the prediction error of regressing IV on the last observable spot:

310

N. Sizova / Journal of Econometrics 162 (2011) 294–311

 model

=

GFE

+T +1 Var IVtt + T

 = Var σ  2 t

2

− Var σ

[ 1+

k

2 t

e−kT

1−e

1 − e− k

where P is a population projection of IVtt +1 on σt2 , and Pˆ is the estimated projection. Note that GFEmodel will remain the same as in the case with no microstructure noise. Denote the original forecast

k



] −k

− e

k

2

1 − e−k −kT

2 

k

based on observations without noise by P ∗ = θˆ + 1−κeˆ

.

where σt is a noise-free filter of the spot variance. Then, the MSPE can be further decomposed, total MSPE = GFEmodel + Var(P − P ∗ ) + Var(P ∗ − Pˆ )

Total MSPE = GFEmodel + F (σˆ t2 ) + 2 Cov(GFEmodel , F (σˆ t2 )), ˜

F (σˆ t2 ) =

e−kT

1−e

−k˜

2



1−

E σˆ t4 + e−kT

k˜

e− k

+ 2cov(P − P ∗ , P ∗ − Pˆ ) + 2cov(IVtt +H − P , P − P ∗ ) + 2cov(IVtt +H − P , P ∗ − Pˆ ) + BIAS.

2 E σt4

k

−κˆ H ∗,2 Since Pˆ − P ∗ = 1−eκˆ (σˆ t2 − σt ) and

˜

−k 1 − e− k  2 2  ˜ 1−e e−kT E σˆ t σt , − 2e−kT k k˜

σˆ t2 − σt∗,2 = bh

˜

Cov(GFEmodel , F (σˆ t2 )) = Var σt2 e

 ×

e

1 − e− k −kT

−e

k

−k ˜ 1−e −kT

1 − e− k −kT



1 − e−kh

bh

kh

1 − ah e−kh

k

.

For the reduced-form forecast based on the ARMA, the total MSPE is constructed in the same manner as for the short-term forecasts: t +T −j

+T +1 +T +1 T +1 T +1 E 2 (IVtt + − RVtt + ) + Var(RVtt + − P (RVtt + T T +T +T |RVt −j

))

T +1 T +1 − Var(IVtt + − RVtt + +T +T ).

T +1 RVtt + +T

− P(

T +1 RVtt + +T

t +T −j RVt −j

|

1−

 T )= p F ∏

p ∑

βi (h)Lp

i=1

(1 − e−ki L)

   ηt +1 (h). 

i =1

+

Since parameters of the ARMA representation are functions of Ξ and the variance of the innovation ηt takes the form (57), the variance of the above expression is equal to: t +T −j

Var(RVtt +T − P (RVtt +T |RVt −j

×

)) = ΨT (Ξ )Var σt2 .

Var σt2

= ΨT (Ξ ) − 2h −

2



h2

h k

−

1−λ

λ

1 − e−kh k2



.

Thus, the comparison for long-term forecasts is similar to the case of short-term forecasts. As before, this comparison depends solely on the parameters in Ξ .

 t −jh t −jh−h

στ dWτ

h

, (62)

covariances of the difference P ∗ − Pˆ with noise-free measures, IVtt +H , P , P ∗ are all zeros. That is, the error has only one extra term, namely the variance Var(P ∗ − Pˆ ):

 Var(P − Pˆ ) = ∗

1 − e−κˆ H

2

κˆ +

b2h



2(1 + ah )Var n2t + 4E 2 n2t 1 − a2h

h2

8θ hEnt

 2

1 − a2h

.

Reduced-form forecast The arguments of Meddahi (2003) apply to RVtt +1 and TSRVtt +1 under microstructure noise. These arguments remain valid since corresponding conditional means are linear in factors Pi (xt ). For example, for TSRVtt +1 it holds that: p ∑

Et TSRVtt +1

The resulting total MSPE equals Total MSPE ‘‘Reduced Form’’

j

ah

(nt −jh − nt −jh−h )2 + 2(nt −jh − nt −jh−h )

The unexpected part of the realized variance is calculated by iterating the ARMA model for RV:



∞ − j=0

k˜



(σt∗,2 − θˆ ),

∗,2

The total error equals:



−κˆ

i =1

=θ+

 1−e−ki ki

( )

Pi xt K1 1−

K ∑

 e

−ki (j−1) Kh

j =2

.

1 K

Thus, the dynamics of both RVtt +1 and TSRVtt +1 can be described by ARMA(p, p) models. The coefficients of the ARMA representations can be derived in the same way as in Meddahi (2003) by inserting first and second moments of RVtt +1 and TSRVtt +1 into the formulas. The moments of RVtt +1 and TSRVtt +1 are as follows. The expectation of RVtt +1 under microstructure noise is, ERVtt +1 = θ +

2E (n2t ) h

.

B.7. Extension 4: microstructure noise

That is, RVtt +1 is a biased estimator of the variance. The variance of RVtt +1 is,

Model-based forecast As before the model-based forecast is based on the assumption

Var RVtt +1 = Var RVt

Et IVtt +1 = θ +

1 − e−κ

κ

(σt2 − θ ).

2En2t . h

The total MSPE can be

total MSPE = GFEmodel + Var(P − Pˆ ) + 2 cov(IVtt +H − P , P − Pˆ )

+ BIAS,

+ (4/h − 2)Var(n2t ) + 4/hE 2 (n2t )

+ 8θ E (n2t ), t +1,∗

If θˆ is equal to the sample average of σˆ t2 , which is the same as the average of ξt2 , then the squared bias of the above forecast is equal to BIAS = (E ξt2 − θ )2 with E ξt2 = θ + decomposed in the following way,

∗,t +1

where Var RVt is the same variance under no microstructure noise. The covariances are the same as in the case with no microstructure noise. In contrast to RVtt +1 , TSRVtt +1 is unbiased, i.e., ETSRVtt +1 = θ . Its variance: ∗,t +1

Var TSRVtt +1 = Var TSRVt

2(K − 1)Var(n2t ) + 8 Kh E 2 (n2t ) + 8E (n2t )θ K − 1 + h −



+

(K − 1)2

h K

 ,

N. Sizova / Journal of Econometrics 162 (2011) 294–311

∗,t +1

where Var TSRVt

is a variance of TSRVtt +1 if nt ≡ 0. Second

∗,t +1

moments of TSRVt are derived by Andersen et al. (2011). Using the results of Meddahi (2003), we can find coefficients in the following ARMA representations,

∏ p

(1 − e−ki L)



  2  i =1    RVt +1 − θ − 2E (nt ) = ηr v,t +1 (h) t p   ∑ h 1− βr v,i Li i=1

∏  p (1 − e−ki L)  i=1     TSRVt +1 − θ = ηtsr v,t +1 (h). t p   ∑ 1− βtsr v,i Li i =1

Then the MSPE of the RV-reduced-form forecast is total MSPE = Var ηr v,t +1 (h) − Var(RVtt +1 − IVtt +1 ) + BIAS



 BIAS =

2E (n2t )



2

h

Var(

RVtt +1

− IVtt +1 ) = (4/h − 2)Var(n2t ) + 4/hE 2 (n2t ) + 8θ E (n2t ) t +1 − IVtt +1 ), + Var(RV∗, t

and the MSPE of the TSRV-reduced-form forecast is total MSPE = Var TSRVtt +1 − Var ηtsr v,t +1 (h) + Var IVtt +1



t +1

 − 2 cov(TSRV t



, IVtt +1 ),

t +1

 where TSRV is a linear projection of TSRVtt +1 on past values. t From the ARMA representation of TSRV and definition (29) it follows that t +1

 cov(TSRV t

, IVtt +1 ) ≈

1 K −1

Var σt2

 K

×

−

1−e

j =2

k

 −k 2



e

j−1 k 1+ K



∏ p

(1 − e−ki e−k )



    1 −  i =1 . p    ∑ 1− βtsr v,i e−ik i =1

Appendix C. Tables and figures See Tables 1–6 and Figs. 1–3. References Aït-Sahalia, Y., Mancini, L., 2008. Out of sample forecasts of quadratic variation. Journal of Econometrics 147, 17–33. Aït-Sahalia, Y., Mykland, P.A., Zhang, L., 2005. How often to sample a continuoustime process in the presence of market microstructure noise. The Review of Financial Studies 18 (2), 351–416. Alizadeh, S., Brandt, M.W., Diebold, F.X., 2002. Range-Based Estimation of Stochastic Volatility Models. The Journal of Finance 57 (3), 1047–1091. Andersen, T.G., Benzoni, L., Ghysels, E., Lund, J., 2002. An Empirical Investigation of Continuous-Time Equity Return Models. The Journal of Finance 57 (3), 1239–1284. Andersen, T.G., Bollerslev, T., 1998. Deutsche Mark-Dollar Volatility: Intraday Ac-

311

tivity Patterns, Macroeconomic Announcements, and Longer Run Dependencies financial markets. The Journal of Finance 53 (1), 219–265. Andersen, T.G., Bollerslev, T., Christoffersen, P.F., Diebold, F.X., 2005. Volatility forecasting, Working Paper, NBER. Andersen, T., Bollerslev, T., Diebold, F.X., Labys, P., 2003. Modeling and Forecasting Realized Volatility. Econometrica 71 (2), 529–625. Andersen, T.G., Bollerslev, T., Meddahi, N., 2004. Analytical Evaluation of Volatility Forecasts. International Economic Review 45 (4), 1079–1110. Andersen, T.G., Bollerslev, T., Meddahi, N., 2011. Realized volatility forecasting and 40 market microstructure noise. Journal of Econometrics 160 (1), 220–234. Barndorff-Nielsen, O.E., Shephard, N., 2002. Econometrics Analysis of Realized Volatility and Its Use in Estimating Stochastic Volatility Models. Journal of the Royal Statistical Society 64 (2), 253–280. Bollerslev, T., Engle, R., 1986. Modeling the Persistence of Conditional Variances. Econometric Reviews 5, 1–50. Bollerslev, T., Zhou, H., 2002. Estimating Stochastic Volatility Diffusion Using Conditional Moments of Integrated Volatility. Journal of Econometrics 109 (1), 33–65. Chernov, M., Gallant, A.R., Ghysels, E., Tauchen, G., 2003. Alternative Models for Stock Price Dynamics. Journal of Econometrics 116 (1–2), 225–257. Corsi, F., 2004. A simple long memory model of realized volatility, Working Paper, University of Lugano and Swiss Finance Institute. Duffie, D., Pan, J., Singleton, K., 2000. Transform Analysis and Asset Pricing for Affine Jump-Diffusions. Econometrica 68 (6), 1343–1376. Durham, G.B., 2006. Monte Carlo methods for estimating, smoothing, and filtering one and two-factor stochastic volatility models. Journal of Econometrics 133 (1), 273–305. Gallant, A.R., Tauchen, G., 1998. Reprojecting Partially Observed Systems with Application to Interest Rate Diffusions. Journal of the American Statistical Association 93 (441), 10–24. Gloter, A., Jacod, J., 2001a. Diffusions with measurement errors. I local asymptotic normality. ESAIM: Probability and Statistics 5, 225–242. Gloter, A., Jacod, J., 2001b. Diffusions with measurement errors. II measurement errors. ESAIM: Probability and Statistics 5, 243–260. Heston, S., 1993. A closed-form solution for options with stochastic volatility bond and currency options. Review of Financial Studies 6 (2), 327–343. Huang, X., Tauchen, G., 2005. The Relative Contribution of Jumps to Total Price Variance. Journal of Financial Econometrics 3 (4), 456–499. Jacod, J., Protter, P., 1998. Asymptotic Error Distributions for the Euler Method for Stochastic Differential Equations. Annals of Probability 26, 267–307. Jacquier, E., Polson, N.G., Rossi, P.E., 1994. Bayesian Analysis of Stochastic Volatility Models. Journal of Business and Economic Statistics 12 (4), 371–389. Johannes, M., Polson, N., 2003. MCMC methods for continuous time financial econometrics. In: Ait-Sahalia, Y., Hansen, L.P. (Eds.), Handbook of Financial Econometrics. University of Chicago. Meddahi, N., 2001. An eigenfunction approach for volatility modeling, Working Paper, Centre de Recherche et Développement en Économique. Meddahi, N., 2002. A Theoretical Comparison Between Intergrated and Realized Volatility. Journal of Applied Econometrics 17 (5), 479–508. Meddahi, N., 2003. ARMA Representation of Integrated and Realized Variances. Econometrics Journal 6 (2), 335–356. Nelson, D.B., 1992. Filtering and Forecasting with Misspecified ARCH models I: Getting the right variance with the wrong model. Journal of Econometrics 52, 61–90. Nelson, D.B., Foster, D.P., 1994. Asymptotic Filtering Theory For Univariate ARCH Models. Econometrica 62 (1), 1–41. Nelson, D.B., Foster, D.P., 1995. Filtering and forecasting with misspecified ARCH models II Making the right forecast with the wrong model. Journal of Econometrics 67, 303–335. Patton, A.J., Timmermann, A., 2007a. Properties of optimal forecasts under asymmetric loss and nonlinearity. Journal of Econometrics 140 (2), 884–918. Patton, A.J., Timmermann, A., 2007b. Testing Forecast Optimality Under Unknown Loss. Journal of the American Statistical Association 102 (480), 1172–1184. Pitt, M.K., Shephard, N., 1999. Filtering Via Simulation: Auxiliary Particle Filters. Journal of the American Statistical Association 94 (446), 590–599. Stroud, J.R., Müller, P., Polson, N.G., 2003. Nonlinear State-Space Models with StateDependent Variances. Journal of the American Statistical Association 98 (462), 377–386. Tauchen, G., 2004. Recent developments in stochastic volatilility: statistical modelling and general equilibrium analysis, Working Paper, Duke University. Zhang, L., Mykland, P.A., Aït-Sahalia, Y., 2005. A Tale of Two Time Scales: Determining Integrated Volatility With Noisy High-Frequency Data. Journal of the American Statistical Association 100 (472), 1394–1411.

Journal of Econometrics 162 (2011) 312–325

Contents lists available at ScienceDirect

Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom

Modeling frailty-correlated defaults using many macroeconomic covariates Siem Jan Koopman a,b,∗ , André Lucas a,b,c , Bernd Schwaab d a

VU University Amsterdam, De Boelelaan 1105, 1081 HV Amsterdam, The Netherlands

b

Tinbergen Institute, The Netherlands

c

Duisenberg School of Finance, The Netherlands

d

European Central Bank, Directorate General Research, Germany

article

info

Article history: Received 15 December 2008 Received in revised form 15 January 2011 Accepted 3 February 2011 Available online 12 February 2011 JEL classification: G21 C33 Keywords: Systematic default risk Frailty-correlated defaults State space methods Credit risk management

abstract We propose a novel time series panel data framework for estimating and forecasting time-varying corporate default rates subject to observed and unobserved risk factors. In an empirical application for a U.S. dataset, we find a large and significant role for a dynamic frailty component even after controlling for more than 80% of the variation in more than 100 macro-financial covariates and other standard risk factors. We emphasize the need for a latent component to prevent a downward bias in estimated default rate volatility and in estimated probabilities of extreme default losses on portfolios of U.S. debt. The latent factor does not substitute for a single omitted macroeconomic variable. We argue that it captures different omitted effects at different times. We also provide empirical evidence that default and business cycle conditions partly depend on different processes. In an out-of-sample forecasting study for pointin-time default probabilities, we obtain mean absolute error reductions of more than forty percent when compared to models with observed risk factors only. The forecasts are relatively more accurate when default conditions diverge from aggregate macroeconomic conditions. © 2011 Elsevier B.V. All rights reserved.

1. Introduction Recent research indicates that observed macroeconomic variables and firm-level information are not sufficient to capture the large degree of default clustering in observed corporate default data. In an important study, Das et al. (2007) reject the joint hypothesis of (i) well-specified default intensities in terms of observed macroeconomic variables and firm-specific information and (ii) the conditional independence (or doubly stochastic default times) assumption. This is bad news for practitioners, since many textbook credit risk models build on conditional independence. Excess default clustering is often attributed to frailty and contagion. The frailty effect captures default dependence above and beyond what is already implied by observed macroeconomic and financial data. In the econometric literature frailty effects are usually modeled by an unobserved risk factor, see McNeil and Wendin (2007), Koopman et al. (2008), Koopman and Lucas (2008), Duffie et al. (2009), and Azizpour et al. (2010). When a model for non-Gaussian credit risk data contains dynamic latent components, the likelihood function is not available in closed form, and advanced econometric techniques based on

∗ Corresponding author at: VU University Amsterdam, De Boelelaan 1105, 1081 HV Amsterdam, The Netherlands. Tel.: +31 205986019. E-mail address: [email protected] (S.J. Koopman). 0304-4076/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2011.02.003

simulation methods are often required. Contagion effects offer another explanation of excess default clustering. Contagion refers to the phenomenon that a defaulting firm can weaken other firms with which it has links, see, for example, Giesecke (2004), Lando and Nielsen (2008), and Giesecke and Kim (2010). Such business links are particularly relevant at the industry level through supply chain relationships, see Lang and Stulz (1992), Jorion and Zhang (2007), and Boissay and Gropp (2007). Empirically distinguishing contagion from frailty effects, however, is difficult given the limited datasets typically available. A statistical continuous-time framework is developed in Azizpour et al. (2010). In our current paper we develop an econometric framework for the measurement and forecasting of point-in-time default probabilities when excess default clustering is present. The underlying economic model allows for default correlations that originate from macroeconomic and financial conditions, frailty risk, and industry sector dynamics. The model is aimed to support credit risk management at financial institutions and stress tests at supervisory agencies. It may also have an impact on the assessment of systemic risk conditions at (macro-prudential) supervisory agencies such as the new European Systemic Risk Board for the European Union, and the Financial Stability Oversight Council for the United States. Time-varying default risk conditions contribute to overall financial systemic risk, and an assessment of the latter requires estimation of the former.

S.J. Koopman et al. / Journal of Econometrics 162 (2011) 312–325

We present three contributions to the econometric credit risk literature. First, we show how to combine a nonlinear nonGaussian panel data model for discrete default counts with an approximate dynamic factor model for macroeconomic time series data. We argue that the resulting framework inherits the best features from both strands of literature. A linear Gaussian factor model permits the use of large arrays of relevant predictor variables. Conversely, a non-Gaussian panel data model in state space form allows for unobserved frailty effects, easily accommodates the cross-sectional heterogeneity of firms, and routinely handles missing values that arise in count data at a disaggregated level. Parameter and factor estimation are achieved by adopting a maximum likelihood framework for multivariate non-Gaussian models in state space form, see Durbin and Koopman (2001, 1997) and Koopman and Lucas (2008). Our model for default counts is set in discrete-time rather than in continuous-time such as, for example, Duffie et al. (2009) and Azizpour et al. (2010). Although our model can be extended in such a direction, this is not pursued here. Our framework allows us to estimate a large dimensional model, accommodating more than 100 time series of disaggregated default counts and more than 100 macro-financial covariates, in only 20–90 min on a standard desktop PC. The computational speed and model tractability allow us to conduct repeated out-ofsample forecasting experiments, where parameters and factors are re-estimated based on expanding sets of data. As our second contribution, we conduct an empirical study of US default data from 1981Q1 to 2009Q4 and find a large and significant role for a dynamic frailty component after taking into account more than 80% of the variation from more than 100 macroeconomic and financial covariates, while controlling for standard measures of risk such as ratings, equity returns, and volatilities. The increase in likelihood from an unobserved component is large. Based on data including the recent financial crisis, and a different modeling framework and estimation methodology, we confirm and extend the findings of Duffie et al. (2009) who used a latent component to prevent a downward bias in the estimation of default rate volatility and extreme default losses on portfolios of US corporate debt. Our results indicate that the presence of a latent factor is not due to a few omitted macroeconomic covariates, but rather appears to capture different omitted effects at different times. In general, the default cycle and business cycle appear to depend on different processes. As a result, inference on the default cycle using observed risk factors only is at best suboptimal, and at worst systematically misleading. Third, we show that the three types of risk factors – common factors from observed data, a frailty factor, and simple observed industry-specific control factors – are all useful for out-of sample forecasting of default risk conditions. Reductions in the mean absolute forecasting errors are substantial, and far exceed the reductions achieved by standard models which use a limited set of observed covariates directly. For example, we find that mean absolute forecasting errors reduce about 43% through the cycle compared to a benchmark model with only observed macro-financial risk factors. Such reductions have clear practical implications for the computation of capital buffers, and for the stress testing of financial institutions’ loan books. Reductions in MAE are most pronounced when frailty effects are highest. Examples are the year 2002, when default rates remain high while the economy is out of recession. Also, in the period 2005–07 leading up to the 2007–09 financial crisis, default conditions are substantially more benign than what is implied by observed macro data. This paper proceeds as follows. In Section 2 we introduce the econometric framework which combines a nonlinear nonGaussian panel time series model with an approximate dynamic

313

factor model for many covariates. Section 3 shows how the proposed econometric model can be represented as a multifactor firm value model for dependent defaults. In Section 4 we discuss the estimation of the unknown parameters. Section 5 introduces the data for our empirical study, presents the major empirical findings, and discusses the out-of-sample forecasting results. Section 6 concludes. 2. The econometric framework In this section we present our reduced form econometric model for dependent defaults. We denote the default counts of cross section j at time t as yjt for j = 1, . . . , J and t = 1, . . . , T . Our modeling framework is different from, for example, Duffie et al. (2009) and Azizpour et al. (2010) since it adopts a statistical discrete-time approach rather than a continuous-time approach and it analyses default counts rather than individual default times. The index j refers to a specific combination of firm characteristics, such as industry sector, current rating class, and company age. Defaults can be cross-sectionally dependent through shared exposure to the business cycle, financing conditions, monetary and fiscal policy, and waves of optimistic or pessimistic sentiment. The macroeconomic impact at time t is summarized by exogenous factors in the R × 1 vector Ft . Other observed explanatory variables, such as trailing equity returns and respective volatilities are collected in a vector Ct . A frailty factor ftuc (where ‘uc’, refers to unobserved component) captures default clustering above and beyond what is implied by observed macro data. The roles of the covariates in Ft and Ct are different in our empirical analysis and are therefore distinguished in terms of notation. Variables in Ft are common to all firms, whereas variables in Ct may be specific to subsets of firms. Both Ft and Ct are treated as exogenous variables in our statistical analysis. The frailty component ftuc captures all default clustering that is not accounted for by the exogenous factors Ft and Ct . For example, it can also include contagion effects due to business links; see Azizpour et al. (2010). Observed and unobserved risk factors are collected in the factor path Ft = {f˜1 , f˜2 , . . . , f˜t }, where f˜t = (ftuc , Ft′ , Ct′ )′ . After conditioning on the factors, defaults yjt in cross section j are assumed to be generated as sums over independent Bernoulli trials with a common time-varying default probability πjt . We refer to Lando (2003, Chapter 9), McNeil et al. (2005, Chapter 8), McNeil and Wendin (2007), and CreditMetrics (2007) for a discussion of so-called binomial mixture models. The panel time series of defaults is given by yjt |Ft , Ct , ftuc ∼ Binomial(kjt , πjt ),

(1)

where yjt is the total number of default ‘successes’ from kjt exposures. In our model, kjt represents the number of firms in cell j that are active at the beginning of period t. We re-count observed exposures kjt at the beginning of each quarter. The observation density (1) implies that, conditional on the observed factors Ft and Ct and the unobserved dynamic frailty factor ftuc , the event counts yjt are independent over time and over the cross section (firms). The measurement and forecasting of the time-varying default probability πjt is our central focus. We specify πjt as the logistic transform of an index function θjt . Therefore θjt can be interpreted as the log-odds or logit transform of πjt . The time-varying default probabilities are specified by

πjt = (1 + e−θjt )−1 , θjt = λj + β

uc j ft

(2)

+ γj Ft + δj Ct , ′

′

(3)

where λj is a fixed effect for the jth cross section and the coefficient vectors βj , γj , and δj capture risk factor sensitivities that may depend on firm characteristics such as industry sector and

314

S.J. Koopman et al. / Journal of Econometrics 162 (2011) 312–325

rating class. The default signals θjt do not contain idiosyncratic error terms. Instead, idiosyncratic randomness is captured in (1). The log-odds of πjt may vary over time due to variation in the macroeconomic factors, Ft , other observed covariates, Ct , and the frailty component, ftuc . The frailty factor ftuc is modeled as a stationary autoregressive process of order one, ftuc = φ ftuc −1 +



1 − φ 2 ηt ,

ηt ∼ NID(0, 1), t = 1, . . . , T ,

(4)

where 0 < φ < 1 and where ηt is a serially uncorrelated, standardized Gaussian disturbance. We therefore have E(ftuc ) = 0, h Var(ftuc ) = 1, and Cor(ftuc , ftuc −h ) = φ . This specification identifies βj in (3) as the frailty factor volatility, or standard deviation. Extensions to multiple unobserved factors and to other dynamic specifications for ftuc are possible. Modeling the dependence of firm defaults on observed macrofinancial variables is an active area of current research, see Duffie et al. (2009), Azizpour et al. (2010), and the references therein. The number of macroeconomic variables in the model differs across studies but is usually small. Instead of opting for a specific selection in our study, we collect a large number of macroeconomic and financial variables denoted by xnt for n = 1, . . . , N. The panel is assumed to adhere to a factor structure as given by xnt = Λn Ft + ζnt ,

n = 1, . . . , N ,

(5)

where Ft is a vector of principal components, Λn is a row vector of loadings, and ζnt is an idiosyncratic disturbance term. The static factor representation of the approximate dynamic factor model (5) can be derived from a dynamic model specification, see Stock and Watson (2002a). The methodology of relating observed variables to a small set of factors has been employed in the forecasting of inflation and production data, see Massimiliano et al. (2003), asset returns and volatilities, see e.g. Ludvigson and Ng (2007), and the term structure of interest rates, see Exterkate et al. (2010). These studies have reported favorable results when such macro factors are used for forecasting. The factors Ft can be estimated consistently using the method of principal components. This method is expedient for at least three reasons. First, dimensionality problems do not occur even for large values of both N and T . This is particularly relevant for our empirical application, where T , N > 100 in both the macro and default datasets. Second, the method can be easily extended to account for missing observations which are present in many macro-economic time series panels. Finally, the extracted factors can be used for the forecasting of particular time series in the panel, see Forni et al. (2005). Eqs. (1)–(5) combine the approximate dynamic factor model with a non-Gaussian panel data model by inserting the elements of Ft from (5) into the signal equation (3). 3. The financial framework By relating the econometric model with the multi-factor model of CreditMetrics (2007) for dependent defaults, we can establish an economic interpretation of the parameters. In addition, we gain more intuition for the mechanisms of the model. Multi-factor models for firm default risk are widely used in risk management practice, see Lando (2003, Chapter 9). In the special case of a standard static one-factor credit risk model for dependent defaults the values of the obligors’ assets, Vi , are driven by a common random factor F , and an idiosyncratic disturbance ϵi . More specifically, the asset value of firm i, Vi , is modeled by Vi =

 √ ρi f + 1 − ρi ϵi ,

where scalar 0 < ρi < 1 weights the dependence of firm i on the general economic condition factor f in relation to the idiosyncratic

factor ϵi , for i = 1, . . . , K , where K is the number of firms, and where (f , ϵi )′ has mean zero and variance matrix I2 . The conditions in this framework imply that E(Vi ) = 0,

Var(Vi ) = 1,

Cor(Vi Vj ) =

√ ρi ρj ,

for i, j = 1, . . . , K . In our multivariate dynamic model, the framework is extended into a more elaborate version for the asset value Vit of firm i at time t and is given by ′ ′ Vit = ωi0 ftuc + ωi1 Ft + ωi2 Ct

 ′ ′ + 1 − (ωi0 )2 − ωi1 ωi1 − ωi2 ωi2 ϵit  = ωi′ f˜t + 1 − ωi′ ωi ϵit , t = 1, . . . , T ,

(6)

factor ftuc ,

where frailty macro factors Ft and firm/industry-specific covariates Ct have been introduced in (1), the associating weight vectors ωi0 , ωi1 , and ωi2 have appropriate dimensions, the factors and covariates are collected in Ft = {f˜1 , f˜2 , . . . , f˜t }, where f˜t = ′ (ftuc , Ft′ , Ct′ )′ , and all weight vectors are collected in ωi = (ωi0 , ωi1 , ′ ′ ωi2 ) with condition ωi′ ωi ≤ 1. The idiosyncratic standard normal disturbance ϵit is serially uncorrelated for t = 1, . . . , T . The unobserved component or frailty factor ftuc represents the credit cycle condition after controlling for the first M macro factors F1,t , . . . , FM ,t and the common variation in the covariates Ct . In other words, the frailty factor captures deviations of the default cycle from systematic macro-financial conditions. Without loss of generality we assume that all risk factors have zero mean and unit variance. Furthermore, we assume that the risk factors ftuc and Ft are uncorrelated with each other at all times. In a firm value model, firm i defaults at time t when its asset value Vit drops below some threshold ci , see Merton (1974) and Black and Cox (1976). In our framework, Vit is driven by systematic observed and unobserved factors as in (6). In our empirical specification, the threshold ci depends on the current rating class, the industry sector, and the time elapsed since the initial rating assignment. For firms which have not defaulted yet, a default occurs when Vit < ci or, as implied by (6), when ci − ωi′ f˜t

ϵit < 

1 − ωi′ ωi

,

for a given value of f˜t . The time-varying conditional probability of a default is then given by

   πit = Pr ϵit <   Ft . 1 − ωi′ ωi  

ci − ωi′ f˜t 

(7)

Favorable credit cycle conditions are associated with a high value of ωi′ f˜t and therefore with a low default probability πit for firm i. Furthermore, Eq. (7) can be related directly to the econometric model specification in (2) and (3) where the firms (i = 1, . . . , I ) are pooled into homogenous groups (j = 1, . . . , J ) according to rating class, industry sector, and time from initial rating assignment. In particular, if ϵit is logistically distributed, we obtain ci = λj

1 − aj ,

 ωi0 = −βj 1 − aj ,   ωi1 = −γj 1 − aj , ωi2 = −δj 1 − aj , 

where aj = (βj2 + γj′ γj + δj′ δj )/(1 + βj2 + γj′ γj + δj′ δj ) for firm i that belongs to group j. We can regard index j as a function of index i, that is j = j(i). The coefficient vectors λj , βj , and γj are defined below (2) and (3). The parameters have therefore a direct interpretation in widely used portfolio credit risk models such as CreditMetrics (2007). 4. Estimation using state space methods We next discuss parameter estimation and signal extraction of the factors for models (1)–(5). The estimation procedure for

S.J. Koopman et al. / Journal of Econometrics 162 (2011) 312–325

the macro factors is discussed in Section 4.1. The state space representation of the econometric model is provided in Section 4.2. We first estimate the parameters using a computationally efficient procedure for Monte Carlo maximum likelihood and then extract the frailty factor using a similar Monte Carlo method. A brief outline of these procedures is given in Section 4.3. All computations are implemented using the Ox programming language and the associated set of state space routines from SsfPack, see Doornik (2007) and Koopman et al. (2008). 4.1. Estimation of the macro factors The common factors Ft from the macro data are estimated by minimizing the objective function given by min V (F , Λ) = (NT )−1

{F ,Λ}

T − (Xt − ΛFt )′ (Xt − ΛFt ),

(8)

t =1

where the N × 1 vector Xt = (x1t , . . . , xNT )′ contains macroeconomic variables and F is the set F = {F1 , . . . , FT } for the R × 1 vector Ft . The observed stationary time series xnt are demeaned and standardized to have unit unconditional variance for n = 1, . . . , N. Concentrating out F and  rearranging  terms shows that (8) is equivalent to maximizing tr Λ′ SX ′ X Λ with respect to Λ and subject to ∑ Λ′ Λ = IR , where SX ′ X = T −1 t Xt Xt′ is the sample covariance matrix of the data, see Lawley and Maxwell (1971) and Stock and Watson (2002a). The resulting principal components estimator of ˆ , where Λ ˆ collects the normalized eigenvecFt is given by Fˆt = Xt′ Λ tors associated with the R largest eigenvalues of SX ′ X . When the variables in Xt are not completely observed for t = 1, . . . , T , we employ the Expectation Maximization (EM) procedure as devised in the Appendix of Stock and Watson (2002b). This iterative procedure takes a simple form under the assumption that xnt ∼ NID(Λn Ft , 1), where Λn denotes the nth row of Λ for n = 1, . . . , N. Here, V (F , Λ) in (8) is a linear function of the loglikelihood L(F , Λ|X m ) where X m denotes the missing parts of the dataset X1 , . . . , XT . Since V (F , Λ) is proportional to −L(F , Λ|X m ), the minimizers of V (F , Λ) are also the maximizers of L(F , Λ|X m ). This result is exploited in the EM algorithm of Stock and Watson (2002b) that we have adopted to compute Fˆt for t = 1, . . . , T . 4.2. The factor model in state space form We can formulate models (1)–(4) in state space form where Ft and Ct are treated as explanatory variables. In our implementation, Ft will be replaced by Fˆt as obtained from the previous section. The estimation framework can therefore be characterized as a two-step procedure. We first estimate the principal components to summarize the variation in macroeconomic data. We then concentrate on the estimation of the frailty factor ftuc by jointly considering Fˆt and Ct . In this way we establish a computationally feasible and relatively simple procedure. In Section 4.4 we present simulation evidence to illustrate the adequacy of our approach for parameter estimation and for uncovering the factors from the data. The Binomial log-density function of model (1) is given by

πjt log p(yjt |πjt ) = yjt log 1 − πjt   kjt + log , 

yjt



+ kjt log(1 − πjt )

315

in terms of the log-odds ratio θjt = log(πjt ) − log(1 − πjt ) given by θjt

log p(yjt |θjt ) = yjt θjt + kjt log(1 + e ) + log



kjt



yjt

.

(10)

The log-odds ratio in (3) can be specified as

θjt = Zjt αt ,

Zjt = (e′j , Fˆt′ ⊗ e′j , Ct′ ⊗ e′j , βj ),

(11)

where ej denotes the jth column of the identity matrix of dimension J, the state vector αt = (λ1 , . . . , λJ , γ1,1 , . . . , γR,J , δ1′ , . . . , δJ′ , ftuc )′ consists of the fixed effects λj , the loadings γr ,j and δj together with the unobserved component ftuc . The system vector Zjt is timevarying due to the inclusion of Fˆt and Ct . The state vector αt contains all unknown coefficients that are linear in the signals θjt . The transition equation provides a model for the evolution of the state vector αt over time and is given by

αt +1 = T αt + Q ξt ,

ηt ∼ NID(0, 1),

(12)

with  transition matrix T = diag(I , φ) and with vector Q = (0, . . . , 0, 1 − φ 2 )′ . All initial elements of the state vector are subject to diffuse initial conditions except for ftuc , which has zero mean and unit variance. The Eqs. (10) and (12) belong to a class of non-Gaussian state space models as discussed in Durbin and Koopman (2001, Part II) and Koopman and Lucas (2008). In our formulation, most unknown coefficients are part of the state vector αt and are estimated as part of the filtering and smoothing procedures described in Section 4.3. This formulation leads to a considerable increase in the computational efficiency of our estimation procedure. The remaining parameters are collected in a coefficient vector ψ = (φ, β1 , . . . , βJ )′ and are estimated by the Monte Carlo maximum likelihood methods that we will discuss next. 4.3. Parameter estimation and signal extraction Parameter estimation for a non-Gaussian model in state space form can be carried out by the method of Monte Carlo maximum likelihood. Once we have obtained an estimate of ψ , we can compute the conditional mean and variance estimates of the state vector αt . In both cases we make use of importance sampling methods. The details of our implementation are given next. For notational convenience we suppress the dependence of the density p(y; ψ) on ψ . The likelihood function of our models (1)–(4) can be expressed by p(y) =

∫ ∫

=

p(y, θ )dθ = p(y|θ )

∫

p(θ ) g (θ|y)

p(y|θ )p(θ )dθ

[

g (θ|y)dθ = Eg p(y|θ )

p(θ ) g (θ|y)

]

,

(13)

where y = (y11 , y21 , . . . , yJT )′ , θ = (θ11 , θ21 , . . . , θJT )′ , p(·) is a density function, p(·, ·) is a joint density, p(·|·) is a conditional density, g (θ|y) is a Gaussian importance density, and Eg denotes expectation with respect to g (θ|y). The importance density g (θ|y) is constructed as the Laplace approximation to the intractable density p(θ|y): both densities have the same mode and curvature at the mode, see Durbin and Koopman (2001) for details. Conditional on θ , we can evaluate p(y|θ ) by p(y|θ ) =

∏

p(yjt |θjt ).

j ,t

(9)

where yjt is the number of defaults and kjt is the number of firms in cross-section j, for j = 1, . . . , J and t = 1, . . . , T . By substituting (2) for the default probability πjt into (9) we obtain the log-density

It follows from (3) and (4) that the marginal density p(θ ) is Gaussian and therefore p(θ ) = g (θ ). Since g (θ|y)g (y) ≡ g (y|θ )g (θ ) we obtain

[

p(y) = Eg p(y|θ )

p(θ ) g (y)

[ ] p(y|θ ) = Eg g (y) g (y|θ ) p(θ ) g (y|θ )

= g (y)Eg [w(y, θ )] ,

]

(14)

316

S.J. Koopman et al. / Journal of Econometrics 162 (2011) 312–325

where w(y, θ ) = p(y|θ )/g (y|θ ). A Monte Carlo estimator of p(y) is therefore given by pˆ (y) = g (y)w, ¯ with



M − p(y|θ m ) , w ¯ = M −1 w m = M −1 g (y|θ m ) m=1 m=1 M −

 [αjt |y] =  Var

yjt = cjt + θjt + ujt ,

ujt ∼ NID(0, djt ),

−1  ∂ 2 log p(y)  − . ∂ψ∂ψ ′ ψ=ψˆ

∫

α¯ = E[α|y] = α p(α|y)dα [ ] ∫ p(α|y) p(α|y) = α g (α|y)dα = Eg α . g (α|y) g (α|y) In a similar way as the development in (14), we obtain Eg [αw(θ , y)] Eg [w(θ , y)]

,

since p(α) = g (α), p(y|α) = p(y|θ ) and g (y|α) = g (y|θ ). The Monte Carlo estimator for α¯ is then given by

 αˆ¯ =  E[α|y] =

M − m=1

−1 w

m

M − m=1

wm

M −

  2 (αjtm )2 wm  − αˆ¯ it ,

m=1

and allow the construction of standard error bands. In our empirical study we also present mode estimates for signal extraction and out-of-sample forecasting of default probabilities in (3). The mode estimates of αjt are obtained by the Kalman filter smoother applied to the state space models (15), (11) and (12) where cjt and djt are computed by using the mode estimate of θjt . Finally, the mode estimate of π = π (θ ) is given by π¯ = π (θ¯ ) for any nonlinear function π (·) that is known and has continuous support. We refer to Durbin and Koopman (2001, Chapter 11) for further details. 4.4. Simulation experiments In this subsection we investigate whether the econometric methods of Sections 4.1 and 4.3 can distinguish default rate volatility due to changes in the macroeconomic environment from changes in unobserved frailty risk. The first source is captured by principal components Ft , while the second source is estimated via the unobserved factor ftuc . This exercise is important since estimation by Monte Carlo maximum likelihood should not be biased towards attributing variation to a latent component when it is in fact due to an exogenous covariate. For this purpose we carry out a simulation study that is close to our empirical application in Section 5. The variables are generated by the equations Ft = ΦF Ft −1 + uF ,t , et = ΦI et −1 + uI ,t ,

uF ,t ∼ N(0, I − ΦF ΦF′ ), uI ,t ∼ N(0, I − ΦI ΦI′ ),

Xt = ΛFt + et , ftuc = φuc ftuc − 1 + uf , t ,

2 uf ,t ∼ N(0, 1 − φuc ),

where φuc and the elements of the matrices ΦF , ΦI , and Λ are generated for each simulated dataset from the uniform distribution U [·, ·], that is φuc ∼ U [0.6, 0.8], ΦF (i, j) ∼ U [0.6, 0.8], ΦI (i, j) ∼ U [0.2, 0.4], and Λ(i, j) ∼ U [0, 2], where A(i, j) is the (i, j)th element of matrix A = ΦF , ΦI , Λ. For computational convenience we consider Ft to be a scalar process (R = 1) and we have no firmspecific covariates (Ct = 0). The default counts yjt in pooling group j are generated by the equations

θjt = λj + β ftuc + γ Ft ,

For the estimation of the latent factor ftuc and fixed coefficients in the state vector, we estimate the conditional mean of α by

α¯ =

−1

(15)

where the disturbances ujt are mutually and serially uncorrelated, for j = 1, . . . , j and t = 1, . . . , T . The unknown constant cjt and variance djt are determined by the individual matching of the first and second derivative of log p(yjt |θjt ) in (10) and log g (yjt |θjt ) = 1 2 − 12 log 2π − 12 log djt − 12 d− jt (yjt − cjt − θjt ) with respect to the signal θjt . The matching equations for cjt and djt rely on θjt for each j, t. For an initial value of θjt , we compute cjt and djt for all j, t. The Kalman filter and smoother compute the estimates for signal θjt based on the linear Gaussian state space models (15), (11) and (12). We compute new values for cjt and dtj based on the new signal estimates of θjt . We can repeat the computations for each new estimate of θjt . The iterations proceed until convergence is achieved, that is when the estimates of θjt do not change. The number of iterations for convergence are usually as low as 5–10 iterations. When convergence has taken place, the Kalman filter and smoother applied to the approximating model (15) compute the mode estimate of log p(θ|y); see Durbin and Koopman (1997) for further details. A new approximating model needs to be constructed for each log-likelihood evaluation when the value for parameter vector ψ has changed. Finally, standard errors for the parameters in ψ are constructed from the numerical second derivatives of the log-likelihood function, that is



M −

m=1

where w m = w(θ m , y) is the value of the importance weight associated with the m-th draw θ m from g (θ|y), and M is the number of Monte Carlo draws. The Gaussian importance density g (θ|y) is chosen for convenience and since it is possible to generate a large number of draws θ m from it in a computationally efficient manner using the simulation smoothing algorithm of de Jong and Shephard (1995) or Durbin and Koopman (2002). We estimate the log-likelihood as log pˆ (y) = log gˆ (y) + log w ¯ , and include a bias correction term as discussed in Durbin and Koopman (1997). The Gaussian importance density g (θ|y) is based on the approximating Gaussian model as given by

ˆ = Σ

where α m = (α11 , . . . , αJT )′ is the m-th draw from g (α|y) and where θ m is computed using (11), that is θjtm = Zjt αjtm for j = 1, . . . , J and t = 1, . . . , T . The associated conditional variances are given by

α m wm ,

yjt ∼ Binomial(kjt , (1 + exp[−θjt ])−1 ), where ftuc and Ft represent their simulated values, and exposure counts kjt come from the dataset which is explored in the next section. The parameters λj , β, γ are chosen similar to their maximum likelihood values reported in Section 5. Simulation results are based on 1000 simulations. Each simulation uses M = 50 importance samples during simulated maximum likelihood estimation, and M = 500 importance samples for signal extraction, see Section 4.3. A selection of the graphical output from our Monte Carlo study is presented in Fig. 1. We find that the principal components estimate Fˆ captures the factor space F well. The goodness-offit statistic R2 is on average 0.94. The conditional mean estimate of ftuc is close to the simulated unobserved factor, with an average R2 of 0.73. The sampling distributions of φuc and λ0 appear roughly symmetric and Gaussian, while the distributions of factor sensitivities β0 and γ1 appear skewed to the right. This is

S.J. Koopman et al. / Journal of Econometrics 162 (2011) 312–325

317

Fig. 1. Simulation analysis. The top panels contain the sampling distributions of R-squared goodness-of-fit statistics in regressions of Fˆ on simulated factors F , and conditional mean estimates Eˆ [f uc |y] on the true factor f uc , respectively. The middle panels present the sampling distributions of key parameters φuc , β, λ0 , and γ1 . The bottom panel compares two empirical distribution functions of the t-statistics associated with the null hypothesis H0 : γ1 = 0. In each simulation either F or Fˆ are used to obtain simulated maximum likelihood parameter and standard error estimates. All distribution plots are based on 1000 simulations. The dimensions of the default panel are N = 112, and T = 100. The macro panel is of dimension N = 120, and T = 100.

consistent with their interpretation as factor standard deviations. The distributions of φuc , β0 , λ0 , and γ1 are all centered around their true values. We conclude that our modeling framework enables us to discriminate between possible sources of default rate variation. The resulting parameter estimates are overall correct for both ψ and state vector α . Finally, the standard errors for the estimated factor loadings γ do not take into account that the principal components are estimated with some error in a first step. We therefore need to investigate whether this impairs inference on these factor loadings. In each simulation we estimate parameters and associated standard errors using true factors Ft as well as their principal components estimates Fˆt . The bottom panel in Fig. 1 plots the empirical distribution functions of t-statistics associated with testing the null hypothesis H0 : γ1 = 0 when either Ft or Fˆt are used. The t-statistics are very similar in both cases. Other standard errors are similarly unaffected. We conclude that the substitution of Ft with Fˆt has negligible effects for parameter estimation. 5. Estimation results and forecasting accuracy We first describe the macroeconomic, financial, and firm default data used in our empirical study. We then discuss our main findings from the study. We conclude with the discussion of out-ofsample forecasting results for time-varying default probabilities.

5.1. Data We use data from two main sources. First, a panel of more than 100 macroeconomic and financial time series is constructed from the Federal Reserve Economic Database FRED (http://research.stlouisfed.org/fred2). The aim is to select series which contain information about systematic credit risk conditions. The variables are grouped into five broad categories: (a) bank lending conditions, (b) macroeconomic and business cycle indicators, including labor market conditions and monetary policy indicators, (c) open economy macroeconomic indicators, (d) micro-level business conditions such as wage rates, cost of capital, and cost of resources, and (e) stock market returns and volatilities. The macro variables are quarterly time series from 1970Q1 to 2009Q4. Table 1 presents a listing of the series for each category. The macroeconomic panel contains both current information indicators (real GDP, industrial production, unemployment rate) and forward looking variables (stock prices, interest rates, credit spreads, commodity prices). A second dataset is constructed from Moody’s default data. The database contains rating transition histories and default dates for all rated firms from 1981Q1 to 2009Q4. This data contains the information to determine quarterly values for yjt and kjt in (1). The database distinguishes 12 industries which we pool into D = 7 industry groups: banks and financials (fin); transport and aviation (tra); hotels, leisure, and media (lei); utilities and energy (egy);

318

S.J. Koopman et al. / Journal of Econometrics 162 (2011) 312–325

Table 1 Macroeconomic and financial predictor variables. Main category

Summary listing

Total

(a) Bank lending conditions Size of overall lending

Extend of problematic banking business

Total commercial loans Total real estate loans Total consumer credit outst. Commercial & industrial loans Bank loans and investments Household obligations/income

Household debt/income-ratio Federal debt of non-fin. sector Excess reserves of dep. institutions Total borrowings from fed reserve Household debt service payments Total loans and leases, all banks

Non-performing loans ratio Net loan losses Return on bank equity Non-perf. commercial loans

Non-performing total loans Total net loan charge-offs Loan loss reserves

Real GDP Industr. production index Private fixed investments National income Manuf. sector output Manuf. sector productivity Government expenditure

ISM manufacturing index Uni Michigan consumer sentiment Real disposable personal income Personal income Consumption expenditure Expenditure durable goods Gross private domestic investment

14

Unemployment rate Weekly hours worked Employment/population-ratio

Total no unemployed Civilian employment Unemployed, more than 15 weeks

6

New orders: durable goods New orders: capital goods Capacity util. manufacturing Capacity util. total industry Light weight vehicle sales Housing starts New building permits

Final sales of dom. product Inventory/sales-ratio Change in private inventories Inventories: total business Non-farm housing starts New houses sold Final sales to domestic buyers

14

M2 money stock UMich infl. expectations Personal savings Gross saving

CPI: all items less food CPI: energy index Personal savings rate GDP deflator, implicit

8

Corp. profits Net corporate dividends

After tax earnings Corporate net cash flow

4

12

7

(b) Macro and BC conditions General macro indicators

Labor market conditions

Business cycle leading/coinciding indicators

Monetary policy indicators

Firm profitability (c) Intern’l competitiveness Terms of trade

Trade weighted USD

FX index major trading partners

2

Balance of payments

Current account balance Balance on merchandise trade

Real exports goods, services Real imports goods & services

4

Unit labor cost: manufacturing Total wages & salaries Management salaries Technical services wages Employee compensation index

Unit labor cost: nonfarm business Non-durable manufacturing wages Durable manufacturing wages Employment cost index: benefits Employment cost index: wages & salaries

10

1 month commercial paper rate 3 month commercial paper rate Effective federal funds rate AAA corporate bond yield BAA corporate bond yield

Treasury bond yield, 10 years Term structure spread Corporate yield spread 30 year mortgage rate Bank prime loan rate

10

PPI all commodities PPI interm. energy goods PPI finished goods

PPI industrial commodities PPI crude energy materials PPI intermediate materials

6

S&P 500 Nasdaq 100 S&P small cap index

Dow Jones industrial average Russell 2000

(d) Micro-level conditions Labor cost/wages

Cost of capital

Cost of resources

(e) Equity market conditions Equity indexes and respective volatilities

10 107

industrials (ind); technology and telecom (tec); retailing and consumer goods (rcg). We further consider four age cohorts: less than 3, 3–6, 6–12, and more than 12 years from the time of the initial rating assignment. Age cohorts are included since default probabilities may depend on the age of a company. A proxy for age is the time since the initial rating has been established. Finally, there are four rating groups, an investment grade group Aaa–Baa,

and three speculative grade groups Ba, B, and Caa − C . Pooling over investment grade firms is necessary since defaults are rare in this segment. In total we distinguish J = 7 × 4 × 4 = 112 different groups. In the process of counting exposures and defaults, a previous rating withdrawal is ignored if it is followed by a later default. If there are multiple defaults per firm, we consider only the first

S.J. Koopman et al. / Journal of Econometrics 162 (2011) 312–325

319

∑

Fig. 2. Aggregated default data and disaggregated default frequencies. The first three panels present time series plots of (a) the total default counts j yjt aggregated to ∑ ∑ ∑ a univariate series, (b) total number of firms j kjt in the database, and (c) aggregate default fractions j yjt / j kjt over time. The bottom panel presents disaggregated default frequencies yjt /kjt over time for four of the broad rating groups Aaa–Baa, Ba, B, and Caa-C . Each plot contains multiple default frequencies over time, disaggregated across industries and time from initial rating assignment.

event. In addition, we exclude defaults that are due to a parentsubsidiary relationship. Such defaults typically share the same default date, resolution date, and legal bankruptcy date in the database. Inspection of the default history and parent number confirms the exclusion of these cases. Fig. 2 presents the total number of default counts, exposure counts, and the corresponding fractions. We observe pronounced default clustering around the recession years of 1991, 2001, and the financial crisis of 2007–09. Aggregate default counts, exposure counts, and fractions are presented in the top panel of Fig. 2. We observe pronounced default clustering around the recession years of 1991, 2001, and the financial crisis of 2007–09. Since defaults cluster due to high levels of latent systematic risk, it follows that systematic risk is serially correlated and may also account for the autocorrelation

in aggregate defaults. Defaults may already rise before the onset of a recession, for example, in the years 1990 and 2000, and they may remain elevated as the economy recovers from recession, for example, in the year 2002. The bottom panel of Fig. 2 presents disaggregated default fractions for four broad rating groups. Default clustering is visible for all rating groups. Our proposed model considers groups of homogenous firms rather than individual firms. As a result it is not straightforward to include firm specific information beyond rating classes and industry sectors. Firm-specific covariates such as equity returns, volatilities and leverage are found to be important in Vassalou and Xing (2004), Duffie et al. (2007), and Duffie et al. (2009). Ratings alone may not be sufficient statistics for future default. To accommodate this concern, our set of explanatory covariates is

320

S.J. Koopman et al. / Journal of Econometrics 162 (2011) 312–325

Fig. 3. Macro factors from unbalanced data. We present the first ten principal components from our unbalanced panel of macro and financial time series as listed in Table 1. Shaded areas indicate NBER recession periods.

extended with average measures of firm-specific variables across firms in the same industry groups. We use S&P industry-level equity index data from Datastream to construct trailing equity return and spot volatility measures at the industry level. The equity volatilities are constructed as realized variance estimates based on average squared returns over the past year. As a robustness check, we also follow Das et al. (2007) and Duffie et al. (2009) by including the trailing 1-year return of the S&P 500 stock index, an S&P 500 spot volatility measure, and the 3-month T -bill rate from Datastream. These observed risk factors are then treated in the same way as the principal components from the macroeconomics dataset. 5.2. Observed macro factors and industry controls Fig. 3 presents the ten principal components obtained from the macro panel of Table 1 and computed by the EM procedure of Section 4.1. The NBER recession dates are depicted as shaded areas. The estimated first factor from the macroeconomic and financial panel is mainly associated with production and employment data; it accounts for a large share of 24% of total variation in the panel. The first factor exhibits clear peaks around the US business cycle troughs. Overall, we select M = 10 factors which capture 82% of the variation in the panel. Intra-industry serial correlation may remain present in the data even after conditioning on macro and frailty factors that are common to all default data. Such leftover variation could be exploited for forecasting. We therefore regress trailing one year default rates at the industry-level on a constant and the trailing one year aggregate default rate (to concentrate out macro and frailty effects that are common to all data). Observed industry factors are

then obtained as the resulting standardized residuals and included as controls in the log-odds equation (3) for all models. 5.3. Model specification The model specification for the default counts of our J = 112 groups is as follows. The individual time series of counts is modeled as a Binomial sequence with log-odds ratio θjt as given by (3) or (11) where the scalar coefficient λj is a fixed effect, scalar βj pertains to the frailty factor, vector γj to the principal components and vector δj to the observed industry control variables, for j = 1, . . . , J. The model includes ten principal components that capture 82% of the variation from 107 macro-financial predictor variables, equity returns and volatilities at the industry level, industry-specific factors, and the firm-specific ratings, industry group, and age cohorts. Since the cross-section is high-dimensional, we follow Koopman and Lucas (2008) in reducing the number of parameters by restricting the coefficients in the following additive structure

χ¯ j = χ0 + χ1,dj + χ2,aj + χ3,sj ,

χ¯ = λ, β, γ , δ,

(16)

where χ0 represents the baseline effect, χ1,d is the industryspecific deviation, χ2,a is the deviation related to age and χ3,s is the deviation related to rating group. The deviations of all seven industry groups (fin, tra, lei, egy, tec, ind, and rcg) cannot be identified simultaneously given the presence of χ0 . To identify the model, we assume that χ1,dj = 0 for the retail and consumer goods group, χ2,aj = 0 for the age group of 12 years or more, and χ3,sj = 0 for the rating group Caa-C . These normalizations are innocuous and can be replaced by alternative baseline choices without affecting our conclusions. For the frailty factor coefficients,

S.J. Koopman et al. / Journal of Econometrics 162 (2011) 312–325 Table 2 Estimation results. We report the maximum likelihood estimates of selected coefficients in the specification for the signal or log-odds ratio (3) with parameterization χ¯ j = χ0 + χ1,dj + χ2,aj + χ3,sj for χ¯ = λ, β . Coefficients λ refer to fixed effects or baseline log-odds, coefficients β refer to the frailty factor, and coefficients γ and δ refer to the macro and industry factors, respectively. Monte Carlo log-likelihood evaluation is based on M = 5000 importance samples. Data is from 1981Q1 to 2009Q4. Further details of the model specification are discussed in Section 5.3. Par

λ0 λ1,fin λ1,tra λ1,lei λ1,egy λ1,ind λ1,tec λ2,0–3 λ2,4–5 λ2,6–12 λ3,IG λ3,Ba λ3,B β0 β1,fin β1,tra β1,lei β1,egy β1,ind β1,tec β2,IG β2,Ba β2,B γ1IG γ1Ba γ1B γ1Caa

Model 1: only Ft

Model 2: only ftuc

Model 3: all factors

Val

t-val

Val

t-val

Val

t-val

−2.62

12.72 0.10 1.13 0.00 1.60 1.28 2.04 1.74 2.39 2.04 14.46 12.07 6.51

−2.56

9.19 0.45 1.16 0.78 0.25 1.85 2.10 2.06 1.46 1.20 16.80 15.62 7.78 4.90 0.81 0.15 1.04 1.92 1.20 0.15 1.18 1.38 2.22

−2.88 0.03 0.19 −0.04 −0.44 −0.12 −0.29 −0.25 0.14 0.15 −7.56 −3.88 −1.79 0.53 −0.18 0.04 −0.00 0.52 −0.17 −0.07 0.07 0.44 0.35 1.44 0.68 0.63 0.53

10.24 0.19 1.24 0.31 2.15 1.19 2.19 2.21 1.38 1.30 13.69 13.68 7.86 4.30 1.45 0.30 0.03 2.80 1.93 0.66 0.18 1.92 2.54 4.47 3.18 3.86 4.02

···

···

0.01 0.19 0.00 −0.21 −0.11 −0.28 −0.20 0.26 0.24 −7.55 −3.26 −1.21

1.37 0.49 0.44 0.42

··· δfin δtra δlei δegy δind δtec δrcg LogLik

0.06 0.18 −0.09 −0.05 −0.19 −0.25 −0.23 0.17 0.14 −6.98 −3.21 −1.25 0.60 −0.14 0.03 0.13 −0.37 0.15 −0.02 0.36 0.23 0.20

4.02 2.19 5.01 3.62

···

0.18 −0.45 0.04 0.27 0.02 0.30 0.19 −2660.98

2.33 3.57 0.52 3.04 0.29 5.21 2.78

0.17 −0.37 0.13 0.19 −0.02 0.25 0.19 −2639.43

2.31 2.72 1.90 2.22 0.42 3.98 2.71

0.17 −0.34 0.11 −0.03 −0.05 0.25 0.17 −2595.87

2.00 2.76 1.71 0.34 1.13 3.53 2.54

we do not account for age and therefore set β2,a = 0 for all a. For the principal components coefficients, we only account for rating groups and therefore we have γ1,d = 0 and γ2,a = 0, for all d, s. For the industry factor coefficients, we only account for industry groups and therefore we have δ2,a = 0 and δ3,s = 0, for all d, s. Using this parameter specification, we combine model parsimony with the ability to test a rich set of hypotheses empirically given the data at hand. 5.4. Empirical findings Table 2 presents the parameter estimates for three different specifications of the signal equation (3). Model 1 does not contain the macro factors, βj = 0. Model 2 does not contain the latent risk factors, γrj = 0 for all r and j. Model 3 refers to specification (3) without restrictions. When comparing the log-likelihood values of Models 1 and 3, we can conclude that adding a latent dynamic frailty factor increases the log-likelihood by approximately 65 points. This increase is statistically significant at the 1% level. Since in practice most default models rely on a small set of observed covariates, this finding indicates that a model without a frailty factor can systematically provide misleading indications of default conditions. Therefore, the industry practise is at best suboptimal, and at worst systematically misleading when used for inference

321

on default conditions, e.g. during a stress testing exercise. Furthermore, our finding support Duffie et al. (2009) and Azizpour et al. (2010) who find that firms are exposed to a common dynamic latent component driving default in addition to observed risk factors. Ignoring the frailty component may lead to a significant downward bias when assessing default rate volatility and the associated probability of extreme default losses. Our estimated frailty factor actually captures the combined effect of frailty and contagion, see Azizpour et al. (2010) for a continuous-time model where frailty and contagion are disentangled. We further find that Model 2 produces a better in-sample fit to the data than Model 1 in terms of the maximized log-likelihood value. Hence, a single unobserved component captures default conditions better than the first ten principal components from the macroeconomic panel. We therefore conclude that business cycle dynamics and default risk conditions are different processes. The principal components represent important covariation in defaults. The difference in the log-likelihood values of Models 2 and 3 is 44 points and is significant at a 5% level. We may therefore conclude that all risk factors in our model are significant. However, all principal components are not of equal importance to default rates. For example, factors 3 and 6 capture 10% and 4% of the variation in the macro panel, respectively, but they have no effect on default counts. 5.5. Interpretation of the frailty factor We have given evidence in Section 5.4 that firms are exposed to a common dynamic latent factor driving default after controlling for observed risk factors. Given its statistical and economic significance, we may conclude that the business cycle and the default cycle are related but depend on different processes. The approximation of the default cycle by business cycle indicators may not be sufficiently accurate. Fig. 4 presents the frailty factor estimates for Models 2 and 3. The recession periods of 1983, 1991, 2001, and 2008–09 are marked as shaded areas. Recession periods coincide with peaks in the default cycle in the top panel for Model 2. The bottom panel presents the estimated frailty effects for Model 3. Duffie et al. (2009) suggest that the frailty factor captures omitted relevant macro-financial covariates together with other omitted effects that are difficult to quantify. Our results suggest that the frailty factor captures predominantly other omitted effects as we have conditioned on macro factors that capture more than 84% of the variation in over 100 macro variables. The frailty effects in the period 2001–2002 could be attributed to the disappearance of trust in the accuracy of public accounting information following the Enron and WorldCom scandals. This development would cause lenders not to extend credit, causing illiquidity risks and eventually credit risks to go up. While such effects are important for risk conditions, they are difficult to quantify. Similarly, the downward movements of the frailty factor in 2005–2007 suggest that Model 3 is able to capture the positive effects of advances in credit risk transfer and securitization during that time. These advances have led to cheap credit access. The estimated frailty factor appears to capture different omitted effects at different times, rather than substituting for a single missing covariate. Fig. 5 presents the estimated composite default signals θjt for investment grade firms (Aaa–Baa) against low speculative grade firms (Caa-C ). The frailty effects are less important for investment grade firms, which typically have a highly robust access to credit. The default clustering implied by observed risk factors is sufficient to match the time-varying default probabilities in the recession periods 1983, 1991, 2001, and 2008. The speculative grade firms have a less robust access to credit that depends much more on market circumstances. For this group, frailty effects indicate

322

S.J. Koopman et al. / Journal of Econometrics 162 (2011) 312–325

Fig. 4. Frailty factor. We present estimates of ftuc from models M2 (top panel) and M3 (bottom panel). Standard error bands are obtained from conditional factor variance estimates, see Section 4.3, and correspond to a 95% confidence level.

additional default clustering in the 1980s, and also during the 1991 recession. The bottom panel of Fig. 5 shows that the low default probabilities for bad risks in the years leading up to the financial crisis are attributed to the frailty component. Finally, we note that our frailty component partly includes contagion effects across and within industry sectors and may overestimate the ‘pure’ frailty effect. We refer to Azizpour et al. (2010) who disentangle contagion dependence from frailty in a continuous-time framework. It is argued that contagion linkages remain an important source of variation even after frailty is taken into account. We do not make such a distinction explicitly here. We leave it to future research to investigate whether such a distinction affects the out-of-sample forecasting results presented in the next subsection.

are only a crude measure of default conditions. We can illustrate this inaccuracy by considering a group of, say, 5 firms. Even if the default probability for this group is forecasted perfectly, it is unlikely to coincide with the observed default fraction of either 0, 1/5, 2/5, etc. The forecast error may therefore be large but it does not necessarily indicate a bad forecast. The observed default fractions are only useful when a sufficiently large number of firms are pooled in a single group. For this reason we pool default and exposure counts over age cohorts, and focus on two broad rating groups, i.e., (i) all rated firms in a certain industry, and (ii) firms in that industry with ratings Ba and below (speculative grade). The mean absolute error (MAE) and the root mean squared error statistic (RMSE) are computed as

5.6. Out-of-sample forecasting accuracy

MAE(t ) =

We compare the out-of-sample forecasting performance between models by considering a number of competing model specifications. Accurate forecasts are valuable in credit risk management, for short-term loan pricing, and for credit portfolio stress testing. Also, out-of-sample forecasting is a stringent diagnostic check for modeling and analyzing time series. We present a truly out-of-sample forecasting study by estimating the parameters of the model using data up to a certain year and by computing the forecasts of the cross-sectional default probabilities for the next year. In this way we have computed our forecasts for the nine years of 2001, . . . , 2009. The measurement of forecasting accuracy of time-varying probabilities is not straightforward. Observed default fractions

D 1 −

D d=1

 RMSE(t ) =

πˆ an

d,t +4|t

D 1 −

D d=1

πˆ

 − π¯ dan,t +4  ,  12

an d,t +4|t

2 an d,t +4

− π¯



,

where index d = 1, . . . , D refers to industry groups. The estimated and realized annual probabilities are given by

πˆ dan,t +4|t = 1 −

4 ∏ 

1 − πˆ d,t +h|t ,



h=1

π¯ dan,t +4 = 1 −

4  ∏ h =1

1−

yd,t +h kd,t +h



,

S.J. Koopman et al. / Journal of Econometrics 162 (2011) 312–325

323

Fig. 5. Smoothed default signals. The two panels present the smoothed default signals θjt for investment grade (Aaa–Baa) and low speculative grade (Caa-C ) firms. The panels decompose the total default signal into estimated factors, scaled by their respective factor loadings (standard deviations). We plot variation due to the first principal component Fˆ1,t , all principal components Fˆ1,t to Fˆ10,t , and all factors including the latent component fˆtuc .

respectively, where πˆ d,t +h|t , for h = 1, . . . , 4, are the forecasted quarterly probabilities for time t + h. To obtain the required default signals, we first forecast all factors Fˆt , fˆtuc jointly using a low order vector autoregression and using the mode estimates of Fˆt and fˆtuc , in-sample. Although mode estimates of ftuc are indicated by f¯tuc , in our forecasting study we integrate them in a Gaussian vector autoregression for which mode and mean estimates are the same. This vector autoregressive model takes into account that the factors Ft and ftuc are conditionally correlated with each other.

Given the forecasts of Fˆt and fˆtuc , we compute πˆ d,t +h|t using Eqs. (2) and (3) and based on parameter estimates and mode estimates of the signal θjt . Table 3 reports the forecast error statistics for five competing models. Model 0 does not contain common factors. It thus corresponds to the common practice of estimating default probabilities using long-term historical averages. We use a model with only baseline log-odds and three well-used macros (industrial production growth, changes in the unemployment rate, and the credit spread (Aaa–Baa)) as our benchmark. The benchmark model is denoted as M0(Xt ). Another version of Model 0 includes three observed variables instead of the common macro factors to forecast time-varying default probabilities; they are changes in industrial production, changes in unemployment rate, and the yield spread between Baa and Aaa rated bonds. We label the benchmark model M0(Xt ). This approach is more common in the literature and here it serves as a more realistic benchmark. The results reported in Table 3 are based

on out-of-sample forecasts from Models 1, 2, and 3, with their parameters replaced by their corresponding estimates as reported in Table 2. As the main finding, ‘observed’ risk factors Fˆt , the latent component ftuc , as well as the industry-specific risk factors in Ct , each contribute to out-of-sample forecasting performance for default rates, to different extents. Feasible reductions in forecasting error are substantial, and by far exceed the reductions achieved by using a few observed covariates directly. The observed reduction in mean absolute forecasting error due to the inclusion of the three observed covariates from Model 0 is less than 2%. Using other observed risk factors provides similar results. Reductions in forecasting error increase when the observed covariates are replaced by principal components and are as high as 10% on average over the years 2001–2009. This finding shows that principal components from a large macro and finance panel can capture default dynamics more successfully. Forecasts improve further when an unobserved component is added to the the principal components and observed industry factors. Mean absolute forecasting errors then reduce to 43%. Reductions in MAE are most pronounced when frailty effects are highest. This is the case in 2002, when default rates remain high while the economy is recovering from recession, and years 2005–2007, when default conditions are substantially better than expected from macro and financial data. Reductions of more than 40% on average are substantial and have clear practical implications for the computation of capital requirements. It is also clear that the simple AR(1) dynamics for the frailty factor are

324

S.J. Koopman et al. / Journal of Econometrics 162 (2011) 312–325

Table 3 Out-of-sample forecasting accuracy. The table reports forecast error statistics associated with one-year ahead out-of-sample forecasts of time-varying point-in-time default probabilities. Error statistics are relative to a benchmark model M0(Xt ) with observed risk factors only, where Xt contains changes in industrial production, changes in unemployment rate, and the yield spread between Baa and Aaa rated bonds, see Section 5.6. We report mean absolute error (MAE) and root mean square error (RMSE) statistics for all firms (All) and speculative grade (SpG), respectively, based on all industry-group forecasts for the years 2001–2009. The relative MAEs are also given for all industry-group forecasts, for each year. Model M0 contains constants only. Models M1, M2, and M3 contain in addition the factors Ft , ftuc , and both Ft , ftuc , respectively. The models may also contain covariates as indicated. Model M0: no factors

Total MAE RMSE

M0: Xt , Ct

MAE RMSE

M1: Ft , Ct

MAE RMSE

M2: ftuc , Ct

MAE RMSE

M3: Ft , ftuc , no Ct

MAE RMSE

M3: Ft , ftuc , Ct

MAE RMSE

All SpG All SpG All SpG All SpG All SpG All SpG All SpG All SpG All SpG All SpG All SpG All SpG

1.00 0.99 1.01 0.99 0.99 0.99 1.01 1.00 0.91 0.90 0.92 0.92 0.61 0.62 0.65 0.66 0.63 0.63 0.68 0.68 0.57 0.57 0.62 0.63

Ch.MAE

2001

2002

2003

2004

2005

2006

2007

2008

2009

0.0%

1.05 1.01 1.06 1.04 0.96 0.96 0.97 0.97 0.93 0.99 0.89 0.96 1.13 1.07 1.15 1.08 0.90 0.92 0.91 0.95 0.95 0.94 0.96 0.97

0.66 0.65 0.70 0.71 1.07 1.04 1.12 1.07 0.84 0.82 0.77 0.77 0.93 0.79 0.88 0.82 0.58 0.57 0.67 0.70 0.58 0.56 0.66 0.68

1.62 1.58 1.49 1.43 1.09 1.06 1.06 1.03 1.10 1.04 1.04 1.00 0.88 0.89 0.80 0.82 0.62 0.74 0.70 0.77 0.61 0.73 0.62 0.70

1.08 1.08 1.09 1.09 0.95 0.94 0.96 0.95 0.74 0.75 0.77 0.78 0.50 0.52 0.55 0.57 0.45 0.44 0.46 0.46 0.37 0.36 0.37 0.38

1.01 1.01 1.01 1.01 0.96 0.95 0.98 0.97 0.66 0.64 0.71 0.69 0.36 0.35 0.41 0.40 0.37 0.37 0.40 0.40 0.35 0.35 0.37 0.39

1.04 1.04 1.05 1.05 1.04 1.03 1.05 1.04 0.83 0.81 0.85 0.85 0.26 0.30 0.28 0.33 0.39 0.42 0.39 0.43 0.31 0.35 0.32 0.37

1.02 1.03 1.04 1.04 0.96 0.95 0.96 0.95 0.96 0.96 0.97 0.97 0.30 0.34 0.35 0.38 0.32 0.31 0.37 0.37 0.29 0.29 0.35 0.35

1.06 1.05 1.07 1.06 0.95 0.96 0.97 0.97 0.89 0.91 0.88 0.89 1.03 1.05 1.17 1.17 1.19 1.28 1.24 1.32 1.10 1.18 1.21 1.26

0.80 0.76 0.85 0.73 1.00 1.02 1.07 1.07 1.27 1.25 1.34 1.31 0.77 0.76 0.76 0.72 1.18 1.08 1.16 1.09 0.98 0.86 0.99 0.90

−1.4%

−0.8% −1.3%

−9.4% −10.2%

−38.7% −38.2%

−36.9% −37.4%

−43.0% −43.2%

too simplistic to capture the abrupt changes in common credit conditions during the crisis of 2008. As the frailty factor is negative over 2007, the forecast of default risk over 2008 based on the AR(1) dynamics is too low. In 2009, we find that the full model including frailty again does better than its competitors. To further improve the forecasting performance of the full model in crisis situations, one could extend the dynamic behavior of the frailty factor further to include nonlinear effects. This is left for future research. 6. Conclusion We have proposed a novel non-Gaussian panel data time series model with regression effects for the analysis and forecasting of corporate default rates. The dynamic model combines a nonGaussian panel data specification with the principal components of a large number of macroeconomic covariates. In an empirical application to US data, the combined factors capture a significant share of the common dynamics in disaggregated default counts. We find a large and significant role for a dynamic frailty component, even after accounting for more than 80% of the variation in more than 100 macroeconomic and financial covariates. A latent component or frailty factor is thus needed to prevent a downward bias in the estimation of extreme default losses on portfolios of US corporate debt. Our result also indicates that the presence of a latent factor may not be due to a few omitted macroeconomic covariates, but rather appears to capture different omitted effects at different times. In an out-of-sample forecasting experiment, we obtain substantial reductions between 10% and 43% on average in mean absolute error when forecasting the point-in-time default probabilities using our factor structure. The forecasts from our model are particularly more accurate at times when frailty effects are important and when aggregate default conditions deviate from financial and business cycle conditions. A frailty component implies additional default rate volatility, and may contribute to default clustering during periods of stress. Practitioners who rely on observed macroeconomic and firm-specific data alone may underestimate their economic capital requirements and crisis default probabilities as a result.

References Azizpour, S., Giesecke, K., Schwenkler, G., 2010. Exploring the sources of default clustering. Stanford University Working Paper Series. Black, F., Cox, J.C., 1976. Valuing corporate securities: some effects of bond indenture provisions. The Journal of Finance 31 (2), 351–367. Boissay, F., Gropp, R., 2007. Trade credit defaults and liquidity provision by firms. ECB Working Paper No 753. CreditMetrics, 2007. CreditMetrics (TM)—technical document, riskMetrics group. www.riskmetrics.com/pdf/dnldtechdoc/CMTD1.pdf. Das, S., Duffie, D., Kapadia, N., Saita, L., 2007. Common failings: how corporate defaults are correlated. The Journal of Finance 62 (1), 93–117. (25). de Jong, P., Shephard, N., 1995. The simulation smoother for time series models. Biometrika 82, 339–350. Doornik, J.A., 2007. Ox: An Object-Oriented Matrix Language. Timberlake Consultants Press, London. Duffie, D., Eckner, A., Horel, G., Saita, L., 2009. Frailty correlated default. Journal of Finance 64 (5), 2089–2123. Duffie, D., Saita, L., Wang, K., 2007. Multi-period corporate default prediction with stochastic covariates. Journal of Financial Economics 83 (3), 635–665. Durbin, J., Koopman, S.J., 1997. Monte Carlo maximum likelihood estimation for non-Gaussian state space models. Biometrika 84 (3), 669–684. Durbin, J., Koopman, S.J., 2001. Time Series Analysis by State Space Methods. Oxford University Press, Oxford. Durbin, J., Koopman, S.J., 2002. A simple and efficient simulation smoother for state space time series analysis. Biometrika 89 (3), 603–616. Exterkate, P., van Dijk, D., Heij, C., Groenen, P.J.F., 2010. Forecasting the yield curve in a data-rich environment using the factor-augmented Nelson–Siegel model. Econometric Institute Working Paper. Erasmus University Rotterdam. Forni, M., Hallin, M., Lippi, M., Reichlin, L., 2005. The generalized dynamic-factor model: one-sided estimation and forecasting. Journal of the American Statistical Association 100, 830–840. Giesecke, K., 2004. Correlated default with incomplete information. Journal of Banking and Finance 28, 1521–1545. Giesecke, K., Kim, B., 2010. Systemic risk: What defaults are telling us. Stanford University Working Paper. pp. 1–34. Jorion, P., Zhang, G., 2007. Good and bad credit contagion: evidence from credit default swaps. Journal of Financial Economics 84 (3), 860–883. Koopman, S.J., Lucas, A., 2008. A non-Gaussian panel time series model for estimating and decomposing default risk. Journal of Business and Economic Statistics 26 (4), 510–525. Koopman, S.J., Lucas, A., Monteiro, A., 2008. The multi-stage latent factor intensity model for credit rating transitions. Journal of Econometrics 142 (1), 399–424. Koopman, S.J., Shephard, N., Doornik, J.A., 2008. Statistical Algorithms for Models in State Space Form: Ssfpack 3.0. Timberlake Consultants Press, London. Lando, D., 2003. Credit Risk Modelling—Theory and Applications. Princeton University Press. Lando, D., Nielsen, M.S., 2008. Correlation in corporate defaults: Contagion or conditional independence. Working Paper. Copenhagen Business School.

S.J. Koopman et al. / Journal of Econometrics 162 (2011) 312–325 Lang, L., Stulz, R., 1992. Contagion and competitive intra-industry effects of bankruptcy announcements. Journal of Financial Economics 32, 45–60. Lawley, D.N., Maxwell, A.E., 1971. Factor Analysis as a Statistical Method. American Elsevier Publishing, New York. Ludvigson, S.C., Ng, S., 2007. The empirical risk-return relation: a factor analysis approach. Journal of Financial Economics 83, 171–222. Massimiliano, M., Stock, J.H., Watson, M.W., 2003. Macroeconomic forecasting in the euro area: country specific versus area-wide information. European Economic Review 47 (1), 1–18. McNeil, A.J., Frey, R., Embrechts, P., 2005. Quantitative Risk Management: Concepts, Techniques and Tools. Princeton University Press.

325

McNeil, A., Wendin, J., 2007. Bayesian inference for generalized linear mixed models of portfolio credit risk. Journal of Empirical Finance 14 (2), 131–149. Merton, R., 1974. On the pricing of corporate debt: the risk structure of interest rates. Journal of Finance 29 (2), 449–470. Stock, J., Watson, M., 2002a. Forecasting using principal components from a large number of predictors. Journal of the American Statistical Association 97 (460), 1167–1179. Stock, J., Watson, M., 2002b. Macroeconomic forecasting using diffusion indexes. Journal of Business and Economic Statistics 20 (2), 147–162. Vassalou, M., Xing, Y., 2004. Default risk in equity premiums. Journal of Finance 109 (2), 831–868.

Journal of Econometrics 162 (2011) 326–344

Contents lists available at ScienceDirect

Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom

Generalized runs tests for the IID hypothesis Jin Seo Cho a , Halbert White b,∗ a

School of Economics, Yonsei University, 50 Yonsei-ro, Seodaemun-gu, Seoul 120-749, South Korea

b

Department of Economics, University of California, San Diego, 9500 Gilman Dr., La Jolla, CA 92093, USA

article

info

Article history: Received 5 August 2009 Received in revised form 27 January 2011 Accepted 3 February 2011 Available online 12 February 2011 JEL classification: C12 C23 C80

abstract We provide a family of tests for the IID hypothesis based on generalized runs, powerful against unspecified alternatives, providing a useful complement to tests designed for specific alternatives, such as serial correlation, GARCH, or structural breaks. Our tests have appealing computational simplicity in that they do not require kernel density estimation, with the associated challenge of bandwidth selection. Simulations show levels close to nominal asymptotic levels. Our tests have power against both dependent and heterogeneous alternatives, as both theory and simulations demonstrate. © 2011 Elsevier B.V. All rights reserved.

Keywords: IID condition Runs test Geometric distribution Gaussian process Dependence Structural break

1. Introduction The assumption that data are independent and identically distributed (IID) plays a central role in the analysis of economic data. In cross-section settings, the IID assumption holds under pure random sampling. As Heckman (2001) notes, violation of the IID property, therefore random sampling, can indicate the presence of sample selection bias. The IID assumption is also important in timeseries settings, as processes driving time series of interest are often assumed to be IID. Moreover, transformations of certain time series can be shown to be IID under specific null hypotheses. For example Diebold et al. (1998) show that to test density forecast optimality, one can test whether the series of probability integral transforms of the forecast errors are IID uniform (U [0, 1]). There is a large number of tests designed to test the IID assumption against specific alternatives, such as structural breaks, serial correlation, or autoregressive conditional heteroskedasticity. Such special purpose tests may lack power in other directions, however, so it is useful to have available broader diagnostics

∗

Corresponding author. Tel.: +1 858 534 5985; fax: +1 858 534 7040. E-mail addresses: [email protected] (J.S. Cho), [email protected], [email protected] (H. White). 0304-4076/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2011.02.001

that may alert researchers to otherwise unsuspected properties of their data. Thus, as a complement to special purpose tests, we consider tests for the IID hypothesis that are sensitive to general alternatives. Here we exploit runs statistics to obtain necessary and sufficient conditions for data to be IID. In particular, we show that if the underlying data are IID, then suitably defined runs are IID with the geometric distribution. By testing whether the runs have the requisite geometric distribution, we obtain a new family of tests, the generalized runs tests, suitable for testing the IID property. An appealing aspect of our tests is their computational convenience relative to other tests sensitive to general alternatives to IID. For example, Hong and White’s (2005) entropy-based IID tests require kernel density estimation, with its associated challenge of bandwidth selection. Our tests do not require kernel estimation and, as we show, have power against dependent alternatives. Our tests also have power against structural break alternatives, without exhibiting the non-monotonicities apparent in certain tests based on kernel estimators (Crainiceanu and Vogelsang, 2007; Deng and Perron, 2008). Runs have formed an effective means for understanding data properties since the early 1940s. Wald and Wolfowitz (1940), Mood (1940), Dodd (1942) and Goodman (1958) first studied runs to test for randomness of data with a fixed percentile p used in defining the runs. Granger (1963) and Dufour (1981) propose using runs as a nonparametric diagnostic for serial correlation, noting

J.S. Cho, H. White / Journal of Econometrics 162 (2011) 326–344

that the choice of p is important for the power of the test. Fama (1965) extensively exploits a runs test to examine stylized facts of asset returns in US industries, with a particular focus on testing for serial correlation of asset returns. Heckman (2001) observes that runs tests can be exploited to detect sample selection bias in crosssectional data; such biases can be understood to arise from a form of structural break in the underlying distributions. Earlier runs tests compared the mean or other moments of the runs to those of the geometric distribution for fixed p, say 0.5 (in which case the associated runs can be computed alternatively using the median instead of the mean). Here we develop runs tests based on the probability generating function (PGF) of the geometric distribution. Previously, Kocherlakota and Kocherlakota (KK, 1986) have used the PGF to devise tests for discrete random variables having a given distribution under the null hypothesis. Using fixed values of the PGF parameter s, KK develop tests for the Poisson, Pascal–Poisson, bivariate Poisson, or bivariate Neyman type A distributions. More recently, Rueda et al. (1991) study PGF-based tests for the Poisson null hypothesis, constructing test statistics as functionals of stochastic processes indexed by the PGF parameter s. Here we develop PGF-based tests for the geometric distribution with parameter p, applied to the runs for a sample of continuously distributed random variables. We construct our test statistics as functionals of stochastic processes indexed by both the runs percentile p and the PGF parameter s. By not restricting ourselves to fixed values for p and/or s, we create the opportunity to construct tests with superior power. Further, we obtain weak limits for our statistics in situations where the distribution of the raw data from which the runs are constructed may or may not be known and where there may or may not be estimated parameters. As pointed out by Darling (1955), Sukhatme (1972), Durbin (1973) and Henze (1996), among others, goodness-of-fit (GOF) based statistics such as ours may have limiting distributions affected by parameter estimation. As we show, however, our test statistics have asymptotic null distributions that are not affected by parameter estimation under mild conditions. We also provide straightforward simulation methods to consistently estimate asymptotic critical values for our test statistics. We analyze the asymptotic local power of our tests, and we conduct Monte Carlo experiments to explore the properties of our tests in settings relevant for economic applications. In studying power, we give particular attention to dependent alternatives and to alternatives containing an unknown number of structural breaks. To analyze the asymptotic local power of our tests against dependent alternatives, we assume a first-order Markov process converging to an IID process in probability at the rate n−1/2 , where n is the sample size, and we find that our tests have nontrivial local power. We work with first-order Markov processes for conciseness. Our results generalize to higher-order Markov processes, but that analysis is sufficiently involved that we leave it for subsequent work. Our Monte Carlo experiments corroborate our theoretical results and also show that our tests exhibit useful finite sample behavior. For dependent alternatives, we compare our generalized runs tests to the entropy-based tests of Robinson (1991), Skaug and Tjøstheim (1996) and Hong and White (2005). Our tests perform respectably, showing good level behavior and useful, and in some cases superior, power against dependent alternatives. For structural break alternatives, we compare our generalized runs tests to Feller’s (1951) and Kuan and Hornik’s (1995) RR test, Brown et al. ’s (1975) RE-CUSUM test, Sen’s (1980) and Ploberger et al. ’s (1989) RE test, Ploberger and Krämer’s (1992) OLS-CUSUM test, Andrews’ (1993) Sup-W test, Andrews and Ploberger’s (1994) ExpW and Avg-W tests, and Bai’s (1996) M-test. These prior tests are all designed to detect a finite number of structural breaks

327

at unknown locations. We find good level behavior for our tests and superior power against multiple breaks. An innovation is that we consider alternatives where the number of breaks grows with sample size. Our new tests perform well against such structural break alternatives, whereas the prior tests do not. This paper is organized as follows. In Section 2, we introduce our new family of generalized runs statistics and derive their asymptotic null distributions. These involve Gaussian stochastic processes. Section 3 provides methods for consistently estimating critical values for the test statistics of Section 2. This permits us to compute valid asymptotic critical values even when the associated Gaussian processes are transformed by continuous mappings designed to yield particular test statistics of interest. We achieve this using other easily simulated Gaussian processes whose distributions are identical to those of Section 2. Section 4 studies aspects of local power for our tests. Section 5 contains Monte Carlo simulations; this also illustrates the use of the simulation methods developed for obtaining the asymptotic critical values in Section 2. Section 6 contains concluding remarks. All mathematical proofs are collected in the Appendix. Before proceeding, we introduce mathematical notation used throughout. We let 1{ · } stand for the indicator function such that 1{A} = 1 if the event A is true, and 0 otherwise. ⇒ and → denote ‘converge(s) weakly’ and ‘converge(s) to’, respectively, and d

= denotes equality in distribution. Further, ‖ · ‖ and ‖ · ‖∞ denote the Euclidean and uniform metrics, respectively. We let C (A) and D (A) be the spaces of continuous and cadlag mappings from a set A to R, respectively, and we endow these spaces with Billingsley’s (1968, 1999) or Bickel and Wichura’s (1971) metric. We denote the unit interval as I := [0, 1]. 2. Testing the IID hypothesis 2.1. Maintained assumptions We begin by collecting together assumptions maintained throughout and proceed with our discussion based on these. We first specify the data generating process (DGP) and a parameterized function whose behavior is of interest. A1 (DGP): Let (Ω , F , P) be a complete probability space. For m ∈ N, {Xt : Ω → Rm , t = 1, 2, . . .} is a stochastic process on (Ω , F , P). A2 (Parameterization): For d ∈ N, let Θ be a non-empty convex compact subset of Rd . Let h : Rm × Θ → R be a function such that (i) for each θ ∈ Θ, h(Xt ( · ), θ) is measurable; and (ii) for each ω ∈ Ω , h(Xt (ω), · ) is such that for each θ, θ ′ ∈ Θ, |h(Xt (ω), θ) − h(Xt (ω), θ ′ )| ≤ Mt (ω)‖θ−θ ′ ‖, where Mt is measurable and is OP (1), uniformly in t. Assumption A2 specifies that Xt is transformed via h. The Lipschitz condition of A2(ii) is mild and typically holds in applications involving estimation. Our next assumption restricts attention to continuously distributed random variables. A3 (Continuous random variables): For given θ ∗ ∈ Θ, the random variables Yt := h(Xt , θ ∗ ) have continuous cumulative distribution functions (CDFs) Ft : R → I, t = 1, 2, . . . . Our main interest attaches to distinguishing the following hypotheses:

H0 : {Yt : t = 1, 2, . . .} is an IID sequence; vs. H1 : {Yt : t = 1, 2, . . .} is not an IID sequence. Under H0 , Ft ≡ F (say), t = 1, 2, . . . . We separately treat the cases in which F is known or unknown. In the latter case, we estimate F using the empirical distribution function.

328

J.S. Cho, H. White / Journal of Econometrics 162 (2011) 326–344

We also separately consider cases in which θ ∗ is known or unknown. In the latter case, we assume that θ ∗ is consistently estimated by θˆ n . Formally, we impose A4 (Estimator): There exists a sequence of measurable functions {θˆ n : √ Ω → Θ} such that n(θˆ n − θ ∗ ) = OP (1). Thus, the sequence of transformed observations {Yt : t = 1, 2, . . .} need not be observable. Instead, it will suffice that these can be estimated, as occurs when regression errors are of interest. In this case, h(Xt , θ ∗ ) can be regarded as a representation of regression errors X1t − E [X1t |X2t , . . . , Xmt ], say. Estimated residuals then have the representation h(Xt , θˆ n ). We pay particular attention to the effect of parameter estimation on the asymptotic null distribution of our test statistics. 2.2. Generalized runs (GR) tests

satisfying Kn,0 (p, p′ ) = 0, and

Our first result justifies popular uses of runs in the literature. For this, we provide properties of the runs distribution, new to the best of our knowledge, that can be exploited to yield a variety of runs-based tests consistent against departures from the IID null hypothesis. We begin by analyzing the case in which θ ∗ and F are known. We define runs in the following two steps: first, for each p ∈ I, we let Tn (p) := {t ∈ {1, . . . , n} : F (Yt ) < p}, n = 1, 2, . . . . This set contains those indices whose percentiles F (Yt ) are less than the given number p. That is, we first employ the probability integral transform of Rosenblatt (1952). Next, let Mn (p) denote the (random) number of elements of Tn (p), let tn,i (p) denote the ith smallest element of Tn (p), i = 1, . . . , Mn (p), and define the p-runs Rn,i (p) as Rn,i (p) :=



tn,i (p), tn,i (p) − tn,i−1 (p),

P(Rn,1 (p) = k) =

(1 − p)pk , pn ,

if k ≤ n; if k = ∅.

∑Kn,i (p,p′ )

j=Kn,i−1 (p,p′ )+1

Rn,j (p) = Rn,i (p′ ).

Lemma 1. Suppose assumptions A1, A2(i), and A3 hold. (a) Then for each n = 1, 2, . . . , {Yt , t = 1, . . . , n} is IID only if the following regularity conditions (R) hold: 1. for every p ∈ I such that Mn (p) > 0, {Rn,i (p), i = 1, . . . , Mn (p)} is IID with distribution Gp , the geometric distribution with parameter p; and 2. for every p, p′ ∈ I with p′ ≤ p such that Mn (p′ ) > 0, (i) Rn,j (p) is independent of Rn,i (p′ ) if j ̸∈ {Kn,i−1 + 1, Kn,i−1 + 2 , . . . , K n ,i } ; (ii) otherwise, for w = 1, . . . , Mn (p′ ), m = 1, . . . , w, and ℓ = m, . . . , w ,  m+Kn,i−1

P

−

Rn,j (p) = ℓ, Rn,i (p′ ) = w|Kn,i−1 , Kn,i 

j=1+Kn,i−1

i = 1; i = 2, . . . , Mn (p).

Thus, a p-run Rn,i (p) is the number of observations separating data values whose percentiles are less than the given value p. This is the conventional definition of runs found in the literature, except that F is assumed known for the moment. Thus, if the population median is known, then the conventional runs given by Wald and Wolfowitz (1940) are identical to ours with p = 0.5. The only difference is that we apply the probability integral transform; this enables us to later accommodate the influence of parameter estimation error on the asymptotic distribution. In Section 2.3 we relax the assumption that F is known and examine how this affects the results obtained in this section. Note that Mn (p)/n = p + oP (1). Conventional runs are known to embody the IID hypothesis nonparametrically; this feature is exploited in the literature to test for the IID hypothesis. For example, the Wald and Wolfowitz (1940) runs test considers the standardized number of runs, whose distribution differs asymptotically from the standard normal if the data are not IID, giving the test its power. It is important to note that for a given n and p, n need not be an element of Tn (p). That is, there may be an ‘‘incomplete’’ or ‘‘censored’’ run at the end of the data that arises because F (Yn ) ≥ p. We omit this censored run from consideration to ensure that all the runs we analyze have an identical distribution. To see why this is important, consider the first run, Rn,1 (p), and, for the moment, suppose that we admit censored runs (i.e., we include the last run, even if F (Yn ) ≥ p). When a run is censored, we denote its length by k = ∅. When the original data {Yt } are IID, the marginal distribution of Rn,1 (p) is then



Thus, when censored runs are admitted, the unconditional distribution of Rn,1 (p) is a mixture distribution. The same is true for runs other than the first, but the mixture distributions differ due to the censoring. On the other hand, the uncensored run Rn,1 (p) is distributed as Gp , the geometric distribution with parameter p. The same is also true for uncensored runs other than the first. That is, {Rn,i (p), i = 1, 2, . . . , Mn (p)} is the set of runs with identical distribution Gp , as every run indexed by i = 1, 2, . . . , Mn (p) is uncensored. (The censored run, when it exists, is indexed by i = Mn (p) + 1. When the first run is censored, Mn (p) = 0.) Moreover, as we show, the uncensored runs are independent when {Yt } is IID. Thus, in what follows, we consider only the uncensored runs, as formally defined above. Further, we construct and analyze our statistics in such a way that values of p for which Mn (p) = 0 have no adverse impact on our results. To state our result, we let Kn,i stand for Kn,i (p, p′ ), with p′ ≤ p,

 ℓ−1 ( )(1 − p)ℓ−m (p − p′ )m    m−1 (1 − p′ )w−(ℓ+1) p′ , = ℓ−1 ℓ−m  (m−1 )(1 − p)  ′ m−1 ′ (p − p ) p ,

if ℓ = m, . . . , w − 1; if ℓ = w .

(b) If R holds, then Yt is identically distributed and pairwise independent. Conditions (1) and (2) of Lemma 1 enable us to detect violations of IID {Yt } in directions that differ from the conventional parametric approaches. Specifically, by Lemma 1(a), alternatives to IID {Yt } may manifest as p-runs with the following alternative (A) properties:

A(i): A(ii): A(iii): A(iv):

the p-runs have distribution Gq , q ̸= p; the p-runs have non-geometric distribution; the p-runs have heterogeneous distributions; the p-runs and p′ -runs have dependence between Rn,i (p) and Rn,j (p′ ) (i ̸= j, p′ ≤ p); A(v): any combination of (i)–(iv).

Popularly assumed alternatives to IID data can be related to the alternatives in A. For example, stationary autoregressive processes yield runs with geometric distribution, but for a given p, {Rn,i (p)} has a geometric distribution different from Gp and may exhibit serial correlation. Thus stationary autoregressive processes exhibit A(i) or A(iv). Alternatively, if the original data are independent but heterogeneously distributed, then for some p, {Rn,i (p)} is non-geometric or has heterogeneous distributions. This case thus belongs to A(ii) or A(iii). To keep our analysis manageable, we focus on detecting A(i)–A(iii) by testing the p-runs for distribution Gp . That is, the hypotheses considered here are as follows: H′0 : {Rn,i (p), i = 1, . . . , Mn (p)} is IID with distribution Gp for each p ∈ I such that Mn (p) > 0; vs. H′1 : {Rn,i (p), i = 1, . . . , Mn (p)} manifests A(i), A(ii), or A(iii) for some p ∈ I such that Mn (p) > 0. Stated more primitively, the alternative DGPs aimed at here include

J.S. Cho, H. White / Journal of Econometrics 162 (2011) 326–344

serially correlated and/or heterogeneous alternatives. Alternatives that violate A(iv) without violating A(i)–A(iii) will generally not be detectable. Thus, our goal is different from the rank-based white noise test of Hallin et al. (1985) and the distribution-functionbased serial independence test of Delgado (1996). Certainly, it is of interest to devise statistics specifically directed at A(iv) in order to test H0 fully against the alternatives of H1 . Such statistics are not as simple to compute and require analysis different from those motivated by H′1 ; moreover, the Monte Carlo simulations in Section 5 show that even with attention restricted to H′1 , we obtain well-behaved tests with power against both commonly assumed dependent and heterogeneous alternatives to IID. We thus leave consideration of tests designed specifically to detect A(iv) to other work. Lemma 1(b) is a partial converse of Lemma 1(a). It appears possible to extend this to a full converse (establishing Yt is IID) using results of Jogdeo (1968), but we leave this aside here for brevity. There are numerous ways to construct statistics for detecting A(i)–A(iii). For example, as for conventional runs statistics, we can compare the first two runs moments with those implied by the geometric distribution. Nevertheless, this approach may fail to detect differences in higher moments. To avoid difficulties of this sort, we exploit a GOF statistic based on the PGF to test the Gp hypothesis. For this, let −1 < s < 0 < s¯ < 1; for each s ∈ S := [s, s¯], define Mn (p)

 1 − Gn (p, s) := √ sRn,i (p) − n i=1

sp

{1 − s(1 − p)}



,

(1)

329

This mainly follows by showing that {Gn (p, · ) : n = 1, 2, . . .} is tight (see Billingsley covariance structure  1999); the given  (2) is derived from E Gn (p, s)Gn (p, s′ ) under the null. Let f : C (S) → R be a continuous mapping. Then by the continuous mapping theorem, under the null any test statistic f [Gn (p, · )] obeys f [Gn (p, · )] ⇒ f [G(p, · )]. As Granger (1963) and Dufour (1981) emphasize, the power of runs tests may depend critically on the specific choice of p. For example, if the original data set is a sequence of independent normal variables with population mean zero and variance dependent upon index t, then selecting p = 0.5 yields no power, as the runs for p = 0.5 follow G0.5 despite the heterogeneity. Nevertheless, useful power can be delivered by selecting p different from 0.5. This also suggests that better powered runs tests may be obtained by considering numerous p’s at the same time. To fully exploit Gn , we consider Gn as a random function of both p and s, and not just Gn (p, · ) for given p. Specifically, under the null, a functional central limit theorem ensures that Gn ⇒ G

(3)

on J × S, where J := [p, 1] with p > 0, and G is a Gaussian process such that for each (p, s) and (p′ , s′ ) with p′ ≤ p, E [G(p, s)] = 0, and E [G(p, s)G(p′ , s′ )] ss′ p′ (1 − s)(1 − s′ )(1 − p){1 − s′ (1 − p)} 2

=

{1 − s(1 − p)}{1 − s′ (1 − p′ )}2 {1 − ss′ (1 − p)}

.

(4)

if p ∈ (pmin,n , 1), and Gn (p, s) := 0 otherwise, where pmin,n := min[F (Y1 ), F (Y2 ), . . . , F (Yn )]. This is a scaled difference between the p-runs sample PGF and the Gp PGF. Two types of GOF statistics are popular in the literature: those exploiting the empirical distribution function (e.g., Darling, 1955; Sukhatme, 1972; Durbin, 1973; Henze, 1996) and those comparing empirical characteristic or moment generating functions (MGFs) with their sample estimates (e.g., Bierens, 1990; Brett and Pinkse, 1997; Stinchcombe and White, 1998; Hong, 1999; Pinkse, 1998). The statistic in (1) belongs to the latter type, as the PGF for discrete random variables plays the same role as the MGF, as noted by Karlin and Taylor (1975). The PGF is especially convenient because it is a rational polynomial in s, enabling us to easily handle the weak limit of the process Gn . Specifically, the rational polynomial structure permits us to represent this weak limit as an infinite sum of independent Gaussian processes, enabling us to straightforwardly estimate critical values by simulation, as examined in detail in Section 3. GOF tests using (1) are diagnostic, as are standard MGFbased GOF tests; thus, tests based on (1) do not tell us in which direction the null is violated. Also, like standard MGF-based GOF tests, they are not consistent against all departures from the IID hypothesis. Section 4 examines local alternatives to the null; we provide further discussion there. Our use of Gn builds on the work of Kocherlakota and Kocherlakota (KK, 1986), who consider tests for a number of discrete null distributions, based on a comparison of sample and theoretical PGFs for a given finite set of s’s. To test their null distributions, KK recommend choosing s’s close to zero. Subsequently, Rueda et al. (1991) examined the weak limit of an analog of Gn ( p, · ) to test the IID Poisson null hypothesis. Here we show that if {Rn,i (p)} is a sequence of IID p-runs distributed as Gp then Gn ( p, · ) obeys the functional central limit theorem; test statistics can be constructed accordingly. Specifically, for each p, Gn ( p, · ) ⇒ G(p, · ), where G(p, · ) is a Gaussian process such that for each s, s′ ∈ S, E [G(p, s)] = 0, and

When p = p′ , the covariance structure is as in (2). Note also that the covariance structure in (4) is symmetric in both s and p, as we specify that p′ ≤ p. Without this latter restriction, the symmetry is easily seen, as the covariance then has the form given in Box I To obtain (3) and (4), we exploit the joint probability distribution of runs associated with different percentiles p and p′ . Although our statistic Gn is not devised to test for dependence between Rn,j (p) and Rn,j (p′ ), verifying Eq. (4) nevertheless makes particular use of the dependence structure implied by A(iv). This structure also makes it straightforward to devise statistics specifically directed at A(iv); we leave this aside here to maintain a focused presentation. For each s, Gn ( · , s) is cadlag, so the tightness of {Gn } must be proved differently from that of {Gn (p, · )}. Further, although G is continuous in p, it is not differentiable almost surely. This is because Rn,i is a discrete random function of p. As n increases, the discreteness of Gn disappears, but its limit is not smooth enough to deliver differentiability in p. The weak convergence given in (3) is proved by applying the convergence criterion of Bickel and Wichura (1971, Theorem 3). We verify this by showing that the modulus of continuity based on the fourth-order moment is uniformly bounded on J × S. By taking p > 0, we are not sacrificing much, as Mn (p) decreases as p tends to zero, so that Gn (p, · ) converges to zero uniformly on S. For practical purposes, we can thus let p be quite small. We examine the behavior of the relevant test statistics in our Monte Carlo experiments of Section 5 by examining what happens when p is zero. As before, the continuous mapping theorem ensures that, given a continuous mapping f : D (J × S) → R, under the null the test statistic f [Gn ] obeys f [Gn ] ⇒ f [G]. Another approach uses the process Gn (· , s) on J. Under the null, we have Gn ( · , s) ⇒ G( · , s), where G( · , s) is a Gaussian process such that for each p and p′ in J with p′ ≤ p, E [G( · , s)] = 0, and

E [G(p, s)G(p, s′ )]

E [G(p, s)G(p′ , s)]

=

ss′ p2 (1 − s)(1 − s′ )(1 − p)

{1 − s(1 − p)}{1 − s′ (1 − p)}{1 − ss′ (1 − p)}

s2 p′ (1 − s)2 (1 − p){1 − s(1 − p)} 2

.

(2)

=

{1 − s(1 − p)}{1 − s(1 − p′ )}2 {1 − s2 (1 − p)}

.

(5)

330

J.S. Cho, H. White / Journal of Econometrics 162 (2011) 326–344

ss′ min[p, p′ ]2 (1 − s)(1 − s′ )(1 − max[p, p′ ]){1 − s′ (1 − max[p, p′ ])}

{1 − s(1 − max[p, p′ ])}{1 − s′ (1 − min[p, p′ ])}2 {1 − ss′ (1 − max[p, p′ ])} Box I.

Given a continuous mapping f : D (J) → R, under the null we have f [Gn ( · , s)] ⇒ f [G( · , s)]. We call tests based on f [Gn (p, · )], f [Gn (· , s)], or f [Gn ] generalized runs tests (GR tests) to emphasize their lack of dependence on specific values of p and/or s. We summarize our discussion as Theorem 1. Given conditions A1, A2(i), A3, and H0 , (i) for each p ∈ I, Gn ( p, · ) ⇒ G(p, · ), and if f : C (S) → R is continuous, then f [Gn (p, · )] ⇒ f [G(p, · )]; (ii) for each s ∈ S, Gn ( · , s) ⇒ G( · , s), and if f : D (J) → R is continuous, then f [Gn ( · , s)] ⇒ f [G( · , s)]; (iii) Gn ⇒ G, and if f : D (J × S) → R is continuous, then f [Gn ] ⇒ f [G]. The proofs of Theorem 1(i, ii, and iii) are given in the Appendix. Although Theorem 1(i and ii) follow as corollaries of Theorem 1(iii), we prove Theorem 1(i and ii) first and use these properties as lemmas in proving Theorem 1(iii). Note that Theorem 1(i) holds even when p = 0, because Gn ( 0, · ) ≡ 0, and for every s, G(0, s) ∼ N (0, 0) = 0. We cannot allow p = 0 in Theorem 1(iii), however, because however large n is, there is always some p close to 0 for which the asymptotics break down. This necessitates our consideration of J instead of I in (iii). We remark that we do not specify f in order to allow researchers to form their own statistics based upon their particular interests. There are a number of popular mappings and justifications for these in the literature, especially those motivated by Bayesian interpretations. For example, Davies (1977) considers the mapping that selects the maximum of the random functions generated by nuisance parameters present only under the alternative. The motivation for this is analogous to that for the Kolmogorov (K ) goodness-of-fit statistic, namely, to test non-spurious peaks of the random functions. Bierens (1990) also proposes this choice for his consistent conditional moment statistic. Andrews and Ploberger (1994) study this mapping together with others, and propose a mapping that is optimal in a well-defined sense. Alternatively, Bierens (1982) and Bierens and Ploberger (1997) consider integrating the associated random functions with respect to the nuisance parameters, similar to the Smirnov (S) statistic. This is motivated by the desire to test for a zero constant mean function of the associated random functions. Below, we examine K - and Stype mappings for our Monte Carlo simulations. A main motivation for this is that the goodness-of-fit aspects of the transformed data tested via the PGF have interpretations parallel to those for the mappings used in Kolmogorov’s and Smirnov’s goodness-of-fit statistics. 2.3. Empirical generalized runs (EGR) tests We now consider the case in which θ ∗ is known, but the null CDF of Yt is unknown. This is a common situation when interest attaches to the behavior of raw data. As the null CDF is unknown, Gn cannot be computed. Nevertheless, we can proceed by replacing the unknown F with a suitable estimator. The empirical distribution function is especially convenient here. ∑n Specifically, for each y ∈ R, we define  Fn (y) := 1n 1 { Yt ≤y} . t =1 This estimation requires modifying our prior definition of p-runs as follows: First, for each p ∈ I, let  Tn (p) := {t ∈ N :  Fn (Yt ) < p}, n (p) denote the (random) number of elements of  let M Tn (p), and let

n (p).  tn,i (p) denote the ith smallest element of  Tn (p), i = 1, . . . , M n (p)/n⌋ = p.) We define the empirical p-runs as (Note that ⌊M   tn,i (p), i = 1;  Rn,i (p) := n (p).  tn,i (p) −  tn,i−1 (p), i = 2, . . . , M For each s ∈ S, define n (p)  M 1 −   Gn (p, s) := √ sRn,i (p) −

n i=1

sp

 (6)

{1 − s(1 − p)}

 , 1 , and  Gn (p, s) := 0 otherwise. The presence of  Fn leads to an asymptotic null distribution for

if p ∈

1 n

 Gn different from that for Gn . We now examine this in detail. For convenience, for each p ∈ I, let  qn (p) := inf{x ∈ R :  Fn (x) ≥ p}, let  pn (p) := F ( qn (p)), and abbreviate  pn (p) as  pn . Then (6) can be decomposed into two pieces as  Gn = Wn + Hn , where for each ∑ n (p) R (p) n,i − s pn /{1 − s(1 − pn )}), and (p, s), Wn (p, s) := n−1/2 M ( s i =1 n (p)(s Hn (p, s) := n−1/2 M pn /{1 − s(1 − pn )} − sp/{1 − s(1 − p)}). Our next result relates Wn to the random function Gn , revealing Hn to be the contribution of the CDF estimation error. Lemma 2. Given conditions A1, A2(i), A3, and H0 , (i) sup(p,s)∈J×S |Wn (p, s) − Gn (p, s)| = oP (1); (ii) Hn ⇒ H , where H is a Gaussian process on J × S such that for each (p, s) and (p′ , s′ ) with p′ ≤ p, E [H (p, s)] = 0, and ss′ pp′ (1 − s)(1 − s′ )(1 − p) 2

E [H (p, s)H (p′ , s′ )] =

{1 − s(1 − p)}2 {1 − s′ (1 − p′ )}2

. (7)

(iii) (Wn , Hn ) ⇒ (G, H ), and for each (p, s) and (p′ , s′ ), E [G(p, s) H (p′ , s′ )] = −E [H (p, s)H (p′ , s′ )]. Lemma 2 relates the results of Theorem 1 to the unknown distribution function case. As Wn is asymptotically equivalent to Gn (as defined in the known F case), Hn must be the additional component incurred by estimating the empirical distribution function. To state our result for the asymptotic distribution of  Gn , we let  G be a Gaussian process on J × S such that for each (p, s) and (p′ , s′ ) with p′ ≤ p, E [ G(p, s)] = 0, and E [ G(p, s) G(p′ , s′ )] ss′ p′ (1 − s)2 (1 − s′ )2 (1 − p)2 2

=

{1 − s(1 − p)}2 {1 − s′ (1 − p′ )}2 {1 − ss′ (1 − p)}

.

(8)

The analog of Theorem 1 can now be given as follows. Theorem 2. Given conditions A1, A2(i), A3, and H0 , (i) for each p ∈ I,  Gn (p, · ) ⇒  G(p, · ), and if f : C (S) → R is continuous, then f [ Gn (p, · )] ⇒ f [ G(p, · )]; (ii) for each s ∈ S,  Gn ( · , s) ⇒  G( · , s), and if f : D (J) → R is continuous, then f [ Gn ( · , s)] ⇒ f [ G( · , s)]; (iii)  Gn ⇒  G, and if f : D (J × S) → R is continuous, then f [ Gn ] ⇒ f [ G]. We call tests based on f [ Gn (p, · )], f [ Gn ( · , s)], or f [ Gn ] empirical generalized runs tests (EGR tests) to highlight their use of the empirical distribution function. We emphasize that the distributions of the GR and EGR tests differ, as the CDF estimation error survives in the limit, a consequence of the presence of the component Hn .

J.S. Cho, H. White / Journal of Econometrics 162 (2011) 326–344

2.4. EGR tests with nuisance parameter estimation Now we consider the consequences of estimating θ ∗ by θˆ n satisfying A4. As noted by Darling (1955), Sukhatme (1972), Durbin (1973) and Henze (1996), estimation can affect the asymptotic null distribution of GOF-based test statistics. Nevertheless, as we now show in detail, this turns out not to be the case here. We elaborate our notation to handle parameter estimation. Let ∑n ˆ Yˆn,t := h(Xt , θˆ n ) and let Fˆn (y) := 1n t 1{Yˆn,t ≤y} , so that Fn is the empirical CDF of Yˆn,t . Note that we replace θ ∗ with its estimate θˆ n to accommodate the fact that θ ∗ is unknown in this case. Thus, Fˆn contains two sorts of estimation errors: that arising from the empirical distribution and the estimation error for θ ∗ . Next, we define the associated runs using the estimates Yˆn,t and Fˆn . For each p in I, we now let Tˆn (p) := {t ∈ N : Fˆn (Yˆn,t ) < p}, let

ˆ n (p) denote the (random) number of elements of Tˆn (p), and let M ˆ n (p). tˆn,i (p) denote the ith smallest element of Tˆn (p), i = 1, . . . , M ˆ n (p)/n⌋ = p.) We define the parametric empirical p(Note that ⌊M runs as

Rˆ n,i (p) :=



tˆn,i (p), tˆn,i (p) − tˆn,i−1 (p),

i = 1; ˆ n (p). i = 2, . . . , M

ˆ n (p, s) := n−1/2 For each s ∈ S, define G

∑Mˆ n (p) i=1

(sRn,i (p) − sp/{1 − ˆ

ˆ n (p, s) := 0 otherwise. Note that s(1 − p)}) if p ∈ n , 1 , and G these definitions are parallel to those previously given. The only difference is that we are using {Yˆn,t : t = 1, 2, . . . , n} instead of {Yt : t = 1, 2, . . . , n}. To see why estimating θ ∗ has no asymptotic impact, we begin ˆ n as Gˆ n = G¨ n + H¨ n , where, letting  by decomposing G qn (p) := inf{y ∈ R :  Fn (y) ≥ p} and  pn := F ( qn (p)) as above, we 1



¨ n (p, s) := n−1/2 define G

331

primary role in determining the asymptotic runs distribution. This also implies that when θˆ n is estimated and F is known, it may be computationally convenient to construct the runs using Fˆn instead of F . The analog of Theorems 1 and 2 is: Theorem 3. Given conditions A1–A4 and H0 ,

ˆ n (p, · ) ⇒  (i) for each p ∈ I, G G(p, · ), and if f : C (S) → R is ˆ n (p, · )] ⇒ f [ continuous, then f [G G(p, · )]; ˆ n ( · , s) ⇒  (ii) for each s ∈ S, G G( · , s), and if f : D (J) → R is ˆ n ( · , s)] ⇒ f [ continuous, then f [G G( · , s)]; ˆn ⇒  (iii) G G, and if f : D (J × S) → R is continuous, then ˆ n ] ⇒ f [ f [G G]. ˆ n ( p, · )], f [Gˆ n (·, s)], or f [Gˆ n ] parametric We call tests based on f [G empirical generalized runs tests (PEGR tests) to highlight their use of estimated parameters. By Theorem 3, the asymptotic null ˆ n ] is identical to that of f [ distribution of f [G Gn ], which takes θ ∗ as known. We remark that I appears in Lemma 3(ii) and ¨ n and Hn only involve the empirical Theorem 3(i) rather than J, as H distribution and not the distribution of runs. This is parallel to the results of Chen and Fan (2006) and Chan et al. (2009). They study semiparametric copula-based multivariate dynamic models and show that their pseudo-likelihood ratio statistic has an asymptotic distribution that depends on estimating the empirical distribution but not other nuisance parameters. The asymptotically ¨ n in Lemma 3 reflects the asymptotic influence of surviving H estimating the empirical distribution, whereas estimating the nuisance parameters has no asymptotic impact, as seen in Theorem 3.

∑Mˆ n (p)

ˆ (sRn,i (p) − s pn /{1 − s(1 −  pn )}), ¨ n (p, s) := n−1/2 M ˆ n (p)(s and H pn /{1 − s(1 −  pn )} − sp/{1 − s(1 − p)}). Note that this decomposition is also parallel to the previous decomposition,  Gn = Wn + Hn . Our next result extends j=1

Lemma 2. Lemma 3. Given conditions A1–A4 and H0 ,

¨ n (p, s) − Gn (p, s)| = oP (1); (i) sup(p,s)∈J×S |G

¨ n (p, s) − Hn (p, s)| = oP (1). (ii) sup(p,s)∈I×S |H

ˆ n = Wn + Hn + oP (1) = Given Lemma 2(i), it becomes evident that G  ˆ n coincides with that Gn + oP (1), so the asymptotic distribution of G of  Gn , implying that the asymptotic runs distribution is primarily determined by the estimation error associated with the empirical distribution Fˆn and not by the estimation of θ ∗ . The intuition behind this result is straightforward. As Darling (1955), Sukhatme (1972), Durbin (1973) and Henze (1996) note, the asymptotic distribution of an empirical process, say p → Zˆn (p) := n1/2 {F (ˆqn (p)) − p}, p ∈ I, where qˆ n (p) := inf{y ∈ R : Fˆn (y) ≥ p}, is affected by parameter estimation error primarily because the empirical process Zˆn is constructed using the Yˆn,t := h(Xt , θˆ n ) and the differentiable function F . Because h contains not θ ∗ but θˆ n , the parameter estimation error embodied in θˆ n is transmitted to the asymptotic distribution of Zˆn through qˆ n and F . Thus, if we were to define runs as T¨n (p) := {t ∈ N : F (Yˆn,t ) < p}, then their asymptotic distribution would be affected by the parameter estimation error. Instead, however, our runs {Rˆ n,i } are constructed using Tˆn (p) := {t ∈ N : Fˆn (Yˆn,t ) < p}, which replaces F with Fˆn , a step function. Variation in θˆ n is less important in this case, whereas the estimation of F plays the

3. Simulating asymptotic critical values Obtaining critical values for test statistics constructed as functions of Gaussian processes can be challenging. Nevertheless, the rational polynomial structure of our statistics permits us to construct representations of G and  G as infinite sums of independent Gaussian random functions. Straightforward simulations then deliver the desired critical values. Given that Theorems 1–3 do not specify the continuous mapping f , it is of interest to have methods yielding the asymptotic distributions of G and  G rather than f [G] and f [ G] for a particular mapping f , as the latter distributions are easily obtained from the methods provided here once f is specified. To represent G and  G, we use the Karhunen–Loève (K –L) representation (ch. 11 Loève, 1978) of a stochastic process. This represents Brownian motion as an infinite sum of sine functions multiplied by independent Gaussian random coefficients. Grenander (1981) describes this representation as a complete orthogonal system (CONS) and provides many examples. For example, Krivyakov et al. (1977) obtain the asymptotic critical values of von Mises’ ω2 statistic in the multi-dimensional case by applying this method. In econometrics, Phillips (1998) has used the K –L representation to obtain asymptotic critical values for testing cointegration. Andrews’ (2001) analysis of test statistics for a GARCH(1,1) model with nuisance parameter not identified under the null also exploits a CONS representation. By Theorem 2 of Jain and Kallianpur (1970), Gaussian processes with almost surely continuous paths have a CONS representation and can be approximated uniformly. We apply this result to our GR and (P)EGR test statistics; this straightforwardly delivers reliable asymptotic critical values.

332

J.S. Cho, H. White / Journal of Econometrics 162 (2011) 326–344 d

d

(ii) G = Z∗ = Z, and if f : D (J × S) → R is continuous, then

3.1. Generalized runs tests

d

A fundamental property of Gaussian processes is that two Gaussian processes have identical distributions if their covariance structures are the same. We use this fact to represent G(p, · ), G( · , s), and G as infinite sums of independent Gaussian processes that can be straightforwardly simulated. To obtain critical values for GR tests, we can use the Gaussian process Z∗ defined by

Z∗ (p, s) :=

sp(1 − s)B00 (p)

+

(1 − s)2 {1 − s(1 − p)}2

{1 − s(1 − p)}2 ∞ − × sj Bjs (p2 , (1 − p)1+j ),

(9)

j =1

where B00 is a Brownian bridge, and {Bjs : j = 1, 2, . . .} is a sequence of independent Brownian sheets, whose covariance structure is given by E [Bjs (p, q)Bis (p′ , q′ )] = 1{i=j} min[p, p′ ] · min[q, q′ ]. The arguments of Bjs lie only in the unit interval, and it is readily verified that E [Z∗ (p, s)Z∗ (p′ , s′ )] is identical to (4), so Z has the same distribution as G. An inconvenient computational aspect of Z∗ is that the terms s Bj require evaluation on a two-dimensional square, which is computationally demanding. More convenient in this regard is the Gaussian process Z defined by

Z(p, s) :=

sp(1 − s)B00 (p)

+

(1 − s)2 {1 − s(1 − p)}2   2

{1 − s(1 − p)}2 ∞ − × sj (1 − p)1+j Bj

p

(1 − p)1+j

j =1

As deriving the covariance structures of the relevant processes is straightforward, we omit the proof of Theorem 4 from the Appendix. 3.2. (P)EGR tests For the EGR statistics, we can similarly provide a Gaussian process whose covariance structure is the same as (8) and that can be straightforwardly simulated. By Theorem 3, this Gaussian process also yields critical values for PEGR test statistics. We begin with a representation for H . Specifically, consider sp(1−s) the Gaussian process X defined by X(p, s) := − {1−s(1−p)}2 B00 (p),

where B00 is a Brownian bridge as before. It is straightforward to show that when p′ ≤ p, E [X(p, s)X(p′ , s′ )] is the same as (7), implying that this captures the asymptotic distribution of the empirical distribution estimation error Hn , which survives to the limit. The representation Z for G in Theorem 4(ii) and the covariance structure for G and H required by Lemma 2(iii) together ∗ or Z  defined by suggest representing  G as Z

∗ (p, s) := Z

(p, s) := Z ,

(10)

E [Z(p, s)Z(p′ , s′ )] ss′ p′ (1 − s)(1 − s′ )(1 − p){1 − s′ (1 − p)}

{1 − s(1 − p)}{1 − s′ (1 − p′ )}2 {1 − ss′ (1 − p)}

.

p (s) := Z

This covariance structure is also identical to (4), so Z has the same distribution as G. The processes B00 and {Bj } are readily simulated as a consequence of Donsker’s (1951) theorem or the K –L representation (ch. 11 Loève, 1978), ensuring that critical values for any statistic f [Gn ] can be straightforwardly found by Monte Carlo methods. Although one can obtain asymptotic critical values for p-runs test statistics f [Gn (p, · )] by fixing p in (9) or (10), there is a much simpler representation for G(p, · ). Specifically, consider the j/2

process Zp defined by Zp (s) := j=0 s (1 − p) Zj , where {Zj } is a sequence of IID standard normals. It is readily verified that for each p, E [Zp (s)Zp (s′ )] is identical to (2). Because Zp does not involve the Brownian bridge, Brownian motions, or Brownian sheets, it is more efficient to simulate than Z(p, · ). This convenient representation arises from the symmetry of Eq. (4) in s and s′ when p = p′ . The fact that Eq. (4) is asymmetric in p and p′ when s = s′ implies that a similar convenient representation for G(· , s) is not available. Instead, we obtain asymptotic critical values for test statistics f [Gn ( · , s)], by fixing s in (9) or (10). We summarize these results as follows. d

∑∞

j

Theorem 4. (i) For each p ∈ I, G(p, · ) = Zp , and if f : C (S) → R d

is continuous, then f [G(p, · )] = f [Zp ];

(11)

  ∞ − (1 − s)2 p2 j 1 +j , (12) s ( 1 − p ) B j {1 − s(1 − p)}2 j=1 (1 − p)1+j

∗ (resp. Z ) is the sum of Z∗ (resp. Z) and X respectively, so that Z with the identical B00 in each. As is readily verified, (8) is the same ∗ (p′ , s′ )Z ∗ (p, s)] and E [Z (p′ , s′ )Z (p, s)]. Thus, simulating as E [Z (11) or (12) can deliver the asymptotic null distribution of  Gn and ˆ n. G Similar to the previous case, the following representation is convenient when p is fixed:

2

sp(1−s)(1−p)1/2 {1−s(1−p)}

∞ − (1 − s)2 sj Bjs (p2 , (1 − p)1+j ) {1 − s(1 − p)}2 j=1

and

where {Bj : j = 1, 2, . . .} is a sequence of independent standard Brownian motions independent of the Brownian bridge B00 . It is straightforward to compute E [Z(p, s)Z(p′ , s′ )]. Specifically, if p′ ≤ p then

=

d

f [G] = f [Z∗ ] = f [Z].

sp(1 − s)(1 − p)1/2

×

{1 − s(1 − p)}  ∞  j − s (1 − s) − p(1 − sj+1 ) {1 − s(1 − p)}

j =0

(1 − p)j/2 Zj .

(·, s) or For fixed s, we use the representation provided by Z ∗ (·, s). Z We summarize these results as follows. Theorem 5.

d

(i) H = X; d

p , and if f (ii) For each p ∈ I,  G(p, · ) = Z

: C (S) → R is d   continuous, then f [G(p, · )] = f [Zp ]; d d ∗ = , and if f : D (J × S) → R is continuous, then (iii)  G=Z Z d d   ]. f [G] = f [Z∗ ] = f [Z As deriving the covariance structures of the relevant processes is straightforward, we omit the proof of Theorem 5 from the Appendix. 4. Asymptotic local power Generalized runs tests target serially correlated autoregressive processes and/or independent heterogeneous processes violating A(i)–A(iii), as stated in Section 3. Nevertheless, runs tests are not always consistent against these processes, because just as for

J.S. Cho, H. White / Journal of Econometrics 162 (2011) 326–344

MGF-based GOF tests, PGF-based GOF tests cannot handle certain measure zero alternatives. We therefore examine whether the given (P)EGR test statistics have nontrivial power under specific local alternatives. To study this, we consider a first-order Markov process under which (P)EGR test statistics have nontrivial power when the convergence rate of the local alternative to the null is n−1/2 . Another motivation for considering this local alternative is to show that (P)EGR test statistics can have local power directly comparable to that of standard parametric methods. We consider first-order Markov processes for conciseness. The test can also be shown to have local power against higher-order Markov processes. Our results for first-order processes provide heuristic support for this claim, as higher-order Markov processes will generally exhibit first-order dependence. A test capable of detecting true first-order Markov structure will generally be able to detect apparent firstorder structure, as well. The situation is analogous to the case of autoregression, where tests for AR(1) structure are generally also sensitive to AR(p) structures, p > 1. We provide some additional discussion below in the simulation section. To keep our presentation succinct, we focus on EGR test statistics in this section. We saw above that the distribution theory for EGR statistics applies to PEGR statistics. This also holds for local power analysis. For brevity, we omit a formal demonstration of this fact here. We consider a double array of processes {Yn,t }, and we let Fn,t denote the smallest σ -algebra generated by {Yn,t , Yn,t −1 , . . . , }. We suppose that for each n, {Yn,1 , Yn,2 , . . . , Yn,n } is a strictly stationary and geometric ergodic first-order Markov process having transition probability distributions P(Yn,t +1 ≤ y|Fn,t ) with the following Lebesgue–Stieltjes differential:

Hℓ1 : dFn (y|Fn,t ) = dF (y) + n−1/2 dD(y, Yn,t )

(13)

under the local alternative, where we construct the remainder term to be oP (n−1/2 ) uniformly in y. For this, we suppose that D(·, Yn,t ) is a signed measure with properties specified in A5, and that for a suitable signed measure Q with Lebesgue–Stieltjes differential dQ , Yn,t has marginal Lebesgue–Stieltjes differential dFn (y) = dF (y) + n−1/2 {dQ (y) + o(1)}.

(14)

We impose the following formal condition. A5 (Local alternative): (i) For each n = 1, 2, . . . , {Yn,1 , Yn,2 , . . . , Yn,n } is a strictly stationary and geometric ergodic first-order Markov process with transition probability distributions given by Eq. (13) and marginal distributions given by Eq. (14), where (ii) D : R × R → R is a continuous function such that D(·, z ) defines a signed measure for each z ∈ R; (iii) supx |D(x, Yn,t )| ≤ Mn,t such that E [Mn,t ] ≤ ∆ < ∞ uniformly in t and n, and limy→±∞ D(y, Yn,t ) = 0 a.s.∞ P uniformly in t and n; (iv) supy −∞ |D(y, x)|dF (x) ≤ ∆ and supy |

∞ y

D(y, x)dD(x, Yn,t )| ≤ Mn,t for all t and n.

Thus, as n tends to infinity, {Yn,1 , Yn,2 , . . .} converges in distribution to an IID sequence of random variables with marginal distribution F . Note that the marginal distribution given in Eq. (14) is obtained by substituting the conditional distribution of Yn,t −j+1 |Fn,t −j (j = 1, 2, . . .) into (13) and integrating with respect to the random variables other than Yn,t . For example, ∞ D ( y , z )dF (z ) = Q (y). This implies that the properties of Q −∞ are determined by those of D. For example, limy→∞ Q (y) = 0 and supy |Q (y)| ≤ ∆. Our motivations for condition A5 are as follows. We impose the first-order Markov condition for conciseness. Higher-order Markov processes can be handled similarly. Assumption A5(i) also implies that {Yn,t } is an ergodic β -mixing process by Theorem 1 of Davydov (1973). Next, assumptions A5(ii, iii) ensure that Fn ( · |Fn,t ) is a proper distribution for all n almost surely, corresponding to A3.

333

Finally, assumptions A5(iii) and (iv) asymptotically control certain remainder terms of probabilities relevant to runs. Specifically, applying an induction argument yields that for each k = 1, 2, . . . ,

P(Yn,t +1 ≥ y, . . . , Yn,t +k−1 ≥ y, Yn,t +k < y|Fn,t ) = p(1 − p)k−1 1

+ √ hk (p, Yn,t ) + rk (p, Yn,t ),

(15)

n

where p is a short-hand notation for F (y); for each p, h1 (p, Yn,t ) := C (p, Yn,t ) := D(F −1 (p), Yn,t ); h2 (p, Yn,t ) := w(p) − p C (p, Yn,t ); and for k = 3, 4, . . . , hk (p, Yn,t ) := w(p)(1 − p)k−3 (1 − (k − k−2 −1 1  )∞p) − p(1 − p) C (p, Yn,t ), where w(p) := α(F (p)) := D ( y , x ) dF ( x ) . Here, the remainder term r ( p , Y ) k n,t is sequeny tially computed using previous remainder terms and hk (p, Yn,t ). For for given p, r1 (p, Yn,t ) = 0, r2 (p, Yn,t ) := n−1  ∞ example, −1 D(F (p), x)dD(x, Yn,t ), and so forth. These remainder F −1 (p)

terms turn out to be OP (n−1 ), mainly due to assumptions A5(iii) and (iv). Runs distributions can also be derived from (15), with asymptotic behavior controlled by assumptions A5(iii) and (iv). That is, if Yn,t < y, then the distribution of a run starting from Yn,t +1 , say Rn,i (p), can be obtained from (15) as

P(Rn,i (p) = k)

= P(Yn,t +1 ≥ y, Yn,t +2 ≥ y, . . . , Yn,t +k < y|Yn,t < y) = p(1 − p)k−1 + n−1/2 Fn (F −1 (p))−1 hn,k (p) + Fn (F −1 (p))−1 rn,k (p),

(16)

 F −1 (p)

hk (p, x)

hk (p, x)dF (x) and rn,k (p) :=

 F −1 (p)

where, as n tends to infinity, for each k, hn,k (p) := −∞ dFn (x) → hk (p) :=

 F −1 (p) −∞

rk (p, x)dFn (x) → rk (p) :=

 F −1 (p) −∞

−∞

rk (p, x)dF (x); and for each

p, Fn (F −1 (p)) → p from assumptions A5(iii) and (iv). Further, the remainder term rk (p) is OP (n−1 ), uniformly in p. The local power of EGR test statistics stems from the difference between the distribution of runs given in Eq. (16) and that obtained under the null. Specifically, the second component on the righthand side (RHS) of (16) makes the population mean of Gn different from zero, so that the limiting distribution of Gn corresponding to that obtained under the null can be derived when its population mean is appropriately adjusted. This non-zero population mean yields local power for n−1/2 local alternatives for the EGR test statistics as follows. Theorem 6. Given conditions A1, A2(i), A3, A5, and Hℓ1 ,  Gn − µ ⇒  G, where for each (p, s) ∈ J × S, µ(p, s) := ps(1 − s){sw(p) − s(1−s)

 F −1 (p)

Q (F −1 (p))}/{1 − s(1 − p)}2 + {1−s(1−p)} −∞

C (p, z )dF (z ).

It is not difficult to specify DGPs satisfying condition A5. For example, an AR(1) process can be constructed so as to belong to this case. That is, if for each t , Yn,t := n−1/2 Yn,t −1 + εt and εt ∼ IID N (0, 1), then we can let C (p, Yn,t ) = −ξ (p)Yn,t + oP (1) and w(p) = −ξ (p)2 , where ξ (p) := φ[Φ −1 (p)], and φ( · ) and Φ ( · ) are the probability density function (PDF) and CDF of a standard normal random variable. This gives µ(p, s) = {ξ (p)2 s(1 − s)2 }/{1 − s(1 − p)}2 , with Q ≡ 0. Because we have convergence rate n−1/2 , the associated EGR test statistics have the same convergence rate as the parametric local alternative. We point out several implications of Theorem 6. First, if the convergence rate in (13) is lower than 1/2, the EGR test may not have useful power; EGR tests are not powerful against every alternative to H′0 . For EGR tests to be consistent against firstorder Markov processes, the rate must be at least 1/2. Second, the statement for first-order Markov process can be extended to further higher-order Markov processes, although we do not pursue

334

J.S. Cho, H. White / Journal of Econometrics 162 (2011) 326–344

this here for brevity. Theorem 6 therefore should be understood as a starting point for identifying Markov processes as a class of n−1/2 -alternatives. Finally, the result of Theorem 6 does not hold for every local alternative specification. Our examination of a variety of other local alternative specifications reveals cases in which EGR tests have nontrivial power at the rate n−1/4 . For example, certain independent and non-identically distributed (INID) DGPs can yield EGR test statistics exhibiting n−1/4 rates. This rate arises because analysis of these cases requires an expansion of the conditional distribution of runs of order higher than that considered in Theorem 6. For brevity, we do not examine this further here. 5. Monte Carlo simulations In this section, we use Monte Carlo simulation to obtain critical values for test statistics constructed with f delivering the L1 (S-type) and uniform (K -type) norms of its argument. We also examine the level and power properties of tests based on these critical values. 5.1. Critical values p

We consider the following statistics: T1,n (S1 ) := S |Gn (p, s)| 1  p  1p,n (S1 ) := ds, T∞,n (S1 ) := sups∈S1 |Gn (p, s)|, T | G n (p, s)|ds, S



1

p

T∞,n (S1 ) := sups∈S1 | Gn (p, s)|, where S1 := [−0.99, 0.99], and p ∈  s {0.1, 0.3, 0.5, 0.7, 0.9}; T1s,n := I |Gn (p, s)|dp, T∞, n := supp∈I  s s    |Gn (p, s)|, T1,n := |Gn (p, s)|dp, T∞,n := supp∈I | Gn (p, s)|, I where s ∈ {− 0 . 5 , − 0 .3, −0.1, 0.1, 0.3, 0.5}; and T1,n (S) :=   |G (p, s)|dsdp, T∞,n (S) := sup(p,s)∈I×S |Gn (p, s)|, T1,n (S) := I S n ∞,n (S) := sup(p,s)∈I×S | | Gn (p, s)|dsdp, T Gn (p, s)|, where we I S consider S1 := [−0.99, 0.99] and S2 := [−0.50, 0.50] for S. As discussed above, these S- and K -type statistics are relevant for researchers interested in testing for non-zero constant mean function and non-spurious peaks of Gn on I × S in terms of T1,n (S) and T∞,n (S), respectively. Note that these test statistics are constructed using I instead of J. There are two reasons for doing this. First, we want to examine the sensitivity of these test statistics to p. We have chosen the extreme case to examine the levels of the test statistics. Second, as pointed out by Granger (1963) and Dufour (1981), more alternatives can be handled by specifying a larger space for p. Theorems 4 and 5 ensure that the asymptotic null distributions of these statistics can be generated by simulating Zp , Z (or Z∗ ), p , and Z  (or Z ∗ ), as suitably transformed. We approximate these Z using

Wp (s) :=

50 sp(1 − s)(1 − p)1/2 −

{1 − s(1 − p)}

W (p, s) :=

p (s) := W

sp(1 − s)

sj (1 − p)j/2 Zj ,

j =0

(1 − s)2 {1 − s(1 − p)} {1 − s(1 − p)}2   40 − p2 j × sj (1 − p)1+j B , (1 − p)1+j j =1 00 (p) + B 2

sp(1 − s)(1 − p)1/2

{1 − s(1 − p)}  50  − p j × s − (1 − p)j/2 Zj , and {1 − s(1 − p)} j =0   40 − (1 − s)2 p2 j 1+j  (p, s) := W s ( 1 − p ) B , j {1 − s(1 − p)}2 j=1 (1 − p)1+j

00 (p) := W0 (p) − pW0 (1), B j (x + p) := respectively, where B ∑x Wx+1 (p) + k=1 Wk (1) (with x ∈ N, and p ∈ I), and {Wk : k = 0, 1, 2, . . .} is a set of independent processes approximating Brownian motion using the K –L representation, defined as √ ∑ (k) 100 Wk (p) := 2 ℓ=1 {sin[(ℓ − 1/2)π p]}Zℓ /{(ℓ − 1/2)π }, where (j)

Zℓ ∼ IID N (0, 1) with respect to ℓ and j. We evaluate these functions for I, S1 , and S2 on the grids {0.01, 0.02, . . . , 1.00}, {−0.99, −0.98, . . . , 0.98, 0.99}, and {−0.50, −0.49, . . . , 0.49, 0.50}, respectively. Concerning these approximations, several comments are in order. First, the domains for p and s are approximated using a relatively fine grid. Second, we truncate the sum of the independent Brownian motions at 40 terms. The jth term contributes a random component with a standard deviation of sj p(1 − p)(1+j)/2 , which vanishes quickly as j tends to infinity. Third, j on the positive Euclidean line by the Brownian we approximate B motion on [0, 10,000]. Preliminary experiments showed the impact of these approximations to be small when S is appropriately chosen; we briefly discuss certain aspects of these experiments below. Table 1 contains the critical values generated by 10,000 replications of these processes. 5.2. Level and power of the test statistics In this section, we compare the level and power of generalized runs tests with other tests in the literature. We conduct two sets of experiments. The first examines power against dependent alternatives. The second examines power against structural break alternatives. To examine power against dependent alternatives, we follow Hong and White (2005) and consider the following DGPs:

• • • • • • • • •

DGP 1.1: Xt := εt ; DGP 1.2: Xt := 0.3Xt −1 + εt ; 1/2 DGP 1.3: Xt := ht εt , and ht = 1 + 0.8Xt2−1 ; 1/2

DGP 1.4: Xt := ht εt , and ht = 0.25 + 0.6ht −1 + 0.5Xt2−1 1{εt −1 <0} + 0.2Xt2−1 1{εt −1 ≥0} ; DGP 1.5: Xt := 0.8Xt −1 εt −1 + εt ; DGP 1.6: Xt := 0.8εt2−1 + εt ; DGP 1.7: Xt := 0.4Xt −1 1{Xt −1 >1} − 0.5Xt −1 1{Xt −1 ≤1} + εt ; DGP 1.8: Xt := 0.8|Xt −1 |0.5 + εt ; DGP 1.9: Xt := sgn(Xt −1 ) + 0.43εt ;

where εt ∼ IID N (0, 1). Note that DGP 1.1 satisfies the null hypothesis, whereas the other DGPs represent interesting dependent alternatives. As there is no parameter estimation, we apply our EGR statistics and compare these to the entropy-based nonparametric statistics of Robinson (1991), Skaug and Tjøstheim (1996) and Hong and White (2005), denoted as Rn , STn , and HWn , respectively. We present the results in Tables 2–4. To summarize, the EGR test statistics generally show approximately correct levels, even 1s,n exhibits level using I instead of J. We notice, however, that T distortion when s gets close to one. This is mainly because the number of Brownian motions in the approximation is finite, and these are defined on the finite Euclidean positive real line, [0, 10,000]. If s and p are close to one and zero, respectively, then the approximation can be coarse. Specifically, the given finite number of Brownian motions may not enough to adequately approximate the desired infinite sum of Brownian motions, and the given finite domain [0, 10,000] may be too small to adequately approximate the positive Euclidean real line. For the other tests, we do not observe similar level distortions. For the DGPs generating alternatives to H0 , the EGR tests generally gain power as n increases. As noted by Granger (1963)

J.S. Cho, H. White / Journal of Econometrics 162 (2011) 326–344

335

Table 1 Asymptotic critical values of the test statistics. Statistics \ level p

T1,n (S1 )

p

T∞,n (S1 )

T1s,n

s T∞, n

p = 0.1 0.3 0.5 0.7 0.9 p = 0.1 0.3 0.5 0.7 0.9 s = −0.5 −0.3 −0.1 0.1 0.3 0.5 s = −0.5 −0.3 −0.1 0.1 0.3 0.5

T1,n (S1 ) T∞,n (S1 ) T1,n (S2 ) T∞,n (S2 )

1%

5%

10%

Statistics \ level

0.2474 0.5892 0.8124 0.9007 0.7052 0.7483 1.3517 1.6846 1.7834 1.3791 0.3114 0.1698 0.0514 0.0466 0.1246 0.1769 0.8091 0.4349 0.1282 0.1124 0.2898 0.3962

0.1914 0.4512 0.6207 0.6841 0.5329 0.5683 1.0237 1.2818 1.3590 1.0486 0.2439 0.1330 0.0402 0.0361 0.0957 0.1356 0.6885 0.3650 0.1072 0.0939 0.2416 0.3304

0.1639 0.3828 0.5239 0.5763 0.4478 0.4750 0.8582 1.0728 1.4000 0.8839 0.2152 0.1164 0.0351 0.0315 0.0836 0.1183 0.6160 0.3267 0.0960 0.0840 0.2154 0.2948

T1,n (S1 )

0.4571 2.2956 0.1219 0.8091

0.3547 1.9130 0.0955 0.6885

0.3124 1.7331 0.0836 0.6160

T1,n (S1 ) T∞,n (S1 ) T1,n (S2 ) T∞,n (S2 )

Table 2 Level simulation at 5% level (in percent, 10,000 iterations). Statistics \n p T1,n (S1 )

p

T∞,n (S1 )

T1s,n

s T∞, n

DGP 1.1 p = 0.1 0.3 0.5 0.7 0.9 p = 0.1 0.3 0.5 0.7 0.9 s = −0.5 −0.3 −0.1 0.1 0.3 0.5 s = −0.5 −0.3 −0.1 0.1 0.3 0.5

T1,n (S1 ) T∞,n (S1 ) T1,n (S2 ) T∞,n (S2 ) Rn a STn a HWn a

100

300

500

4.05 5.02 5.22 5.34 4.35 3.86 3.54 5.42 7.00 4.21 4.36 4.32 4.23 4.19 4.01 3.85 4.90 4.92 5.15 4.89 5.11 5.05

4.71 4.95 4.74 4.90 6.63 4.44 4.65 4.40 5.03 6.27 4.72 4.77 4.71 4.64 4.08 3.63 6.07 6.11 6.29 6.22 5.76 5.18

5.09 4.61 4.87 5.31 6.07 4.49 4.59 5.05 5.08 4.71 4.26 4.08 3.83 3.68 3.25 2.55 5.95 5.86 6.13 5.94 6.06 5.57

3.71 4.91 4.06 4.77 4.7 5.1 6.5

4.21 6.40 4.65 5.94

3.60 6.65 4.12 5.74

a These results are those given in Hong and White (2005). Their number of replications is 1000.

and Dufour (1981), a particular selection of p or, more generally, the choice of mapping f can yield tests with better or worse 1p,n (S1 ) (resp. T1s,n )-based power. Generally, we see that the T

p

p

T∞,n (S1 )

T1s,n

s T∞, n

p = 0.1 0.3 0.5 0.7 0.9 p = 0.1 0.3 0.5 0.7 0.9 s = −0.5 −0.3 −0.1 0.1 0.3 0.5 s = −0.5 −0.3 −0.1 0.1 0.3 0.5

1%

5%

10%

0.2230 0.5004 0.6413 0.6065 0.3066 0.7454 1.3239 1.5909 1.5019 0.7912 0.2197 0.1124 0.0313 0.0262 0.0631 0.0780 0.6229 0.3107 0.0864 0.0714 0.1718 0.2175

0.1727 0.3842 0.4886 0.4632 0.2356 0.5677 1.0069 1.1990 1.1360 0.6028 0.1785 0.0905 0.0254 0.0210 0.0504 0.0625 0.5319 0.2668 0.0734 0.0604 0.1454 0.1869

0.1420 0.3225 0.4092 0.3889 0.1973 0.4750 0.8441 1.0091 0.9617 0.5060 0.1587 0.0803 0.0225 0.0187 0.0446 0.0552 0.4799 0.2412 0.0670 0.0547 0.1306 0.1684

0.3080 2.0411 0.0725 0.6229

0.2440 1.7101 0.0590 0.5319

0.2187 1.5615 0.0523 0.4799

choices for p often yield better power. Also, in general, the power 1,n (S2 )-based tests are midway between performances of the T 1s,n -based tests. Apart those of the best and worst cases for the T from these observations, there is no clear-cut relation between 1p,n (S1 )-based tests and the T1,n (S1 )-based tests. The more the T

1p,n (S1 )-based tests dominate the T1,n (S1 )-based tests powerful T 1,n (S1 )-based tests dominate for DGPs for DGPs 1.3–1.5, but the T 1.2, and 1.6–1.9. Comparing EGR tests to the entropy-based tests, we observe 1,n (S)-based tests or T1s,n (S2 )-based three notable features. First, T tests generally dominate entropy-based tests for DGP 1.2 and 1.8. Second, for DGPs 1.3, 1.6, and 1.7, entropy-based tests are more powerful than the EGR tests. Finally, for the other DGPs, the best powered EGR tests exhibit power roughly similar to that of the best powered entropy-based tests. Such mixed results are common in the model specification testing literature, especially in nonparametric contexts where there is no generally optimal test. For example, Fan and Li (2000) compare the power properties of specification tests using kernelbased nonparametric statistics with Bierens and Ploberger’s (1997) integrated conditional moment (ICM) tests. They find that these tests are complementary, with differing power depending on the type of local alternative. Similarly, the entropy-based and EGR tests can also be used as complements. In addition, we conducted Monte Carlo simulations for higherorder Markov processes. As the results are quite similar to those in Tables 3 and 4, we omit them for brevity. For structural break alternatives, we compare our PEGR tests to a variety of well-known tests. These include Feller’s (1951) and Kuan and Hornik’s (1995) RR test, Brown et al. ’s (1975) RE-CUSUM test, Sen’s (1980) and Ploberger et al. ’s (1989) RE test, Ploberger and Krämer’s (1992) OLS-CUSUM test, Andrews’ (1993) Sup-W test, Andrews and Ploberger’s (1994) Exp-W and Avg-W tests, and Bai’s (1996) M-test.1 As these are all designed to test for a single structural break at an unknown point, they may not perform well

p

s ∞,n (S1 ) (resp. T∞, tests outperform the T n )-based tests. Similarly, 1,n (S)-based tests outperform the T∞,n (S)-based tests for the T 1p,n (S1 )-based tests, more extreme both S1 and S2 . Among the T

1 We also examined Chu et al. ’s (1995a) ME test and Chu et al. ’s (1995b) REMOSUM and OLS-MOSUM tests. Their performance is comparable to that of the other prior tests, so for brevity we do not report those results here.

336

J.S. Cho, H. White / Journal of Econometrics 162 (2011) 326–344

Table 3 Power simulation at 5% level (in percent, 3000 iterations). Statistics \n p T1,n

 ( S1 )

p

T∞,n (S1 )

T1s,n

s T∞, n

DGP 1.2 p = 0.1 0.3 0.5 0.7 0.9 p = 0.1 0.3 0.5 0.7 0.9 s = −0.5 −0.3 −0.1 0.1 0.3 0.5 s = −0.5 −0.3 −0.1 0.1 0.3 0.5

T1,n (S1 ) T∞,n (S1 ) T1,n (S2 ) T∞,n (S2 ) Rn a STn a HWn a a

DGP 1.3

DGP 1.4

DGP 1.5

100

200

100

200

100

200

100

200

20.67 33.27 36.60 34.13 23.77 6.60 10.57 24.07 29.53 22.47 62.56 67.70 70.06 71.30 70.30 67.36 43.80 49.66 55.63 59.70 61.40 58.40

33.17 59.37 65.17 60.30 38.07 6.67 23.97 42.87 41.67 22.63 99.83 99.86 99.73 99.70 99.60 99.46 79.50 83.96 87.80 89.06 88.30 84.83

29.50 8.43 5.33 9.33 30.30 7.10 3.60 5.77 9.80 27.30 13.53 14.66 15.66 16.46 17.30 19.93 7.83 8.50 9.83 11.20 14.33 22.36

45.00 10.57 5.47 11.37 51.57 6.33 5.40 6.13 6.87 32.67 90.90 92.06 92.73 93.40 93.46 93.33 13.46 13.66 15.33 17.66 24.86 37.23

27.53 8.77 5.20 5.37 15.40 5.03 3.70 4.90 6.93 14.43 8.40 9.33 10.53 11.86 13.93 18.63 5.70 6.60 8.03 10.10 15.46 23.60

47.47 12.07 5.80 6.10 23.97 5.37 5.40 6.33 4.20 13.30 11.63 12.96 15.30 18.10 23.50 32.80 6.10 6.76 9.16 14.43 25.60 46.33

5.80 9.77 18.77 76.17 75.37 3.63 5.70 16.43 72.97 72.30 81.73 80.30 76.86 69.66 58.50 44.90 74.46 71.40 66.43 55.06 40.76 22.43

9.27 15.40 32.93 97.33 97.10 4.73 11.00 26.70 92.77 89.87 99.30 99.33 98.90 97.73 94.00 82.73 98.73 98.03 96.53 91.33 79.13 50.30

59.73 27.30 97.03 44.03 13.8 12.4 14.0

90.93 61.56 99.86 80.50 25.4 22.0 27.0

11.96 5.83 16.36 7.20 26.4 61.2 37.6

23.13 10.70 26.36 13.60 52.2 90.0 67.6

9.40 4.70 75.60 5.56 15.0 27.8 20.6

13.73 6.46 86.06 7.26 7.2 52.0 35.2

14.60 18.13 82.30 13.76 59.8 81.6 69.6

27.33 34.33 92.40 18.56 75.4 98.4 95.6

These results are those given in Hong and White (2005). Their number of replications is 500.

Table 4 Power simulation at 5% level (in percent, 3000 iterations). Statistics \n p T1,n (S1 )

p

T∞,n (S1 )

T1s,n

s T∞, n

T1,n (S1 ) T∞,n (S1 ) T1,n (S2 ) T∞,n (S2 ) Rn a STn a HWn a a

DGP 1.6 p = 0.1 0.3 0.5 0.7 0.9 p = 0.1 0.3 0.5 0.7 0.9 s = −0.5 −0.3 −0.1 0.1 0.3 0.5 s = −0.5 −0.3 −0.1 0.1 0.3 0.5

DGP 1.7

DGP 1.8

DGP 1.9

100

200

100

200

100

200

100

200

1.30 9.93 8.33 30.87 26.43 3.73 6.00 8.47 33.63 25.80 25.13 25.13 22.40 20.13 16.90 12.36 23.13 22.03 20.50 17.43 14.16 9.83

6.83 16.33 10.87 57.17 41.27 5.73 10.47 9.90 47.20 29.60 60.93 60.10 56.20 52.10 43.73 32.30 54.23 50.46 45.50 38.26 30.10 21.40

1.00 18.90 9.27 18.07 24.00 4.33 8.93 8.10 16.67 22.53 21.96 24.13 25.16 25.10 23.10 17.53 20.80 21.50 23.03 21.96 20.50 14.90

5.33 33.67 11.37 28.73 39.03 4.30 16.23 9.83 18.17 22.70 51.46 55.93 58.93 59.96 56.46 47.60 43.36 46.80 49.93 48.36 45.70 38.26

2.77 17.73 31.77 33.60 22.37 3.77 5.83 20.47 30.03 20.63 50.13 54.26 55.93 56.20 53.33 48.93 40.46 45.56 49.60 51.20 49.70 43.86

3.73 32.57 59.80 58.43 34.63 3.67 12.80 36.93 39.93 19.97 83.03 86.16 87.53 87.26 84.90 80.23 75.83 79.86 82.36 82.73 80.60 71.43

27.83 39.47 43.13 39.97 26.30 9.50 22.23 34.60 37.83 24.83 56.36 57.43 57.80 58.46 59.13 59.50 55.70 57.20 57.83 58.56 59.33 59.53

47.70 64.23 68.00 62.13 37.50 14.53 44.33 59.23 54.90 29.80 80.73 81.10 81.46 81.96 82.13 82.20 80.06 80.83 81.53 82.00 82.40 82.56

25.90 22.36 22.80 22.80 31.8 34.8 34.0

59.36 50.56 56.40 53.70 65.2 72.8 74.0

19.43 14.50 23.63 19.66 24.6 25.8 25.6

46.10 29.10 57.23 41.50 80.8 86.8 85.4

47.96 26.36 54.90 39.63 14.2 13.4 17.0

81.50 55.80 86.40 76.36 34.6 23.8 26.2

56.33 50.26 58.70 54.60 60.2 55.8 60.8

79.96 75.60 81.40 79.73 84.0 79.8 84.6

These results are those given in Hong and White (2005). Their number of replications is 500.

when there are multiple breaks. In contrast, our PEGR statistics are designed to detect general alternatives to IID, so we expect that these may perform well in such situations.

We consider the following DGPs for our Monte Carlo simulations. These have been chosen to provide a test bed in which the behaviors of the various tests can be clearly contrasted.

J.S. Cho, H. White / Journal of Econometrics 162 (2011) 326–344

337

Table 5 Level simulation at 5% level (in percent, 10,000 iterations). Statistics \n p T1,n

 (S1 )

p

T∞,n (S1 )

T1s,n

s T∞, n

T1,n (S1 ) T∞,n (S1 ) T1,n (S2 ) T∞,n (S2 ) Mn REn RRn SupWn AvgWn ExpWn RECUSUMn O(N )LSCUSUMn

• • • • • • • • • • • •

DGP 2.1 p = 0.1 0.3 0.5 0.7 0.9 p = 0.1 0.3 0.5 0.7 0.9 s = −0.5 −0.3 −0.1 0.1 0.3 0.5 s = −0.5 −0.3 −0.1 0.1 0.3 0.5

DGP 2.2

100

300

500

4.12 5.12 4.76 5.81 3.71 3.86 5.56 5.17 7.44 3.58 4.37 4.27 3.98 3.90 3.46 3.34 4.42 4.49 4.64 4.41 4.47 4.78 3.69 5.22

4.88 5.22 5.51 5.63 6.63 6.38 4.56 5.34 5.70 6.25 4.66 4.82 4.57 4.46 3.95 3.31 5.96 5.73 5.87 5.66 5.53 4.95 4.34 6.65

5.02 5.07 5.33 5.19 5.92 6.04 6.10 5.60 4.88 7.43 4.32 4.40 4.06 3.78 3.30 2.72 6.13 6.12 6.27 6.28 5.84 5.21 3.50 6.50

4.25 5.20 5.02 5.14 4.35 4.33 5.68 5.60 6.89 4.14 4.18 4.29 4.18 4.02 3.86 3.50 4.63 4.64 4.62 4.53 4.79 4.78 3.62 5.00

5.01 5.12 5.04 4.51 6.87 6.44 4.82 5.06 4.54 6.47 4.71 4.75 4.47 4.29 3.96 3.52 6.16 6.07 5.94 5.65 5.27 4.95 3.90 5.98

5.02 5.23 5.38 5.32 6.02 6.32 6.38 5.77 5.15 7.49 4.21 4.13 3.97 3.67 3.04 2.57 6.12 6.06 6.19 6.04 5.77 5.22 3.46 6.97

3.94 4.87 2.89 7.89 9.28 4.77 5.81 5.34 1.65 2.68

4.33 5.53 4.43 4.14 4.30 4.41 5.29 5.35 3.07 4.20

3.98 6.02 4.75 3.11 2.74 4.57 5.10 5.04 3.68 4.01

3.71 4.89 3.98 71.90 64.39 2.93 2.60 1.78 3.41 28.91

4.41 5.62 5.33 70.70 60.78 0.94 1.69 2.35 3.86 30.22

4.09 6.25 5.62 31.01 66.15 1.31 2.22 3.13 3.64 55.50

DGP 2.1: Yt := Zt + εt ; DGP 2.2: Yt := exp(Zt ) + εt ; DGP 2.3: Yt := 1{t >⌊0.5·n⌋} + εt ; DGP 2.4: Yt := Zt 1{t ≤⌊0.5·n⌋} − Zt 1{t >⌊0.5·n⌋} + εt ; DGP 2.5: Yt := Zt 1{t ≤⌊0.3n⌋} − Zt 1{t >⌊0.3n⌋} + εt ; DGP 2.6: Yt := Zt 1{t ≤⌊0.1n⌋} − Zt 1{t >⌊0.1n⌋} + εt ; DGP 2.7: Yt := exp(Zt )1{t ≤⌊0.5·n⌋} + exp(−Zt )1{t >⌊0.5·n⌋} + εt ; DGP 2.8: Yt := Zt 1{t ∈Kn (0.02)} − Zt 1{t ̸∈Kn (0.02)} + εt ; DGP 2.9: Yt := Zt 1{t ∈Kn (0.05)} − Zt 1{t ̸∈Kn (0.05)} + εt ; DGP 2.10: Yt := Zt 1{t ∈Kn (0.1)} − Zt 1{t ̸∈Kn (0.1)} + εt ; DGP 2.11: Yt := Zt 1{t =odd} − Zt 1{t =even} + εt ; DGP 2.12: Yt := exp(0.1 · Zt )1{t =odd} + exp(Zt )1{t =even} + εt ,

where Zt = 0.5Zt −1 + ut ; (εt , ut )′ ∼ IIDN (0, I2 ); and Kn (r ) := {t = 1, . . . , n : (k − 1)/r + 1 ≤ t ≤ k/r , k = 1, 3, 5, . . .}. For DGPs 2.1, 2.4–2.6, and 2.8–2.11, we use ordinary least squares (OLS) to estimate the parameters of a linear model Yt = α + β Zt + vt , and we apply our PEGR statistics to the prediction errors vˆt := Yt − αˆ − βˆ Zt . For DGP 2.3, we specify the model ∑nYt = α + vt , and we apply our PEGR statistic to Yt − n−1 t =1 Yt . The linear model is correctly specified for DGP 2.1, but is misspecified for DGPs 2.3–2.6 and 2.8–2.11. Thus, when DGP 2.1 is considered the null hypothesis holds, permitting an examination of the level of the tests. As the model is misspecified for DGPs 2.3–2.6 and 2.8–2.11, the alternative holds for vˆt , permitting an examination of power. DGPs 2.3–2.6 exhibit a single structural break at different break points, permitting us to see how the PEGR tests compare to standard structural break tests specifically designed to detect such alternatives. DGPs 2.8 through 2.11 are deterministic mixtures in which the true coefficient of Zt depends on whether or not t belongs to a particular structural regime. The number of structural breaks increases as the sample size

100

300

500

increases, but the proportion of breaks to the sample size is constant. Also, the break points are equally spaced. Thus, for example, there are four break points in DGP 2.8 when the sample size is 100 and nine break points when the sample size is 200. The extreme case is DGP 2.11, in which the proportion of breaks is equal to one, and the coefficient of Zt depends on whether or not t is even. Given the regular pattern of these breaks, this may be hard to distinguish from a DGP without a structural break. For DGPs 2.2, 2.7, and 2.12, we use nonlinear least squares (NLS) to estimate the parameters of a nonlinear model Yt = exp(β Zt ) + vt , and we apply our PEGR statistics to the prediction errors vˆt := Yt − exp(βˆ Zt ). The situation is analogous to that for the linear model, in that the null holds for DGP 2.2, whereas the alternative holds for 2.7 and 2.12. Examining these alternatives permits an interesting comparison of the PEGR tests, designed for general use, to the RR, RE, M, OLS-CUSUM and RE-CUSUM statistics, which are expressly designed for use with linear models. Our simulation results are presented in Tables 5–7. To summarize, the levels of the PEGR tests are approximately correct for both linear and nonlinear cases and generally improve as the sample size increases. On the other hand, there are evident level distortions for some of the other statistics, especially, as expected, for the linear model statistics with nonlinear DGP 2.2. The PEGR statistics also have respectable power. They appear consistent against our structural break alternatives, although the PEGR tests are not as powerful as the other (properly sized) break tests when there is a single structural break. This is as expected, as the other tests are specifically designed to detect a single break, whereas the PEGR test is not. As one might also expect, the power of the PEGR tests diminishes notably as the

338

J.S. Cho, H. White / Journal of Econometrics 162 (2011) 326–344

Table 6 Power simulation at 5% level (in percent, 3000 iterations). Statistics \n

DGP 2.3 100

p T1,n

 ( S1 )

DGP 2.4 200

100

DGP 2.5 200

100

DGP 2.6 200

100

DGP 2.7 200

100

200

p = 0.1 0.3 0.5 0.7 0.9

14.20 19.06 21.00 19.23 10.70

22.73 31.86 37.70 31.03 14.66

18.13 27.47 29.03 29.03 19.37

29.60 48.43 55.43 51.70 32.77

22.20 20.53 21.83 22.33 22.63

39.93 37.60 37.77 38.20 39.23

16.70 7.80 5.80 7.67 17.30

24.70 10.37 7.77 10.17 24.07

35.67 22.87 19.43 27.37 43.27

59.00 39.23 35.07 48.37 71.47

T∞,n (S1 )

p = 0.1 0.3 0.5 0.7 0.9

4.10 4.43 11.36 15.56 10.23

4.30 8.70 17.56 16.53 8.10

5.90 7.70 18.23 25.27 17.37

5.53 21.80 33.47 44.40 39.73

5.50 6.87 13.80 18.83 20.33

6.80 16.43 20.50 32.30 44.47

6.07 4.77 5.93 7.30 15.27

5.03 8.40 6.50 9.83 25.77

7.57 7.20 12.63 22.47 40.10

9.47 18.10 20.10 41.37 74.80

T1s,n

s = −0.5 −0.3 −0.1 0.1 0.3 0.5

28.76 33.46 39.40 46.20 50.73 58.43

51.46 59.10 66.50 73.23 79.46 85.13

47.36 53.60 57.76 60.73 61.06 60.23

83.30 86.73 89.13 90.73 91.00 90.00

37.53 42.40 46.46 48.80 49.83 50.03

67.60 73.60 77.33 80.13 81.00 80.10

9.20 10.60 11.96 13.36 14.53 15.66

17.23 19.63 21.56 23.63 24.16 26.00

47.83 51.56 54.83 57.36 58.26 59.20

76.43 81.03 82.93 84.66 85.16 84.03

s T∞, n

s = −0.5 −0.3 −0.1 0.1 0.3 0.5

22.60 28.70 37.13 44.30 51.06 59.70

43.50 53.10 62.40 70.36 77.03 83.50

33.90 41.16 47.10 51.76 55.50 54.63

68.30 75.53 80.76 83.30 84.86 83.56

22.96 28.20 33.66 37.33 41.23 43.00

49.46 56.63 63.53 67.33 70.16 69.96

6.10 7.06 8.93 9.96 11.73 13.60

10.56 12.56 14.90 15.90 18.43 22.43

31.66 36.50 40.60 43.76 47.63 49.13

59.56 64.93 69.56 71.96 72.53 73.06

T1,n (S1 ) T∞,n (S1 )

32.73 12.76

56.70 24.50

46.53 22.13

80.13 46.10

36.06 15.63

66.73 30.06

9.46 5.50

14.96 7.70

45.93 19.06

46.86 38.26

T1,n (S2 ) T∞,n (S2 )

42.90 23.93

67.73 44.30

58.76 35.00

89.06 69.13

45.83 23.63

76.33 49.40

12.26 6.03

19.83 9.60

54.86 29.43

84.70 60.00

Mn REn RRn SupWn AvgWn ExpWn RECUSUMn O(N )LSCUSUMn

97.13 99.16 93.33 98.26 98.26 98.76 87.13 99.16

100.0 100.0 100.0 100.0 99.96 100.0 99.40 100.0

100.0 100.0 99.97 100.0 100.0 100.0 6.77 19.67

100.0 100.0 100.0 100.0 100.0 100.0 9.92 24.29

12.93 100.0 99.96 100.0 100.0 100.0 9.30 22.43

23.03 100.0 100.0 100.0 100.0 100.0 11.43 26.60

1.97 94.96 87.90 85.63 73.76 85.73 14.73 15.50

2.70 100.0 99.93 97.86 92.60 97.70 19.10 22.03

35.96 79.50 69.43 94.89 94.87 90.19 15.60 83.03

61.43 89.13 81.73 99.03 99.02 98.14 20.20 92.40

p

break moves away from the center of the sample. Nevertheless, the relative performance of the tests reverses when there are multiple breaks. All test statistics lose power as the proportion of breaks increases, but the loss of power for the non-PEGR tests is much faster than that for the PEGR tests. For the extreme alternative DGP 2.11, the PEGR tests appear to be the only consistent tests. We also note that, as for the dependent alternatives, the integral norm-based tests outperform the supremum norm-based tests. Finally, we briefly summarize the results of other interesting experiments omitted from our tabulations for the sake of brevity. To further examine level properties, we applied our EGR tests (i) with Yt ∼ IIDC (0, 1), where C (ℓ, s) denotes the Cauchy distribution with location and scale parameters ℓ and s, respectively, and (ii) with Yt = (ut + 1)1{εt ≥0} + (ut − 1)1{εt <0} , where (εt , ut ) ∼ IIDN (0, I2 ). We consider the Cauchy process to examine whether the absence of moments in the raw data matters, and we consider the normal random mixture to compare the results with the deterministic mixture, DGP 2.10. Our experiments yielded results very similar to those reported for DGP 1.1. This affirms the claims for the asymptotic null distributions of the (P)EGR test statistics in the previous sections. To further examine the power of our (P)EGR tests, we also considered the mean shift processes analyzed by Crainiceanu and Vogelsang (2007), based on DGP 2.3. Our main motivation for this arises from the caveat in the literature that CUSUM and CUSQ tests may exhibit power functions non-monotonic in α . (See Deng and Perron (2008) for further details.) In contrast, we find that the (P)EGR test statistics do not exhibit this non-monotonicity.

6. Conclusion The IID assumption plays a central role in economics and econometrics. Here we provide a family of tests based on generalized runs that are powerful against unspecified alternatives, providing a useful complement to tests designed to have power against specific alternatives, such as serial correlation, GARCH, or structural breaks. Relative to other tests of this sort, for example the entropybased tests of Hong and White (2005), our tests have an appealing computational simplicity, in that they do not require kernel density estimation, with the associated challenge of bandwidth selection. Our simulation studies show that our tests have empirical levels close to their nominal asymptotic levels. They also have encouraging power against a variety of important alternatives. In particular, they have power against dependent alternatives and heterogeneous alternatives, including those involving a number of structural breaks increasing with the sample size. Acknowledgements The authors are grateful to the Co-editor, Ronald Gallant, and two anonymous referees for their very helpful comments. Also, they have benefited from discussions with Robert Davies, Ai Deng, Juan Carlos Escanciano, Chirok Han, Yongmiao Hong, Estate Khmaladze, Jin Lee, Tae-Hwy Lee, Leigh Roberts, Peter Robinson, Peter Thomson, David Vere-Jones and participants at FEMES07, NZESG, NZSA, SRA, UC-Riverside, SEF and SMCS of Victoria University of Wellington, and Joint Economics Conference (Seoul National University, 2010).

J.S. Cho, H. White / Journal of Econometrics 162 (2011) 326–344

339

Table 7 Power simulation at 5% level (in percent, 3000 iterations). Statistics \n

DGP 2.8

DGP 2.9

DGP 2.10

DGP 2.11

DGP 2.12

100

200

100

200

100

200

100

200

100

200

p = 0.1 0.3 0.5 0.7 0.9

18.13 27.47 29.03 29.03 19.37

29.13 49.87 55.77 50.97 32.33

18.13 25.50 26.33 25.46 21.10

27.06 43.26 47.90 45.06 27.73

14.86 20.06 21.56 22.70 16.06

23.66 37.43 41.06 39.90 26.13

4.10 37.83 46.40 35.87 0.43

12.33 63.23 74.57 63.67 0.03

24.03 20.53 18.93 15.20 0.93

45.53 34.53 29.27 24.10 0.20

T∞,n (S1 )

p = 0.1 0.3 0.5 0.7 0.9

5.90 7.70 18.23 25.27 17.37

5.27 20.83 32.73 44.43 39.57

5.43 7.86 16.40 22.43 19.90

4.93 15.83 28.16 28.73 15.06

5.10 6.43 13.10 19.76 14.80

4.23 14.50 23.96 25.56 14.23

4.23 31.27 37.17 38.13 0.40

7.13 43.87 65.13 56.77 0.07

24.27 23.07 16.87 17.60 0.90

46.60 30.87 26.87 20.17 0.53

T1s,n

s = −0.5 −0.3 −0.1 0.1 0.3 0.5

46.63 53.33 57.06 59.73 60.66 60.13

80.86 85.20 87.26 88.63 89.20 88.20

42.33 46.93 50.73 53.53 54.50 53.70

75.43 80.03 83.43 85.30 85.40 83.86

37.63 42.76 45.43 47.36 47.53 46.26

67.90 73.26 75.76 77.46 77.53 75.40

64.23 64.20 61.46 55.23 44.73 31.13

90.10 90.30 89.30 86.33 79.60 64.43

27.63 25.16 22.10 17.76 11.93 7.06

47.73 44.40 40.00 33.00 23.70 13.96

s T∞, n

s = −0.5 −0.3 −0.1 0.1 0.3 0.5

34.30 40.60 46.56 50.96 53.66 52.36

66.83 73.86 80.56 83.23 84.20 81.86

29.60 34.66 40.30 45.03 47.76 47.96

62.03 69.00 74.23 77.23 78.23 76.26

26.00 31.70 37.00 39.66 42.50 41.66

52.93 60.06 65.90 69.10 70.40 67.43

55.86 55.60 54.46 46.73 37.33 19.46

84.30 84.50 83.53 79.30 70.43 47.96

22.80 21.10 19.46 15.00 10.70 5.80

36.33 35.70 33.33 28.23 21.10 10.36

T1,n (S1 ) T∞,n (S1 )

46.26 21.13

81.20 47.66

40.16 17.56

75.83 41.93

35.63 16.06

66.20 35.43

58.73 47.30

88.66 77.06

26.10 21.93

47.30 39.50

T1,n (S2 ) T∞,n (S2 )

57.43 33.00

87.66 66.73

50.60 28.73

84.80 62.60

46.83 25.46

75.46 53.83

59.86 56.13

89.90 85.46

21.66 22.53

39.76 37.47

Mn REn RRn SupWn AvgWn ExpWn RECUSUMn OLSCUSUMn

100.0 100.0 99.97 100.0 100.0 100.0 6.77 19.67

13.67 100.0 99.96 100.0 100.0 100.0 9.20 23.36

9.80 78.06 77.86 98.16 92.80 98.86 7.03 17.80

12.20 60.00 59.73 63.26 50.16 62.66 9.50 20.40

8.10 36.23 34.33 41.86 34.16 43.50 5.46 13.63

9.40 3.12 30.93 46.06 32.63 42.40 8.50 15.70

0.66 7.77 8.85 3.73 4.35 3.56 0.23 0.43

0.86 4.91 5.24 3.46 3.63 3.19 0.39 0.39

2.36 80.70 71.16 3.82 3.34 1.76 1.26 28.43

2.73 89.16 79.93 3.64 3.15 1.87 3.10 35.13

p T1,n

 (S1 )

p

M (p)

Appendix A. Proofs To prove our main results, we first state some preliminary lemmas. The proofs of these and other lemmas below can be found at url: http://web.yonsei.ac.kr/jinseocho/. Recall that J := [p, 1], p > 0, and for notational simplicity for every p, p′ ∈ I with p′ ≤ p and Mn (p′ ) > 0, we let Kn,i denote Kn,i (p, p′ ) such that Kn,0 (p, p′ ) = 0 and

∑Kn,i (p,p′ )

j=Kn,i−1 (p,p′ )+1

Rn,j (p) = Rn,i (p′ ).

Lemma A.5. Let p ∈ I such that Mn (p) > 0. If {Rn,i (p)}i=n1 is IID with distribution Gp then, for m = 1, 2, . . . , and ℓ = m, m + 1, m + ∑m  ℓ−1 ℓ−m m 2, . . . , P p . i=1 Rn,i (p) = ℓ = (m−1 )(1 − p) Lemma A.6. Let p, p′ ∈ I such that Mn (p′ ) > 0. Given condition R of Lemma 1, then for i, k = 1, 2, . . . ,

i+k+1−ℓ ∑m { j=2 Rn,j (p′ ) = i + k m=2

Lemma A.1. Given A1, A2(i), A3, andH0 , if s ∈ S, p′ , p ∈ I, and ∑Kn,1 R (p) n,i p′ ≤ p such that Mn (p′ ) > 0, then E = E [sRn,1 (p) ] · i=1 s E [Kn,1 ].

(i) if ℓ = i, i + 1, . . . , i + k − 1, P( − ℓ}) = p′ ; (ii) when p > p′ ,

 P

Lemma A.2. Given A1, A2(i), A3, and H0 , if s ∈ S, p , p ∈ J and p′ ≤ p such that Mn (p′ ) > 0, then ′

E [Kn,1 sRn,1 (p ) ] = ′

sp {1 − s(1 − p)} ′

{1 − s(1 − p′ )}2

.

 i m  − m=1

 =





Rn,j (p) = i , Rn,1 (p ) = ℓ ′

j =1

p′ (1 − p′ )i−1 , p′ (p − p′ )(1 − p′ )ℓ−2 ,

if ℓ = i; if ℓ = i + 1, . . . , i + k.

(17)

In the special case in which p = p′ , we have Kn,1 = 1 and E (sRn,1 (p) ) = sp/(1 − s(1 − p)). Lemma A.3. Given A1, A2(i), A3, and H0 , if s ∈ S, p , p ∈ J,and ∑Kn,1 R (p)+R (p′ ) ′ ′ n,i n,1 p ≤ p such that Mn (p ) > 0, then E = i =1 s ′

Lemma A.7. Let p, p′ ∈ I such that Mn (p′ ) > 0. Given condition ∑i+k−1 R of Lemma 1 and p > p′ , then for i, k = 1, 2, . . . , ℓ=i P

 ∑ i+k+1−ℓ ∑m ′ ( im=1 { m { j=2 Rn,j (p′ ) = j=1 Rn,j (p) = i}, Rn,1 (p ) = ℓ, m=2 i ∑m i + k − ℓ}) + P( m=1 { j=1 Rn,j (p) = i}, Rn,1 (p′ ) = i + k) = pp′ (1 − p′ )i−1 .

s2 p′ /{1 − s2 (1 − p)} · [{1 − s(1 − p)}/{1 − s(1 − p′ )}]2 . Lemma A.4. Given A1, A2(i), A3, and H0 , if p′ , p ∈ J and p′ ≤ p such that Mn (p′ ) > 0, then P(Kn,1 = k) = (p/p′ )[(p − p′ )/p]k−1 and E [Kn,1 ] = p/p′ .

Lemma A.8. Let K be a random positive integer, and let {Xt } be a sequence of random  variables such that for eachi = 1, 2, . . . , E (Xi ) < ∞. Then E

∑K

i =1

Xi

=E

∑K

i=1

E (Xi |K ) .

340

J.S. Cho, H. White / Journal of Econometrics 162 (2011) 326–344

Before proving Lemma 1, we define several relevant notions. First, for p ∈ I with Mn (p) > 0, we define the building time to i

∑Un,i (p)

by Bn,i (p) := i −

j =1

Rn,j (p), where Un,i (p) is the maximum

∑w R (p) < i; i.e., Un,i (p) := j=1 n,j  ∑w R ( p ) < i . Now Bn,i (p) ∈ {1, 2, . . . , i − 1, i}; max w ∈ N : n , j j =1 and if Bn,i (p) = i then Rn,1 (p) ≥ i. For p, p′ ∈ I, p′ < p, with Mn (p′ ) > 0, we also let Wn,i (p, p′ ) be the number of runs {Rn,i (p′ )} ∑Un,i (p) ∑Wn,i (p,p′ ) such that j=1 Rn,j (p) = Rn,j (p′ ). j =1 number of runs such that

Proof of Lemma 1. As part (a) is easy, we prove only part (b). We first show that R implies that the original data {Yt } are pairwise independent. I.e., we show that for any two variables, (Yi , Yi+k ) (i, k ≥ 1), say, P(Fi (Yi ) ≤ p, Fi+k (Yi+k ) ≤ p′ ) = pp′ . We partition our consideration into three cases: (a) p = p′ ; (b) p < p′ ; and (c) p > p′ and obtain the given equality for each case. (a) Let p = p′ . We have P(Fi (Yi ) ≤ p, Fi+k (Yi+k ) ≤ i ∑m k+m ∑h p′ ) = P( m=1 { j=1 Rn,j (p) = i}, h=m+1 { j=m+1 Rn,j (p) =

∑ k ∑h ({ m Rn,j (p) = k}) = j=1 Rn,j (p) = i})P( h=1 { ∑i ∑ ∑k ∑ h j =1 P ({ R ( p ) = i }) P ( R ( = k) = n ,j m=1 h =1 j=1 n,j p) p · p = p , where the second and third equalities follow k}) =

∑i

m=1 P m j =1 2

from condition R and Lemma A.5, respectively. (b) Next, suppose p  < p′ . We have P(Fi (Yi ) ≤ p, Fi+k (Yi+k ) ≤ p′ ) =

∑i

h=1

i

P

m=1

∑m

j =1

Rn,j (p) = i ,

 ∑h



′ j=h+1 Rn,j (p ) = k

∑m

j =1

h =1 P (

∑i

=

Rn,j (p′ )

i

m=1

=

i,

k+h

m=h+1

j=1 Rn,j (p) = i ,

∑m



′ ′ j=1 Rn,j (p ) = i) m=1 P j=1 Rn,j (p ) = k , where the second equality follows from R. Further, Lemma A.5 implies

∑h

∑k

∑m



′ = P j=1 Rn,j (p ) = i    ∑ ∑ i h i m i ′ = h=1 j=1 Rn,j (p ) = i , m=1 j=1 Rn,j (p) = i m=1 P ∑m  ∑m  ∑k ′ = p, and m=1 P = p′ . j=1 Rn,j (p) = i j=1 Rn,j (p ) = k ′ ′ Thus, P(Fi (Yi ) ≤ p, Fi+k (Yi+k ) ≤ p ) = pp . (c) Finally, let p′ < p. We have P(Fi (Yi ) < p, Fi+k (Yi+k ) < p′ ) = ∑i ′ b=1 P(Fi (Yi ) < p, Fi+k (Yi+k ) < p , Bn,i (p) = b) and derive

that

∑i



h =1

i

P

m=1

∑m

j=1

Rn,j (p) = i ,

 ∑h

∑

each piece constituting this sum separately. We first examine the case b = i. Then P(Fi (Yi ) < p, Fi+k (Yi+k ) < p′ , Bn,i (p) = i) =



∑i+k−1

P

ℓ=i

i m=1

∑m

′ j = 2 R n ,j ( p ) = i + k − ℓ

∑m



(p ) = i + k ′

Rn,j (p) = i , Rn,1 (p′ ) = ℓ,



j =1



+P

= pp (1 − p ) ′

 i

′ i−1

m =1

i+k+1−ℓ m=2

j=1 Rn,j (p) = i , Rn,1

∑m



, where the last equality follows

from Lemma A.7. Next, we consider the cases b = 1, 2, . . . , i − 1. Then it follows that P(Fi (Yi ) < p, Fi+k (Yi+k ) < p′ , Bn,i (p) =

∑  ∑m  b+k−1 b+w ′ = P ℓ=b m=1+w j=1+w Rn,j (p) = b , Rn,u+1 (p )    b+k+u+1−ℓ ∑m  ′ = ℓ, m + P bm+w =u+2 j=u+2 Rn,j (p ) = b + k − ℓ =1+w ∑    ∑m m i−b ′ ×P j=1+w Rn,j (p) = b , Rn,u+1 (p ) = b + k m=1 j =1  ′ Rn,j (p ) = i − b , where w and u are short-hand notations for Wn,i (p, p′ ) and Un,i (p). Given this, we further note that i−b ∑m ∑i−b ∑m ′ P( m=1 { j=1 Rn,j (p′ ) = i − b}) = ,j ( p ) = m=1 P( j =1 R n b+w i − b) = p′ by Lemma A.5; condition R implies that P m=1+w  ∑m    ∑m b ′ R ( p ) = b , R ( p ) = b + k = P n,u+1 j=1+w n,j m=1 j =1   Rn,j (p) = b , Rn,u+1 (p′ ) = b + k and for ℓ = b, b + 1, . . . ,  ∑m  b+w ′ b + k − 1, P m=1+w j=1+w Rn,j (p) = b , Rn,u+1 (p ) = ℓ,    b+k+u+1−ℓ ∑m b ∑m ′ R ( p ) = b + k − ℓ = P n , j m=u+2 j =u +2 m=1 j =1  b+k+1−ℓ ∑m ′ ′ Rn,j (p) = b , Rn,1 (p ) = ℓ, m=2 j=2 Rn,j (p ) = b + b)

k−ℓ



, so that P(Fi (Yi ) < p, Fi+k (Yi+k ) < p′ , Bn,i (p) = b) =

P(Fb (Yi ) < p, Fb+k (Yb+k ) < p′ , Bn,b (p) = b)p′ = pp′ (1 − p′ )b−1 . ∑i Hence, P(Fi (Yi ) < p, Fi+k (Yi+k ) < p′ ) = b=1 P(Fi (Yi ) < ∑i−1 ′ 2 ′ b −1 p, Fi+k (Yi+k ) < p′ , Bn,i (p) = b) = pp ( 1 − p ) + pp′ (1 − b=1 p′ )i−1 = pp′ . Thus, Yi and Yi+k are independent. Next, suppose that {Yt } is not identically distributed. Then there is a pair, say (Yi , Yj ), such that for some y ∈ R, pi := Fi (y) ̸= pj := Fj (y). Further, for the same y, P(Rn,(j) (pj ) = 1) = P(Fj (Yj ) ≤ Fj (y)|Fj−1 (Yj−1 ) ≤ Fj−1 (y)) = P(Fj (Yj ) ≤ Fj (y)) = pj , where the subscript (j) denotes the (j)th run of {Rn,i (pj )} corresponding to Fj (y), and the second equality follows from the independence property just shown. Similarly, P(Rn,(i) (pj ) = 1) = P(Fj (Yi ) ≤ Fj (y)) = P(Yi ≤ y) = pi . That is, P(Rn,(j) (pj ) = 1) ̸= P(Rn,(i) (pj ) = 1). This contradicts the assumption that {Rn,i (p)} is identically distributed for all p ∈ I. Hence, {Yt } must be identically distributed. This completes the proof. 2

Proof of Theorem 1. (i) We separate the proof into three parts. In (a), we prove the weak convergence of Gn (p, · ). In (b), we show that E [G(p, s)] = 0 for each s ∈ S. Finally, (c) derives E [G(p, s)G(p, s′ )]. ¯ 1 > 0, E [{Gn (p, s) − (a) First, we show that for some ∆ ¯ 1 |s − s′ |4 . Note that for each p, E [{Gn (p, s) − Gn (p, s′ )}4 ] ≤ ∆ 

∑Mn (p) Gn (p, s′ )}4 ] = (1 − n−1 )E [{G(p, s) − G(p, s′ )}4 ] + n−2 E i =1

4 ] ′ ≤ 2E [{G(p, s) − + {1−s(sp1−p)} − {1−s′s(p1−p)}   ′ G(p, s′ )}4 ] + n−1 E {sRn,i (p) − s′ Rn,i (p) + {1−s(sp1−p)} − {1−s′s(p1−p)} }4 , a 

sRn,i (p) − s′

Rn,i (p)

consequence of finite-dimensional weak convergence, which follows from Lindeberg–Levy’s central limit theorem (CLT) and the Cramér–Wold device. We examine each piece on the RHS sepad

rately. It is independently shown in Theorem 4(i) that G(p, ·) = Zp . Thus, E [{G(p, s) − G(p, s′ )}4 ] = E [{Zp (s) − Zp (s′ )}4 ] uniformly in p. If we let mp (s) := sp(1 − s)(1 − p)1/2 {1 − s(1 − p)}−1 and Bj (p) := (1 − p)j/2 Zj for notational simplicity, then ∑∞ Zp (s) = mp (s) j=0 sj Bj (p), and it follows that {Zp (s)− Zp (s′ )}4 =

 4 Ap (s)[mp (s) − mp (s′ )] + mp (s′ )Bp (s)(s − s′ ) , where Ap (s) := ∑∞ j ∑∞ ∑j ′ j −k k s Bj (p). We can also k=0 s j=0 s Bj (p) and Bp (s) := j =0 use the mean value theorem to obtain that for some s′′ between s and s′ , mp (s) − mp (s′ ) = m′p (s′′ )(s − s′ ). Therefore, if we let

∆1 := (1 − s)2 (1 −  s)−2 with  s := max[|s|, s¯], E [{Zp (s) − ′ 4 Zp (s )} ] = E [{Ap (s)m′p (s′′ ) + mp (s′ )Bp (s)}4 ]|s − s′ |4 ≤ ∆41 {|E [Ap (s)4 ]| + 4|E [Ap (s)3 Bp (s)]| + 6|E [Ap (s)2 Bp (s)2 ]| + 4|E [Ap (s)Bp (s)2 ]| + |E [Bp (s)4 ]|}|s − s′ |4 , because supp,s |m′p (s)| ≤ ∆1 and supp,s |mp (s)| ≤ ∆1 . Some tedious algebra shows that 6 sup(p,s)∈I×S E [Ap (s)4 ] ≤ ∆2 := (1− , and sup(p,s)∈I×S E [Bp (s)4 ] s4 )2

72 ≤ ∆3 := (1− , so that E [{Zp (s) − Zp (s′ )}4 ] ≤ 16∆41 ∆3 |s − s4 )5 ′ 4 s | . Using Hölder’s inequality, we obtain |E [Ap (s)3 Bp (s)]| ≤ 3/4 1/4 |E [Ap (s)4 ]3/4 E [Bp (s)4 ]1/4 ≤ ∆2 ∆3 ≤ ∆3 , |E [Ap (s)2 Bp (s)2 ]| ≤ 2/4 2/4 |E [Ap (s)4 ]2/4 E [Bp (s)4 ]2/4 ≤ ∆2 ∆3 ≤ ∆3 , and |E [Ap (s)Bp 3/4 1/4 (s)3 ]| ≤ E [Ap (s)4 ]3/4 E [Bp (s)4 ]1/4 ≤ ∆2 ∆3 ≤ ∆3 , where the final inequalities follow from the fact that ∆2 ≤ ∆3 . Next, we note R (p) that |sRn,i (p) − s′ n,i | ≤ Rn,i (p) sRn,i (p) |s − s′ | and |sp/{1 − s(1 − 1 ′ ′ p)} − s p/{1 − s (1 − p)}| ≤ (1− |s − s′ | with E [Rn,i (p)4 s4Rn,i (p) ] ≤ s)2 24(1 − s4 )−5 . Thus, when we let Qn,i := Rn,i (p) sRn,i (p) + (1 − s)−2 , 4 4 −5 −8  it follows that E [Qn,i ] ≤ ∆1 := 384 × (1 − s ) (1 − s) , and R (p) E [{sRn,i (p) − s′ n,i + sp/{1 − s(1 − p)} − s′ p/{1 − s′ (1 − p)}}4 ] ≤ 1 |s − s′ |4 . Given this, if ∆ 1 ) ¯ 1 is defined by ∆ ¯ 1 := (32∆41 ∆3 + ∆ ∆ ′ 4 ′ 4 ¯ then E [|Gn (p, s) − Gn (p, s )| ] ≤ ∆1 |s − s | .

J.S. Cho, H. White / Journal of Econometrics 162 (2011) 326–344

Second, therefore, if we let s′′ ≤ s′ ≤ s, E [|Gn (p, s) − Gn (p, s′ )|2 |Gn (p, s′ ) − Gn (p, s′′ )|2 ] ≤ E [|Gn (p, s) − Gn (p, s′ )|4 ]1/2 E ¯ 1 |s − s′′ |4 , where the first [|Gn (p, s′ ) − Gn (p, s′′ )|4 ]1/2 ≤ ∆ inequality follows from the Cauchy–Schwarz inequality. This verifies condition (13.14) of Billingsley (1999). The desired result follows from these, Theorem 13.5 of Billingsley (1999) and the finite-dimensional weak convergence, which is obtained by applying the Cramér–Wold device.  (b) Under the given conditions and the null, E

∑Mn (p) i=1

sRn,i (p)

Mn (p) Rn,i (p) −sp/{1 − s(1 − p)}] = E − sp/{1 − s(1 − p)}| i =1 E [ s ∑  Mn (p) Mn (p)]] = E i=1 sp/{1 − s(1 − p)} − sp/{1 − s(1 − p)} = 0, where the first equality follows from Lemma A.8, and the second equality follows from the fact that given Mn (p), Rn,i (p) is

∑

IID under the null. (c) Under the  conditions and the null, E [Gn (p, s)Gn (p,  given the s′ )] = n−1 E

∑Mn (p) i =1

E [sRn,i (p) − sp/{1 − s(1 − p)}][s′

Rn,i (p)

− s′

p/{1 − s′ (1 − p)}]|Mn (p) = n−1 E [Mn (p)[ss′ p/{1 − ss′ (1 − p)} − ss′ p2 /{1−s(1−p)}{1−s′ (1−p)}]] = ss′ p2 (1−s)(1−s′ )(1−p)/[{1− ss′ (1 − p)}{1 − s(1 − p)}{1 − s′ (1 − p)}], where the first equality follows from Lemma A.8, and the last equality follows because n−1 E [Mn (p)] = p. Finally, it follows from the continuous mapping theorem that f [Gn (p, · )] ⇒ f [G(p, · )]. This is the desired result. (ii) This can be proved in numerous ways. We verify the conditions of Theorem 13.5 of Billingsley (1999). Our proof is separated into three parts: (a), (b) and (c). In (a), we show the weak convergence of Gn ( · , s). In (b), we prove that for each p, E [G(p, s)] = 0. Finally, in (c), we show that E [G(p, s)G(p′ , s)] = s2 p′2 (1 − s)2 (1 − p)/{1 − s(1 − p′ )}2 {1 − s2 (1 − p)}. (a) First, for each s, we have G(1, s) ≡ 0 as Gn (1, s) ≡ 0, and for any δ > 0, limp→1 P(|G(p, s)| > δ) ≤ limp→1 E (|G(p, s)|2 )/δ 2 = limp→1 s2 p2 (1 − s)2 (1 − p)/δ 2 {1 − s(1 − p)}2 {1 − s2 (1 − p)} = 0 uniformly on S, where the inequality and equality follow from the Markov inequality and the result in (c), respectively. Thus, for each s, G(p, s) − G(1, s) ⇒ 0 as p → 1. Second, it is not hard to show that E [{Gn (p, s) − Gn (p′ , s)}4 ] = −1 E [{G(p,[s) − G(p′ , s)}4 ] − n−1 p′ E [{G(p, s) ]− G(p′ , s)}4 ] +



n− 1 p ′ E

∑

Kn,1 i =1

4 ′ ′ (sRi − E [sRi ]) − (sR1 − E [sR1 ]) using the finite-

dimensional weak convergence result. We examine each term on the RHS separately. From some tedious algebra, it follows that E [{G(p, s) − G(p′ , s)}4 ] = 3s4 (1 − s)4 {ks (p)ms (p) − 2ks (p′ )ms (p) + ks (p′ )ms (p′ )}2 ≤ 3{|{ks (p) − ks (p′ )}ms (p)| + |ks (p′ ){ms (p′ ) − p2

ms (p)}|}2 , where for each p, ks (p) := {1−s(1−p)}2 , and ms (p) := 1−p . Note that |ks |, |ms |, |k′s | and |m′s | are bounded by {1−s2 (1−p)}

∆4 := max[∆1 , ∆2 , ∆3 ] uniformly in (p, s). This implies that 2 > 0 such that if n is sufficiently large enough, there exists ∆ 2 |p − p′ |2 . Some algebra then E [{G(p, s) − G(p′ , s)}4 ] ≤ ∆ r ⃝ implemented using Mathematica shows that for some ∆5 > [ 4 ] ∑Kn,1 R R′1 R′1 ′ Ri i 0, p E − E [s ]) ≤ ∆5 p′ −1 |p − p′ |, i=1 (s − E [s ]) − (s so that given that p′ ≥ p > 0, if n−1 is less than |p − p′ | then

¯ 2 |p − p′ |2 for sufficiently large n, E [{Gn (p, s) − Gn (p′ , s)}4 ] ≤ ∆ 2 (1 + p−1 ) + ∆5 p−1 . Finally, for each p′′ ≤ p′ ≤ ¯ 2 := ∆ where ∆

p, E [{Gn (p, s) − Gn (p′ , s)}2 {Gn (p′ , s) − Gn (p′′ , s)}2 ] ≤ E [|Gn (p, s) − ¯ 2 |p − p′′ |2 by the Gn (p′ , s)|4 ]1/2 E [|Gn (p′ , s) − Gn (p′′ , s)|4 ]1/2 ≤ ∆ Cauchy–Schwarz inequality. The weak convergence of {Gn ( · , s)} holds by Theorem 13.5 of Billingsley (1999) and finite-dimensional weak convergence, which can be obtained by the Cramér–Wold device. (b) For each p, E [G(p, · )] = 0 follows from the proof of Theorem 1(i, b).

341

(c) First, for convenience, for each p and p′ , we let M and M ′ denote Mn (p) and Mn (p′ ), respectively, and let Ri and R′i stand for Rn,i (p) and Rn,i (p′ ). Then  from the definition of Kn,j , E [Gn (p, s)Gn (p′ , s)]

∑Kn,1

′

(sRi − E [sRi ])(sR1 − E    ∑Kn,1 R ′ ′ R′1 Ri i [sR1 ])|M , M ′ = n− 1 E [ M ′ ] E − E [sR1 ]) i=1 (s − E [s ])(s ∑ ∑Kn,1 R R′ Kn,1 R +R′ R′1 Ri Ri i i 1 − 1 = p′ E i=1 s i=1 s E [s ] − Kn,1 s E [s ] + Kn,1 E [s ]  ′ E [sR1 ] , where the first equality follows from Lemma A.8 since {Kn,j } is IID under the null and R′j is independent of Rℓ , if ℓ ≤ Kn,j−1 or ℓ ≥ Kn,j + 1. The second equality follows, as {M , M ′ } is  independent of {Ri , R1 : i = 1, 2, . . . , Kn,1 }. Further, ∑Knm1 R  ′ i E s = E [sRi ] · E [Kn,1 ], E [Kn,1 sR1 ] = sp′ /{1 − s(1 − i=1 ∑  Kn,1 R +R′ i 1 p′ )} · {1 − s(1 − p)}/{1 − s(1 − p′ )}, and E s = i=1 =

n− 1 E M ′ E

i=1

s2 p′ /{1 − s2 (1 − p)}·[{1 − s(1 − p)}/{1 − s(1 − p′ )}]2 by Lemmas A.1– A.4. Substituting these into the above equation yields the desired result. (iii) We separate the proof into two parts, (a) and (b). In (a), we prove the weak convergence of Gn , and in (b) we derive its covariance structure. (a) In order to show the weak convergence of Gn , we exploit the moment condition in Theorem 3 of Bickel and Wichura (1971, p. 1665). For this, we first let B and C be neighbors in J × S such that B := (p1 , p2 ] × (s1 , s2 ] and C := (p1 , p2 ] × (s2 , s3 ]. Without loss of generality, we suppose that |s2 − s1 | ≤ |s3 − s2 |. Second, we define |Gn (B)| := |Gn (p1 , s1 ) − Gn (p1 , s2 ) − Gn (p2 , s1 ) + Gn (p2 , s2 )|, then |Gn (B)| ≤ |Gn (p1 , s1 ) − Gn (p1 , s2 )| + |Gn (p2 , s2 ) − Gn (p2 , s1 )|, so that E [|Gn (B)|4 ] = E [|A1 |4 ] + 4E [|A1 |3 |A2 |] + 6E [|A1 |2 |A2 |2 ] + 4E [|A1 ||A2 |3 ] + E [|A2 |4 ] ≤ E [|A1 |4 ] + 4E [|A1 |4 ]3/4 E [|A2 |4 ]1/4 + 6E [|A1 |4 ]2/4 E [|A2 |4 ]2/4 + 4E [|A1 |4 ]1/4 E [|A2 |4 ]3/4 + E [|A2 |4 ] using Hölder’s inequality, where we let A1 := |Gn (p1 , s1 ) − Gn (p1 , s2 )| and A2 := |Gn (p2 , s2 ) − Gn (p2 , s1 )| for notational simplicity. ¯ 1 |s1 − s2 |4 and E [|A2 |4 ] ≤ We already saw that E [|A1 |4 ] ≤ ∆ ¯ 1 |s1 − s2 |4 in the proof of Theorem 1(i). Thus, E [|Gn (B)|4 ] ≤ ∆ ¯ 1 |s1 − s2 |4 . Third, we define |Gn (C )| := |Gn (p2 , s2 ) − 16∆ Gn (p2 , s3 ) − Gn (p3 , s2 ) + Gn (p3 , s3 )|; then |Gn (C )| ≤ |Gn (p2 , s2 ) − Gn (p3 , s2 )|+|Gn (p3 , s3 )− Gn (p2 , s3 )|. Using the same logic as above, Hölder’s inequality, and the result in the proof of Theorem 1(ii), ¯ 2 |p2 − p1 |2 for sufficiently large we obtain E [|Gn (C )|4 ] ≤ 16∆ n. Fourth, therefore, using Hölder’s inequality, we obtain that for all sufficiently large n, E [|B|4/3 |C |8/3 ] ≤ E [|B|4 ]1/3 E [|C |4 ]2/3 ≤ ¯ {|s2 − s1 |2 · |p2 − p1 |2 }2/3 ≤ ∆ ¯ {|s2 − s1 | · |p2 − p1 |}2/3 {|s3 − ∆ ¯ 3/4 λ(B)}2/3 {∆ ¯ 3/4 λ(C )}2/3 , where ∆ ¯ := s2 | · |p2 − p1 |}2/3 = {∆ 1/3 2/3 ¯1 ∆ ¯ 2 , and λ( · ) denotes the Lebesgue measure of the given 16∆ argument. This verifies the moment condition (3) in Theorem 3 of Bickel and Wichura (1971, p. 1665). Fifth, it trivially holds from the definition of Gn that G = 0 on {(p, s) ∈ J × S : s = 0}. Finally, the continuity of G on the edge of J × S was verified in the proof of Theorem 1(ii). Therefore, the weak convergence of {Gn } follows from the corollary in Bickel and Wichura (1971, p. 1664) and the finite-dimensional weak convergence obtained by Lindeberg–Levy’s CLT and the Cramér–Wold device. (b) As before, for convenience, for each p and p′ , we let M and M ′ denote Mn (p) and Mn (p′ ), respectively, and we let Ri and R′i be short-hand notations for Rn,i (p) and Rn,i (p′ ). Also, we let Kn,j be as previously defined. Then, under the given conditions and the null, E [Gn (p, s)Gn (p′ , s′ )]

 ′  M − M − ′ ′ Ri Ri ′ Rj ′ Rj =n E (s − E [s ])(s − E [s ]) −1

j=1 i=1

342

J.S. Cho, H. White / Journal of Econometrics 162 (2011) 326–344

= p′ E

K n,1 −

sRi s′

R′1

Kn,1 −

R′

E [s′ 1 ]

i=1

i =1

× s R i − K n ,1 s

−

′ ′ R1

′ ′ R1

E [sRi ] + Kn,1 E [sRi ]E [s

 ] ,

(18)

where the first equality follows from the definition of Gn , and the second equality holds for the same reason as in the proof  of

∑Kn,1

Theorem 1(ii). From Lemmas A.1–A.4, we have that E

i=1

sRi =

R′

E [sRi ] · E [Kn,1 ], and E [Kn,1 s′ 1 ] = s′ p′ /{1 − s′ (1 − p′ )} · {1 − s′ (1 − p)}/{1 − s′ (1 − p′ )}, and E

∑Kn,1 i=1

sRi s′

R′1

= {ss′ p′ }/{1 − ss′ (1 − p)}·

[{1 − s′ (1 − p)}/{1 − s′ (1 − p′ )}]2 . Thus, substituting these into (18) gives E [Gn (p, s)Gn (p′ , s′ )] = ss′ p′2 (1 − s)(1 − s′ )(1 − p){1 − s′ (1 − p)}/[{1 − s(1 − p)}{1 − s′ (1 − p′ )}2 {1 − ss′ (1 − p)}]. Finally, it follows from the continuous mapping theorem that f [Gn ] ⇒ f [G]. Proof of Lemma 2. (i) First, supp∈I | pn (p) − p| → 0 almost surely by Glivenko–Cantelli. Second, Gn ⇒ G by Theorem 1(ii). Third, (D (J × S)× D (J)) is a separable space. Thus, (Gn , pn ( · )) ⇒ (G, · ) by Theorem 3.9 of Billingsley (1999). Fourth, |G(p, s) − G(p′ , s′ )| ≤ |G(p, s) − G(p′ , s)| + |G(p′ , s) − G(p′ , s′ )|, and each term of the RHS can be made as small as desired by letting |p − p′ | and |s − s′ | tend to zero, as G ∈ C (J × S) a.s. Finally, note that for each (p, s), Wn (p, s) = Gn ( pn (p), s). Therefore, Wn − Gn = Gn ( pn ( · ), · ) − Gn ( · , · ) ⇒ G − G = 0 by a lemma of Billingsley (1968/1999, p. 151) and the four facts just shown. This implies that sup(p,s)∈J×S |Wn (p, s) − Gn (p, s)| → 0 in probability, as desired. (ii) We write  pn (p) as  pn for convenience. By the mean value theorem, for some p∗n (p) (in I) between p ∈ J and  pn , |[{s pn }/{1 − s(1 − pn )}− sp/{1 − s(1 − p)}]−{s(1 − s)( pn − p)}/{1 − s(1 − p)}2 | =  2s2 (1 − s)( pn − p)2 /{1 − s(1 − p∗n (p))}3 , where supp∈J p∗n (p) − p → 0 a.s. by Glivenko–Cantelli. Also,

n (p)s2 (1 − s)( M pn − p)2 sup √ n{1 − s(1 − p∗n (p))}3 p,s 



 n (p)    1  s2 ( 1 − s) M  | sup n( ≤ √ sup pn − p)2 |,  sup ∗ 3 n  p,s {1 − s(1 − pn (p))}  n p p 

where n Mn (p) and s (1 − s){1 − s(1 − pn (p))} are uniformly bounded by 1 and 1/(1 − s)3 , respectively, with  s := max[|s|, s¯]; 2 and n( pn − p) = OP (1) uniformly in p. Thus, −1 

∗

2

−3

 n (p)  M s pn sup √  pn )} n  {1 − s(1 − p,s    sp s(1 − s)( pn − p)  − −  = oP (1). 2 {1 − s(1 − p)} {1 − s(1 − p)}  Given these, the weak convergence of Hn follows immediately, as n (p) − p| = oP (1), and the function of p defined by supp |n−1 M √ n( pn − p) weakly converges to a Brownian bridge, permitting application of the lemma of Billingsley (1968/1999, p. 151). These facts also suffice for the tightness of Hn . Next, the covariance structure of H follows from the fact that ′ ′ ′  for each pn − √ (p, s) and (p , s ) with p ≤ p, E [{Mn (p)s(1 − s)( p)}/ n{1 − s(1 − p)}2 ] = 0, and

 E

n (p)s(1 − s)( n (p′ )s′ (1 − s′ )( M pn − p) M p′n − p′ ) √

n{1 − s(1 − p)}2

√

n{1 − s′ (1 − p′ )}2

ss′ pp (1 − s)(1 − s′ )(1 − p) ′2

=

{1 − s(1 − p)}2 {1 − s′ (1 − p′ )}2

,

which is identical to E [H (p, s)H (p′ , s′ )].



(iii) To show the given claim, we first derive the given covariance structure. For each (p, s) and (p′ , s′ ), E [Wn (p, s)Hn (p′ , s′ )] = E [E [Wn (p, s)|X1 , . . . , Xn ]Hn (p′ , s′ )], where the equality follows because Hn is measurable with respect to the smallest σ -algebra generated by {X1 , . . . , Xn }. Given this, we have E [Wn (p, s)|X1 , . . . , Xn ]

∑ n (p)  Rn,i (p) = n−1/2 M |X1 , . . . , Xn ] − s pn /{1 − s(1 −  pn )}] = i =1 [ E [ s ∑ n (p) M −1/2  n [ sp /{ 1 − s ( 1 − p )} − s p /{ 1 − s ( 1 − p )}] = − H n n n (p, s). i =1 Thus, E [E [Wn (p, s)|X1 , . . . , Xn ]Hn (p′ , s′ )] = −E [Hn (p, s)Hn (p′ , s′ )]. Next, we have that E [G(p, s)H (p′ , s′ )] = limn→∞ E [Wn (p, s) Hn (p′ , s′ )] by Lemma 2(i). Further, E [H (p, s)H (p′ , s′ )] = limn→∞ E [Hn (p, s)Hn (p′ , s′ )]. It follows that E [G(p, s)H (p′ , s′ )] = −E [H (p, s)H (p′ , s′ )]. Next, we consider ( Gn , Hn )′ and apply example 1.4.6 of van

der Vaart and Wellner (1996, p. 31) to show weak convergence. Note that  Gn = Wn + Hn = Gn + Hn + oP (1), and that Gn and Hn are each tight, so  Gn is tight, too. Further, Gn and Hn have continuous limits by Theorem 1(ii) and Lemma 2(ii). Thus, if the finite-dimensional distributions of Gn + Hn have weak limits, then  Gn must weakly converge to the Gaussian process  G with the covariance structure (8). We may apply Lindeberg–Levy’s CLT to show this unless Gn + Hn ≡ 0 almost surely. That is, for each (p, s) with s ̸= 0, E [Gn (p, s) + Hn (p, s)] = 0, and E [{Gn (p, s) + Hn (p, s)}2 ] = E [Gn (p, s)2 ] − E [Hn (p, s)2 ] + o(1) ≤ E [Gn (p, s)2 ] + o(1) = s2 p2 (1 − s)2 (1 − p)/ {1 − s2 (1 − p)}{1 − s(1 − p)}2 + o(1), which is uniformly bounded, so that for each (p, s) with s ̸= 0, the sufficiency conditions for Lindeberg–Levy’s CLT hold. The first equality above follows by applying Lemma 2(ii). If s = 0, then Gn ( ·, 0) + Hn ( · , 0) ≡ 0, so that the probability limit of Gn ( ·, 0) + Hn ( · , 0) is zero. Given these, the finite-dimensional weak convergence of  Gn now follows from the Cramér–Wold device. Next, note that  Gn is asymptotically independent of Hn because E [ Gn (p, s)Hn (p′ , s′ )] = E [Wn (p, s)Hn (p′ , s′ )] + E [Hn (p, s)Hn (p′ , s′ )] = 0 by the covariance structure given above. It follows that ( Gn , Hn )′ ⇒ ( G, H )′ by example 1.4.6 of van der Vaart and Wellner (1996, p. 31). To complete the proof, take ( Gn − Hn , Hn )′ = (Wn , Hn )′ , and apply the continuous mapping theorem. Proof of Theorem 2 (i, ii, and iii). The proof of Lemma 2(iii) establishes that  Gn ⇒  G. This and the continuous mapping theorem imply the given claims. Proof of Lemma 3. (i) First, as shown in the proof of Lemma 3(ii) below, Fˆn (y( · )) converges to F (y( · )) in probability uniformly on I, where for each p, y(p) := inf{x ∈ R : F (x) ≥ p}. Second, Gn ⇒ G by Theorem 1(ii). Third, (D (J × S) × D (J)) is a separable space. Therefore, it follows that (Gn , Fˆn (y( · ))) ⇒ (G, F (y( · ))) by Theorem 3.9 of Billingsley (1968/1999). Fourth, G ∈ C (J × S). ¨ n ( · , · ) = Gn (Fˆn (y( · )), · ), so that G¨ n ( · , · ) − Gn ( · , · ) = Finally, G ˆ Gn (Fn (y( · )), · ) − Gn ( · , · ) ⇒ G − G = 0, where the weak convergence follows from the lemma of Billingsley (1968/1999, ¨ n (p, s) − Gn (p, s)| = oP (1). p. 151). Thus, sup(p,s) |G

¨ n (p, s) permits the representation (ii) First, the definition of H √ ¨ n (p, s) = {M ˆ n (p)/n}{ n[s H pn√ /{1 − s(1 −  pn )} − sp/{1 − s(1 − p)}]}. Second, it follows that n[s pn /{1 − s(1 −  pn )} − sp/{1 − s(1 − p)}] ⇒ −s(1 − s)B00 (p)/{1 − s(1 − p)}2 by Lemma 2(ii)

and Theorem 5(i) below. Third, if Fˆn ( · ) converges to F ( · ) in ˆ n (p) converges to p in probability uniformly probability, then n−1 M ∑ ˆ n (p) is defined as nt=1 1 ˆ ˆ in p, because for each p, M {Fn (Yt )
¨ n (p, s) − Hn (p, s)| = oP (1) by the these facts imply that sup(p,s) |H lemma of Billingsley (1968/1999, p. 151); this completes the proof. Therefore, we only have to show that Fˆn ( · ) converges to F ( · ) in probability; for this we exploit Glivenko–Cantelli. That is, if for each p, Fˆn (y(p)) converges to F (y(p)) in probability, then the

J.S. Cho, H. White / Journal of Econometrics 162 (2011) 326–344

uniform convergence follows from the properties of empirical distribution: boundedness, monotonicity, and right continuity. Thus, the pointwise convergence of Fˆn (p) completes the proof. We proceed as follows. First, letting y = y(p) for notational simplicity, for each y and for any ε1 > 0, we have {ω ∈ Ω : ht (θˆ n ) < y} ⊂ {ω ∈ Ω : ht (θ ∗ ) < y + |ht (θˆ n ) − ht (θ ∗ )|} = {ω ∈ Ω : ht (θ ∗ ) < y + |ht (θˆ n ) − ht (θ ∗ )|, |ht (θˆ n ) − ht (θ ∗ )| < ε1 } ∪ {ω ∈ Ω : ht (θ ∗ ) < y + |ht (θˆ n ) − ht (θ ∗ )|, |ht (θˆ n ) − ht (θ ∗ )| ≥ ε1 } ⊂ {ω ∈ Ω : ht (θ ∗ ) < y + ε1 } ∪ {|ht (θˆ n ) − ht (θ ∗ )| ≥ ε1 }. Second, for the same y and ε1 , {ω ∈ Ω : ht (θˆ n ) < y} ⊃ {ω ∈ Ω : ht (θ ∗ ) < y − ε1 } \ {ω ∈∑Ω : |ht (θ ∗ ) − ht (θˆ n )| > ∑nε1 }. These two facts n imply that n−1 t =1 1{ht (θ ∗ )
≤ n

∑n −1

t =1

1{ht (θˆ n )
∑n −1

t =1

1{ht (θ ∗ )
∑n 1 −1

t =1

∑n −1

1{|ht (θˆ n )−ht (θ ∗ )|≥ε } . Thus, it follows that n t =1 1{ht (θ ∗ ) 0 ∑ and ε2 > 0, there is an n∗ such that if n ∗ −1 n > n , P(n t =1 1{|ht (θ ∗ )−ht (θˆ n )|>ε } ≥ δ) ≤ ε2 . This follows because P(n

1

t =1 1{|ht (θ ∗ )−ht (θˆ n )|>ε1 } ≤ δ) ≤ (nδ) t =1 E ∑n −1 ˆ (1{|ht (θ∗ )−ht (θˆ n )|>ε1 } ) = (δ n) t =1 P(|ht (θ ∗ ) − ht (θ n )| > ε1 ) ≤ ε2 , where the first inequality follows from the Markov inequality, and the last inequality follows from the fact that |ht (θ ∗ )− ht (θˆ n )| ≤ Mt ‖θˆ n −θ ∗ ‖ = oP (1) uniformly in t by A2 and A3(ii). It follows that for any ε1 > 0, F (y − ε1 ) + oP (1) ≤ Fˆn (y) ≤ F (y + ε1 ) + oP (1). As ε1 may be chosen arbitrarily small, it follows that Fˆn (y) converges to F (y) in probability as desired.

∑n −1

∑n −1

ˆ n = G¨ n + H¨ n = Gn + Hn + Proof of Theorem 3. (i, ii, and iii) G oP (1) by Lemma 3(i) and (ii). Further, Gn + Hn =  Gn ⇒  G by ˆ  Theorem 2(ii). Thus, Gn ⇒ G, which, together with the continuous mapping theorem, implies the desired result. The following lemmas collect together further supplementary claims needed to prove the weak convergence of the EGR test statistics under the local alternative. As before, we use the notation p = F (y) for brevity and suppose that Rn,i (p) is defined by observations starting from Yn,t +1 , unless otherwise noted. Appendix B

(i) for each y and k = 2, 3, . . . , E [1{Yn,t +k
(ii) for each p and k = 1, 2, . . . , (15) holds, and rk (p, Yn,t ) = OP (n−1 ) uniformly in p; (iii) for each p and k = 1, 2, . . . , (16) holds, and rn,k (p) = O(n−1 ) uniformly in p; (iv) for each k = 1, 2, . . . , hn,k (p) → hk (p) and rn,k (p) → rk (p) uniformly in p a.s.-P, (v) for each p ∈ I such that p > 1n , if we let  pn := F ( qn (p));

P( Rn,i (p) = k| pn ) = (1 − pn )k−1 pn

+ n−1/2

hn,k ( pn ) Fn

(F −1 ( p

n

))

j n− 2 Qj (y) + n−k/2 Gk (y, Yn,t ), where for each j = 1, 2,

  . . . , Qj (y) := · · ·  D(y, x1)dD(x1 , x2 ) · · · dD(xj−1 , x)dF (x), and Gk (y, Yn,t ) := · · · D(y, x1 )dD(x1 , x2 ) · · · dD(xk−2 , xk−1 )dD(xk−1 , Yn,t ); ∑∞ (ii) for each y, Fn (y) = F (y) + j=1 n−j/2 Qj (y) and Q1 (y) = Q (y); (iii) for each y and k = 1, 2, . . . , E [ Jn,t +k (y) Jn,t (y)] = O(n−k/2 ), where Jn,t (y) := 1{Yn,t
y, n1/2 { Fn (y) − F (y)} ∼ N (Q (y), F (y){1 − F (y)}). In what follows, we assume that p ∈ J, unless otherwise noted. Lemma B.3. Given conditions A1, A2(i), A3, A5, and Hℓ1 , (i) supy |α(y)| ≤ ∆, where α(y) :=

∞ y

D(y, x)dF (x);

+

rn,k ( pn ) Fn (F −1 ( pn ))

(vi) for each (p, s) ∈ I × S such that p >

√



n E [sRn,i (p) | pn ] − 

=

ν( p n , s) Fn (F −1 ( pn ))

;

(19)

1 , n



s pn

1 − s(1 − pn )

+ oP (1),

(20)

ps2 (1−s)w(p)

s(1−s)

 F −1 (p)

where ν(p, s) := {1−s(1−p)}2 + {1−s(1−p)} −∞

C (p, y)dF (y).

Lemma B.4. Given conditions A1, A2(i), A3, A5, and Hℓ1 , if Rn,i (p) is defined by observations starting from Yn,t +1 and p > 1n , then (i) for each p and k = 1, 2, . . . ,

P(Rn,i (p) = k|Fn,t ) = p(1 − p)k−1 + n−1/2 p−1 hk (p)

+ p−1 rk (p) + n−1 bk (p, Yn,t −1 ) + oP (n−1 ), (21)  y where bk (p, Yn,t −1 ) := p−1 −∞ hk (p, x)dD(x, Yn,t −1 ) − p−2 hk (p)D(y, Yn,t −1 ); (ii) for each p and k = 1, 2, . . . , P(Rn,i (p) = k) = p(1 − p)k−1 + −1/2 −1 n p hk (p) + p−1 rk (p) + n−1 bk (p) + o(n−1 ), where bk (p) :=  bk (p, z )dF (z ). Lemma B.5. Given conditions A1, A2(i), A3, A5, and Hℓ1 , if Rn,i (p) is defined by observations starting from Yn,t +1 and p > 1n , then (i) for each p and k, m = 1, 2, . . . , P(Rn,i (p) = k|Fn,t −m ) − m+1

m+1

P(Rn,i (p) = k) = n− 2 Bk,m (p, Yn,t −m ) + oP (n− 2 ), where Bk,1 (p, Yn,t −1 ) := bk (p,Yn,t −1 ) − bk (p), and for m = 2, 3, . . . , Bk,m (p, Yt −m ) := · · · bk (p, z )dD(z , x1 ) · · · m+1

Lemma B.1. Given conditions A1, and A2(i), A3, A5, and Hℓ1 ,

∑k−1

343

dD(xm−2 , Yn,t −m ) − n− 2 · · · bk (p, z )dD(z , x1 ) · · · d D(xm−2 , x)dF (x); (ii) for each p and k, ℓ, m = 1, 2, . . . , P(Rn,i (p) = k|Rn,i−m (p) = ℓ) = P(Rn,i (p) = k) + OP (n−(m+1)/2 ); (iii) for each p and k, ℓ, m = 1, 2, . . . , P(Rn,i (p) = k, Rn,i−m (p) = ℓ) = P(Rn,i (p) = k)P(Rn,i−m (p) = ℓ) + O(n−(m+1)/2 ).





Lemma B.6. Given conditions A1, A2(i), A3, A5, and Hℓ1 , for each (p, s), (i) E [Wn (p, s)| pn ] = ν(p, s) + oP (1); (ii) E [Hn (p, s)| pn ] = −ps(1 − s)Q (F −1 (p))/{1 − s(1 − p)}2 + oP (1); (iii) E [Wn (p, s)2 | pn ] = p2 s2 (1 − p)(1 − s)2 /{1 − s(1 − p)}2 {1 − 2 s (1 − p)} + ν(p, s)2 + oP (1); (iv) E [Hn (p, s)2 | pn ] = p3 s2 (1 − p)(1 − s)2 /{1 − s(1 − p)}4 + p2 s2 (1 − s)2 Q (F −1 (p))2 /{1 − s(1 − p)}4 + oP (1). Lemma B.7. Given conditions A1, A2(i), A3, A5, and Hℓ1 , for each (p, s),

n (p, s) = ν(p, s) + oP (1), where W n (p, s) := (i) Wn (p, s) − W n−1/2

∑M n (p)

(sRn,i (p) − E [sRn,i (p) | pn ]); m+1 (ii) E [Sn,i (p, s)|Fn,t −m ] = OP (n− 2 ), where Sn,i (p, s) := sRn,i (p) − E [sRn,i (p) ], and Rn,i (p) is the run defined by observations starting i=1

from Yn,t +1 .





344

J.S. Cho, H. White / Journal of Econometrics 162 (2011) 326–344

Lemma B.8. Given conditions A1, A2(i), A3, A5, and Hℓ1 , for each A

(p, s),  Gn (p, s) ∼ N (µ(p, s), s p (1−s) (1−p) /[{1−s(1−p)} {1− s2 (1 − p)}]). 2 2

4

2

4

Lemma B.9. Given conditions A1, A2(i), A3, A5, and Hℓ1 , (i) Wn − ν ⇒ G; and (ii) Hn − (µ − ν) ⇒ H . Remark. (a) For brevity, we omit deriving the asymptotic covariance structure of Wn and Hn under H1ℓ as this can be obtained in a manner similar to that obtaining the asymptotic variance of  Gn (p, s). (b) Given the fact that G and H are in C (J × S), they are tight, so that Lemma 1.3.8(ii) of van der Vaart and Wellner (1996, p. 21) implies that Wn and Hn are tight. Proof of Theorem 6. Given the weak convergence in Lemma B.8, the desired result follows by the tightness implied by Lemma B.9(i) (see Remark (b)) and the fact that (Wn , Hn )′ is tight by Lemma 1.4.3 of van der Vaart and Wellner (1996, p. 30). References Andrews, D., 1993. Tests for parameter instability and structural change with unknown change point. Econometrica 61, 821–856. Andrews, D., 2001. Testing when a parameter is on the boundary of the maintained hypothesis. Econometrica 69, 683–734. Andrews, D., Ploberger, W., 1994. Optimal tests when a nuisance parameter is present only under the alternative. Econometrica 62, 1383–1414. Bai, J., 1996. Testing for parameter constancy in linear regressions: an empirical distribution function approach. Econometrica 64, 597–622. Bickel, P., Wichura, M., 1971. Convergence criteria for multiparameter stochastic processes and some applications. Annals of Mathematical Statistics 42, 1656–1670. Bierens, H., 1982. Consistent model specification tests. Journal of Econometrics 20, 105–134. Bierens, H., 1990. A consistent conditional moment test of functional form. Econometrica 58, 1443–1458. Bierens, H., Ploberger, W., 1997. Asymptotic theory of integrated conditional moment tests. Econometrica 65, 1129–1151. Billingsley, P., 1968/1999. Convergence of Probability Measures. Wiley, New York. Brett, C., Pinkse, J., 1997. Those taxes are all over the map! A test for spatial independence of municipal tax rates in British Columbia. International Regional Science Review 20, 131–151. Brown, R., Durbin, J., Evans, J., 1975. Techniques for testing the constancy of regression relationships over time. Journal of the Royal Statistical Society. Series B 37, 149–163. Chan, N.-H., Chen, J., Chen, X., Fan, Y., Peng, L., 2009. Statistical inference for multivariate residual copula of GARCH models. Statistica Sinica 10, 53–70. Chen, X., Fan, Y., 2006. Estimation and model selection of semiparametric copulabased multivariate dynamic model under copula misspecification. Journal of Econometrics 135, 125–154. Chu, J., Hornik, K., Kuan, C.-M., 1995a. MOSUM tests for parameter constancy. Biometrika 82, 603–617. Chu, J., Hornik, K., Kuan, C.-M., 1995b. The moving-estimates test for parameter stability. Econometric Theory 11, 699–720. Crainiceanu, C., Vogelsang, T., 2007. Nonmonotonic power for tests of a mean shift in a time series. Journal of Statistical Computation and Simulation 77, 457–476. Darling, D., 1955. The Cramér–Smirnov test in the parametric case. Annals of Mathematical Statistics 28, 823–838. Davies, R., 1977. Hypothesis testing when a nuisance parameter is present only under the alternative. Biometrika 64, 247–254. Davydov, Y., 1973. Mixing conditions for Markov chains. Theory of Probability and its Applications 18, 312–328. Delgado, M., 1996. Testing serial independence based on the empirical distribution function. Journal of Time Series Analysis 17, 271–286. Deng, A., Perron, P., 2008. A non-local perspective on the power properties of the CUSUM and CUSUM of squares tests for structural change. Journal of Econometrics 142, 212–240. Diebold, F., Gunther, T., Tay, A., 1998. Evaluating density forecasts with applications to financial risk management. International Economic Review 76, 967–974.

Dodd, E., 1942. Certain tests for randomness applied to data grouped into small sets. Econometrica 10, 249–257. Donsker, M., 1951. An invariance principle for certain probability limit theorems. Memoirs of American Mathematical Society 6. Dufour, J.-M., 1981. Rank tests for serial dependence. Journal of Time Series Analysis 2, 117–128. Durbin, J., 1973. Weak convergence of the sample distribution function when parameters are estimated. Annals of Statistics 1, 279–290. Fama, E., 1965. The behavior of stock market prices. Journal of Business 38, 34–105. Fan, Y., Li, Q., 2000. Kernel-based tests versus Bierens’ ICM tests. Econometric Theory 16, 1016–1041. Feller, W., 1951. The asymptotic distribution of the range of sums of independent random variables. Annals of Mathematical Statistics 22, 427–432. Goodman, L., 1958. Simplified runs tests and likelihood ratio tests for Markov chains. Biometrika 45, 181–197. Granger, C., 1963. A quick test for serial correlation suitable for use with nonstationary time series. Journal of the American Statistical Association 58, 728–736. Grenander, U., 1981. Abstract Inference. Wiley, New York. Hallin, M., Ingenbleek, J.-F., Puri, M., 1985. Linear serial rank tests for randomness against ARMA alternatives. Annals of Statistics 13, 1156–1181. Heckman, J., 2001. Micro data, heterogeneity, and the evaluation of public policy: Nobel lecture. Journal of Political Economy 109, 673–746. Henze, N., 1996. Empirical distribution function goodness-of-fit tests for discrete models. Canadian Journal of Statistics 24, 81–93. Hong, Y., 1999. Hypothesis testing in time series via the empirical characteristic function: a generalized spectral density approach. Journal of the American Statistical Association 84, 1201–1220. Hong, Y., White, H., 2005. Asymptotic distribution theory for nonparametric entropy measures of serial dependence. Econometrica 73, 837–901. Jain, N., Kallianpur, G., 1970. Norm convergent expansions for Gaussian processes in Banach spaces. Proceedings of the American Mathematical Society 25, 890–895. Jogdeo, K., 1968. Characterizations of independence in certain families of bivariate and multivariate distributions. Annals of Mathematical Statistics 39, 433–441. Karlin, S., Taylor, H., 1975. A First Course in Stochastic Processes. Academic Press, San Diego. Kocherlakota, S., Kocherlakota, K., 1986. Goodness-of-fit tests for discrete distributions. Communications in Statistics. Theory and Methods 15, 815–829. Krivyakov, E., Martynov, G., Tyurin, Y., 1977. On the distribution of the ω2 statistics in the multi-dimensional case. Theory of Probability and its Applications 22, 406–410. Kuan, C.-M., Hornik, K., 1995. The generalized fluctuation test: a unifying view. Econometric Reviews 14, 135–161. Loève, M., 1978. Probability Theory II. Springer-Verlarg, New York. Mood, A., 1940. The distribution theory of runs. Annals of Mathematical Statistics 11, 367–392. Phillips, P., 1998. New tools for understanding spurious regressions. Econometrica 66, 1299–1326. Pinkse, J., 1998. Consistent nonparametric testing for serial independence. Journal of Econometrics 84, 205–231. Ploberger, W., Krämer, W., 1992. The CUSUM test with OLS residuals. Econometrica 60, 271–285. Ploberger, W., Krämer, W., Kontrus, K., 1989. A new test for structural stability in the linear regression model. Journal of Econometrics 40, 307–318. Robinson, P., 1991. Consistent nonparametric entropy-based testing. Review of Economic Studies 58, 437–453. Rosenblatt, M., 1952. Remarks on a multivariate transform. Annals of Mathematical Statistics 23, 470–472. Rueda, R., Pérez-Abreu, V., O’Reily, F., 1991. Goodness-of-fit test for the Poisson distribution based on the probability generating function. Communications in Statistics. Theory and Methods 20, 3093–3110. Sen, P., 1980. Asymptotic theory of some tests for a possible change in the regressional slope occurring at an unknown time point. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete 52, 203–218. Skaug, H., Tjøstheim, D., 1996. Measures of distance between densities with application to testing for serial independence. In: Robinson, P., Rosenblatt, M. (Eds.), Time Series Analysis in Memory of E.J. Hannan. Springer-Verlag, New York, pp. 363–377. Stinchcombe, M., White, H., 1998. Consistent specification testing with nuisance parameters present only under the alternative. Econometric Theory 14, 295–324. Sukhatme, P., 1972. Fredholm determinant of a positive definite kernel of a special type and its applications. Annals of Mathematical Statistics 43, 1914–1926. van der Vaart, A., Wellner, J., 1996. Weak Convergence and Empirical Processes with Applications to Statistics. Springer-Verlag, New York. Wald, A., Wolfowitz, J., 1940. On a test whether two samples are from the same population. Annals of Mathematical Statistics 2, 147–162.

Journal of Econometrics 162 (2011) 345–361

Contents lists available at ScienceDirect

Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom

Bayesian inference in a correlated random coefficients model: Modeling causal effect heterogeneity with an application to heterogeneous returns to schooling Mingliang Li a,∗ , Justin L. Tobias b a

Department of Economics, SUNY-Buffalo, United States

b

Department of Economics, Purdue University, United States

article

info

Article history: Received 4 July 2009 Received in revised form 9 September 2010 Accepted 8 February 2011 Available online 25 February 2011

abstract We consider the problem of causal effect heterogeneity from a Bayesian point of view. This is accomplished by introducing a three-equation system, similar in spirit to the work of Heckman and Vytlacil (1998), describing the joint determination of a scalar outcome, an endogenous ‘‘treatment’’ variable, and an individual-specific causal return to that treatment. We describe a Bayesian posterior simulator for fitting this model which recovers far more than the average causal effect in the population, the object which has been the focus of most previous work. Parameter identification and generalized methods for flexibly modeling the outcome and return heterogeneity distributions are also discussed. Combining data sets from High School and Beyond (HSB) and the 1980 Census, we illustrate our methods in practice and investigate heterogeneity in returns to education. Our analysis decomposes the impact of key HSB covariates on log wages into three parts: a ‘‘direct’’ effect and two separate indirect effects through educational attainment and returns to education. Our results strongly suggest that the quantity of schooling attained is determined, at least in part, by the individual’s own return to education. Specifically, a one percentage point increase in the return to schooling parameter is associated with the receipt of (approximately) 0.14 more years of schooling. Furthermore, when we control for variation in returns to education across individuals, we find no difference in predicted schooling levels for men and women. However, women are predicted to attain approximately 1/4 of a year more schooling than men on average as a result of higher rates of return to investments in education. © 2011 Elsevier B.V. All rights reserved.

1. Introduction A substantial volume of work in economics and statistics has been devoted to the issue of identifying and estimating the causal impact of an endogenous or ‘‘treatment’’ variable. The endogeneity issue in these models arises when the level or intensity of treatment is not randomly assigned, but, instead, is selected by the individual. This self-selection problem is the defining feature of observational data, as those agents observed to choose high levels of treatment are likely to possess observed and unobserved characteristics that differ from those choosing lower levels of treatment. This aspect of the problem generates a well-known bias and inconsistency in standard estimators that fail to account for the endogeneity of treatment levels.1

∗

Corresponding author. Tel.: +1 716 645 2121; fax: +1 716 645 2127. E-mail addresses: [email protected] (M. Li), [email protected] (J.L. Tobias).

1 In our paper, we continue to use the word ‘‘treatment’’ and use it in reference to the endogenous right-hand side outcome variable. In most cases in the literature, however, ‘‘treatment’’ refers to variables that are binary, or perhaps discrete, while here we will consider a continuous treatment outcome. Furthermore, ‘‘treatment effects’’ are typically defined in the context of a binary treatment outcome in 0304-4076/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2011.02.006

Recent work has also emphasized the importance of addressing causal effect heterogeneity – that individual agents may have different returns to treatment – and providing the proper interpretation of traditional estimators in the presence of such heterogeneity. For example, the important work of Imbens and Angrist (1994) shows that, under suitable conditions in a binary treatment context, the standard instrumental variable (IV) technique recovers the local average treatment effect (LATE), a treatment impact for a subpopulation of ‘‘compliers’’ whose behavior can be manipulated by an instrument. Whether or not this LATE parameter is inherently an object of interest is necessarily

a potential outcomes framework that explicitly models both the treated and untreated states. We work in this paper with a simplified model of observed outcomes only, which, arguably, is the most common model used in applied work. We will refer to the individual-specific return to treatment as a causal effect, causal impact, or ‘‘return’’, and derive procedures for characterizing many properties of the distribution of causal effects in the population. Conventional treatment effects, such as ATE, TT, and LATE, are based upon a more general representation of the model with a binary treatment, and thus are not directly comparable to those considered here. Our framework, instead, is based upon the observed outcomes representation as discussed in Wooldridge (1997), Heckman and Vytlacil (1998), and Wooldridge (2003), among others.

346

M. Li, J.L. Tobias / Journal of Econometrics 162 (2011) 345–361

an application and instrument-specific question, and, as such, researchers sometimes focus on methods that enable recovery of the average treatment effect (ATE), or average causal effect, in the presence of treatment effect heterogeneity. Notable studies in this regard, based upon observed outcomes models like the one we consider here, include those of Wooldridge (1997), Heckman and Vytlacil (1998), and Wooldridge (2003), who introduce assumptions and procedures under which the average causal effect can be consistently estimated when treatment returns are heterogeneous and potentially correlated with treatment levels. In a similar vein, Heckman and Vytlacil (1999, 2005) and others describe the marginal treatment effect (MTE) (e.g., Björklund and Moffitt, 1987, Heckman, 1997, and Heckman and Smith, 1999) as a type of unifying parameter which, when properly integrated, can be used to calculate all of the conventional mean binary treatment effect parameters, including ATE, LATE, and the effect of treatment on the treated (TT), thereby capturing important aspects of treatment effect heterogeneity. In this paper, we take up the issue of causal effect heterogeneity in a similar spirit to the work mentioned above, but choose to address this issue via a Bayesian approach. Specifically, we consider Bayesian estimation in a variant of the correlated random coefficient (CRC) model, similar to that considered by Wooldridge (1997), Heckman and Vytlacil (1998) and Wooldridge (2003). To be sure, our study is certainly not the first such effort, and, indeed, a sizeable Bayesian literature has evolved for the estimation of treatment–response models with observational data.2 Early efforts in this regard primarily focused on the Markov chain Monte Carlo (MCMC) implementation (e.g., Koop and Poirier, 1997 and Chib and Hamilton, 2000) and included some discussion of recovering individual-level treatment impacts within a potential outcomes framework.3 More recent work has focused on problems associated with weak instruments generally, has discussed priors that yield posteriors similar to sampling distributions for the two-stage least squares (2SLS) and limited information maximum likelihood (LIML) estimators (e.g., Kleibergen and Zivot, 2003), has introduced a non-parametric modeling of outcomes via a Dirichlet process prior (Conley et al., 2008), and has obtained new results associated with the seminal Angrist and Krueger (1991) study (e.g., Hoogerheide et al., 2007).4 Our model of interest consists of a three-equation system describing the joint determination of an observed scalar outcome variable, an endogenous ‘‘treatment’’ variable, and an individualspecific causal effect parameter. The novelty of our approach is that we directly model the process generating the individualspecific causal effect and thus can calculate any statistic of interest (such as return percentiles or the probability of a positive treatment impact) associated with the causal effect heterogeneity distribution. Of course, our ability to do this stems from particular parametric assumptions made regarding

2 Important examples of this work include Koop and Poirier (1997), Li (1998), Chib and Hamilton (2000, 2002), Poirier and Tobias (2003), and Chib (2007). Li et al. (2003), Munkin and Trivedi (2003), and Deb et al. (2006) provide applications of these methods. 3 That is, these models explicitly consider outcomes in the treated and untreated states together with the treatment decision. Chib and Hamilton (2002), for example, point out the possibility of learning about individual-level treatment effects, while Koop and Poirier (1997) and Poirier and Tobias (2003) discuss the potential for learning about outcome gain distributions and the cross-regime correlation parameter. In recent work, Chib (2007) argues in favor of avoiding explicit modeling of the counterfactual, owing to concerns associated with modeling the nonidentifiable cross-regime correlation parameter. We follow in a similar spirit of working with observed outcomes in the present paper. 4 The Bayesian approach to this model has also received considerable attention in recent textbooks, including Lancaster (2004, Chapter 8), Rossi et al. (2005, Chapter 7), and Koop et al. (2007, pp. 223–236).

the heterogeneity distribution, and, to this end, we describe methods that enable a flexible representation of this distribution. In addition, and similar in spirit to the income maximization presumption of the Roy (1951) model, agents can potentially choose the amount of treatment based on their own knowledge of the return to such treatment. Provided sufficient sources of exogenous variation are available, we show that this presumption becomes empirically testable. Within the Bayesian literature, the structure of our threeequation system seems rather similar in spirit to the innovative work of Manchanda et al. (2004).5 In this paper, the authors are interested in providing a joint description of ‘‘detailing’’ efforts made by drug companies and the number of physician prescriptions, noting that the decision to detail particular physicians may depend, at least in part, on the responsiveness of that physician to the detailing effort (i.e., how many more prescriptions he or she will write as a consequence of being detailed). Our model seeks to address a similar problem to that considered by Manchanda et al. (2004), though in our case the standard treatment–response framework, which must contend with problems such as confounding on unobservables, is generalized to allow treatment levels to be selected based on the ‘‘returns’’ to treatment. Similarly, Conley et al. (2008, section 2.5) discuss the possibility of employing a Dirichlet process prior to simultaneously allow for a nonparametric distribution of outcomes and heterogeneous treatment impacts. Despite its flexibility, this specification does not explicitly model the potential structural dependence of the endogenous treatment variable on the return to treatment, and thus differs from the specification considered here. In some sense, one might regard our efforts in this endeavor as a step back relative to the existing classical literature, in light of the fact that we need to make specific distributional assumptions whereas others (e.g., Wooldridge, 2003) only require a few moment conditions to be satisfied. While our assumptions are clearly stronger than those typically made in these types of analysis, we argue that the benefits afforded by such assumptions may warrant their adoption: we are able to identify all parameters of our model (provided satisfactory exclusion restrictions exist) and expand our focus to directly model the entire treatment effect distribution. We develop an efficient posterior simulator for fitting our model and illustrate in generated data experiments that it mixes well and performs adequately in recovering parameters of the data-generating process. In addition, we carefully discuss the conditions required for parameter identification and methods for relaxing normality, thereby allowing for a more flexible modeling of the outcome and return heterogeneity distributions. Finally, we employ our methods in a real application and investigate the issue of heterogeneity in the economic returns to education, following the influential work of Card (2001). To this end, we combine data sources from the sophomore cohort of the High School and Beyond (1992) Survey and the 1980 Census. We show in our paper that successful identification of our model’s parameters requires the availability of some variable that has a structural impact on individual-level returns to treatment, but remains conditionally uncorrelated with our outcome variable and the level of treatment received. In this regard we first use 1980 Census data to calculate county-level average returns to education. The lagged countylevel returns to schooling are then used as exogenous sources of variation which should correlate positively with the individual’s (1991) private returns to education (and we find strong evidence in support of this), but are assumed to be conditionally uncorrelated with educational attainment and log wage outcomes.

5 This paper is also described in Rossi et al. (2005).

M. Li, J.L. Tobias / Journal of Econometrics 162 (2011) 345–361

Our results suggest strong evidence that the amount of schooling attained is determined, in part, by the individual’s own return to education. Specifically, a one percentage point increase in the return to schooling parameter is associated with the receipt of (approximately) 0.14 more years of education. Further, we find evidence of heterogeneity in returns to education, with females, blacks and Hispanics possessing higher returns to schooling than males and whites. The outline of our paper is as follows. Section 2 briefly introduces the model while Section 3 discusses identification, strategies for posterior simulation, and how learning takes place regarding the causal effect heterogeneity parameters. Section 4 conducts generated data experiments while Section 5 describes the data sets involved with our application. Results of that application are presented in Section 6, and the paper concludes with a summary in Section 7. The Appendix provides technical details regarding identification. 2. The model The model we consider is a three-equation system as described below6 : yi = β0 + xi β + si θi + ui

(1)

si = δ0 + xi δ + zi γ + θi ρ + vi

(2)

θi = η0 + xi η + wi λ + ϵi .

(3)

In Eq. (1), yi denotes the (continuous) outcome of interest and si is a continuous and (potentially) endogenous treatment variable. The covariates in xi are common to all equations, while we also allow instrumental variables zi and wi to appear in (2) and (3), respectively.7 Consistent with a majority of recent work in this area, we do not wish to impose identical treatment returns for each agent, and explicitly allow for heterogeneous causal impacts. The private return to si in our model is denoted as θi , and the notation makes clear that the return can differ across individuals. Eq. (3) then relates the (unobserved) return θi to observables wi and xi . That is, individual-specific returns to education may potentially depend on things like the ability (i.e., test score) of the agent as well as other demographic characteristics. Finally, in (2), we allow the quantity of the endogenous variable si to depend, at least in part, on the return to treatment θi . That is, economic agents who may have knowledge of their private return to education, for example, may potentially choose the amount of schooling based on this knowledge. A related assumption appears frequently in close variants of this model; the Roy (1951) model, for example, is based on income maximization and posits that individuals select into binary ‘‘treatment’’ based on their economic gain from doing so. 3. Identification and model generality It may not be immediately clear that the parameters of the system above are identifiable, or what conditions might be required in order to achieve identification. To begin our discussion of these issues, we first consider the likelihood p(y , s|Γ −θ ) under the assumption of jointly normal errors with unrestricted covariance matrix Σ 8 :

6 Bold script is used to denote vectors and matrices, while capitals are used for matrices. 7 We consider only a scalar treatment variable in this analysis, noting that others, such as Wooldridge (2003), allow for multiple endogenous variables. Extension to the multivariate case is possible and reasonably straightforward, but is not considered here. 8 Here, Γ θ · · · θ ] as the denotes all parameters other than θ = [θ −θ

heterogeneity terms are to be integrated out.

1

2

n

   2   σy 0 ui  i.i.d.  vi  xi , wi , zi ∼ N  0 ,  σys ϵi  0 σyθ

347

σys σs2 σsθ

σyθ σsθ  σθ2 

≡ N (0, Σ ).

(4)

Details regarding parameter identification under this assumption are described in the Appendix, yet it is worth noting here the main results of this exercise. In a sense, the most important exclusion restriction for identification purposes turns out to be the set of variables w, as their presence enables full identification of the model parameters. Specifically, if no ‘‘traditional’’ instrument z is available (i.e., γ = 0), then the system in (1)–(3) is still identified, provided that ρ ̸= 0 and w appears only in (3). If z is additionally available in (2), as previously written in Eqs. (1)–(3), then the variables w could actually be included in both (1) and (3), and the model would remain fully identified. On the other hand, if λ = 0 so that z, the traditional instrument, is the only excluded variable, then the model is no longer fully identified. It may not be clear why w is so important for identification purposes, and in what follows we seek to provide an intuitive explanation for this. The presence of w in (3) enables exogenous shifts in the distribution of θi which, in turn, permits the identification of ρ in (2) and subsequently all of the remaining structural parameters. In the more difficult case where γ = 0 so that z is absent from the model, we obtain, upon integrating out the heterogeneity terms, yi = β0 + xi β + si η0 + si xi η + si wi λ + (si ϵi + ui ) si = (δ0 + ρη0 ) + xi (δ + ρη) + wi ρλ + (ρϵi + vi ). Loosely (a more formal treatment is provided in the Appendix), the coefficient on si wi enables us to recover an estimate of λ and the reduced form expression for si enables us to estimate the product ρλ, thus providing a means to recover ρ .9 When z is available, but w is not present, the interaction si wi disappears and ρ is no longer separately identifiable, although η0 , interpretable as the average causal impact in the population (when λ = 0 and x is standardized to be mean zero), does remain estimable. Of course, whether or not exclusion restrictions such as w and z are available in practice is inevitably an application-specific question, and the degree to which such restrictions can be credibly maintained will depend, in part, on adequate conditioning data and, to no small degree, on the persuasiveness of the researcher. What emerges from our model, however, is a primary requirement that is somewhat different from the traditional instrumental variables assumption: what is most necessary for identification purposes is the existence of some variable affecting the return to treatment that is also (conditionally) uncorrelated with the outcomes of interest, and the endogenous treatment variable in particular. In what follows, we illustrate application of this model to a widely studied question in the labor economics literature: estimating the return to education and characterizing heterogeneity in schooling returns. It is also worthwhile to pause and place our study in the context of the previous literature. From the classical perspective, numerous papers have addressed various aspects of causal effect heterogeneity, and previous efforts that seem most similar to ours

9 The first equation of this system is very similar to that described by Wooldridge (2003), who notes that IV/2SLS can be used to estimate the average causal effect η0 using w and its interactions as instruments. What is required here is that the conditional covariance between si and θi does not depend on w . This is true given the assumptions of our model. Note, however, that if treatment effect homogeneity were assumed, and standard IV were applied to (1) directly using z (or w ) to instrument for s, then one will not recover in general an estimate of the average causal impact. Further details regarding this issue are available upon request.

348

M. Li, J.L. Tobias / Journal of Econometrics 162 (2011) 345–361

include those of Heckman and Vytlacil (1998) and Wooldridge (1997, 2003). In the most recent of these, Wooldridge (2003) introduces three assumptions that enable consistent estimation of the average treatment effect. His representation of the model is more general than ours, as multiple treatments are considered, no specific distributional assumptions are employed, and no explicit modeling of the treatment variable s is necessary. In this sense, our model might seem to offer a step in the wrong direction, as Eqs. (1)–(3) impose considerable structure beyond what has been used in past work. In our view, however, the added structure may be a worthy investment, as it enables us to expand our focus beyond the average causal effect and learn about all aspects of the heterogeneity distribution, including learning about individuallevel treatment impacts. With respect to distributional concerns, these seem decidedly more minor, as we will now replace trivariate normality with a finite mixture of Gaussian distributions,10 which provides a very flexible way to model the joint distribution of the outcome, endogenous variable, and causal effect parameter. 3.1. Posterior analysis

yi = β0i + xi βi + si θi + ui

(5)

si = δ0i + xi δi + zi γ i + ρi θi + vi

(6)

θi = η0i + xi ηi + wi λi + ϵi

(7)

and

vi

Σ i ∈ Σ = {Σ 1 , Σ 2 , . . . , Σ G }. All parameters other than ρi , θi , and Σ i are lumped into the vector φi . The above equations imply that each of the individual-specific parameters will be assigned a hierarchical prior where, once the component of the mixture is known, the distribution for ρi , φi and Σ i is degenerate around one of the G values in the parameter sets given above. The allocation of agents to the appropriate component of the mixture is achieved via the addition of component indicator variables. Specifically, we will let cig = 1 denote that individual i ‘‘belongs to’’ the gth component of the mixture. Formally, we let ci = [ci1

ci2

ciG ]′

···

be the component label vector for individual i, and specify priors of the form p(φi , ρi , Σ i |cig = 1, φ, ρ, Σ )

  = I φi = φg , ρi = ρ g , Σ i = Σ g

The normality assumption made in (4) is potentially inappropriate and often controversial, and, to this end, it is important to recognize that it can be significantly generalized. We pursue such a generalization in this and the following sections via a finite Gaussian mixture representation. We begin by writing this model in a somewhat non-traditional way as

[ ui

and

ind

ϵi ]′ |xi , zi , wi ∼ N (0, Σ i ),

whence the density for yi , si , θi |· is readily available, as the associated Jacobian of the transformation from the error vector to [yi si θi ]′ is unity. The notation in (5)–(7) is quite general, as it allows for individual-specific slope and intercept parameters. The finite mixture formulation of the model adds structure to this by proposing that there are, say, G distinct groups with identical parameters within each group yet different parameters across groups. Whether or not these ‘‘groups’’ have any intrinsic meaning as a discrete partitioning of the population of interest is mostly irrelevant, though, if such an interpretation can be convincingly given in a particular application, the mixture components may be afforded a specific interpretation. In most instances, mixture models are commonly employed as a flexible computational tool to allow for skew, multimodality, and heavy tails in the outcome distributions, and no specific meaning need be ascribed to the various mixture components.11 With an eye toward the implementation of our posterior simulator, we first define the parameter sets

φi = [β0i β′i δ0i δ′i γ i = {φ1 , φ2 , . . . , φG } ρi ∈ ρ = {ρ 1 , ρ 2 , . . . , ρ G }

η0i

η′i

λ i ]′ ∈ φ

i.i.d.

ci ∼ Mult(1, π)

(9)

π ∼ Dirichlet(α),

(10)

with

π = [π1

π2

πG ]′ ,

···

α = [α1

smoothly mixing regression model.

α2

···

αG ]′ ,

I (·) denoting the standard indicator function, Mult(·) denoting the multinomial distribution and Dirichlet (·) denoting the Dirichlet distribution (see, e.g., Koop et al., 2007, pp. 340). Thus, given cig = 1 and the set of component-specific parameters and covariance matrices, the parameter vector for individual i is known, and φi , ρi and Σ i merely serve as ‘‘place-holders’’ which are convenient for simplifying the exposition. The multinomial prior on ci and the corresponding Dirichlet prior on the component probability vector π imply that, unconditionally, p(φi , ρi , Σ i |π, φ, ρ, Σ ) =

G −

  πg I φi = φg ρi = ρ g , Σ i = Σ g .

g =1

Likewise, the trivariate distribution for yi , si , θi (not conditioned on ci ) is p(yi , si , θi |π, φ, Σ , ρ)

=

G −

πg p(yi , si , θi |φi = φg , ρi = ρ g , Σ i = Σ g ),

g =1

where the distribution within each component of the mixture is obtained by a change of variables from (5)–(7). This illustrates the finite mixture representation of the likelihood. We complete the specification of our model with the following priors: i.i.d.

φg ∼ N (φ0 , Vφ ),

g = 1, 2, . . . , G

(11)

p(Σ 1 , Σ 2 , . . . , Σ G )

∝ I (Σ ss1 < Σ ss2 < · · · < Σ ssG ) 10 A proof regarding parameter identification in the finite mixture framework is omitted here, but is available upon request. The proof follows a similar strategy to that given in the Appendix, where y|s, Γ −θ and s|Γ −θ are obtained and their moments are characterized, revealing parameter identification. Such a strategy dates back to at least Pearson (1894), who used a moment-based approach to estimate the parameters of a two-component Gaussian mixture. 11 See, e.g., Geweke and Keane (2007) for use of a related methodology, the

(8)

G ∏

pIW (Σ g |p, pR)

(12)

g =1 i.i.d.

ρ g ∼ N (ρ0 , Vρ ) g = 1, 2, . . . , G.

(13)

The prior on the set of covariance matrices serves to identify the mixture components, and it does so by providing an ordering restriction on variances in the schooling equation (i.e., Eq. (2)). To derive the joint posterior distribution for the unobservables in our

M. Li, J.L. Tobias / Journal of Econometrics 162 (2011) 345–361

model, we first define s1  s2 

y1 y2 

 

 

 s=  ..  ,

 y=  ..  ,

.

.

sn

yn

  ρ1 ρ2   ρ=  ..  , . ρn

Σ1 Σ 2 





 Σ=  ..  , . Σn

    φ1 θ1 φ 2  θ2    φ= θ=  ..  ,  ..  , . . φn θn  ′ c1 ′

c2   and c =   ..  . . cn′

349

as necessary to achieve agreement with the ordering restriction. More discussion of this issue can be found in Frühwirth-Schnatter (2001, particularly Sections 3.3 and 3.4) and Geweke (2007, particularly Section 3). For the empirical application of Section 6, we also emphasize that the parameters of interest reported and the posterior predictive analyses conducted are not affected by the labeling issue, and as such the need to permute the labels or impose component identification through the prior is irrelevant for these pursuits. The Gibbs sampler, apart from the label permutation, then proceeds in six steps, which we enumerate below. Step 1: φ, φ|·, y , s.

With this notation in hand, Bayes’ theorem gives the joint posterior distribution up to proportionality:

To sample the regression parameters φ and φ marginalized over the random effects θ , first let

p(θ, φ, ρ, Σ , φ, ρ, Σ , π, c |y, s) ∝ p(π)p(φ)p(ρ)p(Σ )

ri ≡ (yi

n ∏

×

1



p(φi , ρi , Σ i |ci , φ, ρ, Σ )p(ci |π)|Σ i |− 2

Xi ≡

i=1



1



1 × exp − h′i Σ − i hi ,

(14)

2

hi =

xi 0 0

1 0 0

−

XX g ≡

0 1 0

yi − β0i − xi βi − si θi si − δ0i − xi δi − zi γ i − ρi θi θi − η0i − xi ηi − wi λi

p(φ, φ|·, y , s) ∝

1 −1 [p(φi , ρi , Σ i |ci , φ, ρ, Σ )p(ci |π)φN (θi ; qq− i tqi , qqi )|

i=1

× Σ |−1/2 |qqi |−1/2 exp(−[1/2]tti )],

(15)

where we have defined yi − β0i − xi βi si − δ0i − xi δi − zi γ i −η0i − xi ηi − wi λi



 ,

qi ≡

si

tti ≡ ti′ [σ qqi ]ti .



Xr g ≡

and

, −

Xi′ [σ qqi ]ri .

{i:cig =1}

G ∏

φN [φg |(Vφ−1

{i:cig =1}

where it is to be understood that ‘‘·’’ in the conditioning in this case denotes all parameters other than φ, φ, and θ . This result implies that φ and φ can be sampled by first drawing independently, for g = 1, 2, . . . , G, from

φg |·, y , s ∼ N [(Vφ−1 + XX g )−1 (Vφ−1 φ0 + Xr g ), (Vφ−1 + XX g )]

(18)

and then setting, for i = 1, 2, . . . , n,

φi =

G −

cig φg .

(19)

g =1

The ‘‘sampling’’ of φi in (19) reiterates that these quantities are largely incidental to the problem, and merely simplify the exposition of the model and the algorithm. Step 2: θi |·, y , s. The conditional posterior density for the heterogeneity terms θi can be deduced directly from the form of the joint posterior in (15). Specifically,



ρi , −1

ind

(16)

θi |·, y , s ∼ N ([qqi ]−1 tqi , [qqi ]−1 ),

(17)

where the terms in this conditional have been defined just prior to step (1). Step 3: ρ, ρ|·, y , s. To sample the parameters ρ, we again need to introduce some additional notation. Let

1 qqi ≡ q′i Σ − i qi , 1 tqi ≡ ti′ Σ − i qi ,

0 0 wi

+ XX g )−1 (Vφ−1 φ0 + Xr g ), (Vφ−1 ∏ + XX g )−1 ] I (φi = φg ),

p(θ, φ, ρ, Σ , φ, ρ, Σ , π, c |y, s) ∝ p(π)p(φ)p(ρ)p(Σ )

ti ≡

0 0 xi

g =1

In principle, a standard Gibbs sampler can be applied to fit this model, drawing, in turn, from each of the complete posterior conditionals as implied by (14). However, this standard Gibbs sampler turns out to suffer from poor mixing properties. To this end, we employ a blocking (grouping) step where the parameters φ are drawn from their conditional distribution, marginalized over the heterogeneity terms θ , and then the remaining parameters are drawn from their complete conditional posterior distributions. This blocking procedure samples φ and θ together in a single step, which substantially improves the mixing of the posterior simulations, as will be shown in the following section. To implement this blocking procedure, it is useful to write the augmented likelihood in a different, though equivalent, way. Completing the square on θi in (14) enables us to express the augmented joint posterior as



0 0 1

With some algebra, we obtain



3.2. The Gibbs algorithm

×

0 zi 0

Xi′ [σ qqi ]Xi ,

and p(φi , ρi , Σ i |ci , φ, ρ, Σ ) has been given in (8).

n ∏

0 xi 0

{i:cig =1}

where



0)′ ,

si

1 1 −1 ′ −1 σ qqi ≡ Σ − − Σ− qi Σ i and i i qi [qqi ]

To account for the ordering constraint Σ ss1 < Σ ss2 < · · · < Σ ssG , we generate simulations by first ignoring the constraint and making use of the ‘‘unconstrained’’ sampler described below, and then permuting the labels at the end of the simulation period

 fi =

yi − β0i − xi βi − si θi si − δ0i − xi δi − zi γ i θi − η0i − xi ηi − wi λi

i = 1, 2, . . . , n,

0



  and pi =

θi . 0

(20)

350

M. Li, J.L. Tobias / Journal of Econometrics 162 (2011) 345–361

The conditional posterior distribution of ρ and ρ can be shown to be, up to proportionality,

on specific parameters explicit, to avoid possible confusion when constructing the component probabilities. To this end, let

p(ρ, ρ|θ, φ, Σ , c , φ, Σ , π, y , s)

yi − β 0g − xi βg ti (φg ) ≡ si − δ 0g − xi δg − zi γ g  , −η0g − xi ηg − wi λg

G ∏

∝



 −1

 −

φN ρ g | Vρ−1 +

g =1

1  p′i Σ − i pi

−

−1

 −1 

Vρ−1 +

1  p′i Σ − i pi

i:cig =1

∏

I (ρi = ρ g ).



This implies that the sampling of ρ and ρ can proceed by first independently drawing, for g = 1, 2, . . . , G, from

 −1

 −

ρ g |·, y , s ∼ N Vρ−1 +

 1 p′i Σ − i fi ,



i:cig =1

 −1 

 −

Vρ−1 +

,

1  p′i Σ − i pi

(21)

i:cig =1

and then setting, for i = 1, 2, . . . , n,

ρi =

G −

(22)

Step 4: Σ , Σ |·, y , s. From (14), the conditional posterior distribution of Σ and Σ is, up to proportionality, given as p(Σ , Σ |·, y , s) ∝ p(Σ )

∝

G ∏

n ∏



1 1 1 p(Σ i |Σ , ci )|Σ i |− 2 exp − h′i Σ − i hi 2 i =1

 pIW Σ g |p +

g =1

n −



 −

cig , pR +

i=1

∏

×

hi h′i 

{i:cig =1}

I (Σ i = Σ g ).

{i:cig =1}

This implies that the sampling of Σ and Σ can proceed by first drawing, independently, from

 ind

Σ g |·, y , s ∼ IW p +

n − i=1

 −

cig , pR +

α1 +

···

π˜ iG )′ . Straightforward algebra (25)

n −

ci1



  i =1   n   −  α + ci2   2 .  α˜ =  i =1  ..     .   n −   αG + ciG

˜ π|·, y , s ∼ Dirichlet(α).

(23)

(24)

g =1

Step 5: ci |·, y , s. To describe the sampling from the component label vector ci , we first define terms similar to those defined just prior to step 1. In this case, we make the dependence of the objects in (16) and (17)

(26)

A posterior simulator proceeds by sampling from (18)–(26). 3.3. Causal effect heterogeneity Having discussed issues of model flexibility and methods for posterior simulation, we now turn our attention to key parameters of interest. Of course, a primary focus of our model concerns the causal effect heterogeneity terms, θi . In particular, we would like to make use of our analysis to answer questions such as the following. What have we learned about the overall distribution of such impacts in the population? How can we use our model to make predictive statements regarding the effect of future, out-ofsample treatments? We separately consider the cases of in-sample and out-ofsample prediction. To this end, we first note from (5)–(10) that, marginally, p(θi |·) ∼

and then setting, for i = 1, 2, . . . , n: cig Σ g .

The conditional posterior distribution of π can be shown to be

hi h′i  ,

{i:cig =1}

g = 1, 2, . . . , G

G −

π˜ i2

i =1

cig ρ g .

g =1

Σi =

(π˜ i1

Step 6: π|·, y , s. Finally, let

i:cig =1

−

h=1

˜i ≡ and π produces that

˜ i ). ci |·, y , s ∼ Mult(1, π

1  p′i Σ − i pi

 × Vρ−1 ρ0 +

−1

  1 1 πg |Σ g |− 2 |qqi (ρ g , Σ g )|− 2 exp − 21 tti (φg , ρ g , Σ g ) π˜ ig ≡ G ,  ∑ 1 1 πh |Σ h |− 2 |qqi (ρ h , Σ h )|− 2 exp − 12 tti (φh , ρ h , Σ h )

{i:cig =1}

ind

−1

tti (φg , ρ g , Σ g ) ≡ ti (φg )′ σ qqi (ρ g , Σ g )ti (φg ),

i:cig =1

−



σ qqi (ρ g , Σ g ) ≡ Σ g − Σ g qi (ρ g )[qqi (ρ g , Σ g )]−1 qi (ρ g )′ Σ g ,

1 p′i Σ − i fi ,



si

qi (ρ g ) ≡ ρ g , −1

−1



× Vρ−1 ρ0 +



qqi (ρ g , Σ g ) ≡ qi (ρ g )′ Σ g qi (ρ g ),

i:cig =1







G −

πg φ(θi ; η0g + xi ηg + wi λg , σθ2g ),

g =1

where the subscript g denotes parameters associated with the gth component of the mixture, the ‘‘·’’ in the conditioning explicitly reflects that we are conditioning on the model’s parameters, and φ(x; µ, σ 2 ) denotes a normal density for x with mean µ and variance σ 2 . Thus, our model assumes that the distribution of treatment effect heterogeneity can be adequately represented as a finite mixture of Gaussian distributions, which seems unlikely to be a controversial assumption in practice. If interest centers on summarizing the overall shape of the heterogeneity distribution, or in making predictions about

M. Li, J.L. Tobias / Journal of Econometrics 162 (2011) 345–361

treatment returns for a future sample, such questions can be addressed by deriving and calculating the appropriate posterior predictive density. For example, adding a subscript f to denote ‘‘future’’ outcomes, and considering the case of a particular agent with known characteristics xf and wf , the desired posterior predictive distribution can be obtained via ‘‘Rao–Blackwellization’’. Specifically, p(θf |xf , wf , y , s) ≈

M G 1 − − (m) (m) πg φ(θf ; η0g M m=1 g =1

+ xf ηg(m) + wf λ(gm) , σθ2g

(m)

p(θi |y , s) =

∫ = ∫ =

p(θi , Γ |y , s)dΓ p(θi |Γ , y , s)p(Γ |y , s)dΓ p(θi |Γ , yi , si )p(Γ |y , s)dΓ ,

where the last line follows from the fact that, given Γ , θi is independent of outcomes other than yi and si . To fix ideas and make progress in understanding how we learn about specific within-sample treatment impacts, let us consider the single-component Gaussian model.12 The conditional posterior p(θi |Γ , yi , si ), which is to be averaged over p(Γ |y , s) to obtain p(θi |y , s), can be obtained from (1)–(3). Importantly, this derivation shows that the conditional posterior distribution of θi is not just the marginal density in (3), but, instead, the conditional distribution θi |Γ , yi , si is normal with mean E (θi |Γ , yi , si ), given in Box I, where we have defined µθ i ≡ η0 + xi η + wi λ, y˜ i ≡ yi − β0 − xi β, s˜i ≡ si − δ0 − xi δ − zi γ , and σ ij denotes the (i, j) element of Σ −1 .13 This mean, of course, is different from µθ i , the mean of (3) that is used for out-of-sample prediction purposes. The law of iterated expectations then implies that E (θi |y , s) = EΓ |y ,s E (θi |Γ , y , s) = EΓ |y ,s E (θi |Γ , yi , si ) ,









so that the posterior expected return to treatment for agent i is the conditional posterior mean E (θi |Γ , yi , si ) averaged over the posterior distribution of Γ . With a little rearranging, the conditional posterior mean E (θi |Γ , yi , si ) above can be represented as a type of weighted average of three pieces: y˜ i /si , s˜i /ρ , and µθ i . These three pieces emerge quite naturally as ‘‘estimators’’ of θi from (1)–(3), as each of these involves ‘‘solving’’ for θi in the respective equations. In the

12 Alternatively, what follows applies to a particular component of the mixture, so that this assumption is made essentially without loss of generality. 13 It is also worth mentioning that Var(θ |Γ , y , s ) is simply the inverse of the i

i

i

denominator in the expression for E (θi |Γ , yi , si ) above.

limiting case where Σ is diagonal, it is straightforward to show, for example, (holding all else constant in each case) that θi |Γ , yi , si collapses around y˜ i /si as σy → 0, collapses around s˜i /ρ as σs → 0, and collapses around µθ i as σθ → 0. Thus, in-sample predictions regarding individual-level treatment impacts use information from all three equations of our system, and therefore more precise estimates of our outcome and treatment equations in (1) and (2) can lead to better learning about individual-level causal effect parameters. 4. Generated data experiments

),

where M denotes the total number of post-convergence simulations and m indexes these simulations. A single, ‘‘representative’’ posterior predictive density could be calculated by fixing xf and wf to their sample means or rounded integer values, as appropriate. Perhaps more desirably, an additional step can be added to average these over the empirical distribution of the xf and wf characteristics to effectively drop the conditioning on xf and wf . In this out-of-sample exercise, we focus on the θf marginal density since the future treatment level sf and future outcome yf are not observed. Within the sample, however, this is not the case, and the mechanism for learning about θi is slightly different. In particular, the structure of our model suggests that the observed yi and si convey information about θi beyond what is learned from (7) only. To shed some light on this issue more formally, let Γ denote all parameters other than θi , and note that

∫

351

We illustrate the performance of our algorithm via two generated data experiments. In the first experiment, we simulate a large sample size of n = 10,000 observations from a correctly specified two-component mixture version of the model in (1)–(3). For this case, the exogenous variables are generated as follows:

 ′ xi zi

wi





x1,i x2,i  =  zi

wi   

1 0 0  0.1 ∼ N4   ,  −0.2 0 0.1 0

0.1 0.25 −0.2 0.05

−0.2 −0.2 4 −0.2

0.1 0.05  , −0.2 1



(27)

and the following hyperparameters are selected: φ0 = 0k×1 , Vφ = 106 × Ik , ρ0 = 0, Vρ = 106 , p = 8, R = diag{1, 102 , 0.12 }, α1 = 1, α2 = 1. These yield reasonably ‘‘diffuse’’ marginal prior distributions, so the information contained in the prior is small relative to the information contained in the data. Table 1 reports the actual parameters of the data-generating process as well as their posterior means and posterior standard deviations from the experiment. The results of this table reveal that our algorithm successfully recovers the parameters of the datageneration process and that our code for fitting such models is likely to be free of errors.14 Experiments with fewer observations provided similar results, and, finally, experiments based upon more than two mixture components also revealed that the code and posterior simulator performed adequately.15 In Table 2, we illustrate the performance of our method in a different way. The table presents inefficiency factors associated with the sampling of three different parameters: β 01 , ρ 1 , and Σ yy1 . The first column presents such factors using our posterior simulator outlined in the previous section, while the second column, for the sake of comparison, presents analogous results for a sampler that fails to block φ and θ together. As the table clearly illustrates, the gains to blocking are substantial, and, moreover, the mixing of the simulations in our sampler is adequate, though still falling somewhat short of the numerical efficiency obtained under i.i.d. sampling. A second generated data experiment was also conducted to illustrate the performance of the mixture method under misspecification. Since the return heterogeneity distribution is of

14 A more formal diagnosis of the code was obtained by performing some of the checks suggested by Geweke (2004). Though not reported here, these tests did not provide any evidence that a mistake had been made. 15 It is worth noting that the ability of our algorithm to accurately estimate the true parameter values clearly depended on the quality of the instruments (i.e., the magnitude of γ and λ) and the degree of confounding (i.e., ρys , ρyθ , ρsθ ). We do not attempt to further characterize these relationships in the present study, as doing so thoroughly will lead us well beyond the scope and goals of this paper. Whether or not such issues are relevant for the applied researcher will inevitably depend on the application at hand and the data available.

352

M. Li, J.L. Tobias / Journal of Econometrics 162 (2011) 345–361

E (θi |Γ , yi , si ) =

σ 11 y˜ i si + σ 22 s˜i ρ + σ 33 µθ i + σ 12 [si s˜i + ρ y˜ i ] − σ 13 [˜yi + si µθ i ] − σ 23 [˜si + ρµθ i ] . σ 11 s2i + σ 22 ρ 2 + σ 33 + 2σ 12 si ρ − 2σ 13 si − 2σ 23 ρ Box I.

Table 1 Parameter posterior means, standard deviations, and true values. Parameter

Component 1

Component 2

True value

E(β|D)

Std(β|D)

True value

0.7

0.689

0.006

0.3

0.5 −3 −1

0.523

−2.94 −1.06

0.033 0.039 0.054

−1.5 −3

0.044 0.051 0.056 0.014 0.013

E(β|D)

Std(β|D)

Component probability

π1

0.311

0.006

1.07

−1.4 −3.0

0.084 0.111 0.161

0.5 2 −1 1.5 1.5

0.351 2.07 −1.14 1.51 1.48

0.075 0.081 0.135 0.031 0.021

−2

−1.96

2.5 −2.5 3.0

2.48 −2.54 3.0

0.025 0.021 0.044 0.021

2 3 1 0.1 −0.2 0.2

1.98 3.03 0.98 0.08 −0.15 0.24

0.071 0.046 0.017 0.044 0.042 0.024

Equation for y

β0 β1 β2

1

Equation for s

δ0 δ1 δ2 γ ρ

1.5 −2 1 −1.5 2.5

η0 η1 η2 λ

−2.5

−2.52

3

3

−0.5

−0.495

2

2

0.008 0.007 0.015 0.007

1 2 0.5 0.2 −0.1 0.1

0.975 2 0.495 0.16 −0.136 0.0941

0.027 0.022 0.006 0.030 0.030 0.016

Equation for θ

1.45

−1.89 0.943

−1.49 2.47

Covariance matrix

σy σs σθ ρ ys ρ yθ ρ sθ

Table 2 Inefficiency factors for a selection of three parameters. Parameter

Blocking φ and θ together

Without blocking φ and θ together

β 01 ρ1 Σ yy1

2.72 22.84 34.99

929 3554 388

primary interest in our study, we introduced a departure from normality in the generation of these returns by first sampling ϵi from a lognormal distribution with reasonable skew, recentered to have mean zero:

  i.i.d. ϵi ∼ lognormal 0, 0.25 − exp(0.125). The joint distribution of ui and vi was then sampled as a (conditionally) bivariate normal:

  [   ui  ind −0.2 log[ϵi + exp(0.125)] 0.99 ϵ ∼ N , vi  i 0.4 log[ϵi + exp(0.125)] 0.42

0.42 3.96

]

,

and the parameter values and process for generating the covariates were then identical to those employed for the first component of the first generated data experiment. Unconditional moments of the joint distribution of u, v , and ϵ can be derived for this experiment, but we omit these details here for the sake of brevity. Instead, we focus on the most important issue of assessing how well our mixture model fares at picking up this departure from normality when it is present.16 We fit a variety of different mixture models to this data, focusing on models with 2–5 different mixture components. For the sake of parsimony, we only allow the intercepts and covariance matrices of the mixture components to differ and restrict all

16 The slope coefficients and remaining model parameters were also well estimated in this exercise. Details of these results are available upon request.

Table 3 Bayes factors supporting the four-component model from the second generated data experiment. Bayes factor

j=2

j=3

j=4

j=5

p(y, s|M4 )/p(y, s|Mj )

2.62 × 10814

1.05 × 10155

1

1.60

the slope coefficients to be the same across components. A summary of the different model performances in terms of the return distribution is presented graphically in Fig. 1. To fix ideas, we used the simulations produced from the algorithm of Section 3.2 to obtain a posterior predictive return heterogeneity distribution for an individual of average characteristics (i.e., setting all covariates to zero). We repeated this exercise four different times, considering models with 2–5 mixture components.17 We then compared the posterior predictive densities for these cases to the actual heterogeneity distribution for this ‘‘average’’ individual. As is evident from Fig. 1, the mixture models perform well. Even the two-component model is able to capture the most salient features of the lognormal heterogeneity distribution, while the four-component and five-component models are able to capture its shape almost exactly. Table 3 shows the calculated Bayes factors18 for the competing mixture models, using the notation Mj to denote the specification employing j mixture components. The results in the table are reported as Bayes factors in support of the four-component model. These clearly reveal that the fourcomponent specification is favored relative to those with two or three components, and also reveals near indifference between the

17 The one-component Gaussian model produced a symmetric predictive return distribution and was clearly inferior to those with more mixture components. 18 These were computed using the method described in Gelfand and Dey (1994) and Geweke (1999).

M. Li, J.L. Tobias / Journal of Econometrics 162 (2011) 345–361

1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0 –5

–4

–3 –2 –1 2–component

0

1

0 –5

1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0 –5

–4

–3 –2 –1 4–component

0

1

0 –5

353

–4

–3 –2 –1 3–component

0

1

–4

–3 –2 –1 5–component

0

1

Fig. 1. True return heterogeneity distribution for an individual of average characteristics (solid lines) compared to posterior predictive densities estimated from models with 2–5 mixture components (dashed lines).

use of four and five mixture components. As Fig. 2 suggested, both the four-component and five-component models performed well in terms of capturing features of the return heterogeneity distribution, and our marginal likelihood calculations weakly support the more parsimonious four-component model over the five-component alternative. The results of this exercise are encouraging, and they suggest that the mixture models can fare well in picking up departures from normality. Of course, this exercise has not investigated other sources of misspecification, such as measurement error or incorrect specification of the conditional mean functions. We should not expect our method to be immune to such problems, and indeed it will not be; what we have documented here is simply a degree of robustness to distributional assumptions in the absence of other confounding problems. 5. The data For our empirical application, we make use of data from two distinct sources. The first and primary data set, which has been widely employed in the applied literature, is the High School and Beyond (HSB) survey. HSB is a survey conducted on behalf of the National Center for Education Statistics (NCES), and it was designed with the intent of yielding a sample of students representative of American high school students. HSB is a biennial survey starting in 1980, and we focus attention on the sophomore subsample of the HSB data. For our earnings outcome measure, we employ the most recent data available to us, the 1992 survey, from which 1991 earnings can be obtained. In practice, we restrict the HSB sample

Table 4 Descriptive statistics. Variable

Mean

Std. Dev.

Log monthly earnings Schooling Father’s education Mother’s education Base year family income ($10,000) Base year test score Number of siblings Female Age as of 1 January 1991 Hispanic Native American Asian/Pacific Black Other/Missing race 1980 county grp. avg. log hourly wage 1980 county grp. avg. schooling 1980 county grp. avg. return to schooling

7.51 13.6 12.5 12.3 2.09 0 2.92 0.463 26.8 0.154 0.0191 0.0316 0.123 0.00371 1.53 12.7 0.0785

0.465 2.06 3.28 2.81 0.999 1 1.61 0.499 0.536 0.361 0.137 0.175 0.328 0.0608 0.109 0.434 0.012

to individuals who have worked for at least nine months during 1991, and whose monthly earnings were between $500 and $6000. Finally, it is worth mentioning that the HSB data set employed here, like several other widely used micro data sets, contains a wealth of demographic information on the sample respondents, such as family background characteristics and individual test scores, making it an attractive data source for our application. Descriptive statistics associated with key variables in the model are provided in Table 4. As discussed earlier in the paper, for identification purposes, we require an instrument or set of instruments. The most important

354

M. Li, J.L. Tobias / Journal of Econometrics 162 (2011) 345–361

of these is some characteristic or set of characteristics that are conditionally correlated with individual-level returns to schooling, but can be excluded from the earnings and schooling equations. As shown in the Appendix, the model parameters are fully identifiable with such an exclusion restriction, provided that ρ ̸= 0. Our choice in this regard is to obtain a county-level return to schooling estimate based on 1980 Census data and use this lagged return as a right-hand side variable in Eq. (3). The lagged county-level returns are constructed using data from the public use 5% sample of the 1980 Census. We restrict the Census sample to those individuals who are between 16 and 28 years of age19 working for at least 40 hours a week in 1979 with an hourly wage between $1 and $100. For each race and gender cell within a given county group, we calculate the corresponding county-level average log hourly wage, highest grade completed, and return to schooling.20 This county-level return to schooling is obtained by running a regression of individual log hourly wages in the given county on highest grade completed, potential labor market experience (age minus schooling minus 6), and potential experience squared. We consider the schooling coefficient estimated from this regression as the 1980 county-level average return to schooling for that group.21 The lagged county-level average returns are then matched with the HSB data. This, of course, requires county identifiers for the HSB sample. However, this matching is not trivially performed, as there are no directly available county-level indicators in HSB, and the NCES does not publicize this information. To this end, we follow and expand upon the approach of Hanushek and Taylor (1990), Rivkin (1991), Ganderton (1992), Grogger (1996a,b), and Li (2006, 2007), who are able to match individuals to states of residence via other information provided in HSB. Specifically, a school survey component of the HSB data provides a variety of information on local labor market conditions associated with each school represented in the survey. In practice, we implement our identification strategy in two steps. First, we utilize the available state-level geographically related HSB variables and match them to publicly available state-level data to uncover the state associated with each HSB individual. In the second step, once the state has been identified, we repeat the same procedure at the county level by making use of available county-level labor market conditions to identify the county of residence for each HSB participant during their high school years. In this way, we are able to match lagged county-level returns to schooling to each individual in the HSB sample.22 The validity of lagged returns to schooling as an instrumental variable rests on two assumptions. First, we must assume that lagged county-level returns to education are correlated with the (contemporaneous) private return to education for the given

19 The HSB sophomores were around 16 in 1980 and 28 in 1992. Therefore, labor market outcomes and educational attainments of individuals from the Census aged between 16 and 28 can be considered most relevant to our HSB sample. 20 In the 1980 Census, county groups are typically defined as contiguous areas with an aggregate population of at least 100,000. They may be actual county groups or single counties. In some cases, a county is split up into several ‘‘county groups’’. In such situations, we create a larger county group by encompassing all ‘‘county groups’’ belonging to the same county. 21 In some cases, a subgroup (i.e., a particular race and gender category within a particular county group) is found to have fewer than 15,000 observations. In such cases, we first enlarge the sample by including all persons who are from the same county group and of the same sex, regardless of their ethnic background, and add a set of race dummies to the log hourly wage regression. If the resulting pooled sample size is still below 15,000 observations, we combine all people from the same county group, irrespective of their gender and ethnic origin, and then include dummies for gender and race in the log hourly wage regression. 22 Specific (and tedious) details regarding how this is done are available upon request.

individual. Recognizing that many individuals will choose to work in the same county group as their high school was located, this correlation may result from a type of autoregressive process in county-level returns to education. Unlike traditional IV analyses, this first identification assumption, however, is not ‘‘directly’’ empirically testable, as θi in (3) is not observed.23 The second (and surely more controversial) assumption is that the lagged county-level returns can be excluded from, most importantly, the schooling equation in (2) and, to a lesser extent, the log wage equation in (1). That is, conditioned on a variety of individual-level controls, 1980 average returns to education in a county are uncorrelated with unobservables affecting wages and educational attainment observed in 1991. There are certainly a few reasons to think that this assumption is suspect. For example, educational attainment decisions may be based, in part, on the contemporaneous return to schooling observed by the agent. That is, sophomores in 1980 (who generally become seniors in 1982) may make decisions about college entry based on currently available information regarding the return to a college degree. If this is the case, then lagged returns may have some non-ignorable role in explaining 1991 schooling outcomes, which would undermine our identification strategy. Our assumption in this regard, however, is that the agent makes educational attainment decisions based on his or her own return to education parameter θi , and conditioned on this parameter (and other characteristics), lagged county-level information is superfluous. Again, this assumption is probably not without controversy, but we maintain it in the current analysis. As a way to partially mitigate some of these effects, we also include in the log earnings equation the 1980 average log earnings for that county group, and likewise, in the schooling equation, we include the 1980 average level of schooling for that particular county group. Thus our argument is that, conditioned on the individual return to schooling parameter as well as lagged average levels of schooling and earnings, lagged county-level returns to education do not play an independent structural role in schooling decisions and earnings determination. Finally, in addition to the instruments described above, we also control for parental education and income, family size, sophomore year test scores,24 age, gender, and a variety of racial indicator variables. This produced an HSB sample of 8886 individuals from 471 county groups (or 536 counties). 6. Empirical results Before discussing results from any particular model, we first consider the general issue of model selection. To this end, we estimate the single-component normal model along with twocomponent and three-component normal mixture models as competing specifications. For each case, we run our posterior simulator for 200,000 iterations and discard the first 20% (40,000) as the burn-in period. For the mixture models, we restrict all slope coefficients to be the same across components, yet allow the intercepts for each equation and covariance matrices to differ. These slope restrictions were imposed in order to minimize added parameterization while still being able to accommodate skew, heavy tails, or other departures from normality.

23 We do, however, find strong evidence supporting a role for the lagged returns in (3), as documented in the following section. 24 In 1980, the sophomores who participated in the HSB survey also took a battery of seven tests that were designed to measure the cognitive abilities of these individuals. We add together the number of questions answered correctly in the seven tests and rescale this variable so that it has a mean of 0 and a standard deviation of 1.

M. Li, J.L. Tobias / Journal of Econometrics 162 (2011) 345–361

Under the same priors employed for the generated data experiments of Section 4 (and equal prior probabilities over each of these three models), we find evidence against normality for our application. Specifically, the two-component specification is favored over the ‘‘textbook’’ Gaussian model by an overwhelming factor of 2.58 × 1015 and it is also favored over the more general three-component model by a factor of 1043. This support for the more parameter rich two-component model over the Gaussian specification occurs despite the fact that our priors, though proper, are still quite uninformative.25 Given these results, we focus in the remainder of this discussion on results obtained from the two-component mixture model, as it is strongly preferred over these competitors, and model-averaged posteriors would essentially reduce to those same posteriors obtained under the two-component specification. We first consider the log monthly earnings equation from the two-component model, as reported in Table 5.26 From this first portion of the table it is evident that, holding all else constant, males earn more, on average, than women, as do workers who come from smaller families (in terms of fewer siblings) with high annual incomes.27 In addition, whites, the excluded category, generally earn more than other racial/ethnic groups, and the lagged county-level average log wage also has an important role in describing current monthly earnings.28 Our second equation explains the variation in the quantity of schooling attained by individuals in our sample. There is overwhelming evidence from Table 5 supporting the assertions that, holding all else constant, the quantity of education attained increases with student-level test scores and also increases with parental education and income, as no posterior simulations associated with these parameters were negative. Similarly, children of larger families attain less schooling, on average, than those from smaller families, while those of Asian/Pacific descent attain about one more year of education than whites. We also find an important role for the individual-level return to schooling parameter θi in the schooling equation. Specifically, a one percentage point increase in the return to education will lead the individual to acquire 0.135 more years of schooling, on average. If we consider a one-standard-deviation increase in the conditional distribution of the return to education parameter (which corresponds, approximately, to increasing θi by 0.05), this leads to an expected increase in the number of years of education equal to 2/3 of a year. These results suggest both statistical and economic significance of the return to education variable in

25 Bartlett’s paradox, for example, illustrates that the adoption of such priors results in the Bayes factor lending support for the restricted model. 26 Although not reported in this set of tables, calculations of the coefficient prior means and prior standard deviations reveal that a substantial amount of learning has taken place regarding all slope, intercept, and covariance matrix parameters. The priors used are the same as those employed in Section 4. For the sake of space, we do not report in the table posterior statistics for the intercept and covariance matrix parameters, although these are available upon request. The results do, however, show a strong, positive correlation between schooling and log earnings unobservables, and strong negative correlations between these unobservables and those associated with returns to education. A similar pattern is found in terms of observed characteristics, a point we discuss later in this section. 27 Values of the third column are reported as one (zero) when all (none) of the posterior simulations were positive. 28 In the HSB data information on family income, parental education, number of siblings, and base year test score are often missing. We do not, in this paper, take up the issue of how best to model the missing data, or whether these observations are missing at random, or not missing randomly. Instead, in the case where these variables are absent, we set the corresponding variables equal to their sample mean values and add a dummy variable to the regression equation denoting whether or not the given covariate is missing or observed in the sample. The posterior means and standard deviations of the parameters associated with the missing indicators are included in the analysis but not reported in the tables for the sake of brevity.

355

the schooling equation, as the associated marginal effect described above is clearly meaningfully large, and more than 99% of the posterior simulations associated with this coefficient were positive. The results here are quite interesting, as they clearly reveal that agents with higher returns to education do, in fact, acquire more schooling. The final equation of our system, with estimates reported in Table 6, explains the individual-level variation in returns to education. In terms of coefficient point estimates, the results of the table generally suggest that those family background characteristics leading individuals to acquire more schooling and receive higher earnings, such as parental education and family income, also tend to lower an individual’s return to an added year of education.29 This interesting result makes some intuitive sense, since we can certainly imagine that a college degree, for example, may significantly alter the earnings profile for someone coming from a low-income family. At the same time, a college degree for an individual from a high-income family will also be valued in the labor market, but it is seemingly likely that the high-income child, owing to family connections or other social networks, would fare better in the absence of the college degree than the lowincome individual. Finally, we also note that returns to schooling do not seem to vary in any systematic way with test scores, as the posterior probability that this parameter was positive was 0.68. This is not to say that test scores do not matter in the production of wages and education—indeed previous portions of the table clearly point to an important role for test scores in the production of both variables. Instead, we find little evidence that returns to education vary in a systematic way with cognitive ability; Koop and Tobias (2004) also document a similar result, albeit with a very different model and data set. In addition, we find modest evidence supporting the notion that minority groups – blacks in particular and Hispanics to a lesser extent – have higher returns to education than whites, and that females have higher returns to education than men.30 Lagged county-level returns to education were also clearly important in explaining the individual-level variation in returns to schooling, which is critical for identification purposes. Specifically, a one percentage point increase in the 1980 return to education in the county is associated with a 0.16 percentage point increase in the individual’s 1991 private return, and no posterior simulations associated with this parameter were negative. 6.1. Decomposing a covariate’s effect on log wages The foregoing discussion clearly illustrates that a covariate in our model has many channels through which it impacts log wage outcomes. Our previous discussion of results has, in fact, focused primarily on directional impacts and brief discussions akin to the ‘‘significance’’ of particular variables in light of the multifaceted nature of their influences. Variables such as family income and parental education, for example, have direct effects on schooling levels and returns to schooling, and each of these filter through

29 As the reader can see, there is considerable uncertainty associated with many of the parameters at this stage of the model. Formal Bayes factors, computed via the Savage–Dickey density ratio, were found to support the inclusion of only the black, female, and lagged county-level returns to education variables in Eq. (3), although there is rather considerable support based on the marginal posterior distributions for retaining the mother’s education and family income as well. Of course, the priors employed for these parameters were quite flat, lending substantial prior support to the restricted variants of the model (e.g., Bartlett’s paradox). The results of the table are clearly suggestive that females and blacks have higher returns to education while individuals from wealthy families have lower returns to schooling. 30 Henderson et al. (2009) recently document a similar result, as have previous studies in the literature.

356

M. Li, J.L. Tobias / Journal of Econometrics 162 (2011) 345–361

Table 5 Posterior means, standard deviations, probabilities of being positive, and numerical standard error (NSE) values from the two-component mixture model. Variable Log monthly earnings equation Father’s education Mother’s education Base year family income ($10,000) Base year test score Number of siblings Female Age as of 1 January 1991 Hispanic Native American Asian/Pacific Black Other/Missing race 1980 county group average log hourly wage Schooling equation Father’s education Mother’s education Base year family income ($10,000) Base year test score Number of siblings Female Age as of 1 January 1991 Hispanic Native American Asian/Pacific Black Other/Missing race 1980 county group average schooling Return to schooling

E(β|D)

Std(β|D)

Pr(β > 0|D)

NSE

0.00828 0.0208 0.15 0.0185 −0.0214 −0.527 −0.0437 −0.0634 −0.299 0.0286 −0.281 −0.682 0.47

0.0116 0.013 0.033 0.0372 0.0193 0.0624 0.0589 0.0914 0.266 0.179 0.102 0.539 0.041

0.761 0.945 1 0.691 0.133 0 0.21 0.244 0.13 0.562 0.00286 0.102 1

4.17e−005 5.43e−005 9.7e−005 0.000606 5.59e−005 0.000723 0.00996 0.00026 0.000777 0.000543 0.000293 0.00148 0.00194

0.117 0.0894 0.236 0.724 −0.0824 −0.0807 −0.379 0.218 −0.401 0.918 0.24 −0.509 0.149 13.5

0.0142 0.0171 0.0595 0.0437 0.0242 0.148 0.0745 0.121 0.341 0.206 0.154 0.738 0.0425 5.2

1 1 1 1 0.000687 0.307 0 0.952 0.098 1 0.928 0.248 1 0.997

0.00023 0.000667 0.00483 0.00103 0.000442 0.0141 0.0119 0.00435 0.00618 0.000617 0.0101 0.0293 0.00326 0.569

Table 6 Posterior means, standard deviations, probabilities of being positive, and numerical standard error (NSE) values from the two-component mixture model. Variable Return to schooling equation Father’s education Mother’s education Base year family income ($10,000) Base year test score Number of siblings Female Age as of 1 January 1991 Hispanic Native American Asian/Pacific Black Other/Missing race 1980 county grp. avg. return to schooling

E(β|D)

Std(β|D)

Pr(β > 0|D)

NSE

−0.00057 −0.00128 −0.00868

0.000831 0.000945 0.00236 0.0026 0.00141 0.00452 0.00442 0.00678 0.0208 0.0121 0.00748 0.0419 0.0261

0.247 0.0879 0.00015 0.678 0.785 1 0.626 0.894 0.771 0.541 0.99 0.906 1

2.41e−006 3.07e−006 6.8e−006 3.77e−005 4.09e−006 5.45e−005 0.000784 1.93e−005 6.09e−005 3.55e−005 2.15e−005 0.000115 0.000125

0.00119 0.00111 0.0242 0.00137 0.00846 0.0155 0.00121 0.0174 0.0548 0.162

the model to define that variable’s ‘‘total’’ impact on earnings. In attempt to identify this total impact as well as the component pieces that define it, in this section we look into the posterior predictive distribution, as discussed in Section 3.3. Specifically, let yf , sf , and θf denote the log monthly earnings, the level of schooling, and the return to schooling parameter for some hypothetical or ‘‘future’’ individual f . The posterior predictive distribution of these outcomes can be obtained as p(yf , sf , θf |y , s, xf , wf , zf )

∫ =

p(yf , sf , θf |Γ −θ , xf , wf , zf )p(Γ −θ |y , s)dΓ −θ .

(28)

Samples from this trivariate posterior predictive distribution can therefore be drawn, given a set of simulations from the posterior distribution p(Γ −θ |y , s), the maintained model in (1)–(3), and values of the covariates xf , wf , and zf . We generate a series of simulations from this posterior predictive distribution and use these to summarize the effects of various covariate changes on each outcome. Specifically, we consider the effects of (a) attaining a BA degree of both parents (as opposed to both being high school graduates only), (b)

increasing family income by $10,000 (which corresponds almost exactly to a one-standard-deviation increase in family income), (c) increasing the baseline achievement scores by one standard deviation, (d) having two additional siblings, and (e) being female. Table 7 shows the posterior mean and posterior standard deviation of the impacts of such covariate changes on each of the three outcomes of interest. In reading the table, recognize, for example, that the ‘‘schooling’’ column summarizes the direct impact of the covariate change on educational attainment (as read directly from Table 6) plus any indirect effect that such a change may also have on returns to education and, consequently, schooling levels. The monthly earnings figure in the first column of the table offers a complete summary of how the given covariate change filters through all channels and affects earnings. To evaluate all of these effects, we generate draws from the posterior predictive distribution, as described in (28), with each exercise requiring appropriate definitions of the covariates xf . The rightmost column of Table 7, which describes the effect of the stated change on returns to schooling, can simply be read from Table 6. Again, these results reveal that females have a much higher return to education (approximately 2.4 percentage points

M. Li, J.L. Tobias / Journal of Econometrics 162 (2011) 345–361

357

Table 7 Posterior means (and standard deviations) from posterior predictive exercises.

Both parents have a BA Increase family income $10,000 One-standard-deviation increase in test scores Adding two siblings Female

Monthly earnings ($)

Schooling

Return to education

103.8 (103.5) 81.7 (47.1) 160.4 (124.3) −41.5 (28.13) −360.2 (171.2)

0.727 (0.031) 0.119 (0.020) 0.740 (0.020) −0.135 (0.023) 0.245 (0.047)

−0.007 (0.004) −0.009 (0.002) 0.001 (0.002) 0.002 (0.003) 0.024 (0.004)

higher) than men. Student achievement scores do not appear to play a role in explaining variation in returns to schooling, while parental education and family income lower returns to education. Of these, family income plays the largest, though still reasonably minor, role, as a one-standard-deviation increase in family income lowers returns to schooling by less than 1 percentage point on average. The second column summarizes the impacts of the considered covariate changes on the quantity of schooling attained. A onestandard-deviation increase in student achievement scores, or attaining a four-year degree of both parents, produces a large change in the quantity of education attained, as both effects are found to increase educational attainment by approximately 3/4 of a year. Furthermore, these effects are estimated rather precisely, as the posterior standard deviation of the schooling impact in either case is very small relative to the mean. For these two particular exercises, the schooling increase primarily arises from the ‘‘direct effect’’, as revealed in Table 6, as increases in parental education and test scores were not strongly linked to private rates of return to education. In contrast to this, Table 7 also shows that females acquire more schooling on average than males. This result appears, at first glance, at odds with Table 5, as the coefficient on the female indicator in the schooling equation is actually negative, though with considerable mass placed on both sides of zero. What Table 5 does not directly summarize, however, is the fact that females have much higher returns to education, and given that ρ > 0, tend to acquire more schooling on average as a result. Higher levels of educational attainment for women is also a feature of our data: women receive, on average, 0.33 years more education than men in our HSB sample. Our model is able to reproduce this feature of the data, as it predicts women to receive approximately 0.25 years more education than men, with the observed outcome of 0.33 falling within two posterior standard deviations of this point estimate. In our view, this result is quite interesting and, perhaps, new to the literature: once the variation in rates of return to education has been accounted for, there is no discernable difference in the predicted quantity of schooling attained by men and women. However, women attain more schooling, on average, then men because of a comparably high rate of return on such an investment. The first column of Table 7 aggregates all of these channels and provides overall estimates of the various impacts on monthly earnings. In terms of posterior means, females earn approximately $360 less per month than men, even though they tend to acquire more schooling, on average. A one-standard-deviation increase in student achievement scores increases monthly earnings by about $160 on average, with the bulk of this increase explained by increased levels of educational attainment for those of higher ability. Similarly, graduation from college of both parents increases the monthly (child) earnings by about $104, which, again, results primarily from higher educational attainment by such children. The impacts of family income and number of siblings also operate

in the directions we might expect, although the magnitude of these changes is smaller: a one-standard-deviation increase in parental income is associated with an average increase in (child) monthly earnings equal to $81.2, while the addition of two siblings tends to lower monthly earnings by about $42. At this stage of the model, there are also reasonably large amounts of uncertainty surrounding these mean impacts, as their estimation involves an aggregation of effects at each level of the system. In Fig. 2, we again use our posterior simulations to characterize the differences in rates of return to education, educational attainment, and monthly earnings, and this time, calculate such quantities for a representative white male and a representative black female. When performing these calculations, we fix the covariates at group-specific sample averages rather than restricting the covariate vectors to be equal for both groups. As shown in the leftmost columns of Fig. 2, the posterior distribution of the return to education parameter for black females is shifted to the right relative to that of a white male, with the posterior mean of the former being 0.093 and the latter approximately 0.05. Returns to education are, however, rather variable and difficult to completely characterize through observables, as summarized by the calculation: Pr(θbf > θwm |y , s) ≈ 0.73.31 Unlike the return to schooling distributions, the educational attainment posterior predictive distributions are quite similar for both groups. Higher rates of return for black females lead them to acquire more education than white males, although this increase is offset by the fact that black females tend to come from less educated, larger and less wealthy families on average than those of white males, and black females also have lower average test scores in our data. These offsetting effects culminate in very similar predictive distributions for educational attainment for both groups. The rightmost columns of Fig. 2 plot the posterior predictive monthly earnings distributions for both representative individuals. The expected monthly earnings of a white male were approximately $2300 and those of a black female were approximately $1700.32 White males are far less likely to be characterized as low income as, for example, Pr(MonthEarnwm > $1100|y , s) ≈ 0.95, while Pr(MonthEarnbf > $1100|y , s) ≈ 0.80.33 Taken together, these calculations illustrate how complete outcome summaries can be obtained within the framework of our model while still identifying the separate individual channels through which particular covariates, or changes in them, filter to affect earnings.

31 Here, the subscript ‘‘bf’’ refers to black female while the subscript ‘‘wm’’ denotes white male. 32 Although these numbers might seem small, keep in mind that these are 1991 outcomes for a sample of young workers whose average age (and standard deviation) is 26.8 (0.54). 33 The choice of $1100 as a threshold is simply to fix ideas, yet is partially guided by policy. The 1991 HHS poverty threshold for a family of four, for example, was $13,400, motivating the monthly figure of $1100 as a choice with some interest.

358

M. Li, J.L. Tobias / Journal of Econometrics 162 (2011) 345–361

Fig. 2. Posterior predictive outcome distributions for white males (top row) and black females (bottom row).

We conclude this discussion of our findings by offering a few comments regarding how our results fit within the context of the rather vast literature on this topic. To this end, we consider as a benchmark what we term the ‘‘standard IV model’’. This model is obtained by imposing homogeneity in causal impacts (i.e., setting θi = θ in (1), and thus Eq. (3) becomes irrelevant) and dropping the term θi ρ from (2). The resulting two-equation system is then fit using standard MCMC methods. When doing so, we obtain an estimate (posterior mean) of the common schooling effect equal to 0.059. This result is not terribly out of line with the findings of previous IV-based studies, although the majority of such studies tend to report larger impacts.34 In Table 8, we also report estimates of the overall average return to education as well as estimates of this effect that are broken down by racial and gender groups using our two-component version of the correlated random coefficient model. On the whole, we can see that our estimate of the average causal effect and that from the homogeneous effect standard IV model are rather similar, differing by about 0.7 percentage point, and with a fair degree of overlap between the marginal posterior distributions of these quantities. Point estimates of the average return to education for most racial and gender groups exceed the common effect IV

34 Card (2001), for example, provides a review of a number of influential IV studies on returns to education. Most of these report IV estimates substantially exceeding their OLS estimates, and often in excess of 10%. Our homogenous ‘‘IV-type’’ estimate tends to be closer to the consensus OLS estimate of these studies rather than their IV counterparts. As a partial explanation for this difference, it is important to recognize that we focus on a sample of young, reasonably well-educated workers in the HSB data for whom returns to education are likely to be smaller, on average.

Table 8 Posterior estimates of average return to education across groups and models. E(·|D)

Std(·|D)

Pr(· > 0|D)

θ

0.0591

0.0148

1

CRC model ∑2 xη + wλ + g =1 π g η0g White male White female Black male Black female

0.0663 0.0511 0.0753 0.0686 0.0927

0.0133 0.0136 0.0136 0.0150 0.0149

1 1 1 1 1

Predictive return to schooling Standard IV model

estimate, while the point estimate of returns to education for white males is lower than the IV estimate. Despite the similarity of results for the average causal effects from both models, it remains important to note that we should not expect these two estimates to converge to the same parameter; the standard IV procedure will not consistently estimate the average causal effect in the population in general when treatment effect heterogeneity, as described by our model, is present.35 Our analysis can, however, recover this causal effect and much more, including characterizing the distribution of heterogenous returns, describing if and to what extent agents act upon knowledge of their private returns, and clarifying the various channels through which covariates influence the outcomes of interest.

35 Further details on this issue are available on request, although most of these will repeat the arguments of Wooldridge (2003), who establishes conditions under which a properly implemented IV procedure will consistently recover the average causal effect.

M. Li, J.L. Tobias / Journal of Econometrics 162 (2011) 345–361

Table 9 Coefficients on terms in E (y|s, Γ −θ ) with b ≡ σsθ +ρσθ2 and c ≡ σs2 + 2ρσsθ +ρ 2 σθ2 .

7. Conclusion In this paper, we have taken up the issue of Bayesian estimation of a correlated random coefficients model. In the past, estimation in these types of models has focused almost exclusively on the estimation of the average causal effect in the population. Our model, though decidedly more parameterized than these previous studies, enables the estimation of far more parameters of interest, including the variability and other features of the causal effect distribution in addition to learning how individuals make treatment decisions on the basis of their gain from receipt of that treatment. We applied our method in practice to a widely studied problem in labor economics: estimation of the private return to education. Using data combined from High School and Beyond and the 1980 Census, we find evidence of heterogeneity in returns to education. Specifically, we find that some characteristics of agents typically associated with higher levels of schooling (such as family income) are, at the same time, associated with lower returns to schooling. This finding supports the idea that those who benefit most from education are not necessarily the ones who are observed to acquire the most education. In addition, individuals can be viewed to make their schooling decisions based, at least in part, on their return to education. Specifically, a one percentage point increase in the return to education is associated with an increase in schooling quantity equal to approximately 0.135 years. Appendix. Identification To fix ideas, we focus on one observation’s contribution to the likelihood, denoted as p(y, s|Γ −θ ), where the subscript i is dropped for simplicity and Γ −θ denotes all parameters other than the return θ , which is to be integrated out of (1)–(3). In this regard, we first note that the marginal density s|Γ −θ is obtained as s = [δ0 + ρη0 ] + x(δ + ρη) + z γ + w ρλ + u˜ s ,

∫

∞

−∞ ∫ ∞

=

p(y, θ|s, Γ −θ )dθ

(30)

p(y|Γ , s)p(θ|Γ −θ , s)dθ ,

(31)

−∞

where Γ denotes all parameters in the model. The assumptions of (1)–(3) imply that y|Γ , s ∼ N (β0 + xβ + sθ + r1 [s − δ0 − xδ − z γ − θ ρ]

+ r2 [θ − η0 − xη − w λ], Vy ),

(32)

Vy ≡ σy2 − [σys

σyθ ]

σ σsθ 2 s

σsθ σθ2

]−1 [

σys σyθ

]

  β0 − c −1 (δ0 + ρη0 ) σys + ρσ yθ −1 β − c (δ  + ρη) σys + ρσyθ −λρ c −1 σys + ρσyθ  η0 + c −1 σys + ρσyθ − [b/c ](δ0 + ρη0 )   −γ c −1 σys + ρσyθ

s z s2 sx sw sz

bc −1 η − bc −1 (δ + ηρ) λ − bc −1 ρλ −γ bc −1

where b ≡ σsθ + ρσθ2 ,

c ≡ σs2 + 2ρσsθ + ρ 2 σθ2

and

Vθ ≡ σθ − (b /c ). 2

2

(35)

Given (32) and (34), the integration involves completing the square in θ , recognizing a portion of the integrand as the kernel of a Gaussian distribution, and then accounting for all remaining terms. The result of this calculation shows that y|s, Γ −θ is also normal; its regression function contains a constant and additive terms involving x, w, s, z , s2 , sx, sw , and sz. The parameters multiplying each of these terms in the regression function for y|s are given in Table 9. Similar algebra also reveals that Var(y|s, Γ −θ ) =



[σys + ρσyθ ]2



σ − c   (σys + ρσyθ )b + 2si σyθ − 2 y

c

  b2 2 2 . + si σ θ −

(36)

c

Some quick accounting suggests that, without any further restrictions placed on the model, there are 15 sets of unknowns and 17 sets of equations from s|Γ −θ and y|s, Γ −θ that can be used to recover these ‘‘structural’’ parameters. Henceforth, we restrict ourselves to establishing identification in the more difficult (and perhaps more realistic) case where γ = 0, ρ ̸= 0 and w is a scalar. In other words, we have one exclusion restriction in (3), no exclusion restrictions in (2), and make the assumption that ρ ̸= 0. In this case, we have one less set of parameters (γ ) to estimate, but setting γ = 0 eliminates three of our estimating equations. Under these restrictions, the parameter vector can be broken down into eight sets of regression parameters

[σy2

σ 2 σys − σyθ σsθ r1 ≡ θ 2 , σθ σs2 − σs2θ

σ 2 σyθ − σsθ σys r2 ≡ s 2 . (33) σθ σs2 − σs2θ The integration in (31) also requires p(θ|Γ −θ , s). Given the bivariate normality of (s, θ|Γ −θ ) implied from (2) and (3), we obtain b

θ|Γ −θ , s ∼ N η0 + xη + w λ + (s − δ0 − xδ − z γ c

β

δo

δ

ρ

η0

η

λ]

(34)

σs2

σθ2

σys

σyθ

σsθ ].

In terms of equations that can be used to identify the above values, j let ar denote the (estimable) coefficient on variable r in equation j, j ∈ {s, y} and r ∈ {co, x, w, s, s2 , sx, sw}. It is understood that j = s refers to s|Γ −θ in Eq. (29), j = y refers to the equation for y|s, Γ −θ in Table 5, and r = co denotes the constant term in each equation. From the s marginal density, we estimate three coefficients and a variance parameter which, in the above notation, provides

[asco

 − ρ[η0 + xη + w λ]), Vθ ,

Constant x

and six parameters of the covariance matrix:

and



Coefficient

w

[β0

with

[

Variable

(29)

where u˜ s ≡ ρϵ + v . We now seek to derive the conditional density p(y|s, Γ −θ ). To this end, we note that p(y|s, Γ −θ ) =

359

asx

asw

cˆ ].

Similarly, the y|s, Γ −θ equation gives

[ayco

ayx

ayw

ays

y

as2

aysx

aysw

y Vco

Vsy

y

Vs2 ],

360

M. Li, J.L. Tobias / Journal of Econometrics 162 (2011) 345–361 y

where, for the final three terms, Vr denotes the (estimable) coefficient multiplying the variable r in the expression for Var(y|s, Γ −θ ) in (36). Thus, we have 14 sets of equations to use in recovering the 14 sets of structural parameters. Note that asco = δ0 + ρηo

(37)

= δ + ρη

(38)

asx

 asw = ρλ.

(39)

The relationships in Table 5 can be stacked together to produce 1 0  0  0  0  0 0



−asco cˆ −1 −asx cˆ −1 −asw cˆ −1 cˆ −1

0 I 0 0 0 0 0

0 0 0

0 0 0 1 0 0 0

0 0 0 0 0 I 0

0 0 0 −asco cˆ −1 cˆ −1 −asx cˆ −1 −asw cˆ −1

0 βˆ 0 0   βˆ   0 σys + ρσ yθ    0   ηˆ0   0   bˆ   ηˆ 0 1 λˆ





 ay  co y

 ayx  a   wy   =  ays  a 2   sy  asx aysw

or, succinctly, Hy Γ y = ay , where I denotes an identity matrix with an appropriate size. The matrix Hy is full rank; hence, the terms in Γ y are identified and could be estimated as Γ y = Hy−1 ay . The remaining parameters ρ , δ, and δ0 can then be obtained from (37)–(39) as

ρˆ = asw λˆ −1

(40)

δˆ =

(41)

asx

− ρˆ ηˆ

δˆ0 = asco − ρˆ ηˆ0 .

(42)

It remains to discuss the parameters of Σ . Note that cˆ , bˆ , [σys + ρσyθ ], and (36) can be employed to recover their values. Specifically,

ρ ρ 2  0  0 0 

0 1 0 0 0 0

1

0 0 0 1 0 0



0 0

ρ

0 1 0

1 2ρ 0 0 0 0

σθ 0 2 0  σs   2    1   σy  0 σyθ  0  σsθ  0 σys

bˆ cˆ  [σys + ρσyθ ]



2





        2 = y −1  ,  ˆ V + [σ + ρσ ] c ys yθ  co  V y /2 + [σ  ˆ ˆ −1  ys + ρσyθ ]bc s y V 2 + bˆ 2 cˆ −1 s

or, succinctly, Hσ Γ σ = Vσ . The matrix Hσ is, again, full rank; hence, the parameters of the covariance matrix are identifiable and could be estimated as Γ σ = Hσ−1 Vσ .

References Angrist, J.D., Krueger, A.B., 1991. Does compulsory school attendance affect schooling and earnings? Quarterly Journal of Economics 106, 979–1014. Björklund, A., Moffitt, R., 1987. The estimation of wage gains and welfare gains in self-selection models. Review of Economics and Statistics 69 (1), 42–49. Card, D., 2001. Estimating the return to schooling: progress on some persistent econometric problems. Econometrica 69, 1127–1160. Chib, S., 2007. Analysis of treatment response data without the joint distribution of potential outcomes. Journal of Econometrics 140 (2), 401–412. Chib, S., Hamilton, B.H., 2000. Bayesian analysis of cross-section and clustered data treatment models. Journal of Econometrics 97 (1), 25–50. Chib, S., Hamilton, B.H., 2002. Semiparametric Bayes analysis of longitudinal data treatment models. Journal of Econometrics 110, 67–89. Conley, T., Hansen, C., McCulloch, R., Rossi, P.E., 2008. A semi-parametric Bayesian approach to the instrumental variable problem. Journal of Econometrics 144 (1), 276–305. Deb, P., Munkin, M.K., Trivedi, P., 2006. Bayesian analysis of the two-part model with endogeneity: application to health care expenditure. Journal of Applied Econometrics 21 (7), 1081–1099. Frühwirth-Schnatter, S., 2001. Markov chain Monte Carlo estimation of classical and dynamic switching mixture models. Journal of the American Statistical Association 96 (453), 194–209. Ganderton, P.T., 1992. The effect of subsidies in kind on the choice of college. Journal of Public Economics 48, 269–292. Gelfand, A.E., Dey, D.K., 1994. Bayesian model choice: asymptotics and exact calculations. Journal of the Royal Statistical Society. Series B 501–514. Geweke, J., 1999. Using simulation methods for Bayesian econometric models: inference, development and communication. Econometric Reviews 18 (1), 1–73. Geweke, J., 2004. Getting it right: joint distribution tests of posterior simulators. Journal of the American Statistical Association 99, 799–804. Geweke, J., 2007. Interpretation and inference in mixture models: simple MCMC works. Computational Statistics and Data Analysis 51, 3529–3550. Geweke, J., Keane, M., 2007. Smoothly mixing regressions. Journal of Econometrics 138, 252–290. Grogger, J., 1996a. Does school quality explain the recent black/white wage trend? Journal of Labor Economics 14 (2), 231–253. Grogger, J., 1996b. School expenditures and post-schooling earnings: evidence from high school and beyond. Review of Economics and Statistics 78 (4), 628–637. Hanushek, E.A., Taylor, L.L., 1990. Alternative assessments of the performance of schools: measurement of state variation in achievement. Journal of Human Resources 25 (2), 179–201. Heckman, J.J., 1997. Instrumental variables: a study of implicit behavioral assumptions used in making program evaluations. Journal of Human Resources 32 (3), 441–462. Heckman, J.J., Smith, J., 1999. Evaluating the welfare state. In: Strom, S. (Ed.), Econometrics and Economic Theory in the 20th Century: The Ragnar Frisch Centennial Symposium. In: Econometric Society Monographs, Cambridge University Press, Cambridge. Heckman, J.J., Vytlacil, E., 1998. Instrumental variables methods for the correlated random coefficient model: estimating the average rate of return to schooling when the return is correlated with schooling. Journal of Human Resources 33, 974–987. Heckman, J.J., Vytlacil, E., 1999. Local instrumental variables and latent variable models for identifying and bounding treatment effects. Proceedings of the National Academy of Sciences 96, 4730–4734. Heckman, J.J., Vytlacil, E., 2005. Structural equations, treatment effects and econometric policy evaluation. Econometrica 73 (3), 669–738. Henderson, D.J., Polachek, S.W., Wang, L., 2009. Heterogeneity in schooling rates of return. Working Paper. Department of Economics, SUNY-Binghamton. Hoogerheide, L., Kleibergen, F., van Dijk, H.K., 2007. Natural conjugate priors for the instrumental variables regression model applied to the Angrist–Krueger data. Journal of Econometrics 138 (1), 63–103. Imbens, G., Angrist, J., 1994. Identification and estimation of local average treatment effects. Econometrica 62 (2), 467–475. Kleibergen, F.R., Zivot, E., 2003. Bayesian and classical approaches to instrumental variable regression. Journal of Econometrics 114, 29–72. Koop, G., Poirier, D.J., 1997. Learning about the across-regime correlation in switching regression models. Journal of Econometrics 78, 217–227. Koop, G., Poirier, D.J., Tobias, J.L., 2007. Bayesian Econometric Methods. Cambridge University Press. Koop, G., Tobias, J.L., 2004. Learning about heterogeneity in returns to schooling. Journal of Applied Econometrics 19 (7), 827–849. Lancaster, T., 2004. An Introduction to Modern Bayesian Econometrics. Blackwell. Li, K., 1998. Bayesian inference in a simultaneous equation model with limited dependent variables. Journal of Econometrics 85 (2), 387–400. Li, M., 2006. High school completion and future youth unemployment: new evidence from high school and beyond. Journal of Applied Econometrics 21, 23–53. Li, M., 2007. Bayesian proportional hazard analysis of the timing of high school dropout decisions. Econometric Reviews 26, 529–556. Li, M., Poirier, D.J., Tobias, J.L., 2003. Do dropouts suffer from dropping out? Estimation and prediction of outcome gains in generalized selection models. Journal of Applied Econometrics 19 (2), 203–225.

M. Li, J.L. Tobias / Journal of Econometrics 162 (2011) 345–361 Manchanda, P., Rossi, P.E., Chintagunta, P.K., 2004. Response modeling with nonrandom marketing-mix variables. Journal of Marketing Research XLI, 467–478. Munkin, M., Trivedi, P., 2003. Bayesian analysis of a self-selection model with multiple outcomes using simulation-based estimation: an application to the demand for healthcare. Journal of Econometrics 114, 197–220. Pearson, K., 1894. Contributions to the mathematical theory of evolution. Philosophical Transactions of the Royal Society of London, Series A 185, 71–110. Poirier, D.J., Tobias, J.L., 2003. On the predictive distributions of outcome gains in the presence of an unidentified parameter. Journal of Business and Economic Statistics 21 (2), 258–268.

361

Rivkin, S.G., 1991. Schooling and employment in the 1980’s: who succeeds? Ph.D. Dissertation. UCLA, Department of Economics. Rossi, P.E., Allenby, G.M., McCulloch, R., 2005. Bayesian Statistics and Marketing. Wiley. Roy, A.D., 1951. Some thoughts on the distribution of earnings. Oxford Economic Papers. New Series, vol. 3, pp. 135–146. Wooldridge, J.M., 1997. On two stage least squares estimation of the average treatment effect in a random coefficient model. Economics Letters 56, 129–133. Wooldridge, J.M., 2003. Further results on instrumental variables estimation of average treatment effects in the correlated random coefficient model. Economics Letters 79, 185–191.

Journal of Econometrics 162 (2011) 362–368

Contents lists available at ScienceDirect

Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom

Regression with imputed covariates: A generalized missing-indicator approach Valentino Dardanoni a , Salvatore Modica a , Franco Peracchi b,∗ a

University of Palermo, Italy

b

Tor Vergata University and EIEF, Italy

article

info

Article history: Received 2 October 2009 Received in revised form 11 December 2010 Accepted 8 February 2011 Available online 25 February 2011 JEL classification: C12 C13 C19 Keywords: Missing covariates Imputations Bias-precision trade-off Model reduction Model averaging BMI and income

abstract A common problem in applied regression analysis is that covariate values may be missing for some observations but imputed values may be available. This situation generates a trade-off between bias and precision: the complete cases are often disarmingly few, but replacing the missing observations with the imputed values to gain precision may lead to bias. In this paper, we formalize this tradeoff by showing that one can augment the regression model with a set of auxiliary variables so as to obtain, under weak assumptions about the imputations, the same unbiased estimator of the parameters of interest as complete-case analysis. Given this augmented model, the bias-precision trade-off may then be tackled by either model reduction procedures or model averaging methods. We illustrate our approach by considering the problem of estimating the relation between income and the body mass index (BMI) using survey data affected by item non-response, where the missing values on the main covariates are filled in by imputations. © 2011 Elsevier B.V. All rights reserved.

Introduction A common problem in applied regression analysis is that covariate values may be missing for some observations but imputed values may be available, either values provided by the data-producing agency or directly constructed by the researcher. This problem has received little attention compared to the more general problem of missing covariate values, but is of considerable practical relevance as all empirical researchers know well. In many cases, it is safe to assume that the mechanism leading to missing covariate values does not depend on the outcome of interest. In these cases, one can ignore the missing data mechanism and focus on the problem of what use to make of the available imputations. There are two main approaches to this problem. One is to simply ignore the imputations and only use the observations with complete data on all covariates—the so-called complete-case analysis. Although this may entail a loss of precision, it has the strong appeal of yielding an unbiased estimator of the parameters of interest when the missing data mechanism is ignorable. The other approach is more concerned with precision and replaces

∗

Corresponding author. Tel.: +39 06 7259 5934; fax: +39 06 2040 219. E-mail address: [email protected] (F. Peracchi).

0304-4076/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2011.02.005

the missing covariate values with the imputations. A refined version of this approach corrects for incorporating the imputed values by some variant of the so-called missing-indicator method (Little, 1992; Horton and Kleinman, 2007; Little and Rubin, 2002), which consists of augmenting the regression model with a set of binary indicators for each covariate with missing values. Although frequently used in practice, this approach is known to produce biased estimates (Jones, 1996; Horton and Kleinman, 2007). It also raises the problem of how to assess precision of the estimators, a problem that we ignore in this paper because it can easily be handled by multiple imputation methods (Rubin, 1987). Thus, when covariate values are missing we face a tradeoff between bias and precision: the complete cases are often disarmingly few, but replacing the missing observations with the imputed values to gain precision may lead to bias. In this paper, we formalize the bias-precision trade-off by showing that one can augment the regression model with a set of auxiliary variables so as to obtain, under weak assumptions about the imputations, the same unbiased estimator of the parameters of interest as complete-case analysis. Given this augmented model, the bias-precision trade-off may then be tackled either by standard model reduction procedures or, more aptly in our view, by model averaging methods. We illustrate our approach by considering the problem of estimating the relationship between income and the body mass

V. Dardanoni et al. / Journal of Econometrics 162 (2011) 362–368

index (BMI) using survey data affected by item non-response, where the missing values on the main covariates are filled in by imputation. The sequel of the paper is organized as follows. Section 1 presents the basic notation. Section 2 discusses complete-case analysis. Sections 3 and 4 present the augmented model with auxiliary variables and discuss its missing-indicator interpretation. Section 5 contains our main result. Section 6 discusses the trade-off between bias and precision. Section 7 presents our application to modeling the relation between BMI and income. Finally, Section 8 offers some concluding remarks. 1. Notation Observations are indexed by n = 1, . . . , N, and covariates by k = 0, 1, . . . , K − 1, with k = 0 corresponding to the constant term and K > 1. We consider the classical linear model y = X β + u,

(1)

where y is the N × 1 vector of observations on the outcome of interest, X is an N × K matrix of observations on the covariates, β is the K × 1 vector of coefficients and u is an N × 1 vector of homoskedastic and serially uncorrelated regression errors with zero mean conditional on X . A subsample with incomplete data is a group of observations where one or more covariates are missing. Because the constant term is always observed, the number of possible subsamples with incomplete data is equal to 2K −1 − 1. Not all such subsamples need be present in a data set. In addition to the subsample with complete data (indexed by j = 0), we assume to have J ≤ 2K −1 − 1 subsamples with incomplete data, indexed by j = 1, . . . , J. This formulation covers both the case when some patterns of missing covariates are not present in the data and the case when the investigator decides to drop from the analysis some groups with incomplete data. Let Nj , Kj and Kj∗ = K − Kj , respectively, denote the sample size, the number of available covariates (the covariates with no missing values, including the constant erm), and the number of ∑J missing covariates in the jth subsample. By construction j=0 Nj = j

N , K0 = K , K0∗ = 0 and 1 ≤ Kj , Kj∗ < K for j = 1, . . . , J. Let y j , Xa j Xm ,

and respectively, denote the Nj × 1 outcome vector, the Nj × Kj submatrix containing the values of the available covariates, and the Nj × Kj∗ submatrix containing the values of the missing covariates j Xa

j Xm

for the jth subsample. Also, let X = [ , ], an Nj × K matrix. We assume that X 0 = Xa0 is of full column rank, which implies that N0 ≥ K . j

2. Complete-case analysis Our benchmark in dealing with missing values is the so-called complete-case method, which uses only the observations with complete data on all covariates. Let M denote the N × K missing-data indicator matrix, whose (n, k)th element mnk takes value 1 if the nth observation contains a missing value on the kth covariate and value 0 otherwise. The following assumption is common to most approaches to the problem of missing covariate values and is maintained throughout this paper. Assumption 1 (Ignorability). M and y are conditionally independent given X . By symmetry of conditional independence, it is easily seen that Assumption 1 is equivalent to the following two assumptions:

363

P (y | X , M ) = P (y | X )

(2)

and P (M | y , X ) = P (M | X ).

(3)

Assumption (2) basically says that if we knew the true values of the missing covariates, knowing the pattern of missing data would not help in predicting y. Assumption (3) implies that the missing data mechanism, seen as a function of y and X , depends on X only. Assumption 1 may fail if, for example, observations with missing covariate values have a different regression function than observations with no missing values. On the other hand, it does not place restrictions on how M is generated from X . For example, M may exhibit patterns such that cases with low or high levels of some covariates systematically have a greater percentage of missing values. Theorem 1 provides a formal proof of the fact that, under Assumption 1, the OLS estimator for the complete case is unbiased. This result has been known for long time, but may be considered a ‘‘folk theorem’’. Little (1992) and Little and Rubin (2002) attribute it to an unpublished 1986 technical report by William Glynn and Nan Laird. Private communication with Nan Laird however informs us that the report has never been published and is no longer available. Jones (1996) offers a proof for the case of two covariates, one of which has missing values, whereas Wooldridge (2002, p. 553) shows that the two-stage least-squares estimator for the complete case is consistent. Theorem 1 (Complete-Case Estimation). If Assumption 1 holds, then the OLS estimator of β obtained by using only the observations with complete data on all covariates is unbiased for β . Proof. The OLS estimator for the complete data may be written as follows:

 β = (X ′ DX )−1 X ′ Dy , where D is an N ×N diagonal matrix whose nth diagonal element dn takes value 1 if no covariate is missing for the nth observation and value 0 otherwise. The elements of D are related ∏ to the elements K of the missing-indicator matrix M through dn = k=1 (1 − mnk ). The Ignorability assumption implies that any function of M , in particular D, is independent of y conditional on X . From (2),

E( β | X , D) = (X ′ DX )−1 X ′ DX β = β, and therefore E( β) = β.

An implication of Theorem 1 is that the subsample with complete data satisfies y 0 = X 0 β + u0 ,

(4)

0

where u is an N0 × 1 vector of homoskedastic and serially uncorrelated regression errors. This result supports the common practice of complete-case analysis, namely estimating β by regressing y 0 on X 0 . However, severe loss of information, and hence of precision, may result unless the fraction of deleted cases is small. 3. The augmented model with auxiliary variables Suppose that, for each subsample j = 1, . . . , J with incomplete data, the values of the Kj∗ missing covariates are filled-in using some imputation procedure. A covariate with imputed values is called an imputed covariate. The Nj × Kj∗ matrix corresponding to the set of imputed covariates is called the imputation matrix for the jth subsample and is denoted by L j . The Nj × K matrix W j =

[Xaj , L j ], whose columns correspond to the Kj available covariates

and the Kj∗ imputed covariates, is called the completed design

364

V. Dardanoni et al. / Journal of Econometrics 162 (2011) 362–368

matrix for the jth subsample. Our treatment of imputation is very general and covers a variety of imputation procedures, including regression and donor-based methods such as nearest-neighbor and hot-deck imputations. It also allows for the possibility that different imputation procedures are used for different covariates, or for different subsamples with incomplete data. Consider modeling the Nj × 1 outcome vector y j for the jth subsample as a linear function of the observed covariates in W j . The best (minimum mean-square error) linear predictor of y j given j W j = [Xa , L j ] is

Respectively, an N-vector, an N × K matrix, an N × JK matrix, and an N-vector. Note that the matrix W is obtained by fillingin the missing covariate values with the available imputations. Model (6) includes all observations re-ordered groupwise: first, the complete cases, and then the first group with incomplete data, etc. Ordering of the groups is arbitrary and plays no role in the analysis. In the terminology of Danilov and Magnus (2004), the K columns of W are the ‘‘focus’’ regressors, while the JK columns of Z are the ‘‘auxiliary’’ regressors.

E∗ (y j | Xaj , L j ) = E∗ (X j β | Xaj , L j )

4. A missing-indicator interpretation

= Xaj βja + E∗ (Xmj | Xaj , L j )βjm = Xaj βja + (Xaj ∆j + L j Γ j )βjm = Xaj γ ja + L j γ jm , j

where βja and βjm are the subvectors of β associated with Xa and j Xm ,

j respectively, E Xm j j predictor of Xm given Xa

∗

(

γ ja = βja + ∆j βjm ,

j Xa

,L ) = and L j , and

|

j

j X a ∆j

+ L Γ is the best linear j

j

γ jm = Γ j βjm .

The resulting linear model for the jth subsample may be written, more compactly, y j = W j γ j + uj ,

j = 1, . . . , J ,

(5)

where γ j is the K × 1 vector consisting of the coefficients associated j with the observed and the imputed covariates in W j = [Xa , L j ], j and u is an Nj × 1 vector of projection errors that, by construction, have mean zero and are uncorrelated with W j . Two important features distinguish model (5) from the original model (1). First, the vector of population coefficients γ j is generally different from β unless ∆j = 0 and Γ j is equal to the idenj j tity matrix or, equivalently, E∗ (Xm | Xa , L j ) = L j , that is, given the imputations, the available covariates contain no further information about the missing covariates. Second, the elements of the error vector uj are not necessarily homoskedastic, even when homoskedasticity holds for the elements of u. Letting δj = γ j − β, j = 1, . . . , J, and stacking on top of each other the complete-case model and the J linear models for the subsamples with incomplete data give

[ 0] y y∗

[

]

[

]

[ ]

0 X0 u0 , ∗ δ+ ∗ β+ Z W u∗

=

Before presenting our main result it is instructive to give a missing-indicator interpretation of model (6). Indeed, the JK auxiliary variables in the matrix Z are obtained by multiplying the covariates in each group by the various indicators of group membership. To see this write Z = [Z1 , . . . , Zj , . . . , ZJ ], where Zj is the N × K matrix that contains the auxiliary variables for the jth group. Let 1K denote the 1 × K vector whose elements are all equal to one and let dj denote the N × 1 vector of group-membership indicators for the jth group (the elements of dj are equal to one for observations in group j and zero otherwise). Then Zj = [1K ⊗ dj ] · W ,

j = 1, . . . , J ,

where ⊗ denotes the Kronecker product and · the Hadamard (elementwise) product. As an illustration, consider the linear model

E(yn | xn1 , xn2 ) = β0 + β1 xn1 + β2 xn2 , with a constant term and two covariates, x1 and x2 . Suppose that, in addition to the group with complete data, one has two groups with incomplete data: in group 1 [resp. 2] only the first [resp. second] covariate is missing. If dn0 , dn1 and dn2 denote the group-membership indicators, and L1n1 and L2n2 denote the imputed values in each group with incomplete data, then we may write yn = dn0 (β0 + β1 xn1 + β2 xn2 ) + dn1 (γ01 + γ11 L1n1 + γ21 xn2 )

+ dn2 (γ02 + γ12 xn1 + γ22 L2n2 ) + un . Let wnk be equal to xnk if the kth covariate is observed for the nth observation and to its imputed value otherwise. Then, the last relation may be written as yn = β0 + β1 wn1 + β2 wn2 + δ01 dn1 + δ02 dn2

+ δ11 dn1 wn1 + δ12 dn2 wn1 + δ21 dn1 wn2 + δ22 dn2 wn2 + un ,

where

 1 y

. y ∗ =  ..  , y

J

W1



 .  W ∗ =  ..  , W

J

W1



  Z∗ = 

j

 ..

. WJ

 ,

j

where δk = γk − βk . This is exactly model (6) for this special case, where the auxiliary variables added to the wk ’s are the groupmembership indicators and their interactions with the constant term and the observed or imputed covariates.

 1 u

∗

 u =

..  , .

5. Main result

uJ

and δ is the JK × 1 vector consisting of δ 1 , . . . , δJ . We can now write the model for the available and the imputed data as the grand model: y = W β + Z δ + u,

(6)

where β is the parameter of primary interest, δ is a vector of nuisance parameters, and

[ 0] y=

y , y∗

[ W =

] 0

X , W∗

[ Z =

]

0 , Z∗

[ 0] u=

u , u∗

The following result shows that, no matter which imputation procedure is chosen, the OLS estimate of β in the grand model (6) and that in the complete-case model (4) are numerically the same. Thus, the statistical properties of the two estimators are also the same. In particular, if the latter is unbiased (for example, the conditions of Theorem 1 hold), so is the former. Theorem 2. Suppose that the matrix W is of full column rank K and that N ≥ K (J + 1). Then, for any choice of imputation matrices L 1 , . . . , L J , the OLS estimate of β in the ‘‘grand’ model’’ (6) coincides with the OLS estimate of β in the complete-case model (4).

V. Dardanoni et al. / Journal of Econometrics 162 (2011) 362–368

Proof. Given any matrix A, let RA = I − A(A′ A)− A′ , where (A′ A)− denotes a g-inverse of A′ A. Since Z ′ Z and Z have the same rank, the rank of Z (Z ′ Z )− Z ′ is equal to the rank of Z (Rao and Mitra, 1971). The fact that the rank of Z may be less than JK implies that the rank of RZ must be at least N − JK . Because N ≥ K (J + 1) implies that K ≤ N − JK , it follows that the rank of W cannot exceed the rank of RZ , so the matrix W ′ RZ W must be nonsingular. Thus, by the FrischWaugh-Lovell (Partitioned Regression) Theorem, the OLS estimate of β in model (6) is

 β = (W ′ RZ W )−1 W ′ RZ y = (X˜ ′ X˜ )−1 X˜ ′ y , where X˜ = RZ W . Next notice that

 

X˜ =  

  X 0   0 X W 1  0    .  =  . ,  .   .  . .

IN0 RW 1

..

. RW J

WJ

0 ′

′

where we used the fact that W j (W j W j )− W j W j = W j for all j and any choice of g-inverse (Rao and Mitra, 1971). Therefore, ′ ′ X˜ ′ X˜ = X 0 X 0 and X˜ y = X 0 y 0 . Hence ′ ′  β = (X˜ ′ X˜ )−1 X˜ ′ y = (X 0 X 0 )−1 X 0 y 0 ,

which is the complete-case estimate.

The matrix W is of full column rank if, as we already assumed, the K columns of X 0 are linearly independent. The use of a ginverse in the proof of the theorem is necessary because some of the completed design matrices W j may be singular, which happens if Nj < K or if Nj ≥ K but the columns of W j are linearly dependent. The latter is for example the case when a missing covariate value is replaced by its average value for the available cases (mean imputation) or by its predicted values based on the j observed covariates Xa and the coefficients from an OLS regression using the subsample with complete data (deterministic regression imputation). One can replace a g-inverse with the regular inverse when the J subsamples with incomplete data are such that all W j ’s have full column rank. In practice, this may be achieved by dropping groups that contain too few observations and avoiding mean imputation or deterministic regression imputation. After all, these two imputation methods are known to produce completed data sets with undesirable properties, for example they have less variability than a set of truly observed values (see e.g. Lundström and Särndal, 2002). The complete-case model (4) and the grand model (6) may at first appear as two polar approaches to the problem of handling missing data in model (1). At one extreme is complete-case analysis. Under the assumption of Theorem 1, this gives an unbiased estimate of β but may throw away too much information by retaining only the observations in the subsample with complete data. At the other extreme, all observations are retained but some imputation procedure is adopted to fill-in the missing data. In fact, Theorem 2 shows that if β and δ in (6) are left unconstrained then this second approach is equivalent to complete-case analysis as far as estimation of β is concerned. A referee offered the following heuristic. Our model places no restrictions (equivalently, uses no information) on the imputation method. In the decomposition γ j = β + δj , it is only the observations from the complete case that sort out the part that should be β. Since the remaining cases provide absolutely no information, the estimates are the same. ‘‘No information added, no change’’. The standard practice of regressing y only on the completed design matrix W omitting the variable in Z corresponds to using a restricted version of the grand model (6) where all elements of

365

the vector δ are set equal to zero. This is the same as assuming that the missing data mechanism satisfies Assumption 1 and the imputation procedure is such that β = γ j for each j = 1, . . . , J. The less frequent practice of regressing y on W and the set of group-membership indicators (which we shall refer to as the simple missing-indicator method) corresponds to using another restricted version of the grand model, where all interactions between the group membership indicators and the observed or derived covariates are set equal to zero. Both sets of restrictions are testable. Testing the first set of restrictions corresponds to testing the hypothesis that all regression coefficients do not change across the J groups containing missing covariates, while testing the second set of restrictions corresponds to testing the hypothesis that, except for the intercepts, all other regression coefficients do not change across the J groups containing missing covariates. The precise nature of these tests, in particular the form of the test statistics, depends on the properties of the error vector u in model (6). Given OLS estimates  β and  δ of β and δ in the grand model, classical F -tests would be appropriate if it can safely be assumed that u is a vector of homoskedastic and serially uncorrelated regression errors. If this assumption cannot be justified, then one could use a ‘‘robustified’’ version of these tests based on an estimator of the sampling variance of  β and  δ that is consistent under heteroskedasticity or autocorrelation of unknown form in the elements of u. In our view, however, the key issue is not what statistic to use for testing, but whether it makes sense to ask questions such as: Is it true that δ = 0? Following Leamer (1978) and Magnus and Durbin (1999), we think that asking such questions in this context is wrong. The right question is: What is the best available estimator of β? 6. Bias versus precision Theorem 2 says that unbiased estimates of β may be obtained in two equivalent ways, either by using the N0 observations in the subsample with complete data, or by using all N observations and the grand model (6) which includes the imputed values of the missing covariates in the matrix W and the auxiliary variables in the matrix Z . We also know from standard results that placing restrictions on the elements of δ may lead to biased but more precise estimates of β. Two approaches may be followed to handle this trade-off between bias and precision in the estimation of β: model reduction and model averaging. Either approach can be applied to model (6). Model reduction involves first selecting an intermediate model between the grand model and the fully restricted model where δ = 0, and then estimating the parameter of interest β conditional on the selected model. Model reduction may be carried out through variable selection methods, such as stepwise regression (see e.g. Kennedy and Bancroft, 1971), or more complex generalto-specific procedures (see Campos et al., 2005, for a survey). The details of the model reduction procedure may also depend on whether one allows dropping arbitrary subsets of auxiliary variables in Z , or only subsets of auxiliary variables corresponding to specific subsamples with missing covariates. Dropping one of the columns of Z amounts to selecting a group j and, in the corresponding equation (5), restricting one element of δj to zero. This in turn corresponds to forcing the coefficient of that particular covariate in the completed design matrix W j to be the same as in the subsample with complete data. Dropping the columns of Z corresponding to the jth subsample amounts instead to restricting all element of δj to be zero, which in turn corresponds to forcing the relationship between y j and the completed design matrix W j to be the same as that between y 0 and X 0 in the subsample with complete data.

366

V. Dardanoni et al. / Journal of Econometrics 162 (2011) 362–368

One well-known problem with this approach is pre-testing. A second problem is that model selection and estimation are completely separated. As a result, the reported conditional estimates tend to be interpreted as if they were unconditional. A third problem is that, since there are J subsamples with incomplete data and K covariates (including the constant term), the model space may contain up to 2JK models. Thus, the model space is huge, unless both J and K are small. Model averaging is different. Instead of selecting a model out of the available set of models, one first estimates the parameter of interest β conditional on each model in the model space, and then computes the estimator of β as a weighted average of these conditional estimators. When the model space contains I models, a model averaging estimator of β is of the form

β¯ =

I −

βi , λi

(7)

i=1

where the λi are non-negative weights that add up to one, and  βi is the estimator of β obtained by conditioning on the ith model. In Bayesian model averaging (BMA), each  βi is weighted by the posterior probability of the corresponding model. If equal prior probabilities are assigned to each model under consideration, then the λi are just proportional to the marginal likelihood of each model. Bayesian averaging of both classical (least-squares) and Bayesian estimators have been considered, with the posterior mean of β for the model under consideration as the typical Bayesian estimator. Bayesian averaging of Bayesian estimators has been popularized by Raftery et al. (1997), while Bayesian averaging of classical estimators has been popularized by Sala-iMartin et al. (2004). The choice between the different approaches involves considering the computational burden and the statistical properties of the resulting estimators and, in the case of BMA, the nature of the assumed priors. The role of priors would also arise if a Bayesian model reduction approach is taken. Magnus et al. (forthcoming) study the properties of model averaging estimators of the same form as (7) with λi = λi ( u), where  u is the vector of OLS residuals from the regression of y on W only. Their class of weighted-average least-squares (WALS) estimators generalizes to the case when I ≥ 2 the class of estimators introduced by Magnus and Durbin (1999), which contains the classical pre-test estimator as a special case. Although WALS estimators are in fact BMA estimators, they differ from standard BMA in three important respects: their computational burden, the choice of prior for δ, and their statistical properties. The main advantage of WALS is that, although we may have up to I = 2JK models, the computational burden is only proportional to JK . With medium or large values of J or K , the computation burden is minimal compared to standard BMA. Like standard BMA, WALS assume a classical Gaussian linear model for (6) and noninformative priors for β and the error variance σ 2 . The assumption that the regression errors are homoskedastic and serially uncorrelated is not crucial for WALS, and the method can be generalized to non-spherical errors (Magnus et al., forthcoming). The key step in WALS is to reparameterize the model replacing Z δ by Z ∗ δ∗ , with Z ∗ = ZP Λ−1/2 and δ∗ = Λ1/2 P ′ δ, where P is an orthonormal matrix and Λ is a diagonal matrix such that P ′ Z ′ RW ZP = Λ. The main difference with respect to standard BMA is that, instead of a multivariate Gaussian prior for δ, WALS use a Laplace distribution with zero mean for the independently and identically distributed elements of the transformed parameter vector η = δ∗ /σ , whose ith element, ηi is the population t-ratio on δi , the ith element of δ. In this formulation, ignorance is a situation where it is equally likely for these population t-ratios to be larger or smaller than one in absolute value. Finally, unlike standard BMA, WALS have bounded risk and are near-optimal in terms of a well-defined regret criterion (Magnus et al., 2010).

7. An application In this section, we apply our approach in the context of a concrete example with missing data. The problem at hand is that of estimating the relation between body-mass and income using survey data affected by item non-response. We first present the estimates one obtains by the complete-case approach, by using raw data (no dummies), and by the simple indicator method. We then compare them with the estimates one obtains using different model-selection or model-averaging techniques on the basis of the grand model (6). The BMI, namely the ratio of weight (in kg) to squared height (in meters), is one way of combining weight and height into a single measure. Due to its ease of measurement and calculation, the BMI is the most common diagnostic tool to identify obesity problems within a population. As such, it has received lots of attention in the recent literature on the obesity epidemic and its economic and public health consequences (Cutler et al., 2003; Philipson and Posner, 2008). The obesity epidemic is essentially an imbalance between food intake and energy expenditure. It has been argued that this imbalance may be linked to income (see e.g. Drewnowski and Specter, 2004). The available empirical evidence – Cawley et al. (2008) for elderly people in the USA and Sanz-de-Galdeano (2005), and García Villar and Quintana-Domeque (2009) for Europe – is inconclusive for men, whereas for women there appears to be a more clear indication of a negative correlation between BMI and income. Our data are from Release 2 of the first wave of the Survey of Health, Ageing, and Retirement in Europe (SHARE), a multidisciplinary and cross-national household panel survey designed to investigate several aspects of the elderly population in Europe. The target population of SHARE consists of people aged above 50 living in residential households, plus their co-resident partners irrespective of age. The first wave, conducted in 2004, covered 15,544 households and 22,431 individuals in 11 European countries. All national samples are selected through probability sampling. The key to ensure comparability is the adoption of a common survey instrument. The physical health module of the questionnaire collects self-reported height and weight, the income module collects information on 25 income components, which are then aggregated into a measure of household income, and the consumption module collects household expenditure on four consumption categories (food at home, food outside the home, telephone, and all goods and services) in the last month. Non-response to household income and food expenditure is substantial, and in this case we use the imputations provided by SHARE. Complete or partial non-response to household income occurs for as much as 60 percent of the observations, such a high fraction being due to the fact that this variable is obtained by aggregating a large number of income components across household members. Non-response to food expenditure occurs for about 15 percent of the observations. To impute missing values, SHARE uses a complex two-stage multivariate procedure (Kalwij and van Soest, 2005). Imputations are first obtained recursively for a few core variables. In the second stage, the imputed values from the first stage are used to impute the other variables. This procedure essentially employs only univariate regression imputation methods. It is important to note that height and weight are never used to impute missing variables. To allow multiple imputation methods, SHARE provides five imputations for each missing value. SHARE imputes total household income by separately imputing each income component and then aggregating them. Imputations are provided for individual incomes of all eligible partners who did not agree to participate to the survey. We focus on the income-BMI relationship for males. We model the mean of log BMI as a function of age and age squared, log

V. Dardanoni et al. / Journal of Econometrics 162 (2011) 362–368

367

Table 1 Estimated coefficients.

age agesq lowed lypc lfpc N

Complete case

Fully restricted

Simple indicator

Stata’s stepwise

WALS

BMA

0.0008 (0.0008) −0.0082*** (0.0020) 0.0144*** (0.0047) −0.0196*** (0.0030) 0.0054 (0.0045) 4067

0.0023*** (0.0004) −0.0107*** (0.0012) 0.0201*** (0.0027) −0.0097*** (0.0016) 0.0042 (0.0026) 11,475

0.0023*** (0.0004) −0.0107*** (0.0012) 0.0201 *** (0.0027) −0.0098*** (0.0016) 0.0044* (0.0026) 11,475

0.0020*** (0.0005) −0.0108*** (0.0012) 0.0197*** (0.0027) −0.0174*** (0.0026) 0.0049* (0.0026) 11,475

0.0012* (0.0007) −0.0087*** (0.0018) 0.0160*** (0.0041) −0.0165*** (0.0027) 0.0046 (0.0039) 11,475

0.0023*** (0.0004) −0.0107*** (0.0012) 0.0201*** (0.0027) −0.0101*** (0.0020) 0.0042 (0.0026) 11,475

Note: Observed p-values: * p < 0.10; ** p < 0.05; *** p < 0.01.

household income per capita, log household food expenditure per capita, and a dummy indicator for low educational attainment. In addition to the subsample with complete data (4067 obs., 35.5%), we have three subsamples with incomplete data: one where only food expenditure is missing (287 obs., 2.5%), one where only household income is missing (5891 obs., 51.3%), and one where both household income and food expenditure are missing (1230 obs., 10.7%). For each variable, we use the first of the five available imputations. Table 1 shows the estimated coefficients for age and its square (agesq), log household income per capita (lypc), log food expenditure per capita (lfpc), and a dummy for not having a highschool degree (lowed). The first three columns contain estimates for the complete-case/grand model, the fully restricted estimator corresponding to δ = 0, and the simple missing-indicator method. The other three are obtained by model-selection or averaging on the basis of the grand model: Stata’s stepwise procedure with pvalue equal to.05, WALS and BMA. Estimates of the coefficients for the 18 auxiliary regressors are not presented but are available upon request. For simplicity, all estimates are based on the assumption of spherically distributed errors in the grand model (6). The BMA and WALS estimates and their standard errors have been computed using the Matlab code downloaded from Jan Magnus’s web page at http://center.uvt.nl/staff/magnus/wals/. This BMA implementation estimates all possible models, so it becomes very time consuming when J or K are large. In our case, with J = 3 subsamples with incomplete data and K = 6 focus regressors (including the constant term), examining all possible 218 models required about one day on our desktop computer. Faster implementations are available, but they estimate only a randomly chosen subset of all possible models and have the important disadvantage of not using the distinction between focus and auxiliary regressors, which is key to our analysis. As for WALS, it is worth discussing briefly the concept and treatment of uncertainty implicit in the choice of a Laplace prior for the elements of the transformed parameter vector η. Assuming this particular prior means that we think that it is equally likely that the observed value of the t-statistic on any element of δ is greater or smaller than one. This is equivalent to say that we are agnostic about the quality of the imputation: it could be either good or bad. Since we are simply users, not producers, of the imputations, this may not be a bad assumption. There is agreement between the different methods on the qualitative effect of the various variables: concave for age, negative for education and income, and positive for food expenditure. The magnitude of the estimated coefficients, however, differs considerably across methods. At one extreme are the fully restricted estimator, the simple missing-indicator method and BMA that produce nearly identical results: they assign more importance to age and less importance to income. At the other extreme are the complete-case estimator and WALS: they assign less importance to age and more importance to income. It is

noteworthy that, in this example, WALS is close to complete-case (all dummies in the model), while BMA is close to fully restricted (no dummy). Thus, starting with the grand model, WALS seems to give more weight to the auxiliary dummies than BMA. The stepwise procedure gives estimates of the relative effects of age and income that are somewhat in between these two extremes. 8. Concluding remarks In this paper, we formalized the trade-off between bias and efficiency that arises when there are missing covariate values in a regression relationship of interest and showed how to tackle this trade-off by model reduction procedures or model averaging methods. In future work, we plan to extend our approach to generalized linear models (GLM), such as logit, probit and Poisson regression, for which we conjecture that versions of Theorems 1 and 2 also hold. Our conjecture is motivated by the fact that maximum-likelihood estimators of exponential family models may be obtained by iteratively reweighted least squares. Acknowledgement We thank Jan Magnus for comments and insightful discussions, and an associate editor and a referee for helpful suggestions. References Campos, J., Ericsson, N.R., Hendry, D.F., 2005. General-to-specific modeling: an overview and selected bibliography. FRB International Finance Discussion Paper No. 838. Available at SSRN: http://ssrn.com/abstract=791684. Cawley, J., Moran, J.R., Simon, K.I., 2008. The impact of income on the weight of elderly Americans. NBER Working Paper No. 14104. Cutler, D.A., Glaeser, E.L., Shapiro, J.M., 2003. Why have Americans become more obese? Journal of Economic Perspectives 17, 93–118. Danilov, D., Magnus, J.R., 2004. On the harm that ignoring pretesting can cause. Journal of Econometrics 122, 27–46. Drewnowski, A., Specter, S., 2004. Poverty and obesity: the role of energy density and costs. American Journal of Clinical Nutrition 79, 6–16. García Villar, J., Quintana-Domeque, C., 2009. Income and body mass index in Europe. Economics & Human Biology 7, 73–83. Horton, N.J., Kleinman, K.P., 2007. Much ado about nothing: a comparison of missing data methods and software to fit incomplete data regression models. The American Statistician 61, 79–90. Jones, M.P., 1996. Indicator and stratification methods for missing explanatory variables in multiple linear regression. Journal of the American Statistical Association 91, 222–230. Kalwij, A., van Soest, A., 2005. Item non-response and alternative imputation procedures. In: Börsch-Supan, Axel, Jürges, Hendrik (Eds.), The Survey of Health, Aging, and Retirement in Europe-Methodology. MEA, Mannheim, pp. 128–150. Kennedy Jr., W.J., Bancroft, T.A., 1971. Model-building for prediction in regression based on repeated significance tests. Annals of Mathematical Statistics 42, 1273–1284. Leamer, E.E., 1978. Specification Searches. Ad Hoc Inference with Nonexperimental Data. Wiley. Little, R.J.A., 1992. Regression with missing X’s: a review. Journal of the American Statistical Association 87, 1227–1237. Little, R.J.A., Rubin, D., 2002. Statistical Analysis with Missing Data, 2nd ed. Wiley. Lundström, S., Särndal, C.-E., 2002. Estimation in the Presence of Nonresponse and Frame Imperfections. Statistics Sweden. Magnus, J.R., Durbin, J., 1999. Estimation of regression coefficients of interest when other regression coefficients are of no interest. Econometrica 67, 639–643.

368

V. Dardanoni et al. / Journal of Econometrics 162 (2011) 362–368

Magnus, J.R., Powell, O., Prüfer, P., 2010. A comparison of two averaging techniques with an application to growth empirics. Journal of Econometrics 154, 139–153. Magnus, J.R., Wan, A.T.K., Zhang, X., 2011. WALS estimation with nonspherical disturbances and an application to the Hong Kong housing market. Computational Statistics & Data Analysis (forthcoming). Philipson, T., Posner, R., 2008. Is the obesity epidemic a public health problem? A decade of research on the economics of obesity. NBER Working Paper No. 14010. Raftery, A.E., Madigan, D., Hoeting, J.A., 1997. Bayesian model averaging for linear regression models. Journal of the American Statistical Association 92, 179–191.

Rao, C.R., Mitra, S.K., 1971. Generalized Inverse of Matrices and its Applications. Wiley. Rubin, D., 1987. Multiple Imputations. Wiley. Sala-i-Martin, X., Doppelhofer, G., Miller, R.I., 2004. Determinants of long-term growth: a Bayesian averaging of classical estimates (BACE) aproach. American Economic Review 94, 813–835. Sanz-de-Galdeano, A., 2005. The obesity epidemic in Europe. IZA Discussion Paper No. 1814. Wooldridge, J.M., 2002. Econometric Analysis of Cross Section and Panel Data. MIT Press.

Journal of Econometrics 162 (2011) 369–382

Contents lists available at ScienceDirect

Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom

Bayesian estimation of an extended local scale stochastic volatility model Philippe J. Deschamps ∗ Séminaire d’économétrie, Université de Fribourg, Boulevard de Pérolles 90, CH-1700 Fribourg, Switzerland

article

info

Article history: Received 1 May 2008 Received in revised form 10 September 2010 Accepted 25 February 2011 Available online 21 March 2011

abstract A new version of the local scale model of Shephard (1994) is presented. Its features are identically distributed evolution equation disturbances, the incorporation of in-the-mean effects, and the incorporation of variance regressors. A Bayesian posterior simulator and a new simulation smoother are presented. The model is applied to publicly available daily exchange rate and asset return series, and is compared with t-GARCH and Lognormal stochastic volatility formulations using Bayes factors. © 2011 Elsevier B.V. All rights reserved.

JEL classification: C11 C13 C15 C22 Keywords: State space models Markov chain Monte Carlo Simulation smoothing Generalized error distribution Generalized t distribution

1. Introduction Since the seminal paper by Engle (1982), models that attempt to explain the conditional heteroskedasticity of asset returns have become essential tools for financial analysts. These models can be subdivided into two broad classes. The first one is the class of ARCH, GARCH, and EGARCH models with their many variants; see, e.g., Engle (1982), Bollerslev (1986), Engle et al. (1987), Nelson (1991), and the excellent survey in Bollerslev et al. (1994). In this first class, the variance equation is deterministic, and much effort has been devoted to its specification. The second one is the class of stochastic volatility (SV) models, where the variance equation is stochastic and can be considered as the evolution equation in a state space model. The evolution equation in SV models usually has a rather simple autoregressive form. Nevertheless, it may be argued that SV models are potentially more flexible than models in the GARCH class, since they involve two random shocks rather than one. Furthermore, since an inference on the conditional variances in SV models can be based on the entire sample rather than on past observations only, a sophisticated modeling of the variance equation is perhaps less necessary in SV models than in GARCH

∗

Tel.: +41 26 300 8252; fax: +41 26 300 9781. E-mail address: [email protected].

0304-4076/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2011.02.022

models: the evolution equation in SV models can be viewed as a hierarchical Bayesian prior on the volatilities, which is updated by the information present in the entire sample. As a consequence, the smoothed estimates of the latent variables can exhibit systematic nonlinearities that were not implied by the prior specification. The paper by Deschamps (2003) illustrates this fact in a multivariate state space model of consumer demand. Stochastic volatility models are non-linear or non-Gaussian state space models, and this makes their estimation difficult. It is perhaps for this reason that their use has been less frequent than that of models in the GARCH class. Their treatment seems to follow three main approaches. The first approach is based on a linear evolution equation implying Lognormal volatilities. This formulation appears to be the most common one: see, e.g., Jacquier et al. (1994, 2004), Kim et al. (1998), Chib et al. (2002), and Omori et al. (2007). The second one is based on arbitrary differentiable, but Gaussian, observation and/or evolution equations; an example of this approach is in Stroud et al. (2003). The third approach to SV models assumes a multiplicative evolution equation with Beta innovations. Such an equation appears to have been first used by Smith and Miller (1986) with an Exponential observation density, for the time series analysis of extreme values; it was subsequently used by Shephard (1994) in the context of SV models with observation equation disturbances having the Generalized Error Distribution, which includes the Normal as a special case. The model of Shephard (1994) is known as the local scale model.

370

P.J. Deschamps / Journal of Econometrics 162 (2011) 369–382

Uhlig (1997) presents a multivariate generalization of the local scale model but his analysis is limited to multivariate Normal observation equation errors. The same limitation is present in the closely related paper by Philipov and Glickman (2006). The local scale model has two potential advantages over the first two approaches. First, the one step ahead prediction densities are known analytically. In contrast, prediction in the SV models that use the first two approaches must be done by particle filtering, conditional on parameter values; it is impractical to do so for all the replications in a large posterior sample of these parameters. The second advantage involves posterior simulation. The hyperparameter likelihood (marginalized over the volatilities) can be computed exactly from the prediction error decomposition, so that a posterior sampler can be defined on the hyperparameters only; this reduces the dimension of the problem. The present paper follows the tradition of the local scale model. It presents a new version of this model, and a new method of estimation. Its main contributions are the following. First, contrary to the original local scale formulation, the disturbances in the evolution equation are identically distributed, with a distribution that depends on two fundamental parameters in a natural way. This new formulation will be called the steady-state version of the local scale model, and has some advantages over the original version. Second, ARCH-M effects are introduced and found to be very significant in an application to asset return data. This may be due, at least in part, to the fact that these effects are formulated in terms of the scale parameter of the one step ahead prediction density, which is available in closed form for the local scale model. Indeed, French et al. (1987) show that it is important in this context to distinguish between anticipated and unanticipated components of volatility. Third, a new simulation smoother is presented. This algorithm draws from the exact conditional posterior of the volatilities, contrary to the quasi-smoother presented in Shephard (1994) and previously suggested by Harvey (1989, p. 359), who acknowledged that the quasi-smoother has no firm theoretical foundation. Our simulation smoother is particularly easy to implement, since it only involves generating the inverse volatilities from a linear stochastic difference equation with Gamma innovations. The advantages of using a Bayesian approach in latent variable models have been stated elsewhere in the literature; see, e.g., Harvey et al. (2007). An outline of the paper follows. Section 2 presents and motivates our version of the local scale model. Section 3 presents the simulation smoother; its validity is proved in Appendix A. Section 4 presents a posterior simulator of the hyperparameters, with further details given in Appendix B; a posterior sample of the volatilities can be obtained by applying the simulation smoother to each hyperparameter replication. Section 5 applies the local scale model to daily asset return and exchange rate series. Section 6 compares the local scale model with t-GARCH and other stochastic volatility formulations, using Bayes factors. Section 7 concludes. 2. A steady-state version of the local scale model

with

ψ(r ) =

In what follows, we will denote a variable X having a standardized Generalized Error Distribution with parameter r ≥ 1 as X ∼ GED(r ). This distribution is discussed in Box and Tiao (1992, pp. 156–160), who call it the Exponential Power Distribution, and was used by Nelson (1991) in an important article. We write the density of X as: fGED (x; r ) =

r Γ (3/r )1/2 3/2

2 Γ (1/r )

exp −|x|r ψ(r ) ,





Γ (3/r ) Γ (1/r )

]r /2

.

Note that since ψ(2) = 1/2, fGED (x; 2) is the standard Normal density. The GED density has heavier tails than the Normal if r < 2 and thinner tails if r > 2. A Beta variable Y with parameters α and β is noted as Y ∼ Be(α, β), and its density as fB (y; α, β). Finally, a Gamma variable Z with density fG (z ; a, b) ∝ z a−1 exp(−bz ) is noted as Z ∼ Ga(a, b). 2.2. The model A general version of the local scale model is, for t = 1, . . . , T : yt = µt + (λθt )−1/r ut

(2.1)

ln θt = ln θt −1 + [ln ηt − E (ln ηt )] +

k −

βj Djt

(2.2)

j =1

θ0 ∼ Ga(a0 , b0 )

(2.3)

where the Djt are observable variance regressors, the βj are unknown coefficients, ut , ηt are independently distributed with:

ηt ∼ Be(ωat −1 , (1 − ω)at −1 ) ut ∼ GED(r ) and where 0 < ω < 1. In the Beta distribution of ηt , at −1 is a parameter of the filter density of θt −1 , which will be stated in Section 2.3. Smith and Miller (1986) show that the marginal distribution of θt in (2.2) is Gamma, and Shephard (1994) shows that this Gamma distribution is conjugate with the GED distribution implied by (2.1). There are two important differences between our version of the local scale model and the original one in Shephard (1994). First, here, at −1 will be a constant that depends on ω and r, so that the evolution equation disturbances ηt are identically distributed. Secondly, the empirical applications presented in Shephard (1994) were limited to λ = 1, µt = 0, and βj = 0 in (2.1) and (2.2), whereas we do not impose these restrictions. In Eq. (2.1), the presence of λ ensures that the θt do not depend on units of measurement, and the choice of the conditional expectation µt will be motivated by the results of LeBaron (1992), who finds that the first-order autocorrelation in an asset return series is a decreasing function of anticipated volatility; see also Bollerslev et al. (1994). We define the anticipated square root volatility as: st = (λE [θt | y1 , . . . , yt −1 ])−1/2

(2.4)

and model yt by an autoregression with coefficients that depend on st , so that:

µt = α00 + α10 F (st ) +

p − 

 α0i + α1i F (st ) yt −i ,

(2.5)

i =1

with: F (st ) =

2.1. Notation

[

1 1 + exp(−st )

1

− . 2

(2.6)

It will be seen in Section 2.3 that st turns out to be the scale parameter of the one step ahead predictive distribution of yt . Contrary to the observation equation (GED) standard error, which is (λθt )−1/r , st does not depend on the past volatilities, which are unobservable; it is therefore natural to use st in formulating in-themean effects. The logistic function F (st ) lies between 0 and 1/2 and is almost linear for st close to 0; its boundedness is important since st can take very large values. Eq. (2.2) states that the latent variables θt follow a logarithmic random walk, with covariates Djt . The random walk implies

P.J. Deschamps / Journal of Econometrics 162 (2011) 369–382

important restrictions on these covariates. The term j=1 βj Djt can be interpreted as a (non constant) drift, which should be bounded and should average out to zero for the model to remain stable. Unfortunately, this excludes an asymmetric news impact function of the EGARCH type. Suitable choices for the variance regressors Djt , which can depend on past observations on yt , will be discussed in Section 5.1. It should also be noted that the random walk (2.2) could not be replaced by a more general autoregression without losing the conjugacy of (2.1) and (2.2).

∑k

2.3. Filtering and prediction We now state the filtering equations associated with (2.1) and (2.2). These equations will enable us to obtain the one step ahead prediction densities, from which the hyperparameter likelihood follows by the prediction error decomposition. Upon defining, for convenience:

φt = exp

 k −

 βj Djt − E (ln ηt ) ,

(2.7)

2.4. Filter initialization We now address the choice of a0 and b0 in (2.3). Following Harvey (1989), Shephard (1994) proposes to initialize the filter (2.10)–(2.11) with the values a0 = b0 = 0, corresponding to a diffuse prior on θ0 . However, there are several reasons why a0 > 0 and b0 > 0 might be preferred. First, if a0 = b0 = 0, the marginal density of y1 is not defined. Second, we may view (2.2) and (2.3) as defining a joint prior ∏ on (θ0 , . . . , θT ); the marginal prior of θt is then Ga(ωt a0 , b0 [ ti=1 φi ]−1 ), which becomes improper as a0 and b0 tend to 0. This would exclude any prior simulation of the volatilities. Third, it will be seen in Section 3 that the full conditional posterior of θ0 becomes improper as b0 → 0. This would exclude any posterior simulation involving θ0 . Last, if a0 = 0, at will not reach a steady-state value until some time after the start of the sample; since this start usually depends on data availability rather than on more fundamental considerations, this is difficult to justify. We will adopt a flexible hierarchical prior on θ0 based on regularization concepts. Upon solving the recurrence equations (2.10) and (2.11), we obtain for t = 1, . . . , T :

j =1

at = ω t a0 +

it can be shown that: f (θt | y1 , . . . , yt −1 ) = fG (θt ; at |t −1 , bt |t −1 )

(2.8)

f (θt | y1 , . . . , yt ) = fG (θt ; at , bt )

(2.9)

371

bt =

with:

b0 t ∏

t 1−

r i=1

+

φi

t −

ωi−1 = ωt a0 +

1 − ωt r (1 − ω)

λψ(r )γj |yj − µj |r

(2.15)

(2.16)

j =1

i=1

at |t −1 = ωat −1 at = at | t − 1 +

1 r

and and

bt |t −1 =

b t −1

(2.10)

φt

bt = bt |t −1 + λψ(r )|yt − µt |r .

a0 =

f (yt | y1 , . . . , yt −1 ) ∞

∫

∏t

(2.11)

Finding the one step ahead prediction density is straightforward. The simplest case occurs when the observation equation disturbances in (2.1) are normally distributed, which occurs when r = 2. In this case, we have:

f (yt | µt , θt , λ)f (θt | y1 , . . . , yt −1 )dθt

=

with γj = i=j+1 φi−1 for j < t and γt = 1. Eq. (2.15) reveals that an obvious choice for a0 is the steadystate value:

(2.12)

0

which is a scale mixture of Normals with a Gamma mixing density; from (2.1) and (2.8), this is a Student-t with expectation µt , 2at |t −1 degrees of freedom, and scale parameter: s2t = bt |t −1 /(λat |t −1 ) = (λE [θt | y1 , . . . , yt −1 ])−1 ;

(2.13)

see, e.g., Bernardo and Smith (2000, p. 123). As noted by Shephard (1994), when r is arbitrary, the predictive becomes: f (yt | y1 , . . . , yt −1 ) = fGST (yt ; µt , s2t , at |t −1 , r ) r Γ at |t −1 + 1/r



=

2 Γ at |t −1 Γ (1/r )



[ × 1+



at |t −1 s2t

]1/r

since this implies at = a0 > 0 for all t. An appropriate choice for b0 is less obvious. The solution proposed in this paper assumes b−1 = 0 (so that the filter density of θ−1 is improper), leading to b0|−1 = 0 from (2.10). Using (2.11), we then set b0 equal to the expected value of λψ(r )|y0 −µ0 |r , when the pre-sample value y0 is generated by a steady-state version of the model, obtained from the density (2.14) with s20 = 1/λ and a0|−1 = ωa0 . Similar treatments of initial conditions have appeared in the literature; see, e.g., Schotman and van Dijk (1991). The choice of s20 = 1/λ is an additional structural assumption that replaces (2.13) when t = 0 and is motivated by considerations of parsimony: it turns out to be the simplest one yielding a solution that depends only on ω and r. It can be verified using standard analytic integration software that:

∫

+∞ −∞

b0 =

at |t −1 s2t

ψ(r )|yt − µt |r

(2.17)

  1 ω b0 |yt − µt |r fGST yt ; µt , , , r dyt = , λ r (1 − ω) λψ(r )

with:

ψ(r )

 [

1 r (1 − w)

]−at |t −1 −1/r (2.14)

where s2t is again given by (2.13). The density (2.14) is known in the literature as a Generalized Student-t (see McDonald and Newey, 1988) and was used by Bollerslev et al. (1994) as error distribution in an EGARCH-type model of asset returns. It is easy to see that (2.14) specializes to the usual Student distribution with 2at |t −1 degrees of freedom when r = 2.

ω . r ω + (ω − 1) r2

(2.18)

The denominator of (2.18) will be positive if ω > r /(1 + r ). When the observation equation disturbances are Normal (r = 2), this inequality merely constrains the degrees of freedom in the Student-t predictive to be greater than two, a constraint that ensures the existence of the first two moments. Since E (θ0 ) = a0 /b0 , (2.17) and (2.18) allow the full range of expectations 0 < E (θ0 ) < +∞ and do not, therefore, seriously restrict the flexibility of the model. Choosing a0 as in (2.17) and b0 as in (2.18) leads to the steadystate version of the local scale model, which is written below for the sake of easy reference:

372

P.J. Deschamps / Journal of Econometrics 162 (2011) 369–382

L(ω)

T=100

Steady state

Diffuse

0.7

0.75

0.8

0.85

0.9

ω

0.95

L(ω)

T=500

Steady state

Diffuse

0.7

0.75

0.8

0.85

0.9

ω

0.95 L(ω)

T=2000

Steady state

Diffuse

0.7

0.75

0.8

0.85

0.9

0.95

ω

Fig. 1. Loglikelihood comparisons.

yt = µt + (λθt )−1/r ut ,

ut ∼ GED(r ) i.i.d

θt = φt θt −1 ηt   k − φt = exp βj Djt + Ψ j =1

ηt ∼ Be



ω 1 , r (1 − ω) r

(2.19) (2.20)

1 r (1 − ω)



−Ψ



ω r (1 − ω)

 (2.21)

 i.i.d

(2.22)

 ω (2.23) r (1 − ω) r ω + r 2 (ω − 1) where Ψ (z ) = d ln Γ (z )/dz is the digamma function. Eqs. (2.19)– θ0 ∼ Ga



1

,

(2.23) can be viewed as a restatement of (2.1)–(2.3) where the choice of a0 and b0 is made endogenous; Eq. (2.21) follows from (2.7) and from a well-known result on the expected logarithm of a Beta variate. 2.5. Comparing the diffuse and steady-state likelihoods This subsection will illustrate some differences between the steady-state version of the local scale model and the Shephard

(1994) version, using simulated data. Fig. 1 presents, for the three sample sizes of T = 100, 500, and 2000, the loglikelihoods of ω obtained with the diffuse filter initialization (dashed lines) and with the steady-state initialization (solid lines). Since the density of y1 is not defined with the diffuse filter, both loglikelihoods are conditional on the first observation in order to ensure comparability: they are obtained by summing the logarithms of the densities (2.14) for t = 2, . . . , T . The data were simulated from the steady-state version of the model, assuming µt = 0, βj = 0, λ = 1, r = 2, and ω = 0.93. These values of r and ω are close to maximum likelihood estimates obtained by Shephard (1994). In computing the loglikelihoods, all parameters except ω were set equal to their assumed values. It is apparent in the top panel of Fig. 1, and to a lesser extent in the other panels, that the steady-state likelihood penalizes those values of ω that are close to 2/3 or close to unity. In our model, when ω tends to r /(1 + r ) = 2/3, E (θ0 ) tends to zero, and when ω tends to one, E (θ0 ) tends to +∞. Since neither assumption is supported by the data, the steady-state likelihood is lower than the diffuse one. When T becomes large, the two likelihoods of course become increasingly similar, but the differences remain apparent in this example.

P.J. Deschamps / Journal of Econometrics 162 (2011) 369–382

373

Since the evolution density becomes singular as ω tends to one, it may be argued that the steady-state likelihood, which bounds ω away from unity, is preferable to the diffuse one.

and:

3. Simulation smoothing

p(α, β, λ, ω, r ) ∝ fN (α; mα , Σα )fN (β; mβ , Σβ )

 β′ = β1

...

 βk .

We will adopt the following prior: This section addresses the simulation of (θ0 , . . . , θT ) from their distribution conditional on the observables (y1 , . . . , yT ) and on all the other unobservables in the model, including ω and r. We denote by yt1 :t2 the vector (yt1 , . . . , yt2 ) and similarly for θt1 :t2 . The full conditional posterior of θ0:T can be written as: T −1

f (θ0:T | y1:T ) = f (θT | y1:T )

∏

f (θt | y1:T , θt +1:T ).

(3.1)

t =0

A simulation smoother can be based on the decomposition (3.1), using an argument similar to the ones given by Carter and Kohn (1994) and Chib (1996) in different contexts. Each of the last T terms in the product on the right-hand side of (3.1) can be written as: f (θt | y1:T , θt +1:T ) ∝ f (θt | y1:t )f (θt +1 | θt , y1:t )

(3.2)

since we have, from Bayes’ theorem and from the decomposition of a joint density: f (θt | y1:T , θt +1:T ) ∝ f (θt | y1:t )f (yt +1:T , θt +1:T | y1:t , θt )

∝ f (θt | y1:t )f (θt +1 | θt , y1:t ) × f (yt +1:T , θt +2:T | y1:t , θt +1 , θt )

(3.3)

and the last term in (3.3) does not depend on θt . The first term on the right-hand side of (3.2) is of course the filter density of θt , given by (2.9); the second term can be viewed as the likelihood of θt implied by Eq. (2.20). It is shown in Appendix A that applying Bayes’ theorem to the right-hand side of (3.2) yields the following translated Gamma posterior: f (θt | y1:t , θt +1 ) = fG



θt −

φt−+11 θt +1 ;

1 r



, bt .

In order to simulate θ0:T from its full conditional posterior distribution, we may then use the following forward-filtering backward-sampling algorithm based on (3.1) and (3.2): (1) Compute and store bt from the filter (2.10)–(2.11) for t = 0, . . . , T . (2) Sample:

θT ∼ Ga



1 r (1 − ω)

 , bT .

(3.4)

(3) Generate θT −1 , . . . , θ0 from the stochastic difference equation:

θt =

φt−+11 θt +1

+ ϵt

(3.5)

where the ϵt are independent draws from Ga(1/r , bt ) distributions. The posterior expectations θ¯t of θt can be obtained by replacing, in (3.4) and (3.5), the random variables by their expected values. This yields the recursion:

θ¯T =

1

(3.6)

rbT (1 − ω)

θ¯t = φt−+11 θ¯t +1 +

1 rbt

for t = T − 1, . . . , 0.

(3.7)

× fG (λ; aλ , bλ )I(rmin ,rmax ) (r )I(ωmin ,1) (ω)

(4.1)

where fN denotes the Multinormal density and I denotes an indicator function. From the definition of a GED density, we must have rmin ≥ 1; and since typical asset return distributions exhibit heavy tails, it is sensible to choose rmax not much greater than 2. For (2.18) to remain well-defined, we must have:

ωmin >

rmax 1 + rmax

which is an innocuous constraint in view of the remark made after the statement of (2.18). Our posterior sampler for the extended local scale model can be summarized as follows. Define ξ = (α, β, λ, ω, r ). First sample ξ(0) in the support of its prior distribution. A Markov chain {ξ(i) }Ni=1 is generated as follows: (1) Sample r (i) conditional on the data, α(i−1) , β(i−1) , ω(i−1) , and λ(i−1) . (2) Sample ω(i) conditional on the data, α(i−1) , β(i−1) , r (i) , and λ(i−1) . (3) Sample α(i) conditional on the data, β(i−1) , λ(i−1) , r (i) , and ω(i) . (4) Sample β(i) conditional on the data, α(i) , λ(i−1) , r (i) , and ω(i) . (5) Sample λ(i) conditional on the data, α(i) , β(i) , r (i) , and ω(i) . Note that this algorithm does not depend on the simulation smoother of Section 3, but uses the likelihood implied by the prediction error decomposition. This use is partly due to the fact that the joint likelihood of ξ and θ0:T is only defined for those values of the volatilities which are consistent with (2.20): it would not be practical to maintain this consistency across random draws of the remaining parameters. However, the deterministic smoother (3.6)–(3.7) will be used in the construction of candidate proposal densities, and this turns out to be important for a good performance of the algorithm.1 Steps (1)–(5) are ‘‘Metropolis within Gibbs’’ steps having the following generic expression. Let ϑ be the subvector of ξ being simulated, and let ϕ be that subvector of ξ which does not contain ϑ. Let k(ϑ | ϕ) be the kernel of the conditional posterior of ϑ. Let ϑold be the previous draw of ϑ, and let q(ϑ | ϑold , ϕ) be a normalized proposal density. The proposal densities used in this paper are described in Appendix B. One draws a candidate ϑ from q(ϑ | ϑold , ϕ), and sets ϑ(i) = ϑ with probability:

[

min 1,

k(ϑ | ϕ) q(ϑold , | ϑ, ϕ)

]

k(ϑold | ϕ) q(ϑ | ϑold , ϕ)

(i)

and ϑ = ϑold otherwise. In order to compute the posterior kernel k(ϑ | ϕ) above, one multiplies by the prior density (4.1) the likelihood implied by (2.14):

L(ξ) ≡ L(ϑ; ϕ) =

T ∏ t =1

[ fGST

ω yt ; µt (ξ), (ξ), ,r r (1 − ω) s2t

] (4.2)

where the µt (ξ) and s2t (ξ) are obtained by the filter of Section 2.3. For convenience, we restate this filter by incorporating into (2.10)–(2.11) the conditions (2.5), (2.13), (2.17), (2.18) and (2.21).

4. Posterior simulation Let, for convenience:

α′ = α00 

α10

α01

α11

...

α0p

α1p



1 An alternative way of generating candidates would be to use random draws of the volatilities, generated at the beginning of each pass. This did not result in an improved sampler.

374

P.J. Deschamps / Journal of Econometrics 162 (2011) 369–382

700

We assume that p initial observations y1−p:0 are available. For given α, β, λ, r, and ω, the sequences b0:T , φ1:T , µ1:T and s1:T can be generated by the following recursion:

µt = α00 + α10 F (st ) +

600 500

p −   α0i + α1i F (st ) yt −i

(4.3)

400

i=1

bt =

b t −1

φt

(4.4) 200

k

ln φt +1 =

−

βi Di,t +1 + Ψ



bt

r (1 − ω)

λφt +1

ω



1 r (1 − ω)

i=1

s2t +1 =

300

+ λψ(r )|yt − µt |r −Ψ



ω r (1 − ω)



0 -0.50

ω

ln φ1 = Ψ s21 =

(4.7)



1 r (1 − ω)

−Ψ

1−ω

0.50



ω r (1 − ω)

5.000 4.000

 (4.8)

3.000

.

(4.9)

2.000

Using (2.14), it is easy to show that the likelihood (4.2) is equal

1.000

λφ1 [ω + r (ω − 1)]

0.00 0.25 Histogram of D(1t)

6.000

r ω + r 2 (ω − 1)



-0.25

(4.6)

with the initial conditions: b0 =

100

(4.5)

to:

Γ





×

T [r r +1 ψ(r )(1 − ω)]1/r      2 Γ r (1ω−ω) Γ 1r ω1/r

1 r (1−ω)

T ∏

0



[

s2t (ξ)−1/r 1 +

r ψ(r )(1 − ω)

ωs2t (ξ)

t =1

|yt − µt (ξ)|r

-6

1 r (ω−1)

. (4.10)

5.1. The data In this section, we will estimate versions of the local scale model on one asset return and two exchange rate series. The first series is constructed from the Standard and Poor 500 (S&P500) daily asset price data, available on finance.yahoo.com, and ranges from January 6, 1970 to April 17, 2009 (9916 observations). The second and third series range from January 5, 1982 to June 19, 2009 (6905 observations) and are constructed from the Swiss Franc/US dollar and US dollar/Pound Sterling exchange rates, available on www.federalreserve.gov. All three series are defined as yt = 100 ln(Pt /Pt −1 ), where Pt is either the closing price index (in the case of the S&P500 data) or the exchange rate (in the two other cases). The three series are irregularly dated; whereas the missing data in the S&P500 series only cover Saturdays, Sundays, holidays, and the September 2001 event, the two exchange rate series include occasional missing observations that are not due to market closure. Two predetermined variables D1t and D2t will be considered for use in the variance equation (2.2). D1t is the logistic transform F (yt −1 ), where: 1 1 + exp(−x)

-4

-3

-2

-1

0

1

2

3

4

5

6

.6

]

5. Estimating the local scale model

F (x) =

-5

Bar chart of D(2t)

1

Logistic transform F[y(t-1)]



.4 .2 .0 -.2 -.4 -.6 -24

-20

-16

-12

-8

-4

0

4

8

12

Lagged dependent variable y(t-1) Fig. 2. Variance regressors (S&P500 data).

of consecutive non-trading days prior to date t. Its presence allows for effects due to market closure. In view of the remark made at the end of the previous paragraph, it will be included in some local scale asset return models but omitted from the exchange rate models. Fig. 2 presents the histogram and bar chart of the variance regressors for the asset return data, as well as a scatter plot of yt on its logistic transform. The distribution of both variance regressors is clearly centered on zero; as mentioned at the end of Section 2.2, this ensures model stability. The bottom panel of Fig. 2 shows that F (yt ) is approximately linear for values of yt close to zero, but does not exhibit the extreme outliers present in the asset return series.

− . 2

Since D1t < 0 corresponds to a negative past asset return, one would expect its coefficient β1 to be positive in the S&P500 case (bad news increases volatility), and this can be interpreted as a form of leverage. D2t is defined as the first difference of the number

5.2. Model comparison Formulating a local scale model involves deciding whether to include ARCH-M effects and variance regressors, and whether to choose a GED rather than a Normal observation density. In

P.J. Deschamps / Journal of Econometrics 162 (2011) 369–382

375

Table 1 Logarithmic Bayes factors and numerical standard errors (S&P500 data). Observation density

Variance regressors

Normal

D1t , D2t

Normal

D2t

GED

D2t

Normal

None

GED

None

ARCH-M effects p=0

p=1

p=2

−52.6

−53.9

−24.6

−30.4

0.006 −22.5 0.008 −122.7 0.006 −84.6 0.007 −135.2 0.006 −94.1 0.007

0.006 −21.6 0.008 −111.8 0.006 −74.0 0.007 −125.2 0.006 −84.2 0.007

ln(BF ) NSE ln(BF ) NSE ln(BF ) NSE ln(BF ) NSE ln(BF ) NSE ln(BF ) NSE

D1t , D2t

GED

No ARCH-M

µt = 0

0.006 0.0

0.006

−4.7 0.008

−81.8 0.006

−51.9 0.007

−90.6 0.006

−58.9 0.007

−86.6 0.006

−55.7 0.007

−95.5 0.006

−62.7 0.007

Table 2 Logarithmic Bayes factors and numerical standard errors (exchange rate data, µt = 0). Data

Observation density

Variance regressors

ln(BF )

NSE

US dollar/Swiss Franc US dollar/Swiss Franc

Normal Normal

None D1t

−45.7 −48.9

0.004 0.004

US dollar/Swiss Franc US dollar/Swiss Franc

GED GED

None D1t

−2.6

0.005

US dollar/Sterling US dollar/Sterling

Normal Normal

None D1t

−48.6 −52.8

0.003 0.004

US dollar/Sterling US dollar/Sterling

GED GED

None D1t

−4.5

addition, the autoregressive order p must be specified. It can be argued that an ideal Bayesian implementation would do model averaging, but this is often practically equivalent to model selection in large samples (see Geweke and Amisano, 2010, p. 229). This subsection will therefore be limited to model comparison using the marginal likelihood: p(y) =

∫

f (y | ξ)p(ξ)dξ

(5.1)

where y = y1:T , f (y | ξ) is the right-hand side of (4.2), and p(ξ) is the prior (4.1). A number of methods are available for estimating (5.1); see, e.g Gelfand and Dey (1994), Meng and Wong (1996) and Chib and Jeliazkov (2001). The author chose the bridge sampling method of Meng and Wong (1996) for its ease of implementation and numerical efficiency. In the present context, the bridge sampling identity reads as: p(ξ)f (y | ξ)α(ξ) q(ξ)dξ

 p(y) =





q(ξ)α(ξ) p(ξ | y)dξ



(5.2)

where q(ξ) is a normalized importance sampling density and α(ξ) is a ‘‘bridge function’’ to be defined shortly. So, the numerator in (5.2) can be estimated by an average of n replications of p(ξ)f (y | ξ)α(ξ), where ξ is drawn from q(ξ), and the denominator by an average of m replications of q(ξ)α(ξ), where ξ is drawn from the posterior. The bridge function is obtained by an iterative procedure, as:

α(ξ) = 

1 nq(ξ) + m

p(ξ)f (y|ξ) p(y)

.

(5.3)

Frühwirth-Schnatter (2004) provides many useful implementation details, and uses theoretical arguments to show that this method is an improvement over earlier importance sampling methods, such as that of Gelfand and Dey (1994). Bridge sampling has been used successfully in several other instances (see DiCiccio et al., 1997; Frühwirth-Schnatter, 2004; Deschamps, 2008; Ardia, 2008, 2009).

0.0

0.0 0.005

The importance sampling density was chosen as: q(ξ) = q1 (r )q2 (ω)q3 (α)q4 (β)q5 (λ) where q1 and q2 are the densities of linear functions of Beta variates with ranges equal to the prior supports of r and ω; q3 and q4 are Normal; and q5 is Lognormal. The first two moments of the qi were chosen to match the empirical posterior moments obtained by MCMC. The priors on the elements of α and β were independent N (0, 10); the prior on λ was Ga(10−6 , 10−6 ); and that for ω was Uniform U (0.8, 1). The prior on r was U (1, 2) for the S&P500 data and U (1, 2.5) for the exchange rate series (in each case, these choices ensured prior supports much larger than the range of the posterior replications). The largest estimated logarithmic (base e) marginal likelihoods were −12687.0 (NSE = 0.006) for the S&P500 data, −7315.3 (NSE = 0.003) for the USD-CHF exchange rate data, and −5989.5 (NSE = 0.003) for the USD-GBP data. Tables 1 and 2 present natural logarithms of the estimated Bayes factors against the preferred models, and the numerical standard errors of these logarithms. The Bayes factor estimates are indeed very precise, confirming that our choice of the method and importance sampling densities is appropriate. For the S&P500 data, Bayes factors favor an AR(1) model with ARCH-M effects, two variance regressors, and GED errors (β1 ̸= 0, β2 ̸= 0, p = 1, r ̸= 2). The Bayes factor evidence against all the other models in Table 1 is decisive (Jeffreys, 1961, Appendix B). For both exchange rate series, ARCH-M effects are irrelevant and serial correlation is negligible, so that µt = 0 was imposed, and it can be seen in Table 2 that models with GED disturbances but without the variance regressor D1t are preferred. The evidence against the inclusion of D1t is strong in the Swiss Franc case, and very strong in the Sterling case. The evidence in favor of GED disturbances is decisive in both cases. 5.3. Posterior analysis Tables 3–5 present posterior replication summaries for the specifications having the largest posterior probabilities. These

376

P.J. Deschamps / Journal of Econometrics 162 (2011) 369–382

.964

Table 3 Posterior replication summaries (S&P500 data).

.960

θ

θ0.025

θ0.5

θ0.975

RNE

ρ1

ρ5

Reject. rate

α00 α10 α01 α11 β1 β2

−0.038 −0.288

0.019 0.035 0.202 −0.646 0.232 −0.076 1.676 0.135 0.951

0.075 0.359 0.266 −0.365 0.280 −0.047 1.755 0.350 0.957

0.80 0.80 0.70 0.75 0.52 0.14 0.82 0.36 0.71

0.11 0.11 0.18 0.14 0.32 0.76 0.10 0.47 0.17

−0.00 −0.01

0.58 0.58 0.58 0.58 0.73 0.73 0.33 0.24 0.63

r

λ ω

0.138 −0.924 0.186 −0.106 1.602 0.061 0.945

0.01 −0.00 0.01 0.23 0.01 0.02 0.03

.956 .952 .948 .944 .940 .936 2500

θα : quantile at probability α ; RNE: relative numerical efficiency; ρi : autocorrelation at lag i.

5000 Draws of ω

7500

10000

5000

7500

10000

7500

10000

1.2 Table 4 Posterior replication summaries (US Dollar/Swiss Franc exchange rate).

θ

θ0.025

θ0.5

θ0.975

RNE

ρ1

ρ5

Reject. rate

r

1.466 0.032 0.959

1.542 0.060 0.966

1.623 0.126 0.973

0.90 0.50 0.78

0.05 0.34 0.12

−0.01

0.53 0.34 0.54

λ ω

0.01 0.02

1.0 0.8 0.6

θα : quantile at probability α ; RNE: relative numerical efficiency; ρi : autocorrelation at lag i.

0.4

Table 5 Posterior replication summaries (US Dollar/Sterling exchange rate).

0.0

0.2

2500

θ

θ0.025

θ0.5

θ0.975

RNE

ρ1

ρ5

Reject. rate

r

1.443 0.045 0.951

1.519 0.087 0.959

1.601 0.188 0.966

0.91 0.51 0.78

0.05 0.33 0.13

−0.01

0.52 0.32 0.60

λ ω

0.03 0.01

θα : quantile at probability α ; RNE: relative numerical efficiency; ρi : autocorrelation at lag i.

results are based on replications obtained by running the posterior simulator twice for 30,000 passes, of which the first 5000 were discarded. Convergence was checked by heteroskedasticity and autocorrelation consistent Wald equality tests on the expected values of the two chains, and by the method of Gelman and Rubin (1992). The final posterior sample was then obtained by combining the two chains and selecting every fifth replication, yielding (2 × 25,000)/5 = 10,000 final draws. The autocorrelations in the final posterior sample decay quickly, as indicated by ρ1 and ρ5 in Table 3 to 5. The priors are the same as in Section 5.2. For the S&P500 data (Table 3), the credible set for α01 implies the presence of autocorrelation, and that for α11 only contains negative values, confirming the findings of LeBaron (1992): autocorrelation is a decreasing function of anticipated volatility. The credible set for β1 confirms the intuition in Section 5.1 of a positive sign. The credible set for β2 confirms the stylized fact that market closure causes a subsequent increase in volatility (recall the definition of D2t in Section 5.1, and that θt is an inverse volatility). The GED parameter r is clearly less than 2, confirming the leptokurticity of the observation distribution. This is also the case for the exchange rate series (Tables 4 and 5). The parameter ω is particularly well identified in all cases. The burn-in period of 5000 passes (which was also used in the other simulations of Section 5.2) appears to be very conservative. Indeed, Fig. 3 presents the sample paths of 10000 replications of ω, λ, and r for the S&P500 data (using the model of Table 3), obtained after a burn-in run of 1000 passes only and without discarding intermediate draws; it suggests convergence after a few hundred sweeps of the Metropolis–Hastings algorithm. The 11,000 replications took about one hour of processor time on a five-year old 3.2 GH workstation, using compiled code. The top panel of Fig. 4 is a time series of point estimates (posterior means) of the anticipated volatilities st = st (ξ) for the S&P500 data, obtained by the filter (4.3)–(4.9) from a posterior

Draws of λ 1.9

1.8

1.7

1.6

1.5 2500

5000

Draws of GED parameter (r) Fig. 3. MCMC sample paths (S&P500 data).

sample. The middle panel is a time series of point estimates of the conditional expectations: E (yt | st ) =

α00 + α10 F (st ) 1 − α01 − α11 F (st )

(5.4)

where F (st ) is the logistic function (2.6). The bottom panel is a scatter plot of st against E (yt | st ). These graphs clearly indicate that a significant risk premium is present. Fig. 5 presents, for all three data sets, line graphs of the observations, together with median volatilities and 95% volatility confidence bands. For each t, the median volatilities and confidence bands were obtained from 1000 posterior replications of (λθt )−1/r , with θt drawn by the simulation smoother of Section 3. The volatility graphs closely reflect intuition; however, for the S&P500 data, it is interesting to note that the peak volatility estimate occurs during the recent financial crisis rather than during the 1987 crash. 6. Comparing the local scale and other models In this section, we will compare the local scale model with other formulations. The models used for comparison will be:

P.J. Deschamps / Journal of Econometrics 162 (2011) 369–382

377

and will be denoted by AR(0) t-GARCH when γ0 = γ1 = 0, and AR(1) t-GARCH otherwise. (2) Versions of the Lognormal SV model in Omori et al. (2007), with and without leverage. The basic version of this model is:

5 4 3 2 1

yt = ϵt exp(ht /2)

(6.4)

ht +1 = µ + φ(ht − µ) + ηt [   ]   1 ρσ 0 ϵt ∼N , ηt 0 ρσ σ 2

(6.5) i.i.d.

(6.6)

0 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 00 02 04 06 08 Estimates of anticipated volatilities s(t) .032 .031

We will first discuss comparisons of the preferred local scale models in Section 5 with the t-GARCH models and with the Lognormal SV models with and without leverage, but with Normal observation equation disturbances. A discussion of the heavytailed Lognormal SV models will be given at the end of this section. The prior used in the Lognormal SV model was the same as in Omori et al. (2007). For the t-GARCH model, the author used:

.030 .029 .028 .027 .026

p(γi ) = fN (γi ; 0, 10)

.025 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 00 02 04 06 08 Estimates of E[y(t) | s(t)]

Estimates of E[y(t) | s(t)]

The Lognormal SV model incorporates a leverage effect when

ρ ̸= 0. Omori et al. (2007) also estimate heavy-tailed versions of their model, obtained √ by multiplying the right-hand side of (6.4) by a latent variable λt , where λt has a suitable mixing distribution.

p(α0 ) ∝ fN (α0 ; 0, 10) I(0,∞) (α0 ) p(αi ) ∝ fN (αi ; 0.5, 10) I(0,∞) (αi ) (i = 1, 2)

.032

p(β) ∝ fN (β; 0.5, 10) I(0,1) (β)

.031

ν = 2 + ν∗,

.030 .029 .082 .032 .027 .026 .025 0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

Estimates of s(t) Fig. 4. Anticipated volatilities and ARCH-M effects (S&P500 data).

Table 6 Logarithmic Bayes factors (competing models against best local scale models). Data

Model

ln(BF )

NSE

S&P500 S&P500 S&P500 S&P500

AR(0) t-GARCH AR(1) t-GARCH Lognormal SV (leverage) Lognormal SV (no leverage)

−25.2 −0.1 −38.7 −123.6

0.020 0.019 0.071 0.044

USD-CHF USD-CHF USD-CHF USD-CHF

AR(0) t-GARCH AR(1) t-GARCH Lognormal SV (leverage) Lognormal SV (no leverage)

−1.1 −12.6 −14.0 −11.8

0.022 0.031 0.048 0.040

USD-Sterling USD-Sterling USD-Sterling USD-Sterling

AR(0) t-GARCH AR(1) t-GARCH Lognormal SV (leverage) Lognormal SV (no leverage)

−0.4 −9.2 −26.0 −24.8

0.031 0.022 0.057 0.045

(1) A t-GARCH model with an asymmetric variance equation of the type used in Glosten et al. (1993). This model reads as: yt = γ0 + γ1 yt −1 + ut

ν−2 ϵt , ϵt ∼ Student(ν) i.i.d ν  = α0 + α1 I[0,+∞) (ut −1 )  + α2 I(−∞,0) (ut −1 ) u2t −1 + βσt2−1

(6.1)



ut = σt

σt2

(6.2)

(6.3)

ν ∗ ∼ Exponential(0.1).

Table 6 presents logarithmic (base e) Bayes factors against the extended local scale model (p = 1, r ̸= 2, β1 ̸= 0, β2 ̸= 0) in the asset return case, and against the basic local scale model (µt = 0, r ̸= 2, β1 = β2 = 0) for the exchange rate series. The t-GARCH marginal likelihoods were estimated by bridge sampling, and those for Lognormal SV were estimated by the method of Chib (1995), as implemented in Chib and Greenberg (1998). The loglikelihood of the Lognormal SV model, which is an input of Chib’s method, was estimated by the particle filter described in Omori et al. (2007). The MCMC algorithm used in t-GARCH estimation is fully described in Ardia (2008, pp. 59–64); the sampler for Lognormal SV can be found in Omori et al. (2007). The evidence in favor of leverage in the Lognormal SV model (ρ ̸= 0) can be obtained as a difference of the relevant logarithmic Bayes factors. It is not surprising that this evidence is decisive for asset returns, whereas the absence of leverage (ρ = 0) is favored for the exchange rates (though not very strongly). For the t-GARCH models, the evidence decisively favors AR(1) (γ0 ̸= 0, γ1 ̸= 0) for the S&P500 series, and AR(0) (γ0 = γ1 = 0) for the exchange rates. The local scale models chosen in Section 5 are preferred to all Lognormal SV and t-GARCH formulations. It should be noted, however, that the evidence against the preferred t-GARCH models is very weak in all instances, the odds against t-GARCH and in favor of local scale never exceeding three to one. The relatively good performance of t-GARCH in this paper is corroborated by the results obtained by Geweke and Amisano (2010) with S&P500 daily returns. Comparing Tables 1 and 6 reveals that for the S&P500 data, the odds against Lognormal SV are due to the inclusion of D1t and to µt ̸= 0 in the local scale model, rather than to GED errors. Comparing Tables 2 and 6 reveals that for the exchange rate series, the superiority of local scale to Lognormal SV is due to the GED observation density. The MCMC estimates of the t-GARCH and Lognormal SV models are in Tables 7 and 8. Table 7 shows that the variance equations in the Lognormal SV models are close to logarithmic random walks. Table 8 suggests that some asymmetry is present in the t-GARCH variance equations for the exchange rate series, even though this

378

P.J. Deschamps / Journal of Econometrics 162 (2011) 369–382

S&P500 DATA

Smoother volatilities

Observations 12

7

8

6

4 5

0 -4

4

-8

3

-12

2

-16 1

-20 -24

0 1975 1980 1985 1990 1995 2000 2005

1975 1980 1985 1990 1995 2000 2005

US DOLLAR/SWISS FRANC

Observations

Smoother volatilit ies 1.8

4

1.6 2

1.4 1.2

0

1.0 -2

0.8 0.6

-4

0.4 0.2

-6 82 84 86 88 90 92 94 96 98 00 02 04 06 08

82 84 86 88 90 92 94 96 98 00 02 04 06 08

US DOLLAR/STERLING

Observations

Smoother volatilities

6

2.4

4

2.0

2

1.6

0

1.2

-2

0.8

-4

0.4 0.0

-6

82 84 86 88 90 92 94 96 98 00 02 04 06 08

82 84 86 88 90 92 94 96 98 00 02 04 06 08

Fig. 5. Observations and volatility confidence bands.

Table 7 MCMC estimates of the best Lognormal SV models.

θ φ σ2 ρ µ

S&P500

USD-CHF

USD-Sterling

θ0.025

θ0.5

θ0.975

θ0.025

θ0.5

θ0.975

θ0.025

θ0.5

θ0.975

0.981 0.015 −0.618 −0.280

0.986 0.018 −0.558 −0.193

0.989 0.023 −0.481 0.008

0.968 0.008 NA −0.897

0.979 0.012 NA −0.770

0.987 0.018 NA −0.637

0.977 0.010 NA −1.335

0.984 0.015 NA −1.147

0.990 0.022 NA −0.949

θα : quantile at probability α ; NA: not applicable.

can obviously not be ascribed to leverage; as expected, strong asymmetry is present in the S&P500 case. Fig. 6 presents ratios of volatility posterior means obtained with the local scale and competing models. The volatilities were defined as exp(ht /2) for Lognormal SV, σt for t-GARCH, and (λθt )−1/r for the local scale; they were estimated by simulation smoothing in

the local scale and Lognormal SV cases. These ratios have a much higher amplitude when the local scale is compared to t-GARCH. The Lognormal SV and local scale estimated volatilities are quite close; this suggests that model specification does not have too much impact on the smoothed volatilities, where prior information is dominated by the information present in the entire sample.

P.J. Deschamps / Journal of Econometrics 162 (2011) 369–382

379

Table 8 MCMC estimates of the best t-GARCH models.

θ γ0 γ1 α0 α1 α2 β ν

S&P500

USD-CHF

USD-Sterling

θ0.025

θ0.5

θ0.975

θ0.025

θ0.5

θ0.975

θ0.025

θ0.5

θ0.975

0.014 0.054 0.006 0.012 0.083 0.922 7.415

0.029 0.074 0.008 0.020 0.097 0.933 8.570

0.044 0.094 0.012 0.029 0.113 0.942 10.072

NA NA 0.004 0.027 0.034 0.931 6.559

NA NA 0.007 0.038 0.044 0.946 7.822

NA NA 0.011 0.052 0.057 0.959 9.537

NA NA 0.002 0.033 0.042 0.926 5.855

NA NA 0.003 0.045 0.056 0.942 6.820

NA NA 0.006 0.059 0.072 0.956 8.133

θα : quantile at probability α ; NA: not applicable. S&P500 data

Lognormal SV / Local scale 2.4

2.4

2.0

2.0

1.6

1.6

1.2

1.2

0.8

0.8

0.4

t-GARCH / Local scale

0.4 75

80

85

90

95

00

05

75

80

85

90

95

00

05

US dollar - Swiss franc 1.6

1.6

1.4

1.4

1.2

1.2

1.0

1.0

0.8

0.8

0.6

0.6

0.4

0.4 1985

1990

1995

2000

2005

1985

1990

1995

2000

2005

1985

1990

1995

2000

2005

USdollar - Sterling 1.6

1.6

1.4

1.4

1.2

1.2

1.0

1.0

0.8

0.8

0.6

0.6 0.4

0.4 1985

1990

1995

2000

2005

Fig. 6. Ratios of volatility posterior means.

Since the Lognormal SV models used in Table 6 have Normal disturbances, and since Tables 1 and 2 clearly illustrated the benefits of introducing heavy-tailed errors in the local scale models, a legitimate question is the robustness of the results in Table 6 with respect to the Normality assumption in the Lognormal SV observation density. Unfortunately, Omori et al. (2007) do not discuss particle filters for the heavy-tailed versions of their model.

It is, however, a simple matter to estimate versions of the local scale model on the TOPIX data used by Omori et al., since these are publicly available on sites.google.com/site/jnakajimaweb/sv (I am grateful to Y. Omori and J. Nakajima for pointing out this source). The remainder of this section will therefore discuss a comparison of local scale marginal likelihoods with the highest marginal likelihood reported by Omori et al., using the TOPIX data.

380

P.J. Deschamps / Journal of Econometrics 162 (2011) 369–382

Table 9 Logarithmic Bayes factors of local scale models against best Lognormal SV (TOPIX data). Observation density

Normal

GED

Variance regressor

D1t D1t

No ARCH-M

ARCH-M effects

µt = 0

p=0

p=1

ln(BF ) NSE

−21.6

−23.9

−13.2

0.14

0.14

0.14

ln(BF ) NSE

−22.3

−25.0

−15.2

0.14

0.14

0.14

Table 9 presents logarithmic (base e) Bayes factors of six versions of the local scale model, each including the variance regressor D1t and estimated using the same prior as that used for the S&P500 data, against the best Lognormal SV formulation in Omori et al. (2007). This formulation has the leverage effect (ρ ̸= 0) and a Student-t observation density. Local scale is clearly dominated by Lognormal SV for this particular data set. However, this dominance is not due to heavy-tailed errors, since the evidence in Table 9 slightly favors a Normal over a GED observation density (this corroborates the findings in Omori et al. (2007), where the evidence in favor of heavy tails was very slight). The superior performance of Lognormal SV for this data set may be due to the fact that the variance process estimated by Omori et al. is not close to a logarithmic random walk, contrary to our results in Table 7. The posterior credible interval for φ in their article has bounds of 0.908 and 0.98. This suggests that Lognormal SV and the local scale are complementary, in the sense that Lognormal SV (with a stationary variance process) is preferable when the volatilities are not nearly integrated. To conclude this section, we note that the Bayes factors in Table 9 decisively favor our ARCH-M formulation with p = 1. 7. Discussion and conclusions This paper has attempted to develop a full Bayesian treatment of an extended version of the local scale model in Shephard (1994). Applications to publicly available data sets have shown that posterior simulation is straightforward, and that exact, efficient simulation smoothing is possible. This paper has also shown that introducing ARCH-M effects and variance regressors significantly improves the marginal likelihood when the model is estimated on asset return data. The ARCHM function has two features that would be difficult to introduce in Lognormal SV models. First, it is formulated in terms of the scale parameter st of the one step ahead prediction density of the dependent variable. This is arguably more natural than a formulation using the expectation of the volatility conditional on its past history, since volatility is unobservable. Second, the autoregression coefficients depend on st , and this dependence was found to be very significant for the S&P500 and TOPIX data. The model was compared to t-GARCH and Lognormal SV formulations, using Bayes factors. The posterior odds favored local scale over t-GARCH, though not very strongly. The results of comparisons with Lognormal SV were more ambiguous. Even though the local scale was preferred to Lognormal SV with a Normal observation density for one asset return and two exchange rate series, it is not yet clear that comparisons with heavy-tailed Lognormal SV formulations would yield similar results; and the Bayes factors for the TOPIX data suggested that stationary variance processes are preferable to the local scale integrated variance process in some cases. Generalizing the local scale variance process to more general autoregressions would therefore be an interesting topic for further research. The challenge, in this case, would be to obtain the one step ahead prediction density in explicit form. A potential disadvantage of the local scale model, when compared to Lognormal SV, is the difficulty of introducing correlation

between the observation and evolution disturbances (Yu, 2005; Omori et al., 2007). It is, however, straightforward to introduce dependence between current volatilities and past observations. Another limitation is the inability to specify an asymmetric news impact function of the EGARCH type: all the attempts to generalize the model in that direction led to serious numerical problems for large values of T , including exponent overflows. This limitation does not appear to have been mentioned in the previous literature. However, the fact that the model remained competitive with asymmetric t-GARCH formulations suggests that the empirical implications of this shortcoming are not too serious. Acknowledgement The author thanks anonymous reviewers and participants at the European Meeting of the Econometric Society, Barcelona, Spain, August 2009 for helpful comments on previous versions. Appendix A. Validity of the simulation smoother In this appendix, we will show that the two densities on the right-hand side of (3.2), though not conjugate, yield a translated Gamma posterior. Proposition 1. Bayes’ theorem implies: f (θt | y1:t , θt +1 ) = fG

  1 θt − φt−+11 θt +1 ; , bt r

for t = T − 1, . . . , 0 where φt +1 is given by (4.5) and bt is given by (4.4). Proof. First note that if Y = cX with X ∼ Be(α, β) and c > 0, then: fY (y) = fB

=

y c

; α, β

1 c

I(0,c ) (y)

Γ (α + β) α−1 y (c − y)β−1 c 1−α−β I(0,c ) (y). Γ (α)Γ (β)

Applying the preceding result to θt +1 = recalling that at = a = 1/[r (1 − ω)] yields:

φt +1 θt ηt +1 and

f (θt +1 | θt , y1:t )

∝ θtω+a1−1 (φt +1 θt − θt +1 )1/r −1 (φt +1 θt )1−ωa−1/r I(0,φt +1 θt ) (θt +1 ) 1/r −1

∝ θtω+a1−1 φt +1 (θt − φt−+11 θt +1 )1/r −1 × (φt +1 θt )1−ωa−1/r I(0,φt +1 θt ) (θt +1 ).

(A.1)

Viewing (A.1) as the likelihood of θt and omitting all those terms that do not depend on θt yields:

L(θt ; θt +1 , y1:t ) 1−ωa−1/r

∝ (θt − φt−+11 θt +1 )1/r −1 θt

I(φ −1 θ

,∞) t +1 t +1

(θt ).

(A.2)

On the other hand, (2.9) implies: f (θt | y1:t ) ∝ θta−1 exp(−bt θt )I(0,∞) (θt ).

(A.3)

P.J. Deschamps / Journal of Econometrics 162 (2011) 369–382

The proposal density is multivariate Student with να degrees of freedom, location vector α∗ , and scale matrix s2α Σα∗ :

Multiplying (A.2) and (A.3) yields: f (θt | y1:t , θt +1 ) ∝ (θt −

φt−+11 θt +1 )1/r −1

exp(−φt−+11 θt +1 bt )

q(α | αold , ϕα ) ∝ (det s2α Σα∗ )−1/2

× exp[−bt (θt − φt−+11 θt +1 )]I(φ −1 θt +1 ,∞) (θt ) t +1

[ ]−(να +2p+2)/2 (α − α∗ )′ (s2α Σα∗ )−1 (α − α∗ ) × 1+ να

∝ (θt − φt−+11 θt +1 )1/r −1 × exp[−bt (θt − φt−+11 θt +1 )]I(φ −1 θt +1 ,∞) (θt ) t +1

where we have used a − ωa = 1/r. This is the kernel of a Ga(1/r , bt ) density on θt − φt−+11 θt +1 , proving Proposition 1. Appendix B. Proposal densities

Since the method used will be the same, we will use the generic expression x to denote a draw of either parameter. The prior of both parameters is Uniform U (xmin , xmax ). The candidate we propose is a linear function xmin + (xmax − xmin )X of a Beta variable X with parameters ax and bx : q(x | ϕ) ∝ (x − xmin )ax −1 (xmax − x)bx −1 I(xmin ,xmax ) (x).

p i xi

where:

=

We obtain the proposal density by combining the Multinormal prior in (4.1) with a likelihood of α suggested by (2.1) and (2.5). However, considerations suggested by the choice of an appropriate importance sampling density indicate that heavier tails than the Normal might be required. Let ϕα denote all the parameters in ξ except α. We first run the filter (4.3)–(4.9) to obtain b0:T , φ1:T , and s1:T , then run the smoother (3.6)–(3.7) to obtain the conditional posterior expectations θ¯t . Note that all these quantities depend on α and ϕα . Define y = y1:T ,



− Ψ′



1 r (1 − ω)

 (B.4)

and Xα (α, ϕα ) as the T by 2p + 2 matrix with row t equal to: yt −1

···

Dk,t +1 .



F (st )yt −1

q(β | βold , ϕβ ) ∝ (det s2β Σβ∗ )−1/2



(β − β∗ )′ (s2β Σβ∗ )−1 (β − β∗ ) νβ

···

yt −p

−(νβ +k)/2 (B.5)

where:

(Σβ∗ )−1 = Σβ−1 +

Xβ′ Xβ

ση2

 β = ∗

Σβ∗

Σβ−1 mβ

+

Xβ′ yβ (βold , ϕβ )



ση2

and where νβ and s2β are again tuning parameters. In all the simulations of this paper, identical values of νβ = 3, s2β = 400 were chosen after some experimentation and led to well-mixing MCMC chains. Setting s2β = 1 led to low rejection rates but high autocorrelations. B.4. Sampling λ We obtain the candidate density by combining the Gamma prior in (4.1) with a likelihood suggested by (2.1). Let ϕλ denote all the parameters in ξ except λ. We run the filter and smoother to obtain the µt and θ¯t , and define:

 r Λt (λ, ϕλ ) = yt − µt (λ, ϕλ ) θ¯t (λ, ϕλ ).

Θ (α, ϕα ) = diag[(λθ¯1 )2/r , . . . , (λθ¯T )2/r ] F (st )

ω r (1 − ω)

(B.2)

xmax − xmin

B.2. Sampling α

1



where Ψ ′ (z ) = d2 ln Γ (z )/dz 2 is the trigamma function, which can be computed using the algorithm in Bowman (1984). Let ϕβ denote all the parameters in ξ except β. We run the filter and smoother to obtain the θ¯t , define yβ (β, ϕβ ) as the (T − 1)× 1 vector

× 1+

x¯ − xmin

subject to ax = δx (if the right-hand side of (B.2) is less than 0.5) or bx = δx (if the right-hand side of (B.2) is larger than 0.5), where δx ≥ 1 is a tuning parameter. This ensures that both ax and bx are larger than δx . Decreasing δx will increase the variance of the candidate while leaving its location unchanged. In all the simulations of this paper, the same values of δx = 20, L = 10 were chosen after some experimentation and led to well-mixing MCMC chains.



ση2 = V (ln ηt +1 ) = Ψ ′

The proposal density is multivariate Student with νβ degrees of freedom, location vector β∗ , and scale matrix s2β Σβ∗ :

L(xj ; ϕ)

and where the xi form a grid of equally spaced values x1 , . . . , xL covering the prior support. The values for ax and bx solve the equation:

ax + b x

We obtain the proposal density by combining the Multinormal prior in (4.1) with a likelihood suggested by (2.2). It is easy to show that:

D1,t +1

j=1

ax

B.3. Sampling β



L(xi ; ϕ) L ∑

(Σα∗ )−1 = Σα−1 + Xα′ (αold , ϕα )Θ (αold , ϕα )Xα (αold , ϕα )   α∗ = Σα∗ Σα−1 mα + Xα′ (αold , ϕα )Θ (αold , ϕα )y

with element t equal to ln θ¯t +1 − ln θ¯t for 1 ≤ t ≤ T − 1, and Xβ as the (T − 1) × k matrix with row t equal to:

i=1

pi =

where:

(B.1)

In (B.1), ϕ contains all the parameters in ξ except x, and ax , bx are chosen in such a way that the location of the candidate density corresponds to an approximate expected value x¯ of the full conditional posterior of x. This value can be computed as: L −

(B.3)

and where να and s2α are tuning parameters. In all the simulations of this paper, identical values of να = 3, s2α = 1 were chosen after some experimentation and led to well-mixing MCMC chains.

B.1. Sampling ω and r

x¯ =

381

F (st )yt −p .



The candidate density is: q(λ | λold , ϕλ ) = fG (λ; δλ a∗λ , δλ b∗λ )

(B.6)

382

P.J. Deschamps / Journal of Econometrics 162 (2011) 369–382

with: a∗λ = aλ +

T r

b∗λ = bλ + ψ(r )

T −

Λt (λold , ϕλ )

t =1

and where the tuning parameter δλ is chosen by experimentation. Lowering δλ increases the variance of the candidate but leaves its expectation unchanged. In all the simulations of this paper, an identical value of δλ = 1/625 was chosen after some experimentation and led to well-mixing MCMC chains. As before, high values of δλ led to low rejection rates but high autocorrelations. References Ardia, D., 2008. Financial Risk Management with Bayesian Estimation of GARCH Models: Theory and Applications. Springer-Verlag, Berlin. Ardia, D., 2009. Bayesian estimation of a Markov-switching threshold asymmetric GARCH model with Student-t innovations. Econometrics Journal 12, 105–126. Bernardo, J.M., Smith, A.F.M., 2000. Bayesian Theory. Wiley, Chichester. Bollerslev, T., 1986. Generalized autoregressive conditional heteroskedasticity. Journal of Econometrics 31, 307–327. Bollerslev, T., Engle, R.F., Nelson, D.B., 1994. ARCH models. In: Engle, R.F., McFadden, D.L. (Eds.), Handbook of Econometrics, vol. 4. Elsevier, Amsterdam, pp. 2959–3038. Box, G.E.P., Tiao, G.C., 1992. Bayesian Inference in Statistical Analysis. Wiley, New York. Bowman, K.O., 1984. Computation of the polygamma functions. Communications in Statistics – Simulation and Computation 13, 409–415. Carter, C.K., Kohn, R., 1994. On Gibbs sampling for state space models. Biometrika 81, 541–553. Chib, S., 1995. Marginal likelihood from the Gibbs output. Journal of the American Statistical Association 90, 1313–1321. Chib, S., 1996. Calculating posterior distributions and modal estimates in Markov mixture models. Journal of Econometrics 75, 79–97. Chib, S., Greenberg, E., 1998. Analysis of multivariate probit models. Biometrika 85, 347–361. Chib, S., Jeliazkov, I., 2001. Marginal likelihood from the Metropolis–Hastings output. Journal of the American Statistical Association 96, 270–281. Chib, S., Nardari, F., Shephard, N., 2002. Markov chain Monte Carlo methods for stochastic volatility models. Journal of Econometrics 108, 281–316. Deschamps, P.J., 2003. Time-varying intercepts and equilibrium analysis: an extension of the dynamic almost ideal demand model. Journal of Applied Econometrics 18, 209–236. Deschamps, P.J., 2008. Comparing smooth transition and Markov switching autoregressive models of US unemployment. Journal of Applied Econometrics 23, 435–462. DiCiccio, T.J., Kass, R.E., Raftery, A., Wasserman, L., 1997. Computing Bayes factors by combining simulation and asymptotic approximations. Journal of the American Statistical Association 92, 903–915.

Engle, R.F., 1982. Autoregressive conditional heteroscedasticity with estimates of the variance of United Kingdom inflation. Econometrica 50, 987–1007. Engle, R.F., Lilien, D.M., Robins, R.P., 1987. Estimating time varying risk premia in the term structure: the ARCH-M model. Econometrica 55, 391–407. French, K.R., Schwert, G.W., Stambaugh, R.F., 1987. Expected stock returns and volatility. Journal of Financial Economics 19, 3–29. Frühwirth-Schnatter, S., 2004. Estimating marginal likelihoods for mixture and Markov switching models using bridge sampling techniques. Econometrics Journal 7, 143–167. Gelfand, A.E., Dey, D.K., 1994. Bayesian model choice: asymptotics and exact calculations. Journal of the Royal Statistical Society (series B) 56, 501–514. Gelman, A., Rubin, D., 1992. Inference from iterative simulation using multiple sequences. Statistical Science 7, 457–511. Geweke, J., Amisano, G., 2010. Comparing and evaluating Bayesian predictive distributions of asset returns. International Journal of Forecasting 26, 216–230. Glosten, L.R., Jagannathan, R., Runkle, D., 1993. On the relation between the expected value and the volatility of the nominal excess return on stocks. Journal of Finance 48, 1779–1801. Harvey, A.C., 1989. Forecasting, Structural Time Series Models and the Kalman Filter. Cambridge University Press, Cambridge. Harvey, A.C, Trimbur, T.M., van Dijk, H.K., 2007. Trends and cycles in economic time series: A Bayesian approach. Journal of Econometrics 140, 618–649. Jacquier, E., Polson, N.G., Rossi, P.E., 1994. Bayesian analysis of stochastic volatility models. Journal of Business and Economic Statistics 12, 371–389. Jacquier, E., Polson, N.G., Rossi, P.E., 2004. Bayesian analysis of stochastic volatility models with fat-tails and correlated errors. Journal of Econometrics 122, 185–212. Jeffreys, H., 1961. Theory of Probability, 3rd ed. Oxford University Press, Oxford. Kim, S., Shephard, N., Chib, S., 1998. Stochastic volatility: likelihood inference and comparison with ARCH models. Review of Economic Studies 65, 361–393. LeBaron, B., 1992. Some relations between volatility and serial correlation in stock market returns. Journal of Business 65, 199–220. McDonald, J.B., Newey, W.K., 1988. Partially adaptive estimation of regression models via the generalized t distribution. Econometric Theory 4, 428–457. Meng, X.-L., Wong, W.H., 1996. Simulating ratios of normalizing constants via a simple identity: a theoretical exploration. Statistica Sinica 6, 831–860. Nelson, D.B., 1991. Conditional heteroskedasticity in asset returns: a new approach. Econometrica 59, 347–370. Omori, Y., Chib, S., Shephard, N., Nakajima, J., 2007. Stochastic volatility with leverage: fast and efficient likelihood inference. Journal of Econometrics 140, 425–449. Philipov, A., Glickman, M.E., 2006. Multivariate stochastic volatility via Wishart processes. Journal of Business and Economic Statistics 24, 313–328. Schotman, P.C., van Dijk, H.K., 1991. A Bayesian analysis of the unit root in real exchange rates. Journal of Econometrics 49, 195–238. Shephard, N., 1994. Local scale models: state space alternative to integrated GARCH processes. Journal of Econometrics 60, 181–202. Smith, R.L., Miller, J.E., 1986. A non-Gaussian state space model and application to prediction of records. Journal of the Royal Statistical Society (series B) 48, 79–88. Stroud, J.R., Müller, P., Polson, N.G., 2003. Nonlinear state-space models with dependent variances. Journal of the American Statistical Association 98, 377–386. Uhlig, H., 1997. Bayesian vector autoregressions with stochastic volatility. Econometrica 65, 59–73. Yu, J., 2005. On leverage in a stochastic volatility model. Journal of Econometrics 127, 165–178.

Journal of Econometrics 162 (2011) 383–396

Contents lists available at ScienceDirect

Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom

Stick-breaking autoregressive processes J.E. Griffin a,∗ , M.F.J. Steel b a

School of Mathematics, Statistics and Actuarial Science, University of Kent, UK

b

Department of Statistics, University of Warwick, UK

article

info

Article history: Available online 16 March 2011 JEL classification: C11 C14 C22 Keywords: Bayesian nonparametrics Dirichlet process Poisson–Dirichlet process Time-dependent nonparametrics

abstract This paper considers the problem of defining a time-dependent nonparametric prior for use in Bayesian nonparametric modelling of time series. A recursive construction allows the definition of priors whose marginals have a general stick-breaking form. The processes with Poisson–Dirichlet and Dirichlet process marginals are investigated in some detail. We develop a general conditional Markov Chain Monte Carlo (MCMC) method for inference in the wide subclass of these models where the parameters of the marginal stick-breaking process are nondecreasing sequences. We derive a generalised Pólya urn scheme type representation of the Dirichlet process construction, which allows us to develop a marginal MCMC method for this case. We apply the proposed methods to financial data to develop a semi-parametric stochastic volatility model with a time-varying nonparametric returns distribution. Finally, we present two examples concerning the analysis of regional GDP and its growth. © 2011 Elsevier B.V. All rights reserved.

1. Introduction Nonparametric estimation is an increasingly important element in the modern econometrician’s toolbox. This paper focuses on nonparametric Bayesian models, which, in spite of the name, involve infinitely many parameters, and are typically used to express uncertainty in distribution spaces.1 We will concentrate on the infinite mixture model, which for an observable y, expresses the density as f (y) =

∞ −

are assumed independent and identically distributed from a distribution H. The model is often expressed more compactly in terms of a mixing measure G, f (y) =

i=1

where k(y|θ ) is a probability density function for y, while p1 , p2 , p3 , . . . is an infinite sequence of positive numbers such that ∑ ∞ i=1 pi = 1 and θ1 , θ2 , θ3 , . . . is an infinite sequence of parameter values for k. The model represents the distribution of y as an infinite mixture which can flexibly represent a wide range of distributional shapes and generalises the standard finite mixture model. The Bayesian analysis of this model is completed by the choice of a prior for p1 , p2 , p3 , . . . and θ1 , θ2 , θ3 , . . . . Typically θ1 , θ2 , θ3 , . . .

∗ Corresponding address: School of Mathematics, Statistics and Actuarial Science, University of Kent, Canterbury, CT2 7NF, UK. Tel.: +44 1227 823627; fax: +44 1227 827932. E-mail address: [email protected] (J.E. Griffin). 1 Surveys of Bayesian nonparametric methods are provided in Walker et al. (1999) and Müller and Quintana (2004). Early applications in economics include autoregressive panel data models (Hirano, 2002), longitudinal data treatment models (Chib and Hamilton, 2002) and stochastic frontiers (Griffin and Steel, 2004). 0304-4076/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2011.03.001

k(y|ϕ) dG(ϕ)

(1)

where G is a random probability measure G=

pi k(y|θi )

∫

∞ −

pi δθi

i=1

while δθ is the Dirac delta function which places measure 1 on the point θ . The distribution G is almost surely discrete and each element is often called an ‘‘atom’’. Many priors have been proposed for G including the Dirichlet process (Lo, 1984), Stick-Breaking (Ishwaran and James, 2001, 2003) (which will be discussed in Section 2), and Normalised Random Measures (James et al., 2009). If covariates are observed with y, the infinite mixture model can be extended by allowing the kernel k to depend on x (Leslie et al., 2007; Chib and Greenberg, 2009), the mixing measure G to depend on x (De Iorio et al., 2004; Müller et al., 2004; Griffin and Steel, 2006; Dunson et al., 2007) or both (Geweke and Keane, 2007). This allows posterior inference about the unknown distribution at a particular covariate value to borrow strength from the distribution at other covariate values and so posterior estimates of the distributions at different covariate values are smoothed. Similarly, if observations are made over time then the mixture model can be extended to allow time dependence in the kernel k or the mixing measure G. We will concentrate on the

384

J.E. Griffin, M.F.J. Steel / Journal of Econometrics 162 (2011) 383–396

second case and consider an extended infinite mixture model to define the density ft (y) at time t by ft (y) = Gt =

∫

∞ −

k(y|ϕ)dGt (ϕ)

(2)

pj (t )δθj

j =1

where j=1 pj (t ) = 1 for all t. The kernels have fixed parameters but their probabilities are allowed to change over time. There are several models which fit into the framework of (2). One possible approach expresses p1 (t ), p2 (t ), p3 (t ), . . . as the stick-breaking transformation of V (t ) = (V1 (t ), V2 (t ), V3 (t ), . . .) where 0 < Vj (t ) < 1 for all j and t, so that

∑∞

pj (t ) = Vj (t )

∏

(1 − Vj (t )).

j
The ‘‘arrivals’’ construction of Griffin and Steel (2006) defines a sequence of times τ1 , τ2 , τ3 , . . . which follow a Poisson process and increases the size of V (t ) at time τj by introducing an extra variable 0 < Vj⋆ < 1. This process defines distributions that change in continuous time and whose autocorrelation can be controlled by the intensity of the Poisson process. The Time Series Dependent Dirichlet Process of Nieto-Barajas et al. (2008) defines a stochastic process for Vj (t ) independent from Vk (t ) for k ̸= j using an auxiliary sequence of binomial random variables. This process does not include the introduction of new atoms and so has rather different areas of application than the ‘‘arrivals’’ construction and the models developed in this paper. Alternatively, Zhu et al. (2005) model the distribution function of the observables directly by defining a Time-Sensitive Dirichlet Process which generalises the Pólya urn scheme of the Dirichlet process. However, the process is not consistent under marginalisation of the sample. A related approach is described by Caron et al. (2007) who define a timedependent nonparametric prior with Dirichlet process marginals by allowing atoms to be deleted (unlike Griffin and Steel, 2006) as well as added at each discrete time point. A careful construction of the addition and deletion process allows a Pólya urn scheme to be derived and defines a process which is consistent under marginalisation. In discrete time, Dunson (2006) proposes the evolution Gt = π Gt −1 + (1 − π )ϵt where π is a parameter and ϵt is a realisation from a Dirichlet process. This defines an AR-process type model and an explicit Pólya urn-type representation allows efficient inference. This has some similarities to the ‘‘arrivals’’ π -DDP in discrete time, where the model is generalised to Gt = πt −1 Gt −1 + (1 − πt −1 )ϵt −1 where ϵt is a discrete distribution with a Poisson-distributed number of atoms and πt is a random variable correlated with ϵt . Griffin and Steel (2006) show how to ensure that the marginal law of Gt follows a Dirichlet process for all t. An alternative approach is to assume that pj (t ) = pj for all t and the time dependence is instead introduced through the atoms θj ; in that case we obtain an infinite mixture of time-series model in the framework of single-p DDP models. Rodriguez and ter Horst (2008) develop methods in this direction. Other time-dependent nonparametric models have been developed as generalisations of hidden Markov models. Fox et al. (2008) develop a hidden Markov model with an infinite number of states and apply their method to regime-switching in financial time series. Taddy and Kottas (2009) propose a Markov-switching model where there are a finite number of states each of which is associated with a nonparametric distribution.

In this paper we develop a process of distributions {Gt } which evolve by adding new atoms in continuous time and can be used in (2). An infinite mixture model, as opposed to a finite mixture model, is natural in this context since it can be thought of as being generated by the introduction of atoms from infinitely far in the past. The processes are strictly stationary and have the same stickbreaking process prior marginally for all t. The specific cases where the marginal is a Dirichlet process and a Poisson–Dirichlet2 process are discussed (and represent two special cases of the construction). A Pólya urn-type scheme is derived for the important special case of a Dirichlet process marginal. Markov chain Monte Carlo (MCMC) schemes for inference with both the general process and the special cases are described. This generalises the work of Griffin and Steel (2006) from Dirichlet process marginals to more general stick-breaking processes. The paper is organised in the following way: Section 2 briefly discusses stick-breaking processes and Section 3 describes the link between stick-breaking processes and time-varying nonparametric models and presents a method for constructing stationary processes with a given marginal stick-breaking process. This section also discusses two important special cases: with Dirichlet process and Poisson–Dirichlet marginals. Section 4 briefly describes the proposed computational methods. In Section 5 we explore three econometric examples using the two leading processes. In particular, we develop a stochastic volatility model with a time-varying nonparametric returns distribution, and analyse models for regional GDP and growth of regional GDP. Proofs of all Theorems are grouped in Appendix A, whereas Appendix B presents details of the MCMC algorithms and Appendix C compares the marginal and the conditional algorithms for the case with Dirichlet process marginals. 2. Stick-breaking processes The choice of a prior for G defines the mixture model. We will concentrate on stick-breaking processes which are defined as follows for the static case: Definition 1. Suppose that a = (a1 , a2 , . . .) and b = (b1 , b2 , . . .) are sequences of positive real numbers. A random probability measure G follows a stick-breaking process if d

G=

∞ −

pi δθi

i=1

p i = Vi

∏

( 1 − Vj )

j
where θ1 , θ2 , θ3 , . . . ∼ H and V1 , V2 , V3 , . . . is a sequence of independent random variables for which Vi ∼ Be(ai , bi ) for all i. This process will be denoted by G ∼ Π (a, b, H ). Ishwaran ∑∞ and James (2001) shows that the process is well-defined (i.e. i=1 pi = 1 almost surely) if ∞ −

log(1 + ai /bi ) = ∞.

i=1

Note that the ordering of the atoms θi in a stick-breaking process matters, since the mean weight E[pi ] is decreasing with i; thus the atoms later in the ordering will tend to have less weight. Two specific stick-breaking priors have been well-studied: the Dirichlet process where ai = 1 and bi = M denoted by G ∼

2 This process is also sometimes called a Pitman–Yor process, after Pitman and Yor (1997).

J.E. Griffin, M.F.J. Steel / Journal of Econometrics 162 (2011) 383–396

385

Fig. 1. Number of clusters in n = 20 observations, using M = 1 and different values of a for the Poisson–Dirichlet process (a = 0 corresponds to the Dirichlet process).

DP(MH ) and the Poisson–Dirichlet process where ai = 1 − a, bi = M + ai (with 0 ≤ a < 1, M > −a). The Dirichlet process can be expressed as a special case of the Poisson–Dirichlet process where a = 0. The Poisson–Dirichlet process is an attractive generalisation of the Dirichlet process where E[pi ] can decay at a sub-geometric rate (in contrast to the geometric rate associated with the Dirichlet process). This also affects the distribution of the number of clusters3 in a sample of n values. The mean for the Dirichlet process is approximately M log n whereas the mean for the Poisson–Dirichlet process is Sa,M na (Pitman, 2003) where Sa,M depends on a and M. The parameter a also affects the shape of the distribution of the number of clusters. Fig. 1 shows this distribution for different values of a with M fixed at 1. Clearly, the variance of the distribution increases with a. This extra flexibility allows a wide range of possible variances for a given prior mean number of clusters. 3. Stick-breaking autoregressive processes In this section, we will construct {Gt }t ∈[0,T ] which are strictly stationary with stick-breaking marginals and can be used in the time-dependent mixture model in (2). We start with a process in continuous time defined by

˜ N (t ) Gt = G

variance of the number of clusters for a fixed mean. Therefore, the number of clusters will be relatively stable over time. Alternative marginal processes, such as a Poisson–Dirichlet process as defined in the previous section, offer the potential for larger variances and so more flexibility in modelling the number of clusters over time. ˜ s defined by (3) can be expressed in stickThe distribution G breaking form as

˜s = G

s − i=−∞

˜ s = Vs G˜ s−1 + (1 − Vs )δθs G

(3)

where θs ∼ H and Vs ∼ Be(1, M ). The process Gt evolves through jumps in the distribution at the arrival times of the Poisson process. This occurs in a specific way: a new atom θs is introduced into Gt (so a new cluster is introduced into the mixture model for y in (2)) and the previous atoms are discounted by Vs (so all old clusters are downweighted in the mixture model). Griffin and Steel (2006) ˜ and G are strictly stationary processes and that show that both G ˜ s and Gs follow a Dirichlet process with mass parameter M and G centring distribution H. They also show that for any set B

λk Corr(Gt (B), Gt +k (B)) = exp − M +1 



≡ ρk

(4)

and so the dependence between measures on any set decreases exponentially at the same rate. The form of (3) defines a simple recursion but it also restricts the form of the marginal Gt to a Dirichlet or two-parameter beta process (Ishwaran and Zarepour, 2000). The process Gt is stationary and so the choice of marginal process will control the distribution of clusters over time. The Dirichlet process has a relatively small

3 An important property of these Bayesian nonparametric models is that the same atom can be assigned to more than one observation; in other words, they induce clustering of the n observations into typically less than n groups.

s ∏

(1 − Vj ).

(5)

j=i+1

The stick-breaking is applied ‘‘backwards in time’’ which, combined with the ordering of the expected weights, means that weights tend to be larger for atoms introduced at later times. Because the new atoms which ‘‘refresh’’ the distribution are placed at the front of the ordering (i.e. with the highest expected weight), the Vi ’s in (5) are associated with a different place in the ordering whenever a new atom appears. If the distribution of Vi depends on its place in this ordering (as is the case for the Poisson–Dirichlet marginals), this needs to be taken into account. Thus, for more general marginals for Gt , we need to consider the more general model

˜s = G

s − i=−∞

˜ s is defined where N (t ) is a Poisson process with intensity λ and G by the recursion

δθi Vi

δθi Vi,s−i+1

s ∏

(1 − Vj,s−j+1 ),

(6)

j=i+1

where Vj,1 , Vj,2 , . . . denotes the value of the break introduced at time j as it evolves over time and Vj,t represents its value at ∞ time j + t. The stochastic processes {Vi,t }∞ t =1 and {Vj,t }t =1 must be independent for i ̸= j if the process Gt is to have a stickbreaking form. If we want Gt to follow a stick-breaking process with parameters a and b (see Definition 1) then the marginal distribution of Vi,t has to be distributed as Be(at , bt ) for all i and t and clearly we need time dependence of the breaks. The following theorem allows us to define a reversible stochastic process Vi,t which has the correct distributions at all time points to define a stationary process whose marginal process has a given stickbreaking form. In the following theorem we define that if X is distributed as Be(a, 0) then X = 1 with probability 1 and if X ∼ Be(0, b) then X = 0 with probability 1. Theorem 1. Suppose that at +1 ≥ at and bt +1 ≥ bt then Vj,t +1 = wj,t +1 Vj,t + (1 − wj,t +1 )ϵj,t +1 where wj,t +1 ∼ Be(at + bt , at +1 + bt +1 − at − bt ), ϵj,t +1 ∼ Be(at +1 − at , bt +1 − bt ) and wj,t +1 is independent of ϵj,t +1 implies that the marginal distributions of Vj,t and Vj,t +1 are Vj,t ∼ Be(at , bt ) and Vj,t +1 ∼ Be(at +1 , bt +1 ). The application of this theorem allows us to construct stochastic processes with the correct margins for any stick-breaking process for which a1 , a2 , . . . and b1 , b2 , . . . form nondecreasing sequences. Other choices can be accommodated, but nondecreasing sequences allow for the important case of Poisson–Dirichlet marginals. In addition, they are the most computationally convenient since the transitions of the stochastic process are mixtures, for which

386

J.E. Griffin, M.F.J. Steel / Journal of Econometrics 162 (2011) 383–396

we have well-developed simulation techniques. In this case, Theorem 1 leads to a stick-breaking representation for Vj,t , which will be formalised in a continuous-time setting in the following definition for our general class of nonparametric models.

Definition 3. Let τ1 , τ2 , . . . follow a homogeneous Poisson process

Definition 2. Assume that τ1 , τ2 , τ3 , . . . follow a homogeneous Poisson process with intensity λ and the count mj (t ) is the number of arrivals of the point process between τj and t, so that

Gt =

mj (t ) = #{k|τj < τk < t }. Let Π be a stick-breaking process (see Definition 1) with parameters a, b, H where a and b are nondecreasing sequences. Define Gt =

∞ −

pj (t )δθj

j =1 i.i.d.

where θ1 , θ2 , θ3 , . . . ∼ H and pj (t ) =

τj > t

 0

Vj (t )

∏

(1 − Vk (t ))

mj (t )+1

Vj (t ) =

−

τj < t

k|τj <τk
ϵj,l (1 − wj,l )

l=1

∏

wj,i

i=l+1

with ϵj,1 ∼ Be(a1 , b1 ), wj,1 = 0 and wj,m ∼ Be(am−1 + bm−1 , am + bm − am−1 − bm−1 ), ϵj,m ∼ Be(am − am−1 , bm − bm−1 ) for m ≥ 2. Then we call Gt a Π -AR (for Π -autoregressive) process, denoted as Π -AR(a, b, H ; λ). This process is strictly stationary and so it will also mean revert. A new atom (or a new cluster in the mixture model) is introduced at each jump point of the Poisson process and other atoms are downweighted (as with the process with Dirichlet process marginal). However, the values of Vj,t will also decay as t increases, in line with Theorem 1. The mixture with the Π -AR as the mixing distribution defines a standard change-point model (Carlin et al., 1992) if Gt is a single atom for all t. Therefore the new hierarchical model defines a generalisation of change-point models that allow observations to be drawn from a mixture distribution where components will change according to a point process. Alternatively, as λ → 0, the process tends to the corresponding static nonparametric model and as λ → ∞ the process becomes uncorrelated in time. We have restricted attention to nondecreasing sequences since efficient inferential methods can be developed. This definition could be extended to define priors where the marginal process is stickbreaking but one or both parameter sequences are decreasing.4 3.1. Special cases 3.1.1. Dirichlet process marginal The Dirichlet process is the most commonly used nonparametric prior for the mixing distribution which arises when aj = 1 and bj = M for all j. So here we do not need to apply Theorem 1. The Π -AR with a Dirichlet process marginal (denoted as DPAR model) has a simple form with ϵj,1 ∼ Be(1, M ) and wj,m = 1, ϵj,m = 0 for all m ≥ 2 so that Vj (t ) = ϵj,1 for all t. Writing Vj = ϵj,1 motivates the following definition.

4 Details can be found in a working paper version entitled ‘‘Time-Dependent Stick-Breaking Processes’’ which is freely available at http://www.warwick.ac.uk/ go/crism/research/2009/paper09-05.

i.i.d.

with intensity λ and V1 , V2 , . . . ∼ Be(1, M ). Then we say that {Gt }∞ t =−∞ follows a DPAR(M , H ; λ) if ∞ −

pj (t )δθj

j =1 i.i.d.

where θ1 , θ2 , . . . ∼ H and pj (t ) =

 0

Vj



∏

(1 − Vk )

τj > t τj < t .

{k|τj <τk
An important property of the Dirichlet process is the availability of a Pólya urn scheme representation of a sample drawn from a distribution with a Dirichlet process prior. This representation implies that if x1 , x2 , x3 , . . . are i.i.d. with distribution G and G ∼ DP(MH ) then xn+1 |∑ x1 , x2 , . . . , xn is drawn from the probability n 1 measure MM+n H + i=1 M +n δxi . Thus, this means that xn+1 will either share its value with one of the previously obtained xi ’s or will be a new value drawn from the distribution H. Importantly, this is a mixture distribution with a finite number of components, and this representation typically leads to a considerable simplification in our computations, since it allows us to marginalise with respect to G. We shall now show that the DPAR model can also be represented sequentially through mixture distributions with a finite number of components. Let x1 , x2 , . . . , xn be a sequence of values drawn at times t1 , t2 , . . . , tn respectively. Suppose that xi ∼ Gti and {Gt } ∼ DPAR(M , H ; λ). We will construct a representation of the process by sequentially drawing xi conditional on x1 , x2 , . . . , xi−1 and t1 , t2 , . . . , ti . The DPAR model assumes that each value x1 , x2 , . . . , xn is introduced into Gt at particular time points and the set of these times is Tn . The size of Tn will be denoted by kn . It will be useful to prove the following lemma before the representation. This concerns the effect of the previous observations on the distribution of T = (τ1 , τ2 , τ3 , . . .). We will work with allocation variables s1 , s2 , . . . , sn which are chosen so that xi = θsi and define an active set, An (t ) = {i|ti ≥ t and τsi < t for 1 ≤ i ≤ n}, which contains the observations available to be allocated at time t. If all observations are made at the same time, the model reduces to the standard Dirichlet process and An (t ) will be an increasing process. However, once we make observations at different times then the process will be increasing between observed times but it can be decreasing at these times. To derive a convenient representation of the DPAR it is useful to consider the following lemma which derives the distribution of τ conditional on Tn and a set of m ≥ n times t1 , t2 , . . . , tm . Lemma 1. Let Sn,m = Tn ∪{t1 , . . . , tm } where m ≥ n, which has size ln,m , and TnC be the set difference of T and Tn . If φ1 < φ2 < · · · < φln,m are the elements of Sn,m , the distribution of TnC conditional on s1 , s2 , . . . , sn and Sn,m is an inhomogeneous Poisson process with a piecewise constant intensity, f (·), where

 λ      M f (u) = λ  M + An (φi )   λ

−∞ < u ≤ φ1 φi−1 < u ≤ φi , 2 ≤ i ≤ ln,m φln,m < u < ∞.

This lemma implies that the intensity of the underlying Poisson process falls as An (φi ) increases with larger values of M associated

J.E. Griffin, M.F.J. Steel / Journal of Econometrics 162 (2011) 383–396

with smaller decreases. Intuitively, the knowledge that xi arrives in Gt at time τsi reduces the chance of observing new values between τsi and ti . The following theorem now gives a finite representation of a sample from a DPAR prior. ⋆

⋆

 Ci = exp − =ρ

λM (φi+1 − φi ) (M + An (φi+1 ))2 (1 + M + An (φi+1 )) 2

M 2 (M +1) (φi+1 −φi ) (M +An (φi+1 ))2 (1+M +An (φi+1 ))

M + An (τi⋆ )

⋆

1 + ηi + M + An (τi )

,

,

1 ≤ i < ln,m+1



h|τj⋆ ≤φh ≤φp

q ∏



Di ,

1≤j≤q

i =j +1

p(sn+1 = kn + 1 and p   ∏ τk⋆n +1 ∈ (φj , φj+1 )) = 1 − Cj Ch

∏

Di

{i|φj <τi⋆ ≤φp } p q ∏  ∏ ∈ (−∞, φ1 ) = Ch Di .

h=j+1

p sn+1 = kn + 1 and τk⋆n +1



h =1

i=1

Let TEx(a,b) (λ) represent an exponential distribution truncated to (a, b) with p.d.f. f (x) ∝ λ exp{−λx},

a < x < b.

If sn+1 = kn + 1 then the new time τk⋆n +1 is drawn as follows:   if τk⋆n +1 ∈ (−∞, φ1 ) , τk⋆n +1 = φ1 − x where x ∼ Ex Mλ+1 , which represents an exponential distribution with mean (M + 1)/λ, and if τk⋆n +1 ∈ (φj , φj+1 ), τk⋆n +1 = φj+1 − x where x ∼ TEx(0,φj+1 −φj )



λM 2 (M +An (φj+1 ))2 (M +An (φj+1 )+1)



.

Thus, a value xn+1 drawn at time tn+1 from this DPAR nonparametric distribution is equal to θj (which has been previously drawn) with probability p(sn+1 = j) or is assigned a new value drawn from H with arrival time τk⋆n +1 as specified in the theorem. This representation of the predictive distribution has a stickbreaking structure with a finite number of breaks. The representation defines a probability distribution for any choice of C and D. However, it is not clear whether other choices of C and D would follow from a time series model with marginals that define exchangeable sequences drawn from random probability measures. 3.1.2. Poisson–Dirichlet process marginal The stick-breaking representation of the Poisson–Dirichlet process has Vj ∼ Be(1 − a, M + aj) where 0 ≤ a < 1 and M > −a and the Dirichlet process is the special case when a = 0. Applying Theorem 1 gives wj,m ∼ Be(1 + M + a(m − 2), a) and ϵj,m = 0 for m ≥ 2 and leads to the following definition. Definition 4. Let τ1 , τ2 , τ3 , . . . follow a homogeneous Poisson process with intensity λ. We say that {Gt }∞ t =−∞ follows a Poisson–Dirichlet autoregressive model, denoted by PDAR(a, M , H ; λ) if Gt =

∞ − i =1

pj (t )δθj

τj < t

∏

wj,i

i=2

∑n

Ch

(1 − Vk (t ))

{k|τj <τk
with Vj (t ) = ϵj

1 ≤ i ≤ kn

∏

Vj (t )

∏

mj (t )+1

⋆ where ηi = j=1 I(sj = i). Let φp = tm+1 and τq be the largest element of Tn smaller than tm+1 . The distribution of sn+1 given Sn,m , s1 , . . . , sn , tn+1 can be represented by

p(sn+1 = j) = (1 − Dj )

pj (t ) =

τj > t

 0



and Di =

i.i.d.

where θ1 , θ2 , θ3 , . . . ∼ H and

⋆

Theorem 2. Let τ1 < τ2 < · · · < τkn be the elements of Tn and φ1 < φ2 < · · · < φln,m+1 be the elements of Sn,m+1 . We define

387

where ϵj ∼ Be(1 − a, a + M ) and wj,m ∼ Be(1 + M + a(m − 2), a) for m > 1. This will always define a process with a Poisson–Dirichlet marginal. Fig. 2 shows some realisations of the Vj (t ) process. For comparison, in the Dirichlet process case, Vj (t ) would be constant. The effect of the Poisson–Dirichlet extension is to discount the value of Vj (t ) over time. As pj (t ) involves the product of factors (1 − Vk (t )) for τj < τk < t, this leads to larger pj (t ) for atoms that were introduced in the past than under the Dirichlet process-generating scheme. As a increases the process for the Vj (t ) is increasingly discounted and we generate larger autocorrelations for past shocks. This effect is illustrated by the shape of the autocorrelation function shown in Fig. 3, which gives the autocorrelation function for various values of M and a and for a fixed value of λ. Larger values of a leads to increasingly slow decay of the autocorrelation function for fixed M. Larger values of M for fixed a lead to non-negligible autocorrelation at longer lags (as with the Dirichlet Process AR which corresponds to a = 0). The Poisson–Dirichlet process gives us an extra parameter compared to the Dirichlet process which is related to the dispersion of the distribution of the number of clusters in a sample of size n drawn from the process. Larger values of a are related to a more dispersed distribution for given M. In our time series model, the parameter a controls the number of clusters in a sample of size n at time t given that there were C clusters at time t − 1 in a sample of the same size. Fig. 4 shows the distribution for n = 20 and various values of a and C . For a fixed value of C , the mode of the distributions seems unchanged by the value of a but the right-hand tail has more mass on larger numbers of clusters if a increases. This provides the PDAR process with the extra flexibility (over the Dirichlet process) to model situations where the number of clusters underlying the data is rapidly changing. 4. Computational methods Inference for these model can be made using Markov chain Monte Carlo methods. Samplers for Dirichlet process marginals can be defined using the generalised Pólya urn scheme developed in Section 3.1.1. Samplers for other marginal processes can be implemented by an extension of the retrospective sampler for Dirichlet processes (Papaspiliopoulos and Roberts, 2008). Appendix B groups most of the details; here we specifically focus on introducing latent variables that make Gibbs updating simpler by exploiting the stick-breaking form of Vj (t ) induced by Theorem 1. We assume that the arrival times of the atoms are τ1 < τ2 < τ3 < · · · and that associated with τj we have sequences ϵj = (ϵj,1 , ϵj,2 , . . .) and wj = (wj,2 , wj,3 , . . .). We will follow standard methods for stick-breaking mixtures by introducing latent variables s1 , s2 , . . . , sn that allocate each observation to one of the atoms. We define ri by τri = max{τj |τj < ti }. Then, if ti > τj , p(si = j) = pj (ti ) = Vj (ti )

ri ∏ k=j+1

(1 − Vk (ti ))

388

J.E. Griffin, M.F.J. Steel / Journal of Econometrics 162 (2011) 383–396

Fig. 2. Four realisations of Vj (t ) for different values of a and M in the Poisson–Dirichlet process with λ = 1.

Fig. 3. The autocorrelation function for various choice of a and M for a fixed value of λ.

Fig. 4. Number of clusters in 20 draws at time 2 conditional on C clusters in 20 draws at time 1: M = 1 and λ = 1.

J.E. Griffin, M.F.J. Steel / Journal of Econometrics 162 (2011) 383–396

=

m (t )+1 j− i

mj (ti )+1

ϵj,l (1 − wj,l )

l =1

h=l+1

(ti )+1 k−

∏

mk (ti )+1

(1 − ϵk,l )(1 − wk,l )

k=j+1

l =1

∏

 wk,h .

h=l+1

This form is not particularly helpful for simulation of the posterior distribution but we define a more convenient form by introducing latent variables ζ as follows mk (ti )+1

∏

p(ζijk = l) = (1 − wk,l )

wk,h ,

h=l+1

1 ≤ l ≤ mj (ti ) + 1, 1 ≤ i ≤ n, 1 ≤ j ≤ k ≤ ri

p(si = j|ζ ) = ϵj,ζijj

(1 − ϵk,ζijk ).

In effect, each stick-break is a mixture distribution and the indicators ζ choose components of that mixture distribution. Then p(s, ζ ) =

i =1

×



ri ∏

ϵsi ,ζisi si

 (1 − ϵk,ζisi k )

k=si +1

ri ri ∏ n ∏ ∏ i=1 j=si k=j

 (1 − wk,ζijk )

ρ δ σv

Posterior median

95% HPD

0.37 0.995 0.36 0.11

(0.14, 0.85) (0.992, 0.998) (0.01, 0.87) (0.04, 0.40)

features. A stochastic volatility model assumes that the conditional variance follows a stochastic process. If y1 , y2 , . . . , yT are the observations then a typical discrete time specification assumes that yt =



ht ϵt

log ht ∼ N(α + δ log ht −1 , σv2 ).

k=j+1

n ∏

M

where ϵt are i.i.d. from some returns distribution and the conditional variance ht is modelled by

and ri ∏

389

Table 1 Posterior inference for some parameters of the model for the Standard and Poors data.

wj,h

m

ri

×

∏



mj (ti )+1

∏

 wk,h 

h=ζijk +1

which is a form that will be useful for simulation of the full conditional distributions of w and ϵ which will be beta distributed. Further details on the MCMC algorithms proposed here are contained in Appendix B. We compare the marginal (using the finite mixture description of the predictive in Section 3.1.1) and the general conditional algorithms for the DPAR model in Appendix C. Results clearly indicate the superiority of the marginal algorithm, which will thus be used for the DPAR model in the following examples. 5. Examples We consider three examples: modelling returns of a stock index using a nonparametric stochastic volatility model, and modelling regional GDP data and regional growth data using a nonparametric mixture model. The first example is a single, long time series whilst the latter two examples involve a short panel of many time series. In all examples, the parameter M is given an exponential prior distribution with mean one. The parameter λ controls the rate at which distribution changes over time and so the prior distribution for λ represents our belief about how quickly the distributions are changing. In these examples, we used exponential priors with mean µλ . This implies that the correlation in the DPAR model in (4) at lag µ1 has a Be(M + 1, 1) prior distribution, given M. We λ used µλ = 0.01 in the financial time series example with daily data and µλ = 1 in the yearly regional GDP and regional growth data. This suggests that the distributions are changing over time but not rapidly. The methods described in this paper are rather computationally intensive. For example, the financial time series example was run for 30 000 iterations which took about 30 h on an Apple Imac with a Intel 2.16 GHz Core Duo processor in MatLab. 5.1. Financial time series Financial time series often show a number of stylised features such as heavy tails and volatility clustering. Building models with stochastic volatility is a popular method for capturing these

The model is identified by either assuming that the variance of ϵt is 1 or that α = 0. However, the distribution of ϵt is usually taken to be normal. Financial time series often contain large values which may not be fully explained by the time-varying variance. This has motivated the use of other choices. Jacquier et al. (2004) considered using t-distributions. Jensen and Maheu (2010) consider a full Bayesian nonparametric model and show that the return distribution for an asset index may not be well-represented by either normal or t distributions. They model the returns distributions with a Dirichlet process mixture of normals. We define a model that allows the returns distribution to change over time. The hierarchical model5 can be written as yt √ ∼ N(µt , σt2 ) ht log ht ∼ N(δ log ht −1 , σv2 )

(µt , σt2 ) ∼ Gt Gt ∼ DPAR(M , H ; λ)     where H µ, σ −2 = N(µ|0, 0.01σ 2 )Ga σ −2 |1, 0.1 and Ga (x|c , d) represents a Gamma distribution with shape c and mean c /d. The parameters δ, σv2 are given relatively flat priors for the relevant ranges: σv−2 ∼ Ga(1, 0.005/2) and δ ∼ N(0, 10) trun-

cated to [0,1] as described in Jacquier et al. (2004). Jensen and Maheu (2010) describe computational methods for the Dirichlet process-based model which can be extended using the method in Section 3.1.1. The method is applied to the daily returns of the Standard and Poors index from January 1, 1980 to December 30, 1987, as shown in the top left panel of Fig. 5. Posterior inference about the parameters of the model is shown in Table 1. The posterior median of M is around 0.37 implying the nonparametric distribution has two or three normals with nonnegligible mass and the first order autocorrelation, ρ , is large, indicating that the distributions do not change rapidly over time. The estimate of δ is much smaller than usually found in stochastic volatility models (where δ is often estimated to be greater 0.9) and suggest that much of the dependence in the volatility can be explained by changes in its mean level. The posterior √ predictive inference for the scaled returns distributions (i.e. yt / ht ) is shown in panels (a)–(c) of Fig. 5. As suggested by the estimates of M and ρ we have results that would be roughly consistent with a changepoint analysis where several regions with very similar returns distributions have been identified. Fig. 5(a) shows representative distributions for the main periods and illustrates the range of

5 In this model, we set α = 0 rather than impose the equivalent of Var[ϵ ] = 1. t

390

J.E. Griffin, M.F.J. Steel / Journal of Econometrics 162 (2011) 383–396

a

b

c Fig. 7. Posterior distribution of a for the NUTS data.

Fig. 5. Inference for the Standard and Poors data set: (a) selected predictive density functions for scaled returns at times 481 (solid line), 681 (dotted line) and 1181 (dashed line); (b) heatmap of the predictive density functions of the scaled returns at each time point (darker colours represent higher density values); (c) variance of the scaled returns distribution over time.

shapes in the scaled returns distribution. A heatmap of these distributions over time is given in Fig. 5(b). The main difference between the distributions is their spread and the variance of the fitted distributions is shown in Fig. 5(c). The results are extremely smooth and can be thought of as representing an estimate of underlying, long-run volatility (since daily changes in volatility are captured through the volatility equation). A parametric analysis assuming that the returns distribution is normal leads to an estimate of the long-run variance to be 1.66 which is roughly an average of our nonparametric estimates. 5.2. Regional GDP data The data contain the real (log) per capita GDP of 110 EU regions from 1977 to 1996 and has been previously analysed by Grazia Pittau and Zelli (2006). We ignore the longitudinal nature of the data and assume that the problem can be treated as density estimation over time with independent measurements and model yit ∼ N(µit , σit2 )

(µit , σit2 ) ∼ Gt Gt ∼ DPAR(M , H ; λ)

(7)

where H µ, σ = hyperparameter µ0 represents an overall mean value for the data and the sample mean is adopted as a suitable value. Results are



 −2

  N(µ|µ0 , 0.01σ 2 )Ga σ −2 |1, 0.1 . The

a

presented in Fig. 6. Panel (a) shows a heatmap of the estimated distribution plotted at each year. The most striking feature of the plots is that the distribution changes from year 1988 to 1989 with larger values observed from 1989 onwards. It is clear the model very much behaves like a change-point model with one changepoint. Panel (b) shows the estimated densities for each year. The change in the main mode of the distribution is obvious but there is also a change in other aspects of the distribution. To check whether this change in distribution is supported by the data, we fitted independent Dirichlet process mixture models to each year. The results are presented in panel (c) and support the two main distributions inferred from the data. In fact, the yearly distributions are very similar for the second period (post-1988). It is interesting to note that the density estimates are much smoother for the independent compared to the DPAR model, which is due to the smaller amount of information available for each estimate. 5.3. Regional growth data The data consists of annual per capita GDP growth rates for 258 NUTS2 European regions covering the period from 1995 to 2004 (NUTS2 regions provide a roughly uniform breakdown of the EU into territorial units). The data are modelled using a mixture of normal distributions as in model (7) with the exception that now we use the model with Poisson–Dirichlet marginals: Gt ∼ PDAR(a, M , H ; λ). We consider the cases where a is fixed and where a is given a uniform prior distribution on [0, 1). Fig. 7 shows the posterior distribution of a, which places its mass on smaller values of a (under 0.3) with mode around 0.05. The yearly predictive distribution of growth is shown in Fig. 8 for the PDAR with a = 0.05 and a = 0.1 and a unknown. This figure also presents results for a model where the distribution of each year’s growth is estimated independently with a Dirichlet process mixture. The results are remarkably consistent across all models. This is perhaps not surprising since we are looking at posterior means with a substantial amount of data in the sample and large differences between each year’s distribution. When the data is thinned at random to a sample of 60 regions over 9

b

c

Fig. 6. Income data: (a) heatmap of the estimated distribution for each year using a DPAR mixture model; (b) density estimates for each year using DPAR; (c) density estimates for each year using independent Dirichlet process mixture models (pre-1989 shown in light grey and other years in black).

J.E. Griffin, M.F.J. Steel / Journal of Econometrics 162 (2011) 383–396

391

Fig. 11. Posterior distribution of the number of clusters for the NUTS data.

Fig. 8. Heatmaps of the fitted yearly distribution of growth for the NUTS data.

clusters: the median number of clusters is 32 if a = 0.05 and 36 if a = 0.1. When a is unknown the median number of clusters is 34. Fig. 11 displays its posterior distribution and also presents the posterior distribution of the number of clusters when we model the data in each year with independent Dirichlet process mixture models. In the latter case we obtain a substantially larger number of clusters (the median is 88, i.e. roughly two-and-a-half times the number under the time-dependent model). This suggests that despite the lack of similarity between the distribution of each year some clusters can usefully be carried over from one year to the next. 6. Discussion

Fig. 9. Heatmaps of the fitted yearly distribution of growth for the thinned NUTS data.

years, Fig. 9 contrasts the results for independent DP and the PDAR model with a = 0.05. The results for the PDAR model then show more smoothing, particularly when distributions in consecutive years are similar, as we would expect from a ‘‘changepoint’’ type analysis. However, even with the full data set there are differences between the posterior distributions of the parameters of the model. Fig. 10 shows the posterior distribution of λ. This is the mean number of new clusters introduced each year. The distribution is concentrated between 2 and 4. Once again, this indicates the large differences between the distributions for each year. The mean of λ is 2.69 when a = 0.05 and 2.98 when a = 0.1. When a is unknown the mean, 2.87, falls between these two values. As we increase a in the Poisson–Dirichlet model then we are more likely to introduce smaller components which allows the introduction of larger numbers of components at each year. This idea is supported by the posterior distribution of the number of

This paper introduces, develops and implements inference with a new class of time-dependent measure-valued processes with stick-breaking marginals, which can be used as a prior distribution in Bayesian nonparametric time-series modelling. The Dirichlet process and Poisson–Dirichlet process marginals arise as natural special cases. We derive a generalised Pólya urn scheme for the Dirichlet process case which allows us to develop a new algorithm using a marginalised method. This method typically leads to better mixing of the parameters (particularly the intensity parameter of the Poisson process). We also develop a conditional simulation method using retrospective sampling methods when the parameters of the stick-breaking process are nondecreasing. Moving from Dirichlet process to Poisson–Dirichlet process marginals allows for more flexibility in the conditional distribution of the number of clusters in a sample of size n at time t given the number at time t − 1. The processes provide smoothed estimates of the distributions of interest. The models can behave like a change-point model which allows the discovery of periods where distributions are relatively unchanged. Acknowledgements The authors would like to acknowledge helpful comments from the Co-Editor and anonymous reviewers and from seminar audiences at the Universities of Newcastle, Nottingham, Bath, Sheffield and Bristol, Imperial College London and the Gatsby Computational Neuroscience group.

Fig. 10. Posterior distribution of λ for the NUTS data.

392

J.E. Griffin, M.F.J. Steel / Journal of Econometrics 162 (2011) 383–396

Appendix A

it follows that

A.1. Proof of Theorem 1

p(s1 , . . . , sn |Sn,m ) =

M ηi !Γ (M + An (τi⋆ ))

kn ∏

Γ (M + 1 + ηi + An (τi⋆ ))  #{j|φi−1 <τj <φi }  ln,m ∏

i=1

We will use the notation Ga(a) to denote a Gamma distribution with shape a and unitary scale. A standard property of beta random qj,t , where qj,t ∼ Ga(at ), rj,t ∼ variables implies that Vj,t = q + r j,t

j,t

Ga(bt ) and qj,t and rj,t are independent for all j and t. Let qj,t +1 = qj,t + xj,t +1 and rj,t +1 = rj,t + zj,t +1 where xj,t +1 ∼ Ga(at +1 − at ), zj,t +1 ∼ Ga(bt +1 − bt ) and xj,t +1 and zj,t +1 are independent then qj,t +1 ∼ Ga(at +1 ), rj,t +1 ∼ Ga(bt +1 ) and qj,t +1 and rj,t +1 are independent. Writing qj,t +1

V j ,t + 1 =

qj,t +1 + rj,t +1 q

×

n



M + An (φi )

x

∝

ki

M M + An (φi+1 )

i =1

p(s1 , . . . , sn |Sn,m )

=

×

×

This expectation can be derived by first noting that

Γ (M + 1 + ηi + An (τi ))

.

M + An (τi )

,



M M + An (φi )

An+1 (φi ) M + An+1 (φi )



.

p(s1 , . . . , sn , sn+1 = j|Sn,m ) p(s1 , . . . , sn |Sn,m )

;

M

 (M + An (φi ))(M + 1 + A + n(φi ))     M   if τk⋆n +1 < τi < φi . M + An (φi ) + 1

if τi = τk⋆n +1

Let τkn −1 = (1 − w)φi + wφi−1 ,

 E

= M



M ηi !Γ (M + An+1 (φi ))

=

where R = {i| min{τsi , 1 ≤ i ≤ n} ≤ τi ≤ max{ti , 1 ≤ i ≤ n}}. Marginalising over V gives

=



exp −λ(φi − φi−1 )

after some algebra we get the form in the Theorem. Otherwise, sn+1 = kn + 1 and we need to calculate p(s1 , . . . , sn , sn+1 = kn + 1, τk⋆n +1 ∈ (φi−1 , φi )|Sn,m ). It is clear that

η

Noticing that, if ηi = 0 then

Γ (M + 1 + ηi + An+1 (τi⋆ ))

p(sn+1 = j|Sn,m , s1 , . . . , sn ) =

Vi i (1 − Vi )An (τi )

Γ (M + 1 + ηi + An (τi ))

M ηi !Γ (M + An+1 (τi⋆ ))

So that, if j ≤ kn ,

i∈R

i∈R

Γ (M + 2 + ηj + An (τi⋆ ))

i =2

p(s1 , . . . , sn |V1 , V2 , V3 , . . . , τ1 , τ2 , τ3 , . . .)

M ηi !Γ (M + An (τi ))

lm,n ∏

M (ηj + 1)!Γ (M + An (τi⋆ ))

Γ (M + 1 + ηi + An+1 (φi ))  M   if φi−1 < τi < τk⋆n +1   M + An (φi )   

= E[p(s1 , . . . , sn |V1 , V2 , V3 , . . . , τ1 , τ2 , τ3 , . . .)].

∏

∏ i=1;i̸=j

i+1

p(s1 , s2 , . . . , sn )

M ηi !Γ (M + An (τi ))

Γ (M + 1 + ηi + An (τi⋆ ))   ln,m ∏ MAn (φi ) × exp −λ(φi − φi−1 ) (M + An (φi ))2 i =2 i=1

kn

In order to calculate the predictive distribution we need to calculate the probability of generating the sample s1 , s2 , . . . , sn which is given by

p(s1 , . . . , sn |τ1 , τ2 , τ3 , . . .) =

M ηi !Γ (M + An (τi⋆ ))

kn ∏

(λ(φi+1 − φi ))ki

A.3. Proof of Theorem 2

∏

(M + An (φi ))2

p(s1 , . . . , sn , sn+1 = j|Sn,m ) =

the points is unaffected by the likelihood and so the posterior is a Poisson process. There is no likelihood contribution for the intervals (−∞, φ1 ) and (φln,m , ∞). Since the Poisson process has independent increment then the posterior distribution on these intervals is also a Poisson process with intensity λ.

=



and

which shows that the number of  points on (φi , φi+1 ) is Poisson  M distributed with mean M +A (φ ) λ(φi+1 − φi ). The position of n

MAn (φi )

it also follows that if sn+1 = j where j ≤ kn then, if m ≥ n + 1,

p(k1 , k2 , . . . , kln,m −1 |s1 , s2 , . . . , sn )

∏ 

#{j|φi−1 <τj <φi } 

M

E

A.2. Proof of Lemma 1

ln,m −1

i



Let ki = #{i|φi < τi < φi+1 } for 1 ≤ i ≤ ln,m − 1.

.

From Lemma 1, #{j|φi−1 < τj < φi } is Poisson distributed with  M mean λ M +A (φ ) (φi − φi−1 ) and so

= exp −λ(φi − φi−1 ) +r

M + An (φi )

i=2

= wj,t +1 Vj,t + (1 − wj,t +1 )ϵj,t +1

j,t implies that where wj,t +1 = q +r j,t +xj,t +z and ϵj,t +1 = x + j,t j,t j,t j,t j,t zj,t Vj,t +1 is beta distributed with the correct parameters. Standard properties of beta and gamma distributions show that xj,t + zj,t is independent of ϵj,t +1 , wj,t +1 is independent of ϵj,t +1 and wj,t +1 ∼ Be(at +1 + bt +1 , at +1 + bt +1 − at − bt ) and ϵj,t +1 ∼ Be(at +1 − at , bt +1 − bt ).

M

E

 #{j|φi−1 <τj <φi }   ⋆  τkn +1 , #{j|φi−1 < τj < φi } = k  M + An+1 (φi ) M

M

k

(M + An (φi ))(M + 1 + An (φi )) [ ]k−1 M M × w+ (1 − w) M + An (φi ) + 1 M + An (φi )

J.E. Griffin, M.F.J. Steel / Journal of Econometrics 162 (2011) 383–396

B.1. General sampler

and so

[ E

#{j|φi <τj <φi+1 }

M M + An+1 (φi )

=

 ]  ⋆ τ  kn +1

Updating s. We update si using a retrospective sampler (see Papaspiliopoulos and Roberts, 2008). Let ∆ = {θ , w, ϵ, τ }. This method proposes a new value of (si , ∆), which will be referred to as (s′i , ∆′ ), that are either accepted or rejected in a Metropolis–Hastings sampler. The proposal is made in the following way: Let θi′ = θi , ϵi′ = ϵi , wi′ = wi and τi′ = τi for 1 ≤ i ≤ k−i , α = maxj≤k−i {k(yi |θj′ )} and define

M λM (φi − φi−1 ) (M + An (φi ))(M + 1 + An (φi )) M + An (φi )   λM (φi − φi−1 ) × exp − M + An (φi ) [ ] An (φi ) + 1 An (φi ) × w+ (1 − w) . M + An (φi ) + 1 M + An (φi )

[

#{j|φi−1 <τj <φi }

M

qj = α p(si = j),

M ηi !Γ (M + An+1 (τi⋆ ))

Γ (M + 1 + ηi + An+1 (τi⋆ ))   M λ(φj − φj−1 )An (φj ) exp − (M + An (φj ))2 ] [  M 2 λ(φj − φj−1 ) 1 − exp − (M + An (φj ))2 (M + An (φj ) + 1)    lm,n ∏ M exp −λ(φi − φi−1 ) M + An (φi ) i=2;i̸=j  An+1 (φi ) M + An+1 (φi )

 1

j =1

∑k−i

j =1

qj then find the m for which

qj . Otherwise, we simulate in the following

j > k−i

if m ≤ k−i



min 1,

k(yi |θm′ )



if m > k−i .

α

Updating ζ . If j = si , then the full conditional distribution of ζijk is given by mj (ti )

p(ζisi si = l) ∝ ϵsi ,l (1 − wj,l )

∏

wj,h

h=l+1 mj (ti )

p(ζisi k = l) = (1 − ϵsi ,l )(1 − wj,l )

∏

wj,h ,

k < si .

h=l+1

Otherwise ζijk is sampled from its prior distribution.

and

Updating ϵ . The full conditional distribution of ϵj,k is

p(sn+1 = kn + 1, τk⋆n +1 ∈ (φi−1 , φi )|Sn,m , s1 , . . . , sn )

=

∑m

∑m

i =1

×

p(si = j)k(yi |θj )

′ ′ the condition that u < j=1 qj . We can simulate ∆j given ∆j−1 ′ ′ using the relation τj = τj−1 − νj where νj ∼ Ex(λ) and simulating θj′ , ϵj′ and wj′ from their prior. The new state (s′i , ∆′ ) is accepted with probability

p(s1 , . . . , sn , sn+1 ∈ (φj−1 , φj )|Sn,m )

×

+

and sequentially simulate ∆′k−i +1 , ∆′k−i +2 , . . . , ∆′m until we meet

It follows that we can write

×

p(si = j)

′

∑

1 ≤ j ≤ k−i . j=1 qj < u < way. Let



,



j =1

∑m−1

M λ(φi − φi−1 )An (φi ) = exp − (M + An (φi ))2 ] [  M 2 λ(φi − φi−1 ) . × 1 − exp − (M + An (φi ))2 (M + An (φi ) + 1)

×

k−i ∑

Simulate u ∼ U(0, 1). If u <



kn ∏

 α 1−

]

M + An (φi )

=

p(si = j)k(yi |θj′ )

qj =

Finally τk⋆n +1 is uniformly distributed on (φi−1 , φi ) which implies that E

393

p(s1 , . . . , sn , sn+1 = kn + 1, τk⋆n +1 ∈ (φi−1 , φi )|Sn,m ) p(s1 , . . . , sn |Sn,m )

.

 Be ak − ak−1 +

Finally, since τkn +1 is uniformly distributed on (φi , φi−1 ) if τkn +1 ∈ (φi , φi−1 ) then

[ ∝E

M

#{j|φi−1 <τj <φi }

ri −

 I(ζijp = k) .

Updating w . The full conditional distribution of wj,l is Be (a⋆ , b⋆ ) , where

M + An (φi )

λM 2 ∝ exp − (φi − τkn +1 ) (M + An (φi ))2 (M + An (φ) + 1)

−

{i|si <j and τj
]



I(ζijj = k), bk − bk−1

{i|si =j}

+

p(τk⋆n +1 = φi − x|τk⋆n +1 ∈ (φi−1 , φi ))

−

 a⋆ = al − 1 + b l − 1 +

ri − ri n − −

I(ζijk + 1 ≤ l ≤ mj (ti ))

i=1 j=si k=j

which implies that x = φi − τkn +1 follows the distribution given in the Theorem.

and

Appendix B. Computational details

b ⋆ = al + b l − al − 1 − b l − 1 +

ri − n −

I(h ≤ j ≤ ri )I(ζihj = l).

h=si i=1

We will write the times in reverse time-order T > τ1 > τ2 > τ3 > · · · > τk where T = max{ti } and k = max{sj }. Let k−i = maxj̸=i {sj }. We will use the notation from Definition 2, mj (t ) = #{k|τj < τk < t }.

Updating τ . The point process τ can be updated using a Reversible Jump MCMC step. We have three possible moves: (1) Add a point to the process, (2) delete a point from the process and (3) Move a

394

J.E. Griffin, M.F.J. Steel / Journal of Econometrics 162 (2011) 383–396

point. The first two moves are proposed with the same probability qCHANGE (where qCHANGE < 0.5) and the third move is proposed with probability 1 − 2qCHANGE . The Add move proposes the addition of a point to the process by uniformly sampling τk+1 from (min{τi }, max{ti }), θk+1 ∼ H and simulating the necessary extra ϵ ’s, w ’s and ζ ’s from their prior. To improve acceptance rates we also update some allocations s. A point, j⋆ , is chosen uniformly at random from {1, . . . , k} and we propose new values s′i if si = j⋆ according to the probabilities p(s′i = j⋆ ) p′ (si = j⋆ )k yi |θj⋆



= qi,1 =



p′ (si = k + 1)k (yi |θk+1 ) + p′ (si = j⋆ )k yi |θj⋆





Updating λ. The parameter λ can be updated in the following way. Let τ (old) and λ(old) be the current values in the Markov chain. Simulate λ from the distribution proportional to p(λ)λ#{i|τi >min{ti }} exp{−λ(max(ti ) − min(ti ))} (old) (old) and set τi = min(ti ) − λ λ (min(ti ) − τi ) if τi(old) < min(ti ). Updating θ . The parameter θj can update from the full conditional distribution

h(θj )

n ∏

k(yi |θj ).

{i|si =j}

and p(s′i = k + 1)

= qi,2 =

p′ (si = k + 1)k (yi |θk+1 ) p′ (si = k + 1)k (yi |θk+1 ) + p′ (si = j⋆ )k yi |θj⋆



.

The acceptance probability is

 

 n  ∏ k(yi |θk′+1 ) ∏ k−1 p′ (s′i ) ⋆ min 1, q  λ(max(ti ) − min(τi )) ′ k(yi |θj⋆ ) i=1 p(si )  {i|s ̸=s } i

i

qi,1

{i|si =j⋆ )

qi,2

We introduce latent variables rijk which takes values 0 or 1 where

 n  λ(max(ti ) − min(τi )) ∏ k(yi |θj2 ) ∏ p′ (s′i ) ⋆ q min 1,  k k(yi |θj1 ) i=1 p(si )  ′  

{i|si ̸=si }

where



∏

1

I(s′i =j1 ) 

qi,1

{i|si =j1 or si =j2 }

1

I(s′i =j2 )

qi,2

p(si = j1 )k yi |θj1





p(si = j1 )k yi |θj1 + p(si = j2 )k yi |θj2







and p(si = j2 )k yi |θj2



qi,2 =



p(si = j1 )k yi |θj1 + p(si = j2 )k yi |θj2







.

The Move step uses a Metropolis–Hastings random walk proposal. A distinct point is chosen at random from the set {1, 2, . . . , k}/{i|τi ≤ τj for all j}, say j⋆ , and a new value τj′⋆ = τj⋆ +ϵ 2 where ϵ ∼ N(0, σPROP ). The move is rejected if τj′ < min(τj ), τj′ > ′ max(ti ) or τj⋆ > ti for any i such that si = j⋆ . Otherwise, the acceptance probability is

 min 1,

n ∏ p ′ ( s′ ) i

i=1

p(si )

 .

p rij1 = 1 = ϵj ,





p rijk = 1 = wj,k





for k = 2, . . . , mj (ti ) + 1

which are linked to the usual allocation variables si by the relationship si = min{j|rijk = 1 for all 1 ≤ k ≤ mj (ti ) + 1}. Thus p(si = j) = p(rijk = 1

×

∏

for all 1 ≤ k ≤ mj (ti ))

p (there exists k such that rilk = 0)

l ≤j

= p(rijk = 1 for all 1 ≤ k ≤ mj (ti )) ∏ × (1 − p(rilk = 1 for all 1 ≤ k ≤ ml (ti ))) l ≤j

.

The reverse proposals qi,1 and qi,2 are calculated as



wj,h .

h=2

and p′ is calculated using the proposal and p is calculated using the current state. The delete move proposes to remove a point of the process by uniformly selecting two distinct points j1 from the set {1, 2, . . . , k}/{i|τi ≤ τj for all j} and j2 from {1, 2, . . . , k}. We propose to remove τj1 , θj1 , and the vectors wj1 and ϵj1 . For a points τi < τj1 , we propose new vectors ϵi′ and wi′ by deleting the element ϵi,m where m = #{j|τj > τi } from ϵi and the element wi,m where m = #{j|τj > τi } from wi . Finally, we set s′i = j2 if si = j1 . The acceptance probability is zero if τj2 > ti for any i such that si = j1 . Otherwise, the acceptance probability is

q⋆ =

∏

Vj ( t ) = ϵ j

∏  1 I(s′i =j⋆ )  1 I(s′i =k+1)

qi,1 =

In this case the general sampler can be simplified to a method that generalises the computational approach described by Dunson et al. (2007) for a process where each break is formed by the product of two beta random variables. In our more general case we can write mj (t )+1

where q⋆ =

B.2. Poisson–Dirichlet process

mj (ti )

= ϵj

∏ i=2

 wj,i 1 − ϵl

ml (ti )

∏

 w j ,i .

i=2

Conditional on r the full conditional distributions of ϵ and w will be beta distributions and any hyperparameters of the stick-breaking process can be updated using standard techniques. Updating of the other parameters proceeds by marginalising over r but conditioning on s. Updating s. We could update s using the method in Appendix B.1 but we find that this can run very slowly in the Poisson–Dirichlet case. This is because at each update of si we potentially simulate a proposed value s′i which is much bigger than max{si } (due to the slow decay of Poisson–Dirichlet processes) and generate very many values for w . This section describes an alternative approach which updates in two steps: (1) update si marginalising over any new ϵ and w vectors and (2) simulate the new ϵ and w vectors conditional on the new value of si . The algorithm is much more efficient since many proposed values of si are rejected at stage (1) and extensive

J.E. Griffin, M.F.J. Steel / Journal of Econometrics 162 (2011) 383–396

simulation is avoided. We make the following changes to the algorithm qj = α

1−b 1 + a + b(j − 1)

×

∏

a + bl

∏

1 + a + b(l − 1) k−i
(1 − Vj (ti )),

· · · < τk⋆ be the elements of T , where k is the size of T , and let φ1 < φ2 < · · · < φl be the elements of S . We define   λM 2 (φi+1 − φi ) Ci = exp − (M + A(φi+1 ))2 (1 + M + A(φi+1 )) M 2 (M +1) (φi+1 −φi ) 2 (1+M +A(φ i+1 ))

j > k−i

= ρ (M +A(φi+1 ))

l≤k−i

and sequentially simulate (θk′ −i +1 , τk′−i +1 ), (θk′ −i +2 , τk′−i +2 ), . . . ,

(θm , ∑ τm ) in the same way as before until we meet the condition that m ′ ′ ′ u< j=1 qj . The new state (θ , τ , si ) is accepted with probability  if max{si } ≤ k 1 k(yi |θm′ )  if max{si } > k. α If the move is accepted we simulate ϵj and wj for j > k−i in the following way. Simulate rijk where k = #{l|τj < τl < ti } for k−i < j ≤ si and simulate ϵj and wj using the method for updating ϵ and w . Updating ϵ and w . ′

We can generate r conditional on s using the following scheme. For the i-th observation, we can simulate rij1 , rij2 , . . . , rijk where k = #{l|τj < τl < ti } sequentially. Initially, k ∏

p(rij1 = 1) = ϵj

w j ,h

k ∏

1 − ϵj

.

To simulate rijl , then if rijh = 1, 1 ≤ h ≤ l then

1−

w j ,h

k ∏

w j ,h

 −

rij1 , a + b

{i|1<#{τk |τj ≤τk
1 − rij1



{i|1<#{τk |τj ≤τk
and the full conditional distribution of wj,k for k ≥ 1 is

 −

Be 1 + a + (k − 2)b +

rijk , b

{i|k<#{τk |τj ≤τk
 +

−



1 − rijk



1≤i≤k

−j

∑

∏

p(sj = m) ∝ k(yj |θm )(1 − Dm )

Cl

{l|τm⋆ ≤φl ≤φp }

q ∏

Dh ,

h=j+1

1≤m≤q p(sj = k + 1 and τk⋆+1 ∈ (φi , φi+1 ))

∝ k⋆ (yj ) (1 − Ci )

p ∏

∏

Ch

Dh

{τh⋆ |φi <τh⋆ ≤φp } p ∏

Ch

q ∏

Di .

i =1

If sj = k + 1 the new time τk⋆+1 needs to be drawn in the following way: If τk⋆+1 ∈ (−∞, φ1 ) , τk⋆+1 = φ1 − x where λ



M +1



and if τk⋆+1 ∈ (φi , φi+1 ), τk⋆+1 = φi+1 − x where

∼ TEx(0,φi+1 −φi )



λM 2 (M +An (φi+1 ))2 (M +An (φi+1 )+1)



Updating τ ⋆ . We can update τi⋆ from its full conditional distribution. We define φ1 < φ2 < · · · < φl be the elements of {t1 , t2 , . . . , tn } ∪ {τ1⋆ , . . . , τi⋆−1 , τi⋆+1 , . . . , τk⋆ }. Let A(t ) be the active set excluding the values of j for which τsj = τi , K by φK = min{tk |sk = i} and ηi = #{k|sk = i}. Let

 λM (φj+1 − φj )An (φj ) Pj = exp − , (M + An (φj+1 ))2   λM (φj+1 − φj )(An (φj+1 ) + ηi ) Pj′ = exp − , (M + An (φj+1 ))(ηi + M + An (φj+1 )) 1 ≤ j < l, Γ (M + An (τj )) Qj = , Γ (1 + ηj + M + An (τj )) Γ (M + An (τj ) + ηi ) , 1≤j≤k Qj′ = Γ (M + An (τj ) + ηi + 1 + ηj ) 

 

,

and TEx(a,b) (λ) represent an exponential distribution truncated to (a, b) as defined in Theorem 2.

.

Otherwise p(rijl = 1) = wj,l . Finally, we set ri(k+1)1 = 1, . . . , ri(k+1)k = 1. Then the full conditional distribution of ϵj is

−

−j

⋆ where ηi = k=1,k̸=j I(sk = i). Let φp = tj and τq be the largest element of T smaller than tj . The full conditional distribution of sj is given by

x

h =l

Be 1 − b +

1 + ηi + M + A(τi⋆ )

x ∼ Ex

h=l+1

p(rijl = 1) = wj,l

+

M + A(τi⋆ )

h=1

wj,h

k ∏

1≤i
  p sj = k + 1 and τk⋆+1 ∈ (−∞, φ1 ) ∝ k⋆ (yj )

h=1

1−

Di =

h=i+1

h=1

,

and

′

1−

395

.

{i|k<#{τk |τj ≤τk
Updating τ , θ and λ. These can be updated using the methods in Appendix B.1.

and Aj =

B.3. Dirichlet process—marginal method Updating s. We can update sj conditional on s1 , . . . , sj−1 , sj+1 , . . . , sn . We define A(t ) to be the active set defined using the allocations s1 , . . . , sj−1 , sj+1 , . . . , sn , T = {τsl |l = 1,2, . . . , j−1, j+1, . . . , n}, S = T ∪ {t1 , t2 , . . . , tn } and k⋆ (yj ) = k(yj |θ )h(θ ) dθ . We use the discrete distribution derived from Theorem 2. Let τ1⋆ < τ2⋆ <

M Γ (M + An (φj+1 ))

Γ (1 + ηi + M + An (φj+1 ))   M (φj+1 − φj )An (φj+1 + ηi ) × exp −λ . (M + An (φj+1 ))(M + An (φj+1 ) + ηi )

The probability that τi⋆ ∈ (φj , φj+1 ) is proportional to Aj

p ∏ Ph′ h=j+1

Ph

∏ {i|φj <τi⋆ ≤φp }

Qi′ Qi

,

j≤K

396

J.E. Griffin, M.F.J. Steel / Journal of Econometrics 162 (2011) 383–396

and the probability that τi⋆ < φ1 p

Table 2 The integrated autocorrelation times for M and λ using the two sampling schemes.

q

Γ (M + 1) ∏ Ph′ ∏ Qi′ . ηi Γ (ηi + M ) h=1 Ph i=1 Qi This distribution is finite and discrete and draws can be simply simulated. Conditional on the atom being allocated to the region (φj−1 , φj ), then the simulated value τi⋆ = φj − x where x





η

is distributed TEx(0,φj −φj−1 ) λ(φj − φj−1 ) M +MA(φ ) M +A(φi )+η ⋆

j

⋆

τi < φ1 then τi = φ1 − x where x ∼ Ex(λ/(M + 1)). Updating λ and M.

j

i

n

i+1

V1 , V2 , . . . , Vkn where Vi ∼ Be(1 + ηi , M + An (τi )). The full conditional distribution of λ is proportional to kn ∑

kn +

ci

i=1

exp {−λ(max{ti } − min{τi })} .

The full conditional distribution of M is proportional to p(M )M

kn +

kn ∑

ci kn

i=1

∏

M + An (τi )

M + An (τi ) + 1 + ηi i=1  ci ∏ M + An (φi+1 ) × . 1 + M + An (φi+1 ) i =1 kn +1

Appendix C. Comparison of MCMC algorithms We compare the marginal (see Appendix B.3) and the general conditional algorithms with Dirichlet process marginals by analysing three simulated data sets and looking at the behaviour of the chain for the two parameters λ and M. The integrated autocorrelation time is used as a measure of the mixing of the two chains since an effective sample size can be estimated by sample size divided by integrated autocorrelation time (Liu, 2001). We introduce three simple, simulated datasets to compare performance over a range of possible data. In all cases, we make a single observation at each time point for t = 1, 2, . . . , 100. The data sets are simulated from the following models. The first model has a single change point at time 50 p(yi ) =

N(−20, 1) N(20, 1)



if i < 50 if i ≥ 50.

The second model has a linear trend over time   40(i − 1) p(yi ) = N − 20, 1 . 99 The third model has a linear trend before time 40 and then follows a mixture of three regressions after time 40

   40(i − 1)   N − 20, 1 if i < 40   99      3 40(i − 1) 2 p(yi ) = N − 20, 1 + N (−4, 1)  10 99   5    3 40(i − 1)   + N 12 − ,1 if i ≥ 40. 10

Cond.

Marg.

Cond.

Marg.

1 2 3

3.9 3.0 15.0

3.5 3.4 6.1

12.3 41.4 36.3

6.4 5.1 5.9

and if

To update these parameters from their full conditional distribution we first simulate the number of atoms, ci , between (φi , φi+1 ) λ from a Poisson distribution with mean M +AM(φ − φi ) and ) (φi+1

p(λ)λ

λ

M Dataset

99

These data sets are fitted by a mixture of normal models yt ∼ N(µt , 1)

µt ∼ Gt Gt ∼ DPAR(M , H ; λ) where H (µ) = N(µ|0, 100). Table 2 shows the results for the three data sets, using Exponential priors with unitary mean for M and λ. The mixing of λ is much better using the marginal sampler for each dataset (particularly data sets 2 and 3). The mixing of M is similar for the first two datasets but better for dataset 3.

References Carlin, B.P., Gelfand, A.E., Smith, A.F.M., 1992. Hierarchical Bayesian analysis of changepoint problems. Journal of the Royal Statistical Society: Series C 41, 389–405. Caron, F., Davy, M., Doucet, A., 2007. Generalized pólya urn for time-varying Dirichlet process mixtures. In: 23rd Conference on Uncertainty in Artificial Intelligence. UAI 2007. Chib, S., Greenberg, E., 2009. Additive cubic spline regression with Dirichlet process mixture errors. Technical Report. Washington University at St. Louis. Chib, S., Hamilton, B.H., 2002. Semiparametric Bayesian analysis of longitudinal data treatment models. Journal of Econometrics 110, 67–89. De Iorio, M., Müller, P., Rosner, G.L., MacEachern, S.N., 2004. An ANOVA model for dependent random measures. Journal of the American Statistical Association 99, 205–215. Dunson, D.B., 2006. Bayesian dynamic modeling of latent trait distributions. Biostatistics 7, 551–568. Dunson, D.B., Pillai, N., Park, J.H., 2007. Bayesian density regression. Journal of the Royal Statistical Society: Series B 69, 163–183. Fox, E.B., Sudderth, E.B., Jordan, M.I., Willsky, A.S., 2008. An HDP-HMM for systems with state persistence. In: Proceedings of the International Conference on Machine Learning. Helsinki, Finland. Geweke, J., Keane, M., 2007. Smoothly mixing regressions. Journal of Econometrics 138, 252–290. Grazia Pittau, M., Zelli, R., 2006. Empirical evidence of income dynamics across EU regions. Journal of Applied Econometrics 21, 605–628. Griffin, J.E., Steel, M.F.J., 2004. Semiparametric Bayesian inference for stochastic frontier models. Journal of Econometrics 123, 121–152. Griffin, J.E., Steel, M.F.J., 2006. Order-based dependent Dirichlet processes. Journal of the American Statistical Association 101, 179–194. Hirano, K., 2002. Semiparametric Bayesian inference in autoregressive panel data models. Econometrica 70, 781–799. Ishwaran, H., James, L.F., 2001. Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical Association 96, 161–173. Ishwaran, H., James, L.F., 2003. Some further developments for stick-breaking priors: finite and infinite clustering and classification. Sankhya-A 65, 577–592. Ishwaran, H., Zarepour, M., 2000. Markov chain Monte Carlo in approximate Dirichlet and two-parameter process hierarchical models. Biometrika 87, 371–390. Jacquier, E., Polson, N.G., Rossi, P.E., 2004. Bayesian analysis of stochastic volatility models with fat tails and correlated errors. Journal of Econometrics 122, 185–212. James, L., Lijoi, A., Prünster, I., 2009. Posterior analysis for normalized random measures with independent increments. Scandinavian Journal of Statistics 36, 76–97. Jensen, M.J., Maheu, J.M., 2010. Bayesian semiparametric stochastic volatility modeling. Journal of Econometrics 157, 306–316. Leslie, D., Kohn, R., Nott, D.J., 2007. A general approach to heteroscedastic linear regression. Statistics and Computing 17, 131–146. Liu, J.S., 2001. Monte Carlo Strategies in Scientific Computing. Springer-Verlag, New York. Lo, A.Y., 1984. On a class of Bayesian nonparametric estimates: I. Density estimates. The Annals of Statistics 12, 351–357. Müller, P., Quintana, F., 2004. Nonparametric Bayesian data analysis. Statistical Science 19, 95–110. Müller, P., Quintana, F., Rosner, G., 2004. A method for combining inference across related nonparametric Bayesian models. Journal of the Royal Statistical Society, Series B 66, 735–749. Nieto-Barajas, L., Müller, P., Ji, Y., Lu, Y., Mills, G., 2008. Time series dependent Dirichlet process, Mimeo. Papaspiliopoulos, O., Roberts, G., 2008. Retrospective MCMC for Dirichlet process hierarchical models. Biometrika 95, 169–186. Pitman, J., 2003. Poisson–Kingman partitions. In: Goldstein, D.R. (Ed.), Statistics and Science: A Festschrift for Terry Speed. IMS, Beachwood, pp. 1–34. Pitman, J., Yor, M., 1997. The two-parameter Poisson–Dirichlet distribution derived from a stable subordinator. Annals of Probability 25, 855–900. Rodriguez, A., ter Horst, E., 2008. Bayesian dynamic density estimation. Bayesian Analysis 3, 339–366. Taddy, M.A., Kottas, A., 2009. Markov switching Dirichlet process mixture regression. Bayesian Analysis 4, 793–816. Walker, S.G., Damien, P., Laud, P.W., Smith, A.F.M., 1999. Bayesian nonparametric inference for random distributions and related functions (with discussion). Journal of the Royal Statistical Society: Series B 61, 485–527. Zhu, X., Ghahramani, Z., Lafferty, J., 2005. Time-sensitive Dirichlet process mixture models. Technical Report CMU-CALD-05-104. Carnegie Mellon University.

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

No title

Read more

Recommend Documents

No title

Contents Introduction Dean Wesley Smith Star Trek® “A Test of Character” Kevin Lauderdale “Indomitable” Kevin Killiany “...

No title

METHODS IN CELL PHYSIOLOGY VOLUME I1 This Page Intentionally Left Blank Methods in Cell Physiology Edited by DAVID...

No title

No title

HYPATI A SPECIAL ISSUE FrenchFeministPhilosophy WINTER1989 A Journalof FeministPhilosophy HYPATI A SPECIAL ISSUE Fre...

No title

No title

No title

No title

JOURNAL OF ANCIENT NEAR EASTERN RELIGIONS JOURNAL OF ANCIENT NEAR EASTERN RELIGIONS Aims & Scope The Journal of Ancien...

No title

6 INVESTIGACION Ingeniería hidráulica CIENCIA en el México prehistórico s. Christopher Caran y James E. Neely Ed ic...

No title