Editorial policy: The Journal of Econometrics is designed to serve as an outlet for important new research in both theoretical and applied econometrics. Papers dealing with estimation and other methodological aspects of the application of statistical inference to economic data as well as papers dealing with the application of econometric techniques to substantive areas of economics fall within the scope of the Journal. Econometric research in the traditional divisions of the discipline or in the newly developing areas of social experimentation are decidedly within the range of the Journal’s interests. The Annals of Econometrics form an integral part of the Journal of Econometrics. Each issue of the Annals includes a collection of refereed papers on an important topic in econometrics. Editors: T. AMEMIYA, Department of Economics, Encina Hall, Stanford University, Stanford, CA 94035-6072, USA. A.R. GALLANT, Duke University, Fuqua School of Business, Durham, NC 27708-0120, USA. J.F. GEWEKE, Department of Economics, University of Iowa, Iowa City, IA 52240-1000, USA. C. HSIAO, Department of Economics, University of Southern California, Los Angeles, CA 90089, USA. P. ROBINSON, Department of Economics, London School of Economics, London WC2 2AE, UK. A. ZELLNER, Graduate School of Business, University of Chicago, Chicago, IL 60637, USA. Executive Council: D.J. AIGNER, Paul Merage School of Business, University of California, Irvine CA 92697; T. AMEMIYA, Stanford University; R. BLUNDELL, University College, London; P. DHRYMES, Columbia University; D. JORGENSON, Harvard University; A. ZELLNER, University of Chicago. Associate Editors: Y. AÏT-SAHALIA, Princeton University, Princeton, USA; B.H. BALTAGI, Syracuse University, Syracuse, USA; R. BANSAL, Duke University, Durham, NC, USA; M.J. CHAMBERS, University of Essex, Colchester, UK; SONGNIAN CHEN, Hong Kong University of Science and Technology, Kowloon, Hong Kong; XIAOHONG CHEN, Department of Economics, Yale University, 30 Hillhouse Avenue, P.O. Box 208281, New Haven, CT 06520-8281, USA; MIKHAIL CHERNOV (LSE), London Business School, Sussex Place, Regents Park, London, NW1 4SA, UK; V. CHERNOZHUKOV, MIT, Massachusetts, USA; M. DEISTLER, Technical University of Vienna, Vienna, Austria; M.A. DELGADO, Universidad Carlos III de Madrid, Madrid, Spain; YANQIN FAN, Department of Economics, Vanderbilt University, VU Station B #351819, 2301 Vanderbilt Place, Nashville, TN 37235-1819, USA; S. FRUHWIRTH-SCHNATTER, Johannes Kepler University, Liuz, Austria; E. GHYSELS, University of North Carolina at Chapel Hill, NC, USA; J.C. HAM, University of Southern California, Los Angeles, CA, USA; J. HIDALGO, London School of Economics, London, UK; H. HONG, Stanford University, Stanford, USA; MICHAEL KEANE, University of Technology Sydney, P.O. Box 123 Broadway, NSW 2007, Australia; Y. KITAMURA, Yale Univeristy, New Haven, USA; G.M. KOOP, University of Strathclyde, Glasgow, UK; N. KUNITOMO, University of Tokyo, Tokyo, Japan; K. LAHIRI, State University of New York, Albany, NY, USA; Q. LI, Texas A&M University, College Station, USA; T. LI, Vanderbilt University, Nashville, TN, USA; R.L. MATZKIN, Northwestern University, Evanston, IL, USA; FRANCESCA MOLINARI (CORNELL), Department of Economics, 492 Uris Hall, Ithaca, New York 14853-7601, USA; F.C. PALM, Rijksuniversiteit Limburg, Maastricht, The Netherlands; D.J. POIRIER, University of California, Irvine, USA; B.M. PÖTSCHER, University of Vienna, Vienna, Austria; I. PRUCHA, University of Maryland, College Park, USA; E. RENAULT, University of North Carolina, Chapel Hill, NC; R. SICKLES, Rice University, Houston, USA; F. SOWELL, Carnegie Mellon University, Pittsburgh, PA, USA; MARK STEEL (WARWICK), Department of Statistics, University of Warwick, Coventry CV4 7AL, UK; DAG BJARNE TJOESTHEIM, Department of Mathematics, University of Bergen, Bergen, Norway; HERMAN VAN DIJK, Erasmus University, Rotterdam, The Netherlands; Q.H. VUONG, Pennsylvania State University, University Park, PA, USA; E. VYTLACIL, Columbia University, New York, USA; T. WANSBEEK, Rijksuniversiteit Groningen, Groningen, Netherlands; T. ZHA, Federal Reserve Bank of Atlanta, Atlanta, USA and Emory University, Atlanta, USA. Submission fee: Unsolicited manuscripts must be accompanied by a submission fee of US$50 for authors who currently do not subscribe to the Journal of Econometrics; subscribers are exempt. Personal cheques or money orders accompanying the manuscripts should be made payable to the Journal of Econometrics. Publication information: Journal of Econometrics (ISSN 0304-4076). For 2011, Volumes 160–165 (12 issues) are scheduled for publication. Subscription prices are available upon request from the Publisher, from the Elsevier Customer Service Department nearest you, or from this journal’s website (http://www.elsevier.com/locate/jeconom). Further information is available on this journal and other Elsevier products through Elsevier’s website (http://www.elsevier.com). Subscriptions are accepted on a prepaid basis only and are entered on a calendar year basis. Issues are sent by standard mail (surface within Europe, air delivery outside Europe). Priority rates are available upon request. Claims for missing issues should be made within six months of the date of dispatch. USA mailing notice: Journal of Econometrics (ISSN 0304-4076) is published monthly by Elsevier B.V. (Radarweg 29, 1043 NX Amsterdam, The Netherlands). Periodicals postage paid at Rahway, NJ 07065-9998, USA, and at additional mailing offices. USA POSTMASTER: Send change of address to Journal of Econometrics, Elsevier Customer Service Department, 3251 Riverport Lane, Maryland Heights, MO 63043, USA. AIRFREIGHT AND MAILING in the USA by Mercury International Limited, 365 Blair Road, Avenel, NJ 07001-2231, USA. Orders, claims, and journal inquiries: Please contact the Elsevier Customer Service Department nearest you. St. Louis: Elsevier Customer Service Department, 3251 Riverport Lane, Maryland Heights, MO 63043, USA; phone: (877) 8397126 [toll free within the USA]; (+1) (314) 4478878 [outside the USA]; fax: (+1) (314) 4478077; e-mail:
[email protected]. Oxford: Elsevier Customer Service Department, The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, UK; phone: (+44) (1865) 843434; fax: (+44) (1865) 843970; e-mail:
[email protected]. Tokyo: Elsevier Customer Service Department, 4F Higashi-Azabu, 1-Chome Bldg., 1-9-15 Higashi-Azabu, Minato-ku, Tokyo 106-0044, Japan; phone: (+81) (3) 5561 5037; fax: (+81) (3) 5561 5047; e-mail:
[email protected]. Singapore: Elsevier Customer Service Department, 3 Killiney Road, #08-01 Winsland House I, Singapore 239519; phone: (+65) 63490222; fax: (+65) 67331510; e-mail:
[email protected]. Printed by Henry Ling Ltd., Dorchester, United Kingdom The paper used in this publication meets the requirements of ANSI/NISO Z39.48-1992 (Permanence of Paper)
Journal of Econometrics 164 (2011) 1–3
Contents lists available at ScienceDirect
Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom
Editorial
Annals issue on forecasting—Guest editors’ introduction
Forecasting is one of the most active research areas in econometrics; see Elliott et al. (2006) for a broad view of the subject. The objective of this Annals issue is to show its most recent developments. Forecasting was the theme of a conference titled ‘‘Forecasting in Rio,’’ held at the Graduate School of Economics of Getulio Vargas Foundation, Rio de Janeiro, Brazil, in July 2008. The conference website (http://epge.fgv.br/finrio/) lists all papers presented at the conference, and this Annals issue contains a refereed subset of the papers. Three papers in the volume adopt a no-arbitrage framework to model the term structure of interest rates and test whether model restrictions and information from interest rate options prices can be exploited to generate better out-of-sample forecasts. In their paper ‘‘The Affine Arbitrage-Free Class of Nelson–Siegel Term Structure Models,’’ Christensen, Diebold and Rudebusch develop dynamic term structure models that are theoretically attractive in that they are affine (and thus analytically tractable), exclude arbitrage and simultaneously perform well empirically with results that are easy to interpret. More richly parameterized versions of the models are shown to produce a good in-sample fit to the term structure of interest rates but result in poor out-ofsample forecasting performance. In contrast, parsimonious model specifications are found to produce better out-of-sample forecasts. To analyze when restrictions on the parameter space help or hurt is the topic of the paper, ‘‘How Useful are No-arbitrage Restrictions for Forecasting the Term structure of Interest Rates?’’ by Carriero and Giacomini. Parameter estimation error adversely affects the predictive accuracy of many forecasting models and so it is commonly found that imposing restrictions on model parameters, even when these lead to biased estimates, can improve forecasting performance simply by reducing estimation errors. Carrierro and Giacomini develop a method for inference in the comparison of the predictive accuracy of more or less constrained models. Their forecast combination approach accounts for the forecaster’s loss function and allows for time-variation in the effect of imposing constraints. In an empirical application to out-ofsample predictability of the term structure of interest rates, they find evidence that gains from imposing no-arbitrage restrictions have become smaller over time and that such gains are closely related to the associated reduction in estimation error. Almeida, Graveline and Joslin, in their paper ‘‘Do Interest Rate Options Contain Information About Excess Returns?’’, model swap rates and interest rate option prices jointly to obtain a model for the term structure of interest rates. Like the analysis in Carrierro and Giacomini and Christensen et al., Almeida et al. rely on an arbitrage-free term structure model for interest rates. They show that incorporating information from options prices lead to more precise forecasts of the spread in returns on long-term versus 0304-4076/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2011.02.015
short-term interest rate swaps. In particular, interest rate option prices are found to contain valuable information on a timevarying risk premium. Using information from interest rate options and accounting for stochastic volatility is found to improve the accuracy of interest rate forecasts. The paper ‘‘A Component Model for Dynamic Correlations’’ by Colacito, Engle and Ghysels analyzes specification and estimation of models for dynamic correlations in a way that allows for both short-run and long-run components. Specifically, the paper proposes a new class of DCC–MIDAS models that combines dynamics in conditional correlations with data observed at different frequencies. A first empirical analysis considers the correlation between returns on industry portfolios and a government bond. A second application considers the formation of minimum variance portfolios comprising five international stock market portfolios and suggests sizeable gains in the in-sample and out-of-sample forecasting performance of the proposed DCC–MIDAS model over that of simpler alternatives. In ‘‘Predictability of Stock Returns and Asset Allocation under Structural Breaks’’, Pettenuzzo and Timmermann use a Bayesian setup to focus on the issue of predictability of equity returns in the presence of structural breaks. They study monthly returns data over the period 1926–2005 and follow Chib (1998) and Pesaran et al. (2006) in proposing a changepoint model driven by an unobserved discrete state variable. Their analysis covers optimal asset allocation while allowing for different parameter values in each break segment (parameter uncertainty), uncertainty about the timing (dates) of historical breaks, uncertainty about the number of breaks, and, in an extension, uncertainty about the identity of the predictor variables (‘‘model uncertainty’’). The empirical findings suggest that model instability has a large effect on the asset allocation when compared to the effect of parameter estimation and model uncertainty. For the period 1926–2005, they find evidence of several structural breaks linked to events such as the Great Depression (1933), World War II, the oil shock of 1974, October 1987, and the change in the FED’s operating procedures around 1980. The paper ‘‘A Control Function Approach for Testing the Usefulness of Trending Variables in Forecast Models and Linear Regression’’ by Elliot takes off from the observation that many predictor variables employed in forecasting macroeconomic and finance variables display a great deal of persistence. Tests for determining the usefulness of these predictors are typically oversized, overstating their importance. Similarly, hypothesis tests on cointegrating vectors will typically be oversized if there is not an exact unit root. Elliott proposes to use a control variable approach where adding stationary covariates with certain properties to the model can result in asymptotic normal inference for prediction
2
Editorial / Journal of Econometrics 164 (2011) 1–3
regressions and cointegration vector estimates in the presence of possibly non-unit root trending covariates. The paper ‘‘A Semiparametric Panel Model for Climate Change in the United Kingdom’’, by Atak, Linton, and Xiao proposes a semiparametric panel-data model applied to a climate change problem focused on temperatures. This is motivated by a great deal of attention that global warming has received in recent years. The model allows for deterministic seasonality and for unbalanced and/or internally missing data. The trend is allowed to evolve in a nonparametric way so as to obtain a fuller picture of the evolution of common temperature for medium term horizons. Profile likelihood estimators (PLE) are proposed and their statistical properties are studied in great length. A forecasting exercise based on UK regional temperature data and other weather outcomes over the last century is implemented using the model. Their work generalizes the effort in Gao and Hawthorne (2006), who explain annual global temperature in terms of a nonparametric time trend and uses the southern oscillation index (SOI) as a covariate. It is also related to Campbell and Diebold (2005), who propose an alternative analysis of multivariate climate time series data, and to Pateriro-López and González-Manteiga (2006), who worked with a multivariate model with cross-sectionally correlated errors and different trends for each series. Applying the model to the UK dataset shows that there is a statistically significant upward trend over the last twenty years. In ‘‘Model Selection, Estimation and Forecasting in VAR Models with Short-run and Long-run Restrictions,’’ Athanasaopoulos, Guillen, Issler, and Vahid study the joint determination of the lag length, dimension of the cointegrating space and the rank of the matrix of short-run parameters of a vector autoregressive (VAR) model using model selection criteria. In the context of VAR models with restrictions, they consider model selection criteria that have data-dependent penalties (Chao and Phillips (1999)) as well as the traditional ones that do not (Vahid and Issler (2002)). They suggest a new two-step model selection procedure that is a hybrid of the two general approaches, and prove its consistency. Monte Carlo simulations are used to measure the improvements in forecasting accuracy that arise from the joint determination of lag-length and rank using the proposed procedure. This is compared to the accuracy of an unrestricted VAR or a cointegrated VAR estimated by the commonly used procedure of selecting the lag-length only and then testing for cointegration. Two empirical applications to predictability of Brazilian inflation and U.S. macroeconomic aggregate growth rates, respectively, show the usefulness of the proposed model selection strategy. Gains in different measures of forecasting accuracy are found to be substantial, especially for short horizons. Geweke and Amisano, in their paper ‘‘Optimal Prediction Pools,’’ propose pooling techniques for density forecasts. While the literature on pooling point forecasts has witnessed a large number of contributions – dating back to Bates and Granger (1969), extending through the many contributions reviewed in Timmermann (2006), and considering the recent addition by Issler and Lima (2009) – the literature on predictive density combination is more limited although see Wallis (2005), one of the first econometricians to take up combinations of predictive densities, and Hall and Mitchell (2007). In turn, Geweke and Amisano suggest a way of pooling density forecasts by finding the weights that maximize a weighted average of these log densities. The log score is a concave function of the weights. In general, an optimal linear combination will include several models with positive weights, despite the fact that exactly one model has a limiting posterior probability that equals one. Geweke and Amisano derive several formal results: for example, a prediction model with positive weight in a pool may have zero weight if some other models are deleted from that pool. They apply the proposed techniques to the
combination of GARCH-model densities of the return of the S&P 500. The paper ‘‘Quantile Regression for Dynamic Panel Data with Fixed Effects’’ by Galvão studies estimation, inference, and prediction in a quantile regression dynamic panel model with fixed effects. Panel data fixed effects estimators are typically biased when lagged dependent variables are used as regressors. To address this issue, the instrumental variables quantile regression method of Chernozhukov and Hansen (2006, 2008)) with lagged regressors as instruments is proposed. The paper shows that the instrumental variables estimator is consistent and asymptotically normal, and propose Wald and Kolmogorov–Smirnov type tests for general linear restrictions on the parameters. In addition, the paper describes how to employ the estimated models for prediction. Simulation results show that the instrumental variables approach sharply reduces the dynamic bias, and the observed levels for prediction intervals are very close to nominal levels. Finally, the procedures are applied to forecasting output growth rates for 18 OECD countries. Rossi and Sekhposyan’s paper ‘‘Understanding Models’ Forecasting Performance’’ proposes a new method to identify the sources of the out-of-sample forecasting performance of specific models. The authors decompose the models’ forecasting performance into three asymptotically uncorrelated components. The first component measures the time-varying instabilities in forecasting. The second component measures the model’s predictive content, which indicates the extent to which in-sample fit predicts out-of-sample forecasting performance. The last component measures over-fitting for a specific model, i.e, whether it includes irrelevant regressors that help in-sample fit but not out-of-sample predictive accuracy. Rossi and Sekhposyan’s work relates to earlier work on forecast evaluation by Diebold and Mariano (1995) and West (1996), later refined by West and McCracken (1998), McCracken (2000), Clark and McCracken (2001) and Giacomini and White (2006), inter alia. The empirical application seeks to understand the sources of the poor forecasting performance of exchangerate models using macroeconomic fundamentals. For short-term forecasting, Rossi and Sekhposyan find that lack of predictive content is the major culprit, while time-varying instabilities plays a major role for medium-term forecasts. The paper ‘‘Variable Selection, Estimation and Inference for Multi-period Forecasting Problems’’ by Pesaran, Pick, and Timmermann, conducts a comparison of iterated and direct multi-period forecasting approaches applied to both univariate and multivariate models in the form of parsimonious factor-augmented vector autoregressions. The authors account for serial correlation in the residuals of the multi-period direct forecasting models using a new SURE based estimation method and modified Akaike information criteria for model selection. They provide an empirical analysis of 170 variables studied by Marcellino et al. (2006), which shows that information in factors helps improve forecasting performance for most types of economic variables although it can also lead to larger biases. The study also shows that finite sample modifications to the Akaike information criterion can modestly improve the performance of the direct multi-period forecasts. Doz, Giannone and Reichlin, in their paper ‘‘A Two-Step Estimator for Large Approximate Dynamic Factor Models based on Kalman Filtering’’ show consistency of a two step estimation of the factors in a dynamic approximate factor model when the panel of time series is large (n large). In the first step, the parameters of the model are estimated from an OLS on principal components. In the second step, the factors are estimated via the Kalman smoother. The analysis develops the theory for the estimator considered in Giannone et al. (2004, 2008) and for the many empirical papers using this framework for nowcasting.
Editorial / Journal of Econometrics 164 (2011) 1–3
References Bates, J.M., Granger, C.W.J, 1969. The combination of forecasts. Operations Research Quarterly 20, 309–325. Campbell, S.D., Diebold, F.X., 2005. Weather forecasting for weather derivatives. Journal of the American Statistical Association 100, 6–16. Chao, J.C., Phillips, P.C.B., 1999. Model selection in partially nonstationary vector autoregressive processes with reduced rank structure. Journal of Econometrics 91, 227–271. Chernozhukov, V.C., Hansen, C.B., 2006. Instrumental quantile regression inference for structural and treatment effect models. Journal of Econometrics 132 (2), 491–525. Chernozhukov, V.C., Hansen, C.B., 2008. Instrumental variable quantile regression: a robust inference approach. Journal of Econometrics 142 (1), 379–398. Chib, S., 1998. Estimation and comparison of multiple change point models. Journal of Econometrics 86, 221–241. Clark, T.E., McCracken, M.W., 2001. Tests of equal forecast accuracy and encompassing for nested models. Journal of Econometrics 105 (1), 85–110. Diebold, F.X., Mariano, R.S., 1995. Comparing predictive accuracy. Journal of Business and Economic Statistics 13, 253–263. Elliott, G., Granger, C.W.J., Timmermann, A. (Eds.), 2006. Handbook of Economic Forecasting. North-Holland, Amsterdam. Gao, J., Hawthorne, K., 2006. Semiparametric estimation and testing of the trend of temperature series. Econometrics Journal 9, 332–355. Giacomini, R., White, H., 2006. Tests of conditional predictive ability. Econometrica 74 (6), 1545–1578. Giannone, Domenico, Reichlin, Lucrezia, Sala, Luca (Eds.), 2004. Monetary policy in real time. In Mark Gertler and Kenneth Rogoff. In: NBER Macroeconomics Annual. MIT Press, pp. 161–200. Giannone, Domenico, Reichlin, Lucrezia, Small, David, 2008. Nowcasting: the real-time informational content of macroeconomic data. Journal of Monetary Economics 55 (4), 665–676. Hall, S., Mitchell, J., 2007. Combining density forecasts. International Journal of Forecasting 23, 1–13. Issler, J.V., Lima, L.R., 2009. A panel data approach to economic forecasting: the biascorrected average forecast. Journal of Econometrics 152, 153–164. McCracken, M., 2000. Robust out-of-sample inference. Journal of Econometrics 99 (2), 195–223. Marcellino, Massimiliano, Stock, James H., Watson, Mark W., 2006. A comparison of
3
direct and iterated multistep AR methods for forecasting macroeconomic time series. Journal of Econometrics 135 (1-2), 499–526. Pateriro–López, González–Manteiga W., 2006. Multivariate partially linear models. Statistics & Probability Letters 76, 1543–1549. Pesaran, M., Pettenuzzo, D., Timmermann, A., 2006. Forecasting time series subject to multiple break points. Review of Economic Studies 73, 1057–1084. Timmermann, A., 2006. Forecast combinations. In: Elliott, G., Granger, C.W.J., Timmermann, A. (Eds.), Handbook of Economic Forecasting. North-Holland, Amsterdam, pp. 135–196. Vahid, F., Issler, J.V., 2002. The importance of common cyclical features in VAR analysis: a Monte-Carlo study. Journal of Econometrics 109, 341–363. Wallis, K.F., 2005. Combining density and interval forecasts: a modest proposal. Oxford Bulletin of Economics and Statistics 67, 983–994. West, K., 1996. Asymptotic inference about predictive ability. Econometrica 64, 1067–1084. West, K.D., McCracken, M.W., 1998. Regression-based tests of predictive ability. International Economic Review 39 (4), 817–840.
João Victor Issler ∗ Graduate School of Economics—EPGE, Getulio Vargas Foundation, Brazil E-mail address:
[email protected]. Oliver Linton London School of Economics, United Kingdom Allan Timmermann University of California, San Diego and CREATES, United States E-mail address:
[email protected]. Available online 28 February 2011 ∗ Corresponding editor.
Journal of Econometrics 164 (2011) 4–20
Contents lists available at ScienceDirect
Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom
The affine arbitrage-free class of Nelson–Siegel term structure models Jens H.E. Christensen a , Francis X. Diebold b,∗ , Glenn D. Rudebusch a a
Federal Reserve Bank of San Francisco, United States
b
University of Pennsylvania and NBER, United States
article
info
Article history: Available online 26 February 2011 JEL classification: E43 G12 G17 Keywords: Yield curve Interest rate Bond market Factor model Forecasting
abstract We derive the class of affine arbitrage-free dynamic term structure models that approximate the widely used Nelson–Siegel yield curve specification. These arbitrage-free Nelson–Siegel (AFNS) models can be expressed as slightly restricted versions of the canonical representation of the three-factor affine arbitrage-free model. Imposing the Nelson–Siegel structure on the canonical model greatly facilitates estimation and can improve predictive performance. In the future, AFNS models appear likely to be a useful workhorse representation for term structure research. © 2011 Elsevier B.V. All rights reserved.
1. Introduction Understanding the dynamic evolution of the yield curve is important for many tasks, including pricing financial assets and their derivatives, managing financial risk, allocating portfolios, structuring fiscal debt, conducting monetary policy, and valuing capital goods. To investigate yield curve dynamics, researchers have produced a vast literature with a wide variety of models. However, those models tend to be either theoretically rigorous but empirically disappointing, or empirically successful but theoretically lacking. In this paper, we introduce a theoretically rigorous yield curve model that simultaneously displays empirical tractability, good fit, and good forecasting performance. Because bonds trade in deep and well-organized markets, the theoretical restrictions that eliminate opportunities for riskless arbitrage across maturities and over time hold powerful appeal, and they provide the foundation for a large finance literature on arbitrage-free (AF) models that started with Vasiček (1977) and Cox et al. (1985). Those models specify the risk-neutral evolution of the underlying yield curve factors as well as the dynamics of risk premia. Following Duffie and Kan (1996), the affine versions of those models are particularly popular, because yields are convenient linear functions of underlying latent factors (state variables that are unobserved by the econometrician) with parameters, or ‘‘factor loadings’’, that can be calculated from a simple system of differential equations.
∗
Corresponding author. Tel.: +1 2158981507; fax: +1 2156890849. E-mail address:
[email protected] (F.X. Diebold).
0304-4076/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2011.02.011
Unfortunately, the canonical affine AF models often exhibit poor empirical time series performance, especially when forecasting future yields (Duffee, 2002). In addition, and crucially, the estimation of those models is known to be problematic, in large part because of the existence of numerous likelihood maxima that have essentially identical fit to the data but very different implications for economic behavior. The empirical problems appear to reflect an underlying model over-parameterization, and as a solution, many researchers (e.g. Duffee, 2002; Dai and Singleton, 2002) simply restrict to zero those parameters with small t-statistics in a first round of estimation. The resulting more parsimonious structure is typically somewhat easier to estimate and has fewer troublesome likelihood maxima. However, the additional restrictions on model structure are not well motivated theoretically or statistically, and their arbitrary application and the computational burden of estimation effectively preclude robust model validation and thorough simulation studies of the finite-sample properties of the estimators. In part to overcome the problems with empirical implementation of the canonical affine AF model, we develop in this paper a new class of affine AF models based on the workhorse yield curve representation introduced by Nelson and Siegel (1987) and extended to dynamic environments by Diebold and Li (2006). (We refer to the Diebold–Li extension as dynamic Nelson–Siegel, or DNS.) Thus, from one perspective, we take the theoretically rigorous but empirically problematic affine AF model and make it empirically tractable by incorporating DNS elements. From an alternative perspective, we take the DNS model, which is empirically successful but theoretically lacking, and make it
J.H.E. Christensen et al. / Journal of Econometrics 164 (2011) 4–20
5
rigorous by imposing absence of arbitrage. This rigor is important because the Nelson–Siegel model is extremely popular in practice, among both financial market practitioners and central banks (e.g., Svensson, 1995; Bank for International Settlements, 2005; Gürkaynak et al., 2007). DNS’s popularity stems from several sources, both empirical and theoretical, as discussed in Diebold and Li (2006). Empirically, the DNS model is simple and stable to estimate, and it is quite flexible and fits both the cross-section and time series of yields remarkably well, in many countries and periods, and for many grades of bonds. Theoretically, DNS imposes certain economically desirable properties, such as requiring the discount function to approach zero with maturity, and Diebold and Li (2006) show that it corresponds to a modern three-factor model of time-varying level, slope and curvature. However, despite its good empirical performance and a certain amount of theoretical appeal, DNS fails on an important theoretical dimension: it does not impose the restrictions necessary to eliminate opportunities for riskless arbitrage (e.g. Filipović, 1999; Diebold et al., 2005). This motivates us in this paper to introduce the class of AF Nelson–Siegel (AFNS) models, which are affine AF term structure models that maintain the DNS factor loading structure. In short, the AFNS models proposed here combine the best of the AF and DNS traditions. Approached from the AF side, they maintain the AF theoretical restrictions of the canonical affine models but can be easily estimated, because the dynamic Nelson–Siegel structure helps to identify the latent yield curve factors and delivers analytical solutions (which we provide) for zero-coupon bond prices. Approached from the DNS side, they maintain the simplicity and empirical tractability of the popular DNS models, while simultaneously enforcing the theoretically desirable property of absence of riskless arbitrage. After deriving the new class of AFNS models, we examine their in-sample fit and out-of-sample forecast performance relative to standard DNS models. For both the DNS and the AFNS models, we estimate parsimonious and flexible versions (with both independent factors and more richly parameterized correlated factors). We find that the flexible versions of both models are preferred for in-sample fit, but that the parsimonious versions exhibit significantly better out-of-sample forecast performance. As a final comparison, we also show that an AFNS model can outperform the canonical affine AF model in forecasting. We proceed as follows. First we present the main theoretical results of the paper; in Section 2 we derive the AFNS class of models, and in Section 3 we characterize the precise relationship between the AFNS class and the canonical representation of affine AF models. We next provide an empirical analysis of four leading DNS and AFNS models, incorporating both parsimonious and flexible versions; in Section 4 we examine in-sample fit, and in Section 5 we examine out-of-sample forecasting performance. We conclude in Section 6, and we provide proofs and additional technical details in several appendices.
where y(τ ) is the zero-coupon yield with τ months to maturity, and β0 , β1 , β2 , and λ are parameters. As noted earlier, this representation is commonly used by central banks and financial market practitioners to fit the crosssection of yields. Although such a static representation is useful for some purposes, a dynamic version is required to understand the evolution of the bond market over time. Hence Diebold and Li (2006) suggest allowing the β coefficients to vary over time, in which case, given their Nelson–Siegel loadings, the coefficients may be interpreted as time-varying level, slope and curvature factors. To emphasize this, we re-write the model as
2. Nelson–Siegel term structure models
dXt = K Q (t )[θ Q (t ) − Xt ]dt + Σ (t )
Here we review the DNS model and introduce the AFNS class of AF affine term structure models that maintain the Nelson–Siegel factor loading structure.
yt (τ ) = Lt + St
1 − e−λτ
λτ
+ Ct
1 − e−λτ
λτ
− e−λτ .
(2)
Diebold and Li assume an autoregressive structure for the factors, which produces the DNS model, a fully dynamic Nelson–Siegel specification. Indeed, it is a state-space model, with the yield factors as state variables, as emphasized in Diebold et al. (2006). Empirically, the DNS model is highly tractable and typically fits well. Theoretically, however, it does not require that the dynamic evolution of yields cohere such that arbitrage opportunities are precluded. Indeed, the results of Filipović (1999) imply that whatever stochastic dynamics are chosen for the DNS factors, it is impossible to preclude arbitrage at the bond prices implicit in the resulting Nelson–Siegel yield curve. In the next subsection, we show how to remedy this theoretical weakness. 2.2. The arbitrage-free Nelson–Siegel model Our derivation of the AFNS model starts from the standard continuous-time affine AF structure of Duffie and Kan (1996).1 To represent an affine diffusion process, define a filtered probability space (Ω , F , (Ft ), Q ), where the filtration (Ft ) = {Ft : t ≥ 0} satisfies the usual conditions (Williams, 1997). The state variable Xt is assumed to be a Markov process defined on a set M ⊂ Rn that solves the stochastic differential equation (SDE),2 dXt = K Q (t )[θ Q (t ) − Xt ]dt + Σ (t )D(Xt , t )dWt , Q
Q
(3)
n
where W is a standard Brownian motion in R , the information of which is contained in the filtration (Ft ). The drifts and dynamics θ Q : [0, T ] → Rn and K Q : [0, T ] → Rn×n are bounded, continuous functions.3 Similarly, the volatility matrix Σ : [0, T ] → Rn×n is a bounded, continuous function, while D : M × [0, T ] → n×n R has a diagonal structure with ith diagonal entry given by
γ i (t ) + δ1i (t )x1t + · · · + δni (t )xnt . To simplify the notation, γ (t ) and δ(t ) are defined as 1 1 γ (t ) δ1 (t ) · · · δn1 (t ) .. , .. γ (t ) = ... and δ(t ) = ... . . γ n (t )
δ1n (t )
···
δnn (t )
where γ : [0, T ] → Rn and δ : [0, T ] → Rn×n are bounded, continuous functions. Given this notation, the SDE of the state variables can be written as
γ 1 (t ) + δ 1 (t )Xt .. × . 0
··· .. . ···
0
Q .. dWt , . γ n (t ) + δ n (t )Xt
2.1. The dynamic Nelson–Siegel model The original Nelson–Siegel model fits the yield curve with the simple functional form y(τ ) = β0 + β1
1 − e−λτ
λτ
+ β2
1 − e−λτ
λτ
−e
−λτ
,
(1)
1 Krippner (2006) derives a special case of the AFNS model with constant risk premiums. 2 Note that (3) refers to the risk-neutral (‘‘Q ’’) dynamics. 3 Stationarity of the state variables is ensured if the real components of all eigenvalues of K Q (t ) are positive; see Ahn et al. (2002). However, stationarity is not a necessary requirement for the process to be well defined.
6
J.H.E. Christensen et al. / Journal of Econometrics 164 (2011) 4–20
where δ i (t ) denotes the ith row of the δ(t ) matrix. Finally, the instantaneous risk-free rate is assumed to be an affine function of the state variables rt = ρ0 (t ) + ρ1 (t )′ Xt , where ρ0 : [0, T ] → R and ρ1 : [0, T ] → Rn are bounded, continuous functions. Duffie and Kan (1996) prove that zero-coupon bond prices in this framework are exponential affine functions of the state variables, P (t , T ) = Et
Q
[
T
∫
exp −
] ru du
= exp(B(t , T )′ Xt + A(t , T )),
t
dt
−
n 1−
dA(t , T ) dt
−
(Σ ′ B(t , T )B(t , T )′ Σ )j,j (δ j )′ ,
B(T , T ) = 0,
(4)
n 1−
(Σ ′ B(t , T )B(t , T )′ Σ )j,j γ j ,
A(T , T ) = 0,
2 dXt = dXt3
0 0 0
0
0
λ
θ1Q
Xt1
−λ θ2Q − Xt2 dt λ Xt3 θ3Q 1 ,Q
0
dWt
+ Σ dWt2,Q ,
λ > 0.
3 ,Q
dWt
Then zero-coupon bond prices are
[
P (t , T ) = Et
T
∫
exp −
] ru du
t
= exp(B (t , T ) 1
dB1 (t , T )
(5)
Xt1
+ B2 (t , T )Xt2 + B3 (t , T )Xt3 + A(t , T )),
1 B (t , T ) B2 (t , T ) λ B3 (t , T )
0
0 0
λ −λ
(6)
dt and
and the possible time dependence of the parameters is suppressed in the notation. The pricing functions imply that zero-coupon yields are y(t , T ) = −
dt 2 1 0 dB (t , T ) = 1 + 0 dt 0 0 3 dB (t , T )
= ρ0 − B(t , T )′ K Q θ Q
2 j=1
dXt1
where B1 (t , T ), B2 (t , T ), B3 (t , T ), and A(t , T ) are the solutions to the system of ODEs:
= ρ1 + (K Q )′ B(t , T )
2 j=1
Q
where B(t , T ) and A(t , T ) are the solutions to the system of ordinary differential equations (ODEs) dB(t , T )
where the state variables Xt = (Xt1 , Xt2 , Xt3 ) are described by the following system of SDEs under the risk-neutral Q -measure
1 T −t
log P (t , T ) = −
B(t , T )′ T −t
Xt −
A(t , T ) T −t
.
Given the pricing functions, for a three-factor affine model with Xt = (Xt1 , Xt2 , Xt3 ), the closest match to the Nelson–Siegel yield function is a yield function of the form4 y(t , T ) = Xt1 +
1 − e−λ(T −t )
Xt2
λ(T − t ) [ ] 1 − e−λ(T −t ) A(t , T ) + − e−λ(T −t ) Xt3 − , λ(T − t ) T −t
with ODEs for the B(t , T ) functions that have the solutions B1 (t , T ) = −(T − t ), B2 (t , T ) = −
1 − e−λ(T −t )
λ
B (t , T ) = (T − t )e 3
dt
3 1−
= −B(t , T )′ K Q θ Q −
2 j =1
(Σ ′ B(t , T )B(t , T )′ Σ )j,j ,
(7)
with boundary conditions B1 (T , T ) = B2 (T , T ) = B3 (T , T ) = A(T , T ) = 0. The solution to this system of ODEs is: B1 (t , T ) = −(T − t ), B2 (t , T ) = −
1 − e−λ(T −t )
λ
,
B3 (t , T ) = (T − t )e−λ(T −t ) − A(t , T ) = (K θ )2 Q
Q
1 − e−λ(T −t )
λ
B (s, T )ds + (K θ )3 2
Q
Q
t
+
3 ∫ 1−
2 j =1
,
T
∫
T
∫
B3 (s, T )ds t
T
(Σ ′ B(s, T )B(s, T )′ Σ )j,j ds.
t
Finally, zero-coupon bond yields are
,
−λ(T −t )
dA(t , T )
−λ(T −t )
−
1−e
λ
.
In this case the factor loadings exactly match Nelson–Siegel, but A(t ,T ) there is an unavoidable ‘‘yield-adjustment term’’, − T −t , which depends only on the maturity of the bond, not on time. As described in the following proposition, there exists a class of affine AF models that satisfies the above ODEs. Proposition 1. Suppose that the instantaneous risk-free rate is rt = Xt1 + Xt2 ,
y(t , T ) = Xt1 +
1 − e−λ(T −t )
X2
λ(T − t ) t [ ] A(t , T ) 1 − e−λ(T −t ) + − e−λ(T −t ) Xt3 − . λ(T − t ) T −t
Proof. See Appendix A.
The existence of an AFNS model, as defined in Proposition 1, is related to the work of Trolle and Schwartz (2009), who show that the dynamics of the forward rate curve in a general m-dimensional Heath–Jarrow–Morton (HJM) model can always be represented by a finite-dimensional Markov process with time-homogeneous volatility structure if each volatility function is given by
σi (t , T ) = pn,i (T − t )e−γi (T −t ) , 4 One could of course define ‘‘closest’’ in other ways. Our strategy is to find the affine AF model with factor loadings that match Nelson–Siegel exactly.
i = 1 , . . . , m,
where pn,i (τ ) is an nth-order polynomial in τ . Because the forward rates in the DNS model satisfy this requirement, there exists such
J.H.E. Christensen et al. / Journal of Econometrics 164 (2011) 4–20
an AF three-dimensional HJM model. However, the simplicity of the solution in the case of the Nelson–Siegel model presented in Proposition 1 is striking. Proposition 1 also has several interesting implications. First, the three state variables are Gaussian Ornstein–Uhlenbeck processes with a constant volatility matrix Σ .5 The instantaneous interest rate is the sum of level and slope factors (Xt1 and Xt2 ), while the curvature factor’s (Xt3 ) sole role is as a stochastic timevarying mean for the slope factor under the Q -measure. Second, Proposition 1 only imposes structure on the dynamics of the AFNS model under the Q -measure and is silent about the dynamics under the P-measure. Still, the very indirect role of curvature generally accords with the empirical literature where it has been difficult to find sensible interpretations of curvature under the P-measure (Diebold et al., 2006). Similarly, the level factor is a unit-root process under the Q -measure, which accords with the usual finding that one or more of the interest rate factors are close to being nonstationary processes under the P-measure.6 Third, Proposition 1 provides insight into the nature of the parameter λ. Although a few authors (e.g. Koopman et al., 2010) have considered time-varying λ, it is a constant in the AFNS model and has the interpretation as the mean-reversion rate of the curvature and slope factors as well as the scale by which a deviation of the curvature factor from its mean affects the mean of the slope factor. Fourth, and crucially, AFNS contains an additional A(t ,T ) maturity-dependent term − T −t relative to DNS. This ‘‘yieldadjustment’’ term is a key difference between DNS and AFNS, and we now examine it in detail. 2.3. The yield-adjustment term The only parameters in the system of ODEs for the AFNS B(t , T ) functions are ρ1 and K Q , i.e., the factor loadings of rt and the meanreversion structure for the state variables under the Q -measure. The drift term θ Q and the volatility matrix Σ do not appear in the A(t ,T ) ODEs, but rather in the yield-adjustment term − T −t . Hence in the AFNS model the choice of the volatility matrix Σ affects both the P-dynamics and the yield function through the yield-adjustment term. In contrast, the DNS model is silent about the real-world dynamics of the state variables, so the choice of P-dynamics is irrelevant for the yield function. As discussed in the next section, we identify the AFNS models by fixing the mean levels of the state variables under the Q -measure at 0, i.e. θ Q = 0. This implies that the yield-adjustment term is of the form:
−
A(t , T ) T −t
=−
1
1
3 ∫ −
2 T − t j =1
T
(Σ ′ B(s, T )B(s, T )′ Σ )j,j ds. t
σ11 Σ = σ21 σ31
σ12 σ22 σ32
the yield-adjustment term can be derived in analytical form (see Appendix B) as A(t , T ) T −t
=A
1
=
2T −t
(T − t )2 6
1
+C
−
∫
1
+B
λ3 [
2λ2
+E
2λ
1 1 − e−λ(T −t )
λ3
T −t
+
1 1 − e−2λ(T −t ) 4λ3
]
T −t
1 −λ(T −t ) 1 3 e − (T − t )e−2λ(T −t ) − 2 e−2λ(T −t ) 4λ 4λ
+
5 1 − e−2λ(T −t ) 8λ3
T −t
1 −λ(T −t ) 1 1 − e−λ(T −t ) e − 3 2 λ λ T −t
(T − t ) +
]
1 3 −λ(T −t ) e + (T − t ) 2 λ 2λ
1
+ (T − t )e λ +F
−
λ2
T −t 1
(Σ ′ B(s, T )B(s, T )′ Σ )j,j ds
j=1
1
2 1 − e−λ(T −t )
+D
−
t
[
+
2λ2
3 −
T
1
λ2
−λ(T −t )
λ3
T −t
1 −λ(T −t ) 1 e − 2 e−2λ(T −t ) 2λ
+
λ2
3 1 − e−λ(T −t )
λ3
−
3 1 − e−λ(T −t )
T −t
+
3 1 − e−2λ(T −t ) 4λ3
T −t
,
2 2 2 2 2 2 2 2 where A = σ11 + σ12 + σ22 + σ32 + σ13 + σ23 + , B = σ21 , C = σ31 2 σ33 , D = σ11 σ21 + σ12 σ22 + σ13 σ23 , E = σ11 σ31 + σ12 σ32 + σ13 σ33 , and F = σ21 σ31 + σ22 σ32 + σ23 σ33 . This result has two implications. First, the fact that AFNS zerocoupon bond yields are given by an analytical formula greatly facilitates empirical implementation of AFNS models. Second, the nine underlying volatility parameters are not identified. Indeed, only the six terms A, B, C , D, E, and F can be identified; hence the maximally flexible AFNS specification that can be identified has triangular volatility matrix7
σ11 Σ = σ21 σ31
0
σ22 σ32
0 0
σ33
.
Later we will quantify the yield-adjustment term and examine how it affects empirical performance in leading specifications, to which we now turn. 2.4. Four specific Nelson–Siegel models
Given a general volatility matrix
7
σ13 σ23 , σ33
5 Proposition 1 can be extended to include jumps in the state variables. As long as the jump arrival intensity is state independent, the Nelson–Siegel factor loading structure in the yield function is maintained because only A(t , T ) is affected by the inclusion of such jumps. See Duffie et al. (2000) for the needed modification of the ODE for A(t , T ) in this case. 6 With a unit root in the level factor, − A(t ,T ) → −∞ as maturity increases, which T −t
implies that with an unbounded horizon T the model is not arbitrage free. Therefore, as is often done in theoretical discussions, we impose an arbitrary maximum horizon. Alternatively, we could modify the mean-reversion matrix K Q to include a sufficiently small ε > 0 in the upper left-hand position to obtain an AF model that is indistinguishable from the AFNS model in Proposition 1.
In general, the DNS and AFNS models are silent about the Pdynamics, so there are an infinite number of possible specifications that could be used to match the data. However, for continuity with the existing literature, we focus on two versions of the DNS model that have featured prominently in recent studies, examining the effects of imposing absence of arbitrage. In the independent-factor DNS model, the three state variables are independent first-order autoregressions, as in Diebold and Li (2006). The state transition equation is
L t − µL S t − µS C t − µC
=
a11 0 0
0 a22 0
0 0 a33
L t − 1 − µL St −1 − µS C t − 1 − µC
7 The choice of upper or lower triangular is irrelevant.
ηt (L) + ηt (S ) , ηt (C )
8
J.H.E. Christensen et al. / Journal of Econometrics 164 (2011) 4–20
where the stochastic shocks ηt (L), ηt (S ), and ηt (C ) have covariance matrix 2 q11 0 0 Q = 0 q222 0 . 0 0 q233 In the correlated-factor DNS model, the state variables follow a first-order vector autoregression, as in Diebold et al. (2006). The transition equation is
ηt (L) + ηt (S ) , = ηt (C ) where the stochastic shocks ηt (L), ηt (S ), and ηt (C ) have covariance matrix Q = qq′ , where
L t − µL St − µS Ct − µC
q=
q11 q21 q31
a11 a21 a31
0 q22 q32
a12 a22 a32
0 0 q33
a13 a23 a33
Lt −1 − µL S t − 1 − µS C t − 1 − µC
1 − e−λτ1
1 − e−λτ1
− e−λτ1 λτ1 1 − e−λτ2 −λτ2 −e λτ2 .. . −λτN −λτN 1−e 1−e 1 − e−λτN λτN λτN εt (τ1 ) Lt εt (τ2 ) × St + .. , . Ct εt (τN ) where the measurement errors εt (τi ) are i.i.d. white noise. 1
λτ1 1 − e−λτ2 λτ2 .. .
yt (τ1 ) 1 yt (τ2 ) . = . . . . . yt (τN )
(8)
dWt = dWtP + Γt dt , Q
where Γt represents the risk premium. To preserve affine dynamics under the P-measure, we limit our focus to essentially affine risk premium specifications (see Duffee, 2002), in which case Γt takes the form 1 0 1 1 1 γ13 Xt γ1 γ11 γ12 1 1 1 2 Γt = γ20 + γ21 γ22 γ23 Xt . 1 1 1 γ30 γ31 γ32 γ33 Xt3 With this specification the SDE for the state variables under the Pmeasure, dXt = K P [θ P − Xt ]dt + Σ dWtP ,
(9)
remains affine. Due to the flexible specification of Γt , we are free to choose any mean vector θ P and mean-reversion matrix K P under the P-measure and still preserve the required Q -dynamic structure described in Proposition 1. Hence we focus on the two AFNS models that correspond to the two DNS models above. In the independent-factor AFNS model, the three state variables are independent under the P-measure, P dXt1 κ11 2 dXt = 0 0 dXt3
+
κ
P 22
0
σ1 0 0
0 θ1P Xt1 P 0 θ2 − Xt2 dt P κ33 θ3P Xt3
0
0
σ2 0
dW 1,P 0 t 0 dWt2,P . 3 ,P σ3 dWt
+
σ11 σ21 σ31
P κ12 P κ22 P κ32
P 1 P Xt κ13 θ1 P P κ23 θ2 − Xt2 dt P κ33 θ3P Xt3 dW 1,P 0 0 t σ22 0 dWt2,P . 3,P σ32 σ33 dWt
This is the most flexible AFNS model with all parameters identified. In both the independent-factor and correlated-factor AFNS models, the measurement equation is
1
yt (τ1 ) yt (τ2 )
1 . = . . . . . yt (τN )
1
The corresponding AFNS models are formulated in continuous time, and the relationship between the real-world dynamics under the P-measure and the risk-neutral dynamics under the Q measure is given by the measure change
In both the independent-factor and correlated-factor DNS models, the measurement equation is
P dXt1 κ11 P dXt2 = κ21 P 3 κ31 dXt
.
In the correlated-factor AFNS model, the three state variables may interact dynamically and/or their shocks may be correlated,
1 − e−λτ1
1 − e−λτ1
λτ1
λτ1
1 − e−λτ2
1 − e−λτ2
λτ2 .. .
λτ2
1 − e−λτN
−e
−λτ2
.. .
1 − e−λτN
λτN λτN A(τ ) 1 τ 1 εt (τ1 ) A(τ2 ) εt (τ2 ) − τ2 + . , .. .. . εt (τN ) A(τN ) τN
− e−λτ1
− e−λτN
1 X t2 X t3 X t
(10)
where the measurement errors εt (τi ) are i.i.d. noise. 3. The AFNS subclass of canonical affine AF models Before proceeding to an empirical analysis of the various DNS and AFNS models, we first answer a key theoretical question: What, precisely, are the restrictions that the AFNS model imposes on the canonical representation of three-factor affine AF models?8 Denoting the state variables by Yt , the canonical A0 (3) model is rt = δ0n + (δ1n )′ Yt dYt = KnP [θnP − Yt ]dt + Σn dWtP dYt = KnQ [θnQ − Yt ]dt + Σn dWt , Q
with δ0n ∈ R, δ1n , θnP , θnQ ∈ R3 , and KnP , KnQ , Σn ∈ R3×3 .9 If the essentially affine risk premium specification Γt = γn0 + γn1 Yt is imposed on the model, the drift terms under the P-measure (KnP , θnP ) can be chosen independently of the drift terms under the Q -measure (KnQ , θnQ ). Because the latent state variables may rotate without changing the probability distribution of bond yields, not all parameters in the above model can be identified. Singleton (2006) imposes identifying restrictions under the Q -measure. Specifically, he sets the mean θnQ = 0, the volatility matrix Σn equal to the identity matrix, and he sets the mean-reversion matrix KnQ equal to a
8 By this we mean the A (3) representation with three state variables and zero 0 square-root processes, as detailed in Singleton (2006, Chap. 12). 9 Note that Y denotes the state variables of the canonical representation, which t
are different from the Xt state variables in the AFNS models, and that subscripts or superscripts of ‘‘n’’ denote coefficients in the canonical representation.
J.H.E. Christensen et al. / Journal of Econometrics 164 (2011) 4–20
9
Table 1 AFNS parameter restrictions on the A0 (3) canonical representation. AFNS model
δ0n , δ1n
KnQ
KnP
θnP
Restrictions
Independent factor
δ0n = 0 δ1n,3 = 0
κ1n,,1Q = κ1n,,2Q = κ1n,,3Q = 0 κ2n,,2Q = κ3n,,3Q
KnP is diagonal
No restriction
12
Correlated factor
δ0n = 0
κ1n,,1Q = 0 κ2n,,2Q = κ3n,,3Q
No restriction
No restriction
3
triangular matrix.10 The canonical representation then has Q dynamics n ,Q
n,Q κ12 n,Q κ22
κ11 dYt1 dYt2 = − 0 dYt3 0
+
1 0 0
0 1 0
1 n ,Q κ13 Yt n ,Q 2 Yt dt κ23 n ,Q Yt3 0 κ33 1 ,Q
4. Estimation and in-sample fit of DNS and AFNS models
dWt 0 0 dWt2,Q , 3 ,Q 1 dWt
and P-dynamics n ,P
κ11 dYt1 dYt2 = κ n,P 21 n ,P dYt3 κ31
+
1 0 0
n,P 1 n,P θ1 κ13 Yt n,P n,P − Yt2 dt θ2 κ23 n,P Yt3 θ3n,P κ33 dW 1,P 0 t 0 dWt2,P .
n ,P κ12 n ,P κ22 n ,P κ32
0 1 0
1
form solution and, as described in the next section, eliminate in an appealing way the surfeit of troublesome likelihood maxima in estimation.
3 ,P
dWt
The instantaneous risk-free rate is rt = δ0n + δ1n,1 Yt1 + δ1n,2 Yt2 + δ1n,3 Yt3 . Hence there are 22 free parameters in the canonical representation of the A0 (3) model class.11 In the AFNS class, the mean-reversion matrix under the Q measure is triangular, so it is straightforward to derive the restrictions that must be imposed on the canonical affine representation to obtain the class of AFNS models. The procedure through which the restrictions are identified is based on the socalled affine invariant transformations. Appendix C describes such transformations and derives the restrictions associated with the AFNS models considered in this paper. The results are summarized in Table 1, which shows that for the correlated-factor AFNS model there are three key parameter restrictions on the canonical affine model. First, δ0n = 0, so there is no constant in the equation for the instantaneous risk-free rate. There is no need for this constant n,Q because, with the second restriction κ1,1 = 0, the first factor must be a unit-root process under the Q -measure, which also implies n ,Q that this factor can be identified as the level factor. Finally, κ2,2 =
κ3n,,3Q , so the own mean-reversion rates of the second and third
factors under the Q -measure must be identical. The independentfactor AFNS model maintains the three parameter restrictions and adds nine others under both the P- and Q -measures.12 The Nelson–Siegel parameter restrictions on the canonical affine AF model greatly facilitate estimation.13 They allow a closed-
Thus far we have derived the affine AF class of Nelson–Siegel term structure models, and we have explicitly characterized the restrictions that it places on the canonical A0 (3) model. Here we undertake estimation of the AFNS model and illustrate its relative simplicity. We proceed in several steps. First we introduce the state-space/Kalman-filter maximum-likelihood estimation framework that we employ throughout. Second, we estimate and compare independent- and correlated-factor DNS models. Third, we estimate independent- and correlated-factor AFNS models, which we compare to each other and to their DNS counterparts, devoting special attention to the estimated yield-adjustment terms. Throughout, our estimates are based on monthly US Treasury bill and bond yields from January 1987 to December 2002. The data are end-of-month, unsmoothed (Fama and Bliss, 1987) zero-coupon yields at sixteen maturities: 3, 6, 9, 12, 18, 24, 36, 48, 60, 84, 96, 108, 120, 180, 240, and 360 months. 4.1. Estimation framework We first display the state-space representations of the DNS and AFNS models. For the DNS models, the state transition equation is Xt = (I − A)µ + AXt −1 + ηt , where Xt = (Lt , St , Ct ), and the measurement equation is yt = BXt + εt .
(11)
For the continuous-time AFNS models, the conditional mean vector and the conditional covariance matrix are E P [XT |Ft ] = (I − exp(−K P 1t ))θ P + exp(−K P 1t )Xt , V P [XT |Ft ] =
1t
∫
P P ′ e−K s ΣΣ ′ e−(K ) s ds,
0
where 1t = T − t. We compute conditional moments of discrete observations and obtain the AFNS state transition equation Xt = (I − exp(−K P 1t ))θ P + exp(−K P 1t )Xt −1 + ηt , where 1t is the time between observations. The AFNS measurement equation is14 yt = A + BXt + εt .
10 Without loss of generality, we will take it to be upper triangular in what follows. 11 Note that, given this canonical representation, there is no loss of generality in
In both the DNS and AFNS environments, the assumed error structure is
fixing the AFNS model mean under the Q -measure at 0 and leaving the mean under the P-measure, θ P , to be estimated. 12 For both specifications, there is a further modest restriction described in
[ ηt 0 Q ∼N , εt 0 0
the yield function is explicitly tied to the yield-adjustment term through the specification of the volatility matrix, while in the canonical representation it is blurred by an interplay between the specifications of δ1n and KnQ .
14 Note that the matrix B is identical in the DNS and AFNS models (compare Eqs. (8) and (10)). The only difference is the addition of the vector A containing the yield-adjustment terms in the AFNS models.
n, Q
n, Q
0 H
]
,
n, Q
Appendix C: κ2,3 must have the opposite sign of κ2,2 and κ3,3 , but its absolute numerical size can vary freely. 13 Note that in the AFNS model, the connection between the P-dynamics and
10
J.H.E. Christensen et al. / Journal of Econometrics 164 (2011) 4–20
where the matrix H is diagonal, and the matrix Q is diagonal in the independent-factor and non-diagonal in the correlated-factor case. In the AFNS case, moreover, Q has special structure, 1t
∫ Q =
Table 2 Estimated independent-factor DNS model. The top panel contains the estimated A matrix and µ vector. The bottom panel contains the estimated q matrix. Standard errors appear in parentheses. The estimated λ is 0.06040 (0.00100) for maturities measured in months. The maximized log likelihood is 16,332.94.
P P ′ e−K s ΣΣ ′ e−(K ) s ds.
A Matrix
0
In addition, the transition and measurement errors are assumed orthogonal to the initial state. Now we consider Kalman filtering, which we use to evaluate the likelihood functions of the DNS and AFNS models. We initialize the filter at the unconditional mean and variance of the state variables under the P-measure.15 For the DNS models we have X0 = µ and Σ0 = V , where V solves V = AVA′ + Q . For the
∞
X ,0
Xt |t −1 = E P [Xt |Yt −1 ] = Φt
(ψ) + ΦtX ,1 (ψ)Xt −1 ,
Σt |t −1 = ΦtX ,1 (ψ)Σt −1 ΦtX ,1 (ψ)′ + Qt (ψ), X ,1
X ,0
= (I − A)µ, Φt = A, where for the DNS models we have Φt X ,0 = (I − and Qt = Q , and for the AFNS models we have Φt X ,1
exp(−K P 1t ))θ P , Φt
= exp(−K P 1t ), and Qt =
1t 0
A·,3
µ
0
0
A2,·
0.9827 (0.0128) 0
0
A3,·
0
0.9778 (0.0166) 0
0.0696 (0.0137) −0.0249 (0.0151) −0.0108 (0.0079)
A1,·
q·,1
q·,2
q·,3
0
0
q2,·
0.0025 (0.0002) 0
0
q3,·
0
0.0033 (0.0002) 0
q1,·
A Matrix
A1,· A2,· A3,·
where
q1,·
vt = yt − E [yt |Yt −1 ] = yt − A(ψ) − B(ψ)Xt |t −1 , Ft = cov(vt ) = B(ψ)Σt |t −1 B(ψ)′ + H (ψ), H (ψ) = diag(σε2 (τ1 ), . . . , σε2 (τN )). At this point, the Kalman filter has delivered all ingredients needed to evaluate the Gaussian log likelihood, the predictionerror decomposition of which is log l(y1 , . . . , yT ; ψ)
T − N 1 1 − log(2π ) − log |Ft | − vt′ Ft−1 vt , 2
2
2
where N is the number of observed yields. We numerically maximize the likelihood with respect to ψ using the Nelder–Mead simplex algorithm. Upon convergence, we obtain standard errors from the estimated covariance matrix,
= (ψ) Ω
1 T
T ∂ log lt (ψ) ′ 1 − ∂ log lt (ψ)
T t =1
∂ψ
∂ψ
Mean
A·,1
A·,2
A·,3
µ
0.9874 (0.0165) 0.0066 (0.0228) 0.0152 (0.0526)
0.0050 (0.0183) 0.9332 (0.0229) 0.0401 (0.0418)
−0.0097
0.0723 (0.0145) −0.0294 (0.0159) −0.0120 (0.0126)
(0.0157) 0.0819 (0.0202) 0.9011 (0.0377)
q Matrix
Σt = Σt |t −1 − Σt |t −1 B(ψ)′ Ft−1 B(ψ)Σt |t −1 ,
t =1
0.0075 (0.0004)
Table 3 Estimated correlated-factor DNS model. The top panel contains the estimated A matrix and µ vector. The bottom panel contains the estimated q matrix. Standard errors appear in parentheses. The estimated λ is 0.06248 (0.00109) for maturities measured in months. The maximized log likelihood is 16,415.36.
Xt = E [Xt |Yt ] = Xt |t −1 + Σt |t −1 B(ψ)′ Ft−1 vt ,
=
0.9189 (0.0284)
q Matrix
P
e−K s ΣΣ ′
e−(K ) s ds, where 1t is the time between observations. In the time t update step, Xt |t −1 is improved by using the additional information contained in Yt . We have P ′
A·,2
′ −(K P )′ s
−K P s
AFNS models we have X0 = θ and Σ0 = 0 e ΣΣ e ds, which we calculate using the analytical solutions provided in Fisher and Gilles (1996). Denote the information available at time t by Yt = (y1 , y2 , . . . , yt ), and denote model parameters by ψ . Consider period t − 1 and suppose that the state update Xt −1 and its mean square error matrix Σt −1 have been obtained. The prediction step is P
Mean
A·,1
−1
q2,· q3,·
denotes the estimated model parameters. where ψ
q·,2
q·,3
0.0025 (0.0001) −0.0022 (0.0003) 0.0028 (0.0007)
0
0
0.0023 (0.0001) 0.0006 (0.0006)
0 0.0066 (0.0004)
4.2. DNS model estimation Independent-factor DNS estimates appear in Table 2, and correlated-factor DNS estimates appear in Table 3. In both models the level factor is the most persistent, and the curvature factor is least persistent. In the correlated-factor DNS model, only one off-diagonal element of the estimated A matrix is statistically significant.16 Volatility parameters are most easily compared by converting from Cholesky factors to conditional covariance matrices. For independent-factor DNS we have DNS Qindep = qq′
6.17 × 10−6 = 0 0
,
q·,1
0 1.11 × 10−5 0
0 0 5.58 × 10−5
,
(12)
and for correlated-factor DNS we have DNS Qcorr = qq′
15 We ensure covariance stationarity under the P-measure in the DNS case by restricting the eigenvalues of A to be less than 1, and in the AFNS case by restricting the real component of each eigenvalue of K P to be positive.
16 Interestingly, the significant parameter is A St ,Ct −1 , which is the key non-zero off-diagonal element required in Proposition 1 for the AFNS specification.
J.H.E. Christensen et al. / Journal of Econometrics 164 (2011) 4–20
11
Table 4 Summary statistics for in-sample model fit. Residual means and root mean squared errors for sixteen maturities. Maturities are in months; means and RMSEs are in basis points. Maturity
DNS indep-factor Mean
−1.64 −0.24 −0.54
3 6 9 12 18 24 36 48 60 84 96 108 120 180 240 360
4.04 7.22 1.18 −0.07 −0.67 −5.33 −1.22 1.31 0.03 −5.11 24.11 25.61 −29.62
6.03 × 10−6 = −5.47 × 10−6 6.76 × 10−6
DNS corr-factor RMSE 12.26 1.09 7.13 11.19 10.76 5.83 1.51 3.92 7.13 4.25 2.10 2.94 8.51 29.44 34.99 37.61
−5.47 × 10−6 1.01 × 10−5 −4.73 × 10−6
Mean
−1.84 −0.29 −0.51 4.11 7.28 1.19 −0.19 −0.85 −5.51 −1.30 1.29 0.07 −5.01 24.40 26.00 −29.12
6.76 × 10−6 −4.73 × 10−6 . 5.09 × 10−5
(13)
The variances of shocks to each state variable are similar across the independent- and correlated-factor DNS models, with level factor shocks the least volatile and curvature factor shocks the most volatile. The covariance estimates obtained in the correlatedfactor DNS model translate into a correlation of −0.701 for shocks to the level and slope factors, a correlation of 0.385 for shocks to the level and curvature factors, and a correlation of −0.208 for shocks to the slope and curvature factors. The independent- and correlated-factor DNS models are nested, so we can test the independent-factor restrictions using a standard likelihood-ratio (LR) test. Under the null hypothesis of independent-factor DNS, LR = 2[log L(θcorr ) − log L(θindep )] ∼ χ 2 (9). We obtain LR = 164.8, with associated p-value less than 0.0001, so we would formally reject the restrictions imposed in the independent-factor DNS model. This rejection reflects an elevated negative correlation between the shocks to the level and slope factors and a significant positive correlation through the mean-reversion matrix between changes in the slope factor and deviations of the curvature factor from its mean. Crucially, however, the extra parameters in the correlatedfactor model, although statistically significant, appear economically unimportant. That is, the increased flexibility of the correlated-factor DNS model provides little advantage in fitting observed yields, as documented in Table 4, which reports means and root mean squared errors (RMSEs) for model residuals. The RMSE differences appear negligible (typically less than one half of one basis point), maturity-by-maturity, and no consistent advantage across maturities accrues to the correlated-factor model. Interestingly, both models have difficulty fitting yields beyond the ten-year maturity, which suggests that a maturity-dependent yield-adjustment term could improve fit. We now examine the empirical performance of AFNS models, which incorporate precisely such yield adjustments.
AFNS indep-factor
AFNS corr-factor
RMSE
Mean
RMSE
11.96 1.34 6.92 10.86 10.42 5.29 2.09 4.03 7.31 4.25 2.02 3.11 8.53 29.66 35.33 37.18
−2.85 −1.19 −1.24
18.53 7.12 3.45 9.60 10.43 5.94 1.98 3.72 6.82 4.29 2.11 3.02 8.23 32.66 42.61 22.03
3.58 7.14 1.37 0.30 −0.40 −5.27 −1.50 1.03 −0.11 −4.95 27.87 35.96 1.37
Mean
−2.49 −0.03 −0.33 3.72 5.53 −1.18 −1.10 0.93 −2.01 0.89 1.05 −3.23 −11.65 3.85 4.32 −0.81
RMSE 11.55 0.64 6.91 10.14 8.33 4.37 3.16 4.13 5.22 3.83 1.83 5.26 14.00 16.53 23.97 23.04
term structure model is very difficult and time-consuming and effectively prevents the kind of repetitive re-estimation required in a comprehensive simulation study or out-of-sample forecast exercise, which we pursue with the AFNS model in the next section.17 By comparison, the estimation of the AFNS model is straightforward and robust in large part because the role of each latent factor is not left unidentified as in the maximally flexible A0 (3) model. Even though the factors are latent in the AFNS model, with the Nelson–Siegel factor loading structure, they can be clearly identified as level, slope, and curvature. This identification eliminates the troublesome local maxima reported by Kim and Orphanides (2005), i.e., maxima with likelihood values very close to the global maximum but with very different interpretations of the three factors and their dynamics.18 The estimated independent-factor AFNS model is reported in Table 5. Although the independent-factor DNS and AFNS models are non-nested, they contain the same number of parameters, so their likelihoods can be compared directly. The lower log likelihood value obtained for the AFNS model (16,280 vs. 16,332) suggests weaker in-sample performance, which appears consistent with the RMSEs in Table 4. Although the two independent-factor models differ statistically, they are quite similar economically, as can be seen in two ways. First, we compare mean-reversion matrices, covariance matrices, and mean vectors. To compare the independent-factor AFNS mean-reversion matrix to that of the independent-factor DNS model, we translate the continuous-time matrix in Table 5 into the one-month conditional mean-reversion matrix,
exp −K
1
P
12
=
0.993 0 0
0 0.983 0
0 0 . 0.902
(14)
Similarly, we convert the volatility matrix into a one-month conditional covariance matrix AFNS Qindep =
1 12
∫
P P ′ e−K s ΣΣ ′ e−(K ) s ds
0
4.3. AFNS model estimation Thus far we have examined just one simple model (DNS), comparing fit in the independent- and correlated-factor cases. Now we bring AFNS into the mix, and things get more interesting. In particular, we can compare independent- and correlatedfactor cases, with and without imposition of absence of arbitrage. As many have noted, estimation of the canonical affine A0 (3)
17 For example, Rudebusch et al. (2006) report difficulty replicating the published estimates of a no-arbitrage model even though they use identical data and estimation programs. 18 Other strategies to facilitate estimation include adding survey information (Kim and Orphanides, 2005) or assuming that the latent yield curve factors are observable (Ang and Piazzesi, 2003).
12
J.H.E. Christensen et al. / Journal of Econometrics 164 (2011) 4–20
Table 5 Estimated independent-factor AFNS model. The top panel contains the estimated K P matrix and θ P vector. The bottom panel contains the estimated Σ matrix. Standard errors appear in parentheses. The estimated λ is 0.5975 (0.0115) for maturities measured in years. The maximized log likelihood is 16,279.92. K matrix
Mean
K·,P1
K·,P2
K·,P3
θP
0
0
K2P,·
0.0816 (0.0615) 0
0
K3P,·
0
0.2114 (0.1780) 0
0.0710 (0.0129) −0.0282 (0.0173) −0.0093 (0.0061)
K1P,·
Yield-adjustment term in basis points
P
0
1.2330 (0.4240)
Σ matrix Σ1,· Σ2,· Σ3,·
Σ·,1
Σ·,2
Σ·,3
0.0051 (0.0001) 0
0
0
0.0110 (0.0006) 0
0
0
2.15 × 10−6 0 0
0 9.94 × 10−6 0
=
−
A(t , T )
=−
T −t
σ
1
2 T −t
5.26 × 10
B (s, T ) ds − 2
×
2
2 σ33
AFNS corr.-factor -100 0
.
(15)
2 = −σ11
+ − −
(T − t ) 6
1 1−e 4λ3 1 4λ
2 − σ22
−2λ(T −t )
1 2λ2
(T − t )e−2λ(T −t ) − −λ(T −t )
T −t
+
3
2
15
20
25
30
Fig. 1. Yield-adjustment terms for AFNS models.
with the clear result that correlated-factor DNS is dominated by correlated-factor AFNS. Combining the model comparison results above with those reported earlier in Section 4.2, correlated-factor AFNS emerges as the clear in-sample favorite among all the various combinations of independent-factor, correlated-factor, DNS and AFNS models. Presumably, this is due to the greater flexibility of the correlatedfactor AFNS yield adjustment. We report the estimated correlatedfactor AFNS model in Table 6, from which we can infer the estimated yield adjustment. In population, the adjustment is
−
A(t , T ) T −t
+
= −σ
2 11
(T − t )2 6
1 1 − e−2λ(T −t ) 4λ3
1 2λ2
+
1 −λ(T −t ) e
λ2
e−2λ(T −t ) 2
5 1−e
×
T −t
− (σ + σ ) 2 21
2 22
1
2λ
2
−
1 1 − e−λ(T −t )
λ3
T −t
2 2 2 − (σ31 + σ32 + σ33 )
T −t
−2λ(T −t )
T −t
−
1 2λ
3 4λ2
2
and is plotted in Fig. 1. It is everywhere negative, monotonically increasing in absolute value, and very smooth. Presumably a more flexible yield-adjustment term is needed to achieve substantial improvement in fit. The correlated-factor AFNS model, to which we now turn, achieves this. We begin with two model comparisons that involve correlatedfactor AFNS. First consider independent- vs. correlated-factor AFNS. The models are nested, so under the null hypothesis of independent-factor AFNS, LR = 2[log L(θcorr ) − log L(θindep )] ∼ χ 2 (9). We obtain LR = 428.7, with associated p-value less than 0.0001, so independent-factor AFNS is dominated by correlatedfactor AFNS. Second, consider correlated-factor DNS vs. correlatedfactor AFNS. The models are non-nested but contain equal numbers of parameters, so we compare their log likelihoods directly,
[
−σ11 σ31
×
−
1 2λ
−
1 −λ(T −t ) 1 e − (T − t )e−2λ(T −t ) 2 λ 4λ
e
,
+
−2λ(T −t )
−σ11 σ21
4λ
8λ3
10
t
λ3
2 − σ33
B (s, T ) ds 3
5
Maturity in years
2 T −t
1 1 − e−λ(T −t )
−
T −t
2 1−e
λ3
1
T
∫
1
2 T −t
t 2
σ
2 22
t
T
∫
-80
−5
B1 (s, T )2 ds −
-60
0.0264 (0.0014)
0 0
T
∫
-40
AFNS indep.-factor
Inspection reveals that the mean-reversion matrix and covariance matrix (and also the factor mean vector) are similar across the independent-factor DNS and AFNS models. Second, the similarity of the independent-factor DNS and AFNS models can be seen by noting that they make identical assumptions about the P-dynamics and therefore differ only by the yieldadjustment term, which is quite rigid in the independent-factor case. In particular, the independent-factor AFNS yield adjustment is 2 11
-20
−
1
+
8λ3
T −t
1 −λ(T −t ) 1 1 − e−λ(T −t ) e − 3 2 λ λ T −t
]
3 −λ(T −t ) 1 1 e + (T − t ) + (T − t )e−λ(T −t ) 2λ λ
+
− (σ21 σ31 + σ22 σ32 )
1 −λ(T −t ) 1 e − 2 e−2λ(T −t ) 2λ
λ2
3 1 − e−λ(T −t )
λ3
T −t
5 1 − e−2λ(T −t )
λ2
T −t
λ2
λ3
(T − t ) +
3 1 − e−λ(T −t )
λ3
2 1 − e−λ(T −t )
T −t
+
3 1 − e−2λ(T −t ) 4λ3
T −t
.
Replacing population parameters with estimates delivers the corresponding estimated yield adjustment, which we plot in Fig. 1. It is indeed more flexible, with an interesting hump in the fifteento twenty-year maturity range, which improves the fit of those
J.H.E. Christensen et al. / Journal of Econometrics 164 (2011) 4–20 Table 6 Estimated correlated-factor AFNS model. The top panel contains the estimated K P matrix and θ P vector. The bottom panel contains the estimated Σ matrix. Standard errors appear in parentheses. The estimated λ is 0.8244 (0.0122) for maturities measured in years. The maximized log likelihood is 16,494.29. K P matrix
K1P,· K2P,· K3P,·
Mean
K·,P1
K·,P2
K·,P3
θP
5.2740 (1.3100) −0.2848 (1.3200) −37.3100 (11.0000)
9.0130 (1.4200) 0.5730 (2.3200) −66.7700 (11.9000)
−10.7100
0.0794 (0.0084) −0.0396 (0.0200) −0.0279 (0.0193)
(1.4800) −0.5528 (2.7600) 80.0900 (12.1000)
Σ·,1 . Σ1,·
0.0154 (0.0004) −0.0013 (0.0051) −0.1641 (0.0069)
Σ2,· Σ3,·
Σ·,2
Σ·,3
0
0
0.0117 (0.0018) −0.0590 (0.0106)
0 0.0001 (6.8900)
7.5
Yield in percent
7.0
6.5
6.0
Empirical mean yields Indep.-factor DNS mean yields Corr.-factor DNS mean yields Indep.-factor AFNS mean yields Corr.-factor AFNS mean yields
5.5
5
exp −K
10
15
20
1
P
12
=
0.917 0.0390 0.456
−0.107 0.981 0.769
0.122 0.0112 . 0.0667
(16)
Evidently, the level factor becomes less persistent once the flexible correlated-factor AFNS yield adjustment is incorporated, because the level factor is more free to work with slope and curvature to improve fit at shorter maturities, given that the yield adjustment is most helpful at long maturities. The one-month conditional covariance matrix is 1 12
∫ =
P P ′ e−K s ΣΣ ′ e−(K ) s ds
0
7.42 × 10−6 = −6.11 × 10−6 −7.62 × 10−6
−6.11 × 10−6 1.07 × 10−5 5.89 × 10−7
−7.62 × 10−6 −7 . 5.89 × 10 1.87 × 10−4
(17)
The conditional variances in the diagonal are about the same for the level and slope factors as those obtained in the correlated-factor DNS model, but the conditional variance for curvature is much larger. In terms of covariances, the negative correlation between the shocks to level and slope is maintained. For the correlations between shocks to curvature and shocks to level and slope, the signs have changed relative to the unconstrained correlated-factor DNS model. This suggests that the off-diagonal elements of Σ are heavily influenced by the required shape of the yield-adjustment term rather than the dynamics of the state variables. On the other hand, the estimated covariances of the shocks in the DNS models are likely to be unbiased as they are varied to provide the best fit of the P-dynamics without any implications for the cross-sectional fit of the model. 5. Out-of-sample predictive performance
5.0 0
mean-reversion matrix
AFNS Qcorr
Σ matrix
13
25
30
Time to maturity in years
Fig. 2. Mean yield curves. We show the empirical mean yield curve, and the independent- and correlated-factor DNS and AFNS model mean yield curves.
long-term yields in particular, although it also helps with shorter maturities. Another way to appreciate the role of the yield-adjustment term is to compare the mean fitted yield curves from the independent- and correlated-factor AFNS and DNS models to the sample mean yield curve, which is done in Fig. 2. All of the models match the mean yield curve well for maturities up to ten years, but their behavior diverges for longer maturities. Note that the DNS model curve is monotonically increasing, while with the yieldadjustment terms, the AFNS models can bend downward and achieve better long-maturity fit.19 The enhanced flexibility produced by the correlated-factor AFNS yield-adjustment term allows the level factor to become less persistent, as evidenced by the estimated one-month conditional
19 This result suggests why the DNS model is not arbitrage free. At very long maturities, only the level factor has any appreciable influence on bond yields. To eliminate the arbitrage opportunity from going long on a bond with very long maturity and hedging the risk by shorting a bond with a slightly shorter maturity, eventually the yield curve must slope downwards (an application of Jensen’s inequality and an illustration of convexity), which the DNS model cannot support.
Here we investigate whether the in-sample superiority of the correlated-factor AFNS model carries over to out-of-sample forecast accuracy. We first describe the recursive estimation and prediction procedure employed. Second, we compare performance of the four uncorrelated/correlated-factor DNS/AFNS models, exactly as in the in-sample analysis of Section 4 except that we work out-of-sample as opposed to in-sample. Third, we compare the out-of-sample predictive performance of AFNS to that of the canonical A0 (3) model. 5.1. Construction of out-of-sample forecasts We construct six- and twelve-month-ahead forecasts from the four DNS and AFNS models for yields at various maturities. We estimate and forecast using an expanding sample. The first estimation sample is January 1987 to December 1996; then January 1987 to January 1997, and so on. The largest estimation sample for the one-month-ahead forecasts ends in November 2002 (72 forecasts in all). For the six- and twelve-month horizons, the largest samples end in June 2002 and December 2001 (67 and 61 forecasts), respectively. Under quadratic loss the optimal forecast is simply the relevant conditional expectation. The optimal DNS forecast for a maturity-τ yield made at time t for time t + h is therefore P P P yDNS t +h,t (τ ) ≡ Et [yt +h (τ )] = Et [Lt +h ] + Et [St +h ]
+ EtP [Ct +h ]
1−e
−λτ
λτ
1 − e−λτ
λτ
− e−λτ .
(18)
14
J.H.E. Christensen et al. / Journal of Econometrics 164 (2011) 4–20
But from the first-order transition dynamics we have immediately
EtP
[Xt +h ] =
h−1 −
A
i
(I − A)µ + Ah Xt ,
(19)
i =0
P P 1 P 2 yAFNS t +h,t (τ ) ≡ Et [yt +h (τ )] = Et [Xt +h ] + Et [Xt +h ]
+ EtP [Xt3+h ]
1 − e−λτ
λτ
− e−λτ
−
h=6
Three-month yield DNSindep
96.87
173.39
DNScorr AFNSindep
87.43 91.63
166.91 164.70
AFNScorr
88.49
λτ
one-year yield DNSindep DNScorr
103.25 102.71
170.85 173.14
,
AFNSindep AFNScorr
98.49 98.63
163.46 165.50
Three-year yield DNSindep DNScorr
92.22 99.55
135.24 145.82
AFNSindep AFNScorr
86.99 90.64
126.95 135.79
Five-year yield DNSindep DNScorr
87.87 94.95
122.09 132.40
AFNSindep AFNScorr
82.41 88.15
112.85 124.87
Ten-year yield DNSindep DNScorr
74.71 79.48
105.02 112.37
AFNSindep AFNScorr
67.48 90.21
92.39 123.89
Thirty-year yield DNSindep DNScorr
71.35 72.71
96.90 99.68
AFNSindep AFNScorr
48.06 71.38
61.97 96.75
1−e
A(τ )
τ
−λτ
where E0P [Xt ] = (I − exp(−K P t ))θ P + exp(−K P t )X0 , and Xt = (
Xt1
,
Xt2
,
Xt3
).
Forecast horizon in months Model
where Xt = (Lt , St , Ct ). The straightforward forecasting of the state vector (19) translates into straightforward forecasting of the yield vector via (18). Similarly, the optimal AFNS forecast for a maturity-τ yield made at time t for time t + h is
Table 7 Out-of-sample root mean squared forecast errors, four models. For each maturity and horizon, the smallest RMSFE is boxed. Units are basis points.
20
5.2. Evaluation of out-of-sample forecasts Predictive accuracy has been a key metric to evaluate the adequacy of yield curve models; recent analyses include Ang and Piazzesi (2003), Hördahl et al. (2005), De Pooter et al. (2007), Chua et al. (2008), Mönch (2008), and Zantedeschi et al. (2009). Define the h-step-ahead forecast error for maturity τ as eˆ t +h,t (τ ) = yt +h (τ ) − yˆ t +h,t (τ ). Then the forecast performances of the four models (DNS/AFNS, independent/correlated) are compared using the root mean squared forecast error (RMSFE) for τ = 3, 12, 36, 60, 120, 360, and h = 6, 12 (in months). These RMSFEs are shown in Table 7. For each of the twelve combinations of yield maturity and forecast horizon, the most accurate model’s RMSFE is boxed. The results are striking. In ten of the twelve combinations, the most accurate model is the independent-factor AFNS model. In particular, the in-sample advantage of the correlated-factor AFNS model disappears out of sample. Evidently, the correlated-factor AFNS model is prone to in-sample overfitting due to its rich Pdynamics.21 In examining forecast performance, we are interested in two broad questions. First, how does the forecast performance of the correlated-factor models compare to that of the independentfactor models, and second, how does the imposition of AF structure affect forecast performance. Fig. 3 suggests the answers, showing ratios of RMSFEs for various combinations of model, maturity and forecast horizon. The first question is addressed in the left and middle panels, which show the ratios of the independent-factor and correlated-factor DNS models and the independent-factor and correlated-factor AFNS models, respectively. The ratios are almost uniformly below one, which supports the parsimonious models. The second question is addressed in the right panel, which shows RMSFE ratios of the independent-factor AFNS and DNS models. The evidence is somewhat mixed—due largely to anomalous
20 Making the formulae operational of course requires replacing population system parameters with estimates. We denote the operational forecasts by yˆ DNS t +h,t (τ )
and yˆ AFNS t +h,t (τ ). 21 The two cases in which the independent-factor AFNS model is not the most accurate pertain to the three-month yield. This disadvantage likely reflects idiosyncratic fluctuations in short-term Treasury bill yields from institutional factors unrelated to yields on longer-maturity Treasuries, as described by Duffee (1996). The more flexible models appear to have a slight advantage in fitting such idiosyncratic movements.
h = 12
161.94
behavior at the twenty-year maturity—but overall the AF version dominates. Therefore, out-of-sample forecast performance appears largely improved by imposing freedom from arbitrage, especially at the longer twelve-month forecast horizon. 5.3. Comparison to Duffee (2002) An important remaining issue is the forecasting performance of AFNS relative to the canonical AF A0 (3) model. In this subsection we address that issue, and in so doing we provide insight into the benefits of imposing the Nelson–Siegel restrictions. We hasten to add that, quite apart from any effects on forecasting performance, imposition of the Nelson–Siegel restrictions delivers clear benefits simply in achieving estimation tractability. The simple estimation of AFNS contrasts starkly with the ‘‘challenging’’ estimation of the maximally flexible A0 (3) model, whose recalcitrance is well known. Our earlier-implemented expanding-sample AFNS estimation, for example, is infeasible for the maximally flexible A0 (3) model. Hence, instead of estimating a somewhat arbitrary A0 (3) model for our data set, we take an existing optimized empirical A0 (3) model from the literature, specifically Duffee (2002), and we compare it to an AFNS model estimated on the same data. Duffee (2002) examines the predictive performance of the A0 (3) model class, estimating both the maximally flexible version (given an essentially affine risk premium structure) and a more parsimonious ‘‘preferred’’ specification on a single sample from January 1952 to December 1994.22 Fixing the parameters at estimated val-
22 The data used are available at http://econ.jhu.edu/People/Duffee/index.htm.
J.H.E. Christensen et al. / Journal of Econometrics 164 (2011) 4–20
15
Fig. 3. Out-of-sample root mean squared forecast error ratios. Table 8 Estimated independent-factor AFNS model, Duffee (2002) data set. The top panel contains the estimated K P matrix and θ P vector. The bottom panel contains the estimated Σ matrix. Standard errors appear in parentheses. The estimated λ is 0.8131 (0.0183) for maturities measured in years. The maximized log likelihood is 14,948.79. K P matrix
Table 9 Out-of-sample root mean squared forecast errors, three models. We show RMSFEs for the random walk model, the preferred A0 (3) model as selected and estimated by Duffee (2002, Table 8), and the independent-factor AFNS estimated using Duffee’s data set. For each maturity and horizon, the smallest RMSFE is boxed. Units are basis points.
Mean
Forecast horizon in months
K·,P1
K·,P2
K·,P3
θP
Maturity/model
h=6
h = 12
0
0
K2P,·
0.0299 (0.0249) 0
0
Six-month yield Random walk Preferred A0 (3)
40.0 36.5
48.4 42.1
K3P,·
0
0.7436 (0.1550) 0
0.0609 (0.0224) −0.0162 (0.0054) −0.0043 (0.0026)
K1P,·
2.5250 (0.3540)
Σ matrix Σ·,1
Σ·,2
Σ·,3
0
0
Σ2,·
0.0069 (0.0002) 0
0
Σ3,·
0
0.0208 (0.0004) 0
Σ1,·
AFNSindep Two-year yield Random walk Preferred A0 (3) AFNSindep Ten-year yield Random walk Preferred A0 (3) AFNSindep
34.0 65.2 56.6 54.3 66.9 63.6 60.7
41.3 76.2 60.0 59.0 81.5 73.8 71.8
0.0363 (0.0009)
ues, Duffee sequentially updates the state variables and produces three-, six- and twelve-month-ahead yield forecasts.23 We extend Duffee’s forecast comparison to include the independent-factor AFNS model, estimated using three-month, six-month, one-year, two-year, five-year, and ten-year yields from January 1952 to December 1994, as reported in Table 8.24 Fixing parameters at estimated values, we sequentially update the state variables using the Kalman filter. Based on the updated state variables, we produce six- and twelve-month-ahead yield forecasts as above. RMSFEs appear in Table 9 for the two models examined by Duffee (2002) (random walk and A0 (3)) plus the independentfactor AFNS model, for the six-month, two-year and ten-year yield maturities examined by Duffee. RMSFEs for each forecasting model are based on 42 six-month-ahead forecasts from January 1995 to
23 The estimation method used by Duffee (2002) differs from ours in that he avoids filtering by assuming that the six-month, two-year, and ten-year yields are observed without error. Duffee therefore evaluates out-of-sample forecast performance only at those maturities. 24 There are 21 parameters estimated in Duffee’s preferred A (3) model and 16 0
parameters estimated in our AFNS model, including the six measurement error standard deviations.
June 1998, and 36 twelve-month-ahead forecasts from January 1995 to December 1997. For each maturity/horizon combination, the independent-factor AFNS forecasts are the most accurate, consistently outperforming both the random walk and Duffee’s preferred A0 (3) model. This superior out-of-sample forecast performance indicates that the AFNS class is a leading and, not least, well-identified member of the general A0 (3) class of models. 6. Concluding remarks Asset pricing, portfolio allocation, and risk management are fundamental tasks in financial asset markets. For fixed income securities, superior yield curve modeling translates into superior pricing, portfolio returns, and risk management. Accordingly, we have focused on two important and successful yield curve literature: the Nelson–Siegel empirically based one and the no-arbitrage theoretically based one. Yield curve models in both of these traditions are impressive successes, albeit for very different reasons. Ironically, both approaches are equally impressive failures, and for the same reasons, swapped. That is, models in the Nelson–Siegel tradition fit and forecast well, but they lack theoretical rigor insofar as they admit arbitrage possibilities. Conversely, models in the arbitrage-free tradition are theoretically
16
J.H.E. Christensen et al. / Journal of Econometrics 164 (2011) 4–20
rigorous insofar as they enforce absence of arbitrage, but they fit and forecast poorly. We have bridged the divide, proposing Nelson–Siegel-inspired models that enforce absence of arbitrage. We analyzed our models theoretically, relating them to the canonical (Dai and Singleton, 2000) representation of three-factor arbitrage-free affine models. We also analyzed our models empirically, both in terms of insample fit and out-of-sample prediction. As regards in-sample fit, we showed that the Nelson–Siegel parameter restrictions greatly facilitate estimation, enabling one to escape the challenging A0 (3) estimation environment in favor of the simple and robust AFNS environment, and that the data strongly favor the correlated-factor AFNS specification. As regards out-of-sample prediction, we showed that the tables are turned: the more parsimonious independent-factor models fare better. The results also suggest that gains may be achieved by imposing absence of arbitrage, particularly for moderate to long yield maturities and forecast horizons, although the evidence is much less conclusive than for in-sample fit. All told, the independent-factor AFNS model fares well in out-of-sample prediction, consistently outperforming, for example, the canonical A0 (3). Going forward, this new AFNS structure appears likely to be a useful representation for term structure research, as its embedded three-factor structure (level, slope, curvature) maintains fidelity to key aspects of term-structure data that have been recognized at least since Litterman and Scheinkman (1991), while simultaneously imposing absence of arbitrage. On the theoretical side, it has recently been significantly enriched to include nonlinear regime-switching dynamics by Zantedeschi et al. (2009). On the applied side, it has recently been extended in Christensen et al. (2010) to provide a joint empirical model of nominal and real yield curves and in Christensen et al. (2009) to model the interbank lending market.
it follows from the system of ODEs that T
∫ t
∫ T d (K Q )′ (T −s) Q ′ e(K ) (T −s) ρ1 ds, e B(s, T ) ds = ds t
or equivalently, using the boundary conditions, B(t , T ) = −e
Q ′ e(K ) (T −s) ρ1 ds.
t
Now impose the following structure on (K Q )′ and ρ1 :
(K ) = Q ′
0 0 0
0
0 0
λ −λ
1 1 . 0
and ρ1 =
λ
It is then easy to show that
e
(K Q )′ (T −t )
1 = 0 0
0
e
0 0
λ(T −t )
−λ(T − t )eλ(T −t )
eλ(T −t )
and
1 Q ′ e−(K ) (T −t ) = 0 0
0
e−λ(T −t ) λ(T − t )e−λ(T −t )
1 B(t , T ) = − 0 0 T
∫ × t
e−λ(T −t )
0
e
1 0 0
1
T
∫ t
e−λ(T −t )
0
eλ(T −s)
e−λ(T −t )
Because d dt
[e(K
Q )′ (T −t )
Because T
∫
ds = T − t , t
and T
∫
eλ(T −s) ds =
[
t
−1 λ(T −s) e λ
]T =−
1 − eλ(T −t )
λ
t
= e(K
Q )′ (T −t )
dB(t , T ) dt
− (K Q )′ e(K
Q )′ (T −t )
B(t , T ),
,
and T
∫
−λ(T − s)eλ(T −s) ds =
t
=
1
∫
0
xex dx
λ
λ(T −t )
1
x 0
λ
[xe ]λ(T −t ) −
1
∫
λ
0
ex dx λ(T −t )
1 − eλ(T −t )
λ
the system of ODEs can be reduced to 1 B(t , T ) = − 0 0
0
e−λ(T −t ) λ(T − t )e−λ(T −t )
B(t , T )]
ds. eλ(T −s) −λ(T − s)eλ(T −s)
B(T , T ) = 0.
0
= −(T − t )eλ(T −t ) −
= ρ1 + (K Q )′ B(t , T ),
1 1 ds
0 0
1
Appendix A. Proof of Proposition 1
dt
0 0
eλ(T −s) −λ(T − s)eλ(T −s)
0 e−λ(T −t ) λ(T − t )e−λ(T −t )
= − 0
×
dB(t , T )
.
0 0
−λ(T −t )
λ(T − t )e−λ(T −t )
Acknowledgements
Start the analysis by limiting the volatility to be constant. Then the system of ODEs for B(t , T ) is
0 0
Inserting this in the ODE, we obtain
0
We thank the editors, referees and seminar/conference participants at the University of Chicago, Copenhagen Business School (especially Anders Bjerre Trolle and Peter Feldhütter), Stanford University (especially Ken Singleton), the Wharton School, Cambridge University, HEC Paris, Humboldt University, the NBER summer Institute, the Warwick Royal Economic Society annual meeting, the Coimbra Workshop on Financial Time Series, the New York inaugural meeting of SoFiE, the Montreal CIRANO/CIREQ Financial Econometrics meeting, and the Brazilian Finance Association annual meeting for helpful comments. We thank Georg Strasser, Rong Hai, and Justin Weidner for research assistance. The views expressed are those of the authors and do not necessarily reflect the views of others at the Federal Reserve Bank of San Francisco.
T
∫
−(K Q )′ (T −t )
×
T −t 1 − eλ(T −t )
0 0
e
−λ(T −t )
λ λ(T −t ) 1−e λ(T −t ) −(T − t )e − λ −
,
J.H.E. Christensen et al. / Journal of Econometrics 164 (2011) 4–20
−(T − t ) 1 − e−λ(T −t )
=
I3 =
− , λ −λ(T −t ) 1−e −λ(T −t ) (T − t )e − λ
=
C
1
2 T −t C
= C
which is identical to the claim in Proposition 1.
−
In the AFNS models the yield-adjustment term is in general A(t , T ) T −t
=
=
1
1
2T −t
2T −t
×
3 −
t T
∫
1
1
T
∫
3 −
t
j =1
=
Σ ′ B(s, T )B(s, T )′ Σ
j=1
σ11 σ12 σ13
ds j ,j
1 B ( s, T ) σ31 σ32 B2 (s, T ) σ33 B3 (s, T )
σ21 σ22 σ23
× B1 (s, T ) B2 (s, T ) B3 (s, T ) σ11 σ12 σ13 × σ21 σ22 σ23 ds σ31 σ32 σ33 j,j ∫ T ∫ T A 1 B 1 B1 (s, T )2 ds + B2 (s, T )2 ds = 2T −t t 2T −t t ∫ T C 1 + B3 (s, T )2 ds 2 T −t t ∫ T 1 +D B1 (s, T )B2 (s, T )ds T −t t ∫ T 1 +E B1 (s, T )B3 (s, T )ds T −t t ∫ T 1 +F B2 (s, T )B3 (s, T )ds, T −t t
A=σ
+σ +σ ,
B=σ
+σ +σ ,
C =σ
+σ +σ ,
2 11 2 21 2 31
2 12 2 22 2 32
2 13 2 23 2 33
e
D
∫
To derive the analytical formula for solved:
A(t ,T ) , T −t
six integrals need to be
= I2 =
2T −t B
1
2T −t
B1 (s, T )2 ds
(T − s)2 ds =
t
∫
A 6
(T − t )2 .
T
B2 (s, T )ds
t
∫
T
[
−λ(T −s) ]2
B 1 1−e − ds 2T −t t λ [ ] −λ(T −t ) 1 1 1−e 1 1 − e−2λ(T −t ) = B − 3 + 3 . 2λ2 λ T −t 4λ T −t
=
+
8λ3
T −t
.
B1 (s, T )B2 (s, T )ds
( T − s)
1 − e−λ(T −s)
I5 = E
= E
T
∫
1
T −t t ∫ T 1 T −t
B1 (s, T )B3 (s, T )ds [−(T − s)]
t
[
× (T − s)e−λ(T −s) −
ds
λ
λ2 1
+ (T − t )e λ ∫ T
= F
]
1 3 −λ(T −t ) e + (T − t ) 2λ
= E
I6 = F
1 − e−λ(T −s)
−λ(T −t )
−
3 1 − e−λ(T −t )
λ3
T −t
.
1 B2 (s, T )B3 (s, T )ds T −t t ] ∫ T[ 1 1 − e−λ(T −s) T −t
−
λ
t
[
× (T − s)e−λ(T −s) − 1
λ
2
+
]
λ
ds
1 −λ(T −t ) 1 e − 2 e−2λ(T −t ) 2 λ 2λ
3 1 − e−λ(T −t )
λ3
1 − e−λ(T −s)
T −t
+
3 1 − e−2λ(T −t ) 4λ3
.
T −t
Derivation of the AFNS restrictions imposed on the canonical representation of the A0 (3) class of affine models starts with an arbitrary affine diffusion process represented by dYt = KY [θY − Yt ]dt + ΣY dWt .
T
2T −t t ∫ T A 1
T −t
ds T −t t λ ] [ 1 1 1 − e−λ(T −t ) 1 (T − t ) + 2 e−λ(T −t ) − 3 . = D 2λ λ λ T −t
Q
I1 =
λ3
5 1 − e−2λ(T −t )
Appendix C. Restrictions imposed in the AFNS model
F = σ21 σ31 + σ22 σ32 + σ23 σ33 .
∫
−
2 1 − e−λ(T −t )
Combining the six integrals, the analytical formula reported in Section 2.3 is obtained.
E = σ11 σ31 + σ12 σ32 + σ13 σ33 ,
1
1 −λ(T −t ) 1 e − (T − t )e−2λ(T −t ) 4λ
λ2
T
T −t t ∫ T D
−
D = σ11 σ21 + σ12 σ22 + σ13 σ23 ,
A
[ ]2 1 − e−λ(T −s) −λ(T −s) (T − s)e − ds λ
−2λ(T −t )
4λ2
= F
where
+
2λ2
Appendix B. The AFNS yield-adjustment term I4 =
t
1
3
T
∫
2 T −t
B3 (s, T )ds t
1
17
T
∫
Q
Q
Now consider the affine transformation TY : AYt + η, where A is a nonsingular square matrix of the same dimension as Yt and η is a vector of constants of the same dimension as Yt . Denote the transformed process by Xt = AYt + η. By Ito’s lemma it follows that dXt = AdYt = [AKY θY − AKY Yt ]dt + AΣY dWt Q
Q
Q
Q
= AKYQ A−1 [AθYQ − AYt − η + η]dt + AΣY dWtQ = AKYQ A−1 [AθYQ + η − Xt ]dt + AΣY dWtQ = KXQ [θXQ − Xt ]dt + ΣX dWtQ .
18
J.H.E. Christensen et al. / Journal of Econometrics 164 (2011) 4–20
Thus, Xt is itself an affine diffusion process with parameter specification: Q Q KX = AKY A−1 ,
θXQ = AθYQ + η,
ΣX = A ΣY .
and
A similar result holds for the dynamics under the P-measure. In terms of the short rate process there exists the following relationship: rt = δ + (δ ) Y 0 Y 0 Y 0
Y ′ −1 AYt 1 A
Y 0
= δ + (δ )
[AYt + η − η]
= δ − (δ )
η + (δ1Y )′ A−1 Xt .
rt = δ0Y + (δ1Y )′ Yt = δ0X + (δ1X )′ Xt . Because both Yt and Xt are affine latent factor processes that deliver the same distribution for the short rate process rt , they are equivalent representations of the same fundamental model; hence, TX is called an affine invariant transformation. In the canonical representation of the subset of A0 (3) affine term structure models considered here, the Q -dynamics are Y ,Q
Y ,Q κ12 Y ,Q κ22
κ11 dYt1 dYt2 = − 0 dYt3 0
+
1 0 0
Y ,Q κ13 Yt1 Y ,Q 2 κ23 Yt dt Y ,Q Yt3 0 κ33 dW 1,Q 0 t 0 dWt2,Q ,
0 1 0
1
3,Q dWt
Y ,P
Y ,P κ12 Y ,P κ22 Y ,P κ32
κ11 dYt1 Y ,P dYt2 = κ21 3 dYt κ Y ,P
31
+
1 0 0
Y ,P Y ,P θ1 κ13 Yt1 Y ,P Y ,P κ23 θ2 − Yt2 dt Y ,P Yt3 θ3Y ,P κ33 1,P
dWt 0 0 dWt2,P . 3,P 1 dWt
0 1 0
Finally, the instantaneous risk-free rate is
There are 22 parameters in this maximally flexible canonical representation of the A3 (0) class of models, and here we present the parameter restrictions needed to arrive at the affine AFNS models. (1) The AFNS model with independent factors The independent-factor AFNS model has P-dynamics X ,P
κ11 dXt1 dXt2 = 0 dXt3 0
0
κ
X ,P 22
0
X σ11 + 0
0
σ
X 22
0
0
dXt1 0 dXt2 = − 0 0 dXt3
0
1 θ1 0 Xt X ,P 2 0 θ2 − Xt dt X ,P Xt3 κ33 θ3X ,P 1 ,P dWt 0 2 ,P 0 dWt , X 3 ,P σ33 dWt
λ 0
σ
X 11
+0 0
1 Xt 0 −λ Xt2 dt λ Xt3
0
σ
X 22
0
σ
X 22
0
η= 0
0 X σ33
0
0
will convert the canonical representation into the independentfactor AFNS model. For the mean-reversion matrices, the relationship between the two representations is
⇐⇒ KYP = A−1 KXP A,
KXP = AKYP A−1
Q Q Q Q KX = AKY A−1 ⇐⇒ KY = A−1 KX A.
The equivalent mean-reversion matrix under the Q -measure is then 1 0 0 X σ11 X σ11 0 0 0 0 0 1 Q X 0 0 λ −λ 0 KY = 0 σ22 0 X 0 0 λ σ22 X 0 0 σ 33 1 0 0 X
0
0
= 0
λ
0
0
Y ,Q
K11
0
σ33
X σ33 . X σ22 λ
−λ
Y ,Q
= 0,
K12
Y ,Q
= 0,
K13
Y ,Q Y ,Q = 0 and K33 = K22 .
Y ,Q
Furthermore, notice that K23 will always have the opposite sign of Y ,Q K22
Y ,Q and K33 , but its absolute size can vary independently of these parameters. Because KXP A, and A−1 are all diagonal matrices,
two , KYP is a diagonal matrix, too. This gives another six restrictions. Finally, we can study the factor loadings in the affine function for the short rate process. In all AFNS models, rt = Xt1 + Xt2 , which is equivalent to fixing
δ = 0, X 0
1
δ = 1 . X 1
0 From the relation (δ1X )′ = (δ1Y )′ A−1 it follows that
(δ1Y )′ = (δ1X )′ A = 1
X = σ11
X σ22
X σ22
0
0 0
X σ33
0 .
δ0X = δ0Y − (δ1Y )′ A−1 η ⇐⇒ δ0Y = δ0X = 0. Thus, we have obtained two additional parameter restrictions
δ0Y = 0 and δ1Y,3 = 0. (2) The AFNS model with correlated factors In the correlated-factor AFNS model, the P-dynamics are X ,P
κ11 dXt1 X ,P dXt2 = κ21 3 dXt κ X ,P
31
1 ,Q
0
For the constant term it holds that
dWt 0 2 ,Q 0 dWt . X 3 ,Q σ33 dWt
1
X σ11 0 0 0
X ,P
and the Q -dynamics are given by Proposition 1 as
0
rt = δ0Y + δ1Y,1 Yt1 + δ1Y,2 Yt2 + δ1Y,3 Yt3 .
0
Thus, four restrictions need to be imposed on the upper triangular Q mean-reversion matrix KY :
and the P-dynamics are
X σ11
0
Thus, defining δ0X = δ0Y − (δ1Y )′ A−1 η and δ1X = (δ1Y )′ A−1 , the short rate process is left unchanged and may be represented in either way
A= 0
= δ + (δ )
Y ′ 1 Yt Y ′ −1 1 A Y ′ −1 1 A
Finally, the short rate process is rt = Xt1 + Xt2 . This model has a total of ten parameters; thus, twelve parameter restrictions need to be imposed on the canonical A0 (3) model. It is easy to verify that the affine invariant transformation TA (Yt ) = AYt + η with
X σ11
+0 0
X ,P κ12 X ,P κ22 X ,P κ32 X σ12 X σ22
0
X ,P X ,P κ13 θ1 Xt1 X ,P X ,P κ23 θ2 − Xt2 dt X ,P Xt3 κ33 θ3X ,P 1 ,P X dWt σ13 2 ,P X σ23 dWt , X 3 ,P σ33 dWt
J.H.E. Christensen et al. / Journal of Econometrics 164 (2011) 4–20
This shows that there are no restrictions on δ1Y . For the constant term, we have
and the Q -dynamics are given by Proposition 1 as
1
dXt 0 dXt2 = − 0
0
0
λ
1
Xt
δ0X = δ0Y − (δ1Y )′ A−1 η ⇐⇒ δ0Y = δ0X = 0.
−λ Xt2 dt 0 0 λ Xt3 X 1,Q X X dWt σ11 σ12 σ13 2,Q X X + 0 σ22 σ23 dWt . X 3,Q 0 0 σ33 dWt
dXt3
Thus, we have obtained one additional parameter restriction,
δ0Y = 0. Finally, for the mean-reversion matrix under the P-measure, we have
This model has a total of 19 parameters; thus, three parameter restrictions are needed. It is easy to verify that the affine invariant transformation TA (Yt ) = AYt + η with
X σ11
X σ13 X σ23 X σ33
X σ12 X σ22
A= 0 0
0
0 0 0
and η =
⇐⇒ KYP = A−1 KXP A,
Q Q Q Q KX = AKY A−1 ⇐⇒ KY = A−1 KX A.
The equivalent mean-reversion matrix under the Q -measure is then
σX − X 12 X σ11 σ22
1
X σ11 = 0
Q
KY
X σ22
0
×
1
0
0 0 0
0
λ 0
−λ
X σ12 X σ11
λ
λ
0
X σ33
σX 11 −λ 0 λ 0
0
0 = 0
X σ13 σX σX − − X 12 X 23 X X X σ11 σ33 σ11 σ22 σ33 X σ23 − X X σ22 σ33
1
0
X σ13 X σ23 X σ33 X
X σ12 X σ22
0
X X X σ12 σ33 − σ22 σ13 X X σ11 σ22 . X σ33 −λ X σ22 λ
Thus, two restrictions need to be imposed on the upper triangular Q mean-reversion matrix KY : Y ,Q
K11
Y ,Q
= 0,
K33
Y ,Q = K22 . Y ,Q
Furthermore, notice that K23
will always have the opposite sign
Y ,Q K33 ,
Y ,Q K22
but its absolute size can vary independently of of and the two other parameters. Next we study the factor loadings in the affine function for the short rate process. In the AFNS models, rt = Xt1 + Xt2 , which is equivalent to fixing 1
δ = 0,
δ = 1 .
X 0
X 1
0 From the relation (δ1X )′ = (δ1Y )′ A−1 , it follows that
(δ1Y )′ = (δ1X )′ A = 1
1
X σ11 0 0
X σ12 X σ22
0
= σ
X 11
σ +σ X 21
X 22
0
σ +σ X 13
X 23
.
KXP = AKYP A−1 ⇐⇒ KYP = A−1 KXP A. Because KXP is a free 3 × 3 matrix, KYP is also a free 3 × 3 matrix. Thus, no restrictions are imposed on the P-dynamics in the equivalent canonical representation of this model. References
will convert the canonical representation into the correlated-factor AFNS model. For the mean-reversion matrices, the relationships between the two representations are KXP = AKYP A−1
19
X σ13 X σ23 X σ33
Ahn, D., Dittmar, R.F., Gallant, A.R., 2002. Quadratic term structure models: theory and evidence. Review of Financial Studies 15, 243–288. Ang, A., Piazzesi, M., 2003. A no-arbitrage vector autoregression of term structure dynamics with macroeconomic and latent variables. Journal of Monetary Economics 50, 745–787. Bank for International Settlements, 2005. Zero-coupon yield curves: technical documentation. BIS Papers No. 25. Christensen, J.H.E., Lopez, J.A., Rudebusch, G.D., 2009. Do central bank liquidity facilities affect interbank lending rates? Working Paper #2009-13. Federal Reserve Bank of San Francisco. Christensen, J.H.E., Lopez, J.A., Rudebusch, G.D., 2010. Inflation expectations and risk premiums in an arbitrage-free model of nominal and real bond yields. Journal of Money Credit and Banking 42 (Suppl. 6), 143–178. Chua, C.T., Foster, D., Ramasnwamy, K., Stine, R., 2008. A dynamic model for the forward curve. Review of Financial Studies 21, 265–310. Cox, J.C., Ingersoll, J.E., Ross, S.A., 1985. A theory of the term structure of interest rates. Econometrica 53, 385–407. Dai, Q., Singleton, K.J., 2000. Specification analysis of affine term structure models. Journal of Finance 55, 1943–1978. Dai, Q., Singleton, K.J., 2002. Expectations puzzles, time-varying risk premia, and affine models of the term structure. Journal of Financial Economics 63, 415–441. De Pooter, M., Ravazzolo, F., van Dijk, D., 2007. Predicting the term structure of interest rates: incorporating parameter uncertainty. Model Uncertainty and Macroeconomic Information. Available at SSRN: http://ssrn.com/abstract=967914. Diebold, F.X., Li, C., 2006. Forecasting the term structure of government bond yields. Journal of Econometrics 130, 337–364. Diebold, F.X., Piazzesi, M., Rudebusch, G.D., 2005. Modeling bond yields in finance and macroeconomics. American Economic Review 95, 415–420. Diebold, F.X., Rudebusch, G.D., Aruoba, S.B., 2006. The macroeconomy and the yield curve: a dynamic latent factor approach. Journal of Econometrics 131, 309–338. Duffee, G.R., 1996. Idiosyncratic variation of treasury bill yields. Journal of Finance 51, 527–552. Duffee, G.R., 2002. Term premia and interest rate forecasts in affine models. Journal of Finance 57, 405–443. Duffie, D., Kan, R., 1996. A yield-factor model of interest rates. Mathematical Finance 6, 379–406. Duffie, D., Pan, J., Singleton, K.J., 2000. Transform analysis and asset pricing for affine jump-diffusions. Econometrica 68, 1343–1376. Fama, E.F., Bliss, R.R., 1987. The information in long-maturity forward rates. American Economic Review 77, 680–692. Filipović, D., 1999. A note on the Nelson–Siegel family. Mathematical Finance 9, 349–359. Fisher, M., Gilles, C., 1996. Term premia in exponential-affine models of the term structure. Manuscript. Board of Governors of the Federal Reserve System. Gürkaynak, R.S., Sack, B., Wright, J.H., 2007. The US treasury yield curve: 1961 to the present. Journal of Monetary Economics 54, 2291–2304. Hördahl, P., Tristani, O., Vestin, D., 2005. A joint econometric model of macroeconomic and term-structure dynamics. Journal of Econometrics 131, 406–436. Kim, D.H., Orphanides, A., 2005. Term structure estimation with survey data on interest rate forecasts. Finance and Economics Discussion Series No. 48. Board of Governors of the Federal Reserve System. Koopman, S.J., Mallee, M.I.P., van der Wel, M., 2010. Analyzing the term structure of interest rates using the dynamic Nelson–Siegel model with time-varying parameters. Journal of Business and Economic Statistics 28 (3), 329–343. Krippner, L., 2006. A theoretically consistent version of the Nelson and Siegel class of yield curve models. Applied Mathematical Finance 13, 39–59. Litterman, R., Scheinkman, J., 1991. Common factors affecting bond returns. Journal of Fixed Income 1, 77–85. Mönch, E., 2008. Forecasting the yield curve in a data-rich environment: a noarbitrage factor-augmented VAR Approach. Journal of Econometrics 146, 26–43. Nelson, C.R., Siegel, A.F., 1987. Parsimonious modeling of yield curves. Journal of Business 60, 473–489.
20
J.H.E. Christensen et al. / Journal of Econometrics 164 (2011) 4–20
Rudebusch, G.D., Swanson, E., Wu, T., 2006. The bond yield ‘conundrum’ from a macro-finance perspective. Monetary and Economic Studies 24, 83–128. Singleton, K.J., 2006. Empirical Dynamic Asset Pricing. Princeton University Press, Princeton. Svensson, L.E.O., 1995. Estimating forward interest rates with the extended Nelson–Siegel method. Sveriges Riksbank, Quarterly Review 3, 13–26. Trolle, A.B., Schwartz, E.S., 2009. A general stochastic volatility model for the pricing of interest rate derivatives. Review of Financial Studies 22, 2007–2057.
Vasiček, O., 1977. An equilibrium characterization of the term structure. Journal of Financial Economics 5, 177–188. Williams, D., 1997. Probability with Martingales. Cambridge University Press, Cambridge. Zantedeschi, D., Damien, P., Polson, N.G., 2009. Predictive macro-finance: a sequential term structure modeling approach based on regime switches. Manuscript. University of Chicago and University of Texas.
Journal of Econometrics 164 (2011) 21–34
Contents lists available at ScienceDirect
Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom
How useful are no-arbitrage restrictions for forecasting the term structure of interest rates? Andrea Carriero a , Raffaella Giacomini b,c,∗ a
School of Economics and Finance, Queen Mary, University of London, Mile End Road, E1 4NS London, United Kingdom
b
Department of Economics, University College London, Gower Street, WC1E 6BT London, United Kingdom
c
CEMMAP, United Kingdom
article
info
Article history: Available online 1 March 2011 JEL classification: C52 C53 E43 E47 Keywords: Forecast combination Encompassing Loss functions Instability Affine term structure models
abstract We develop a general framework for analyzing the usefulness of imposing parameter restrictions on a forecasting model. We propose a measure of the usefulness of the restrictions that depends on the forecaster’s loss function and that could be time varying. We show how to conduct inference about this measure. The application of our methodology to analyzing the usefulness of no-arbitrage restrictions for forecasting the term structure of interest rates reveals that: (1) the restrictions have become less useful over time; (2) when using a statistical measure of accuracy, the restrictions are a useful way to reduce parameter estimation uncertainty, but are dominated by restrictions that do the same without using any theory; (3) when using an economic measure of accuracy, the no-arbitrage restrictions are no longer dominated by atheoretical restrictions, but for this to be true it is important that the restrictions incorporate a time-varying risk premium. © 2011 Elsevier B.V. All rights reserved.
1. Introduction In recent years the finance literature has produced major advances in modeling the term structure of interest rates, building on the assumption of absence of arbitrage opportunities in bond markets. While the no-arbitrage approach has produced good results in terms of in-sample fit, see e.g. De Jong (2000) and Dai and Singleton (2000), the papers focusing on out-of sample forecasting have documented a mixed performance of these models. Duffee (2002) shows that beating a random walk with a traditional no-arbitrage affine term structure model is difficult. Ang and Piazzesi (2003) show that imposing no-arbitrage restrictions and an essentially affine specification of market prices of risk improves out-of-sample forecasts from a VAR(12), but the gains with respect to a random walk forecast are small. Carriero (in press) shows that the no-arbitrage restrictions provide better results if they are imposed on the data as prior information rather than as a set of restrictions. More encouraging results have been obtained by Almeida and Vicente (2008), Moench (2008) and Favero et al. (2007).
∗ Corresponding author at: Department of Economics, University College London, Gower Street, WC1E 6BT London, United Kingdom. Tel.: +44 0207679 5898; fax: +44 02079162775. E-mail address:
[email protected] (R. Giacomini). 0304-4076/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2011.02.010
A drawback of the above conclusions is that they are based on informal comparisons of mean squared forecast errors computed over a particular out-of-sample period. In this paper, we develop a formal framework for investigating the usefulness of parameter restrictions in general — and no-arbitrage restrictions in particular — when a model is used for forecasting. We achieve several goals: (1) we propose a measure of the usefulness of the restrictions that is tailored to the forecaster’s decision problem; (2) the measure can be time-varying; (3) we show how to perform inference about the proposed measure. Our framework can be used to answer questions such as ‘‘are no-arbitrage restrictions useful for forecasting the term structure of interest rates?’’, ‘‘are the restrictions useful for bond portfolio allocation?’’, and ‘‘have the restrictions become more or less useful over time?’’, which are not readily answered using conventional model evaluation and hypothesis testing tools. Our main idea is to cast the problem in an out-of-sample forecast combination framework, in which there is only one forecast model, but the forecaster has the option of imposing some restrictions on its parameters or to forecast with the unrestricted model. We consider a forecast combination and estimate the optimal combination weight in an out-of-sample framework. We say that the restriction is ‘‘useful for forecasting’’ when the optimal weight is large, and we can formally test the hypothesis that the restrictions are useless by an out-of-sample encompassing test. Optimality of the weight is defined with respect to a general
22
A. Carriero, R. Giacomini / Journal of Econometrics 164 (2011) 21–34
forecast loss function, but we show how to specialize the results to either the commonly used quadratic loss or to a loss based on (minus) the utility of a bond portfolio constructed using the model. The latter example of an economically meaningful loss has not been considered before for evaluating no-arbitrage models, and we show how its use can lead to substantially different conclusions than those based on conventional statistical measures of accuracy. We further generalize the techniques to an environment with possible instability and provide a test to assess whether the usefulness of the restrictions is time-varying. To gain some intuition for why the usefulness of restrictions could be timevarying, consider the case of a quadratic loss, for which it can be shown that the measure of usefulness captures the bias/variance tradeoff between a possibly misspecified restricted model and the unrestricted model. In this case, time variation could be due to the variance of the unrestricted model changing or the restrictions becoming more or less misspecified over time. We should point out that our methods do not allow one to disentangle the two sources of time variation. We stress that our techniques are not only applicable to the comparison between an unrestricted and a restricted forecast, but they can be more generally used for measuring the usefulness of two alternative sets of restrictions imposed on the same forecasting model. For example, the random walk model that is often used as a benchmark in forecasting can also be viewed as a set of restrictions on a VAR, and one could ask whether the no-arbitrage restrictions are useful relative to the random walk restrictions. Finally, our framework can be used to compare and combine forecasts from nested models, which is similar to the problem considered by Clark and McCracken (2009) in a different asymptotic context. From the perspective of forecast combination, our problem is non-standard because we do not combine forecasts from different models, but forecasts from the same model that are based on different estimators. This in principle poses challenges for the econometric methodology in that the unrestricted and restricted forecasts may be perfectly correlated in large samples if the restrictions are true. We overcome this problem by considering an out-of-sample environment with non-vanishing estimation uncertainty, as that considered by Giacomini and White (2006) in the different context of equal predictive ability tests, and use it to derive out-of-sample encompassing tests. Encompassing tests are appealing in our context because in case of rejection of the null hypothesis they provide as a byproduct a combination weight that can be naturally interpreted as a measure of usefulness of the restrictions, whereas a test of equal predictive ability would force one to choose either the restricted or the unrestricted forecast. This combination weight can further be used to produce combined forecasts that exploit the information contained in the economic restrictions in a way that is optimal for the loss function of interest. Finally, the encompassing approach naturally lends itself to extensions to unstable environments, since the combination weight can be postulated to be time-varying. Our contribution to the literature in this respect is to provide a valid asymptotic theory for testing hypotheses about the time-varying weight. Note that our problem is also different from testing the restrictions in-sample, since we allow for the possibility that the restrictions are not true, but are still useful for out-of-sample forecasting for a given loss function.
predictors zt . We assume that the user has obtained two sequences of h-step ahead out-of-sample forecasts for xt , by first estimating the model without imposing the restrictions (the ‘‘unrestricted forecast’’) and then re-estimating the model subject to the restrictions (the ‘‘restricted forecast’’). If the interest is in comparing two alternative sets of restrictions, the unrestricted forecast will be replaced by the alternative restricted forecast, but for simplicity we will continue to refer to the forecasts as ‘‘restricted’’ and ‘‘unrestricted’’. The forecasts are obtained by a rolling window estimation scheme, which entails estimating the model using data indexed t − m + 1, . . . , t for each t = m, . . . , T − h and using the estimated model at time t to produce a forecast for xt +h . This gives two sequences of n ≡ T − h − m + 1 forecasts {ftU,h }Tt =−mh and {ftR,h }Tt =−mh , denoting respectively the unrestricted and the restricted forecasts. The asymptotic framework considers the in-sample size m fixed and lets the out-of-sample size n grow to infinity, so that all results are implicitly conditional on the choice of m, which is user-defined. The computation of the time-varying measure of usefulness further requires choosing a smoothing window of size d, which is a constant fraction π of the out-of-sample size n. The user must finally choose a forecast loss function L(xt +h , ft ,h ). We consider in particular two types of loss functions, a quadratic loss and a portfolio utility loss. The quadratic loss is defined as L(xt +h , ft ,h ) = (xt +h − ft ,h )2 ,
where xt in our application will be the yield on a zero coupon bond of maturity τ . The portfolio utility loss considers the asset allocation problem of an investor who is buying a portfolio of q assets in period t and then sells it in period t + 1. In our application such assets will be q zero coupon bonds of maturities τ1 , τ2 , . . . , τq . Defining xt as the vector of returns on each asset: xt = (x1 , x2 , . . . , xq )′ , and w ∗ as a vector of optimal weights, the return on such a portfolio is given by w ∗′ xt . The portfolio utility loss is similar to that considered by West et al. (1993), and is given by L(xt +h , ft ,h ) = −w ∗ (ft ,h )′ xt +h +
γ 2
w∗ (ft ,h )′ Σ w ∗ (ft ,h ),
(2)
where w ∗ (ft ,h ) are the optimal portfolio weights for a quadratic utility, and are linear functions of the forecasts (the exact expression is given in (18)). Note that in this case ft ,h is a vector containing the forecasts of each element in xt . The matrix Σ is the variance–covariance matrix of xt +h , and γ is a user-defined parameter related to the coefficient of relative risk aversion δ by γ the relationship 1−γ = δ . Our empirical results are obtained by setting δ = 1, so that γ = 0.5. 2.2. Methodology for a general loss function Consider a combination of the restricted and unrestricted forecast, ft∗,h = ftR,h + (1 − λ)(ftU,h − ftR,h ), so that λ is the weight on the restricted forecast. The optimal weight λ∗ minimizes the expected out-of-sample loss of the combined forecast:
λ = arg min E ∗
λ∈R
T −h 1−
n t =m
L(
,
xt +h ftR,h
+ (1 − λ)(
ftU,h
−
ftR,h
= arg min E [Qn (λ)], λ∈R
2. A measure of the usefulness of economic restrictions
and is estimated by
2.1. Set-up and notation
1 λ = arg min
Let yt = (xt , zt )′ indicate the vector of observables, which include the (scalar) variable of interest xt and the vector of
(1)
λ∈R
T −h −
n t =m
(3)
L(xt +h , ftR,h + (1 − λ)(ftU,h − ftR,h ))
= arg min Qn (λ). λ∈R
))
(4)
A. Carriero, R. Giacomini / Journal of Econometrics 164 (2011) 21–34
The estimated optimal weight λ is our measure of the usefulness of the economic restrictions for forecasting, for a given loss function L(·). A small λ indicates that the restrictions are not useful for forecasting, whereas a large λ suggests that the economic restrictions can be usefully imposed to obtain more accurate forecasts. λ in (4) can be computed for a general loss function using numerical methods, but we show how to derive simple analytical expressions for the special cases of a quadratic and portfolio loss functions in Section 3. The asymptotic distribution of λ is obtained by recognizing that λ is an M-estimator, which minimizes the (typically well-behaved) objective function Qn (λ). A similar remark was made by Elliott and Timmermann (2004), in an environment where the forecasts are based on different models and are taken as given. The fact that in our context the forecasts are based on the same model and depend on in-sample data and estimated parameters introduces some complications, which we handle using a generalization of the key insight in Giacomini and White (2006). Specifically, we show that an asymptotic theory for λ can still be derived by relying on laws of large numbers, central limit theorems and functional central limit theorems for the objective function and its derivatives in spite of the fact that such functions depend in a complex nonlinear manner on the in-sample data through ftR,h and ftU,h . This is because we assume that the in-sample estimation window is finite, so that the objective function and its derivatives become functions of the finite history of ‘‘short memory’’ (mixing) processes, and are thus themselves short memory and plausibly satisfy laws of large numbers and central limit theorems. We rely on the asymptotic properties of λ to obtain formal methods for testing the usefulness of the restrictions, both in an environment where such usefulness is constant over time (Section 2.2.1) and in an environment with possibly time-varying usefulness (Section 2.2.2). 2.2.1. Testing the global usefulness of parameter restrictions We first consider an environment in which λ∗ is constant over time, and can thus be interpreted as a ‘‘global’’ measure of the usefulness of the restrictions. Proposition 1 shows how to construct formal tests for whether the unrestricted forecast is useless (H0U : λ∗ = 1) or whether the restricted forecast is useless (H0R : λ∗ = 0), which are essentially out-of-sample encompassing tests. The tests are derived under the following assumptions. Assumption A. (1) E [Qn (λ)] is uniquely minimized at λ∗ < ∞; (2) L(xt +h , ftR,h + (1 − λ)(ftU,h − ftR,h )) is convex and twice continuously differentiable with respect to λ; (3) {xt } is mixing with φ of size −r /(r − 1) or α of size −2r /(r − 2), r > 2; (4) E |L(xt +h , ftR,h + (1 − λ)(ftU,h − ftR,h ))|r /2 < ∞ for all t and all λ;
23
objective functions. Assumption A(3)–(7) are the familiar primitive conditions guaranteeing applicability of laws of large numbers and central limit theorems for the objective function and its derivatives. Note that these conditions, while ruling out the presence of unit roots, allow the data to be heterogeneous and dependent. Assumption A(7) could be violated if the forecasts were perfectly correlated in large samples. To see why, consider for simplicity the quadratic loss case, where H = E [2(ftR,h − ftU,h )2 ]. If the restrictions were true, the forecasts would become perfectly correlated as the estimation sample grows, making H converge to zero. This occurrence is however ruled out in our context by A(9), which assumes that the estimation sample is fixed, thus preventing estimation uncertainty from disappearing asymptotically. Assumption A(8) requires a uniform law of large numbers for the second derivatives of the objective function. Primitive conditions for A(8) could easily be found, but we do not specify them here because A(8) becomes considerably simpler for the loss functions considered in Section 3, since in both cases the second derivative of the objective function does not depend on λ. For these loss functions, A(8) can be replaced with the condition that ∇λλ Qn has finite r /2-th moments, which, together with A(3), guarantees that a law of large numbers can be invoked for ∇λλ Qn . Assumption A(9) shows that the asymptotic distribution is obtained by letting the out-of-sample size n grow to infinity, whereas the in-sample size m and the forecast horizon h are finite. Proposition 1 (Tests of Global Usefulness). Suppose Assumption A holds. Let
√ t
U
n( λ − 1)
=
σ nλ tR = , σ
;
(5)
√
where σ is given by
σ = H −1 Ω H −1 ; H = ∇λλ Qn (λ); p− n −1 j = Ω 1 − p j=−pn +1
n
(6)
T −h −1 − n st ( λ)st −j ( λ); t =m+j
st ( λ) = ∇λ ∂ L(xt +h , ftR,h + (1 − λ)(ftU,h − ftR,h )), where pn is a bandwidth that increases with the sample’s size (Newey and West, 1987). Then the hypotheses H0U : λ∗ = 1 and H0R : λ∗ = 0 are rejected at a significance level α respectively when |t U | > cα/2 and |t R | > cα/2 , with cα/2 indicating the 1 − α/2 quantile of a N (0, 1) distribution.
(5) E |∇λ L(xt +h , ftR,h + (1 − λ∗ )(ftU,h − ftR,h ))|2r < ∞ for all t, where ∇λ indicates √ the first derivative with respect to λ; (6) Ω = E [( n∇λ Qn (λ∗ ))2 ] > 0 for all n; (7) H = E [∇λλ Qn (λ∗ )] > 0 for all n; (8) supλ∈Λ ‖∇λλ Qn (λ) − E [∇λλ Qn (λ)]‖ →p 0, where Λ indicates a neighborhood of λ∗ and ∇λλ the second derivative with respect to λ; (9) m < ∞, h < ∞, n → ∞.
The bandwidth pn used in the construction of the test statistic must be appropriately chosen to account for the possible serial correlation in the first derivatives of the loss function. In practice, the accuracy of the estimate of Ω can be an issue, particularly for long-horizon forecasts (see, e.g., Kim and Nelson, 1993; Harvey et al., 1998). In our application, we follow Kim and Nelson’s (1993) recommendation and set pn = 2(h − 1).
Assumption A(1) is satisfied by the quadratic and the portfolio utility loss functions considered in Section 3, which are both quadratic polynomials in λ. Assumption A(2) is stronger than necessary and is only imposed for convenience and because it is satisfied by the loss functions in Section 3. Following Newey and McFadden (1994), it is straightforward to extend the results to an environment with non-convex and non-differentiable
2.2.2. Testing the usefulness of parameter restrictions in the presence of instability A question that may be of further interest to forecasters is whether the usefulness of the restrictions varies over time. To answer this question, we extend the previous analysis to the case of time-varying forecast combination weights. These time-varying
24
A. Carriero, R. Giacomini / Journal of Econometrics 164 (2011) 21–34
weights can be interpreted as measuring the ‘‘local’’ usefulness of the restrictions, and solve the problem
3. Special cases: quadratic and portfolio utility loss
λ∗t = arg min E [L(xt +h , ftR,h ) + (1 − λt )(ftU,h − ftR,h )],
This section specializes the general methods described in Section 2.2 to the cases of a quadratic and a portfolio utility loss.
λt ∈R
t = m, . . . , T − h.
(7)
3.1. Quadratic loss
A simple nonparametric estimator of (7) can be obtained by computing rolling average weights over windows of size d: t −
λt ,d = arg min λt ∈R
1 n
[L(xj+h , fj,Rh ) + (1 − λt )(fj,Uh − fj,Rh )],
by
j=t −d+1
t = m + d − 1 , . . . , T − h.
t
−
λ∗t ,d = arg min
j=t −d+1
(9)
Further note that, as a result of adopting a non-standard fixed bandwidth approximation, standard results for optimal bandwidth selection obtained in the nonparametric literature do not apply here. Instead, in our framework different choices of bandwidth result in a different null hypothesis being tested. A plot of the sample path of { λt ,d }Tt =−mh +d−1 in (8) can uncover possible time-variation in the usefulness of the economic restrictions. Proposition 2 further shows how to test the hypothesis that the unrestricted forecast was consistently useless (H0U : λ∗t ,d = 1 for all t) or that the restricted forecast was consistently useless (H0R : λ∗t ,d = 0 for all t) over time. We control the overall size of the procedure by deriving uniform confidence bands that have the desired coverage under the null hypothesis. The proposition relies on the following set of assumptions. Assumption B. Let τ ∈ [0, 1]. Under the hypothesis that λ∗t ,d is constant and equal to λ∗ , (1)
n−1/2
∑m+[τ n] j =m
∇λ L(xj+h , fj,Rh + (1 − λ∗ )(fj,Uh − fj,Rh )) obeys a
Functional Central Limit Theorem with Ω = limn→∞ E (n−1/2 ∑T −h U R R ∗ 2 j=m ∇λ L(xj+h , fj,h + (1 − λ )(fj,h − fj,h ))) > 0; (2) d/n → π ∈ (0, ∞) as d → ∞, n → ∞. m < ∞ and h < ∞; (3) σ →p σ and λ →p λ∗ . Primitive conditions for B(1) and (3) analogous to those listed in Assumption A could be similarly specified here. Proposition 2 (Tests of Time Variation in Usefulness). Suppose Assumption B holds. For a significance level α , first construct the bands:
σ σ λt ,d − kα,π √ , λt ,d + kα,π √ , d
t = m + d − 1, . . . , T − h,
d
(10)
where kα,π is tabulated in Table 1 for various values of π = d/n and
σ is as in Proposition 1.
[ T −h ∑
] (xt +h − ftU,h )(ftR+h − ftU,h ) t =m . [ T −h ] ∑ R E (ft ,h − ftU,h )2
λ∗ =
(11)
t =m
A consistent estimator of λ∗ is T −h
∑ λ=
(xt +h − ftU,h )(ftR+h − ftU,h )
t =m
T −h
∑
(
t =m
ftR,h
−
,
(12)
)
ftU,h 2
or, equivalently, the OLS estimator of λ in the regression
E [L(xj+h , fj,Rh ) + (1 − λt )(fj,Uh − fj,Rh )],
t = m + d − 1 , . . . , T − h.
E
(8)
Instead of adopting a standard asymptotic approximation to conduct inference about (8), which would require the bandwidth d/n to go to zero as d and n grow to infinity, we follow a similar approach as Giacomini and Rossi (2010) and obtain a distribution theory for λt ,d that has better finite-sample properties by using a non-standard asymptotic approximation with fixed bandwidth. Note that in this fixed-bandwidth approximation, however, λt ,d is no longer a consistent estimator of λ∗t , but it consistently estimates a ‘‘smoothed’’ version of λ∗t : λt ∈R
For a quadratic loss, the objective function in (3) is Qn (λ) = ∑ T −h U R R 2 t =m (xt +h − ft ,h − (1 − λ)(ft ,h − ft ,h )) , which is minimized
The null hypotheses H0U : λ∗t ,d = 1 for all t and H0R : λ∗t ,d = 0 for all t can be rejected if there exists at least one t at which, respectively, 1 or 0 fall outside the bands.
xt +h − ftU,h = λ(ftR,h − ftU,h ) + εt +h , t = m, . . . , T − h. (13) The estimator σ that is needed for constructing the tests in Propositions 1 and 2 is in this case given by Eq. (14) given in Box I where εt +h are regression residuals from (13) and pn is a bandwidth that increases with the sample’s size (Newey and West, 1987). In the presence of possible instability, a consistent estimator of the smoothed measure of usefulness λ∗t ,d in (9) can be similarly obtained as t ∑
λ t ,d =
j=t −d+1
(xj+h − fj,Uh )(fj,Rh − fj,Uh ) t ∑
, (
j=t −d+1
fj,Rh
−
)
fj,Uh 2
t = m + d − 1, . . . , T − h,
(15)
or, equivalently, by estimating the OLS coefficient in the following regression over rolling samples of size d: xj+h − fj,Uh = λt ,d (fj,Rh − fj,Uh ) + εj+h ; j = t − d + 1, . . . , t ;
(16)
t = m + d − 1, . . . , T − h. In the empirical application, the variable to forecast will be xt = (τ ) yt , i.e. the yield of a bond of maturity τ . 3.2. Portfolio utility loss Let xt = (x1 , x2 , . . . , xq )′ be a q × 1 vector of risky assets and consider the portfolio w ′ xt , with weights summing to 1. In analogy with our empirical application to no-arbitrage VARs, we suppose the forecaster has a model for xt +h and has the option of estimating it unrestricted or by imposing restrictions that only affect the conditional mean parameters. We further assume that the model does not specify conditional variance dynamics, so that the conditional variance of xt +h at time t simply equals the unconditional variance–covariance matrix of the q assets Σ , so that Vart [xt +h ] = Var[xt +h ] = Σ . We suppose that at each time t = m, . . . , T − h the forecaster constructs a portfolio by choosing the weights that minimize a quadratic utility function:
γ w ∗ = arg min w ′ Et [xt +h ] − w′ Σ w , w
2
(17)
A. Carriero, R. Giacomini / Journal of Econometrics 164 (2011) 21–34
25
Table 1 Critical values kα,π for the confidence bands in Proposition 2.
α
π = d/n
0.05 0.10
σ =
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
3.393 3.170
3.179 2.948
3.012 2.766
2.890 2.626
2.779 2.500
2.634 2.356
2.560 2.252
2.433 2.130
2.248 1.950
T −h 1−
n t =m
−1 pn −1 T −h − − j R U 2 1 − n−1 (ft ,h − ft ,h ) (ftR,h − ftU,h ) εt +h (ftR−j,h − ftU−j,h ) εt +h−j , p n
j=−pn +1
(14)
t =m+j
Box I. Table 2 Empirical size of the tests in (5).
The closed-form solution for this problem is
Panel A: Case β1 = 1 (λ∗ = 0) Percentage of rejections of H0 : λ = 0 m
E [(ftR,h − ftU,h )′ B′ (xt +h − γ Σ (a + BftU,h ))] . E [γ (ftR,h − ftU,h )′ B′ Σ B(ftR,h − ftU,h )]
λ∗ =
n 50
100
150
250
500
1000
0.070 0.067 0.074 0.073
0.068 0.059 0.064 0.060
0.053 0.056 0.056 0.058
0.056 0.053 0.052 0.052
0.054 0.051 0.047 0.049
0.052 0.051 0.048 0.048
A consistent estimator of λ∗ is T −h
50 100 150 250
∑ λ=
T −h
∑
m
n 50
100
150
250
500
1000
50 100 150 250
0.068 0.067 0.074 0.070
0.069 0.057 0.064 0.060
0.052 0.059 0.057 0.058
0.058 0.053 0.052 0.052
0.053 0.051 0.046 0.049
0.052 0.049 0.047 0.051
The table contains the percentage of rejections for the tests in (5) with a nominal size of 0.05.
where Et [·] denotes the conditional mean at time t. The classical solution (Markowitz, 1952) to this problem is given by
w∗ = a + BEt [xt +h ]; Σ −1 ι a = ′ −1 ; ιΣ ι 1 Σ −1 ιι′ Σ −1 −1 B= Σ − , γ ι ′ Σ −1 ι
(18)
where ι is a q × 1 vector of ones. When the economic restrictions only affect the conditional mean of the assets, as is the case for the no-arbitrage restrictions that we are interested in, the forecaster can construct two different portfolios, one by forecasting the conditional mean with the unrestricted model, so that Et [xt +h ] = ftU,h , and one by imposing the restriction and letting Et [xt +h ] = ftR,h . We can similarly consider the portfolio whose optimal weights are a function of the combination forecast, and our measure of usefulness is then obtained by minimizing the expected portfolio utility loss in (2) with respect to the forecast combination weight λ:
λ∈R
2
(a + B(
(21)
+ (1 − λ)(
ftU,h
−
ftR,h
))) ′
× Σ (a + B(ftR,h + (1 − λ)(ftU,h − ftR,h ))) .
where at and Bt are as defined in (18) with Σ substituted at each time t by an estimate computed over each rolling window of data up to time t:
t = Σ
1
t −
(xj − x)(xj − x)′ ,
m j=t −m+1
with x =
1
t −
m j=t −m+1
xj .
(22)
The estimator of the asymptotic variance σ that is needed for constructing the test in Proposition 1 and the bands in Proposition 2 are obtained by setting
t ( st ( λ) = (ftR,h − ftU,h )′ B′t [xt +h − γ Σ at R U + Bt (ft ,h + (1 − λ)(ft ,h − ftR,h )))] and
(23)
∂ st ( λ) t = γ (ftR,h − ftU,h )′ B′t Σ Bt (ftR,h − ftU,h ) ∂λ in Eq. (6). In the presence of time variation, a consistent estimator of the smoothed measure of usefulness (9) for a portfolio utility loss can be obtained as t ∑
λ t ,d =
j ( B′j (xj+h − γ Σ aj + Bj fj,Uh ))] [(fj,Rh − fj,Uh )′
j=t −d+1
t ∑
, j B′j Σ Bj (fj,Rh − fj,Uh )] [γ (fj,Rh − fj,Uh )′
j=t −d+1
t = m + d − 1, . . . , T − h.
(24)
In the empirical application, the variable to be forecasted will (τ )
(τ )
(τq ) ′
be xt = (rt 1 , rt 2 , . . . , rt q different maturities.
) , i.e. a vector of returns on bonds of
4. Illustrative example and finite sample properties
λ = arg min E (−a − B(ftR,h + (1 − λ)(ftU,h − ftR,h )))′ xt +h ∗
+
,
t [γ (ftR,h − ftU,h )′ B′t Σ Bt (ftR,h − ftU,h )]
t =m
Percentage of rejections of H0 : λ = 1
ftR,h
t ( [(ftR,h − ftU,h )′ B′t (xt +h − γ Σ at + Bt ftU,h ))]
t =m
Panel B: Case β1 = 0 (λ∗ = 1)
γ
(20)
(19)
To gain intuition about the determinants of our measure of usefulness λ∗ and to assess the finite-sample properties of our tests, we consider a simple example of two competing sets of restrictions imposed on the parameters of a linear model and investigate the size properties of the global usefulness test in Proposition 1.
26
A. Carriero, R. Giacomini / Journal of Econometrics 164 (2011) 21–34
Suppose the data-generating process is: xt = β1 z1t + β2 z2t + εt ; zt ∼ i.i.d. N (0, I2 ),
(25)
εt ∼ i.i.d. N (0, 1),
and that the two models MU and MR impose the competing restrictions β1 = β1U or β1 = β1R , while leaving β2 unrestricted. This yields the one-step-ahead forecasts ftU = β1U z1,t +1 + β2 z2,t +1 and ftR = β1R z1,t +1 + γ2 z2,t +1 with β2 and γ2 OLS estimators. The optimal weight (11) for a quadratic loss function in this example is1 :
λ∗ =
β1 − β1U , β1R − β1U
(26)
which reveals that the usefulness of the restrictions is determined by the relative amount of bias implied by the two models, so that λ∗ equals 0 or 1 when either restriction is true. Notice that the weight does not necessarily fall between 0 and 1, but in principle could be any value on the real line. Intuitively, in this simple example the relative bias of the models is the sole determinant of the usefulness of the restrictions because the two forecasts imply the same amount of estimation uncertainty, but in more general cases there will be a bias-variance trade-off between different sets of restrictions. We now proceed to illustrate the small sample properties of our tests by using a simple Monte Carlo simulation. We assume that the true DGP is given by (25) and we consider the following two restricted models. The first model imposes on (25) the restriction β1 = β1U = 1: MU : xt = z1t + β2 z2t + εt .
The second model imposes on (25) the restriction β1 = β = 0. (28)
For this set of restrictions the expression in (26) simplifies to λ∗ = 1 − β1 . We draw 5000 random samples from the DGP in (25) setting2 β2 = 1 and using in turn β1 = 1 and β1 = 0. The case β1 = 1yields by construction λ∗ = 0 (MU imposes the right restrictions and MR is useless). The case β1 = 0 yields by construction λ∗ = 1 (MR imposes the right restrictions and MU is useless). For each of the 5000 replications we first estimate the two restricted models and obtain the corresponding forecasts. Then we compute λ∗ and σ using (13) and (14), and use them to obtain the test statistics in (5). Results of our simulation for different sizes of the estimation window m and forecast window n are reported in Table 2. Panel A displays the empirical size of the test in (5) for H0 : λ = 0, Panel B reports the empirical size the test in (5) for H0 : λ = 1. In both cases the nominal size of the tests is 0.05. As is clear from the table, the tests in (5) are well-sized, and for a given estimation window the size tends to improve as the size of the forecast window increases. Before turning to the empirical application we want to briefly discuss the relation of our approach to Bayesian forecast combination. A Bayesian econometrician would attach a priori probabilities p(MU ) = 1 − p(MR ) and p(MR ) to models MU and MR . Note that the two models exhaust all the possibilities. Then he would use Bayes’ formula to derive the posterior probability
1 To see why, note that the numerator in (11) is (β − β U )(β R − β U ) + E (β − 1 2 1 1 1 β2 )( γ2 − β2 ) and the denominator is (β1R − β1U )2 + E ( γ2 − β2 )2 , from which the result follows by noting that E ( β2 ) = E ( γ2 ) and E ( β2 γ2 ) = E ( β22 ) = E ( γ22 ). 2 As is clear from (26) the choice of the value of β in the true DGP does not 2 influence the results as both models will estimate it unbiasedly.
ft∗ = E [xt +1 |x] = ftU (1 − p(MR |x)) + ftR p(MR |x).
(29)
The expression above is conceptually different from that resulting from forecast combination in a classical framework. In (29) the optimal weights are probabilities, and as such they are constrained to be between 0 and 1, while in the classical framework the weights depend on the variances and correlation between the forecasts and they can in principle be negative or greater than 1. As the weights are between 0 and 1, the combined forecast produced by the Bayesian econometrician is a convex combination of the forecasts of the original models, which makes sense as he believes that the two competing models exhaust all the possibilities. On the other hand, for the classical econometrician both models can be misspecified, but their forecasts can be useful if combined in some optimal way. For a comparison of the two approaches with a discussion about the cases in which they can coincide see Palm and Zellner (1992). There is a simple example in which the optimal combination weights computed in our framework can have an interpretation similar to the Bayesian one. Consider the DGP in (25), but further assume that β1 = β1U with probability 1 − π (in which case MU is true) and β1 = β1R with probability π (in which case MR is true). In
ˆ can be this case, the optimal weight is3 λ∗ = π , so the estimate λ thought of as the frequentist equivalent of the posterior probability of MR computed by a Bayesian econometrician with a flat prior.
(27) R 1
MR : xt = γ2 z2t + εt .
of MR , i.e., p(MR |x), where x denotes the available data. Under a flat prior this posterior probability is simply given by the (properly normalized) marginal likelihood of MR . The optimal point forecast for a quadratic loss is then given by:
5. Application: usefulness of the no-arbitrage restrictions for predicting the term structure of interest rates In this section we apply our proposed framework to the problem of forecasting the yield curve using no-arbitrage restrictions. Our framework enables us to address several questions such as: ‘‘are no-arbitrage restrictions useful for forecasting the term structure of interest rates?’’, ‘‘are the restrictions useful for bond portfolio allocation?’’, ‘‘does time variation in the term premium help in forecasting?’’, and ‘‘have the restrictions become more or less useful over time?’’. We will start with describing how the no-arbitrage restrictions can be imposed on a VAR model for the yields, and then turn to the forecasting exercise and provide the results for a quadratic and a portfolio loss function. 5.1. A benchmark no-arbitrage affine term structure model (ATSM) We consider the affine term structure model (ATSM) proposed by Ang and Piazzesi (2003), which is a discrete-time version of the affine class introduced by Duffie and Kan (1996), where bond prices are exponential affine functions of underlying state variables. The assumption of no-arbitrage (Harrison and Kreps, 1979) guarantees the existence of a risk neutral measure Q such that the price at time t of an asset Vt that does not pay any dividends at time Q t + 1 satisfies Vt = Et [exp(−it )Vt +1 ], where the expectation is taken with respect to the measure Q and it is the short term rate. The assumption of no-arbitrage is equivalent to the assumption of the existence of the Radon–Nikodym derivative ξt +1 , which allows
3 To see why note β is a discrete random variable independent from all the 1 other variables in (25). It follows that the optimal λ is λ∗ =
E [β1 ]−β1U
β1R −β1U
. Substituting
E [β1 ] = (1 − π)β1U + πβ1R in the expression for λ∗ provides the result.
A. Carriero, R. Giacomini / Journal of Econometrics 164 (2011) 21–34
one to convert the risk neutral measure into the data generating Q measure: Et [exp(−it )Vt +1 ] = Et [(ξt +1 /ξt ) exp(−it )Vt +1 ]. Assume ξt +1 follows a log-normal process:
ξt +1 = ξt exp(−0.5Λt Λt − Λt εt +1 ). (30) Λt is called the market price of risk and is an affine function of a ′
′
vector of k factors Ft :
Λt = Λ0 + Λ1 Ft , (31) where Λ0 is a k-dimensional vector and Λ1 a k × k matrix. The short term rate is also assumed to be an affine function of Ft : it = δ0 + δ1′ Ft ,
(32)
Ft = Ψ Ft −1 + Ω εt ,
(33)
where δ0 is a scalar and δ1 a k-dimensional vector. We assume that the factors follow a zero-mean stationary vector process: where εt ∼ i.i.d. N (0, Σε ) with Σε = I with no loss of generality. The nominal pricing kernel is defined as: mt +1 = exp(−it )ξt +1 /ξt
= exp(−δ0 − δ1′ Ft − 0.5Λ′t Λt − Λ′t εt +1 ),
(34)
where the second equality comes from (32) and (30). The nominal (τ ) pricing kernel prices all assets in the economy, so, by letting pt denote the time t price of a τ -period zero coupon, we have: (τ +1)
pt
) = Et (mt +1 p(τ t +1 ).
(35)
Using the above equations, it is possible to show that bond prices are an affine function of the state variables: (τ )
= exp(A¯ τ + B¯ ′τ Ft ), (36) where A¯ τ and B¯ τ are a scalar and a k-dimensional vector obeying: pt
A¯ τ +1 = A¯ τ + B¯ ′τ (−ΩΛ0 ) + 0.5B¯ ′τ ΩΩ ′ B¯ τ − δ0 ; B¯ ′τ +1 = B¯ ′τ (Ψ − ΩΛ1 ) − δ1′ ,
(37)
with A¯ 1 = −δ0 and B¯ 1 = −δ1 . See Ang and Piazzesi (2003) for a formal derivation. The continuously compounded yield on a τ period zero coupon bond is: (τ )
) ′ = − ln p(τ (38) t /τ = Aτ + Bτ Ft , ¯ ¯ with Aτ = −Aτ /τ and Bτ = −Bτ /τ , so yields are also an affine
yt
function of the factors. Eqs. (33) and (38) define a state-space model: Ft = Ψ Ft −1 + Ω εt ;
(39)
Yt = A + BFt + vt , (τ )
(τ )
(τq )
where Yt = (yt 1 , yt 2 , . . . , yt )′ is a q-dimensional vector process collecting all the yields at maturities τ1 , τ2 , . . . , τq , A = (Aτ1 , Aτ2 , . . . , Aτq )′ and B = (Bτ1 , Bτ2 , . . . , Bτq )′ are functions of the structural coefficients of the model according to Eq. (37), and vt is a vector of i.i.d. Gaussian measurement errors with variance Σv . Following common practice, we use three factors, which can be interpreted as the level, slope and curvature of the term structure. Given that scaling, shifting, or rotation of the factors provides observational equivalence, a normalization is required. Following Dai and Singleton (2000) we identify the factors by assuming factor mean equal to zero, a lower triangular structure for the matrix Ψ , and we set δ1 = (1, 1, 0)′ . Given this identification scheme, the coefficient δ0 equals the unconditional mean of the instantaneous rate, which can be approximated by the sample average of the 1month yield. As for second order coefficients, we assume Ω and Σv to be diagonal, while we assume absence of correlation between the state and the measurement equation disturbances, i.e. Σεv = 0. We collect all the parameters to be estimated in the vector:
θ = {Ψ , Ω , Λ0 , Λ1 , Σv }. (40) We estimate θ with the EM algorithm, evaluating the likelihood at each iteration by means of the Kalman Filter. In our application we also consider a specification of the model with constant risk premium, which amounts to setting Λ1 = 0 in Eq. (31).
27
5.2. VARs with no-arbitrage ATSM restrictions (ATSM–VAR) Now consider a VAR(p) representation of the q-dimensional vector collecting all the yields at hand: Yt = Φ0 + Φ1 Yt −1 + · · · + Φp Yt −p + ut (τ1 )
(τ2 )
(41)
(τq ) ′
where Yt = (yt , yt , . . . , yt ) and ut is a vector of one-stepahead forecast errors having a multivariate normal distribution with variance Σu . The VAR in (41) can be interpreted as an approximation of the Moving Average (MA) representation of Yt . The approximation gets better as more dynamics are added to the system. Importantly, as is clear from Eq. (39), the ATSM features an MA representation. As the ATSM depends on a vector of coefficients θ (Eq. (40)) having much fewer elements than the coefficient matrices of the VAR, the ATSM imposes a set of nonlinear crossequation restrictions on the VAR in (41). To impose such restrictions on the VAR we follow Del Negro and Schorfheide (2004), i.e. we first compute the moments of Yt under the state-space in Eq. (39), and then impose them on the VAR in (41). To do so, rewrite the VAR in the data-matrix notation: Y = XΦ + U,
(42)
where Y is a T × q data-matrix with rows Yt′ , X is a T × k (where k = 1 + qp) data-matrix with rows Xt = (1, Yt′−1 , Yt′−2 , . . . , Yt′−p ), Φ = (Φ0 , Φ1 , . . . , Φp )′ , and U is a T × q data-matrix with rows u′t . Let Eθ denote the expectation under the ATSM model and define the autocovariance matrices Γxx∗ (θ ) = Eθ (Xt Xt′ ) and ∗ ΓXY (θ ) = Eθ (Xt Yt′ ), which can be computed using the state-space representation in (39) for a given θ . Then, under the ATSM, the relation between the ATSM parameters and the VAR parameters is4 Φ ∗ = [Γxx∗ (θ )]−1 Γxy∗ (θ ), where the star indicates that the ATSM restrictions hold. Defining θˆ as the maximum likelihood estimator of θ , the maximum likelihood estimator for the VAR coefficients under the ATSM is:
ˆ ∗ = [Γxx∗ (θˆ )]−1 Γxy∗ (θˆ ). Φ
(43)
In the following we will refer to this model as the ATSM–VAR. We also consider a specification in which we impose on the ATSM–VAR the additional restriction of a constant risk premium, which is obtained simply by setting Λ1 = 0 in Eq. (31). We label this case CRP–VAR. In the empirical application the structural coefficients θ and the corresponding ATSM moments Γxx∗ (θˆ ) and Γxy∗ (θˆ ) are re-estimated in pseudo-real-time as forecasting proceeds forward within our rolling estimation-forecasting scheme. The maximum ˆ = likelihood estimator of the unrestricted VAR (UVAR) is simply Φ (X ′ X )−1 X ′ Y . 5.3. Forecasting exercise For our exercise we use monthly data on zero coupon bond yields of maturities 1-, 3-, 12-, 36-, and 60-months, from January 1964 to December 2003. The data are taken from the Fama CRSP zero coupon and Treasury Bill files. We produce 1-step ahead forecasts using the UVAR (we label such forecasts ftU ), the ATSM–VAR(ftATSM ), the CRP–VAR (ftCRP ), and a simple random walk (RW ) forecast (ftRW ). For each of the models at hand the sequences of forecasts are produced over the sample 1974:1–2003:12 using a rolling estimation window of 10 years. The procedure thus starts with estimating all the models using the estimation window 1964:1–1973:12, and producing the forecasts for the vector of yields in 1974:1. Then the estimation window is moved one period ahead, to 1964:2–1974:1, and the new estimates are used to produce the forecasts for the vector of yields in 1974:2.
4 As stressed above, the approximation is not exact because the state-space representation of the ATSM generates moving average terms.
28
A. Carriero, R. Giacomini / Journal of Econometrics 164 (2011) 21–34
In light of the forecasting focus of the paper, the structural coefficients θ and the corresponding ATSM moments Γxx∗ (θˆ ) and
ˆ are re-estimated using the data in each of the rolling Γxy∗ (θ)
samples. We maximize the likelihood of the VAR with no-arbitrage ATSM restrictions using the Broyden, Fletcher, Goldfarb, and Shanno (BFGS) algorithm with Brent line search.5 The procedure is iterated until the last forecast (i.e., that for 2003:12) is obtained. For the VAR models, we use a specification with 3 lags which provides well-behaved residuals.6 5.4. Results for a quadratic loss We first consider the results for a quadratic loss function. In the case at hand the loss function in (1) specializes to: (τ )
(τ )
(τ )
(τ )
L(yt +1 , yˆ t +1 ) = (yt +1 − yˆ t +1 )2 , (τ ) where yt +1
(44)
is the yield to maturity of a bond of maturity τ in period (τ )
t + 1 and yˆ t +1 is the 1-step ahead forecast of such variable. We provide results for bonds of five different maturities: 1-, 3-, 12-, 36-, 60-months. We start with the results based on the global measure of usefulness. Results are displayed in Table 3, which is composed of four panels, each corresponding to a different combination: ATSM–VAR and UVAR (Panel A), CRP–VAR and UVAR (Panel B), ATSM–VAR and RW (Panel C), CRP–VAR and RW (Panel D). For each yield, column (1) in Table 3 contains the Root Mean Squared Forecast Error (RMSFE) (i.e., the realized loss) of the first model considered in the combination. Columns (2)–(4) report the percentage gains over the RMSFE of the model in column (1) obtained by using, respectively, the second model in the λ=1/2 ), and a combination, a combination with equal weights (ft combination based on the estimated optimal weight (ft∗ ). Column (5) reports the value of the estimated optimal weight λ (as defined in Eq. (12)). Finally, columns (6)–(8) report the statistics for the encompassing tests of Proposition 1. We included in the comparison the case of a combination with equal weights for reference, because there is a long literature documenting the fact that in practice equal weights usually yield more accurate forecasts than optimal combination weights. Of course this can happen in a real-time forecasting exercise, when the optimal weights are estimated using only past information, while in our case the optimal weights outperform the equal weights by construction. Still, it may be interesting to see whether the optimal weights are statistically different from 0.5, because this might explain the success of combinations with equal weights, and how large are the gains in using optimal rather than equal weights. From Panel A we see that the estimated optimal weights λ range between 0.514 and 1.054, and in all cases the encompassing test
5 In the first estimation window we initialize our algorithm as follows. First we compute a maximum, then we draw 100 alternative starting points by randomizing around this maximum (drawing from a normal with variance derived from the Hessian at the maximum), maximize again, and check that none of the random initial points leads to a point with higher likelihood (i.e. a new maximum). If this is not the case, we take the new maximum and repeat the randomization until no points with higher likelihood are found. Then, for all the remaining estimation windows, we use the optimum obtained in the previous period t − 1 as initial condition for the maximization performed in period t. The in-sample fit of the estimated models is extremely high throughout the sample (with the R2 being around 0.998). 6 The Bayesian Information Criterion selects 1 lag, but the LM test statistic reported in Johansen (1995) rejects the null of no residual autocorrelation. The specification with 3 lags is the most parsimonious one which eliminates this problem. Our results are robust to specifications with 1–4 lags.
rejects the null that the optimal weight is zero, i.e., the restricted model is useless, while it cannot reject (except for the 1-month yield) the null that the unrestricted forecast is useless. Therefore there is evidence that imposing the no-arbitrage ATSM restrictions on a VAR might help in forecasting, although the restrictions are not uniformly useful across yields, in particular they seem to work better for bonds with longer maturity. This is in line with Carriero (in press) who shows that the misspecification of the ATSM restrictions is more pronounced at the short-end of the yield curve while it is milder for yields of longer maturities. The pattern of the forecast gains is obviously related to that of the weights, with the gains from using the combination and using the restricted model being similar for longer maturities. With regards to the comparison with the combination with equal weights, in general the optimal weights lead to larger gains, as the optimal weights are well above 0.5 for all cases except the 1-month yield. Panel B provides results for the case in which we impose on the ATSM–VAR the additional restriction of a constant risk premium, i.e. the CRP–VAR. The estimated optimal weights for the 3-, 36-, and 60-month yields do not change dramatically, while there is a sharp decrease in the usefulness of the restrictions for the 12month and especially for the 1-month yield. The results in terms of significance of the optimal weights are entirely in line with those obtained with the specification with variation in the risk premium, with a slightly stronger evidence against the usefulness of the restrictions. Therefore, most of the forecasting gains do not seem to be strongly related to the presence of time varying rather than constant risk premia in the model, except at the short end of the yield curve. The combined forecasts based on the optimal weights lead to larger gains with respect to the case with equal weights. Panels C and D provide results for the combination between the random walk and the VAR with no-arbitrage ATSM restrictions (with and without variation in the risk premium). Results for this case show that most of the gains documented in Panels A and B seem to be related to the failure of the unrestricted VAR to provide a good forecast of the yield curve rather than to the merits of the no-arbitrage ATSM restrictions. In particular, for the combination of the ATSM–VAR with the RW , the estimated optimal weights λ range between −0.186 and 0.629, while for the combination of CRP–VAR with the RW the weights λ range between 0.090 and 0.380. These figures are much lower than those obtained when considering the combination of the VAR with no-arbitrage ATSM restrictions with the unrestricted VAR (see panels A and B). In particular, all the weights decrease quite dramatically, and the encompassing test does not reject the null that the no-arbitrage ATSM restrictions are useless, with the only exception of the weight on the 1-month yield. Indeed, the latter is the only case in which the random walk produces quite poor forecasts, worse than those produced by the unrestricted VAR. As a result the gains coming from the optimal combination, though positive, are relatively small for all maturities but the 1-month. The combination with equal weights does not work very well for this case, yielding small losses rather than gains in most of the cases. This happens because the equal weights are too high with respect to the optimal ones. The difficulty in beating the random walk forecasts is not surprising, given the results usually obtained for the 1-step ahead case in the literature. For example, in the paper by Diebold and Li (2006) the gains in RMSFE of all the considered models at 1-stepahead with respect to the random walk are quite low. Among all the models they consider, the best forecasts for the 3-month and 12-month yield provide gains of at most 4%, the best forecasts of the 24-month yield provide a gain of 1%, while for yields of longer maturity the gains are either zero or negative. The maximum gain in RMSFE in using an ATSM against the random walk found by
A. Carriero, R. Giacomini / Journal of Econometrics 164 (2011) 21–34
29
Table 3 Results with Quadratic Loss. (1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
λ
t λ=1
t λ=0
t λ=1/2
0.514 0.736 1.054 0.875 0.840
−4.301*** −1.486 −0.231 −0.619 −0.872
4.553*** 4.147*** 4.525*** 4.331*** 4.572***
0.126 1.331 2.378** 1.856* 1.850*
λ
t λ=1
t λ=0
t λ=1/2
0.280 0.709 0.777 0.889 0.740
−8.833*** −1.792* −1.071 −0.567 −1.753*
3.444*** 4.367*** 3.473*** 4.545*** 4.995***
−2.695***
λ
t λ=1
t λ=0
t λ=1/2
0.629 0.188 0.128 0.249
−2.695*** −5.528*** −2.135** −3.649*** −2.530***
4.562*** −0.870 0.496 0.536 0.839
0.933 −3.199*** −0.820 −1.557 −0.845
λ
t λ=1
t λ=0
t λ=1/2
0.356 0.090 0.295 0.380 0.148
−7.607*** −8.043*** −2.945*** −2.645*** −3.948***
4.212*** 0.800 1.234 1.626 0.685
−1.697** −3.621*** −0.855 −0.509 −1.632
Panel A: ATSM–VAR and UVAR Yields
RMSFE
%gain
( )
(
0.808 0.709 0.682 0.526 0.460
0.51 6.42 11.93 8.19 6.70
ftU
1-month 3-month 12-month 36-month 60-month
ftATSM
)
λ=1/2
( ft
)
4.84 6.62 8.50 6.78 5.79
(ft ) ∗
4.85 7.41 11.96 8.37 6.97
Panel B: CRP–VAR and UVAR Yields
1-month 3-month 12-month 36-month 60-month
RMSFE
%gain
(ftU )
(ftCRP )
(ft
0.808 0.709 0.682 0.526 0.460
−13.91 6.20 9.46 9.22 6.05
1.04 6.83 8.97 7.50 6.18
λ=1/2
)
(ft∗ ) 2.70 7.51 10.35 9.37 6.93
1.288 1.336 1.989** 1.621
Panel C: ATSM–VAR and RW Yields
1-month 3-month 12-month 36-month 60-month
RMSFE
%gain
(ftRW )
(ftATSM )
( ft
)
(ft∗ )
0.837 0.607 0.596 0.473 0.424
3.86 −9.33 −0.88 −2.13 −1.11
5.73 −3.06 −0.09 −0.35 −0.002
5.99 0.25 0.05 0.05 0.14
λ=1/2
−0.186
Panel D: CRP–VAR and RW Yields
1-month 3-month 12-month 36-month 60-month
RMSFE
%gain
(ftRW )
(ftCRP )
(ft
0.837 0.607 0.596 0.473 0.424
−10.08 −9.59 −3.71 −0.99 −1.83
4.00 −1.94 −0.42 0.55 −0.27
λ=1/2
)
(ft∗ ) 4.80 0.10 0.81 0.61 0.06
Each of the four panels displays the results from the combination of two models. The considered combinations are: VAR with no-arbitrage ATSM restrictions (ftATSM ) and unrestricted VAR (ftU ) (Panel A), VAR with no-arbitrage ATSM restrictions and constant risk premium (ftCRP ) with unrestricted VAR (ftU ) (Panel B), VAR with no-arbitrage ATSM restrictions (ftATSM ) and random walk (ftRW ) (Panel C), VAR with no-arbitrage ATSM restrictions and constant risk premium (ftCRP ) and random walk (ftRW ) (Panel D). For each panel, results are reported for each of the five yields at hand. Column (1) contains the Root Mean Squared Forecast Error (RMSFE) (i.e. the realized loss) of the first model considered in the combination. Columns (2)–(4) report the percentage gains over the RMSFE of the model in column (1) obtained by using, respectively, the second λ=1/2 model in the combination, a combination with equal weights (ft ), and a combination based on the estimated optimal weight (ft∗ ). Column (5) reports the value of the estimated optimal weight λ (as defined in Eq. (12)). Finally, columns (6)–(8) report the statistics for the encompassing tests of Proposition 1. The statistic t λ=1 is used to test the null that the unrestricted forecast is useless. The statistic t λ=0 is used to test the null that the restricted forecast is useless. The statistic t λ=1/2 is used to test the null that the optimal weight is 0.5. * Indicate rejection of the null at 10%, level. ** Indicate rejection of the null at 5%, level. *** Indicate rejection of the null at 1%, level.
Almeida and Vicente (2008) for the 1-step ahead case is of 5% for the 2-year yield, while they found gains in the range of 1%–3.5% for intermediate maturities, and negative gains for yields of longer maturity. The ATSM estimated by Moench (2008) outperforms the random walk forecast only for the 6-month yield, with a gain of 3%, while it is outperformed for all the remaining maturities. Better results are obtained by Favero et al. (2007) and Moench (2008) by extending the approach with the inclusion of a broad macroeconomic information set. We now turn to the results for the local measure of usefulness, which are summarized in Fig. 1. The figure is composed of 20 panels displayed in 5 rows and 4 columns. The rows display results for different yields, while the columns represent different forecast combinations (respectively ftATSM and ftU in the first column, ftCRP and ftU in the second column, ftATSM and ftRW in the third column, and ftCRP and ftRW in the last column). Each panel contains a plot of the estimated smoothed weights, as defined in Eq. (15), together with the 95% bands described in Proposition 2. We set π = 0.3, which given that our out-of sample size n = 360 implies that
the smoothed weights are estimated using a window of d = 108 observations. Looking at the first two columns in Fig. 1, it is clear that the forecasting gains from imposing the no-arbitrage ATSM restrictions onto the VAR are not constant over time. In particular, the optimal weight is not statistically different from one in the first part of the sample, but in more recent years the estimated optimal weight λˆ t decreases, signaling that there are small or no gains in using the no-arbitrage ATSM restrictions from around 1994 to 2003. Moreover, by comparing the first and the second column of Fig. 1, it is apparent that the effect of including a time varying rather than constant risk premium is not strong, and is mostly limited to the short end of the yield curve, which confirms the results found using the global measure (see Table 2). Finally, by looking at the last two columns of Fig. 1, it is clear that when the VAR with no-arbitrage restrictions is combined with the random walk, one cannot reject the null that the no-arbitrage ATSM restrictions are useless throughout the sample.
30
A. Carriero, R. Giacomini / Journal of Econometrics 164 (2011) 21–34
Table 4 Results with Portfolio Utility Loss. (1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
λ
t λ=1
t λ=0
t λ=1/2
0.551
−6.082∗∗∗
7.460∗∗∗
0.689
λ
t λ=1
t λ=0
t λ=1/2
0.304
−13.527∗∗∗
5.906∗∗∗
−3.811∗∗∗
λ
t λ=1
t λ=0
t λ=1/2
0.453
−9.267∗∗∗
7.668∗∗∗
−0.799
λ
t λ=1
t λ=0
t λ=1/2
0.289
−17.48∗∗∗
7.131∗∗∗
−5.176∗∗∗
Panel A: ATSM–VAR and UVAR Utility Loss
%gain
(ftU )
(ftATSM )
(ft
−1.35
7.96
23.55
λ=1/2
)
(ft∗ ) 23.75
Panel B: CRP–VAR and UVAR Utility Loss
%gain
( )
(
−1.35
−66.12
ftU
ftCRP
λ=1/2
)
(ft
)
(ft ) ∗
9.09
15.57
Panel C. ATSM–VAR and RW Utility Loss
%gain
(
(
ftRW
)
ftATSM
−1.60
λ=1/2
)
(ft
−8.66
)
(ft ) ∗
18.61
18.82
Panel D. CRP–VAR and RW Utility Loss
%gain
(ftRW )
(ftCRP )
(ft
−1.60
−71.34
6.74
λ=1/2
)
(ft∗ ) 14.24
Each of the four panels displays the results from the combination of two models. The considered combinations are: VAR with no-arbitrage ATSM restrictions (ftATSM ) and unrestricted VAR (ftU ) (Panel A), VAR with no-arbitrage ATSM restrictions and constant risk premium (ftCRP ) with unrestricted VAR (ftU ) (Panel B), VAR with no-arbitrage ATSM restrictions (ftATSM ) and random walk (ftRW ) (Panel C), VAR with no-arbitrage ATSM restrictions and constant risk premium (ftCRP ) and random walk (ftRW ) (Panel D). Column (1) contains the portfolio utility loss of the first model considered in the combination. Columns (2)–(4) report the percentage gains over the utility loss of the model λ=1/2 in column (1) obtained by using, respectively, the second model in the combination, a combination with equal weights (ft ), and a combination based on the estimated optimal weight (ft∗ ). Column (5) reports the value of the estimated optimal weight λ (as defined in Eq. (21)). Finally, columns (6)–(8) report the statistics for the encompassing ∗ ∗∗ ∗∗∗ λ=1 tests of Proposition 1. The stars , , , indicate rejection of the null at 10%, 5%, and 1% level. The statistic t is used to test the null that the unrestricted forecast is useless. The statistic t λ=0 is used to test the null that the restricted forecast is useless. The statistic t λ=1/2 is used to test the null that the optimal weight is 0.5.
5.5. Results for a portfolio utility loss The portfolio utility loss considers the asset allocation problem of an investor who is buying a portfolio of bonds in period t and then sells it in period t + 1, and therefore earning/losing the change occurring in the value of the portfolio within t and t +1. The holding period return on a yield of maturity τ is: (τ +1)
r t +1
) (τ +1) ) (τ +1) . = −τ y(τ = p(τ t +1 + (τ + 1)yt t +1 − p t
Eq. (45) shows that a forecast of the yield (τ +1)
of the holding period return rˆt +1 (τ +1)
rˆt +1
) (τ +1) = −τ yˆ (τ . t +1 + (τ + 1)yt
(τ ) yˆ t +1
(45)
provides a forecast
via a simple transformation7 : (46)
Collecting all the returns under consideration in the vector rt +1 = (τ +1)
(rt +11
(τ +1)
, rt +21
(τ +1) ′
, . . . , rt +q1
) , and setting xt +1 = rt +1 the loss function in (2) specializes to: L(rt +1 , ft ,1 ) = −w ∗ (ft ,1 )′ rt +1 +
γ
w ∗ (ft ,1 )′ Σ w ∗ (ft ,1 ), (47) 2 where ft ,1 is a vector of forecasts of rt +1 and can be derived from the forecasts of the yields by using (46). Results for the global measure of usefulness are in Table 4. Each panel in the table corresponds to a different forecast combination: Panels A and B contain results from the combination of ftATSM and ftCRP with the unrestricted VAR forecasts ftU , while panels C and D contain results of the combination of ftATSM and ftCRP with the random walk forecasts ftRW . Column (1) contains the portfolio utility loss of the first model considered in the
7 It also follows that the forecast error made in forecasting the holding period (τ +1) return is proportional to that made in forecasting the yield of a given bond: rˆt +1 − (τ +1)
(τ )
(τ )
rt +1 = −τ (ˆyt +1 − τ yt +1 ). This also implies that using the holding period return rather than the yields in the quadratic loss function would not change the optimal ˆ in that case. weight λ
combination. Columns (2)–(4) report the percentage gains over the utility loss of the model in column (1) obtained by using, respectively, the second model in the combination, a combination λ=1/2 ), and a combination based on the with equal weights (ft estimated optimal weight (ft∗ ). Column (5) reports the value of the estimated optimal weight λ (as defined in Eq. (12)). Finally, columns (6)–(8) report the statistics for the encompassing tests of Proposition 1. Two main results emerge from Table 4. First, by looking at the tstatistics for the encompassing tests of Proposition 1 it appears that the random walk restrictions no longer dominate the no-arbitrage ATSM restrictions when considering a portfolio utility loss, but both forecasts are useful. As a result, the gains from combining these two models are high (depending on the assumption on the risk premium, they are respectively 18.82% and 14.24%). Second, it appears that the choice between a constant or time varying risk premium can be important. The losses occurring when the forecasts are produced with the no-arbitrage restricted VAR are indeed much higher in the case of constant risk premium than in the case of time varying risk premium. This can be interpreted as evidence that keeping fixed the risk premium worsens the forecasts because it increases model misspecification. When the risk premium is time varying the optimal weights are not statistically different from the equal weights (see column 8), while when the risk premium is fixed the optimal weight decreases. We now turn on the results for the local measure of usefulness, which are summarized in Fig. 2. The figure is composed of four panels, each corresponding to a different forecast combination. Each panel contains a plot of the estimated smoothed weight, as defined in Eq. (24), together with the 95% bands described in Proposition 2. Similarly to the case of a quadratic loss, we observe a clear pattern of decreasing usefulness of the no-arbitrage ATSM restrictions over time.
A. Carriero, R. Giacomini / Journal of Econometrics 164 (2011) 21–34
31
ˆ t in Eq. (15). The red dashed lines are the 95% bands in Eq. (10). (For interpretation of Fig. 1. Results for Quadratic Loss. The blue solid line is the estimated optimal weight λ the references to colour in this figure legend, the reader is referred to the web version of this article.)
Another interesting result, which is in stark contrast with the quadratic loss case, is that assuming a constant risk premium clearly worsens the performance of the no-arbitrage forecasts. As is clear from the figure, the weights are uniformly lower in panels B and D with respect to panels A and C. Moreover, as is clear in panels B and D, in the second part of the sample it is not possible to reject the null that the no-arbitrage VAR with constant risk premium is useless. This suggests that the incorporation of a time-varying risk premium in the no-arbitrage ATSM restrictions may not be important from a statistical point of view, but it is essential when evaluating the forecasts in terms of their usefulness for constructing bond portfolios. The results displayed in Fig. 2 also confirm the fact that, differently from the quadratic loss case, the random walk restrictions no longer dominate the no-arbitrage ATSM restrictions when considering a portfolio utility loss. Even though the usefulness of the no-arbitrage ATSM restrictions relative to a random walk has decreased over time, both restrictions appear to be useful, and the optimal forecast combination exploits information from both. 5.6. Additional results In this subsection we address two additional issues. First we provide results for a 12-step-ahead forecast horizon. Second, we provide results for the case in which the optimal λ is chosen exante within a pseudo-real-time forecasting exercise.
Table 5 displays results based on a 12-step-ahead forecast horizon for both the quadratic loss and the portfolio utility loss. The multi-step forecasts are obtained by iteration. Results are in line with those obtained for the 1-step-ahead case. For the quadratic loss imposing the no-arbitrage ATSM restrictions improves the forecasting performance of a VAR in the yields, but the restrictions are less useful when used in combination with a random walk forecast. For the portfolio utility loss the noarbitrage ATSM restrictions are always useful, providing very large forecasting gains. For both loss functions the forecasts based on optimal weights largely outperform the simple equal combination weights forecast. The only major difference with respect to the 1step-ahead case is that the assumption of constant term premium does not seem to play an important role in the case of the portfolio utility loss. Table 6 displays results for a forecasting exercise in which rather than evaluating the usefulness of the restrictions we try to understand whether forecast combination using estimated optimal weights is a viable and efficient way to produce forecasts in real time. Note that to implement our scheme in real time two estimation windows are needed. The first window is needed to estimate the model’s coefficients. Then a second window is needed to estimate λ using past (from the point of view of the simulation) out-of-sample forecasts. Once λ is obtained, forecasts for the future can be computed. As a result the sample on which the figures displayed in Table 6 are based is different from that used
32
A. Carriero, R. Giacomini / Journal of Econometrics 164 (2011) 21–34
ˆ t in Eq. (24). The red dashed lines are the 95% bands in Eq. (10). (For interpretation Fig. 2. Results for Portfolio Utility Loss. The blue solid line is the estimated optimal weight λ of the references to colour in this figure legend, the reader is referred to the web version of this article.)
for Tables 3–5. For analogy with our analysis for the estimation of the smoothed λ we select the estimation window to be equal to d = 108 observations. This implies that the first forecast used for evaluation is that of the period 1983:2, and the estimated optimal weights used at time t in this exercise are the lagged values (i.e., the values at time t − 1) of the optimal weights depicted in Figs. 1 and 2. For both loss functions the ATSM–VAR works well when combined with an unrestricted VAR, while when combined with a random walk the no-arbitrage ATSM restrictions are useful only under the portfolio utility loss, and not under the quadratic loss (except at the short end of the curve). The results obtained with equal weights are similar to those obtained with the optimal weights for the quadratic loss, while they are worse for the utility loss. The latter result is not surprising, as Fig. 2 shows that the optimal weights for the utility loss are statistically different from 0.5 in most of the sample. 6. Conclusions In this paper we have developed a general framework for analyzing the usefulness of imposing parameter restrictions on a forecasting model. We have proposed a measure of usefulness based on the weight that a set of restrictions receives within an optimal forecast combination. Importantly, the proposed measure can vary over time and depends on the forecaster’s loss function. We have shown how to estimate the measure of usefulness out-ofsample and perform inference about it, both in a stable framework and in a framework with possible instability. We have applied our methodology to the problem of analyzing the usefulness of no-arbitrage ATSM restrictions for forecasting the term structure of interest rates. Our results reveal that: (1) the restrictions have become less useful over time; (2) using a statistical measure of accuracy, the restrictions are a useful way to reduce parameter estimation uncertainty, but are dominated by restrictions that do the same without using any theory; (3) using an economic measure
of accuracy, the no-arbitrage ATSM restrictions are no longer dominated by atheoretical restrictions, but for this to be true it is important that they incorporate a time-varying risk premium. We want to stress the fact that these conclusions do not necessarily imply that bond markets have become more or less efficient over time, because first of all our method can only reveal time variation in the usefulness of the restrictions, but does not allow us to determine its source. Secondly, the results are conditional on the particular specification of no-arbitrage restrictions, and for the ATSM in particular this involves a number of additional assumptions that are not necessarily grounded in economic theory. Appendix. Proofs Proof of Proposition 1. We first show that, A, √ under Assumption λ is asymptotically normal, so that σ −1 n( λ − λ∗ ) →d N (0, 1), where σ 2 = H −1 Ω H −1 . The results in the proposition then follow from showing consistency of σ 2 for σ 2 . Asymptotic normality of λ is obtained by verifying the assumptions of Theorem 3.1 of Newey and McFadden (1994), since λ can be viewed as an extremum estimator obtained by maximizing the objective function −Qn (λ) over R. First, we show that assumptions (i)–(iii) of Theorem 2.7 of Newey and McFadden (1994) are satisfied, so that λ →p λ ∗ . Assumption (i) of Theorem 2.7 is equivalent to A(1). Assumption (ii) of Theorem 2.7 requires concavity of −Qn (λ), which is implied by A(2). Assumption (iii) requires that Qn (λ) − E [Qn (λ)] →p 0 for all λ. Since any measurable function of the finite history of yt is mixing of the same size as yt , ftU,h and ftR,h are mixing of the same size as yt , because they are functions of a window of in-sample data m that is finite by A(9). This implies that L(xt +h , ftR,h + (1 −
λ)(ftU,h − ftR,h )) is also mixing with φ of size −r /(2r − 1) or α of size −r /(r − 1), which, together with A(4), implies that the
conditions of Corollary 3.48 of White (2001) are satisfied and thus Qn (λ) − E [Qn (λ)] →p 0 for all λ. We next verify conditions (i)–(v)
A. Carriero, R. Giacomini / Journal of Econometrics 164 (2011) 21–34
33
Table 5 Quadratic and Portfolio Utility Loss: 12-step-ahead results. (1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
λ
t λ=1
t λ=0
t λ=1/2
Panel A: ATSM–VAR and UVAR Loss
%gain
(ftU )
(ftATSM )
1-month 3-month 12-month 36-month 60-month Portfolio
1.632 1.711 1.570 1.242 1.119 4.258
41.37 50.29 51.51 46.39 41.02 61.88
λ=1/2
(f t
)
(ft∗ )
28.75 34.14 33.50 32.88 31.03 60.57
41.37 50.29 51.67 46.50 41.55 68.65
λ=1/2
(ft∗ )
0.999 0.996 1.046 0.961 0.911 0.761
18.728*** 35.379*** 21.834*** 11.677*** 9.471*** 13.174***
−0.007 −0.145 0.952
−0.474 −0.923 −4.137***
9.361*** 17.617*** 11.393*** 5.601*** 4.274*** 4.518***
Panel B: CRP–VAR and UVAR Loss
%gain
(ftU )
(ftCRP )
1-month 3-month 12-month 36-month 60-month Portfolio
1.632 1.711 1.571 1.242 1.119 4.258
20.77 37.29 42.60 51.37 56.28 87.90
( ft
t λ=1
λ )
21.63 30.29 31.98 36.03 39.21 83.31
24.45 38.62 43.11 51.53 56.57 95.58
λ=1/2
(ft∗ )
0.733 0.860 0.915 0.956 0.947 0.779
t λ=0
−2.715*** −2.668*** −1.551 −0.966 −1.447 −3.710***
t λ=1/2
7.455*** 16.406*** 16.653*** 21.234*** 26.029*** 13.089***
2.370*** 6.869*** 7.542*** 10.134*** 12.291*** 4.670***
Panel C: ATSM–VAR and RW
1-month 3-month 12-month 36-month 60-month Portfolio
t λ=1
t λ=0
−6.480*** −14.538*** −12.260*** −15.014*** −23.389*** −37.034***
−1.376 −1.168 −1.047 −0.983*
λ
t λ=1
t λ=0
t λ=1/2
0.163 0.053 0.063 0.042 −0.015 0.229
−15.086*** −34.100*** −17.912*** −13.531*** −10.685*** −18.337***
2.944*** 1.904** 1.275 0.590 −0.157 5.446***
−6.071*** −16.098*** −8.811*** −6.471*** −5.421*** −6.445***
λ
Loss
%gain
(ftRW )
(ftATSM )
(f t
0.837 0.607 0.596 0.473 0.424 −1.601
−14.38 −40.09 −27.87 −40.80 −55.50 −201.36
−13.24 −8.92 −13.03 −17.60 −26.67
)
1.27
2.96 0.44 0.29 0.24 0.13 7.56
0.285
−0.104 −0.105 −0.075 −0.044 0.160
t λ=1/2
2.585***
−1.947*** −7.957*** −6.714*** −8.031*** −12.186*** −14.993***
7.048***
Panel D: CRP–VAR and RW Loss
(
ftRW
1-month 3-month 12-month 36-month 60-month Portfolio
%gain
)
0.837 0.607 0.596 0.473 0.424 −1.601
(
ftCRP
)
−54.55 −76.70 −51.39 −27.73 −15.24 −132.15
λ=1/2
( ft
)
−8.57 −21.15 −12.97 −6.93 −4.14 −5.12
(ft ) ∗
2.79 0.33 0.30 0.06 0.003 12.78
The table summarizes the results for the 12-step ahead forecast horizon, using the quadratic utility loss and the portfolio utility loss. Each panel considers a different forecast combination. For each panel, the first 5 rows display the results for the quadratic loss. Therefore, the first five entries of column (1) in each panel are the RMSFE in forecasting the corresponding yield 12-step-ahead. The last row (Portfolio) displays results for the portfolio utility loss. Therefore, the last entry in column (1) is the utility loss associated with the portfolio. Columns (2)–(8) have the same structure as in Tables 2 and 3. In particular columns (2)–(4) report the percentage gains over the RMSFE or Utility Loss of λ=1/2 the model in column (1) obtained by using, respectively, the second model in the combination, a combination with equal weights (ft ), and a combination based on the estimated optimal weight (ft∗ ). Column (5) reports the value of the estimated optimal weight λ (as defined in Eq. (12) for the quadratic loss and in Eq. (21) for the utility loss). Finally, columns (6)–(8) report the statistics for the encompassing tests of Proposition 1. The statistic t λ=1 is used to test the null that the unrestricted forecast is useless. The statistic t λ=0 is used to test the null that the restricted forecast is useless. The statistic t λ=1/2 is used to test the null that the optimal weight is 0.5. The bandwidth used is 2(h − 1). * Indicate rejection of the null at 10%, level. ** Indicate rejection of the null at 5%, level. *** Indicate rejection of the null at 1%, level.
of Theorem 3.1 of Newey and McFadden (1994). Conditions (i) and (ii) √are implied by A(1) and (2). Condition (iii) requires that Ω −1/2 n∇λ Qn (λ∗ ) →d N (0, 1). Ω is finite by A(5) and it is positive by A(6). By arguments similar to those used above, one can show that A(3) implies that Zt ≡ Ω −1/2 ∇λ L(xt +h , ftR,h + (1 − λ∗ )(ftU,h −
ftR,h )) is mixing with φ of size −r /(2r − 2) or α of size −r /(r − 2). This, together with A(5), implies that the sequence {Zt } satisfies the conditions of Corollary 3.1 of Wooldridge and White’s (1988), and thus condition (iii) is satisfied. Conditions (iv) and (v) of Newey and McFadden (1994) coincide with A(7) and (8). Finally, A(3), (5) and (9) imply that the conditions of Theorem 6.20 of White (2001) are is a consistent estimator of Ω . This, in turn, satisfied and thus Ω implies that the conditions of Theorem 4.1 of Newey and McFadden (1994) are satisfied and thus σ 2 →p σ 2 , which completes the proof.
Proof of Proposition 2. Let Lt +h (λ∗ ) = L(xt +h , ftR,h +(1 −λ∗ )(ftU,h − ftR,h )) and, for ease of notation, henceforth drop the subscript d from
λt ,d . For t = m + d − 1, . . . , T − h we have
t − √ σ −1 d( λt − λ∗ ) = H −1 (λ) σ −1 d−1/2 ∇λ Lj+h (λ∗ ) j=t −d+1
−1/2 −1
(λ) σ −1 Ω 1/2 t − × Ω −1/2 n−1/2 ∇λ Lj+h (λ∗ )
= (d/n)
H
j =m
−Ω
−1/2 −1/2
n
t −d
− j =m
∇λ Lj+h (λ ) ∗
34
A. Carriero, R. Giacomini / Journal of Econometrics 164 (2011) 21–34
Table 6 Pseudo-real-time simulation. Panel A: ATSM–VAR and UVAR
1-month 3-month 12-month 36-month 60-month Portfolio
Loss
%gain
(ftU )
(ftATSM )
0.561 0.341 0.376 0.380 0.371 −1.218
1.66 11.63 7.31 3.78 2.75 −37.19
λ=1/2
(ft
)
6.05 12.87 7.04 4.77 3.69 −4.17
(ft∗ ) 5.55 13.47 7.73 4.42 3.20 1.00
where B is a standard univariate Brownian motion. By B(2), (d/n)−1/2 → π√−1/2 . By B(3), H −1 (λ) σ −1 Ω 1/2 →P H −1 σ −1 Ω 1/2 = √ −1 ∗ 1 and thus σ d( λt −λ ) H⇒ [B (τ )− B (τ −π )]/ π under √ the null hypothesis. Let kα,π solve Pr{supτ |[B (τ )− B (τ −π )]/ π | > kα,π } = α . Then, under either H0U or H0R , (1 − α)% of the time
λ∗ is contained within λt − kα,π √σd , λt + kα,π √σd for all t = m + d − 1, . . . , T − h. The values of kα,π in Table 1 are obtained by Monte Carlo simulation.
References
Panel B: CRP–VAR and UVAR Loss
%gain
( )
(
ftU
1-month 3-month 12-month 36-month 60-month Portfolio
0.561 0.341 0.376 0.380 0.371 −1.218
ftCRP
λ=1/2
)
(ft
−30.71 −3.66 −1.54 4.06 3.57 −167.69
)
−3.58 7.20 3.91 4.67 5.32 −34.70
(ft ) ∗
1.02 8.56 2.21 4.20 5.39 −0.60
Panel C: ATSM–VAR and RW
1-month 3-month 12-month 36-month 60-month Portfolio
Loss
%gain
(ftRW )
(ftATSM )
0.594 0.294 0.341 0.352 0.349 −1.231
7.14
−2.31 −1.99 −3.94 −3.62 −38.07
λ=1/2
(ft
)
11.38 0.46 −0.50 −0.88 −1.04 5.18
(ft∗ ) 11.03 1.30 −0.66 −0.25 −1.11 8.20
Panel D: CRP–VAR and RW
1-month 3-month 12-month 36-month 60-month Portfolio
Loss
%gain
(ftRW )
(ftCRP )
(ft
0.594 0.294 0.341 0.352 0.349 −1.231
−23.42 −20.02 −11.73 −3.64 −2.75 −166.99
6.26 0.16 −2.69 −0.55 −0.06 −22.17
λ=1/2
)
(ft∗ ) 8.13
−0.22 −1.97 −1.48 −1.50 5.02
The table summarizes the results for the pseudo-real-time implementation of our forecasting exercise, at 1-step-ahead horizon. The sample used for forecast evaluation is 1983:2–2003:12, and it differs from the evaluation sample used in Tables 3–5. This happens because at each point in time we need a training sample for which past out of sample forecasts are available in order to estimate the optimal λ. For analogy with the computation of the smoothed λ we use a training sample of 108 observations. Each panel considers a different forecast combination. For each panel, the first 5 rows display the results for the quadratic loss. Therefore, the first five entries of column (1) in each panel are the RMSFE in forecasting the corresponding yield 12-step-ahead. The last row (Portfolio) displays results for the portfolio utility loss. Therefore, the last entry in column (1) is the utility loss associated with the portfolio. Columns (2)–(8) have the same structure as in Tables 3 and 4. In particular columns (2)–(4) report the percentage gains over the RMSFE or utility loss of the model in column (1) obtained by using, respectively, the λ=1/2 second model in the combination, a combination with equal weights (ft ), and ∗ a combination based on the estimated optimal weight (ft ).
where λ lies between λt and λ∗ . By B(1), we have
Ω −1/2 n−1/2
t −
∇λ Lj+h (λ∗ ) −
j =m
H⇒ [B (τ ) − B (τ − π )],
t −d − j =m
∇λ Lj+h (λ∗ )
Almeida, C., Vicente, J., 2008. The role of no-arbitrage on forecasting: lessons from a parametric term structure model. Journal of Banking & Finance 32 (12), 2695–2705. Ang, A., Piazzesi, M., 2003. A no-arbitrage vector autoregression of term structure dynamics with macroeconomic and latent variables. Journal of Monetary Economics 50 (4), 745–787. Carriero, A., 2011. Forecasting the yield curve using priors from no arbitrage affine term structure models. International Economic Review (in press). Clark, T.E., McCracken, M.W., 2009. Combining forecasts from nested models. Oxford Bulletin of Economics and Statistics 71 (3), 303–329. Dai, Q., Singleton, K., 2000. Specification analysis of affine term structure models. Journal of Finance 55, 1943–1978. De Jong, F., 2000. Time series and cross-section information in affine term structure models. Journal of Business and Economic Statistics 18, 300–314. Del Negro, M., Schorfheide, F., 2004. Priors from general equilibrium models for VARs. International Economic Review 45, 643–673. Diebold, F.X., Li, C., 2006. Forecasting the term structure of government bond yields. Journal of Econometrics 130, 337–364. Duffee, G., 2002. Term premia and interest rate forecasts in affine models. Journal of Finance 57, 405–443. Duffie, D., Kan, R., 1996. A yield-factor model of interest rates. Mathematical Finance 6, 379–406. Elliott, G., Timmermann, A., 2004. Optimal forecast combinations under general loss functions and forecast error distributions. Journal of Econometrics 122, 47–79. Favero, C.A., Niu, L., Sala, L., 2007. Term structure forecasting: no-arbitrage restrictions vs large information set. CEPR Discussion Papers 6206. Giacomini, R., Rossi, B., 2010. Forecast comparisons in unstable environments. Journal of Applied Econometrics 25, 595–620. Giacomini, R., White, H., 2006. Tests of conditional predictive ability. Econometrica 74, 1545–1578. Harrison, J.M., Kreps, D.M., 1979. Martingales and arbitrage in multiperiod securities markets. Journal of Economic Theory 2, 381–408. Harvey, D.I., Leybourne, S.J., Newbold, P., 1998. Tests for forecast encompassing. Journal of Business & Economic Statistics 16, 254–259. Johansen, S., 1995. Likelihood-Based Inference in Cointegrated Vector AutoRegressive Models. Oxford University Press, Oxford. Kim, M.J., Nelson, C.R., 1993. Predictable stock returns: the role of small sample bias. The Journal of Finance 48 (2), 641–661. Markowitz, H., 1952. Portfolio selection. The Journal of Finance 7 (1), 77–91. Moench, E., 2008. Forecasting the yield curve in a data-rich environment: a noarbitrage factor-augmented VAR approach. Journal of Econometrics 146 (1), 26–43. Newey, W.K., McFadden, D., 1994. Large sample estimation and hypothesis testing. In: Engle, R.F., McFadden, D.L. (Eds.), Handbook of Econometrics, vol. 4. NorthHolland, Amsterdam, pp. 2112–2245. Newey, W., West, K., 1987. A simple, positive semi-definite, heteroskedasticity and autocorrelation consistent covariance matrix. Econometrica 55, 703–708. Palm, F.C., Zellner, A., 1992. To Combine or not to combine? Issues of combining forecasts. Journal of Forecasting 11, 687–701. West, K.D., Edison, H.J., Cho, D., 1993. A utility-based comparison of some models of exchange rate volatility. Journal of International Economics 35, 23–45. White, H., 2001. Asymptotic Theory for Econometricians, revised ed. Academic Press, New York. Wooldridge, J.M., White, H., 1988. Some invariance principles and central limit theorems for dependent heterogeneous processes. Econometric Theory 4, 210–230.
Journal of Econometrics 164 (2011) 35–44
Contents lists available at ScienceDirect
Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom
Do interest rate options contain information about excess returns? Caio Almeida a,∗ , Jeremy J. Graveline b , Scott Joslin c a
Graduate School of Economics, Fundação Getulio Vargas, Brazil
b
University of Minnesota, Carlson School of Management, United States
c
MIT Sloan School of Management, United States
article
info
Article history: Available online 26 February 2011 JEL classification: C53 C58 E43 E47 G12 Keywords: Interest rates Options Risk premia Excess returns Forecasting
abstract There is strong empirical evidence that long-term interest rates contain a time-varying risk premium. Options may contain valuable information about this risk premium because their prices are sensitive to the underlying interest rates. We use the joint time series of swap rates and interest rate option prices to estimate dynamic term structure models. The risk premiums that we estimate using option prices are better able to predict excess returns for long-term swaps over short-term swaps. Moreover, in contrast to the previous literature, the most successful models for predicting excess returns have risk factors with stochastic volatility. We also show that the stochastic volatility models we estimate using option prices match the failure of the expectations hypothesis. © 2011 Elsevier B.V. All rights reserved.
1. Introduction A bond or swap that is sold before it matures has an uncertain return. There is strong empirical evidence that this return is predictable and time-varying which suggests that long-term interest rates contain a time-varying risk premium.1 In this paper, we ask whether interest rate option prices can be used to obtain better estimates of this risk premium. Options may contain valuable information because their prices are sensitive to the volatility and risk premiums in the underlying interest rates. We use an arbitrage-free term structure model for our empirical analysis because it describes the joint dynamics of interest rate option prices and the underlying interest rates. Related papers that use term structure models to investigate the risk premium in long-term interest rates use only bonds or swap rates for estimation.2 Instead, we use the joint time series of both swap rates and interest rate cap prices with different maturities. Our main
∗
Corresponding author. E-mail addresses:
[email protected] (C. Almeida),
[email protected] (J.J. Graveline),
[email protected] (S. Joslin). 1 See Fama and Bliss (1987), Campbell and Shiller (1991), and Cochrane and Piazzesi (2005). 2 See Duffee (2002), Dai and Singleton (2002), Duarte (2004), and Cheridito et al. (2007). 0304-4076/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2011.02.007
finding is that the risk premiums we estimate using option prices are better able to predict excess returns for long-term swaps. To measure predictability, we compute an R2 -statistic that compares the squared differences between expected returns and 3-month realized returns on zero-coupon swaps with different maturities. However, rather than computing expected returns using linear projection, we instead use our model-implied expected returns as a more stringent metric. Across different maturities, this measure of predictability typically doubles when we use options to estimate a term structure model with one stochastic volatility factor. The improvement is almost threefold for a term structure model with two stochastic volatility factors. Risk premia in term structure models reflect the market prices of risk for the factors driving interest rates (or equivalently, their risk-neutral dynamics). To better understand the improvement in predictability, we examine the estimated risk premia and find that the risk-neutral dynamics of the factors are quite different when the models are estimated with and without options. In the models that are estimated without options, long-run interest rates under the risk-neutral measure are high (10.4%–13%) and have a very slow rate of mean reversion with a half-life of 22.4–31.4 years. In contrast, the models estimated with options have a more modest risk-neutral long-run mean of 8% with a rate of mean reversion on the order of business cycle frequency (the half-life is about 8 years). Both of these combinations have similar implications for swap rates, which makes them difficult to distinguish using only
36
C. Almeida et al. / Journal of Econometrics 164 (2011) 35–44
swaps. However, the two mechanisms are clearly differentiated by their implications for interest rate option prices. Therefore, option prices allow for a more precise identification of the market prices of risk and the models that are estimated with options are better able to explain variation in excess returns. Option prices also help to resolve the tension, identified in previous papers, between matching the first and second moments of bond returns. Both Duffee (2002) and Cheridito et al. (2007) find that excess bond returns are best captured by constant volatility models that cannot match the time-series variation in interest rate volatility. We study the same 3-factor affine term structure models that were developed by Cheridito et al. (2007) (a generalization of the models developed and studied by Duffee (2002)), but we use the joint time series of swaps and interest rate cap prices to estimate the models.3 When we use option prices to estimate the model parameters, the models with one or two stochastic volatility factors better capture the variation in interest rate volatility and are also best at predicting excess returns. The models with stochastic volatility that we estimate with options also satisfy two additional challenges posed by Dai and Singleton (2003): they successfully price interest rate caps and they capture the failure of the expectations hypothesis.4 Dai and Singleton (2002) find that only models with constant volatility successfully match the failure of the expectations hypothesis. We show that term structure models with stochastic volatility also match the failure of the expectations hypothesis when we include options in estimation. We further analyze these results and show that option prices help to identify the portions of the risk premium that are related to the slope of the yield curve. Our empirical analysis deviates from recent research that has focused on unspanned stochastic volatility, or USV, in fixed income markets.5 In term structure models that exhibit USV, interest rate options have both an econometric and an economic role because they cannot be replicated using the underlying bonds or swaps. We do not consider unspanned stochastic volatility models in this paper because our objective is to focus exclusively on the econometric benefits of using options to estimate the risk premium in long-term interest rates. Other papers have used options to estimate term structure models, but do not examine their impact on a model’s ability to capture the dynamics of interest rates and predict excess returns. Umantsev (2002) estimates affine models jointly using both swaps and swaptions and analyzes the volatility structure of these markets. Longstaff et al. (2001) and Han (2007) explore the correlation structure in yields that is required to simultaneously price both caps and swaptions. Bikbov and Chernov (2011) use both Eurodollar futures and short-dated option prices to estimate affine term structure models and discriminate between various volatility specifications. Our paper is also related to empirical papers that examine the joint time series of option prices and returns in equity or foreign
3 Swaps are based on the same LIBOR interest rates as caps, which is why we choose to model swap rates rather than government bond yields. Dai and Singleton (2000) also note that the institutional features that affect government bond yields are not accounted for in standard term structure models. 4 The failure of the expectations hypothesis refers to the empirical property that excess returns to long-term bonds and swaps are negatively related to the slope of the yield curve (and increasingly so for longer maturity yields). If the expectations hypothesis holds, which would be the case if investors are risk-neutral with respect to interest rate risk, then forward rates are the best linear predictor of future interest rates. See Fama (1984a), Fama (1984b), Fama and Bliss (1987) and Campbell and Shiller (1991). 5 See Collin-Dufresne and Goldstein (2002a), Collin-Dufresne et al. (2009), Andersen and Benzoni (2008), Li and Zhao (2006), Thompson (2008), Bikbov and Chernov (2011), Joslin (2007), and Kim (2007).
exchange markets. Chernov and Ghysels (2000), Pan (2002), Jones (2003), and Eraker (2004) analyze S&P 100 or 500 index returns jointly with options on the index. Bakshi et al. (2008) and Graveline (2008) study foreign exchange options and the underlying currency returns. These papers use option prices to help estimate risk premia in the underlying equity or currency returns and our paper has a similar objective applied to fixed income markets.6 The remainder of the paper is organized as follows. Section 2 describes the data and estimation procedure we use. Section 3 describes the fit of the models to swap rates, cap prices, and the conditional volatility of interest rates. Section 4 compares how well the models we estimate predict excess returns and Section 5 examines the dependence of expected excess returns on the level, slope, and curvature of the yield curve. Section 6 concludes. 2. Model and estimation Our objective in this paper is to examine how well different arbitrage-free affine term structure models predict excess returns for long-term swaps. We consider dynamic term structure models in which the short interest rate, r, is driven by a three-dimensional latent factor Xt that follows an affine7 diffusion process. Following the notation in Duffie (2001), the models can be expressed as rt = ρ0 + ρ1 · Xt , dXt =
dXt =
K0P Q K0
(1a)
+
K1P Xt
+
Q K1 Xt
dt +
dt +
σ (Xt )dWtP ,
(1b)
σ (Xt )dWtQ ,
(1c)
Q
where WtP (Wt ) is a three-dimensional Brownian motion under ⊤ the historical (risk-neutral) ∑ k k measure and σ (Xt )σ (Xt ) is a 3 × 3 matrix H0 + H1 Xt whose (i, j) entry is given by H0ij +
∑3
k k k=1 H1ij Xt . The short interest rate and risk-neutral dynamics allow us to price a variety of fixed income instruments. For example, a security that pays g (XT ) at time T will have price at time t given by g
Q
Pt = Et
e−
T
t rs ds
g ( XT ) .
(2)
Dai and Singleton (2000) discuss admissibility and identification issues for the model specification in Eq. (1). First, admissibility refers to the fact that, in order to ensure a well-defined process, parameter constraints must be imposed so that the covariance remains positive semi-definite. Second, due to the latent nature of the states, the parameters may not be econometrically identified, as two distinct sets of parameters can give rise to observationally equivalent models. Dai and Singleton (2000) partition N-factors models into subsets, denoted AM (N ), where M is the number of factors that drive stochastic volatility. We estimate 3-factor term structure models with M = 0, 1 or 2 factors driving stochastic volatility. To ensure admissibility we impose the following constraints on the parameters in estimation:
K0P,i ≥ 0,
i ≤ M;
(3a)
≥ 0,
i ≤ M;
(3b)
i, j ≤ M , i ̸= j;
(3c)
Q K0,i
K1P,ij ≥ 0,
6 See also Jackwerth (2000), Aït-Sahalia and Lo (2000), and Aït-Sahalia et al. (2001) for papers that compare the risk-neutral distribution of returns implied from option prices to the objective distribution of returns inferred from time-series data. 7 Researchers have also extensively studied quadratic term structure models (see Ahn et al., 2002, and Leippold and Wu, 2002). However, Cheng and Scaillet (2007) show that affine and quadratic term structure models are equivalent and therefore our choice to restrict the analysis to affine models is without loss of generality.
C. Almeida et al. / Journal of Econometrics 164 (2011) 35–44
K1,ij ≥ 0,
i, j ≤ M , i ̸= j;
(3d)
= 0,
i ≤ M < j;
(3e)
K1,ij = 0,
i ≤ M < j;
(3f)
Q
K1P,ij Q
H0 is positive semi-definite; H0,ii = 0,
i ≤ M;
H1k is positive semi-definite with H1k = 0 for k > M ; H1k,ii
= 0 when i ≤ M and i ̸= k.
(3g) (3h) (3i) (3j)
Under these constraints, the first M factors are CIR (Cox et al., 1985) processes that drive both volatility (through H1 ) and interest rates (through ρ1 ), while the remaining N–M factors are conditionally Gaussian with the local conditional volatility determined by the first M factors. Conditions (3a)–(3f) ensure that the conditional mean of the CIR factors stays positive under P and Q: positive constant in the drift, positive feedback from one CIR factor to another, and no feedback from the Gaussian factors (which can be either positive or negative) to the CIR factors. Conditions (3g)–(3j) ensure that the conditional covariance remains positive semi-definite: the matrices that determine volatility are positive semi-definite, and the volatility of a CIR factor is zero when the level of the factor is zero (which requires both (3i) and (3j)). Notice that conditions (3g) and (3h) together imply that H0,ij = 0 for i ≤ M and any j, since if the variance of a random variable is zero then so is its covariance with any other random variable. We also impose the Feller condition so that the factors driving stochastic volatility remain strictly positive under both P and Q:
K0P,i ≥ Q
K0,i ≥
1 2 1 2
i H1ii ,
i ≤ M;
(4a)
i H1ii ,
i ≤ M.
(4b)
ρ1,i ≥ 0, H0,ij = 1, H1k,kk H1k,ij
= 1, = 0,
K0P,i = 0, K1P,ij
= 0,
feedback for the Gaussian variables, as in the Schur decomposition of a matrix. Duffie and Kan (1996) show that zero-coupon bond prices are exponential affine in the factors. Specifically, the price at time t of a zero-coupon bond that pays $1 at time T is given by PtT = Et
Q
i > M; i = j > M , 0 otherwise; k ≤ M;
(5a) (5b) (5c)
k ≤ M , i ̸= j;
(5d)
i > M;
(5e)
if M = 0 and j > i.
(5f)
Condition (5a) fixes the sign of the locally Gaussian factors. Conditions (5b)–(5d) scale the latent factors and orthogonalize their innovations. Condition (5e) removes a level indeterminacy by translating the Gaussian variables so that their mean under P is zero. Finally, condition (5f) normalizes the Gaussian variables by using an orthogonal transformation of the variables (which maintains the orthogonal innovations) to generate a lower-triangular
8 For all the specifications that we estimate, these inequality constraints do not in fact bind.
e−
T
t rs ds
= eA(T −t )+B(T −t )·Xt ,
(6)
where B(·) and A(·) solve the Riccati ODEs d dτ
Q ⊤
B(τ ) = −ρ1 + K1
B(τ ) +
1 2
B(τ )⊤ H1 B(τ ),
B(0) = 0, Q ⊤ d 1 A(τ ) = −ρ0 + K0 B(τ ) + B(τ )⊤ H0 B(τ ), dτ 2 A(0) = 0,
(7a)
(7b)
and B(τ )⊤ H1 B(τ ) is a vector whose kth entry is given by B(τ )⊤ H1k B(τ ). Previous papers that investigate the risk premium in longterm interest rates use only the time series of bonds or swaps with different maturities for estimation. Instead, we also use the time series of interest rate cap prices with different maturities. An interest rate cap is a portfolio of options on the 3-month Libor rate that effectively caps the interest rate paid on the floating side of a swap.9 We include cap prices in estimation because they may contain additional econometric information about the risk premium in long-term interest rates. Since caps are interest rate options, their prices are sensitive to the volatility of the risk factors. Cap prices also depend on interest rate risk premia, which are embedded in the risk-neutral distribution of the factors. The price of an N-period cap with strike rate C on 3-month floating interest payments is
N
The Feller condition allows the historical and risk-neutral meaQ sures, P and Q, to be equivalent measures even when K1 and K1P differ in the upper M × M block. This condition allows for a fully flexible market price of risk as in Cheridito et al. (2007) (see also Liptser et al., 2000).8 We impose the following additional constraints (as in Dai and Singleton, 2000) to obtain econometric identification of the parameters:
37
Ct
C =
N − n =2
E t e− Q
t +0.25n t
+ rs ds 0.25 Lt +0.25(n−1) − C ,
(8)
caplet payoff
where Lt +0.25(n−1) is the 3-month Libor interest rate so that +0.25n 1 + 0.25Lt +0.25(n−1) = 1/Ptt+ 0.25(n−1) .
(9)
Duffie et al. (2000) show that cap prices in affine term structure models can be computed as a sum of inverted Fourier transforms. However, when the solutions A and B to the Riccati ODEs in Eq. (7) are not known in closed form, direct Fourier inversion can be too computationally expensive for use in estimation. Instead, we use a more efficient adaptive quadrature method that is based on Joslin (2007). Our data, obtained from Datastream, consists of weekly Libor rates, swap rates, and at-the-money cap-implied volatilities from January 1995 to February 2006. We use 3- and 6-month Libor and the entire term structure of swap rates to bootstrap zero-coupon swap rates at 1, 2, 3, 4, 5, 7, and 10 years.10 We also use at-themoney caps with maturities of 1, 2, 3, 4, 5, 7, and 10 years. We use quasi-maximum-likelihood to estimate model parameters for A0 (3), A1 (3), and A2 (3) models. Following Chen and Scott (1993), we estimate all of the models under the assumption that 3-month Libor and the 2- and 10-year zero-coupon swap rates are
9 Other papers, such Umantsev (2002), have used swaptions, which are options on swaps. We choose to use interest rate caps because it is easier to compute their prices without resorting to approximations. 10 Our bootstrap procedure uses the common assumption that forward swap zero rates are constant between observations.
38
C. Almeida et al. / Journal of Econometrics 164 (2011) 35–44
priced exactly, while the remaining rates are priced with error.11 In addition, we estimate another set of parameters for the A1 (3) and A2 (3) models under the assumption that at-the-money caps with maturities of 1, 2, 3, 4, 5, 7, and 10 years are also priced with error. We refer to these versions of the models that we estimate with option prices as the A1 (3)o and A2 (3)o models. Given the dynamics in Eq. (1b), the conditional mean, X t ,1t = Et [Xt +1t ], satisfies the differential equation,
∂ X t ,u /∂ u = K0P + K1P X t ,u ,
(10a)
with the initial condition X t ,0 = Et [Xt ] = Xt . Similarly, the condi
tional covariance, Vt ,1t = Et Xt +1t − X t ,1t satisfies the differential equation,
Xt +1t − X t ,1t
∂ Vt ,u /∂ u = K1P Vt ,u + Vt ,u K1P⊤ + H0 + H1 · X t ,u ,
⊤
,
(10b)
with initial condition Vt ,0 = 0. This coupled system of linear constant coefficient ordinary differential equations can be solved in closed form. For quasi-maximum-likelihood we assume that, ignoring constants, the log-likelihood of the state vector is
LX = −
1 − 2
ln Vt ,1t + Xt +1t − X t ,1t
× Xt +1t − X t ,1t .
⊤
Vt−,11t
−
ln |B | .
(12)
(13)
Using Eq. (6) and the states we have recovered from Eq. (12), we can compute the model-implied zero-coupon yield for any maturity. Let ε˜ k denote the T -dimensional full time series of the pricing errors (as measured by the difference between the observed yield and the model-implied yield) for the kth-maturity interest rate that is priced with error (6-month Libor and 1-, 3-, 4-, 5-, and 7-year zero-coupon swap rates). We assume that these zero-coupon swap rate pricing errors are independent with mean zero Gaussian distribution so that the log-likelihood, ignoring constants, is
LY˜ = −
6 1−
2 k=1
ε˜ k⊤ ε˜ k T · ln σ˜ + . σ˜ k2
2 k
2 k =1
T · ln υ˜ k2 +
η˜ k⊤ η˜ k . υ˜ k2
(15)
We treat the υ˜ k as unknown parameters of the model. Finally, we choose the parameters to maximize the log-likelihood of the yields that are priced exactly, the yields that are priced with errors, and the caps that are priced with error,
L = LY + LY˜ + LC˜ .
(16)
Our estimates of the parameters are provided in Tables 1 and 2. Standard errors are computed by the BHHH method using the outer product of the gradient (see Davidson and MacKinnon, 1993). It is difficult to directly assign meaning to many of the parameters since they govern the dynamics of a latent process. That is to say, we are more interested in functions of the parameters that pertain to quantities of economic interest. To this end, we now proceed in Section 3 to compare and contrast the implications of the estimated models for the cross-sectional properties of the yield curve and interest rate option prices. We also examine the implications for the conditional volatility of yields. In Sections 4 and 5, our main focus turns to the predictability of bond returns across the estimated models. 3. Fit to prices and conditional volatility
Using our quasi-maximum-likelihood assumption, the log-likelihood of Yt is
LY = LX −
7 1−
(11)
Let Yt denote the vector of observed 3-month, 2-year, and 10-year zero-coupon swap rates that we use to invert for the latent states Xt .12 From Eq. (6) we can compute A ∈ R3 and B ∈ R3×3 (given the parameters) and write Yt = A + B Xt ⇒ Xt = B −1 [Yt − A] .
LC˜ = −
(14)
Here, we treat the σ˜ k as unknown parameters of the model. Similarly, using Eq. (8), let η˜ k denote the T -dimensional full time series of pricing errors for the kth-maturity interest rate cap (we price caps with maturities 1, 2, 3, 4, 5, 7, and 10 years). Again, we assume that these cap pricing errors are independent with mean zero Gaussian distribution so that the log-likelihood, ignoring constants, is
11 By assuming that a subset of securities are priced correctly by the model, we can use these prices to invert for the values of the latent states. Duffee (2002), Dai and Singleton (2002), and Cheridito et al. (2007) also use this approach to invert for the latent states. See Chen and Scott (1993) for more details. We choose the 3month rate as it is the reference rate for the underlying caps. Moreover, 2-year and 10-year swaps are among the most liquid maturities and capture both the moderate and long end of the term structure. 12 Our exposition of the estimation procedure draws from Fisher and Gilles (1996).
In this section we examine how well the term structure models match zero-coupon swap rates, cap prices, and the conditional volatility of interest rates. Table 3 provides the root mean squared pricing errors (in basis points) for zero-coupon swap rates with different maturities. The root mean squared errors are 0 for the 3-month, 2-, and 10-year zero-coupon swap rates because the latent state variables are chosen so that the models correctly price these rates. The root mean squared pricing errors for other maturities range from about 4 basis points to about 10 basis points, with slightly larger errors for the short maturities and a better fit for longer maturities.13 There is very little difference in the cross-sectional fit between the A0 (3), A1 (3), and A2 (3) models that we estimate without using options. Similarly, there is little difference between the A1 (3)o and A2 (3)o models that we estimate with options. The use of options to estimate the A1 (3)o and A2 (3)o models has only a small effect on the models’ fit to the cross-section of zero-coupon swap rates with different maturities. Including options improves the fit (relative to models that we estimate without using options) by less than a basis point at the short end of the yield curve (up to 1 year) and worsens the fit by slightly more than a basis point at the long end of the yield curve (beyond 1 year). Table 4 displays the root mean squared pricing errors (as a percentage of the current market price) for at-the-money caps with various maturities. For all of the models, the percentage pricing errors are worst for 1-year caps and decline as the maturity of the cap increases. The poor fit for short-maturity caps, as well as the weaker fit for short-maturity yields, may be due to the findings in Dai and Singleton (2002) and Piazzesi (2005) which suggest that a fourth factor is required to capture the short end of the yield curve. Additionally, as Piazzesi (2005) documents, jumps play an important role for short maturities and therefore adding jumps may improve the model along this dimension. We choose to implement more parsimonious 3-factor models because we are primarily interested in predicting changes in long-term interest rates. Amongst the models that we estimate without including
13 The cross-sectional pricing errors for all of the models that we estimate are comparable with the pricing errors reported in recent papers such as Dai and Singleton (2000), Duffee (2002), and Cheridito et al. (2007).
C. Almeida et al. / Journal of Econometrics 164 (2011) 35–44
39
Table 1 Parameter estimates: K0P , K0 , K1P , and K1 . Q
Parameter
Q
Model A0 (3)
K0P,1 K0P,2 K0P,3 Q K0,1 Q K0,2 Q K0,3 K1P,11 K1P,12 K1P,13 K1P,21 K1P,22 K1P,23 K1P,31 K1P,32 K1P,33 Q K1,11 Q K1,12 Q K1,13 Q K1,21 Q K1,22 Q K1,23 Q K1,31 Q K1,32 Q K1,33
A1 (3)o
A1 (3) (1.5)
A2 (3)o
2.26
(1.7)
A2 (3)
0
3.61
0.571
(2.4)
0.799
(3.6)
0
0
0
1.32
(4.2)
0.944
(2.3)
0
0
0
0
0
1.39
(25)
1.10
(0.14)
1.30
(0.31)
1.92
(1.2)
1.38
(0.15)
0.402
(9.4)
0.825
(0.44)
4.26
(1.2)
0.899
(0.86)
0.721
(0.52)
−0.227
(2.2)
−0.254
(5.8)
(1.0)
−0.300
(0.63)
0.315
(1.2)
−0.0277
(0.28)
−1.53
(0.62)
(0.49)
−0.757
(0.44)
−0.991
(1.2)
(0.57)
0.495
(2.1)
2.54
−5.09e−5
0
0
0
1.03
0
0
0
0
0.664
(0.41)
0.447
(0.59)
−0.682
(0.74)
−0.340
(0.56)
−0.592
(0.67)
−1.05
(0.37)
(0.53)
−1.76
(1.0)
−0.947
(0.49)
−0.886
(1.1)
−0.486
(0.42)
−0.498
(0.46)
−0.576
(0.75)
−0.625
−1.18
(0.87)
−0.556
(0.65)
−1.15
(0.060)
−0.538
(0.016)
1.78
(0.077)
0
1.60
(0.18)
0
6.78e−3
0
0.493
(0.85)
0.585
(0.89)
(1.1)
−1.45
−0.245
(0.092)
−0.699
(0.9)
(0.22)
0.319
(0.086)
−1.17
(1.5)
−1.41
(0.51)
−0.131
(0.10)
−0.0376
(0.066)
−0.531
(0.013)
−1.61
(0.18)
−0.573
(0.055)
0
1.18
(0.17)
0.000
(0.067)
0
0
0.128
(0.020)
−0.569
(0.038)
−2.21
(0.46)
−0.405
(0.010)
−0.337
(0.050)
−1.00
(0.075)
−0.414
(0.023)
−0.123
(0.022)
−1.92
(0.27)
−0.135
(0.020)
−0.760
(0.14)
−0.926
(0.12)
−0.136
(0.042)
−2.39
(0.14)
−0.647
(0.093)
−0.289
(0.051)
−1.49
(0.12)
−1.34
(0.10)
−1.03
0
0
1.02
−1.42
(1.0)
0
0 (0.15) (0.092)
0
1.43
−2.38
(0.17) (0.093)
0
−0.309
(0.035)
0.571
(0.054)
−2.12
(0.28)
(5.6e−3)
−0.0487
(1.7e−3)
−0.133
0.579
(0.11)
This table presents the parameter estimates of K0P , K0 , K1P , and K1 from Eq. (1) with standard errors in parentheses. The A0 (3), A1 (3), and A2 (3) models were estimated by inverting 3-month, 2-year, and 10-year swap zeros and measuring 6-month Libor, and 1-, 3-, 4-, 5-, and 7-year swap zeros with error. The A1 (3)o and A2 (3)o models were estimated with the additional assumption that 1-, 2-, 3-, 4-, 5-, 7-, and 10-year at-the-money caps were priced with error. Q
Q
Table 2 Parameter estimates: H1 , ρ0 , and ρ1 . Parameter
Model A0 (3)
A1 (3)o
A1 (3)
A2 (3)o
H11,11
0
1
1
1
1
H11,22
0
0.385
(0.042)
0
0
H11,33
0
2.58e−8
(0.046)
H12,22
0
0
H12,33
0
ρ0 ρ1,1 × 100 ρ1,2 × 100 ρ1,3 × 100
−0.193 1.28 0.845 0.000
0.0726 0.0225 0.140 0.865
(1.7)
0.000
(0.049)
0
0 (2.6) (0.026) (0.072) (0.15)
3.36
(1.2e−3)
1
0 (0.035) (0.035) (0.037) (0.053)
1.12e−3
A2 (3)
1.04e−3 0.0131 0.0772 1.05
(0.14) (0.039) (0.022) (0.10)
0.511
(0.40)
1
1.00e−5
(1.6e−3)
0.438
(0.36)
0.0117 0.184 −0.385 2.33
(0.080) (0.033) (0.043) (0.35)
0.289 1.01 1.43 0.596
(0.14) (0.25) (0.23) (0.12)
This table presents the parameter estimates of H1 , ρ0 , and ρ1 from Eq. (1) with standard errors in parentheses. The A0 (3), A1 (3), and A2 (3) models were estimated by inverting 3-month, 2-year, and 10-year swap zeros and pricing 6-month Libor, and 1-, 3-, 4-, 5-, and 7-year swap zeros with error. The A1 (3)o and A2 (3)o models were estimated with the additional assumption that 1-, 2-, 3-, 4-, 5-, 7-, and 10-year at-the-money caps were priced with error.
options, the A2 (3) model provides the best fit to the cross-section of at-the-money cap prices. The A1 (3)o and A2 (3)o models have slightly larger relative pricing errors for 1-year caps than their A1 (3) and A2 (3) counterparts that we estimate without options. However, the relative pricing errors for caps with longer maturities are considerably lower when we include caps in estimation. For example, the root mean squared relative pricing error for at-themoney 5-year caps is 17% in the A1 (3) model and 9.2% in the A1 (3)o model. Similarly, the root mean squared relative pricing error for at-the-money 5-year caps is 13.3% in the A2 (3) model and 9.0% in the A2 (3)o model. The relative pricing errors for the A2 (3)o model are slightly better than those for the A1 (3)o .
The pricing errors for caps from the A1 (3)o and A2 (3)o models that we estimate with options compare favorably with the pricing errors that have been reported in previous literature.14 Driessen
14 Previous papers have also used interest rate option prices other than caps to estimate dynamic term structure models. Umantsev (2002) finds that pricing errors for swaptions are significantly reduced when he uses swaption prices to estimate affine term structure models. Bikbov and Chernov (2011) use Eurodollar options to estimate term structure models with constant, stochastic, and unspanned stochastic volatility. They find that only stochastic volatility models (such as our A1 (3)o and A2 (3)o models) can reconcile both option prices and the term structure of interest rates.
40
C. Almeida et al. / Journal of Econometrics 164 (2011) 35–44
Table 3 Pricing errors in BPS for swap-implied zeros.
6 Month 1 Year 3 Year 4 Year 5 Year 7 Year
A0 (3)
A1 (3)
A1 (3)o
A2 (3)
A2 (3)o
7.1 9.9 4.1 5.3 5.2 3.8
7.1 9.9 4.1 5.2 5.2 3.8
6.8 9.3 4.5 6.3 6.7 5.5
7.1 10.0 4.1 5.2 5.2 3.8
6.8 9.3 4.5 6.2 6.6 5.3
This table shows the root mean squared pricing errors in basis points for yields on swap-implied zeros. The A0 (3), A1 (3), and A2 (3) models were estimated by inverting 3-month, 2-year, and 10-year swap zeros and pricing 6-month Libor and 1-, 3-, 4-, 5-, and 7-year swap zeros with error. The A1 (3)o and A2 (3)o models were estimated with the additional assumption that 1-, 2-, 3-, 4-, 5-, 7-, and 10-year atthe-money caps were priced with error. Table 4 Relative pricing errors in % for at-the-money caps.
1 Year 2 Year 3 Year 4 Year 5 Year 7 Year 10 Year
A0 (3)
A1 (3)
A1 (3)o
A2 (3)
A2 (3)o
33.6 19.9 18.9 17.3 16.3 14.3 13.4
32.4 19.0 18.0 17.3 17.0 16.1 15.8
36.4 14.6 10.9 9.6 9.2 8.6 9.2
33.3 16.9 15.7 14.2 13.3 11.7 11.0
35.5 14.4 10.9 9.6 9.0 8.3 8.9
This table shows the root mean squared relative pricing errors in % for at-the-money caps. The A0 (3), A1 (3), and A2 (3) models were estimated by inverting 3-month, 2year, and 10-year swap zeros and pricing 1-, 3-, 4-, 5-, and 7-year zeros with error. The A1 (3)o and A2 (3)o models were estimated with the additional assumption that 1-, 2-, 3-, 4-, 5-, 7-, and 10-year at-the-money caps were priced with error.
et al. (2003) estimate a 3-factor Gaussian HJM model with cap data and report absolute pricing errors (averaged across maturities) of 22.2% of cap market prices. Li and Zhao (2006) estimate a 3factor quadratic term structure model and find that the root mean squared percentage pricing error for 5-year at-the-money caps is 10.4%. Jagannathan et al. (2003) estimate a 3-factor CIR model and find that the mean absolute pricing errors are 36.62 basis points for 5-year caps compared to a mean market price of 284.84 basis points (the results are similar for caps with other maturities). Longstaff et al. (2001) estimate a 4-factor string market model using swaptions and find that it overprices caps. They report that the mean percentage valuation error for 5-year caps is 5.665% and ranges from a minimum of −2.385% to a maximum of 38.071%. All of these papers report larger pricing errors for shorter-dated caps. Although not reported here, the A1 (3)o and A2 (3)o models that we estimate with caps also provide an excellent fit to the prices of atthe-money swaptions. The pricing performance of the models estimated with and without options provides a key insight for our main objective of analyzing the predictability of excess returns. All of the models have very similar pricing errors for the cross-section of yields, which suggests that the objective function may be very flat along this dimension. Put another way, a variety of risk-neutral models can give very similar implications for bond prices and swap rates. However, when we consider cap prices there is a stark difference between models that had similar pricing errors for yields. The use of option prices in estimation provides both lower option pricing errors and more efficient estimates of the risk-neutral Q parameters. In Section 4, we show that this increased efficiency is also beneficial for predicting bond returns. Unlike prices, conditional volatility is not directly observed and therefore it must be estimated.15 For estimates of conditional
15 Implied volatilities from cap prices are forward looking and directly observable. However, in the case of models with stochastic volatility, the market prices of risk may cause the implied volatilities from cap prices to differ from the actual conditional volatility.
Fig. 1. Realized volatility of 10-year zero-coupon swap rate. Note: These figures plot weekly model conditional volatility of the 10-year zerocoupon swap rate against estimates of conditional volatility based on historical data: an exponential weighted moving average (EWMA) with a 26-week half-life and an EGARCH(1,1). The top plot shows the conditional volatility in the A1 (3) and A1 (3)o models. The bottom plot shows the conditional volatility in the A0 (3), A2 (3), and A2 (3)o models. The A0 (3), A1 (3), and A2 (3) models were estimated by inverting 3-month, 2-year, and 10-year swap zeros and pricing 6-month Libor and 1-, 3-, 4-, 5-, and 7-year swap zeros with error. The A1 (3)o and A2 (3)o models were estimated with the additional assumption that 1-, 2-, 3-, 4-, 5-, 7-, and 10-year at-the-money caps were priced with error. Table 5 Correlation between model and EGARCH volatility.
6 Month 1 Year 2 Year 3 Year 4 Year 5 Year 7 Year 10 Year
A0 (3)
A1 (3)
A1 (3)o
A2 (3)
A2 (3)o
0 0 0 0 0 0 0 0
19.2 50.8 75.0 83.0 84.4 84.1 84.3 82.0
28.9 56.3 77.0 81.5 81.4 79.3 77.4 75.0
39.1 58.3 63.2 39.0 15.4 −2.6 −21.2 −26.7
30.1 52.9 66.9 70.6 71.9 69.3 66.4 61.6
This table shows the correlation between model-implied one-week volatilities and EGARCH(1, 1) volatility estimates. The A0 (3), A1 (3), and A2 (3) models were estimated by inverting 3-month, 2-year, and 10-year swap zeros and pricing 6month Libor and 1-, 3-, 4-, 5-, and 7-year swap zeros with error. The A1 (3)o and A2 (3)o models were estimated with the additional assumption that 1-, 2-, 3-, 4-, 5-, 7-, and 10-year at-the-money caps were priced with error.
volatility based on historical data we use an exponential weighted moving average (EWMA) with a 26-week half-life, and also estimate an EGARCH(1,1) for each zero-coupon swap rate maturity. Fig. 1 plots the conditional volatility of 10-year zero-coupon swap rate from the term structure models against our estimates of conditional volatility that use historical data. Table 5 provides the correlation between the conditional volatility in the pricing model and the EGARCH(1,1) estimates of conditional volatility. The conditional volatility of all swap rates is constant in the A0 (3) model and therefore these models cannot capture any timeseries variation. Over our sample period, the average level of conditional volatility in the A0 (3) model for zero-coupon swap rates with different maturities is slightly below our estimates based on historical data. In the stochastic volatility models, for maturities beyond 1 year, the conditional volatility of zero-coupon swap rates in the A1 (3), A1 (3)o , and A2 (3)o models are all highly correlated with
C. Almeida et al. / Journal of Econometrics 164 (2011) 35–44
the EGARCH estimates of conditional volatility. The correlations are highest in the A1 (3) model, followed closely by the A1 (3)o model, and then by the A2 (3)o model. The conditional volatility in the A2 (3) model is positively correlated with the estimates of conditional volatility for maturities up to 4 years, but negatively correlated for maturities beyond 4 years. This result occurs because there are two stochastic volatility factors that jointly drive the level and volatility of yields in the A2 (3) model. One factor is closely correlated with the volatility of short-maturity yields. However, without the additional discipline in estimation that is provided by options, the second factor is imprecisely estimated and is negatively correlated with the volatility of long-maturity yields. The A1 (3) and A2 (3) models match the average level of conditional volatility for the EGARCH and EWMA estimates. For the 6-month zero-coupon rate, the volatility match is worse, though still positively related. It is very similar between the A1 (3) and A1 (3)o models, and between the A2 (3) and A2 (3)o models. In summary, none of the models match the conditional volatility of short-term interest rates, but the A1 (3), A1 (3)o , and A2 (3)o models capture the conditional volatility of long-term interest rates. The failure to match the conditional volatility of short-term interest rates occurs mainly during periods of high volatility (as estimated by the EGARCH specification). Again, one could use a fourth latent factor with jumps to better capture these dynamics.
41
Table 6 Predictability of excess returns (R2 s).
2 Year 3 Year 4 Year 5 Year 6 Year 7 Year 8 Year 9 Year 10 Year
A0 (3)
A1 (3)
A1 (3)o
A2 (3)
A2 (3)o
CP5
CP10
0.7 (9.4) 6.5 (9.1) 10.2 (8.9) 12.5 (8.7) 13.8 (8.6) 14.9 (8.6) 15.4 (8.4) 15.9 (8.4) 16.1 (8.3)
0.9 (9.6) 4.0 (9.6) 5.9 (9.6) 7.0 (9.4) 7.5 (9.3) 7.9 (9.2) 8.0 (9.2) 8.0 (9.0) 8.0 (9.1)
3.5 (9.2) 8.6 (8.8) 11.7 (8.5) 13.5 (8.6) 14.4 (8.4) 15.1 (8.2) 15.4 (8.4) 15.6 (8.4) 15.5 (8.3)
−0.2
3.4 (9.2) 9.7 (8.5) 13.0 (8.3) 14.6 (8.2) 15.5 (8.1) 16.0 (8.1) 16.2 (8.1) 16.3 (8.0) 16.2 (8.0)
19.8
22.8
20.1
22.2
20.1
21.6
19.7
21.2
19.0
20.3
(9.4) 2.7 (9.4) 4.0 (9.5) 4.6 (9.6) 4.9 (9.4) 5.2 (9.5) 5.2 (9.5) 5.3 (9.6) 5.3 (9.3)
18.5 17.9 17.5 17.1
This table presents R2 s obtained from overlapping weekly projections of 3-month realized zero-coupon swap rate returns, for different maturities, on model-implied returns. Regressions are based on overlapping data. The A0 (3), A1 (3), and A2 (3) models were estimated by inverting 3-month, 2-year, and 10-year swap zeros and pricing 6-month Libor and 1-, 3-, 4-, 5-, and 7-year swap zeros with error. The A1 (3)o and A2 (3)o models were estimated with the additional assumption that 1-, 2-, 3-, 4-, 5-, 7-, and 10-year at-the-money caps were priced with error.
4. Predictability of excess returns In this section we examine how well the risk premiums that we estimate are able to predict excess returns for long-term swaps. The excess return on a τ -maturity bond that is purchased at time t and sold at time t + 1t is defined as e,τ
+τ t +τ rt ,1t := ln Ptt+ /1t − ln 1/Ptt +1t /1t . 1t /Pt
bond return
risk-free return
(17)
In an affine term structure model this risk premium, or expected excess return, is given by
Et
e,τ rt ,1t
=
1
1t
A (τ − 1t ) + B (τ − 1t ) · Et [Xt +1t ]
− [A(τ ) + B(τ ) · Xt ] + A (1t ) + B (1t ) · Xt ,
(18)
where A and B satisfy the Riccati ODEs in Eq. (7). To measure how well our estimated risk premiums predict excess returns, we compute the following modified R2 statistic, R2 = 1 −
mean
e,τ
e,τ
rt ,1t − Et rt ,1t
e,τ
var rt ,1t
2 .
(19)
The mean (sample average) and variance in Eq. (19) are computed using the returns beginning in each of the 483 weeks in our sample (there is overlap in returns with a horizon longer than a week). To emphasize, we compute our modified R2 statistic by directly comparing realized excess returns to the model-implied expected returns computed from Eq. (18). Our modified statistic differs from the R2 computed from a standard regression that uses linear projection and ignores the cross-sectional consistency of the noarbitrage model. Table 6 presents the R2 statistics with bootstrapped t-statistics for 3-month excess returns for the period from January 1995 to February 2006 that was used to estimate the model.16 We
16 For ease of exposition, we focus on 3-month holding period returns across maturities. Roughly speaking, for shorter (longer) holding periods the R2 s are all
utilize the following semi-nonparametric procedure to bootstrap standard errors. For each bootstrap simulation, we choose a random week in the sample to initialize the state variable. We then use the estimated model parameters to simulate a 483-week sample of the state vector. Due to the lack of an economic model for the measurement errors, we complete the bootstrap procedure by using the block bootstrap with blocks of length 12 to generate pricing errors (for an overview, see Chernick, 2001). The reported bootstrapped t-statistics correspond to the reported R2 s divided by the standard deviation of the bootstrapped R2 s. For comparison, in Table 6 we also provide this R2 statistic for three versions of the regressions of excess returns on forward rates as performed in Cochrane and Piazzesi (2005). For each τ -year zero-coupon swap rate (τ = 2, 3, 4, 5), Cochrane and Piazzesi (2005) regress e,n
rt ,1t = β0n + β1n Yt1 + β2n Ft2 + β3n Ft3 + β4n Ft4 + β5n Ft5 + εtn , (n)
(20)
(n−1)
is the 1-year forward rate where Ftn := nYt − (n − 1) Yt at time t between t + n − 1 and t + n. In Table 6, we denote the regression results from Eq. (20) by CP5 . CP10 denotes the corresponding regression results using one year forward rates up to 10 years. Amongst the three models that we estimate without options, the A0 (3) model is best at predicting 3-month excess returns for zero-coupon swaps, followed by the A1 (3) model and then by the A2 (3) model. When we include options in estimation the results are reversed. The A2 (3)o model provides the best prediction results, followed by the A1 (3)o model and then the A0 (3) model (the A0 (3) model has slightly higher R2 s for the 9- and 10-year maturities). Moreover, when we include options in estimation, the R2 s are much closer in magnitude to those obtained from the regressions in Cochrane and Piazzesi (2005). The regressions in Cochrane and Piazzesi (2005) are designed to match only excess returns and so
lower (higher) than for the 3-month holding period, with similar patterns across maturities. This result is consistent with a higher signal-to-noise ratio for shorter holding periods and is also consistent with the empirical results in Cochrane and Piazzesi (2005) and Diebold and Li (2006), among others. See also Boudoukh et al. (2008).
42
C. Almeida et al. / Journal of Econometrics 164 (2011) 35–44
Table 7 Difference in R2 ’s.
Table 8 Mean reversion and long-run means. A1 (3)o − A1 (3)
2 Year 3 Year 4 Year 5 Year 6 Year 7 Year 8 Year 9 Year 10 Year
2.6 [−11.9, 7.4] 4.6 [0.1, 9.4] 5.8 [1.4, 10.7] 6.5 [−0.5, 11.8] 6.9 [2.5, 11.7] 7.2 [2.7, 12.3] 7.4 [3.0, 12.4] 7.6 [3.2, 12.3] 7.5 [3.2, 12.2]
A2 (3)o − A2 (3)
Parameter
3.6 [−1.7, 9.1] 7.0 [1.6, 12.8] 9.0 [3.8, 14.8] 10.0 [5.0, 15.8] 10.6 [5.4, 16.4] 10.8 [5.8, 16.7] 11.0 [5.9, 16.8] 11.0 [6.2, 16.6] 10.9 [6.3, 16.4]
Q θ3m Q θ2y Q θ10y
This table presents the difference in R2 ’s for 3-month realized zero-coupon swap rate returns using the model-implied expected returns. The table presents the differences between the A1 (3)o and A1 (3) models, and between the A2 (3)o and A2 (3) models. Bootstrapped 95% confidence intervals are presented below in parentheses and are computed using the bootstrap procedure described above. We estimated the A1 (3) and A2 (3) models by inverting 3-month, 2-year, and 10-year swap zeros and pricing 6-month LIBOR and 1-, 3-, 4-, 5-, and 7-year zeros with error. We estimated the A1 (3)o and A2 (3)o models with the additional assumption that 1-, 2-, 3-, 4-, 5-, 7-, and 10-year at-the-money caps were priced with error.
Fig. 2. Quarterly expected excess return – 5-year zero coupon. Note: This figure plots the quarterly expected excess return on a 5-year zero-coupon bond (annualized) for each of the models that we estimate. The top plot shows the expected excess return for the A1 (3) and A2 (3) models that we estimated without using options. The bottom plot shows the expected excess return for the A0 (3) model that we estimated without using options, and the A1 (3)o and A2 (3)o models that we estimated using options.
they serve as an upper bound for the level of predictability. Notice that the worst predictive performance is for short maturities, which is the same as the pricing results in Tables 3 and 4. In the A2 (3) model, the R2 is even negative for the 3-month excess return on the 2-year zero-coupon swap, which indicates that the model’s predictions are worse than the sample average of returns. We attribute this poorer performance for short-maturity zero-coupon swaps to the same reasons we discussed in Section 3 on the fit to prices.
Q
half-life1
Estimate
A0 (3)
A 1 (3 )
A1 (3)o
A 2 (3 )
A2 (3)o
11.10
10.50
8.02
13.00
8.14
11.10
10.50
8.02
13.00
8.14
11.00
10.40
7.86
12.70
7.98
0.53
0.49
0.44
0.49
0.48
half-life2
Q
1.02
1.12
1.22
1.06
1.16
Q
23.30 0.61 0.59 0.53
22.40 0.61 0.59 0.53
8.09 0.66 0.61 0.45
31.40 0.61 0.59 0.53
8.46 0.65 0.61 0.45
half-life3 eigenvector3,3m eigenvector3,2y eigenvector3,10y
This table reports the risk-neutral long-run means (θ3m , θ2y , and θ10y ) of the 3month, 2-year, and 10-year zero-coupon yields. It also reports the half-life of the eigenvectors of the mean reversion matrix under Q, as well as the elements of the third eigenvector. The A0 (3), A1 (3), and A2 (3) models were estimated by inverting 3-month, 2-year, and 10-year swap zeros and pricing 1-, 3-, 4-, 5-, and 7-year zeros with error. The A1 (3)o and A2 (3)o models were estimated with the additional assumption that 1-, 2-, 3-, 4-, 5-, 7-, and 10-year at-the-money caps were priced with error. Q
Q
Q
Fig. 2 plots the 3-month expected excess return on a 5-year zero-coupon bond for each of the models that we estimate. Consistent with the results in Tables 6 and 7, the expected excess returns for the A1 (3)o and A2 (3)o models that we estimate using options are very similar to those for the constant volatility A0 (3) model (which was the preferred model in Duffee, 2002 and Cheridito et al., 2007). The expected excess returns for the A1 (3) and A2 (3) models are very similar to each other, but different from their counterparts that we estimate using options. To summarize, the risk premiums that we estimate using interest rate cap prices are better able to predict excess returns for long-term swaps. The question remains, what elements of the risk premium do options help to identify? To address this question, Table 8 highlights the properties of the risk-neutral distribution which differ when we use options to estimate the models. Q Q Q The risk-neutral long-run means (θ3m , θ2y , and θ10y ) of the 3month, 2-year, and 10-year zero-coupon yields range between 10.4% and 13% for the models that are estimated without options, but hover around 8% for the A1 (3)o and A2 (3)o models that are estimated with options. For comparison, the maximum values of the 3-month, 2-year, and 10-year zero-coupon swap rates over our sample period are 6.8%, 8%, and 8.2%, respectively, and the mean values are 4.2%, 4.8%, and 5.9%. In each model, the third eigenvector of K Q acts as a level factor that is almost equally weighted across the three yields. For the models that are estimated without options, the half-life of shocks to this level factor ranges from 22.4 years to 31.4 years, while it is just 8.09 years in the A1 (3)o model and 8.46 years in the A2 (3)o model. That is, for the models that are estimated without options, the risk-neutral expectation of future yields is high, but shocks to the level of yields are very persistent. By contrast, for the A1 (3)o and A2 (3)o models, under the risk-neutral measure the level of yields reverts more quickly to a lower long-run mean (i.e. shocks are less persistent and die off more quickly). Both of these combinations produce an upward sloping yield curve consistent with the data. However, the faster mean reversion to a lower long-run mean fits the joint swaps and options data best. Although heuristic, we also feel that the lower long-run means represent a more economically plausible level of risk aversion with a rate of mean reversion that is on the order of business cycle frequency. For example, Gurkaynak et al. (2005) find difficulty reconciling the business cycle frequency variation (under P) in macroeconomic variables with the long frequency variation implicit in long-maturity bond yields (under Q). To some extent, our findings that the option-based estimates have higher rates of mean reversion (under Q) suggest that information in option prices partially attenuates this difference.
C. Almeida et al. / Journal of Econometrics 164 (2011) 35–44
43
Table 9 Composition of expected excess return.
CONSTANT LEVEL SLOPE CURVATURE
A0 (3)
A1 (3)
A1 (3)o
A2 (3)
A2 (3)o
−0.224
−0.129
−0.159
2.920 3.263 0.680
2.672 6.631 3.524
0.012 0.683 1.173 2.080
−0.127
3.157 9.030 3.347
2.455 6.443 6.231
This table provides the values for CONSTANT, LEVEL, SLOPE, and CURVATURE for each of the models that we estimate, where the annualized 3-month expected excess return on a 10-year zero-coupon bond is
= CONSTANT + LEVEL × Yt(0.25) + SLOPE × Yt(10) − Yt(0.25) CURVATURE × 2 · Yt(2) − Yt(10) − Yt(0.25) . e,10
Et rt ,0.25
+
5. Factor analysis Litterman and Scheinkman (1991) find that most of the variation in bond returns can be explained by the level, slope, and curvature of the yield curve. In this section we examine how the expected excess returns in each model depend on the level, slope, and curvature of interest rates (which are, in turn, affine functions of the underlying states in the model). In each model that we estimate we can write
e,τ
(0.25)
Et rt ,1t = CONSTANT + LEVEL × Yt
+ SLOPE × Yt(10) − Yt(0.25) + CURVATURE × 2 · Yt(2) − Yt(10) − Yt(0.25) ,
Specifically, Fama and Bliss (1987) and Campbell and Shiller (1991) perform the following regression
(0.25)
Fig. 3. Regression coefficients from linear projection on yields. Note: This figure shows the regression coefficients of the Campbell–Shiller (n−0.25) (0.25) regression Yt +0.25 − Ytn on slope, (Ytn − Yt /(n − 0.25)). The model values are simulated mean regression coefficients. The A0 (3), A1 (3), and A2 (3) models were estimated by inverting 3-month, 2-year, and 10-year swap zeros and pricing 6month Libor and 1-, 3-, 4-, 5-, and 7-year swap zeros with error. The A1 (3)o and A2 (3)o models were estimated with the additional assumption that 1-, 2-, 3-, 4-, 5-, 7-, and 10-year at-the-money cap were priced with error.
(2)
(10)
(21)
where Yt , Yt , and Yt are the 3-month, 2-, and 10-year zerocoupon swap rates, respectively. For ease of exposition, we focus on the 10-year return decomposition for a holding period (1t ) of 3 months. Results are similar for other moderate and long maturities. Table 9 provides the values for CONSTANT, LEVEL, SLOPE, and CURVATURE for each of the models that we estimate. Comparing the A1 (3) and A1 (3)o models, we see that in general the options model gives more weight to SLOPE and CURVATURE (3.26 versus 6.63 and 0.67 vs 3.52, respectively). The weight on level is similar across these models. Regarding the A2 (3) and A2 (3)o models, we see a similar pattern across SLOPE and CURVATURE: including options results in a larger weight on these two variables (0.68 versus 2.45 and 2.08 versus 6.23, respectively). Note that although the differences are of similar magnitude, SLOPE has roughly three times the standard deviation of CURVATURE, therefore we can conclude that including options in model estimation primarily helps to identify the risk premium (or expected excess return) that is associated with the slope of the yield curve and, to a lesser degree, the curvature of the yield curve. The A0 (3) model shows a roughly similar pattern to the models that we estimate with options, with comparable values of LEVEL and higher values of SLOPE and CURVATURE. This result is consistent with our findings that, as a group, the A0 (3), A1 (3)o and A2 (3)o outperformed the A1 (3) and A2 (3) models with respect to forecasting bond returns. Dai and Singleton (2002) present two additional challenges for dynamic term structure models that are related to risk premia in fixed income markets. The first challenge, which Dai and Singleton (2002) refer to as the linear projection of yields, or LPY(I), is to match the pattern of violations of the expectations hypothesis that is documented in Fama and Bliss (1987) and Campbell and Shiller (1991). These papers regress excess returns on the slope of the yield curve and find that the regression coefficient is not 1, but instead is negative (and more so for longer maturities). Dai and Singleton’s second related challenge, LPY(II), states that when excess returns are adjusted by model-implied risk premia, the expectations hypothesis should be restored and the regression coefficient on the slope of the yield curve should be 1.
Ytn+−11tt − Ytn = Φ0,n + Φ1,n
1t n Yt − Yt1t + εtn , n − 1t
(22)
ˆ 1,n , are increasingly negand find that the regression coefficients, Φ ative for larger maturities n. Fig. 3 provides the LPY(I) regression results for the models that we estimate.17 Dai and Singleton (2002) find that only the A0 (3) model matches LPY(I). We also find that, when not using options data, only the A0 (3) model with constant volatility matches LPY(I), while the models with stochastic volatility fail to replicate this feature of the data. In contrast, we find that the models with stochastic volatility, when estimated jointly with options data, are statistically consistent with the empirically observed slope coefficients. For all of the models, the observed slope coefficients lie within a simulated 95% confidence interval. However, the point estimates in the A0 (3), A1 (3)o , and A2 (3)o models are much closer to the observed regression coefficients in the data. In unreported results, we also found that, when adjusted for small-sample bias, the A1 (3)o and A2 (3)o models that we estimate with options also satisfy LPY(II). Our results differ from the findings in both Duffee (2002) and Dai and Singleton (2002) that dynamic term structure models with constant volatility were superior to those with stochastic volatility, in terms of consistency in predicting yields. Instead, we find that using options data for estimation allows the models with stochastic volatility to maintain similar or superior performance along this dimension. 6. Conclusion Interest rate options may contain information about the risk premium in long-term interest rates because their prices are sensitive to the volatility and market prices of the underlying interest rates. We use the time series of interest rate cap prices and swap rates to estimate 3-factor affine term structure models. The
17 The mean regression coefficients for each model were generated using 1000 simulations of an Euler approximation of the stochastic volatility models. In the case of the A0 (3), the confidence interval can in fact be computed exactly in closed form. The stochastic volatility models do not yield closed form expressions for the confidence intervals since closed form transition densities are not known for these processes.
44
C. Almeida et al. / Journal of Econometrics 164 (2011) 35–44
risk premiums that we estimate using interest rate option prices are better able to predict excess returns for long-term swaps over short-term swaps. We show that including options reduces the estimate of the risk-neutral long-run mean of yields and increases the estimated rate of mean reversion to a more economically plausible level. We also show that including options helps us to identify the portion of the risk premium that is associated with the slope and curvature of the yield curve. With a correction for a small-sample bias, the models with stochastic volatility that we estimate with options also capture the failure of the expectations hypothesis and match regressions of returns on the slope of the yield curve. Acknowledgements We thank Chris Armstrong, Snehal Banerjee, David Bolder, Mikhail Chernov, Anna Cieslak, Darrell Duffie, Peter Feldhütter, René Garcia, Chris Jones, Yaniv Konchitchki, Peter Reiss, David Runkle, and seminar participants at the American Finance Association meetings in Chicago, the Northern Finance Association meetings in Vancouver, the SAFE 2010 Term Structure Conference in Verona, the University of Waterloo, the Bank of Canada Fixed Income conference, Barclays Global Investors, and Stanford GSB. We are very grateful to Ken Singleton for many discussions and comments. Almeida gratefully acknowledges financial support from CNPq-Brazil and Graveline acknowledges support from a Carlson School Dean’s Small Research Grant. References Ahn, D.H., Dittmar, R.F., Gallant, A.R., 2002. Quadratic term structure models: theory and evidence. The Review of Financial Studies 15, 243–288. Aït-Sahalia, Y., Lo, A.W., 2000. Nonparametric risk management and implied risk aversion. Journal of Econometrics 94, 9–51. Aït-Sahalia, Y., Wang, Y., Yared, F., 2001. Do option markets correctly price the probabilities of movement of the underlying asset? Journal of Econometrics 102, 67–110. Andersen, T.G., Benzoni, L., 2008. Do bonds span volatility risk in the US treasury market? A specification test for affine term structure models. SSRN eLibrary. Bakshi, G.S., Carr, P.P., Wu, L., 2008. Stochastic risk premiums, stochastic skewness in currency options, and stochastic discount factors in international economies. Journal of Financial Economics 87, 132–156. Bikbov, R., Chernov, M., 2011. Yield curve and volatility: lessons from eurodollar futures and options. Journal of Financial Econometrics 9 (1), 66–105. Boudoukh, J., Richardson, M., Whitelaw, R.F., 2008. The myth of long-horizon predictability. Review of Financial Studies 21, 1577–1605. Campbell, J.Y., Shiller, R.J., 1991. Yield spreads and interest rate movements: a bird’s eye view. Review of Economic Studies 58, 495–514. Chen, R.R., Scott, L., 1993. Maximum likelihood estimation for a multifactor equilibrium model of the term structure of interest rates. Journal of Fixed Income 3, 14–31. Cheng, P., Scaillet, O., 2007. Linear-quadratic jump–diffusion modeling. Mathematical Finance 17, 575–598. Cheridito, P., Filipovic, D., Kimmel, R.L., 2007. Market price of risk specifications for affine models: theory and evidence. Journal of Financial Economics 83, 123–170. Chernick, M., 2001. Bootstrap Methods: A Guide for Practitioners and Researchers, third ed. Wiley-Interscience. Chernov, M., Ghysels, E., 2000. A study towards a unified approach to the joint estimation of objective and risk neutral measures for the purpose of options valuation. Journal of Financial Economics 56, 407–458. Cochrane, J.H., Piazzesi, M., 2005. Bond risk premia. American Economic Review 95, 138–160. Collin-Dufresne, P., Goldstein, R.S., 2002a. Do bonds span the fixed income markets? Theory and evidence for unspanned stochastic volatility. Journal of Finance 57, 1685–1730.
Collin-Dufresne, P., Goldstein, R.S., Jones, C.S., 2009. Can interest rate volatility be extracted from the cross section of bond yields? Journal of Financial Economics 94, 47–66. Cox, J.C., Jonathan, E., Ingersoll, J., Ross, S.A., 1985. A theory of the term structure of interest rates. Econometrica 53, 385–408. Dai, Q., Singleton, K.J., 2000. Specification analysis of affine term structure models. Journal of Finance 55, 1943–1978. Dai, Q., Singleton, K.J., 2002. Expectation puzzles, time-varying risk premia, and affine models of the term structure. Journal of Financial Economics 63, 415–441. Dai, Q., Singleton, K.J., 2003. Term structure dynamics in theory and reality. Review of Financial Studies 16, 631–678. Davidson, R., MacKinnon, J., 1993. Estimation and Inference in Econometrics. Oxford University Press. Diebold, F., Li, C., 2006. Forecasting the term structure of government bond yields. Journal of Econometrics 130, 337–364. Driessen, J., Klaassen, P., Melenberg, B., 2003. The performance of multi-factor term structure models for pricing and hedging caps and swaptions. Journal of Financial and Quantitative Analysis 38, 635–672. Duarte, J., 2004. Evaluating an alternative risk preference in affine term structure models. Review of Financial Studies 17, 379–404. Duffee, G.R., 2002. Term premia and interest rate forecasts in affine models. Journal of Finance 57, 405–443. Duffie, J.D., 2001. Dynamic Asset Pricing Theory, third ed. Princeton University Press, Princeton, New Jersey. Duffie, J.D., Kan, R., 1996. A yield-factor model of interest rates. Mathematical Finance 6, 379–406. Duffie, J.D., Pan, J., Singleton, K., 2000. Transform analysis and asset pricing for affine jump-diffusions. Econometrica 68, 1343–1376. Eraker, B., 2004. Do stock prices and volatility jump? Reconciling evidence from spot and option prices. Journal of Finance 59, 1367–1403. Fama, E.F., 1984a. The information in the term structure. Journal of Financial Economics 13, 509–528. Fama, E.F., 1984b. Term premiums in bond returns. Journal of Financial Economics 13, 529–546. Fama, E.F., Bliss, R.R., 1987. The information in long-maturity forward rates. American Economic Review 77, 680–692. Fisher, M., Gilles, C., 1996. Estimating exponential-affine models of the term structure. Working Paper. Graveline, J.J., 2008. Exchange rate volatility and the forward premium anomaly. Working Paper. Gurkaynak, R.S., Sack, B., Swanson, E., 2005. Stochastic volatilities and correlations of bond yields. American Economic Review 95, 425–436. Han, B., 2007. Stochastic volatilities and correlations of bond yields. Journal of Finance 62, 1491–1524. Jackwerth, J.C., 2000. Recovering risk aversion from option prices and realized returns. Review of Financial Studies 13, 433–451. Jagannathan, R., Kaplin, A., Sun, S., 2003. An evaluation of multi-factor cir models using libor, swap rates, and cap and swaption prices. Journal of Econometrics 116, 113–146. Jones, C.S., 2003. The dynamics of stochastic volatility: evidence from underlying and options markets. Journal of Econometrics 116, 181–224. Joslin, S., 2007. Pricing and hedging volatility risk in fixed income markets. Working Paper. Kim, D.H., 2007. Spanned stochastic volatility: a reexamination of the relative pricing between bonds and bond options. BIS Working Papers. Leippold, M., Wu, L., 2002. Asset pricing under the quadratic class. The Journal of Financial and Quantitative Analysis 37, 271–295. Li, H., Zhao, F., 2006. Unspanned stochastic volatility: evidence from hedging interest rate derivatives. Journal of Finance 61, 341–378. Liptser, R., Shiryaev, A., Aries, B., 2000. Statistics of Random Processes I. Springer. Litterman, R., Scheinkman, J., 1991. Common factors affecting bond returns. Journal of Fixed Income 1, 54–61. Longstaff, F.A., Santa-Clara, P., Schwartz, E.S., 2001. The relative valuation of caps and swaptions: theory and empirical evidence. Journal of Finance 56, 2067–2109. Pan, J., 2002. The jump-risk premia implicit in options: evidence from an integrated time-series study. Journal of Financial Economics 63, 3–50. Piazzesi, M., 2005. Bond yields and the federal reserve. Journal of Political Economy 113, 311–344. Thompson, S.B., 2008. Identifying term structure volatility from the libor-swap curve. The Review of Financial Studies 21, 819–854. Umantsev, L., 2002. Econometric analysis of European libor-based options within affine term-structure models. Ph.D. Thesis. Stanford University.
Journal of Econometrics 164 (2011) 45–59
Contents lists available at ScienceDirect
Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom
A component model for dynamic correlations Riccardo Colacito a , Robert F. Engle b , Eric Ghysels a,c,∗ a
Division of Finance, Kenan-Flagler Business School, University of North Carolina at Chapel Hill, United States
b
Department of Finance, Stern School of Business, New York University, United States
c
Department of Economics, University of North Carolina at Chapel Hill, Gardner Hall, CB 3305 Chapel Hill, NC 27599-3305, United States
article
info
Article history: Available online 17 March 2011 JEL classification: C53 C58 Keywords: Dynamic correlations Forecasting Mixed data sampling
abstract We propose a model of dynamic correlations with a short- and long-run component specification, by extending the idea of component models for volatility. We call this class of models DCC-MIDAS. The key ingredients are the Engle (2002) DCC model, the Engle and Lee (1999) component GARCH model replacing the original DCC dynamics with a component specification and the Engle et al. (2006) GARCHMIDAS specification that allows us to extract a long-run correlation component via mixed data sampling. We provide a comprehensive econometric analysis of the new class of models, and provide extensive empirical evidence that supports the model’s specification. © 2011 Elsevier B.V. All rights reserved.
1. Introduction Component models have been widely used for volatility dynamics.1 The motivation is typically based on either one of the following two arguments. First, the component structure allows for a parsimonious representation of complex dependence structures. Second, the components are sometimes linked to economic principles, namely the idea that there are different shortand long-run sources that affect volatility. The purpose of this paper is to propose a component model of dynamic correlations with a short- and long-run component specification.2 We call this class of models DCC–MIDAS as the key ingredients are a combination of the Engle (2002) DCC model, the Engle and Lee (1999) component GARCH model to replace the original DCC dynamics with a component specification and the Engle et al. (2006) GARCH–MIDAS component specification that allows us to extract a long-run correlation component via mixed data sampling.
∗ Corresponding author at: Department of Economics, University of North Carolina at Chapel Hill, Gardner Hall, CB 3305 Chapel Hill, NC 27599-3305, United States. Tel.: +1 919 966 5325; fax: +1 919 962 2068. E-mail address:
[email protected] (E. Ghysels). 1 Engle and Lee (1999) introduced a GARCH model with a long and short run component. Several others have proposed related two-factor volatility models, see e.g. Ding and Granger (1996), Gallant et al. (1999), Alizadeh et al. (2002), Chernov et al. (2003), Adrian and Rosenberg (2004), Christoffersen et al. (2008), among many others. Chernov et al. (2003) examine quite an exhaustive set of diffusion models for the stock price dynamics and conclude quite convincingly that at least two components are necessary to adequately capture the dynamics of volatility. 2 It should be noted that there have been several prior attempts to think of component models for correlations, see inter alia Karolyi and Stulz (1996). Our approach focuses on autoregressive conditional correlation models. 0304-4076/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2011.02.013
We address the specification, estimation and interpretation of correlation models that distinguish short and long run components. We show that the changes in correlations are indeed very different. Dynamic correlations are a natural extension of the GARCH–MIDAS model to Engle (2002) DCC model. The idea captured by the DCC–MIDAS model is similar to that underlying GARCH–MIDAS. In the latter case, two components of volatility are extracted, one pertaining to short term fluctuations, the other pertaining to a secular component. In the GARCH–MIDAS the short run component is a GARCH component, based on daily (squared) returns, that moves around a long-run component driven by realized volatilities computed over a monthly, quarterly or biannual basis. The MIDAS weighting scheme helps us extract the slowly moving secular component around which daily volatility moves. Engle et al. (2006) explicitly link the extracted MIDAS component to macroeconomic sources. It is the same logic that is applied here to correlations. Namely, the daily dynamics obey a DCC scheme, with the correlations moving around a long run component. Short-lived effects to correlations will be captured by the autoregressive dynamic structure of DCC, with the intercept of the latter being a slowly moving process that reflects the fundamental or long-run causes of time variation in correlation.3 To estimate the parameters of the DCC–MIDAS model we follow the two-step procedure of Engle (2002). We start by estimating the parameters of the univariate conditional volatility models.
3 In principle we can link the secular correlation component to macroeconomic sources, very much like Engle et al. (2006) and Schwert (1989), who study long historical time series and link volatility directly to various key macroeconomic time series.
46
R. Colacito et al. / Journal of Econometrics 164 (2011) 45–59
The second step consists of estimating the DCC–MIDAS parameters with the standardized residuals. We also discuss the regularity conditions we need to impose on the MIDAS-filtered long run correlation component as models of correlations are required to yield positive definite matrices. The paper concludes with an empirical illustration, showing the benefits of the component specification. Empirical specification tests reveal the superior empirical fit, both in- and out-of-sample of the new class of DCC–MIDAS correlation models. The remainder of the paper is organized as follows. Section 2 introduces the correlation component model and compares the DCC–MIDAS class of models with original DCC models. Sections 3 and 4 cover regularity conditions and estimation, while Section 5 contains the empirical applications. Section 6 concludes the paper.
month, quarter or half-year. Alternatively, one can specify mi based on rolling samples that change from day to day. The findings in Engle et al. (2006) show that they yield very similar empirical fits— so we opted for the simplest to implement which involves locally constant long run components. We will denote by Nvi the number of days that mi is held fixed. The superscript i indicates that this may be asset-specific. The subscript v differentiates it from a similar scheme that will be introduced later for correlations. It will be convenient to introduce two time scales t and τ . In particular, while gi,t moves daily, mi,τ changes only once every Nvi days. More specifically we assume that for each asset i = 1, . . . , n, univariate returns follow the GARCH–MIDAS process: ri,t = µi +
√
mi,τ · gi,t ξi,t ,
∀t = τ Nvi , . . . , (τ + 1)Nvi
(2.3)
where gi,t follows a GARCH(1, 1) process: 2. A new class of component correlation models The purpose of this section is to introduce the class of DCC–MIDAS dynamic correlation models. In a first subsection we provide some preliminaries. The second subsection introduces the structure of DCC–MIDAS.
gi,t = (1 − αi − βi ) + αi
(ri,t −1 − µi )2 mi,τ
+ βi gi,t −1
(2.4)
while the MIDAS component mi,τ is a weighted sum of Kvi lags of realized variances (RV ) over a long horizon: i
Kv −
mi,τ = mi + θi
2.1. Notation and preliminaries
ϕl (ωvi )RV i,τ −l
(2.5)
l=1
Consider a set of n assets ′ and let the vector of returns be denoted as rt = r1,t , . . . , rn,t . The novelty of our approach consists of describing the dynamics of conditional variances and correlations, where we take into account both short and long run components. The long run component at time t will be a judiciously chosen weighted average of historical correlations. The assumption is that the long run component can be filtered from empirical correlations. Of course, what is critical is the choice of weights, which will be one of the key ingredients of the model specification. To proceed let us ′ assume that the vector of returns rt = r1,t , . . . , rn,t follows the process: rt ∼i.i.d. N (µ, Ht )
(2.1)
Ht = Dt Rt Dt
where the realized variances involve Nvi daily squared returns, namely: i
RV i,τ =
Rt = Et −1 [ξt ξt′ ]
(2.2)
ξt = Dt (rt − µ). −1
1
Therefore rt = µ + Ht2 ξt with ξt ∼i.i.d. N (0, In ). At the outset, it should be noted that component models for correlations also prompt us to think about component models for volatility which feed into the correlation specification. Indeed the decomposition of the conditional covariance matrix Ht = Dt Rt Dt appearing in Eq. (2.1) with Dt a diagonal matrix of standard deviations and Rt the conditional correlation matrix suggests a two-step model specification (and estimation) strategy. Consequently, we will first specify Dt followed by Rt . The univariate volatility models build on recent work by Engle et al. (2006), that proposed component models for volatility, where long an short run volatility dynamics are separated. The new class of models is called GARCH–MIDAS, since it uses a mean reverting unit daily GARCH process, similar to Engle and Rangel (2005), and a MIDAS polynomial which applies to monthly, quarterly, or bi-annual macroeconomic or financial variables. In what follows we will refer to gi and mi as the short and long run variance components respectively for asset i. Engle et al. (2006) consider various specifications for gi and we select only a specific one where the long run component is held constant across the days of the
(ri,j )2 .
j=(τ −1)Nvi +1
Note that Nvi could for example be a quarter or a month. The above specification corresponds to the block sampling scheme as defined in Engle et al. (2006), involving so called Beta weights defined as:
ϕl (ωvi ) =
1−
l Kvi
ωvi −1
i
Kv ∑
1−
µ is where the vector of unconditional means, Ht is the conditional covariance matrix and Dt is a diagonal matrix with standard deviations on the diagonal, and:
τ Nv −
j =1
j
ωvi −1
.
(2.6)
Kvi
In practice we will consider cases where the parameters Nvi and Kvi are independent of i, i.e. the same across all series. Similarly, we can also allow for different decay patterns ωvi across various series, but once again we will focus on cases with common ωv (see the next subsection for further discussion). Obviously, despite the common parameter specification, we expect that the mi,τ substantially differ across series, as they are data-driven. 2.2. The class of DCC–MIDAS dynamic correlation models Dynamic correlations are a natural extension of the GARCH– MIDAS model to the Engle (2002) DCC model. More specifically, we will introduce two components, a long-run and short-run one. In the case of volatility we noted that mi,τ can be formulated either via keeping it locally constant, or else based on a local moving window. Engle et al. (2006) find for volatility that the difference between the two appears to be negligible. For correlations, we have potentially the same choice. Since the trailing local specification is more general, we adopt this for our formulation. Namely, using the standardized residuals ξi,t it is possible to obtain a matrix Qt whose elements are: qi,j,t = ρ i,j,t (1 − a − b) + aξi,t −1 ξj,t −1 + bqi,j,t −1 ij Kc
ρ i,j,t =
− l =1
ϕl ωrij ci,j,t −l
(2.7)
R. Colacito et al. / Journal of Econometrics 164 (2011) 45–59 t
∑
ξi,k ξj,k
ij k=t −Nc
ci,j,t =
t t ∑ 2 ∑ ξi,k ξj2,k ij
ij
k=t −Nc
k=t −Nc
where the weighting scheme is similar to that appearing in (2.5). Note that in the above formulation of ci,j,t we could have used simple cross-products of ξi,t . The normalization will allow us later to discuss regularity conditions in terms of correlation matrices. Correlations can then be computed as:
ρi,j,t = √
qi,j,t qi,i,t
√
qj,j,t
.
We regard qi,j,t as the short run correlation between assets i and j, whereas ρ i,j,t is a slowly moving long run correlation. Rewriting the first equation of system (2.7) as qi,j,t − ρ i,j,t
= a ξi,t −1 ξj,t −1 − ρ i,j,t + b qi,j,t −1 − ρ i,j,t
(2.8)
conveys the idea of short run fluctuations around a time-varying long run relationship. The idea captured by the DCC–MIDAS model is similar to that underlying GARCH–MIDAS. In the latter case, two components of volatility are extracted, one pertaining to short term fluctuations, the other pertaining to a secular component. In the GARCH–MIDAS the short run component is a GARCH component, based on daily (squared) returns, that moves around a long-run component driven by realized volatilities computed over a monthly, quarterly or bi-annual basis. The MIDAS weighting scheme helps us in extracting the slowly moving secular component around which daily volatility moves. Engle et al. (2006) explicitly link the extracted MIDAS component to macroeconomic sources. It is the same logic that is applied here to correlations. Namely, the daily dynamics obey a DCC scheme, with the correlations moving around a long run component. Shortlived effects on correlations will be captured by the autoregressive dynamic structure of DCC, with the intercept of the latter being a slowly moving process that reflects the fundamental or secular causes of time variation in correlation. In principle we can link the long run correlation component to macroeconomic sources, very much like Engle et al. (2006) study long historical time series, similar to Schwert (1989) and link volatility directly to various key macroeconomic time series. Note that in Eq. (2.5) we can allow for different weighting schemes across series. Likewise, the ij specification in (2.7) can potentially accommodate weights ωr , lag ij ij lengths Nc and span lengths of historical correlations Kc to differ across any pair of series.4 Typically we will use a single setting common to all pairs of series, similar to the choice of a common MIDAS filter in the univariate models. We will discuss in the next subsection the implications of a single versus multiple parameter choices for the DCC–MIDAS filtering scheme. It is also worth noting that our DCC–MIDAS model shares features with a local dynamic conditional correlation (LDCC) model introduced in Feng (2007), where variances are decomposed into a conditional and a local (unconditional) parts. The correlation structure is modeled by a multivariate nonparametric ARCH-type approach that accommodates the presence of regressors. On the subject of alternative specifications, we should also note that Engle and Lee (1999) shows one can build a simple component GARCH model from a GARCH(2, 2) model. One could therefore think of a simple component DCC model built from a DCC(2, 2) model. Such a model could be a natural benchmark as well. It has,
4 Note that (ωij , N ij , K ij ) = (ωji , N ji , K ji ) are identical for all i and j. r c c r c c
47
however, a much more challenging structure. In particular, based on the analogy with higher order GARCH models we expect that the higher order dynamics for correlation matrices is quite involved. It is the reason why we opted for the component structure of DCC–MIDAS. Nevertheless, the analogy with Engle and Lee (1999) is certainly a worthwhile project for future research. To conclude this subsection we fix some notation that will allow us to discuss the general model’s specification. First, we will ij collect all the elements ωr into a vector ωr , keeping in mind that it may only contain a single element if all weights are equal and ij we denote Nc = maxi,j Nc . We then can write and collect the set of correlations appearing in Eq. (2.7) yielding generically in matrix form: Rt = (Qt∗ )−1/2 Qt (Qt∗ )−1/2
(2.9)
Qt = diag Qt , ∗
(2.10)
Qt = (1 − a − b)Rt (ωr ) + aξt ξt + bQt −1 ′
(2.11)
where Rt ω r =
Kc −
Φl ωr ⊙ Ct −l
l =1
v1,t .. Ct = .
0
0
..
. ···
0
v 1 ,t × ...
0 vn,t 0
0
..
. ···
0
− 12
t −
ξk ξk ′
k=t −Nc
− 12
0
v n ,t
t
vi,t =
−
ξi2,k ,
∀i = 1, . . . , n
(2.12)
k=t −Nc
where Φl ωr product.5
= ϕl ωr ιι′ and ⊙ stands for the Hadamard
3. Estimation To estimate the parameters of the DCC–MIDAS model we follow the two-step procedure of Engle (2002). We start by collecting the parameters of the univariate conditional volatility models into a vector Ψ ≡ [(αi , βi , ωi , mi , θi ), i = 1, . . . , n]. and the parameters of the conditional correlation model into Ξ ≡ (a, b, ωr ). Then the quasi-likelihood function QL can be written as: QL(Ψ , Ξ ) = QL1 (Ψ ) + QL2 (Ψ , Ξ )
≡−
T −
2 n log(2π ) + 2 log |Dt | + r′t D− t rt
t =1
−
T −
1 ′ log |Rt | + ξt′ R− t ξt + ξt ξt .
(3.1)
t =1
Given the structure of the log likelihood function, namely the separation of QL(Ψ , Ξ ) into QL1 (Ψ ) and QL2 (Ψ , Ξ ), we can first estimate the parameters of the univariate GARCH–MIDAS processes, i.e. the parameters in Ψ , using QL1 (Ψ ) and therefore each single series separately-yielding Ψˆ . The second step consists of estimating the DCC–MIDAS parameters with the standardized
5 In a later section we will generalize this setup to allow for multiple MIDAS filters in the long-run dynamics of correlations.
48
R. Colacito et al. / Journal of Econometrics 164 (2011) 45–59
1 ˆ− residuals ξˆt = D ˆ using QL2 (Ψˆ , Ξ ). The estimation of the t (rt − µ) MIDAS polynomial parameters in the dynamic correlations require some further discussion. The approach we adopt is inspired by the estimation of MIDAS polynomial parameters in the GARCH–MIDAS model. So far we were not very explicit about the choice of the polynomial characteristics Kvi and Nvi in Eq. (2.5) and the choice of Kc and Nc in Eq. (2.7). In the former case, i.e. the univariate GARCH–MIDAS models, Kvi determines the number of lags spanned in each MIDAS polynomial specification for τt . The other is how to compute RV, weighted by the MIDAS polynomials. As pointed out by Engle et al. (2006), this amounts to model selection with a fixed parameter space, and therefore is achieved via profiling the likelihood function for various combinations of Kvi and Nvi . To determine the long run component of conditional correlations, Rt we proceed in exactly the same way, namely we select the number of lags Kc for historical correlations and the time span over which to compute the historical correlations Nc in Eq. (2.7). The similarity between the two procedures is not surprising, given the fact that DCC models build extensively on the ideas of GARCH and in both cases we have a MIDAS filter extracting a component which behaves like a time-varying intercept. We will provide an explicit discussion of the procedure in the empirical applications, given the similarity with Engle et al. (2006). The asymptotic properties of the two-step estimator are discussed in Engle and Sheppard (2001), Comte and Lieberman (2003), Ling and McAleer (2003) and McAleer et al. (2008). These papers deal with fixed parameter DCC models. It is beyond the scope of the current paper to establish the asymptotic properties of the MLE estimator when the MIDAS stochastic intercept is present. Recent work by Dahlhaus and Subba Rao (2006) discussed general time-varying coefficient ARCH(∞) processes and regularity conditions for (local) MLE estimation. The GARCH–MIDAS and DCC–MIDAS processes are to a certain degree special cases of their setup. Namely, Dahlhaus and Subba Rao (2006) allow all parameters to vary and assume a nonparametric setting to capture the time-varying coefficients. This leads them, like Feng (2007), to consider kernel-based estimators. Our setting is parametric, as the MIDAS filter is a parametric specification, and therefore presumably simpler. We leave the regularity conditions that guarantee standard asymptotic results for the two-step estimation of DCC–MIDAS as an open question for future research. Ghysels and Wang (2007) provide a rigorous analysis of the ML estimation of the GARCH–MIDAS model. Their paper shows that the asymptotic theory is quite involved. One can easily extrapolate and infer from this that the DCC–MIDAS asymptotics are quite involved too. We therefore refrained from a detailed development of the asymptotics of the DCC–MIDAS model. However, we do cover in this paper the regularity conditions we need to impose on the MIDAS-filtered long run correlation component to obtain positive definite matrices. This is the topic of the next section.
it is apparent that the matrix Qt is a weighted average of three matrices. The matrix Rt is positive semi-definite because it is a weighted average of correlation matrices. The matrix ξt ξt′ is always positive semi-definite by construction. Therefore, if the matrix Q0 is initialized to a positive semi-definite matrix, it follows that Qt must be positive semi-definite at each point in time. The positive semi-definiteness of the covariance matrix can be guaranteed without putting any restriction on the structure of the conditional variance estimators for the individual return series. This means, for example, that it is possible to assume a different number of GARCH–MIDAS lags for each return. The case of two or more weighting schemes is more involved and therefore becomesmore interesting. Indeed it is not always the case that the matrix Rt ωr is positive semi-definite for any choice of MIDAS parameters and specific restrictions on the parameter space ought to be imposed. The goal of this section is to provide K sufficient conditions under which the sequence of matrices {Φl }l=c 1 defined in (2.11) is positive semi-definite. Since the sequence of L matrices {Ct −i }i=c 1 is positive semi-definite by construction, we can invoke Schur ∑the product theorem to state that any matrix Kc 6 Rt ω r = l=1 Φl ω r ⊙ Ct −l is positive semi-definite as well. In this section we examine the case of block matrices, whose dynamics can be accounted for by three parameters: the first two for the block of correlations and the third one for the off-diagonal correlations. The following three definitions set the stage for the kind of matrices that we deal with in this section. Definition 1 (Diagonal MIDAS Block). Let ΦlD na , ωra be a symmetric, square matrix of size na such that all elements on the offdiagonal are equal to ϕl (ωra ) and all elements on the main diagonal are ones.
Definition 2 (Off-Diagonal MIDAS Block). Let ΦlF na , nb , ωrc be a matrix of size na , nb such that all elements are equal to ϕl (ωrc ).
In this section, we turn our attention to the long run component ij ij and the choice of weights ωr , keeping the lag lengths Nc and span ij lengths of historical correlations Kc fixed across all pairs of series. Hence, we focus on the memory decay in the long run correlations. 4.1. Long-run dynamics The first case to consider is the one of a common decay parameter ωr independent of the pair of returns series selected. The covariance matrices can be shown to be positive definite under a relatively mild set of assumptions. When considering Eq. (2.11)
Definition 3 (Block MIDAS Matrix). Let
Φl ωra , ωrb , ωrc , na , nb ΦlD na , ωra ′ = ΦlF na , nb , ωrc
ΦlF na , nb , ωrc ΦlD nb , ωrb be a block matrix with ΦlD na , ωra and ΦlD nb , ωrb defined as in 1 and ΦlF na , nb , ωrc defined as in 2. The following lemmas lead up to the main proposition of this section. Lemma 1. The determinant of the matrix Φl in Definition 3 is equal to det (Φl ) = detA · detBCAC where detA = det ΦlD na , ωra
4. Regularity conditions
detBCAC
′ = det ΦlD nb , ωrb − ΦlF na , nb , ωrc −1 F × ΦlD na , ωra Φl na , nb , ωrc .
Proof. See Appendix.
(4.1)
(4.2)
This lemma suggests that in order to ensure the positive definiteness of each weighting matrix we can focus separately on
6 For a proof of the Schur product theorem see Horn and Johnson (1985, Theorem 7.5.3).
R. Colacito et al. / Journal of Econometrics 164 (2011) 45–59
the conditions that make the determinant of the first block matrix positive and on those that make the determinant of the function of matrices defined in (4.2) positive. We shall start by focusing on the first diagonal matrix. a Lemma 2.a If ϕl (ωr ) ≤ 1 then all leading principal minors of D Φl na , ωr are non-negative.
Proof. See Appendix.
According to the previous lemma, the only condition that needs to be verified for the leading principal minors of Φl up to the determinant of the first block matrix to be positive is that ϕl (ωra ) is less than one. This condition is always satisfied when using MIDAS filters. The next two lemmas deal with the determinant of the function of sub-matrices defined in (4.2). Lemma 3. If ϕl (ωra ) ≥ 0 and ϕl (ωrb ) ≤ 1, the scalar
n −1 ζ ϕl (ωrb ), ϕl (ωrc ), nb = 1 − ϕl (ωrb ) b (1 + (nb − 1) 2 × ϕl (ωrb )) − na nb ϕl (ωrc )
(4.3)
is always smaller than detBCAC defined in (4.2). Proof. See Appendix.
Lemma 4. If ϕl (ωrb ) ≤ 1 the function
n −1 ζ ϕl (ωrb ), ϕl (ωrc ), nb = 1 − ϕl (ωrb ) b (1 + (nb − 1) 2 × ϕl (ωrb )) − na nb ϕl (ωrc )
We are now ready to state the first of the two propositions of this sub-section. Proposition 1. Let ϕl (ωra ) < 1, ϕl (ωrb ) < 1, and ϕl (ωrc ) < 1 and
nb −1
1 − ϕl (ωrb )
1 + (nb − 1)ϕl (ωrb )
matrices according to Definitions 1 and 2. If
(4.4)
The assumption that ϕl (ωra ) < 1, ϕl (ωrb ) < 1, and ϕl (ωrc ) < 1 is always verified when using MIDAS filters, since they are all positive and are forced to sum up to one. Therefore, it amounts to checking that Eq. (4.4) is satisfied ∀l to ensure that the weighting matrices are positive definite. This can amount to checking a non-negligible amount of conditions in the case of lengthy MIDAS polynomial. For example, in the empirical applications, we show that the likelihood is maximized for 144 lags. A more useful theorem can be stated that amounts to checking only the first one of the conditions above. Its proof is a direct consequence of the following lemma and of Proposition 1. Lemma 5. Let ϕl (ωrb ) > ϕl+k (ωrb ) and ϕl (ωrc ) > ϕl+k (ωrc ) for some positive scalar k. Then
ζ ϕl (ωrb ), ϕl (ωrc ), nb ≤ ζ ϕl+k (ωrb ), ϕl+k (ωrc ), nb
>0 a b c Kc then each matrix in the sequence Φl ωr , ωr , ωr , na , nb l=1 is
j r
the MIDAS filters ϕl (ω )
Kc
l =1
positive definite.
Proof. Follows directly from Proposition 1, from Lemma 5 and from the fact that each element of a MIDAS filter is bounded above by one. This proposition conveniently states that in order to ensure the positive-definiteness of the long-run correlation matrix, one only needs to check one simple condition involving the first terms of the MIDAS polynomial. This condition is quite easily satisfied. For example, assume to have two sets of 10 series each (i.e. na = nb = 10): one that is better described by a long memory dynamics (say that ωra = 2) and one that can be characterized as a short memory process (say that ωrb = 15). Also assume that the cross-correlations can be described as a MIDAS-average of the two process (i.e. say that ωrc = 2+215 ). The length of the MIDAS polynomial is Kc = 144, a parameter that is shown to be optimal in the empirical application. Then, the left-hand side of the condition of Proposition 2 is equal to 0.4047. This simple example documents that even in a 20 by 20 system, this parameterization is quite flexible to ensure the positive definiteness of the resulting long-run correlation matrix. The multiple-MIDAS filters cases that we analyze in the empirical section are all based on weighting matrices of the type:
ϕl (ωra ) 1
ϕl (ωrc )
ϕl (ωrc ) c ϕl (ωr ) 1
where na = 2, nb = 1, and Kc = 144. The condition of Proposition 2 boils down to 1 − 2ϕ1 (ωrc )2 > 0, which is always satisfied for ωrc < 175. Since a decay factor larger than 20 or 30 is hardly ever reached, we can state that this kind of 3 by 3 system is always positive definite.
In general it will also prove convenient to allow for multiple sets of parameters to describe the DCC part of the correlation dynamics. To do so, we consider a richer structure for the short-run dynamics. In particular, let us reconsider Eq. (2.11), and without imposing the common parameters a and b across all asset combinations. In particular, let there be matrices of parameters G, A, and B such that the short-run dynamics are written as: Qt = G ⊙ Rt (ωr ) + A ⊙ ξt −1 ξt′−1 + B ⊙ Qt −1 . The Hadamard products imply that we relax the common parameter restriction and potentially apply different parameters to the asset combinations. In this subsection we study the positive semi-definiteness of the DCC part of the system, by imposing restrictions on the matrices of parameters G, A, and B. We start with the case of three blocks of matrices. Definition 4 (DCC Block Matrices). Let {ai }3i=1 , and {bi }3i=1 be positive scalars. The DCC block matrices, A, B and G are
, ∀j = {a, b, c } and let {ΦlD (na , ωra ),
]
a3 · ι′Nk ιNj
a3 · ι′Nj ιNk
a2 · ι′Nk ιNk
b1 · ι′Nj ιNj
b3 · ι′Nj ιNk
]
[ A=
[ B=
Proposition 2. Let ωra , ωrb , and ωrc be the characteristic parameters of
2
1 + (nb − 1)ϕ1 (ωrb ) − na nb ϕ1 (ωrc )
where the function ζ (·, ·, ·) is defined in (4.3). Proof. See Appendix.
nb −1
1 − ϕ1 (ωrb )
4.2. Short-run dynamics
2 − na nb ϕl (ωrc ) > 0 the matrix Φl ωra , ωrb , ωrc , na , nb is positive definite. Proof. Follows directly from Lemmas 1, 2 and 4.
ΦlD (nb , ωrb ), ΦlF (na , nb , ωrc )}Kl=c 1 be the associated sequences of block
1 a b c Φl ωr , ωr , ωr , na , nb = ϕl (ωra ) ϕl (ωrc )
is always non-increasing in nb . Proof. See Appendix.
49
a1 · ι′Nj ιNj
b3 · ι′Nk ιNj
where ιNl = [1
b2 · ι′Nk ιNk
··· Nl
, ,
G = ι′Nj +Nk ιNj +Nk − A − B
1], ∀l ∈ {j, k}.
50
R. Colacito et al. / Journal of Econometrics 164 (2011) 45–59
The following three assumptions are needed in order to prove the positive semi-definiteness of the correlations matrices arising from the system above. Assumption 1. Let {ai }3i=1 , and {bi }3i=1 be non-negative scalars and let 1 − ai − bi ≥ 0, ∀i = {1, 2, 3}. Assumption 2. Let a1 a2 − a23 ≥ 0 and b1 b2 − b23 ≥ 0. Assumption 3. The matrices G⊙Rt ωr are positive semi-definite.
Proposition 3. Let the conditions of Assumptions 1–3 be satisfied. Then the DCC block matrices, A, B, and G are positive semi-definite. Proof. Any principal minor that is the determinant of a sub-matrix of order equal or larger than three is zero, since it has at least two identical columns or rows. Any principal minor that is the determinant of a sub-matrix of order equal or larger than two is non-negative, because of Assumptions 2 and 3. The fact that all entries of the matrices are non-negative (by Assumption 1) concludes the proof. A special case of this specification is the Generalized-DCC model of Cappiello et al. (2006). Definition 5 (Generalized DCC Matrices). Let there be 2 pairs of DCC
2
parameters aj , bj j=1 . The generalized DCC matrices, Ag , Bg , and Gg are Ag = aa′ ,
Bg = bb′ ,
where
√ a1 · · · , a = ′
Nk
′
b =
b1
···, Nk
Gg = (ιι′ − aa′ − bb′ )
√ a2 · · ·
and
Nj
··· .
b2
5.1. Correlation dynamics of industry portfolios and a 10 year bond
Note that the first assumption is equivalent to the one needed to prove the existence of well-defined correlation matrices in the standard DCC model. We are now ready to state the main theorem of this sub-section.
new models we propose in the paper and uses the methodology proposed by Engle and Colacito (2006) pertaining to minimum variance portfolio management. The example focuses on asset allocation with multiple international equities (five international stock markets) and a single MIDAS filter. The section is divided in subsections which cover the various examples.
Nj
This specification satisfies Assumption 2 above with equality, but there is no specific reason why that should be the case and, in any event, this is a testable restriction. Therefore the DCC block structure provides a more flexible specification, that under mild assumptions delivers well-defined correlation matrices. 5. Empirical applications We start in a first subsection with some small scale empirical examples that explore the various issues that are typically encountered in empirical applications: (1) model specification for the long and short run dynamics of correlations, and (2) testing of models. We consider examples involving stocks and bonds.7 The equity portfolios are formed based on an industry classification.8 The second example focuses on the economic significance of the
7 Independently, Baele et al. (2010) used our DCC–MIDAS framework to study the economic sources of stock-bond return co-movements and their time variation. They find that macroeconomic fundamentals contribute little to explaining stock and bond return correlations but that other factors, especially liquidity proxies, play a more important role. They also document that macro factors are still important in fitting bond return volatility, whereas the variance premium is critical in explaining stock returns. 8 Data were downloaded from Kenneth French’s website and correspond to the 10 industry portfolio classification. Accordingly, each NYSE, AMEX, and NASDAQ stock is assigned to an industry portfolio based on its four-digit SIC code. We report the details on the specific portfolios that we employed in our empirical analysis in the Appendix.
In this subsection we turn our attention to a smaller set of assets. The purpose of this section is more illustrative in nature - as we address various issues that are typically encountered in empirical applications. We investigate the short and long run correlation dynamics of industry portfolios and a 10 year bond.9 The sample starts on 1971-07-15 and ends on 2006-06-30. We first address the issue of selecting the number of MIDAS lags. We follow a procedure suggested in Engle et al. (2006), since the GARCH–MIDAS and DCC–MIDAS class of models share similar model selection issues. A convenient property of both models, is that the lag selection of the MIDAS filter involves a fixed number of parameters. In particular, Engle et al. (2006) compare various GARCH–MIDAS models with different time spans via profiling of the likelihood function. We skip the details here as the procedure is described in their paper. The procedure can be summarized as selecting the smallest number of MIDAS lags after which the loglikelihoods of the volatilities seem to reach their plateau.10 Since we consider more than two assets we have the possibility that several long run MIDAS filters as well as multiple DCC parameters apply.11 We provide two examples involving three asset returns, one where a single MIDAS filter suffices, and another where there is clearly a need for two filters. The former involves two industries and a bond, namely Energy and Hi-Tech portfolios vs. 10 year bond. The results appear in Fig. 1 and Table 1. In Table 3 we report likelihood ratio tests for various nested model specifications involving separate parameters for the DCC dynamics and/or MIDAS filters. Each entry in the table represents the p-value for testing that the likelihood of the model of the column is significantly higher than the likelihood of the model on the corresponding row. The first row of the table documents that the baseline ‘‘one DCC–one MIDAS’’ model may not be enough to account for the dynamics of the system. The specifications with two sets of DCC parameters seem to yield significant lower likelihoods in all pairwise comparisons. It is also the case that adding an additional MIDAS parameter does not improve the model’s performance when a second pair of DCC parameters has already been added. This is always true with the only possible exception of the comparison between one and two MIDAS with two sets of non-generalized DCC parameters. The model with three distinct sets of DCC parameters does not appear to significantly improve the likelihood. These results convey that one MIDAS parameter is enough to account for the long-run dynamics of the system. The shortrun dynamics is the one needing a more flexible specification in this example. A model with two sets of DCC parameters and one MIDAS parameter seems to suffice to accurately describe the joint dynamics of the three assets. However, we cannot draw any conclusion as to whether we should employ the generalized or non-generalized DCC structure using the likelihood ratio tests, as the two models are non-nested. The two left-most columns of Table 2 show that the generalized specification with two sets of
9 As noted before, the Appendix reports the specifics of the industry definitions. 10 The details for the choice of MIDAS lags in this example are available upon request to the authors. 11 The Appendix reports the specifics and the names of the models that are being estimated in this subsection.
R. Colacito et al. / Journal of Econometrics 164 (2011) 45–59
51
Table 1 Energy, hi-tech and 10 year bond.
µ
α
β
θ
ω
m
Energy
0.070 (0.000)
0.087 (0.000)
0.804 (0.016)
0.199 (0.065)
12.602 (0.000)
0.545 (0.123)
Hi-Tech
0.063 (0.000) 0.022
0.087 (0.064) 0.059
0.837 (0.001) 0.915
0.186 (0.000) 0.204
9.997 (0.000) 3.090
0.726 (0.332) 0.284
(0.002)
(0.000)
(0.000)
(0.002)
(0.000)
(0.000)
Bond
a
b
ω
DCC–MIDAS
0.018 (0.004)
0.977 (0.000)
1.683 (0.000)
DCC
0.016 (0.001)
0.981 (0.001)
– –
Notes—The top panel reports the estimates of the GARCH–MIDAS coefficients for the Energy portfolio, Hi-Tech portfolio and 10 year bond. The bottom panel reports the estimates of the DCC–MIDAS and original DCC parameters. The number of MIDAS lags is 36 for the GARCH processes and 144 for the DCC process. The sample covers 1971-07-15 until 2006-06-30.
Fig. 1. Long and short run volatilities and correlations for the energy and hi-tech portfolios and the 10 year bond. The pictures on the main diagonal refer to conditional variances of energy and hi-tech portfolios and of 10 year bond and those on the off diagonal report conditional correlations among the same group of asset returns. In each diagonal panel the dark line refers to the long-run volatility and the light line represents the short-run volatility. In each off-diagonal panel the dark line is the long-run correlation and the light line is the total correlation.
DCC parameters and one MIDAS appears to perform better than its non-generalized counterpart, according to the AIC and BIC criteria. The next and final example shows that this is not always the case. In Table 6 we report likelihood ratio tests for Ten year Bonds combined with Manufacturing and Retail industries.12 The parameter estimates, for the single MIDAS filter appear in Table 4 whereas the multiple filter case appears in Table 5. The bottom part of the table shows that when the MIDAS parameters are estimated in all possible permutations of bivariate systems, a value close to 7 appears to do the job for bond vs. manufacturing and for bond versus retail. A decisively shorter memory achieves the maximum likelihood when it comes to accounting for the long run
12 We use the term Retail, although the industry is more broadly defined. As noted in the Technical Appendix that ‘Retail’ is defined as: Wholesale, Retail, and Some Services (Laundries, Repair Shops) SIC codes: 5000-5999, 7200-7299, 7600-7699.
dynamics of the correlation between retail and manufacturing. The top part of Table 5 shows that it can indeed be quite restrictive to force one MIDAS parameter to describe the long-run dynamics of all pairs of correlations. The introduction of an additional MIDAS parameter not only brings the outcome of the estimation closer to what suggested by the analysis of the bivariate systems, but it also sizeably increases the log-likelihood. Table 6 shows that this increase is significant at a 1% confidence level. The variances and correlations appear respectively in Figs. 2 and 3. The former shows the single filter patterns and the latter shows the patterns with two distinct filters. We observe that the second filter clearly changes the long run component correlation across the two industries. When we employ the Generalized DCC–MIDAS model, we obtain the parameters’ estimates reported in Table 5. These estimates seem to confirm the need for a second MIDAS filter to be applied to the correlation between the manufacturing and the retail portfolios. Fig. 4 shows that the long-run correlations filtered
52
R. Colacito et al. / Journal of Econometrics 164 (2011) 45–59
Table 2 Energy, hi-tech, and 10 year bond: multiple MIDAS and DCC. a1
a2
a3
b1
b2
b3
ω1
ω2
Log-likelihood
AIC
BIC
DCC = 1 MIDAS = 1
0.018 (0.004)
– (–)
– (–)
0.977 (0.000)
– (–)
– (–)
1.683 (0.000)
– (–)
−11 990.18
23,986.36
24,007.62
DCC = 1 MIDAS = 2
0.018 (0.000)
– (–)
– (–)
0.977 (0.001)
– (–)
– (–)
1.368 (0.031)
3.607 (0.022)
−11 989.05
23,986.09
24,014.44
DCC = 2 MIDAS = 1
0.021 (0.000)
0.017 (0.000)
– (–)
0.976 (0.000)
0.976 (0.000)
– (–)
1.708 (0.000)
– (–)
−11 986.53
23,983.06
24,018.49
DCC = 2 MIDAS = 2
0.021 (0.000)
0.016 (0.000)
– (–)
0.976 (0.000)
0.976 (0.000)
– (–)
1.527 (0.000)
3.738 (0.000)
−11 984.18
23,980.35
24,022.87
DCC = 2 (G) MIDAS = 1
0.023 (0.031)
0.017 (0.028)
– (–)
0.973 (0.028)
0.982 (0.003)
– (–)
2.595 (0.046)
– (–)
−11 985.16
23,980.31
24,015.74
DCC = 2 (G) MIDAS = 2
0.023 (0.023)
0.011 (0.031)
– (–)
0.973 (0.027)
0.981 (0.018)
– (–)
1.516 (0.102)
4.157 (0.022)
−11 983.59
23,979.19
24,021.70
DCC = 3 MIDAS = 1
0.021 (0.000)
0.013 (0.000)
0.016 (0.000)
0.977 (0.000)
0.976 (0.000)
0.977 (0.000)
1.942 (0.000)
– (–)
−11 986.11
23,986.20
24,035.80
DCC = 3 MIDAS = 2
0.021 (0.000)
0.014 (0.000)
0.016 (0.000)
0.976 (0.000)
0.977 (0.000)
0.976 (0.000)
1.396 (0.000)
3.684 (0.000)
−11 984.01
23,984.00
24,040.69
Notes—Each row reports the estimated coefficients, the log-likelihood, and the Akaike and Schwartz information criteria for the Energy, Hi-Tech, and 10 year Bond Portfolio for an increasing number of DCC and MIDAS parameters. When multiple sets of parameters are used, the second DCC and/or MIDAS coefficients are applied only to the correlation of Energy vs. Hi-Tech, except for the case of the generalized DCC, in which the product of the two parameters affects also all other correlations. For the case of 3 DCC sets of parameters, the third coefficient is applied to the correlations of Energy and Hi-Tech vs. Bond. The number of MIDAS lags is 36 for the GARCH processes and 144 for the DCC process. The sample starts on 1971-07-15 and ends on 2006-06-30.
Table 3 Energy, hi-tech and 10 yrs bond: likelihood ratio tests.
DCC = 1 DCC = 1 DCC = 2 DCC = 2 DCC = 2 (G) DCC = 2 (G) DCC = 3
MIDAS = 1 MIDAS = 2 MIDAS = 1 MIDAS = 2 MIDAS = 1 MIDAS = 2 MIDAS = 1
DCC = 1 MIDAS = 1
DCC = 1 MIDAS = 2
DCC = 2 MIDAS = 1
DCC = 2 MIDAS = 2
DCC = 2 (G) MIDAS = 1
DCC = 2 (G) MIDAS = 2
DCC = 3 MIDAS = 1
DCC = 3 MIDAS = 2
– – – – – – –
0.132 – – – – – –
0.026∗ – – – – – –
0.007∗∗ 0.008∗∗ 0.030∗ – – – –
0.007∗∗ – – – – – –
0.004∗∗ 0.004∗∗ – – 0.077 – –
0.086 – 0.651 – 0.388 – –
0.030∗ 0.039∗ 0.167 0.839 0.511 0.666 0.040∗
Notes—Each entry represents the p-value for testing that the likelihood of the model on the column is significantly higher than the likelihood of the model on the corresponding row. When multiple sets of parameters are used, the second DCC and/or MIDAS coefficients are applied only to the correlation of Energy vs. Hi-Tech, except for the case of the generalized DCC, in which the product of the two parameters affects also all other correlations. For the case of 3 DCC sets of parameters, the third coefficient is applied to the correlations of Energy and Hi-Tech vs. Bond. The number of MIDAS lags is 36 for the GARCH processes and 144 for the DCC process. The sample starts on 1971-07-15 and ends on 2006-06-30. Table 4 Manufacturing, shops, and 10 year bond.
µ
α
β
θ
ω
m
Bond
0.022 (0.001)
0.059 (0.000)
0.914 (0.000)
0.204 (0.002)
3.090 (0.000)
0.284 (0.000)
Manufacturing
0.072 (0.000) 0.069
0.103 (0.000) 0.101
0.801 (0.078) 0.816
0.175 (0.019) 0.171
10.925 (0.005) 11.529
0.564 (0.082) 0.632
(0.001)
(0.002)
(0.000)
(0.003)
(0.001)
(0.013)
a
b
ω
DCC–MIDAS
0.029 (0.000)
0.954 (0.000)
11.680 (1.215)
DCC
0.217 (0.000)
0.975 (0.000)
– –
Shops
Notes—The top panel reports the estimates of the GARCH–MIDAS coefficients for the 10 year Bond, Manufacturing portfolio, and the Shops portfolio. The bottom panel reports the estimates of the DCC–MIDAS and of the original DCC parameters. The number of MIDAS lags is 36 for the GARCH processes and 144 for the DCC process. The sample covers 1971-07-15 until 2006-06-30.
using this specification appear to be a little smoother when the 10year bond is one of the assets compared to the results obtained under the previous specification. The low-frequency correlation of
the two portfolios is instead a little noisier. Aside for these small differences, we take the results as confirming the need for multiple sets of correlation parameters.
R. Colacito et al. / Journal of Econometrics 164 (2011) 45–59
53
Table 5 Manufacturing, shops, and 10 year bond: multiple MIDAS and DCC. a1
a2
a3
b1
b2
b3
ω1
ω2
Log-likelihood
AIC
BIC
DCC = 1 MIDAS = 1
0.029 (0.000)
– (–)
– (–)
0.954 (0.000)
– (–)
– (–)
11.68 (1.216)
– (–)
−9242.53
18,491.06
18,512.32
DCC = 1 MIDAS = 2
0.030 (0.000)
– (–)
– (–)
0.950 (0.000)
– (–)
– (–)
7.585 (0.000)
26.300 (0.000)
−9237.99
18,484
18,512.32
DCC = 2 MIDAS = 1
0.0357 (0.022)
0.034 (0.025)
– (–)
0.943 (0.002)
0.943 (0.003)
– (–)
14.297 (0.001)
– (–)
−9238.863
18,487.73
18,526.15
DCC = 2 MIDAS = 2
0.036 (0.000)
0.034 (0.000)
– (–)
0.943 (0.000)
0.942 (0.001)
– (–)
7.768 (0.000)
27.029 (0.000)
−9233.92
18,479.85
18,522.37
DCC = 2 (G) MIDAS = 1
0.034 (0.000)
0.031 (0.000)
– (–)
0.932 (0.000)
0.955 (0.000)
– (–)
16.818 (0.000)
– (–)
−9240.41
18,490.82
18,526.25
DCC = 2 (G) MIDAS = 2
0.033 (0.062)
0.029 (0.132)
– (–)
0.939 (0.011)
0.956 (0.028)
– (–)
9.681 (0.113)
26.949 (0.111)
−9237.38
18,486.76
18,529.27
DCC = 3 MIDAS = 1
0.034 (0.000)
0.0341 (0.000)
0.033 (0.000)
0.943 (0.000)
0.955 (0.000)
0.948 (0.000)
12.371 (0.000)
– (–)
−9233.81
18,481.63
18,531.23
DCC = 3 MIDAS = 2
0.035 (0.000)
0.034 (0.013)
0.033 (0.015)
0.941 (0.019)
0.952 (0.008)
0.945 (0.020)
8.253 (0.022)
24.663 (0.022)
−9230.61
18,477.20
18,533.89
Bivariate models a
b
ω
Log-likelihood
Bond vs. manufacturing
0.038 (0.000)
0.946 (0.000)
7.037 (0.000)
−9733.577
Bond vs. shops
0.038 (0.001)
0.936 (0.003)
7.608 (0.000)
−9818.094
Manufacturing vs. shops
0.031 (0.001)
0.952 (0.000)
24.587 (0.000)
−6939.103
Notes—Each row of the top panel reports the estimated coefficients, the log-likelihood, and the Akaike and Schwartz information criteria for the Manufacturing, Shops, and 10 year Bond Portfolio for an increasing number of DCC and MIDAS parameters. When multiple sets of parameters are used, the second DCC and/or MIDAS coefficients are applied only to the correlation of Manufacturing vs. Shops, except for the case of the generalized DCC, in which the product of the two parameters affects also all other correlations. For the case of 3 DCC sets of parameters, the third coefficient is applied to the correlations of Manufacturing and Shops vs. Bond. The bottom panel reports the estimates of the DCC–MIDAS model for the three bivariate systems obtained from all possible permutations of Bond, Manufacturing, and Shops. The number of MIDAS lags is 36 for the GARCH processes and 144 for the DCC process. The sample starts on 1971-07-15 and ends on 2006-06-30. Table 6 10 yrs bond, manufacturing and shops: likelihood ratio tests.
DCC = 1 DCC = 1 DCC = 2 DCC = 2 DCC = 2 (G) DCC = 2 (G) DCC = 3
MIDAS = 1 MIDAS = 2 MIDAS = 1 MIDAS = 2 MIDAS = 1 MIDAS = 2 MIDAS = 1
DCC = 1 MIDAS = 1
DCC = 1 MIDAS = 2
DCC = 2 MIDAS = 1
DCC = 2 MIDAS = 2
DCC = 2 (G) MIDAS = 1
DCC = 2 (G) MIDAS = 2
DCC = 3 MIDAS = 1
DCC = 3 MIDAS = 2
– – – – – – –
0.003∗∗ – – – – – –
0.026∗ – – – – – –
0.000∗∗ 0.017∗ 0.002∗∗ – – – –
0.12 – – – – – –
0.016∗ 0.538 – – 0.014∗ – –
0.000∗∗ – 0.006∗∗ – 0.001∗∗ – –
0.000∗∗ 0.005∗∗ 0.000∗∗ 0.036∗ 0.000∗∗ 0.001∗∗ 0.011∗
Notes—Each entry represents the p-value for testing that the likelihood of the model on the column is significantly higher than the likelihood of the model on the corresponding row. When multiple sets of parameters are used, the second DCC and/or MIDAS coefficients are applied only to the correlation of Manufacturing vs. Shops, except for the case of the generalized DCC, in which the product of the two parameters affects also all other correlations. For the case of 3 DCC sets of parameters, the third coefficient is applied to the correlations of Manufacturing and Shops vs. Bond. The number of MIDAS lags is 36 for the GARCH processes and 144 for the DCC process. The sample starts on 1971-07-15 and ends on 2006-06-30.
5.2. Portfolio choice and correlation dynamics We perform a comparison of covariance matrix estimators, by employing the methodology proposed by Engle and Colacito (2006). For convenience, we briefly summarize their approach. An investor chooses optimal portfolio weights for N securities, in order to minimize expected one day ahead portfolio variance subject to the constraint that portfolio weights must add up to some scalar w : min wt′ Ht wt wt
s.t. wt′ ι = w
(5.1)
where Ht is the estimated one-period ahead conditional covariance matrix and ι is an N × 1 vector of ones. Let the true conditional covariance matrix be denoted as Ωt . An investor choosing optimal portfolio weights according to the estimated Ht would end up with the following amount of volatility:
σt = w
ι′ Ht−1 Ωt Ht−1 ι ι′ Ht−1 ι
.
(5.2)
An investor choosing portfolio weights with the knowledge of the true covariance matrix Ωt would achieve the following portfolio volatility for each unit of investment:
54
R. Colacito et al. / Journal of Econometrics 164 (2011) 45–59
Fig. 2. Long and short run volatilities and correlations for the 10 year bond and Manufacturing and Retail portfolios. The pictures on the main diagonal refer to conditional variances of bond, manufacturing and retail and those on the off-diagonal report conditional correlations among the same group of asset returns. In each diagonal panel the dark line refers to the long-run volatility and the light line represents the short-run volatility. In each off-diagonal panel the dark line is the long-run correlation and the light line is the total correlation.
Fig. 3. Long and short run volatilities and correlations for the 10 year bond and Manufacturing and Retail portfolios with 2 MIDAS filters. The second MIDAS filter is applied to the correlation between the manufacturing and the retail portfolios. The pictures on the main diagonal refer to conditional variances of bond, manufacturing and retail and those on the off-diagonal report conditional correlations among the same group of asset returns. In each diagonal panel the dark line refers to the long-run volatility and the light line represents the short-run volatility. In each off-diagonal panel the dark line is the long-run correlation and the light line is the total correlation.
σt∗ 1 = . w ′ ι Ωt−1 ι
(5.3)
Engle and Colacito (2006) show that σt∗ ≤ σt for any suboptimal estimator of the conditional covariance matrix. In order to quantify
the gains from superior covariance information, we can look at the ratio
σt −σt∗ . σt∗
This ratio is always larger than zero and it can be
interpreted as the percentage reduction in portfolio investment that could have been achieved by knowing the true covariance matrix. For example, a number like 1% means that an investor
R. Colacito et al. / Journal of Econometrics 164 (2011) 45–59
55
Fig. 4. Long and short run volatilities and correlations for the 10 year bond and Manufacturing and Retail portfolios using the Generalized DCC–MIDAS with 2 DCC set of parameters and 2 MIDAS filters. The second set of parameters is applied to the correlation between the manufacturing and the retail portfolios. The pictures on the main diagonal refer to conditional variances of bond, manufacturing and retail and those on the off-diagonal report conditional correlations among the same group of asset returns. In each diagonal panel the dark line refers to the long-run volatility and the light line represents the short-run volatility. In each off-diagonal panel the dark line is the long-run correlation and the light line is the total correlation.
wanting to allocate $100 million and choosing portfolio weights according Ht could have saved $1 million to achieve the minimum variance portfolio. Suppose to have two alternative time series of conditional covariance matrices, one produced by a DCC estimator, HtDCC , and one produced by a DCC–MIDAS estimator, HtDCC–M . In each period, a set of one day ahead minimum variance portfolio weights is constructed based on each covariance matrix. Let portfolio returns attained according to each estimator be denoted as
′ πtj = wtj (rt ),
∀j ∈ {DCC, DCC–M}
where rt stands for the demeaned vector of asset returns. Let the difference of the squared returns on the two portfolios be denoted as ut = πtDCC
2
2 − πtDCC–M .
The null hypothesis is that the portfolio variances of the two estimators are equal. This can be tested using the Diebold and Mariano (1995) procedure. By regressing u on a constant and correcting the covariance matrix for heteroskedasticity, the null of equal variance is simply a test that the mean of u is zero. We assume that both DCC and DCC–MIDAS employ a GARCH(1, 1) as their measure of the forecasted variances. This should enable our tests to be more revealing about specific differences in correlation estimators. We perform both an in-sample and an out-of-sample comparison of estimators, as opposed to Engle and Colacito (2006), that only test in-sample differences. We think that there is even more scope for an out-of-sample forecasting comparison, in light of the recent events that have affected the financial industry. Specifically, we are interested in two one day
ahead forecasting exercises. The first one uses the sample up to January 2008 to estimate models’ parameters. The second one uses data up to September 2008 as the pre-sample.13 The two sample choices reflect roughly the information set of an investor at the onset of the Bear Stern collapse and Lehmann Brothers’ bankruptcy, respectively. In our empirical investigation, we focus on International stock market returns in five countries: United States, United Kingdom, Japan, France, and Germany. Data are collected at daily frequency over the January 1988–October 2009 sample period (a sample of 5216 daily observations). All returns are expressed in US Dollars.14 We construct portfolios of different sizes, by selecting all possible combinations of two, three, and four returns’ series. Table 7 reports the results of our analysis. At least three key findings seem to emerge. First, the DCC–MIDAS correlation outperforms the DCC estimator in a large majority of the cases. This is suggestive of the potential efficiency gains that can be obtained even in the context of a myopic, short-horizon asset allocation exercise, by modeling the low frequency correlation dynamics. Second, efficiency gains are as large as 360 basis points in sample and 260 basis points out of sample and the gains from better correlation information are consistently higher in sample than they are out-of-sample. These numbers are per se non
13 The parameters’ estimates in the two sub-samples are held constant in forecasting all subsequent volatilities and correlations. 14 Data source is MSCI-Barra. Series names are MSDUFR, MSDUGR, MSDUJN, MSDUUK, MSDUUS. With the exception of Table 8, parameters’ estimates are not reported in the paper for space constraints, but they are available upon request to the authors.
56
R. Colacito et al. / Journal of Econometrics 164 (2011) 45–59 Table 7 Diebold and Mariano tests on international portfolios. Countries
In sample
9/1/2008-onward
1/1/2008-onward
US, UK
0.395 (−1.407) 0.13 (−0.564) 1.2*** (−2.377) 0.371 (−1.273) 1.04 (−1.002) 0.955*** (−3.106) −0.286 (1.204) 1.167*** (−3.070) 2.603 (−1.437) 1.166*** (−3.109) 0.735 (−1.231) 2.023*** (−3.326) 0.307 (−0.745) 1.719*** (−2.999) 2.638 (−1.316) 2.036*** (−3.444) 1.527* (−1.877) 2.915 (−1.232) 0.811** (−2.086) 3.528** (−2.010) 2.246*** (−2.890) 2.824 (−1.373) 3.613** (−2.258)
0.097** (−2.047) −0.043 (0.827) 0.331*** (−2.647) −0.159** (2.409) −0.153 (0.809) 0.148 (−1.568) −0.095 (1.283) 0.302** (−2.002) 2.326 (−1.144) 0.214 (−1.419) −0.64** (2.214) 0.514*** (−3.204) −0.372*** (2.629) 0.295 (−1.608) 1.484 (−0.958) 0.699*** (−2.874) −0.152 (0.402) 1.871 (−1.292) −0.081 (0.623) 2.626 (−1.175) −0.114 (0.505) 0.771 (−0.731) 0.958 (−0.864)
0.131*** (−2.631) −0.083** (2.274) 0.219*** (−2.914) −0.104*** (3.762) 0.079 (−0.349) 0.152* (−1.792) 0.198* (−1.778) 0.223* (−1.638) 2.149 (−1.123) 0.228* (−1.638) −0.465** (2.422) 0.409*** (−3.069) −0.55 (0.355) 0.225 (−1.411) 1.383 (−0.915) 0.477*** (−2.539) 0.168 (−0.646) 2.01 (−1.310) 0.041 (−0.377) 2.387 (−1.247) −0.03 (0.165) 1.138 (−0.967) 0.812 (−0.784)
US, GER US, JPN US, FRA UK, GER UK, JPN UK, FRA GER, JPN GER, FRA JPN, FRA US, UK, GER US, UK, JPN US, UK, FRA US, GER, JPN US, GER, FRA US, JPN, FRA UK, GER, JPN UK, GER, FRA UK, JPN, FRA GER, JPN, FRA US, UK, GER, JPN US, UK, GER, FRA UK, GER, JPN, FRA % of DCC–MIDAS lower variance
95.65
60.87
78.26
Average DCC–MIDAS gain Bivariate DCC–MIDAS gain Tri-variate DCC–MIDAS gain Four-variate DCC–MIDAS gain
1.55 0.87 1.82 2.89
0.42 0.30 0.63 0.54
0.49 0.32 0.61 0.64
Notes—Each column reports the percentage efficiency gains (losses) from using the DCC–MIDAS estimator, computed as the ratio of (5.2) and (5.3). The numbers in parentheses are t-stats for the associated Diebold and Mariano test. The last four lines summarize the percentage gains, by breaking them into portfolio sizes. * Significance at 10% confidence levels. ** Significance at 5% confidence levels. *** Significance at 1% confidence levels. Table 8 Germany vs. France. a
b
ω
DCC–MIDAS
0.042 (0.000)
0.921 (0.000)
4.619 (0.000)
DCC
0.026 (0.005)
0.973 (0.005)
– –
Notes—Estimates of the DCC–MIDAS and original DCC parameters for the system including Germany and France. The sample is 1988:1 to 2009:10.
negligible and even more so when compared to those reported in Engle and Colacito (2006). In their application, the largest benefits are obtained by comparing the DCC model with a constant
unconditional estimate of volatilities and correlations. For the case of minimum variance portfolios, the efficiency gain is typically in a ballpark of 100 basis points or less. The DCC–MIDAS appears to substantially improve their findings. Last, but not least, the gains are increasing in the size of the portfolio. In sample, there appears to be an efficiency improvement of about 100 basis points for each asset that is being added to the analysis. Modeling correlation’s slowly moving components is therefore extremely important for large dimensional systems. The case of the correlation between UK, Germany, and France is revealing of the potential source of the benefits from employing the DCC–MIDAS estimator. Fig. 5 reports the forecasted correlations for the two alternative estimators. As suggested by Engle and Colacito
58
R. Colacito et al. / Journal of Econometrics 164 (2011) 45–59
×
−1 F Φl na , nb , ωrc ΦlD na , ωra ′ −1 F . ΦlD nb , ωrb − ΦlF na , nb , ωrc ΦlD na , ωra Φl na , nb , ωrc
Ina 0
The term −q ϕl (ωrc ) is trivially decreasing in ϕl (ωrc ). For the other term:
detk = 1 − ϕl (ωra )
k−1
1 + (k − 1)ϕl (ωra ) ,
nb −1 2 1 − ϕl (ωrb ) = −na nb ϕl (ωrc ) a 1 + (1 + na )ϕl (ωr ) n −1 + 1 − ϕl (ωrb ) b 1 + (nb − 1)ϕl (ωrb )
it amounts to showing that
nb −1
1 − ϕl (ωrb )
≤ 1.
1 + (1 + na )ϕl (ωra )
This is always the case since the numerator is always smaller than unity (because ϕl (ωrb ) ≤ 1 by assumption) and the denominator is always larger than one (because ϕl (ωra ) ≥ 0 by assumption). Proof of Lemma 4. To simplify the proof we slightly modify the notation for ζ ϕl (ωrb ), ϕl (ωrc ), nb to ζ (nb ), suppressing the other arguments. Now, denote ζ (nb ) as the difference
ζ (nb ) = p(nb ) − q(nb ) where
nb −1
p(nb ) = 1 − ϕl (ωrb )
1 + (nb − 1)ϕl (ωrb )
2
q(nb ) = na nb ϕl (ωrc )
.
The term −q(nb ) is trivially always decreasing in nb . The term p(nb ) can be written as p(nb ) = 1 + ϕl (ωrb )
b −1 2 n− (−1)j · j · (ϕl (ωrb ) − 1)j−1
j =1
with p(nb = 1) = 1. The increments: p(nb ) − p(nb − 1)
2 = ϕl (ωrb ) (−1)nb −1 · (nb − 1) · (ϕl (ωrb ) − 1)nb −2 are always negative, because if nb is odd (even), the term (−1)nb −1 is positive (negative), while the term (ϕl (ωrb ) − 1)nb −2 is negative (positive), since ϕl (ωrb ) ≤ 1, by assumption. Proof of Lemma 5. Similar to the previousproof, we slightly modify the notation for ζ ϕl (ωrb ), ϕl (ωrc ), nb to ζ ϕl (ωrb ), ϕl (ωrc ) , suppressing this time the nb argument. Now, decompose ζ ϕl (ωrb ), ϕl (ωrc ) as
ζ ϕl (ωrb ), ϕl (ωrc ) = p ϕl (ωrb ) − q ϕl (ωrc ) where n b −1
p ϕl (ωrb )
− 2 j−1 ϕl (ωrb ) (−1)j · j · ϕl (ωrb )
= 1+
j =1 n b −1
= 1+
−
pj ϕl (ωrb )
j =1
2
q ϕl (ωrc ) = na nb ϕl (ωrc )
A.2. Details on industry portfolios classification
Proof of Lemma 3. Since detBCAC
∀k ≥ 1.
j−1 (−1)j · j · ϕl (ωrb ) j−1 2 ≤ − ϕi+k (ωrb ) (−1)j · j · ϕi+k (ωrb ) = pj ϕi+k (ωrb ) , ∀j = {1, . . . , nb − 1}. 2
pj ϕl (ωrb ) = − ϕl (ωrb )
Proof of Lemma 2. It follows directly from observing that any leading principal minor of ΦlD na , ωra can be written as
.
Data are downloaded from Kenneth French’s web-site. The Energy, Manufacturing, Hi-Tech, and Retail portfolios that we use in the empirical section correspond to the collection of the following SIC codes: 1. Energy: Oil, Gas, and Coal Extraction and Products. SIC codes: 1200-1399, 2900-2999. 2. Manufacturing: Machinery, Trucks, Planes, Chemicals, Off Furn, Paper, Com Printing. SIC codes: 2520-2589, 2600-2699, 2750-2769, 2800-2829, 2840-2899, 3000-3099, 3200-3569, 3580-3621, 3623-3629, 3700-3709, 3712-3713, 3715-3715, 3717-3749, 3752-3791, 3793-3799, 3860-3899. 3. Hi-Tech: Computers, Software, and Electronic Equipment, Industrial controls, computer programming and data processing, Computer integrated systems design, computer processing, data prep, information retrieval services, computer facilities management service, computer rental and leasing, computer maintenance and repair, computer related services, R&D labs, research, development, testing labs. SIC codes: 3570-3579, 3622-3622, 3660-3692, 3694-3699, 3810-3839, 7370-7372, 7373-7373, 7374-7374, 7375-7375, 7376-7376, 7377-7377, 7378-7378, 7379-7379, 7391-7391, 8730-8734. 4. Retail: Wholesale, Retail, and Some Services (Laundries, Repair Shops). SIC codes: 5000-5999, 7200-7299, 7600-7699. A.3. Summary of specifications In the empirical section, we will analyze the performance of several combinations of short- and long-run specifications for a number of 3 by 3 systems. To simplify the reading of the results, we summarize and label the models in this sub-section. 1. MIDAS = 1: the typical MIDAS correlation weighting matrix is
ϕl (ω1 )
1
Φl (ω1 ) = ϕl (ω1 ) ϕl (ω1 )
ϕl (ω1 ) ϕl (ω1 ) .
1
ϕl (ω1 )
1
2. MIDAS = 2: the typical MIDAS correlation weighting matrix is
1
Φl (ω1 , ω2 ) = ϕl (ω2 ) ϕl (ω1 )
ϕl (ω2 ) 1
ϕl (ω1 )
ϕl (ω1 ) ϕl (ω1 ) . 1
Hence, the second MIDAS polynomial insists on the correlation of the first two assets with the third asset. 3. DCC = 1: the short-run dynamics are governed by the scalars a1 and b1 . 4. DCC = 2: the short-run dynamics are described by the following matrices: a2 a2 a1
A=
a2 a2 a1
a1 a1 a1
b2 b2 b1
B=
b2 b2 b1
b1 b1 . b1
R. Colacito et al. / Journal of Econometrics 164 (2011) 45–59
5. DCC = 3: the short-run dynamics are described by the following matrices: a2 a2 a3
a2 a2 a3
A=
a3 a3 a1
B=
b2 b2 b3
b2 b2 b3
b3 b3 . b1
6. DCC = 2 (G): the short-run dynamics are described by the following matrices:
a2 A = √a2 a1 a2
b1 B = b1
b1 b2
a2 √ a2 a1 a2 b1
b1
b1 b2
√ a a 1 2 √ a1 a2 a1
b1 b2 b1 b2 . b2
References Adrian, T., Rosenberg, J., 2004. Stock returns and volatility: pricing the short-run and long-run components of market risk. Working Paper. Aielli, G.P., 2009. Dynamic conditional correlations: on properties and estimation. University of Florence Working Paper. Alizadeh, S., Brandt, M.W., Diebold, F., 2002. Range-based estimation of stochastic volatility models. Journal of Finance 57 (3), 1047–1091. Baele, L., Bekaert, G., Inghelbrecht, K., 2010. The determinants of stock and bond return comovements. Review of Financial Studies 23, 2374–2428. Cappiello, L., Engle, R., Sheppard, K., 2006. Asymmetric dynamics in the correlations of global equity and bond returns. Journal of Financial Econometrics 4, 537–572. Chernov, M., Gallant, R., Ghysels, E., Tauchen, G., 2003. Alternative models for stock price dynamics. Journal of Econometrics 116, 225–257. Christoffersen, P., Jacobs, K., Ornthanalai, C., Wang, Y., 2008. Option valuation with long-run and short-run volatility components. Journal of Financial Economics 90, 272–297.
59
Comte, F., Lieberman, O., 2003. Asymptotic theory for multivariate garch processes. Journal of Multivariate Analysis 84, 61–84. Dahlhaus, R., Subba Rao, S., 2006. Statistical inference for time-varying arch processes. Annals of Statistics 34, 1075–1114. Diebold, F.X., Mariano, R., 1995. Comparing predictive accuracy. Journal of Business and Economic Statistics 13, 253–265. Ding, Z., Granger, C., 1996. Modeling volatility persistence of speculative returns: a new approach. Journal of Econometrics 73, 185–215. Engle, R., 2002. Dynamic conditional correlation—a simple class of multivariate garch models. Journal of Business and Economic Statistics 20, 339–350. Engle, R., Colacito, R., 2006. Testing and valuing dynamic correlations for asset allocation. Journal of Business and Economic Statistics 24, 238–253. Engle, R., Ghysels, E., Sohn, B., 2006. On the economic sources of stock market volatility. NYU and UNC Unpublished Manuscript. Engle, R., Kelly, B., 2009. Dynamic equicorrelation. NYU Working Paper. Engle, R., Lee, G., 1999. A permanent and transitory component model of stock return volatility. In: Engle, R.F., White, H. (Eds.), Cointegration, Causality, and Forecasting: A Festschrift in Honor of Clive W.J. Granger. Oxford University Press, pp. 475–497. Engle, R., Rangel, J., 2005. The spline garch model for unconditional volatility and its global macroeconomic causes. Manuscript NYU and UCSD. Engle, R., Sheppard, K., 2001. Theoretical and empirical properties of dynamic conditional correlation multivariate garch. Discussion Paper, UCSD. Engle, R., Shephard, N., Sheppard, K., 2008. Fitting vast dimensional time varying covariance models. NYU and University of Oxford Working Paper. Feng, Y., 2007. A local dynamic conditional correlation model. MPRA Paper No. 1592. Gallant, A.R., Hsu, C.-T., Tauchen, G., 1999. Using daily range data to calibrate volatility diffusions and extract the forward integrated variance. Review of Economics and Statistics 81, 617–631. Ghysels, E., Wang, F., 2007. Statistical inference for volatility component models. Technical Report. Discussion Paper, UNC. Horn, R.A., Johnson, C.R., 1985. Matrix Analysis. Cambridge University Press. Karolyi, G., Stulz, R., 1996. Why do markets move together? An investigation of US–Japan stock return comovements. Journal of Finance 51, 951–986. Ling, S., McAleer, M., 2003. Asymptotic theory for a new vector arma-garch model. Econometric Theory 19, 280–310. McAleer, M., Chan, F., Hoti, S., Lieberman, O., 2008. Generalized autoregressive conditional correlation. Econometric Theory 24, 1554–1583. Schwert, G.W., 1989. Why does stock market volatility change over time? Journal of Finance 44, 1207–1239.
Journal of Econometrics 164 (2011) 60–78
Contents lists available at ScienceDirect
Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom
Predictability of stock returns and asset allocation under structural breaks✩ Davide Pettenuzzo a , Allan Timmermann b,∗ a
Bates White, LLC, and Brandeis University, United States
b
University of California, San Diego and CREATES, United States
article
info
Article history: Available online 1 March 2011
abstract This paper adopts a new approach that accounts for breaks to the parameters of return prediction models both in the historical estimation period and at future points. Empirically, we find evidence of multiple breaks in return prediction models based on the dividend yield or a short interest rate. Our analysis suggests that model instability is a very important source of investment risk for buy-and-hold investors with long horizons and that breaks can lead to a negative slope in the relationship between the investment horizon and the proportion of wealth that investors allocate to stocks. Once past and future breaks are considered, an investor with medium risk aversion reduces the allocation to stocks from close to 100% at short horizons to 10% at the five-year horizon. Welfare losses from ignoring breaks can amount to several hundred basis points per year for investors with long horizons. © 2011 Elsevier B.V. All rights reserved.
1. Introduction Stock market investors face a daunting array of risks. First and foremost is the component of stock returns that cannot be predicted by any model for the return generating process. This source of uncertainty is substantial, given the low predictive power of return forecasting models. Second, even conditional on a particular forecasting model, investors face parameter uncertainty, i.e., the effect of not knowing the true model parameters (Kandel and Stambaugh, 1996; Barberis, 2000). Third, investors do not know the state variables or functional form of the true return process and so face model uncertainty (Avramov, 2002; Cremers, 2002). This paper deals with a fourth source of uncertainty of particular importance to long-run investors, namely model instability, i.e., random changes or ‘‘breaks’’ to the parameters of the return generating process. Conventional practice in economics and finance is to compute forecasts conditional upon a maintained model whose parameters are assumed to be constant both throughout the historical sample and during future periods to which the forecasts apply. This
✩ Three anonymous referees made constructive and helpful suggestions on an
earlier version of the paper. We thank Jun Liu, Ross Valkanov, Jessica Wachter and Mark Watson as well as seminar participants at the Rio Forecasting conference, University of Aarhus (CAF), University of Arizona, New York University (Stern), Erasmus University at Rotterdam, Princeton, UC Riverside, UCSD, Tilburg, and Studiencenter Gerzensee for helpful comments on the paper. Alberto Rossi provided excellent research assistance. Timmermann acknowledges support from CREATES, funded by the Danish National Research Foundation. ∗ Corresponding author. E-mail address:
[email protected] (A. Timmermann). 0304-4076/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2011.02.019
procedure ignores that, over estimation samples that often span several decades, the relation between economic variables is likely to change. Instability in economic models could reflect institutional, legislative and technological change, financial innovation, changes in stock market participation, large macroeconomic (oil price) shocks and changes in monetary targets or tax policy.1 In the context of financial return prediction models, Merton’s intertemporal CAPM suggests that time-variation in aggregate risk aversion can lead to changes in the relationship between expected returns and predictor variable tracking movements in market risk or investment opportunities.2 Instability in the relation between stock returns and predictor variables such as the dividend yield or short-term interest rates has been documented empirically in several studies. Pesaran and Timmermann (1995), Bossaerts and Hillion (1999), Lettau and Ludvigson (2001), Timmermann and Paye (2006), Ang and Bekaert (2007) and Goyal and Welch (2008) find substantial variation across subsamples in the coefficients of return prediction models and in the degree of return predictability.3 Building on this
1 For example, the introduction of SEC rule 10b–18 in November 1982 changed firms’ ability to repurchase shares and thus could have changed firms’ payout policy, in turn affecting the relation between stock returns and dividend yields. Examples of changes in the dynamics and predictive content of short-term interest rates include the Accord of 1951 and the monetarist experiment from 1979 to 1982. 2 Menzly et al. (2004) provide theoretical reasons for expecting time-variation in the relation between expected stock returns and predictor variables such as the dividend yield. 3 Studies such as Barsky (1989), Dimson et al. (2002), McQueen and Roley (1993) and Boyd et al. (2005) have found evidence of time-variations in the correlation between stock and bond returns or stock returns and economic news variables.
D. Pettenuzzo, A. Timmermann / Journal of Econometrics 164 (2011) 60–78
evidence, recent studies such as Dangl and Halling (2008) and Johannes et al. (2009) capture time-variation in return prediction models by assuming that some parameters follow a random walk and thus change every period. In this paper we focus instead on the effect of rare but large structural breaks as opposed to small parameter changes occurring every period. The distinction between rare, large breaks versus frequent, small breaks can be difficult to make in practice (Elliott and Mueller, 2006). However, our analysis allows us to pinpoint the most important times where the return prediction model undergoes relatively sharp changes, which provides insights into the interpretation of the economic sources of model instability. Sudden, sharp changes in model parameters are consistent with empirical findings by both Dangl and Halling (2008) and Johannes et al. (2009) that the change in the parameters of return predictability models at times can be large. By considering few, large breaks, our approach is close in spirit to Pastor and Stambaugh (2001) who consider breaks in the risk-return trade-off and Lettau and Van Nieuwerburgh (2008) who consider a discrete break to the steady state value of a single predictor variable (the dividend yield). Our approach builds on Chib (1998), Pastor and Stambaugh (2001) and Pesaran et al. (2006) in adopting a change point model driven by an unobserved discrete state variable. Specifically, we generalize the univariate model in Pesaran et al. (2006) to a multivariate setting so instability can arise either in the conditional model used to forecast returns, in the marginal process generating the predictor variable(s) or in the correlation between innovations to the two equations. Forecasting returns in this model require accounting for the probability and magnitude of future breaks. To this end, we introduce a meta distribution that characterizes how the parameters vary across different break segments. The resulting hierarchical model nests as special cases both a pooled scenario where the similarity between the parameters in the different regimes is very strong (corresponding to a narrow dispersion in the distribution of parameters across regimes) as well as a more idiosyncratic scenario where these parameters have little in common and can be very different (corresponding to a wide dispersion). Which of these cases is most in line with the data is reflected in the posterior meta distribution. The proposed model is very general and allows for uncertainty about the timing (dates) of historical breaks as well as uncertainty about the number of breaks and their magnitude. We also extend our setup to allow for uncertainty about the identity of the predictor variables (model uncertainty) using Bayesian model averaging techniques. Hence, investors are not assumed to know the true model or its parameter values, nor are they assumed to know the number, timing and magnitude of past or future breaks. Instead, they come with prior beliefs about the meta distribution from which current and future values of the parameters of the return model are drawn and update these beliefs efficiently as new data are observed. Our empirical analysis investigates predictability of US stock returns in the context of two popular predictor variables, namely the dividend yield and the short interest rate. We find evidence of multiple breaks in return models based on either of these predictor variables in data covering the period 1926–2005. Many of the break dates coincide with major events such as changes in the Fed’s operating procedures (1979, 1982), the Great Depression, the Treasury-Fed Accord (1951), and the growth slowdown following the oil price shocks in the early 1970s. Variation in model parameters across these regimes is found to be extensive. For example, the predictive coefficient of the dividend yield varies between zero and 2.6, while the coefficient of the T-bill rate varies even more, between −9.4 and 3.3, across break segments. Instability in model parameters is particularly important to investors’ long-run asset allocation decisions which crucially rely
61
on forecasts of future returns. Long investment horizons make it more likely that breaks to model parameters will occur and some of these breaks could adversely affect the investment opportunity set, thereby significantly increasing investment risks. Asset allocation exercises mostly assume that although the parameters of the return prediction model or the identity of the ‘‘true’’ model need not be known to investors, the parameters of the data generating process remained constant through time (e.g., Barberis (2000) and Pastor and Stambaugh (2009)). Studies that have allowed for timevarying model parameters such as Dangl and Halling (2008) and Johannes et al. (2009) only consider mean-variance investors with single-period investment horizons. Our focus is instead on the effect of model instability on the risks faced by investors with a long investment horizon. Structural breaks are found to have a large effect on investors’ optimal asset allocations. Moreover, our analysis suggests that model instability is a more important source of investment risk than parameter estimation uncertainty for investors with long horizons and that breaks can lead to a steep negative slope in the relationship between the investment horizon and the proportion of wealth that a buy-and-hold investor allocates to stocks.4 For example, in the model with predictability from the dividend yield but no breaks, the allocation to stocks rises from 40% at short horizons to 60% at the five-year horizon. Allowing for past and future breaks, the allocation to stocks instead declines from close to 100% at short horizons to 10% at the five-year horizon. Ignoring model instability can also lead to large welfare losses, particularly for low-to-medium risk averse investors who use the dividend yield prediction model. At very short investment horizons where breaks are unlikely to occur, welfare losses are quite modest. However, as the investment horizon grows and the risk of breaks increases, losses from ignoring model instability can rise to several hundred basis points per annum in certainty equivalent returns. Welfare losses from ignoring breaks are more modest for more highly risk averse investors or investors who use the short interest rate to predict returns. Our portfolio allocation results lend further credence to the finding in Pastor and Stambaugh (2009) that the long-run risks of stocks can be very high. In a model that allows for imperfect predictors and unknown but stable parameters of the data generating process, Pastor and Stambaugh find that the true perperiod predictive variance of stock returns can be increasing in the investment horizon due to the combined effect of uncertainties about current and future expected returns (and their relationship to observed predictor variables) and estimation risk. While this finding is similar to ours, the mechanism is very different: Pastor and Stambaugh (2009) derive their results from investors’ imperfect knowledge of current and future expected returns and model parameters, whereas model instability is the key driver behind our results. The paper is organized as follows. Section 2 introduces the break point methodology and Section 3 presents empirical estimates for return prediction models based on the dividend yield or the short interest rate. Section 4 shows how investors’ optimal asset allocation can be computed while accounting for past and future breaks. Section 5 considers asset allocations empirically for a buy-andhold investor. Section 6 proposes various extensions to our approach and Section 7 concludes. Technical details are provided in appendices at the end of the paper.
4 Consistent with our results, Johannes et al. (2009) also find that parameter estimation uncertainty has a smaller effect on the asset allocation than uncertainty about changes to model parameters. While Dangl and Halling (2008) find that estimation uncertainty plays a dominant role, they also report that uncertainty about time-variation in coefficients is important, particularly during periods with turmoil such as the early seventies. Barberis (2000) finds that estimation risk significantly affects investors’ long-run asset allocations, but this finding is based on a relatively short data sample.
62
D. Pettenuzzo, A. Timmermann / Journal of Econometrics 164 (2011) 60–78
2. Methodology Studies of asset allocation under return predictability (e.g., Barberis (2000), Campbell and Viceira (2001), Campbell et al. (2003) and Kandel and Stambaugh (1996)) have mostly used vector autoregressions (VARs) to capture the relation between asset returns and predictor variables. We follow this literature and focus on a simple model with a single risky asset and a single predictor variable. This gives rise to a bivariate model relating returns (or excess returns) on the risky asset to a predictor variable, xt . Empirically, the coefficients on the lagged returns are usually found to be small, so we follow common practice and restrict them to be zero. The resulting model takes the form zt = B x˜ t −1 + ut , ′
(1)
where zt = (rt , xt )′ , x˜ t −1 = (1, xt −1 )′ , rt is the stock return at time t in excess of a short risk-free rate, while xt is the predictor variable and ut ∼ IIDN (0, Σ ), where Σ = E [ut u′t ] is the covariance matrix. We refer to µr and µx as the intercepts in the equation for the return and predictor variable, respectively, while βr and βx are the coefficients on the predictor variable in the two equations: rt = µr + βr xt −1 + urt xt = µx + βx xt −1 + uxt .
(2)
2.1. Predictive distributions of returns under breaks Asset allocation decisions require the ability to evaluate expected utility associated with the realization of future payoffs on risky assets. This, in turn, requires computing expectations over the predictive distribution of cumulated returns during an h-period investment horizon [T , T + h] conditional on information available at the time of the investment decision, T , which we denote by ZT . To compute the predictive distribution of returns while allowing for breaks, we need to make assumptions about the probability that future breaks occur, their likely timing as well as the size of such breaks. If more than one break can occur over the course of the investment horizon, we also need to model the distribution from which future regime durations are drawn. We next explain how this is done. To capture instability in the parameters in Eq. (2), we build on the multiple change point model proposed by Chib (1998). Shifts to the parameters of the return prediction model are captured through an integer-valued state variable, St , that tracks the regime from which a particular observation of returns and the predictor variable, xt , are drawn. For example, st = k indicates that zt has been drawn from f (zt |Zt −1 , Θk ), where Zt −1 = {z1 , . . . , zt −1 } is the information set at time t − 1, while a change from st = k to st +1 = k + 1 shows that a break has occurred at time t + 1. Location and scale parameters in regime k are collected in Θk = (Bk , Σk ). Allowing for K breaks or, equivalently, K + 1 break segments, between t = 1 and t = T , our model takes the form of Eq. (3) as given in Box I. Here ΥK = {τ0 , . . . , τK } is the collection of break points with τ0 = 1, and the innovations ut are assumed to be multivariate Gaussian with zero mean. Within each regime we decompose the covariance matrix, Σk , into the product of a diagonal matrix representing the standard deviations of the variables, diag (ψk ), and a correlation matrix, Λk :
Σk = diag (ψk ) × Λk × diag (ψk ).
(4)
This specification allows both mean parameters, volatilities and correlations to vary across regimes.5 We collect the regression
5 Allowing for time-variations in both first and second moments could be important in practice. In a model that allows for stochastic volatility, Johannes et al. (2009) find that the level of return volatility affects the signal-to-noise ratio of the return equation and therefore also affects investors’ ability to infer the underlying state and compute expected returns.
coefficients, error term variances and correlation parameters in Θ = (v ec (B)k , ψk , Λk )Kk=+11 . The state variable St is assumed to be driven by a first order hidden Markov chain whose transition probability matrix is designed so that, at each point in time, St can either remain in the current state or jump to the subsequent state.6 The one-step-ahead transition probability matrix therefore takes the form
P =
p1,1
p1,2
0
0
p2,2
p2,3
.. .
0
.. . ···
··· ··· .. .
0
pK ,K +1
0
0
pK +1,K +1
pK +1,K +2
0
0
... ...
pK ,K
0
0
pK +2,K +2
.. .
0
0
.
.. .
..
(5)
.
Here pk,k+1 = Pr (st = k + 1|st −1 = k) is the probability of moving to regime k + 1 at time t given that we are in state k at time t − 1 so pk,k+1 = 1 − pk,k . K is the number of breaks in the historical sample up to time T so the (K + 1) × (K + 1) sub-matrix in the upper left corner of P, denoted p = (p1,1 , p2,2 , . . . , pK +1,K +1 )′ , describes possible breaks in the historical data sample {z1 , . . . , zT }. The remaining part of P describes the break point dynamics over the future out-of-sample period from T to T + h.7 The special case without breaks corresponds to K = 0 and p1,1 = 1. The persistence parameters in Eq. (5) are assumed to be regimespecific. This means that regimes can differ in their expected duration. The closer is pk,k to one, the longer the regime is expected to last. Furthermore, pk,k is assumed to be independent of pj,j , for j ̸= k, and is drawn from a beta distribution: pk,k ∼ Beta(a, b).
(6)
This break model is quite different from the drifting coefficient (random walk) models studied by Dangl and Halling (2008) and Johannes et al. (2009). The latter are designed to obtain a good local approximation to parameter values at a given point in time, whereas our break model attempts to capture rare, but large shifts in parameter values that affect the return distribution, particularly at longer horizons. 2.2. Meta distributions Because we are interested in forecasting future returns, we follow Pastor and Stambaugh (2001) and Pesaran et al. (2006) and adopt a hierarchical prior formulation, but extend those studies to allow for structural breaks in a multivariate setting.8 To this
6 Some studies assume that the parameters of the return equation are driven by a Markov switching process with two or three states, e.g., Ang and Bekaert (2002), Ang and Chen (2002), Guidolin and Timmermann (2008) and Perez-Quiros and Timmermann (2000). The assumption of a fixed number of states amounts to imposing a restriction that ‘history repeats’. This approach is well suited to identify patterns in returns linked to repeated events such as recessions and expansions. It is less clear that it is able to capture the effects of institutional and technological changes over long spans of time. These are more likely to lead to genuinely new and historically unique regimes. 7 Following Chib (1998), estimation proceeds under the assumption of K breaks in the historical sample (1 ≤ t ≤ T ). This assumption greatly simplifies estimation. We show later that uncertainty about the number of in-sample breaks can be integrated out using Bayesian model averaging techniques. 8 Bai et al. (1998) apply a deterministic procedure to detect breaks in multivariate time-series models and find that when break dates are common across equations, the resulting breaks are estimated more precisely. Their framework is not well suited for our purpose, however, since asset allocation exercises build on the predictive distribution of future returns and thus require modeling the stochastic process underlying the breaks.
D. Pettenuzzo, A. Timmermann / Journal of Econometrics 164 (2011) 60–78
zt = B′1 x˜ t −1 + ut , zt = B′2 x˜ t −1 + ut ,
.. .
zt = Bk x˜ t −1 + ut , ′
.. .
zt = B′K +1 x˜ t −1 + ut ,
for τ0 ≤ t ≤ τ1 (st = 1) for τ1 + 1 ≤ t ≤ τ2 (st = 2)
E [ut u′t ] = Σ1 E [ut u′t ] = Σ2
.. .
63
.. .
for τk−1 + 1 ≤ t ≤ τk
E [ u t u t ] = Σk ′
.. .
E [ut u′t ] = ΣK +1
.. .
for τK + 1 ≤ t ≤ T
(3)
(st = k)
( st = K + 1 )
Box I.
end we assume that the location and scale parameters within each regime, (Bk , Σk ), are drawn from common meta distributions that characterize the degree of similarity in the parameters across different regimes. Suppose for example that the mean parameters do not vary much across regimes but that the variance parameters do. This will show up in the form of a wide dispersion in the meta distribution for the scale parameters and a narrow dispersion in the meta distribution for the location parameters. The assumption that the parameters are drawn from a common meta distribution implies that data from previous regimes carry information relevant for current data and for the new parameters after a future break. By using meta distributions that pool information from different regimes, our approach makes sure that historical information is used efficiently in estimating the parameters of the current regime. We next describe the meta distributions in more detail. We use a random coefficient model to introduce a hierarchical prior for the regime coefficients in Eqs. (3) and (4), {Bk , diag (ψk ), Λk }. We assume that there is a single return series and, for generality, m − 1 predictor variables for a total of m equations in the prediction model (3) and further assume that the m2 location parameters are independent draws from a normal distribution, v ec (B)k ∼ N (b0 , V0 ), k = 1, . . . , K +1, while the m error term precision terms ψk−,i2 are independent and identical draws (IID) from a Gamma v
v
d
distribution, ψk−,i2 ∼ Gamma( 02,i , 0,i2 0,i ), i = 1, . . . , m. Finally, the m(m − 1)/2 correlations, λk,i,c , are IID draws from a normal 2 distribution, λk,i,c ∼ N (µρ,i,c , σρ, i,c ), i, c = 1, . . . , m, i < c, truncated so the correlation matrix is positive definite which in the two-equation model means that λk,i,c ∈ (−1, 1). Hence b0 , v0,i 2 and µρ,i,c represent location parameters, while V0 , d0,i and σρ, i,c are scale parameters of the three meta distributions. The pooled scenario in which all parameters are identical across regimes and the case where the parameters of each regime are virtually unrelated can be seen as special cases that are nested in our framework. Which scenario most closely represents the data can be inferred from the estimates of the location parameters of 2 the meta distribution V0 , d0,i and σρ, i,c . To characterize the parameters of the meta distribution, we assume that9 b0 ∼ N (µ , Σ β ),
(7)
β
−1
V0
∼
1 W (V − β , ν β ),
(8)
where W (.) is a Wishart distribution and µ , Σ β , v β and V β are β
prior hyperparameters that need to be specified. The hyperparameters v0,i and d0,i of the error term precision are assumed to follow an exponential and Gamma distribution, respectively (George et al., 1993) with prior hyperparameters ρ , c 0,i and d0,i : 0,i
v0,i ∼ Exp(−ρ 0,i ),
(9)
9 Throughout the paper we use underscore bars (e.g., a ) to denote prior 0 hyperparameters.
d0,i ∼ Gamma(c 0,i , d0,i ).
(10)
Following Liechty et al. (2004), we specify the following distributions for the hyperparameters of the correlation matrix:
µρ,i,c ∼ N (µµ,i,c , τ 2i,c ),
(11)
−2 σρ, i,c ∼ Gamma(aρ,i,c , bρ,i,c ),
(12)
where again µ
µ,i,c
,τ
2 i ,c ,
aρ,i,c and bρ,i,c are prior hyperparameters
for each element of the correlation matrix. Finally, we specify a prior distribution for the hyperparameters a and b of the transition probabilities, a ∼ Gamma(a0 , b0 ),
(13)
b ∼ Gamma(a0 , b0 ).
(14)
2 H = (b0 , V0 , v0,1 , d0,1 , . . . , v0,m , d0,m , µρ,1,2 , σρ, 1,2 , . . . , µρ,m−1,m , 2 σρ, m−1,m , a, b) collects the hyperparameters of the meta distribu-
tion.
2.3. Likelihood function and approximate marginal likelihood The likelihood function is obtained by extending to the hierarchical setting the approach proposed by Chib (1998). The likelihood function, evaluated at the posterior means of the regime-specific parameters, Θ ∗ , hyperparameters, H ∗ , and transition probabilities, p∗ , is obtained from the decomposition ln f (ZT |Θ ∗ , H ∗ , p∗ ) =
T −
ln f (zt |Zt −1 , Θ ∗ , H ∗ , p∗ ),
(15)
t =1
where f (zt |Zt −1 , Θ ∗ , H ∗ , p∗ ) =
K +1 −
f (zt |Zt −1 , Θ ∗ , H ∗ , p∗ , st = k)
k=1
× p(st = k|Zt −1 , Θ ∗ , H ∗ , p∗ ),
(16)
and f (zt |Zt −1 , Θ , H , p , st = k) is the conditional density of zt given the regime st = k, while ∗
∗
∗
p(st = k|Zt −1 , Θ ∗ , H ∗ , p∗ )
=
k −
p∗l,k × p(st −1 = l|Zt −1 , Θ ∗ , H ∗ , p∗ ),
(17)
l=k−1
(See Appendix A.) The following expression, which is proportional to the SIC, is used to compute an asymptotic approximation to the marginal likelihood for a model with K breaks, MK : p(MK |ZT ) ∝ ln f (ZT |Θ ∗ , H ∗ , p∗ ) −
NK × ln(T )
, (18) 2 where NK is the number of parameters for model MK . Approximate posterior model probabilities for models with up to K¯ breaks can be computed by exponentiating the approximate marginal likelihood for a model with K breaks and dividing by the sum of the corresponding terms across models with K = 0, . . . , K¯ breaks.
64
D. Pettenuzzo, A. Timmermann / Journal of Econometrics 164 (2011) 60–78
2.4. Prior elicitation To the extent possible, choice of priors in the break point model must be guided by economic theory and intuition. Here we explain the choices made for the baseline results. In Section 6 we conduct a sensitivity analysis to shed light on the importance of these choices. We impose two constraints on the parameters in the return prediction model, (3). First, to rule out explosive behavior in the predictor variable (and consequently in stock returns), we impose that βx < 1. Second, we require the unconditional mean of the predictor variable in each state to be non-negative, i.e. µx /(1 − βx ) ≥ 0. This restriction has a very limited impact on the posterior distributions of the individual regime parameters. However, it is important when generating out-of-sample forecasts and helps eliminate economically non-sensible trajectories for the predictor variables.10 Starting with the prior hyperparameters for the mean of the regression coefficient, b0 , we set µ = [0, 0, 0, 0.9]′ and Σ β = β
diag (sc , sc , sc , 1), where sc is a scale factor set to 1000 to reflect uninformative priors. Both predictor variables that we consider (the dividend yield and the T-bill rate) are highly persistent so we specify a more informative prior for the autoregressive coefficient, βx , and center it at 0.9. The hyperparameters for the prior variance of the regression coefficient, V0 , are set at ν β = 2m + 2, V β = diag (0.5, 5, 0.00001, 0.1) for the dividend yield specification and V β = diag (0.1, 1000, 0.00001, 0.1) for the Tbill rate specification. This is sufficient to preserve the variation in the regression coefficients across regimes and ensures that the mean of the inverse Wishart distribution exists. The small variation in µx and the somewhat larger variation in µr reflect the high persistence of the predictor variables, i.e., βx is close to one. Moving to the variance hyperparameters, we maintain uninformative priors and set c 0,i = 1, d0,i = ϵ (the smallest number that matlab can interpret), and ρ = 1/ϵ in all equations, hence speci0 ,i
fying a very large variance. For the correlation coefficient, λj,1,2 , we use an uninformative prior, i.e. µ = 0, τ 21,2 = 100, aρ,1,2 = 1 µ,1,2
and bρ,1,2 = 0.01. Finally, we specify uninformative priors for the hyperparameters a and b of the transition probabilities pk,k in Eq. (5), namely a0 = 1 and b0 = ϵ . By using uninformative priors for the hyperparameters governing the diagonal elements of the transition probability matrix, we allow the data to dictate the frequency of breaks. 3. Breaks in return forecasting models: empirical results Using the approach from Section 2, we next report empirical results for two commonly used return prediction models based on the dividend yield or the short interest rate.
Table 1 Summary statistics for the excess return, dividend yield and T-bill rate data used throughout the paper. Mean, standard deviation, coefficient of kurtosis, coefficient of skewness, minimum and maximum values are reported for each series. Statistic
Excess returns
Dividend yield
Treasury bill
Mean St. Deviation Kurtosis Skewness Minimum Maximum
0.0050 0.0558 7.9662 −0.3788 −0.3391 0.3478
0.0408 0.0170 4.1017 1.1463 0.0108 0.1536
0.0030 0.0024 0.9237 0.9317 8.3E−06 0.0126
All data are obtained from the Center for Research in Security Prices (CRSP). As forecasting variables we include a constant and either the dividend–price ratio – defined as the ratio between dividends over the previous twelve months and the current stock price – or the short interest rate measured by the 1-month T-bill rate. The dividend yield has been found to predict stock returns by many authors including Campbell and Shiller (1988), Keim and Stambaugh (1986) and Fama and French (1988). It has played a key role in the literature on the asset allocation implications of return predictability (Kandel and Stambaugh, 1996; Barberis, 2000). Due to its persistence and the large negative correlation between shocks to the dividend yield and shocks to stock returns, the dividend yield is known to generate a large hedging demand for stocks, particularly at long investment horizons. The short interest rate has also been found to predict stock returns (Campbell, 1987; Ang and Bekaert, 2002). Table 1 reports descriptive statistics for the three variables. Before turning to the empirical results we briefly summarize the estimation setup. Both the dividend yield and T-bill rate models were estimated using a Gibbs sampler with 2500 draws and the first 500 draws discarded to allow the sampler to achieve convergence.11 We performed a variety of MCMC convergence diagnostics, ranging from autocorrelation estimates, Raftery and Lewis (1992a,b) and Raftery and Lewis (1995) MCMC diagnostics, Geweke (1992) numerical standard errors and relative numerical efficiency estimates, and the Geweke chi-squared test comparing the means from the first and last part of the sample. We found very little evidence of autocorrelation in the Gibbs sampler draws. This is further confirmed by the thinning ratio estimates obtained from the Raftery and Lewis (1995) diagnostics which were very close to unity. Finally, the Geweke chi-squared test of the means from the first 20% of the sample versus the last 50% confirmed that the Gibbs sampler has achieved an equilibrium state. Appendix A provides details of the Gibbs sampler used to estimate the return prediction model with multiple breaks. 3.2. Predictability from the dividend yield
Following common practice in the literature on predictability of stock returns, we use as our dependent variable the continuously compounded return on a portfolio of US stocks comprising firms listed on the NYSE, AMEX and NASDAQ in excess of a 1-month Tbill rate. Data are monthly and cover the period 1926:12-2005:12.
Determining whether the return prediction models are subject to breaks and, if so, how many breaks the data support, is the first step in our analysis. For a given number of breaks, K , we get a new model, MK , with its own set of parameters. For all values of K , the model parameters are estimated by maximum likelihood using MCMC methods with states based on the posterior modes of the break point probabilities. Models can then be compared through their (approximate) marginal likelihood values described in Section 2.3. Table 2 provides such a comparison of models with
10 This restriction is easy to motivate economically for the predictor variables that we consider, i.e., the dividend yield and the T-bill rate, which are almost always positive. The restriction we use is softer than imposing the constraint that the value of xt should be non-negative for all trajectories.
11 We used a Windows XP based server with 8 Xeon x5355 2.66 GHz processors and 32 Gigabytes of DRAM. The Gibbs sampler for the dividend yield model based on eight breaks finished in 35 min, while the Gibbs sampler for the T-bill model based on five breaks ran for 33 min.
3.1. Data
D. Pettenuzzo, A. Timmermann / Journal of Econometrics 164 (2011) 60–78
65
Table 2 Model comparison and selection of the number of breaks in the return forecasting models. The table shows estimates of the joint log-likelihood for stock returns and the predictor variable (either the dividend yield or the T-bill rate), approximate marginal likelihood values and approximate posterior probabilities for models with different numbers of breaks along with the posterior modes for the time of the break points. The top and bottom panels display results when the predictor for the excess return is the lagged dividend yield (panel I) and the lagged T-Bill rate (panel II), respectively. Number of breaks
Joint log lik.
I. Excess returns — dividend yield 0 6188.4 1 6866.7 2 6892.1 3 7223.9 4 7321.4 5 7403.8 6 7434.9
Approx. marg. lik.
Posterior probability
6164.4 6739.7 6737.6 7041.9 7112.0 7166.8 7170.5
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00008
7
7455.6
7163.7
0.00000
8
7499.3
7180.0
0.99984
9
7516.6
7169.8
0.00004
10
7544.3
7170.0
0.00005
7834.1 7919.6 8151.7 8155.2 8341.4 8349.5 8333.8
0.00000 0.00000 0.00000 0.00000 0.00029 0.99971 0.00000
II. Excess returns — treasury bill rate 0 7858.2 1 8046.6 2 8306.2 3 8337.2 4 8550.8 5 8586.5 6 8598.2 7
8620.9
8329.0
0.00000
8
8653.0
8333.7
0.00000
different numbers of breaks by reporting the log-likelihood and the approximate marginal likelihood. We find strong support for structural breaks in the return prediction model based on the dividend yield. The approximate posterior odds ratios for the models with multiple breaks are all very high relative to a model with no breaks. Among models with up to ten breaks, an eight-break specification obtains a posterior probability weight of nearly one. Although eight breaks appear to be a large number, it is consistent with the evidence reported by Pastor and Stambaugh (2001) of 15 breaks in the equity premium over a sample (1834–1999) twice the period covered here. Return models that allow for breaks include a larger number of parameters than the conventional stable model so one might be concerned that they overfit the data. This is not an issue here, however, since we select the break point specification by the SIC, which approximates the marginal likelihood. Marginal likelihood captures the models’ out-of-sample prediction record and so penalizes for an increase in the numbers of estimated parameters. For the model with eight breaks, Fig. 1 shows the time of the associated breaks. More precisely, for an interval around the posterior modes of the eight-break dates, each diagram shows the posterior probability of there being one break. Most break dates are reasonably precisely identified in the form of either single spikes with probabilities exceeding one-half or narrow spans covering a few months. There are some exceptions to this, however, notably the break dated 1940, where the alternative date of 1943 also achieves a high posterior probability, and the break dated 1951, for which 1954 is an equally plausible date.12
12 Lettau and Van Nieuwerburgh (2008) find breaks in the mean of the dividend yield in 1954 and 1994. These are very similar to two of our break dates, namely 1951 and 1996 with differences likely to be attributed to uncertainty in the
Break locations
Feb-43 May-42 Jun-40 Jun-40 Jun-40 May-33 Mar-95 May-33 Nov-85 May-33 Feb-86 May-33 May-74 May-33 May-74
Jun-69 Apr-34 Jun-40 Apr-34 Apr-34 Apr-34 Jul-85 Apr-34 Nov-82 Apr-34 Aug-79
Jun-45 May-54 May-58 May-58 Feb-43
Mar-95 May-74 May-74 May-58
Mar-95 Nov-85 May-74
Mar-95 Nov-85
Jun-40 Mar-95 Jun-40 Nov-88 Jul-40 Feb-86 Jul-40 Nov-82
Jul-51
May-58
May-74
Jul-51 Jul-96 Feb-43 Nov-88 Apr-43 Feb-86
May-58
May-74
Jul-58 Jul-96 Jul-51 Nov-88
Aug-66 May-58 Jul-96
Dec-52 Jun-69 Dec-51 Dec-51 Dec-51
Jul-85 Aug-79 Jun-69 Jun-69
Oct-82 Aug-79 Aug-79
Oct-82 Nov-82
Dec-51
Jun-69
Aug-79
Dec-51 Jul-85
Nov-60
Jul-66
Jul-47 Jul-85 Jul-40 Nov-82
Most of the break locations are associated with major events and occurred around the Great Depression (1933), World War II (1940), the Treasury-Fed Accord (1951), and the major oil price shocks of the early seventies and the resulting growth slowdown (1974). Some breaks are also associated with changes in price dynamics, e.g., the interval spanning the October 1987 stock market crash (1986 and 1988) and, more recently, the take-off accompanying the bull market of the nineties (1996). These break dates suggest that changes to the conditional equity premium are associated with events such as major wars, changes to monetary policy and important slowdowns in economic activity caused, e.g., by major supply shocks. Parameter estimates for the model with eight breaks (nine regimes) as well as the no-break model are reported in Table 3. Consistent with results in the empirical literature, full-sample estimates of the parameters in the return Eq. (2) with no breaks (shown in the first column) reveal a mean coefficient on the dividend yield that is positive but slightly less than two standard errors away from zero. Turning to the estimates of the break model, the mean of the dividend yield coefficient in the return equation ranges from a low of zero in the earliest sample (1926–1933) to 2.6 during the final period (1996–2005). The substantial time-variation in the coefficient of the dividend yield is consistent with the subsample estimates reported by Ang and Bekaert (2007). It is also consistent with the finding in Lettau and Van Nieuwerburgh (2008) that uncertainty over the magnitude of breaks is very large. The standard deviation parameter of the return equation also varies considerably across regimes, from a high of 10% per month around
determination of the break dates (1954) and differences in the number of breaks allowed.
66
D. Pettenuzzo, A. Timmermann / Journal of Econometrics 164 (2011) 60–78
Fig. 1. Posterior probabilities for break point locations in the return prediction model with eight breaks based on the dividend yield.
the Great Depression to a low of only 3.1% per month from 1988–1996. The parameter estimates for the dividend yield equation show that this process is highly persistent in all regimes with a mean autoregressive parameter that varies from 0.94 to 0.97. The variance of the dividend yield is again highest in the first regime and becomes much lower after 1988. Correlation estimates for the innovations to stock returns and the lagged dividend yield are large and negative in all regimes with mean values ranging from −0.97 to −0.92. Transition probabilities are high with mean values that always exceed 0.97 and go as high as 0.992, corresponding to mean durations ranging from 40 to 140 months. To address how similar the parameters of the return equation are across regimes, information on the posterior estimates of the hyperparameters of the meta distribution from the MCMC output is provided in Table 4. To preserve space we only report the values of the parameters that are easiest to interpret. The parameter tracking the grand mean of the slope of the dividend yield in the return equation is centered on 0.92 with a standard deviation centered at 0.50, giving rise to a 95% credible set of [0.0, 2.0]. The autoregressive slope βx in the dividend yield equation is centered on a value of 0.95 with a much smaller standard deviation of only 0.03 and a 95% credible set of [0.88, 0.99]. Similarly, the hyperparameter tracking the correlation between shocks to returns and shocks to the dividend yield is centered on −0.94 with a modest standard deviation of 0.03. The posterior distributions of the hyperparameters of the transition probability, a0 and b0 , are surrounded by greater uncertainty as indicated by their relatively large standard deviations. This is consistent with the considerable difference in the duration of the various regimes identified by our model. These findings suggest that the greatest variability in parameters across regimes is associated with the coefficient of the lagged
dividend yield in the return equation, the volatility of stock returns and the duration of the regimes. There is considerably less uncertainty about the persistence of the dividend yield or the correlation between shocks to returns and shocks to the dividend yield. 3.3. Predictability from the short interest rate Turning to the return model based on the short interest rate, Table 2 shows that a model with five breaks is strongly supported by the data. These breaks again appear around the time of major events such as the Great Depression (1934), the Fed-Treasury Accord (1951), the Vietnam War (1969) and the beginning and end of the change to the Fed’s operating procedures (1979 and 1982). Fig. 2 shows the posterior probabilities surrounding the modes of the five break points. The break dates are quite precisely estimated as, in each case, the posterior probabilities define narrow ranges. The break dated 1969 is surrounded by the greatest uncertainty. Parameter estimates for the return model with five breaks are displayed in Table 5. The mean of the coefficient on the lagged T-bill rate in the return equation varies significantly over time, ranging from −9.4 during the very volatile ‘‘monetarist policy experiment’’ from 1979 to 1982 to 3.3 during 1934–1951. Furthermore, the estimates of the slope on the T-bill rate within each regime are surrounded by large standard errors, particularly up to 1951. The process for the short interest rate is highly persistent with the mean of the autoregressive coefficient ranging from a low of 0.94 to a high of 0.996. The correlation between shocks to returns and shocks to the T-bill rate varies much more across regimes than in the dividend yield model, ranging from a low of −0.47 during 1979–1982 to a high of 0.07 during 1926–34. These changes appear not simply to reflect random sample variations since the standard deviations of the correlations are mostly quite low. All states
D. Pettenuzzo, A. Timmermann / Journal of Econometrics 164 (2011) 60–78
67
Table 3 Parameter estimates for the return (rt ) forecasting model with eight break points, based on the lagged dividend yield (xt −1 ) as a predictor variable: rt = µrk + βrk xt −1 + ϵrt , ϵrt ∼ N (0, σr2k ), xt = µxk + βxk xt −1 + ϵxt , ϵxt ∼ N (0, σx2k ), Pr(st = k|st −1 = k) = pk,k , corr (ϵrt , ϵxt ) = ρrxk , τk−1 + 1 ≤ t ≤ τk , k = 1, . . . , 9. Regimes Full sample
26–33
33–40
40–51
51–58
58–74
74–86
86–88
88–96
96–05
µr Mean s.d.
−0.0033
0.0037 0.0139
0.0052 0.0162
−0.0289
−0.0236
−0.0355
−0.0284
−0.0291
−0.0263
−0.0377
0.0163
0.0208
0.0159
0.0147
0.0282
0.0176
0.0168
0.2018 0.1039
−0.0153 0.2683
0.0512 0.3289
0.7042 0.3030
0.8321 0.5281
1.2070 0.5009
0.6760 0.3112
1.0804 0.8222
1.1649 0.5895
2.6278 1.0458
0.0558 0.0013
0.1082 0.0084
0.0806 0.0087
0.0386 0.0026
0.0378 0.0041
0.0359 0.0019
0.0455 0.0028
0.0609 0.0090
0.0308 0.0050
0.0449 0.0031
0.0830 0.0290
0.2028 0.0818
0.2063 0.0791
0.1605 0.0637
0.1889 0.0678
0.1815 0.0545
0.1903 0.0684
0.1816 0.0711
0.1382 0.0532
0.0503 0.0265
0.9790 0.0066
0.9567 0.0189
0.9571 0.0185
0.9722 0.0117
0.9484 0.0172
0.9429 0.0172
0.9591 0.0146
0.9416 0.0231
0.9472 0.0191
0.9663 0.0165
0.3533 0.0082
0.9016 0.0707
0.4976 0.0476
0.2305 0.0182
0.1627 0.0180
0.1208 0.0061
0.2256 0.0136
0.2114 0.0327
0.0991 0.0188
0.0717 0.0047
−0.8807
−0.9359
−0.9399
−0.9290
−0.9529
−0.9732
−0.9734
−0.9324
−0.9666
−0.9547
0.0074
0.0222
0.0207
0.0248
0.0217
0.0131
0.0131
0.0232
0.0160
0.0193
0.9857 0.0110
0.9866 0.0102
0.9895 0.0081
0.9802 0.0150
0.9920 0.0061
0.9895 0.0081
0.9763 0.0189
0.9837 0.0130
– –
0.0046
βr Mean s.d.
σr Mean s.d.
µx × 100 Mean s.d.
βx Mean s.d.
σx × 100 Mean s.d.
ρrx Mean s.d. p Mean s.d.
Table 4 Estimates of the parameters of the meta distribution that characterizes variation in the parameters of the return model across different regimes. The estimates are from a model with predictability of returns from the dividend yield and assume eight historical breaks. Within the kth regime the model is zt = B′k xt −1 + ut , where zt = (rt , xt )′ is the vector of stock returns and the predictor variable, and v ec (B)k ∼ N (b0 , V0 ). ρk ∼ N (µρ , σρ2 ) is the correlation between shocks to the dividend yield and shocks to returns in the kth regime, while pk,k ∼ Beta(a0 , b0 ) is the probability of remaining in the kth regime. Hyperparameters of meta distributions Mean s.d. I Return equation Mean parameters b0 (µr ) b √0 (βr ) √(V0 (µr )) (V0 (βr ))
Turning finally to the meta distribution parameters for the Tbill rate model shown in Table 6, once again the chief source of uncertainty is the slope coefficient of the interest rate in the return equation. For example, b0 (βr ) has a mean of −2.8, a standard deviation of 5.9, and a very long 95% credible set that ranges from −14.6 to 9.1. Compared with the model based on the dividend yield, there is now also greater uncertainty about the correlation between shocks to returns and shocks to the T-bill rate as indicated by the higher standard deviation of µρ and the wide 95% credible set from −0.45 to 0.24.
95% cred. set
4. Asset allocation under structural breaks −0.0248
0.0791 0.5009 0.0545 0.3246
−0.1780
0.9264 0.2296 1.1151
0.0029 0.1497 0.6512
0.1331 2.0354 0.3621 1.8838
II Dividend yield equation Mean parameters b0 (µx ) × 100 0.1673 b0 (βx ) 0.9478 √ ( V (µ )) × 100 0.0688 0 x √ (V0 (βx )) 0.1020
0.0508 0.0309 0.0232 0.0234
0.0761 0.8806 0.0372 0.0674
0.2752 0.9953 0.1256 0.1577
0.0341
−0.9878
−0.8606
Investors are concerned with instability in the return model because this affects future asset payoffs and therefore could alter their optimal asset allocation. To gauge the economic importance of structural breaks in the return model, we next study the optimal asset allocation under a range of alternative modeling assumptions. Consider a buy-and-hold investor with a horizon of h periods who at time T has power utility over terminal wealth, WT +h , and coefficient of relative risk aversion, γ : 1−γ
Correlation parameters
µρ
−0.9377
Transition probability parameters a0 34.7707 b0 0.8126
18.4461 0.3663
9.7272 0.2995
75.7877 1.7041
continue to be highly persistent with mean transition probability estimates varying from 0.973 to 0.993, resulting in state durations between 40 and more than 160 months.
u(WT +h ) =
WT +h
1−γ
,
γ > 0.
(19)
Following Kandel and Stambaugh (1996) and Barberis (2000), we assume that the investor has access to a risk-free asset whose single-period return is denoted by rf ,T +1 , and a risky stock market portfolio whose return, measured in excess of the (single-period) risk-free rate, is denoted by rT +1 . All returns are continuously compounded. The risk-free rate is allowed to change every period.
68
D. Pettenuzzo, A. Timmermann / Journal of Econometrics 164 (2011) 60–78
Fig. 2. Posterior probabilities for break point locations in the return prediction model with five breaks based on the T-bill rate.
4.1. The asset allocation problem Without loss of generality we set initial wealth at one, WT = 1, and let ω be the allocation to stocks. Terminal wealth is then given by WT +h = (1 − ω) exp
h −
rf ,T +τ
τ =1
h − + ω exp (rT +τ + rf ,T +τ ) .
the predictor variables have been specified, the VAR parameters Θ = (B, Σ ) can be estimated and the model can be iterated forward conditional on these parameter estimates. Collecting cumulative stock and T-bill returns in the vector RT +h = (Rs,T +h , Rf ,T +h ), we can generate a distribution for future asset re , ST +h = 1, ZT ) where ST +h = 1 shows that past turns, p(RT +h |Θ and future breaks are ignored. The investor therefore solves the problem
∫ (20)
τ =1
Subject to the no short-sale constraint 0 ≤ ω ≤ 0.99,13 the buyand-hold investor solves the following problem
max ET ω
((1 − ω) exp(Rf ,T +h ) + ω exp(Rs,T +h ))1−γ 1−γ
,
max ω
, ST +h = 1, ZT )dRT +h . u(WT +h )p(RT +h |Θ
(22)
Here we used that, from Eq. (20), the only part of WT +h that is uncertain is RT +h . This of course ignores that Θ is not known precisely but typically is estimated with considerable uncertainty.14
(21)
where the cumulative h-period returns on stocks and the corresponding return from rolling over one-period T-bills are given by ∑h ∑h Rs,T +h = τ =1 (rT +τ + rf ,T +τ ) and Rf ,T +h = τ =1 rf ,T +τ and ET is the conditional expectation given information at time T , ZT . How this expectation is computed reflects the modeling assumptions made by the investor. 4.2. No breaks, no parameter uncertainty First consider the asset allocation problem for an investor who ignores parameter estimation uncertainty and breaks. Once
13 We use an upper bound on stock holdings of ω = 0.99 to ensure that the expected utility is bounded, which might otherwise be a problem; see Geweke (2001), Kandel and Stambaugh (1996) and Barberis (2000).
4.3. No breaks with parameter uncertainty Next, consider the decision of an investor who accounts for parameter estimation uncertainty but ignores both past and future breaks, i.e., assumes that ST +h = 1. In the absence of breaks the posterior distribution π (Θ |ST +h = 1, ZT ) summarizes the uncertainty about the parameters given the historical data sample.15 Integrating over this distribution leads to the predictive distribution of returns conditioned only on the observed sample
14 To be more precise, we could condition also on M i.e. the return prediction Kx model based on the predictor variable x and conditional on K historical breaks, with K = 0 here. The importance of Mkx will become clear when we integrate out uncertainty about the number of breaks and the predictor variables. 15 Throughout the paper, π(·|Z ) refers to posterior distributions conditioned on T
information contained in ZT .
D. Pettenuzzo, A. Timmermann / Journal of Econometrics 164 (2011) 60–78
Table 5 Parameter estimates for the return (rt ) forecasting model with five break points, based on the lagged T-bill rate (xt −1 ) as a predictor variable: rt = µrk +βrk xt −1 +ϵrt , ϵrt ∼ N (0, σr2k ), xt = µxk +βxk xt −1 +ϵxt , ϵxt ∼ N (0, σx2k ), Pr(st = k|st −1 = k) = pk,k , corr (ϵrt , ϵxt ) = ρrxk , τk−1 + 1 ≤ t ≤ τk , k = 1, . . . , 9.
(and not on any fixed Θ ) and the assumption of no breaks prior to time T + h: p(RT +h |ST +h = 1, ZT ) =
Full sample
26–34
34–51
51–69
69–79
79–82
82–05
0.0081 0.0030
0.0000 0.0174
0.0065 0.0052
−1.0334 0.7537
0.1414 7.0858
3.2724 8.7202
0.0559 0.0013
0.1083 0.0078
0.0025 0.0013
0.0250 0.0061
0.0166 0.0169
0.0902 0.0042 0.0384 0.0066
βr −7.0505 −3.6461 −9.3972 0.5025 2.4093
3.3379
3.9141 1.4710
0.0591 0.0029
0.0339 0.0017
0.0456 0.0029
0.0470 0.0437 0.0057 0.0019
0.0047 0.0033
0.0007 0.0003
0.0045 0.0022
0.0151 0.0087
0.0484 0.0031 0.0319 0.0020
0.9922 0.0034
0.9668 0.0163
0.9962 0.0034
0.9886 0.0083
0.9746 0.0176
0.9405 0.9902 0.0349 0.0048
0.0295 0.0007
0.0285 0.0021
0.0038 0.0002
0.0169 0.0009
0.0328 0.0023
0.1112 0.0181 0.0140 0.0008
−0.0793
0.0746 0.0252
0.0181 0.0269
Mean s.d.
µx × 100 Mean s.d.
max ω
Mean s.d.
(24)
Comparing stock holdings in Eqs. (22) and (24) gives a measure of the economic importance of parameter estimation uncertainty. However, both solutions ignore model instability.
Both past and future breaks matter for the investor’s estimates of the future return distribution. The predictive density of returns conditional on K + 1 regimes having emerged up to time T can be computed by integrating over the parameters, π (Θ , H , p|ST = K + 1, ZT ):
∫∫∫ =
ρrx 0.0321
p Mean s.d.
−0.0336 −0.4010 −0.4736 0.0131 0.0280
0.0267
0.0352 0.0239
0.9930 0.0054
0.9929 0.0059
0.9891 0.0085
0.9732 – 0.0225 –
Table 6 Estimates of the parameters of the meta distribution that characterizes variation in the parameters of the return model across different regimes. The estimates are from a model with predictability of returns from the T-bill rate and assume five historical breaks. Within the kth regime the model is zt = B′k xt −1 + ut , where zt = (rt , xt )′ is the vector of stock returns and the predictor variable, and v ec (B)k ∼ N (b0 , V0 ). ρk ∼ N (µρ , σρ2 ) is the correlation between shocks to the T-bill and shocks to returns in the kth regime, while pk,k ∼ Beta(a0 , b0 ) is the probability of remaining in the kth regime. Hyperparameters of meta distributions Mean s.d. I Return equation Mean parameters b0 (µr ) b0 (βr ) √ √(V0 (µr )) (V0 (βr ))
0.0252
−2.7676 0.1287 13.6488
II T-bill equation Mean parameters b0 (µx ) × 100 b0 (βx ) √ √(V0 (µx )) × 100 (V0 (βx ))
0.0524 5.8697 0.0361 4.1496
95% cred. set
−0.0783 −14.6019 0.0787 8.3880
0.1292 9.1074 0.2198 24.0129
0.0201 0.9497 0.0434 0.1239
0.0155 0.0377 0.0139 0.0363
0.0009 0.8571 0.0257 0.0764
0.0592 0.9978 0.0783 0.2171
−0.1250
0.1714
−0.4541
0.2360
Correlation parameters Transition probability parameters a0 26.8935 b0 0.6732
14.4843 0.3085
6.4051 0.2348
60.5744 1.3780
(25)
Appendix B explains the steps involved in obtaining draws from the predictive distribution of cumulative returns that account for possible future breaks. An investor who considers the uncertainty about out-of-sample breaks but conditions on K historical (insample) breaks therefore solves
∫ 0.9866 0.0109
p(RT +h |Θ , H , p, ST = K + 1, ZT )
× π (Θ , H , p|ST = K + 1, ZT )dΘ dHdp.
σx × 100
µρ
u(WT +h )p(RT +h |ST +h = 1, ZT )dRT +h .
p(RT +h |ST = K + 1, ZT )
βx
Mean s.d.
(23)
4.4. Past and future breaks
σr
Mean s.d.
p(RT +h |Θ , ST +h = 1, ZT )
This investor therefore solves the asset allocation problem
∫
µr
Mean s.d.
∫
× π (Θ |ST +h = 1, ZT )dΘ .
Regimes
Mean s.d.
69
max ω
u(WT +h )p(RT +h |ST = K + 1, ZT )dRT +h .
(26)
This expression does not restrict the number of future breaks, nor does it take the parameters as known. It does, however, take the number of historical breaks as fixed and also ignores uncertainty about the forecasting model itself. We next relax these assumptions. 4.5. Uncertainty about the number of historical breaks The predictive densities computed so far have conditioned on the number of in-sample breaks (K ) by setting ST = K + 1. This is of course a simplification since the true number of historical breaks is unknown. To deal with this, we use Bayesian model averaging to compute the predictive density of returns as a weighted average of the predictive densities conditional on different numbers of historical (in-sample) breaks. For each choice of number of breaks, K , and predictor variable, x, we get a model MKx with predictive density pKx (RT +h |ST = K + 1, X = x, ZT ). Integrating over the number of breaks (but keeping the choice of predictor variables, x, fixed), the predictive density under the Bayesian model average is px (RT +h |ZT ) =
K¯ −
pKx (RT +h |ST = K + 1, X = x, ZT )
K =0
× p(MKx |ZT ),
(27)
where K¯ is an upper limit on the number of in-sample breaks. The weights used in the average are proportional to the posterior probability of model MKx given by the product of the prior for model MKx , p(MKx ), and the marginal likelihood, f (ZT |MKx ), p(MKx |ZT ) ∝ f (ZT |MKx )p(MKx ).
(28)
70
D. Pettenuzzo, A. Timmermann / Journal of Econometrics 164 (2011) 60–78
4.6. Model uncertainty In addition to not knowing the parameters of a given return forecasting model and not knowing the number of historical breaks, investors do not know the true identity of the predictor variables. This point has been emphasized by Pesaran and Timmermann (1995) and, more recently in a Bayesian setting, investigated by Avramov (2002) and Cremers (2002). These papers treat model uncertainty by considering all possible combinations of a large range of predictor variables. We follow this analysis by integrating across return prediction models based on the dividend yield and the short interest rate. This is simply an illustration of how to handle model uncertainty and our analysis could be extended to a much larger set of variables. However, to keep computations feasible, we simply combine the return models based on these two predictor variables, in each case accounting for uncertainty about the number of past and future breaks: r
p(RT +h |ZT ) =
K¯ x − Kxr =1
pK r (RT +h |ST = K + 1, MKxr , ZT ) x
× p(MKxr |ZT ).
(29)
Here p(MKxr |ZT ) is proportional to the marginal likelihood of the return equation for the model with x as predictor variable(s) and K breaks, and K¯ xr is the number of different combinations of predictor variables used to forecast returns. 5. Empirical asset allocation results We next use the methods from Section 4 to assess empirically the effect of structural breaks on a buy-and-hold investor’s optimal asset allocation. We use the Gibbs sampler to evaluate the predictive distribution of returns under breaks. Details of the numerical procedure used to compute the distributions are provided in the appendices. Before moving to the results, it is worth recalling two important effects for asset allocation under return predictability from variables such as the dividend yield. First, the dividend yield identifies a mean-reverting component in stock returns which means that the risk of stock returns grows more slowly than in the absence of predictability. Negative shocks to stock prices are bad news in the period when they occur but tend to increase subsequent values of the dividend yield and thus become associated with higher future expected stock returns, creating a hedging demand for stocks; see Campbell et al. (2003). Second, parameter estimation uncertainty reduces a risk averse investor’s demand for stocks. For example, if new information leads the investor to revise downward his belief about mean stock returns shortly after the investment decision is made, this will affect expected returns along the entire investment horizon similar to a permanent negative dividend shock. In our break point model there is an interesting additional interaction between parameter estimation uncertainty and structural breaks. In the absence of breaks, parameter estimation uncertainty has a greater impact on returns in the sense that parameter values are fixed and not subject to change. The presence of breaks means that bad draws of the parameters of the return model will eventually cease to affect returns as they get replaced by new parameter values following future breaks. On the other hand, breaks to the parameters tend to lower the precision of current parameter estimates and thus increase the importance of parameter estimation uncertainty. Which effect dominates depends on the extent of the variability in the parameter values across regimes as well as on the average duration of the regimes. An alternative way to evaluate the importance of return predictability is to conduct a recursive analysis as is done by
Dangl and Halling (2008), Johannes et al. (2009) and Wachter and Warusawitharana (2009). In fact, our finding of large breaks to the return prediction model is likely to be an important reason for the widely reported poor recursive forecasting performance of return prediction models. For example, Lettau and Van Nieuwerburgh (2008) find that a prediction model that allows for breaks to the steady state of the dividend yield produces better in-sample forecasts, although they also report that it is difficult to exploit such breaks in real time due to the uncertainty surrounding the magnitude of the shift in the mean dividend yield. 5.1. Results based on the dividend yield Figs. 3 and 4 plot the allocation to stocks under the three scenarios discussed in Section 4, namely (i) no breaks, no parameter uncertainty; (ii) no breaks with parameter uncertainty; (iii) past and future breaks. The first two scenarios use full-sample parameter estimates. We compute the optimal weight on stocks under two values for the coefficient of relative risk aversion, namely γ = 5 and γ = 10. To be consistent with the results for the T-bill rate model reported in Tables 5 and 6, the asset allocation results assume that the short rate follows a persistent process that is subject to breaks. Fig. 3 starts the dividend yield off from its value at the end of sample (2005:12) which is 1.8%. Under the models that assume no breaks, and setting γ = 5, the weight on stocks rises from a level near 10% at short investment horizons to 30% at the five-year horizon. The assumed absence of breaks means that a very long data sample (1926–2005) is available for parameter estimation. This reduces parameter estimation uncertainty and leads to an increasing weight on stocks, the longer the investment horizon. This interpretation is confirmed by the finding that stock holdings are very similar irrespective of whether parameter estimation uncertainty is accounted for. Allowing for past and future breaks, the weight on stocks starts out at 70% at the 1-month horizon and declines to a level below 10% at the five-year horizon. Parameter instability clearly dominates the increased demand for stocks induced by return predictability from the dividend yield. When risk aversion is increased to γ = 10, the weight on stocks declines uniformly to levels close to half their values for γ = 5. The resulting stock allocations seem low but are also affected by the assumed initial value of the dividend yield which, at 1.8%, is close to its historical minimum. To demonstrate this point, Fig. 4 shows the allocation to stocks when the initial value of the dividend yield is set at its sample mean of 4.1%. Comparing Figs. 3 and 4, the level of the optimal stock holding appears to be quite sensitive to the initial value of the dividend yield. The allocation to stocks under the no-break models now starts close to 40% at short investment horizons and increases to nearly 60% at the five-year horizon. Stock holdings under the model that accounts for breaks start just below 100% at the 1-month horizon but rapidly decline as the investment horizon is expanded to reach a level close to 10% at the five-year horizon. Stock holdings are reduced considerably if γ is raised from five to ten, but the qualitative findings remain the same. These findings suggest that the allocation to stocks is generally increasing in the horizon if breaks in the return prediction model based on the dividend yield are ignored. If past and future breaks are considered, we generally see a strongly declining allocation to stocks, the longer the investment horizon. Parameter instability in our model has a much larger effect on a buy-and-hold investor’s optimal asset allocation than parameter estimation uncertainty. This can be seen by comparing the fullsample (no-break) plots in Figs. 3 and 4 with and without estimation error. In both cases these are very similar. This is to be expected since investors have access to 80 years of data.
D. Pettenuzzo, A. Timmermann / Journal of Econometrics 164 (2011) 60–78
71
1−γ
1 Fig. 3. Optimal Asset Allocation as a function of the investment horizon for a buy-and-hold investor with power utility over terminal wealth, U (WT +h ) = 1−γ WT +h , where h is the forecast horizon and γ is the coefficient of relative risk aversion. The calculations use the dividend yield as a predictor variable. The panels show percentage allocations to stocks plotted against the investment horizon measured in months under the assumption that the dividend yield is set at its value at the end of the sample, i.e., 1.8%. The upward sloping curves track stock allocations under no breaks, while the downward sloping curves allow for past and future breaks.
This conclusion is related to the analysis by Pastor and Stambaugh (2009) who find that the variance of stock returns can increase more than in proportion with the forecast horizon due to a combination of uncertainty about current and future expected returns and estimation risk. We also find that the per-period variance of stock returns increases with the investment horizon in the dividend yield model. A closely related property is that, in the standard model with no return predictability, the annualized Sharpe ratio should be constant. Introducing return predictability from the dividend yield, but ignoring breaks, means that the annualized Sharpe ratio becomes increasing in the investment horizon. For the no-break dividend yield model, we find that the mean annualized Sharpe ratio is 9% higher at the 36-month horizon and 17% higher at the 60-month horizon, compared with the mean 12-month ratio. Conversely, allowing for breaks to the parameters of this model, we find that the mean of the 36-month and 60-month annualized Sharpe ratios are 10% and 20% lower, respectively, than the mean of the 12-month ratio. These numbers
help explain why long-run investors find it less attractive to hold stocks under the model that allows for breaks.16 5.2. Results based on the short interest rate Optimal stock holdings under the return prediction model based on the T-bill rate, set at its value at the end of the sample of 3.8%, are shown in Fig. 5. When γ = 5, the allocation to stocks is flat at 40% under the no-break model irrespective of whether parameter estimation uncertainty is considered. To see why, note that while shocks to the dividend yield and stock returns are strongly negatively correlated and thus give rise to a hedging demand for stocks that grows with the investment horizon, shocks
16 The downward sloping Sharpe ratios are consistent with findings by van Binsbergen et al. (2010) of Sharpe ratios that decrease in the maturity of dividend strips.
72
D. Pettenuzzo, A. Timmermann / Journal of Econometrics 164 (2011) 60–78
1−γ
1 Fig. 4. Optimal Asset Allocation as a function of the investment horizon for a buy-and-hold investor with power utility over terminal wealth, U (WT +h ) = 1−γ WT +h , where h is the forecast horizon and γ is the coefficient of relative risk aversion. The calculations use the dividend yield as a predictor variable. The panels show percentage allocations to stocks plotted against the investment horizon measured in months under the assumption that the dividend yield is set at its mean value, i.e., 4.1%. The upward sloping curves track stock allocations under no breaks, while the downward sloping curves allow for past and future breaks.
to the short rate and stock returns are – on average – largely uncorrelated; see Table 5. This means that the long-run risk of stocks is perceived to be higher under the T-bill rate model and so helps explain the absence of an increase in the stock allocation, the longer the investment horizon. Raising the risk aversion from γ = 5 to γ = 10, the allocation to stocks is reduced to roughly half its previous level. When past and future breaks are considered, the allocation to stocks declines from 70% at short horizons to only 10% at the fiveyear horizon. Thus, we continue to find that the level and slope of stock holdings as a function of the investment horizon are highly sensitive to assumptions about model stability. 5.3. Uncertainty about the number of historical breaks We next follow the analysis in Section 4 and integrate out uncertainty about the number of historical (in-sample) breaks. The effect of using this approach is illustrated in Fig. 6. This figure compares plots of optimal stock holdings under a forecasting model that assumes eight historical breaks for the dividend yield
model (or five breaks for the T-bill rate model) against holdings computed under Bayesian Model Averaging which considers all the models displayed in Table 2 with identical prior weights on each of these models. For both predictor variables the allocations calculated under the models with the highest posterior probabilities versus those calculated under the model averaging approach are virtually identical. This is to be expected given the very high posterior probability assigned to the best models shown in Table 2. 5.4. Model uncertainty Fig. 6 also shows the effect of accounting for model uncertainty in a simple experiment that, for both the dividend yield and Tbill predictor variables, weights the models based on the marginal likelihood of the return equation associated with the multivariate prediction models most supported by the data. Optimal stock holdings most resemble the allocation under the five-break forecasting model based on the T-bill rate. This happens because this model achieves a better fit than the best
D. Pettenuzzo, A. Timmermann / Journal of Econometrics 164 (2011) 60–78
73
1−γ
1 WT +h , where Fig. 5. Optimal Asset Allocation as a function of the investment horizon for a buy-and-hold investor with power utility over terminal wealth, U (WT +h ) = 1−γ h is the forecast horizon and γ is the coefficient of relative risk aversion. The calculations use the T-bill rate as a predictor variable. The panels show percentage allocations to stocks plotted against the investment horizon measured in months under the assumption that the T-bill rate is set at its value at the end of the sample, i.e., 3.8% per annum (0.03% per month). Flat curves track stock allocations under no breaks, while the downward sloping curves allow for past and future breaks.
model based on the dividend yield and hence gets a greater weight in the forecast combination. If we considered a larger universe – in particular models with more than one predictor variable – it is less likely that any particular model would dominate in the way we find here. This analysis is only meant to illustrate how our approach can be extended to account for model uncertainty. In reality, model uncertainty is far greater than that shown here due to the typically large dimension of the set of possible predictor variables. For example, Avramov (2002) considers 214 different models. 6. Sensitivity analysis and extensions This section first explores the robustness of our empirical results with respect to the assumed priors. We then show how the expected volatility of returns change across regimes. Finally, we compute welfare costs for an investor who assumes that the model for returns does not change over time although the true data generating process is subject to breaks.
6.1. Robustness to priors We next investigate the sensitivity of our empirical results with regard to the assumed priors. The greatest sensitivity of our results is related to the specification of V β . This matrix controls variations in the regression coefficients across regimes. If we make V β larger than the value assumed in the empirical analysis, the posterior distribution of the parameter estimates within each regime gets more dispersed, whereas the location of the breaks is less affected. Conversely, for smaller values of V β , the parameter estimates in the various regimes become more similar than suggested by the empirical results since this reduces the variation in the posterior mean of the parameter estimates across regimes. Imposing the constraint that the predictor variable, xt , is stationary (0 < βx < 1) does not have much effect on the results. An additional parameter constraint that requires the unconditional mean excess stock return within each regime to fall between zero βr µ x and 1% per month (0 ≤ µr + 1−β ≤ 0.01) also does not affect the x results in any major way.
74
D. Pettenuzzo, A. Timmermann / Journal of Econometrics 164 (2011) 60–78
Fig. 6. Optimal asset allocation computed under Bayesian model averaging, considering the dividend yield specification with up to 10 breaks (BMA over dividend yield modes), the T-bill rate specification with up to 8 breaks (BMA over treasury bill models) and the combined set of all dividend yield and T-bill rate return prediction models.
Following the analysis in Stambaugh (1999), we consider using an informative prior that centers the correlation between innovations in the return and the dividend yield equations, λj,1,2 on −0.9, with µ = −0.9, τ 21,2 = 0.00001, aρ,1,2 = 1000 µ,1,2
and bρ,1,2 = 100. This again makes little difference to the results, although at short horizons of up to six months, the allocation to stocks is reduced slightly compared to the case with uninformative priors on this parameter. Our benchmark analysis uses an informative prior for the autoregressive parameter that is centered at 0.9. We have also analyzed a model with an uninformative prior for this coefficient and find that this has little effect on the posterior parameter estimates and asset allocation results. We finally estimate a model that imposes identical transition probability parameters across states by setting pk,k = p for all k. Such a restriction is natural to consider since we effectively only have one observation for each pk,k . Notice that this restriction changes the structure of the prior for the transition probability matrix, P, since we no longer have a hierarchical prior on a, b. This modification has little effect on the results. For example, for the dividend yield model with eight breaks, two of the estimated
break dates (1940 and 1958) change a little (to 1943 and 1954, respectively) but the estimated slope coefficients of the dividend yield in the return equation are very similar to the baseline case in Table 3 and asset allocations are basically unchanged. Similar results prevail for the return prediction model based on the T-bill rate, for which the parameter estimates and break dates in a fivebreak model change very little when imposing that pk,k = p. 6.2. Time-varying volatility of returns It is a well-known empirical fact that the volatility of stock returns varies over time. Ideally this should be captured by a return forecasting model used for asset allocation. In fact, since our model allows for breaks to the covariance matrix of returns, it is capable of accounting for heteroskedasticity in returns insofar as this coincides with the identified regimes. This is an important consideration since stock returns were clearly far more volatile during periods such as the Great Depression. To see how the volatility of stock returns changes over time in our model, Fig. 7 provides a time-series plot of the standard deviation of the predictive density of returns. Since the standard
D. Pettenuzzo, A. Timmermann / Journal of Econometrics 164 (2011) 60–78
75
Fig. 7. Monthly standard deviations of the predictive distribution of stock returns when the predictor variable is the dividend yield (top panel) or the T-bill rate (bottom panel) under models with eight and five breaks, respectively.
deviation of returns (and of the yield) is allowed to vary across regimes in the break model, volatility follows a step function that tracks the various regimes. The mean value of the volatility of returns varies significantly from close to 10% per month around the Great Depression to 3%–4% per month in the middle of the sample. This finding shows that the asset allocations we computed earlier account not simply for shifts to the conditional equity premium but, equally importantly, also for shifts to the volatility of stock returns. 6.3. Welfare costs from ignoring breaks To quantify the economic costs of ignoring breaks, we undertook the following exercise. For each of the return prediction models we simulated returns under the assumption that the true data generating process corresponds to the break point models reported in Tables 3–4 and Tables 5–6, respectively. We then computed the optimal asset allocation under the three different model specifications considered earlier, i.e., (i) no breaks without parameter uncertainty; (ii) no breaks with parameter uncertainty; and (iii) past
and future breaks. This exercise thus addresses what loss an investor would occur if the true return process experiences breaks, but the investor is forced to hold a portfolio that is optimized under a model that ignores breaks. Table 7 reports results in the form of annualized certainty equivalent returns, a common measure of performance that allows us to measure the economic loss from using a misspecified return prediction model, in this case one that ignores breaks. Once again we compute results for risk aversion coefficients of γ = 5 and γ = 10 and we set the values of the predictor variables close to their sample means. First consider the model based on the dividend yield. When γ = 5, the loss in certainty equivalent return is quite modest at horizons up to one year. However, the loss quickly grows as the horizon expands. In fact, the models that ignore breaks generate negative certainty equivalent returns at horizons of 30 months or longer, compared with a certainty equivalent return around 5% for the model that accounts for breaks. The finding that the cost of ignoring breaks increases with the investment horizon can be explained as follows. First, the
76
D. Pettenuzzo, A. Timmermann / Journal of Econometrics 164 (2011) 60–78
Table 7 Annualized certainty equivalent return estimates computed under the assumption that the break point model is the return generating process. Asset allocations are computed when the investor (i) ignores breaks and parameter uncertainty; (ii) ignores breaks but accounts for parameter uncertainty; (iii) accounts for breaks and parameter uncertainty. The horizon is measured in months and the return values are in percent per annum. Certainty equivalent return estimates Horizon No breaks w/o param. uncertainty
No breaks with param. uncertainty
Past & future breaks
I. Dividend yield return prediction model Yld = 4.1%; γ = 5 1 9.07 6 7.12 12 4.62 18 2.39 24 0.04 30 −1.05 36 −2.25 42 −2.75 48 −3.50 54 −4.59 60 −4.68
9.07 7.16 4.83 2.73 0.91 −0.58 −0.88 −1.85 −1.78 −2.45 −2.64
12.26 7.24 6.28 5.91 5.68 5.53 5.42 5.33 5.27 5.22 5.15
Yld = 4.1%; γ = 10 1 6 12 18 24 30 36 42 48 54 60
6.49 5.56 4.51 3.60 2.54 2.22 1.74 1.36 1.08 0.85 0.63
6.49 5.56 4.51 3.60 2.83 2.52 2.04 1.93 1.64 1.38 1.38
8.09 5.58 5.07 4.85 4.72 4.62 4.55 4.48 4.43 4.37 4.30
II. T-bill rate return prediction model Tbi = 3.8%; γ = 5 1 6.46 6 6.22 12 5.71 18 5.39 24 5.05 30 4.78 36 4.58 42 4.20 48 4.22 54 4.08 60 3.81
6.56 6.24 5.74 5.39 5.12 4.86 4.67 4.43 4.34 4.20 4.06
6.96 6.30 5.77 5.61 5.47 5.39 5.34 5.25 5.24 5.19 5.14
Tbi = 3.8%; γ = 10 1 6 12 18 24 30 36 42 48 54 60
5.23 5.06 4.82 4.63 4.51 4.37 4.25 4.01 3.92 3.81 3.69
5.41 5.09 4.82 4.73 4.65 4.58 4.54 4.46 4.44 4.37 4.31
5.16 5.06 4.80 4.63 4.45 4.30 4.17 4.01 3.82 3.70 3.58
probability of a break increases, the longer the investment horizon. Second, the difference between the allocation to stocks under the no-break and break models increases as the horizon grows, with the no-break models allocating far too much to stocks. As a result, under the assumption that breaks do in fact affect the return generating process, the no-break models lead investors to take on too much risk and result in far lower certainty equivalent returns. This finding is particularly strong under the dividend yield prediction model since this assumes strong mean reversion in returns and so leads investors to underestimate the long-run risks from holding stocks. Accounting for parameter uncertainty helps investors a bit, although it only reduces the loss in certainty
equivalent returns by around 2% compared to the case where parameter uncertainty is ignored. Notice also that losses from ignoring breaks are much smaller when investors are more risk averse (γ = 10) since this leads them to naturally dampen their allocation to stocks and, as a consequence, their over-exposure to stocks is greatly reduced. Turning to the return prediction model based on the T-bill rate, Table 7 shows much lower losses in certainty equivalent returns. Losses are now only a few basis points for investment horizons shorter than one year. Although losses from ignoring breaks grow as the investment horizon expands, they remain modest at just above one percent per annum for γ = 5 even at the five-year horizon. Once again losses are lower when the risk aversion is increased from γ = 5 go γ = 10. 7. Conclusion This paper provides an analysis of the stability of return prediction models and the asset allocation implication of breaks to model parameters. Our analysis accounts for several sources of uncertainty, namely (i) parameter uncertainty; (ii) model uncertainty; (iii) uncertainty about the number, location and size of historical breaks to model parameters; (iv) uncertainty about future (out-ofsample) breaks. Our empirical results suggest that the parameters of standard return forecasting models are highly unstable and subject to multiple breaks, many of which coincide with important historical events. Such breaks compound the effect of parameter estimation uncertainty. Moreover, we find that the possibility of past and future breaks has a large impact on investors’ optimal asset allocation. Overall, we conclude that instabilities in standard return prediction models can imply long-run risk estimates that are far greater than usually perceived and hence make stocks less attractive to long-run investors. Appendix A. Gibbs sampler for the return prediction model with multiple breaks This appendix extends results in Pesaran et al. (2006) to cover multivariate dynamic models. We are interested in drawing from the posterior distribution π (Θ , H , p, ST |ZT ), where
Θ = (v ec (B)1 , ψ1 , Λ1 , . . . , v ec (B)K +1 , ψK +1 , ΛK +1 ) are the K + 1 sets of regime-specific parameters (regression coefficients, error term variances and correlations) and 2 H = (b0 , V0 , v0,1 , d0,1 , . . . , v0,m , d0,m , µρ,1,2 , σρ, 1 ,2 , . . . , 2 µρ,m−1,m , σρ, m−1,m , a, b)
are the hyperparameters of the meta distribution that characterizes how much the parameters of the return model are allowed to vary across regimes. We also use the notation ST = (s1 , . . . , sT ) for the collection of values of the latent state variable and ZT = (z1 , . . . , zT )′ for the time-series of returns and predictor variables. Finally, p = (p1,1 , p2,2, , . . . , pK +1,K +1 )′ summarizes the unknown parameters of the transition probability matrix in Eq. (5). The Gibbs sampler applied to our setup works as follows. First, states, ST , are simulated conditional on the data, ZT , the parameters, Θ , the meta hyperparameters, H, and the elements of the transition probability matrix, P. Next, the parameters and hyperparameters of the meta distributions are simulated conditional on the data and ST . Specifically, the Gibbs sampler is implemented by simulating the conditional distributions π (ST |Θ , H , p, ZT ), π (Θ , H |p, ST , ZT ), and π (p|ST ).17
17 We use the identity π(Θ , H , p|S , Z ) = π(Θ , H |p, S , Z )π(p|S ) and note T T T T T that under our assumptions, π(p|Θ , H , ST , ZT ) = π(p|ST ).
D. Pettenuzzo, A. Timmermann / Journal of Econometrics 164 (2011) 60–78
Simulation of the states ST requires forward and backward passes through the data. Define St = (s1 , . . . , st ) and S t +1 = (st +1 , . . . , sT ) as the state history up to time t and from time t + 1 to T , respectively. We partition the joint density of the states as follows:
(30)
Chib (1996) shows that the generic element of Eq. (30) can be decomposed as follows p(st |S t +1 , Θ , H , p, ZT ) ∝ p(st |Θ , H , p, Zt )p(st +1 |st , p),
p(st = k|Zt , Θ , H , p) p(st = k|Zt −1 , Θ , H , p) × f (zt |v ec (B)k , Σk , H , Zt −1 ) k ∑
v β = v β + (K + 1), Vβ =
Moving to the posterior for the precision parameters within each regime k and for each equation i, let Ξ = (Zk − Xk Bk )′ (Zk − Xk Bk ) with Ξi,j being its ith row and jth column element. Note that 2 s− k,i |Θ−Sk , H , p, ST , ZT ∼ G
v0,i + nj v0,i d0,i + Ξi,i , 2
2
,
K +1
v0,i |Θ , H−v0,i , p, ST , ZT ∝
∏
2 G(s− k,i |v0,i , d0,i ) exp(−ρ
k=1
0 ,i
),
(32)
d0,i |Θ , H−d0,i , p, ST , ZT
K +1 −
2 s− k,i
+ d0,i .
k=1
,
p(st = l|Θ , H , p, Zt −1 ) × f (zt |v ec (B)l , Σl , H , Zt −1 )
for k = 1, 2, . . . , K + 1. In addition,
2 The full conditional densities for µρ,i,c and σρ, i,c are similar to conjugate densities with an additional factor due to the constraint requiring Λk to be positive definite (we write Rm to identify the space of all correlation matrices of dimension m):
f (µρ,i,c |Θ , H−µρ,i,c , p, ST , ZT )
p(st = k|Θ , H , p, Zt −1 )
K +1
k
−
where nk is the number of observations assigned to regime k. Location and scale parameters for the error term precision of each equation are then updated as follows:19
∼ G v0,i (K + 1) + c 0,i ,
l=k−1
=
K +1 − (v ec (B)j − b0 )(v ec (B)j − b0 )′ + V β .
(31)
where the normalizing constant is easily obtained since st takes only two values conditional on the value taken by st +1 . The last term in Eq. (31) is simply the transition probability from the Markov chain. The first term can be computed by a recursive calculation (the forward pass through the data) where, for a given p(st −1 |Θ , H , p, Zt −1 ), we obtain p(st |Θ , H , p, Zt ) and p(st +1 |Θ , H , p, Zt +1 ), and so on until p(sT |Θ , H , p, ZT ). Suppose p(st −1 |Θ , H , p, Zt −1 ) is available. From Chib (1998),
=
and
j =1
p(sT −1 |sT , Θ , H , p, ZT ) × · · · × p(st |S t +1 , Θ , H , p, ZT )
× · · · × p(s1 |S 2 , Θ , H , p, ZT ).
77
pl,k × p(st −1 = l|Θ , H , p, Zt −1 ),
∝
v ec (B)k |Θ−vec (B)k , H , p, ST , ZT ∼ N (v ec (B)k , V k ), where V k = (Xk′ Σk−1 Xk + V0−1 )−1 ,
v ec (B)k = V k (Xk′ Σk−1 Zk + V0−1 b0 ). The posterior densities of the location and scale parameters of the meta distribution for the regression parameter, b0 and V0 , take the form b0 |Θ , H−b0 , p, ST , ZT ∼ N (µβ , Σ β ),
2 exp{−(λk,i,c − µρ,i,c )2 /(2σρ, i,c )}
k=1
l=k−1
where pl,k is the Markov transition probability. For a given set of simulated states, ST , the data is partitioned into K + 1 groups. Let Zk = (zτ′ k−1 +1 , . . . , zτ′ k )′ and Xk = (zτ′ k−1 , . . . , zτ′ k −1 )′ be the values of the dependent and independent variables within the kth regime. To obtain the conditional distributions for the regression parameters and hyperparameters, note that the conditional distributions of v ec (B)k are independent across regimes with18
∏
× exp{−(µρ,i,c − µµ,i,c )2 /(2τ 2i,c )}I {Λk ∈ Rm }, 2 f (σρ, i,c |Θ , H−σ 2
ρ,i,c
, p, ST , ZT )
K +1
∝
∏
2(1−aρ,i,c )
2 exp{−(λk,i,c − µρ,i,c )2 /(2σρ, i,c )}σρ,i,c
k=1 2 m × exp(−bρ,i,c /σρ, i,c )I {Λk ∈ R }.
(33)
The posterior distributions of the correlation coefficients within 2 each regime, λk,i,c , and of the hyperparameters µρ,i,c and σρ, i,c are nonstandard so sampling is accomplished using a Griddy Gibbs sampling step inside the main Gibbs sampling algorithm. Finally, pk,k is simulated from the conditional beta posterior pk,k |ST ∼ Beta(a + lk , b + 1), where lk = τk − τk−1 − 1 is the duration of regime k. The posterior distribution for the hyperparameters a and b in Eqs. (13) and (14) is not conjugate so sampling is accomplished using a Metropolis–Hastings step.
−1
V0−1 |Θ , H−V0 , p, ST , ZT ∼ W (V β , v β ), where 1 −1 Σ β = ((K + 1)V0−1 + Σ − β ) , K +1 − −1 −1 µβ = Σ β V 0 v ec (B)j + Σ β µβ , j =1
18 Using standard set notation we define A as the complementary set of b in A, −b i.e. A−b = {x ∈ A : x ̸= b}.
Appendix B. Algorithm for generating draws of returns under breaks This appendix describes the steps used to obtain draws from the predictive distribution of returns that account for uncertainty about past and future breaks. For each draw j from the Gibbs sampler (j = 1, . . . , J),
19 See George et al. (1993), pp. 154–155. Drawing v from Eq. (32) is complicated 0 ,i since we cannot make use of standard distributions so we use an adaptive rejection sampling step in the Gibbs sampling algorithm.
78
D. Pettenuzzo, A. Timmermann / Journal of Econometrics 164 (2011) 60–78 j
1. First, obtain a draw of pK +1,K +1 from its posterior distribution. This is achieved by combining the information from the last regime with the prior information for pk,k in Eq. (6) j
j
j
pK +1,K +1 |ST ∼ Beta(a + lK +1 , b + 1), j
j
where lK +1 = T −τK − 1 is the number of observations in regime K + 1 in round j of the Gibbs sampler. Hence the probability that at time T + s (1 ≤ s ≤ h) the jth draw of the Gibbs sampler j j remains in regime K + 1, conditional on ST +s−1 , is pK +1,K +1 , j
while the probability of moving to a new regime is 1 − pK +1,K +1 . Next, for each period T + s (1 ≤ s ≤ h) proceed as follows: j
j
2. Draw a realization, UT +s , from a uniform distribution, UT +s ∼ j U 0 1 . If UT +s j regime. If UT +s
[ , ]
j pK +1,K +1 the sampler remains in the current j pK +1,K +1 the sampler moves to the next
,
≤ >
regime. j j 2a. If UT +s ≤ pK +1,K +1 , stay in the current regime and draw returns from j
j
j
rT +s ∼ p(rT +s |ΘK +1 , H j , pj , ST +s = K + 1, ZT ). Then go back to step 2 after incrementing the time indicator, s, by one. j j 2b. If UT +s > pK +1,K +1 , start by drawing a new set of hy-
perparameters H j from their meta distributions in Eqs. (7)– j j j (12). Next, draw BK +2 and ΣK +2 from π (BK +2 |H j , ZT ) and
π(ΣKj +2 |H j , ZT ), respectively. Finally, draw returns from the posterior predictive density, j
j
j
rT +s ∼ p(rT +s |ΘK +2 , H j , ST +s = K + 2, ZT ). j
j
3. If UT +s > pK +1,K +1 , so a break occurred, draw aj and bj from their conditional posterior distributions in Eqs. (13) and (14). j Generate a draw for pK +2,K +2 using the prior distribution for pk,k in Eq. (6) and aj and bj ,20 j
pK +2,K +2 |aj , bj ∼ Beta(aj , bj ). Then go back to step 2 of the algorithm after incrementing the time indicator, s, by one and increasing the regime indicator by one. 4. When s = h, add returns from periods T + 1, . . . , T + h to get the cumulated return over the investment horizon. Repeating across j = 1, . . . , J we obtain p(RT +h |ST = K + 1, ZT ). References Ang, A., Bekaert, G., 2002. Regime switches in interest rates. Journal of Business and Economic Statistics 20, 163–182. Ang, A., Bekaert, G., 2007. Stock return predictability: is it there? Review of Financial Studies 20 (3), 651. Ang, A., Chen, J., 2002. Asymmetric correlations of equity portfolios. Journal of Financial Economics 63, 443–494. Avramov, D., 2002. Stock return predictability and model uncertainty. Journal of Financial Economics 64, 423–458. Bai, J., Lumsdaine, R., Stock, J., 1998. Testing for and dating common breaks in multivariate time series. Review of Economic Studies 65, 394–432. Barberis, N., 2000. Investing for the long run when returns are predictable. Journal of Finance 55 (2), 225–264. Barsky, R., 1989. Why don’t the prices of stocks and bonds move together? American Economic Review 79, 1132–1145. Bossaerts, P., Hillion, P., 1999. Implementing statistical criteria to select return forecasting models: what do we learn? Review of Financial Studiess 12 (2), 405–428.
20 Because we do not have any information about the length of regime K + 2 from j the estimation sample, we rely on prior information to get an estimate for pK +2,K +2 .
Boyd, J.H., Hu, J., Jagannathan, R., 2005. The stock market’s reaction to unemployment news: Why bad news is usually good for stocks. Journal of Finance 60, 649–672. Campbell, J.Y., 1987. Stock returns and the term structure. Journal of Financial Economics 18, 373–399. Campbell, J.Y., Chan, Y., Viceira, L., 2003. A multivariate model of strategic asset allocation. Journal of Financial Economics 67, 41–80. Campbell, J.Y., Shiller, R.J., 1988. Stock prices, earnings, and expected dividends. Journal of Finance 43 (3), 661–676. Campbell, J.Y., Viceira, L., 2001. Who should buy long-term bonds? American Economic Review 91, 99–127. Chib, S., 1996. Calculating posterior distribution and modal estimates in Markov mixture models. Journal of Econometrics 75, 79–97. Chib, S., 1998. Estimation and comparison of multiple change point models. Journal of Econometrics 86, 221–241. Cremers, K., 2002. Stock return predictability: a bayesian model selection perspective. The Review of Financial Studies 15, 1223–1249. Dangl, T., Halling, M., 2008. Predictive regressions with time-varying coefficients. Working Paper, Vienna University of Technology. Dimson, E., Marsh, P., Staunton, M., 2002. Triumph of the Optimists: 101 Years of Global Investment Returns. Princeton Univ. Press. Elliott, G., Mueller, U., 2006. Efficient tests for general persistent time variation in regression coefficients. Review of Economic Studies 73, 907–940. Fama, E.F., French, K.R., 1988. Dividend yields and expected stock returns. Journal of Financial Economics 22 (1), 3–25. George, E.I., Makov, U.E., Smith, A.F.M., 1993. Conjugate likelihood distributions. Scandinavian Journal of Statistics 20, 147–156. Geweke, J., 1992. Evaluating the accuracy of sampling-based approaches to the calculation of posterior moments (Disc: P189–193). Bayesian Statistics 4, 169–188. Geweke, J., 2001. A note on some limitations of CRRA utility. Economics Letters 71 (3), 341–345. Goyal, A., Welch, I., 2008. A comprehensive look at the empirical performance of equity premium prediction. Review of Financial Studies 21 (4), 1455–1508. Guidolin, M., Timmermann, A., 2008. Size and value anomalies under regime switching. Journal of Financial Econometrics 6, 1–48. Johannes, M., Korteweg, A., Polson, N., 2009. Sequential learning, predictive regressions, and optimal portfolio returns. Mimeo, Columbia University. Kandel, S., Stambaugh, R., 1996. On the predictability of stock returns: an asset allocation perspective. Journal of Finance 51, 385–424. Keim, D., Stambaugh, R., 1986. Predicting returns in the stock and bond markets. Journal of Financial Economics 17, 357–390. Lettau, M., Ludvigson, S., 2001. Consumption, aggregate wealth and expected stock returns. Journal of Finance 56, 815–849. Lettau, M., Van Nieuwerburgh, S., 2008. Reconciling the return predictability evidence. Review of Financial Studies 21 (4), 1607. Liechty, J.C., Liechty, M.W., Müller, P., 2004. Bayesian correlation estimation. Biometrika 91 (1), 1–14. McQueen, G., Roley, V., 1993. Stock prices, news, and business conditions. The Review of Financial Studies 6 (3), 683–707. Menzly, L., Santos, T., Veronesi, P., 2004. Understanding predictability. Journal of Political Economy 112, 1–47. Pastor, L., Stambaugh, R., 2001. The equity premium and structural breaks. Journal of Finance 56, 1207–1245. Pastor, L., Stambaugh, R., 2009. Are stocks really less volatile in the long run? Mimeo, University of Chicago. Perez-Quiros, G., Timmermann, A., 2000. Firm size and cyclical variations in stock returns. Journal of Finance 55, 1229–1262. Pesaran, M., Pettenuzzo, D., Timmermann, A., 2006. Forecasting time series subject to multiple break points. Review of Economic Studies 73, 1057–1084. Pesaran, M.H., Timmermann, A., 1995. Predictability of stock returns: robustness and economic significance. Journal of Finance 50 (4), 1201–1228. Raftery, A., Lewis, S., 1992a. How many iterations in the Gibbs sampler. Bayesian Statistics 4 (2), 763–773. Raftery, A., Lewis, S., 1992b. One long run with diagnostics: implementation strategies for Markov chain Monte Carlo. Statistical Science 7 (4), 493–497. Raftery, A., Lewis, S., 1995. The number of iterations, convergence diagnostics and generic Metropolis algorithms. Practical Markov Chain Monte Carlo. Stambaugh, R., 1999. Predictive regressions. Journal of Financial Economics 54, 375–421. Timmermann, A., Paye, B., 2006. Instability of return prediction models. Journal of Empirical Finance 13, 274–315. van Binsbergen, J.H., Brandt, M.W, Koijen, R.S.J., 2010. On the Timing and Pricing of Cash Flows. Manuscript, Stanford University. Wachter, J., Warusawitharana, M., 2009. Predictable returns and asset allocation: should a skeptical investor time the market? Journal of Econometrics 148 (2), 162–178.
Journal of Econometrics 164 (2011) 79–91
Contents lists available at ScienceDirect
Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom
A control function approach for testing the usefulness of trending variables in forecast models and linear regression Graham Elliott ∗ Department of Economics, University of California, San Diego, 9500 Gilman Drive, LA JOLLA, CA, 92093-0508, USA
article
info
Article history: Available online 8 March 2011 JEL classification: C12 C22 C53 Keywords: Near unit root Cointegration Prediction regressions
abstract Many predictors employed in forecasting macroeconomic and finance variables display a great deal of persistence. Tests for determining the usefulness of these predictors are typically oversized, overstating their importance. Similarly, hypothesis tests on cointegrating vectors will typically be oversized if there is not an exact unit root. This paper uses a control variable approach where adding stationary covariates with certain properties to the model can result in asymptotic normal inference for prediction regressions and cointegration vector estimates in the presence of possibly non-unit root trending covariates. The properties required for this result are derived and discussed. Published by Elsevier B.V.
1. Introduction A common problem in constructing forecasting regressions is understanding the sampling uncertainty of the coefficients in the prediction regression when the predictors appear to be persistent. Forecasters care about this sampling uncertainty because it is often unclear whether or not the predictors are useful for forecasting. Given the near unit root behavior of the predictors, standard asymptotic normal distributions have been shown to be very poor approximations to the sampling distributions of the coefficients in the prediction regression. Such approximations lead to oversized tests and are hence potentially misleading as guides as to whether or not the variables actually have any predictive usefulness. The classic example in the forecasting literature is the prediction of stock market returns with the dividend price ratio—the dividend price ratio appears to have a root on or near the unit circle. For this regression the problem extends to other popular predictors such as interest rate differentials and the earnings price ratio as well. Similar issues arise with forecasting exchange rates with the forward premium, forecasting changes in income or consumption with interest rate differentials and ratios of macroeconomic variables. A number of approaches have been taken to provide tests for inclusion of the predictive variables that control size even when the predictive variable has a unit root or near unit root. The common tool has been to use local to unity asymptotics — using
∗
Tel.: +858 534 4481; fax: +858 534 7040. E-mail address:
[email protected].
0304-4076/$ – see front matter. Published by Elsevier B.V. doi:10.1016/j.jeconom.2011.02.014
limit theory for sequences of models where the largest root for the predictive regression remains in the neighborhood of one — to approximate distributions. The tests then differ in the precise statistic to be computed and how they handle the unknown root. One approach has been to use Bonferroni or related methods (Cavanagh et al., 1995; Lewellen, 2004; Campbell and Yogo, 2006). Alternatively Moreira and Jansson (2006) condition on a sufficient statistic for the root to remove the dependence in their test. None of these methods dominates each other theoretically or in practice. The precise nature of the problem and these methods are reviewed in Section 2 of this paper. The methods we have so far have some limitations. First, for each of these methods extending the methods beyond the bivariate regression is extremely challenging, and no results exist in the literature for more than a single predictor in the regression. This is in part because the methods themselves are somewhat cumbersome to apply. This paper suggests a different approach to obtaining tests which control size. In the method presented here additional covariates are added to the regression. The problems of size distortion and inference are shown to depend on a convolution of the parameters of the trending process and nuisance parameters that describe the relationships between the shocks to the forecasting regression and the shocks to the variables used for forecasting. Judiciously chosen, the covariates have the potential to remove the dependence of the hypothesis tests on the nuisance parameters describing the trend, and so provide inference that is robust to lack of knowledge over the trending behavior of the data. The method is discussed in Section 2 in a simplified bivariate case and examined for very general models in Section 3.
80
G. Elliott / Journal of Econometrics 164 (2011) 79–91
The method is applicable for a very wide set of problems. First, we allow for the inclusion of a general number of regressors in the prediction regression, hence we are not (as in the methods above) restricted to a single regressor. Second, for the prediction regression the method involves only the running of OLS regressions, and hence even for the general case is straightforward to apply. Third, inference is standard asymptotic (mixed) normal and so hypothesis tests are also straightforward to apply. We also show, in Section 4, that the method can also be used when examining ‘cointegrating’ regressions where there is uncertainty over a unit root. Such uncertainty is ubiquitous in the cointegration literature, hence the use of unit root and rank pretests. With use of additional stationary covariates, we show similar results as for the prediction equation problem—asymptotically standard (mixed) normal inference can still apply even when the roots are not exactly on the unit circle. The suggested method, as in the case in instrumental variables, requires finding additional data with particular properties. Such data are referred to as orthogonalizing covariates. In Section 5 we go into more detail regarding these properties. We suggest how such variables can be found in practice. We also show, in Theorem 5, that the use of such covariates that satisfy these properties have at minimum the same power (and generally better power) as would be obtained if we knew the size of the largest root of the predictor variable. Since all of the methods discussed in Section 2 cannot have better power than the case of the root known, this leads directly to the implication that none of the current methods can have better power against any alternative than the method suggested here when we indeed have orthogonalizing covariates. Monte Carlo results are presented in Section 6 to examine both size and power properties of the suggested method. We also examine the implications of Theorem 5 for the relationship between these methods and the popular Campbell and Yogo (2006) procedure in terms of power. Proofs of the results are contained in an Appendix. 2. The testing problem and conditioning on stationary covariates This paper examines two models which have been treated rather differently in the literature but for which common methods apply. The first regression is where a relatively non-trending variable is regressed on lagged trending variables for purposes of prediction. The second is a cointegrating regression where we are uncertain that there are unit roots in the data. In this case the ‘cointegrating’ vector is a trending variable being predicted by contemporaneous trending variables. These problems are similar in the sense that it is not the behavior of the left-hand side variable (which differs here in properties) but the terms on the right-hand side that dominate the theoretical properties of the estimators and tests. In this sense we are able to generate a solution to both problems from a similar idea, as detailed in following sections. The differences lie mostly in the auxiliary assumptions of the model, and the perceived reasonable coefficients on the trending covariates. The remaining difference is that in terms of constructing methods that deal with uncertainty over the trending process, many papers have appeared that examine the prediction regression whereas only a few have examined the cointegrating regression. We focus on the predictive regression in this review section. The simplest predictive regression test would be to examine if a variable measured today predicts the outcome we are interested in seeing in the next period. For example there is a large literature on forecasting stock returns using the log of the dividend price ratio in the previous period as a predictor. For inference on predictive
ability we require assumptions on both the properties of the predictor variable and on the errors to the regression to justify any inferential procedure on the predictiveness or lack thereof of the regressor. Consider the model
(1 − ρ L)(y1t − ϕ1 ) = ε1t t = 2, . . . , T y2t = ϕ2 + β y1t −1 + ε2t t = 2, . . . , T yt Tt=1
(1)
{[y1t , y2t ]′ }Tt=1
where we observe { } = where both y1t and y2t are univariate, ϕ1 , ϕ2 , β and ρ are unknown parameters, and ε1t and ε2t are unknown residual terms. Define the parameter space for ρ as P. Define E [εit εjt′ ] = Σij for i, j = 1, 2. The correlation
between ε1t and ε2t is δ = Σ12 /(Σ11 Σ22 )1/2 . In addition, we must initialize the processes, set the initial value of ε1t = ξ such that T −1/2 ξ →p 0. Some of these restrictions are made for expository purposes in this review—in the next section y2t can be multivariate, the specification of the deterministic terms is more general, and we also allow serial correlation in (1 − ρ L)y1t . This regression is typically run to test the notion of unpredictability of y2t – the typical null hypothesis derived from economic theory – so we assume that ε2t is uncorrelated. Regressions of this form appear as noted above in the prediction of stock prices by such covariates as the dividend price ratio, earnings price ratio and interest rates. Fama and French (1988) for example reject the hypothesis that the dividend price ratio cannot predict returns. Similar issues arise in testing term structure models of the interest rate, where the lagged interest rate spread is used to predict interest rates. Fama (1975) uses interest rates to predict inflation. Hall (1978) examines the predictability of changes in consumption with lagged income and stock prices. Large literatures reexamining these findings has followed these seminal papers. In each of these cases, the regressor is typically a persistent or trending variable in the sense that there is clearly a low frequency component to the data. Unit root or near unit root behavior affects the sampling distribution of least squares estimators for β and also hypothesis tests based on this estimate. We can estimate β by OLS, yielding T ∑
βˆ − β =
ym 1t −1 ε2t
t =2 T ∑
2 (ym 1t −1 )
t =2
−1 where ym t = yt − (T − 1) t =2 y t . Projecting ε2t onto ε1t yields ε2t = (Σ12 /Σ11 )ε1t + ε2.1t where E [ε1t ε2.1t ] = 0 by construction and the variance of ε2.1t is Σ2.1 = 2 Σ22 − Σ12 /Σ11 . Hence the above expression can be rewritten as
∑T
T ∑
βˆ − β =
T ∑
ym 1t −1 ε1t (Σ12 /Σ11 )
t =2 T ∑
+
T ∑
2 (ym 1t −1 )
t =2
. 2 (ym 1t −1 )
t =2 T ∑
= (ρˆ − ρ)(Σ12 /Σ11 ) +
ym 1t −1 ε2.1t
t =2
ym 1t −1 ε2.1t
t =2 T ∑
(2)
(
)
2 ym 1t −1
t =2
where ρˆ is the least squares estimator for ρ using data on y1t . A number of authors have pointed out issues with inference over β in regressions that take this form. Examining the predictability of changes in consumption using lagged income as a predictor, Mankiw and Shapiro (1986) showed in Monte Carlo experiments that hypothesis tests on β did not control size when standard critical values were employed. Stambaugh (1999) showed a similar result for prediction of returns with the dividend
G. Elliott / Journal of Econometrics 164 (2011) 79–91
tβ=β ⇒ δ τˆDF + (1 − δ 2 )1/2 φ ˆ 0
(3)
where τˆDF has a nonstandard asymptotic distribution and φ is a standard (mixed) normal random variable. When ρ = 1 (γ = 0), τˆDF is the usual Dickey and Fuller (1979) distribution for the t statistic testing for a unit root in y1t . For γ > 0 this distribution remains nonstandard with similar features to the distribution when ρ = 1. This distributional approximation explains clearly the Monte Carlo findings and suggests problems with standard inference in predictive regressions of this sort. First, as is well known, when ρ = 1 the distribution for τˆDF has a long left tail. Hence when δ < 0 this corresponds to size distortions using standard critical values in the right tail of tβ=β —when β is truly zero and the regressor ˆ 0
has no predictive power the regression is likely to find βˆ > 0 and statistically significant with a frequency greater than size. The larger in absolute value is the correlation the more likely we are to reject. The effects can be seen in Fig. 1. The figure shows the upper 95th percentiles of the distribution in (3) for δ = −0.95, −0.75 and −0.5 respectively as a function of γ . These percentiles would be the correct critical values for a one sided 5% test for β = 0. The problem of rejecting for t statistics greater than 1.645 for a 5% test — those used when using a normal approximation — can be seen from the figure. When ρ is sufficiently close to 1 and the sample size such that γ = T (ρ − 1) is sufficiently close to zero then for δ of the sizes often seen in application t statistics of greater than
1 In the asymptotic theory we employ triangular asymptotics, so ρ is ρ , we T follow the econometric literature and ignore the subscript to enhance readability.
3
2.8
2.6
95th percentile
price ratio, using one term expansions of the OLS estimator to attempt to explain the problems with inference. Since y1t −1 and ε2.1t are orthogonal by construction, it is apparent from Eq. (2) that the bias arises from the first term, (ρˆ − ρ)Σ12 /Σ11 , which is nonzero when both (a) there is persistence, so ρˆ is a biased estimator for ρ , and (b) when there is simultaneity, so Σ12 ̸= 0. Such problems have in many cases been characterized as a ‘small sample’ problem. The idea is that as the sample size increases, ρˆ is consistent for ρ and the problem disappears asymptotically. Such a result requires restricting the parameter space P away from the unit circle, for example in Amihud and Hurvich (2004) bias approximations for (ρˆ − ρ) are based on Edgeworth expansions that require this restriction. Whilst authors differ on how seriously to take the possibility of a unit root in the regressor (the failure to reject a unit root does not of course imply the presence of a unit root), roots close to one are often consistent with the data. So when the uncertainty includes a unit root, that there will always be some region of the parameter space of ρ near one for which problems of size control occur regardless of the sample size. To account for the possibility of unit roots and near unit roots, we model the region near one using the local to unity methods of Phillips (1987) where1 ρ = 1 − γ /T . These methods are known to provide much better asymptotic approximations for sample size and ρ pairs such that γ is in the range of 40 to a small positive number (including the unit root process smoothly with models near the unit root). In this approach we define the Brownian motion W1 (s) as the suitably scaled limit of the partial sum of the ¯ (s) is a demeaned Ornstein–Uhlenbeck process shocks to y1t and M driven by W1 (s) (these are defined more precisely in Section 3). Under such asymptotics it is well known that T (ρˆ − ρ) converges ¯ (s)). Hence the to a nonstandard distribution (a function of M bias remains asymptotically. Indeed, as shown in Elliott and Stock (1994) for this model
81
2.4
2.2
2
1.8 -5
0
5
10
γ
15
20
Fig. 1. Curves show 95th percentile of (3) as a function of γ for δ = −0.95 (solid line), δ = −0.75 (long dashes) and δ = −0.5 (short dashes).
2 and even near 3 that would lead one to reject the hypothesis of no predictability are actually consistent with the null hypothesis of no predictability. This translates to tests that are greatly oversized. The distortion is smaller the larger is γ (so y1t is less persistent), where the first term in (3) becomes closer to a standard normal distribution (Phillips, 1987). For smaller values for δ the extent of the oversize of the test statistic becomes smaller, when δ is close to zero there is little problem, with critical values close to their usual values. Fortunately, since δ is consistently estimable even without knowledge of ρ , in applications we are able to gauge the extent of the problem. The consistency of standard estimates for δ follows directly from Σ being consistently estimable in regression since the parameters converge at a fast enough rate so that estimation error is negligible asymptotically (this is of course why we are able to estimate the variance covariance matrix to normalize tstatistics in typical regression problems). Many of the methods discussed below are derived for known δ and work in practice by replacing this value with a consistent estimate. The discussion for the remainder of this section will assume this parameter is known. A number of approaches have been suggested to overcome this size distortion problem. An elegant approach is to condition on the sufficient statistic for ρ and construct a test for the null hypothesis based on the conditional likelihood. This approach is detailed in Moreira and Jansson (2006), who provide tests for the bivariate model case. The method involves a numerical integration technique to construct p-values for the test. A limitation of this approach is that it becomes very difficult to extend the method beyond the bivariate model. Stambaugh (1999) suggested a Bayesian approach, which avoids the requirement that the root be defined to a single number. Bootstrap approaches that involve estimating ρ in order to regenerate bootstrap samples of the data {y∗t }, though common in the applied literature, result in bootstrap distributions that differ from the sampling distribution for the statistics of interest (see Basawa et al., 1991) and hence do not control size asymptotically and thus do not solve the problem. An alternative approach that does control size reasonably well, and as such have been more often employed in the applied literature, is to examine the distribution of (3) for all ‘reasonable’ values for ρ and use the most conservative ones. Cavanagh et al. (1995) suggested such a test based on Bonferroni and Scheffe methods which incorporate sample information on likely values for ρ . Such a procedure results in a test that controls size but
82
G. Elliott / Journal of Econometrics 164 (2011) 79–91
results in biased tests. Essentially the critical values may be too conservative for many values of ρ and hence for a particular (ρ, δ) pair numerical size is much less than nominal size. This results in a biased test since there are local alternatives for which the test rejects less often than nominal size. Power is also affected, although these methods have reasonable power for many alternatives. We can rotate the system so that the model is
(1 − ρ L)(y1t − ϕ1 ) = ε1t y2t = ϕ2 + β y1t −1 + λ(1 − ρ L)y1t + ε2.1t where λ = Σ12 /Σ11 and ε2.1t is by construction orthogonal to ε1t . In this regression the orthogonality of the residuals means that the history of y1t is uncorrelated with the regression residuals. The infeasible SUR problem thus rotates the model so that the regression to be run is y2t = ϕ2 + β y1t −1 + λ(1 − ρ L)y1t −1 + ε2.1t .
(4)
In this regression we have that E [ε2.1t |y1t ] = 0 and hence the bias has been removed. The t statistic testing hypotheses on β now have an asymptotic mixed normal distribution regardless of ρ . Thus standard critical values from the normal distribution apply asymptotically. This is of course infeasible since constructing the additional regressor requires that we know ρ in order to construct the quasidifferenced regressor (1 − ρ L)y1t −1 . Campbell and Yogo (2006) utilize this regression with ρ known to obtain confidence intervals on βˆ given ρ . They then use a Bonferroni procedure which employs a DF-GLS test for a unit root (Elliott et al., 1996) and the largest bounds from each of the values for ρ in the confidence interval on ρ to construct confidence intervals for βˆ robust to the lack of knowledge over ρ . Plugging in an estimator ρˆ to construct the quasi difference results in a correctly centered t-statistic on β having the asymptotic distribution tβˆ
OLS =β0
⇒
δ τˆDF + φ (1 − δ 2 )1/2
(5)
which is again nonstandard when there is simultaneity. The infeasible SUR result, which results in standard inference on β , does give some insight into the problem. Standard inference would result if we could include the orthogonalizing variable (1 − ρ L)y1t for the regression, however estimates of ρ are not sufficiently precise, even asymptotically. Similar results will obtain for any estimator for ρ that converges at rate T . However this possibility of obtaining standard inference could still be useful if we had a proxy for the orthogonalizing variable (1 − ρ L)y1t for the regression. In this paper we take a different approach to the problem, one more aligned with solutions that arise from the economics behind the problem. Consider instead a stationary variable zt that is contemporaneously correlated with both y2t and (1 − ρ L)y2t . We could run the regression y2t = ϕ2 + β y1t −1 + α2 zt + ε˜ 2t where ε˜ 2t is the remainder of a projection of ε2t on zt . Since the correlation between zt and y2t is contemporaneous then α2 nonzero does not contradict theories that imply lack of predictability. Typically economic theory suggests many possible factors that are likely to be contemporaneously related to variables we are interested in predicting. Although it may be difficult to find predictors of stock market returns, explanations abound of news that contemporaneously affects market prices. The addition of stationary covariates to the model affects the limit distribution for the OLS estimator for β and its attendant t statistic. The distribution for the t statistic in this case is again
a linear combination of a nonstandard term and a mixed normal term, namely tβ=β ⇒ δ˜ τˆDF + (1 − δ˜ 2 )1/2 φ ˆ 0
(6)
where δ˜ is the correlation between ε˜ 1t and ε˜ 2t where ε˜ it , i = 1, 2 is the remainder of a projection of εit on zt . Hence the inclusion of zt in the regression changes the weights on the standard and nonstandard pieces. Thus for any zt this simply results in yet another mixture ˜ < |δ| the distribution which in aggregate is nonstandard. For |δ| effect of the nonstandard piece and hence the possibility of size distortions from using the normal are smaller. However it may be possible to find variables zt — the orthogonalizing covariates — for which δ˜ = 0 even though δ does not. In this case the weight on the nonstandard part of the distribution is zero and the t statistic has a mixed normal distribution with mean zero and variance unity. Hence standard inference with the usual critical values from the normal table is appropriate asymptotically. In Section 3 these points will be made precise in the general model. Similar problems in inference arise in the cointegrating model when variables do not have an exact unit root. Elliott (1998) examines this problem for estimators of the cointegrating vectors that are efficient when the roots are exactly on the unit circle. The sizes of the distortions can be even larger for the near cointegrating case than that for the predictive case. However a correct orthogonalization can remove the effect. This result is shown for general models in Section 4. 3. Standard inference for predictive regressions with near unit roots The general predictive model is
(1 − ρ L)(y1t − ϕ1 dt ) = u1t y2t = ϕ2 dt + β y1t −1 + ε2t zt = ϕ3 dt + u3t
(7)
where the observed data {y1t , y2t , zt } are such that y1t is kx1, y2t is univariate and zt is px1, dt is a vector of deterministic terms, A(L)u1t = ε1t and B(L)u3t = ε3t where B(L) has order q, εt = ′ ′ ′ (ε1t , ε2t , ε3t ) where E [εt εt′ ] = Σ . It will be useful to partition Σ into blocks Σij , i, j = 1, 2, 3 which correspond to each of the three elements of εt (so E [εi εj′ ] = Σij ). The unknown parameters are (ϕ1 , ϕ2 , ϕ3 , β, ρ, ξ , Σ ) and the parameters of the finite lag polynomials A(L) and B(L). The model is intended to capture the problem of forecasting a univariate outcome of interest y2t with a kx1 set of predictive regressors y1t −1 . These predictive regressors are trending regressors in the sense that the autoregressive coefficient that corresponds to their largest roots, here factored out into ρ , are close to one in the statistical sense of these roots being difficult to distinguish from either on the unit circle or just outside. We restrict ρ to be a kxk matrix which is diagonal2 with diagonal elements ρi for i = 1, . . . , k. In order to obtain asymptotic approximations for sampling distributions of the estimators and tests considered here, we make the following additional assumptions on the model. Condition 1. (a) The data {y1tT , y2tT , ztT } are a triangular array and ρ = 1 − T −1 Γ where Γ is a diagonal matrix with diagonal elements γi , i = 1, . . . , k.
2 This restriction ensures that the y variables do not have the property that their 1t second differences satisfy a Functional Central Limit Theorem (FCLT) with a positive but finite scale term, i.e. they are not I (2) or near I (2) random variables.
G. Elliott / Journal of Econometrics 164 (2011) 79–91
(b) E [εt |εt −1 , εt −2 , . . .] = 0, E [εt εt′ |εt −1 , εt −2 , . . .] = Σ and E [εitm |εt −1 , εt −2 , . . .] < K < ∞, m > 4 for i = 1, . . . , k + 1 + p. (c) A(1)−1 Σ11 A(1)−1′ has rank k. (d) A(L) and B(L) have all roots outside the unit circle. (e) T −1/2 ξ →p 0. (f) d¯ t (which is a function of dt ) is such that there exists a diagonal scaling matrix Υd,T which is only a function of T such that
Υd−,T1 d¯ t = T −1/2 d˜ t and T −1 s(λ) is nonstochastic.
∑T
t =1
d˜ t d˜ ′t →
1 0
s(λ)s(λ)′ dλ where
In Condition 1(a) we make formal the use of the local to unity asymptotic theory, following Phillips (1987). This is standard in this literature. Condition 1(b) ensures that the shocks driving the system are regular enough to allow asymptotic inference, in ∑[T ·] particular T −1/2 t =1 εt ,T ⇒ Σ 1/2 W (·) where W (λ) is a (k + 1 + p)x1 standard Brownian Motion with variance covariance matrix λIk+1+p . The requirement in Condition 1(c) is that there do not exist any linear combinations of y1t such that their partial sums would satisfy a functional central limit theorem. In the case where ρ = I this condition amounts to assuming that none of the elements of y1t are cointegrated with each other. The condition in (d) is required to restrict asymptotic inference on the coefficients of the zt covariates to be asymptotic normal converging at rate T 1/2 . This condition of course is the standard condition for asymptotic normal inference when dependence is linear. The initial condition for y1t is made negligible in (e), which is the standard assumption made in this literature. In particular the entire set of results for the predictive regression discussed above make this assumption or the more restrictive one that ξ = 0. Condition 1(f) limits the types of deterministic terms that are handled by the results below. The condition allows for a constant, polynomials in time, and many other types of deterministic terms that do not depend on estimated quantities. The key assumption is that s(λ) does not depend on estimated quantities but is instead known.3 Condition 1(a), (b), (d), (e), the first equation in (7) along with choosing Σ 1/2 to be lower block diagonal with partitions after the kth row and column yields the asymptotic approximation 1/2 T −1/2 u1[T λ] ⇒ A(1)−1 Σ11 M (λ) where M (λ) is a kx1 standard Ornstein–Uhlenbeck process which is a continuous function of W1 (λ) where W1 (λ) are the first k elements of W (λ). The choice of Σ 1/2 lower block diagonal ensures that M (λ) is independent of the remaining elements of W (λ). The standard approach to examining this problem, since it is a problem of examining the predictability of y1t is to estimate directly the prediction equation without regard to the rest of the system. The prediction regression is y2t = ϕ2′ dt + β ′ y1t −1 + ε2t .
(8)
The null hypothesis of interest is H0 : β = 0 with a two sided alternative. This can be tested using the usual Wald test. Theorem 1. When Condition 1 holds then under the null hypothesis testing β = 0 in the regression (8) the Wald test has the limiting distribution
∫ Wald ⇒
′ 1/2 ¯ MdW 1 δ + (1 − δ δ)
∫ ×
∫
′ 1/2 ¯ MdW 1 δ + (1 − δ δ) −1/2
−1/2
where δ = Σ11 Σ12 Σ22 projection of M (λ) on s(λ).
¯ MdW 2 ∫
′ ∫
¯ MdW 2
¯M ¯′ M
−1
¯ (λ) is the remainder of a and M
3 An implicit assumption is that the form of s(λ) is the same for both (8) and (9). The results are easily amended at the cost of additional notation when these differ.
83
This result extends that of Elliott and Stock (1994) to the case of more than a single regressor and also allows for a greater variety of deterministic terms. The basic point however is the same—when there is correlation between ε1t and ε2t so Σ12 is nonzero then the distribution of any test for the null hypothesis does not have the standard distribution and as a consequence there is typically a size distortion if standard critical values are used. Rejections of the null hypothesis of no predictability may in fact be due to this correlation. The regressors zt are available stationary covariates that may be correlated with the shocks to the prediction equation (if Σ23 is nonzero) and may be contemporaneously correlated with the shocks to the predictor variables (if the relevant element of Σ13 is nonzero). The regression to be run when the zt are included in the regression is y2t = ϕ¯ 2′ d¯ t + β ′ y1t −1 + α2′ Zt + ε˜ 2t
(9)
where Zt = (zt , zt −1 , . . . , zt −q ) , α2 = (α20 , α21 , . . . , α2q ) and we ′ ′ define ε˜ t = (˜ε1t , ε˜ 2t )′ as the remainder when we project (ε1t , ε2t ) onto ε3t . For partitions conformable with (y′1t , y2t , zt ) define the matrix ′
G=
Ik 0
′
′
−1 −Σ13 Σ33 −1 −Σ23 Σ33
0 1
′
′
′
′
′
−1 ′ then ε˜ t = Gεt and so the population values for α2i are Σ23 Σ33 Bi where Bi is the ith element of B(L). We consider the Wald statistic testing the same hypothesis as in Theorem 1.
Theorem 2. When Condition 1 holds then under the null hypothesis testing β = 0 in the regression (9)
∫
¯ ˜ ˜ ′ ˜ 1/2 MdW 1 δ + (1 − δ δ)
Wald ⇒
∫ ×
∫
¯ W ˜2 Md
¯ ˜ ˜ ′ ˜ 1/2 MdW 1 δ + (1 − δ δ)
−1/2
∫
′ ∫
¯ W ˜2 Md
¯M ¯′ M
−1
−1/2
˜ (λ) ˜ 12 Σ ˜ 22 , Σ ˜ = E [˜εt ε˜ t′ ] defined below and W ˜ 11 Σ where δ˜ = Σ is a standardized linear combination of W2 (λ) and W3 (λ). This theorem extends the results to the case where stationary ˜ (λ) is a linear variables are included in the regression. Since W ˜ ¯ (λ) function of W (λ) and W (λ) then W (λ) is independent of M 2 3 ¯ W ˜ 2 is mixed normal. This means that, just as in the then Md previous result, the distributions are a weighted average between a term that is mixed normal and a nonstandard term. The difference lies in the weighting—here the weighting on the nonstandard component is δ˜ in place of δ . The different weighting means that the extent to which the distribution differs from a standard one will differ. We can consider the impact of adding the additional covariates on the nuisance parameters though since
˜ 11 Σ ′ ˜ 12 Σ
˜ 12 Σ ˜ 22 Σ
=
−1 ′ Σ11 − Σ13 Σ33 Σ13 −1 ′ ′ Σ12 − Σ23 Σ33 Σ13
−1 ′ Σ12 − Σ13 Σ33 Σ23 . (10) −1 ′ Σ22 − Σ23 Σ33 Σ23
˜ > The inclusion of the extra stationary variables can result in |δ| |δ|, i.e. even larger size distortions that when zt is omitted, although it should be recalled that both δ˜ and δ are consistently estimable so this situation can easily be avoided in practice. But it is also possible that by including zt with the correct properties that δ˜ = 0 and standard inference on the OLS coefficients provides correct inference asymptotically. What is required is that the control variables zt be ‘orthogonalizing covariates’ and explain away the correlation between ε1t and ε2t . Mathematically
84
G. Elliott / Journal of Econometrics 164 (2011) 79–91
′ this means that Σ12 = Σ13 Σ33 Σ23 . This makes clear two necessary properties that must be satisfied by zt —they are (a) they must be stochastic and (b) both Σ13 and Σ23 must be nonzero. The second of these means that zt must covary with both the dependent variable and the quasi difference of the predictor variables. Even when δ˜ is nonzero it may be close enough to zero so that size distortions are small.
4. Standard inference for cointegrating relationships with near unit roots The general near cointegrating model is
(1 − ρ L)(y1t − ϕ1 dt ) = u1t y2t = ϕ2 dt + β y1t + u2t zt = ϕ3 dt + u3t
(11)
where ut = (u′1t , u′2t , u′3t )′ and A˜ (L)ut = εt . The observed data {y1t , y2t , zt } are such that y1t is kx1, y2t is rx1 (so there are r potential cointegrating vectors) and zt is px1, dt is a vector of ′ ′ ′ ′ deterministic terms, εt = (ε1t , ε2t , ε3t ) where E [εt εt′ ] = Σ , and the unknown parameters are (ϕ1 , ϕ2 , ϕ3 , β, ρ, ξ , Σ ) and the parameters of the finite (known) lag polynomial A˜ (L). This model can be rotated into the vector autoregressive model
1y1t 1y2t zt
=Ψ
y1t −1 y2t −1
1y1t −1 + Φ d¯ t + Π (L) 1y2t −1
+ εt
∗
(12)
z t −1
where Ψ is (k + r + p)x(k + r ) and d¯ t are the relevant deterministic terms that derive from (11). For deterministic terms that are a polynomial of time this is straightforward as d¯ t = dt , however if there are known structural breaks then this term can become more complicated. It is assumed throughout that the deterministic component is correctly specified in the sense that no terms implied by the original model have been omitted. In the case where ρ = Ik and we assume that the elements of y1t themselves do not cointegrate amongst each other, this is a general cointegrating system where if we partition Ψ = [Ψ1 Ψ2 ] where the partition is after the kth column then Ψ2 are the impact coefficients of the error correction terms (y2t −1 − β y1t −1 ) where β is the cointegrating vector. When the zt covariates are omitted a wide range of asymptotically equivalent methods are available under this assumption for estimation of and inference on the cointegrating vector (cf. Johansen, 1988, 1991; Stock and Watson, 1993; Saikkonen, 1991, 1992; Phillips and Hansen, 1990). Hypothesis tests on the cointegrating vector using these methods is particularly simple since t-statistics can be compared to the usual normal distribution and more general tests can be compared to the χ 2 distributions. Relaxing the condition that we know a priori that the roots are exactly on the unit circle (and maintaining the assumption that we ignore the zt covariates) Elliott (1998) shows that the efficient methods for estimating the cointegrating vector have a bias term and standard inference no longer applies except in the very special case where the correlation between u1t and u2t is zero at the zero frequency. For estimating βˆ in this regression we consider the estimator
ˆ ∗−1 Ψˆ 2 )−1 (Ψˆ 2′ Σ ˆ ∗−1 Ψˆ 1 ) βˆ = −(Ψˆ 2′ Σ
(13)
where the rows of Ψˆ = [Ψˆ 1 Ψˆ 2 ] are obtained from running each of the regressions in (12) equation by equation (as in the usual case ∑ ˆ ∗ = T −1 εˆ t∗ εˆ t∗′ where εˆ t∗ are the when we estimate a VAR) and Σ least squares residuals from the OLS regressions (12). This estimator is an extension of the method proposed by Saikkonen (1992) for the case where there are no stationary
variables. The Granger representation theorem shows that when ρ = Ik then cointegration implies that the matrix Ψ is reduced rank. This estimator can be shown to be asymptotically equivalent to the minimum distance estimator for the cointegrating vector when the restrictions implied by cointegration are imposed on the unrestricted VAR estimates. Further, following Saikkonen (1992) this estimator can be shown to be asymptotically equivalent to a maximum likelihood estimator for the same normalization of the cointegrating vector. We turn to the asymptotic properties of this estimator. Additional assumptions on the model are essentially the same as Condition 1, apart from (c) and (d). Condition 2. (a) The data {y1tT , y2tT , ztT } are a triangular array and ρ = 1 − T −1 Γ where Γ is a diagonal matrix with diagonal elements γi , i = 1, . . . , k. (b) E [εt |εt −1 , εt −2 , . . .] = 0, E [εt εt′ |εt −1 , εt −2 , . . .] = Σ and E [εitm |εt −1 , εt −2 , . . .] < K < ∞, m > 4 for i = 1, . . . , k + 1 + p. (c) A˜ 11 (1)−1 Σ11 A˜ 11 (1)−1′ has rank k where A˜ 11 (L) is the upper kxk block of A˜ (L). (d) A˜ (L) has all roots outside the unit circle. (e) T −1/2 ξ →p 0. (f) d¯ t (which is a function of dt ) is such that there exists a diagonal scaling matrix Υd,T which is only a function of T such that
Υd−,T1 d¯ t = T −1/2 d˜ t and T −1 s(λ) is nonstochastic.
∑T
t =1
d˜ t d˜ ′t →
1 0
s(λ)s(λ)′ dλ where
Theorem 3. For the model in (11) when ρ = Ik − T −1 Γ and Condition 2 holds then for the estimator defined in (13) then
˜ 2.1 ι′2 Ω −1/2′ T (βˆ − β) ⇒ Ω −1/2
× Ω11
∫
¯′ dW M
∫
¯M ¯′ M
−1
′ ˜ −1 ˜ 12 +Ω Ω11 Γ
(14)
˜ 2.1 = ˜ ij = Ωij − Ωi3 Ω33 Ω3j , Ω where Ω = A(1) Σ A(1) , Ω ′ ˜ −1 ˜ ¯ is as defined in the text after Condition 1. ˜ 22 − Ω ˜ 12 Ω Ω11 Ω12 , M −1
−1
−1′
This result extends the results of Saikkonen (1991) to the case where stationary covariates are included in the regression and used optimally to estimate the cointegrating vector under the ′ assumption that ρ = Ik . Since the leading rxk block of ι′2 Ω −1/2 is a ¯′ matrix of zeros this means that in the weighted average of dW M ′ ¯ a zero weight is given to dW1 M (which is a nonstandard piece) and the remaining pieces are mixed normal. Hence the first term in (14) has a mixed normal distribution. Linear hypotheses on β in the form Rvec(β) = r can be tested using the Wald test
ˆ ⊗ (Ψˆ 2′ Σ ˆ − r )′ R(M ˆ ∗−1 Ψˆ 2 )−1 )−1 R′ Wald = T 2 (Rvec(β)
−1
ˆ − r) × (Rvec(β) ˆ is the upper kxk block of the inverse of the regressors where M outer product scaled by T −2 . This is the standard form of the Wald ˆ ∗−1 Ψˆ 2 )−1 in place of the statistic apart from the scaling term (Ψˆ 2′ Σ variance covariance of the residuals. Theorem 4. Under the same conditions as Theorem 3 then under the null hypothesis that β is true ′ ∫ −1 ∫ ′ −1/2 ′ ˜ −1 ¯′ ¯M ¯′ ˜ 21./12 ˜ 12 Wald ⇒ RvecΩ dW M M Ω11 + Ω Ω11 Γ
−1/2′
R Ω11
∫
¯M ¯′ M
−1
−1/2
Ω11
−1 ˜ 2.1 R′ ⊗Ω
G. Elliott / Journal of Econometrics 164 (2011) 79–91
∫ −1 ∫ ′ −1/2 ′ ˜ −1 ¯′ ¯M ¯′ ˜ 21./12 ˜ 12 x RvecΩ dW M M Ω11 + Ω Ω11 Γ
˜ (λ) is a rx1 standard Brownian Motion independent of where W M (λ). This result extends the results of Elliott (1998) to hypothesis testing in the near cointegrating model when stationary variables are added to the system. The Wald test has a nonstandard ˜ ′ ˜ −1 distribution due to the term Ω 12 Ω′11 Γ . When this term is equal ¯M ¯ this distribution is equivalent to zero, then conditional on M to a χq2 where the number of degrees of freedom q is equal to the number of restrictions (row rank of R). The added stationary covariates zt with be orthogonalizing ′ ˜ 12 covariates when Ω = 0, i.e. the condition to be satisfied is that −1 ′ Ω12 − Ω13 Ω33 Ω23 = 0. This is the zero frequency analog of the same condition for the predictive regression—indeed because of the serial correlation structure of the model in Section 3 the zero frequency variance covariance matrix and the contemporaneous variance covariance matrix coincide, so the restriction is seen to be exactly the same as that for the previous section. When the control variables zt are orthogonalizing covariates then this result allows standard t tests to be conducted on β , even when we do not know the value for ρ . 5. Finding and interpreting orthogonalizing regressors In each of the above cases, predictive regressions and the near cointegrating regressions, the condition for the orthogonalizing covariates is essentially the same. We require that the orthogonalizing covariates zt be mean reverting and that they explain away the correlation between ε1t and ε2t . In the cointegrating case this becomes the zero frequency correlation. As with finding instruments in instrumental variables regression, candidate orthogonalizing covariates would generally arise from theoretical understanding of the problem from economic theory and the requirements of the candidate variables from econometric theory. Candidate variables would be those we think are contemporaneously correlated with the shocks to both the predictor variable and to the variable to be predicted. Consider the problem of predicting stock market returns with the log of the dividend price ratio. Here y1t = log(Dividends) − log(Price) and y2t is returns, which also depends on the price of the stock. With prices appearing in both, it is no surprise that δ < 0 and large, indeed δ is often estimated in the range of −0.75 to −0.95. In Table 1 p. 382 of Stambaugh (1999) they are less than −0.89. Candidate orthogonalizing regressors would be variables that are correlated with both the dividend price ratio and returns. Consider news variables, for example good macroeconomic news. This would likely have the effect of raising stock prices and hence returns, at the same time lowering the dividend price ratio as prices of the stock rise. Hence such a variable would go some way in explaining this negative correlation. Thus candidate variables for zt would come from theoretical considerations for any application. There is no requirement of understanding causal direction, since it is only the correlation that matters for the variable to be potentially useful. Finding such candidates should not be particularly difficult, in the sense that in economic problems contemporaneous correlations between variables are more the norm than orthogonality. Having the candidates actually explain all of the correlation so that δ˜ = 0 is likely more difficult (as is finding instruments that we are sure are uncorrelated with error terms) however the results of this paper show that this is a direction to consider in applied work.
85
Unlike the instrumental variables problem, in this problem we are able to test the assumptions of the model. In terms of mean reversion of zt , there exist many tests for this null hypothesis (see Stock, 1994, for a review). Consistent estimators for Ω or Σ can be computed using standard methods and hence can easily estimate both δ and δ˜ for any application. Consistency follows directly from the asymptotic negligibility of the regression estimates in estimating variance covariance matrices. Hence for any application one could avoid situations where adding zt to the regression makes the problem worse by not employing the ˜ is estimated to be significantly less than the methods unless |δ| estimate for |δ|. The effect of introducing zt to the prediction regression need not be to completely orthogonalize the regression. However as Theorem 2 above showed the effective distribution does change. ˜ is much smaller than |δ|, size distortions For situations where |δ| though not removed will be made much smaller. In such cases it may be a large enough difference at the calculated test statistic that the rejection becomes clear (since for no ρ would the critical value be so high). Hence rather than solving the inference problem, it may instead be that the introduction of zt merely increases power. This too is interesting, especially if it causes the t statistic to change in such a way that it is now large enough that it is beyond the critical values for all possible γ . The use of zt in the prediction regression or the cointegrating model improves power through a number of channels. Here we compare two possibilities—that ρ is known and zt are not employed versus the more realistic situation that ρ is unknown but also we have and use orthogonalizing covariates zt . Theorem 5. When there exist zt that satisfy the conditions for being orthogonalizing covariates then (a) (Predictive Case) Under the conditions of Theorem 2 then the OLS estimator for β employing zt as a predictor is asymptotically at least as efficient as the asymptotically efficient estimators for β when ρ is known and the information in zt is not employed. (b) (Cointegrating Case) Under the conditions of Theorem 3 then the estimator (13) is asymptotically at least as efficient as the asymptotically efficient estimators for β when ρ is known and the information in zt is not employed. In the case of the predictive regression, it is of course not possible that any of the methods discussed in the review in Section 2 that both control size for all ρ and allow uncertainty over this parameter to have better power than a test based on a ttest in the infeasible SUR regression given the efficiency properties of the latter regression for this model. Theorem 5 shows that when orthogonalizing covariates are available, then power using orthogonalizing covariates must be at least as good as the efficient infeasible SUR method that ignores the additional covariates. From this it follows directly that none of the methods presented in Section 2 can have better power against any alternative for β than the method suggested in this paper when zt is such that δ˜ = 0. A similar point holds in the cointegrating case. Finally, a comment on interpretation of the regression coefficient. Typically in regression we think of a coefficient such as β as the effect of y1t holding constant other variables in the regression. Thus it might seem that adding the ‘orthogonalizing’ regressor zt to the regression fundamentally alters the effect we are trying to estimate. This in turn implies the correct interpretation of the difference in results from choosing to include or exclude the covariate zt may be that the effect that is being estimated is different in each of the regressions. In the prediction regression, supposing that we knew α2 , this means that when zt is excluded we interpret β as the effect of y2t −1 on y1t and when zt is included this coefficient is the
86
G. Elliott / Journal of Econometrics 164 (2011) 79–91
Table 1 Size properties for one sided test. Method
Regression (8)
δ=
−0.95
−0.75
−0.5
Regression (9)
−0.95
−0.75
−0.5
−0.95
Campbell Yogo
−0.75
−0.5
ρ=1 ρ = 0.9 ρ = 0.8
0.416 0.139 0.105
0.289 0.116 0.092
0.180 0.091 0.079
0.051 0.053 0.054
0.051 0.053 0.054
0.051 0.053 0.054
0.066 0.060 0.061
0.067 0.060 0.059
0.065 0.058 0.055
Notes: Size is 5% for a one sided upper tail test.
δ = −0.75 1
0.8
0.8
0.6
0.6
Power
Power
δ = −0.5 1
0.4
0.4
0.2
0.2
0
0
10
5
15
20
Local Alternative
0
0
5
10
15
20
Local Alternative
Fig. 2. Local power. The solid line is the method of this paper, long dashes indicate power when ρ is known, short dashes indicate power for the Campbell and Yogo (2006) Bonferroni procedure.
effect of y2t −1 on (y2t − α2 zt ), a different question. For the cointegration case the same would be true for the effect of y1t on y2t . Such an interpretation is however is not correct. To see this, consider the cointegrating regression results. The interpretation of β is that it is the coefficient that allows the persistent component in y2t − β y1t to be annihilated. Adding a stationary variable — one with serial correlation properties that are such that roots are far from the unit circle — cannot possibly change this interpretation of the coefficient since it cannot affect this trending behavior in y1t and y2t . A similar argument holds for the prediction regression. Consider for the prediction regression what it must mean for β to be nonzero. Under this alternative hypothesis, by virtue of the serial correlation properties of y2t , there must be a highly serially correlated component in y1t . This can be true even if standard tests for unit roots or stationarity do not detect this component.4 Provided that the orthogonalizing regressor zt does not have such a low frequency component, it will not affect the relationship between the low frequency behavior of y2t and any equivalent low frequency behavior or lack thereof in y2t . 6. Numerical evaluation of the results We turn to providing a numerical evaluation of the results. All results will be presented for the predictive model since results are qualitatively the same for the cointegrating model however in the cointegrating model magnitudes can be larger (see Elliott, 1998), hence results presented here are conservative in this respect. For the Monte Carlo results we employ the model used in Campbell and Yogo (2006), which is the first two equations of (7) with k =
4 Power of stationarity tests against alternatives where the integrated or near integrated component is small is low. Similarly, unit root tests over-reject when there is a small unit root component mixed in with a large degree of stationary noise.
1, dt = 1 (i.e. constants only, this is the leading applied case for this regression), ξ = 0 and A(L) = 1. Values for ρ and δ follow their paper as well. For the augmented model we add the third equation of (7) and set B(L) = 1, |Σ13 | = |Σ23 | so that δ˜ = 0. We report size results for testing the hypothesis that β = 0 versus the alternative that β > 0 with a 5% test in Table 1. The first three columns of results give rejection rates for using the least squares method (8) with use of the normal distribution to construct the critical values for the test. The next three columns of results show size results for the method augmented by including zt in (9). We also include in the last three columns size for the Campbell and Yogo (2006) Bonferroni method. In each case we used 100 observations and 10 000 Monte Carlo replications. It is clear from Table 1 that the use of asymptotic normal critical values — the original Stambaugh problem — results in very large size distortions. This was also shown to be true asymptotically in Fig. 1 in Section 2. Having orthogonalizing covariates allows for size to be controlled. This is true for all values of ρ and δ . We can compare this to size results for the method of Campbell and Yogo (2006), where the methods are slightly oversized and the extent of the problem depends on both ρ and δ , where the degree of oversize was increasing in |δ| and larger for ρ closer to one. In Fig. 2 we examine power for testing the null of β = 0 versus the alternative that β > 0 using the method of this paper, the infeasible approach of knowing ρ , and the Campbell and Yogo (2006) method.5 The model is the same as that for Table 1 however we use 250 observations to ensure that all methods control size. The alternative is set to β = b/T for b given on the x-axis of the curves. The Bonferroni method of Campbell and Yogo (2006) does not require knowledge of ρ , and by construction gives up power for not knowing this nuisance parameter. Hence we expect that from
5 Notice that the comparison with the Campbell and Yogo (2006) method involves different levels of information, so is not a fair comparison from this point of view.
G. Elliott / Journal of Econometrics 164 (2011) 79–91
87
the results of Theorem 5 that the augmented regression (9) will result in a test that is more powerful than the infeasible known ρ method from ((4)) which in turn will be more powerful than the Campbell and Yogo (2006) method. The first panel examines power for the model where γ = 10 and δ = −0.5, the second panel repeats this exercise for δ = −0.75. Similar unreported results hold for γ = 0 and γ = 20. In each of these figures the results we expect from Theorem 5 are borne out, with power of the method suggested in this paper being greater than that of the infeasible SUR method which in turn is an upper bound on the ρ unknown bivariate methods. Different choices of Σ for any δ can make power for the orthogonalizing regressor method and the infeasible SUR method closer or further apart.
the continuous mapping theorem. For (b) the result follows from Phillips (1987). The result in (d) is the result behind the familiar block diagonality of the denominator matrix in regression between the integrated and non-integrated pieces. To show the result it is sufficient to show it for an arbitrary element of the matrix, take the (j, r )th element where j may equal r. We have
7. Conclusion
Since A11 (L) and B(L) have finite lags then T −1 t =1 s=1 Xt us ∑t converges to Var(T −1/2 s=1 Xt ut ). The result follows then from showing that T −1/2 max{ρj , ρjT −1 } → 0. For ρj ≤ 1 then ρj > ρjT −1
Acknowledgements The author thanks participants at the Forecasting in Rio conference, Stanford, MIT, ANU, The Federal Reserve Board, UBC, Boston College, Brown University as well as the referees and editor for comments.
refers to
1 0
and ⇒ refers to weak convergence.
Lemma 1 (Asymptotic Approximations). For the model in (7) when Condition 1 holds or (11) when Condition 2 holds then
(b) (c) (d) (e) (f)
1/2
1/2′
M (λ)M (λ)′ dλΩ11 y1t −1 y′1t −1 ⇒ Ω11 ∑ ′ 1 / 2 T −1 y1t −1 εt′ ⇒ Ω11 M (λ)dW (λ)Σ 1/2 ∑ 1/2 T −3/2 y d˜ ′ ⇒ Ω M (λ)s(λ)′ dλ ∑ 1t −1 t′ p 11 −3/2 T y X → 0 ∑ 1t −1 pt −1 ˜ T dt Xt → 0 ∑ T −1/2 Xt εt′ is 0p (1)
(a) T
∑T
and T −1/2 pj → 0. For ρj = 1 + T −1 γj > 0 by rearrangement
∑T −1 ∑T −1 T −1/2 ρjT −1 = T −1/2 −γj T −3/2 s=0 γjs → 0 since T −1 s=0 γjs →
eγ j λ d λ < ∞ . For (e) first consider any univariate mean zero random variable xt with finite variance E [x2t ]. Since maxt =1,...,T d˜ it is finite by
1 0
construction ∑ for all elements of the vector d˜ t then E [d˜ t xt ]2 is fi−1 nite and T dt xt →p 0 by Chebyshev. We will show that all elements of Xt satisfy the conditions on xt plus possible a smaller order term. The first rx1 elements of Xt are u2t −1 = ι′2 A(L)−1 εt −1 where ι2 = (0′kxr , Ir , 0′pxr )′ which are mean zero since E [εt ] = 0. Since all roots of A(L) are outside the unit circle this also has finite variance and so the result holds directly. For ∆(y1t −1 −ϕ1 dt −1 ) note that this is (1 −ρ L)(y1t −1 −ϕ1′ dt −1 )+(1 −ρ)(y1t −2 −ϕ1′ dt −2 ) which in turn is ε1t −1 + T −1/2 (T (1 −ρ)T −1/2 (y1t −2 −ϕ1′ dt −2 )). The second piece is of a smaller magnitude than the first. Further ε1t −1 has finite variance by Condition 1(a). For ∆(y2t −1 −ϕ2 dt −1 ) note that this is equal to β ∆(y1t −1 − ϕ1 dt −1 ) + u2t −1 and hence has finite variance by the immediately preceding two results. For zt − ϕ3 dt the result holds by Condition 1(a) directly. For greater lags the results ∑ hold for the reasons just given. Hence all elements of T −1 d˜ t Xt converge to zero as stated. As just shown in the proof for (e) each element of Xt has bounded variance. Since εt is a MDS with four moments then via any MDS central limit theorem this term has an asymptotic normal distribution and is thus 0p (1). Proof of Theorem 1. Rewrite (8) so that stochastic terms have their deterministic components removed, (throughout define yd1t = y1t − ϕ1 dt , yd2t = y2t − ϕ2 dt and ztd = zt − ϕ3 dt ) i.e.
Appendix
∑ −2
T T t − − − t −s−1 −3/2 yj1t −1 Xrt = T Xrt ρj us t =2 t =2 s =2 T − T − T −1 −1 −1/2 ≤T max ρj , ρj X t us . T t =1 s=1 ∑T
This paper analyzes how control variables that are not persistent can be employed in linear regressions with persistent regressors to reduce or eliminate size distortions of hypothesis tests for the coefficients on these variables. In particular interest is focussed on inference on the coefficients on the persistent covariates. It is shown that in the context of both predictive regressions (where a persistent regressor is employed to forecast a variable that may or may not be persistent) and for near cointegrating regressions (where we suspect that the trends in each of the variables is annihilated by a linear combination but we are unsure if the trend is an exact unit root process) that the addition of control variables to the regression may result in standard asymptotic analysis for the coefficients on the trending regressors even when we do not know the precise specification of the trend itself. Further, when such orthogonalizing control variables are available, inference is improved above and beyond that available through ignoring the control variables but knowing the exact nature of the trend. Finding candidate orthogonalizing covariates — discussed in Section 5 — follows from economic theory and considering what variables are likely to be correlated with the shocks drive the variables in the regression.
For all results
T
−3/2
where Xt are mean zero terms such that H (L)Xt is stationary with H (L) having a finite lag length. Proof. The results in (a) and (c) follow directly from condition 1/2 1 (a) and (b) which yield T −1/2 y1[T λ] ⇒ A(1)−1 Σ11 M (λ) and
y2t = ϕ¯ 2′ d¯ t + β ′ y1t −1 + ε2t
= ϕ˜ 2′ dt + β ′ yd1t −1 + ε2t = θ ′ Qt + ε2t here Qt = [yd1t′ −1 , d¯ ′t ]′ and θ = (β ′ , ϕ˜ 2′ )′ .
Define ΥT to be a diagonal scaling matrix ΥT = diag[T −1 Ik , ΥdT ] where ΥdT is defined in Condition 1. The least squares estimator for θ , denoted θˆ , is such that
ΥT (θˆ − θ ) =
ΥT
−1
T −
−1 Qt Qt′ ΥT
ΥT
−1
−1
t =2
=
ΥT
−1
T − t =2
T −
Qt ε2t
t =2
−1 Qt Qt ΥT ′
−1
ΥT
−1
T −
Qt εt ι2 ′
t =2
where ι2 = (0′k , 1)′ where 0j denotes a jx1 vector of zeros.
88
G. Elliott / Journal of Econometrics 164 (2011) 79–91
y2t = ϕ¯ 2′ d¯ t + β ′ yd1t −1 + α2′ Ztd + ε˜ 2t
For the denominator
ΥT
−1
T −
Qt Qt ΥT
−1
′
−
T −2
=
T −3/2
yd1t −1 yd1t′ −1
∫ 1/2 1/2′ Ω MM ′ Ω11 11 ⇒ ∫ 1/2′ sM Ω11
1/2
Ω11 ∫
∫
= θ ′ Qt + ε˜ 2t
′
yd1t −1 d˜ t
−
T −1
·
t =2
−
where Qt = [yd1t′ −1 , d¯ ′t , Ztd′ ]′ and θ = (β ′ , ϕ¯ 2′ , α2′ )′ . Define ΥT to be a diagonal scaling matrix ΥT = diag[T −1 Ik , ΥdT , −1/2 T Ip(q+1) ] where ΥrT is defined in Condition 1. The least squares
d˜ t d˜ ′t
Ms′
estimator for θ , denoted θˆ , is such that
ss′
ΥT (θˆ − θ ) =
and for the numerator terms, the term is
ΥT
−1
T −
Qt εt ι2 = ′
T −1
−
T −1/2
t =2
yd1,t −1 εt′ ι2
−
T −
−1 Qt Qt ΥT
T −
ΥT
−1
′
−1
t =2
.
d˜ t εt′ ι2
ΥT
−1
ΥT
−1
=
T −
Qt ε˜ 2t
t =2
−1 Qt Qt ΥT
T −
ΥT
−1
′
−1
t =2
Using the results of Lemma 1 (noting that each of these pieces is 0p (1)) and the continuous mapping theorem and some rearrangement we have
Qt εt G ι2
.
′ ′
t =2
Consider the denominator term T −
ΥT−1
Qt Qt′ ΥT−1
t =2
−1/2 T (βˆ − β) ⇒ Ω11
′
∫
¯M ¯′ M
−1 ∫
′ 1/2′ ¯ MdW Σ ι2
(15)
1/2
ˆ 22 [T (βˆ − β0 )] = Σ −1
′
−1 −1 − S1 ΥT−1 Qt Qt′ ΥT−1 S1′ [T (βˆ − β0 )]
ˆ 22 = T −1 t =2 (y2t − θˆ ′ Qt )2 , where S1 = (Ik , 0kx dim(d¯ t ) ) and Σ i.e. the sum of squared residuals of the regression (8) divided by the sample size. Consistency of the OLS estimates gives
T −
2 ε2t + op (1)
which yields T
∑T −1
t =2
=
¯ MdW 1 Σ11
∫ × ∫ =
¯M ¯′ M
−1/2
Σ12 Σ22
−1 ∫
+
1/2 −1/2 Σ2.1 Σ22
−1/2
′ ¯ MdW 1 Σ11
¯ MdW 1 δ + (1 − δ δ)
∫ ×
−1/2
′
1/2
∫
′ 1/2 ¯ MdW 1 δ + (1 − δ δ)
−1/2
Σ12 Σ22
¯ MdW 2 ∫
∫
′ ∫
¯ MdW 2
¯ MdW 2
′
1/2
−1/2
+ Σ2.1 Σ22 ¯M ¯ M
′
∫
¯ MdW 2
−1
where
1/2′
= Σ22 (1 − δ ′ δ)1/2 so the result is as stated in the text.
1/2
Ω22 ∫
Ms′
0
ss
ΥT−1
T −
T −1
d˜ t Zt′
T −1
−
Zt Zt′
t =2
T −1
−
y2,t −1 εt′ G′ ι2
−1/2 Qt εt′ G′ ι2 = d˜ t ε ′ G′ ι2 T . − t
−
T −1/2
Zt εt′ G′ ι2
Using the results of Lemma 1 (noting that each of these pieces is 0p (1)) and the continuous mapping theorem and some rearrangement we have
∫
¯M ¯ M
−1 ∫
′ 1/2′ ′ ¯ MdW Σ G ι2 .
′ The weights on dW are given by Σ 1/2 G′ ι2 , which are
ι′2 GΣ 1/2
′ ′ ˜ −1/2 ˜ 12 Σ Σ11 = Σ ˜ 21./12 0
by judicious choice of Σ 1/2 (i.e. the Choleski decomposition of the variance covariance matrix with the ordering (ε3t , ε1t , ε2t )) where ′ ˜ −1 ˜ ˜ 2.1 = Σ ˜ 22 − Σ ˜ 12 ˜ as defined in (10). Σ11 Σ12 and Σ Σ The numerator of the limit distribution has two components, i.e. we can write
∫
′ 1/2′ ′ ¯ MdW Σ G ι2 =
∫
′ ˜ −1/2 ˜ ¯ ˜ 2.1 MdW Σ12 + Σ 1 Σ11 ′
∫
¯ W ˜2 Md
where the first term has a nonstandard distribution and the second is mixed normal. Hence we have as before T (βˆ − β) ⇒ A(1)′−1 Σ11
∫ ×
∫
¯M ¯′ M
−1/2′
′ ¯ MdW 1 Σ11
−1
−1 ′ (Σ12 − Σ13 Σ33 Σ23 )
Proof of Theorem 2. Rewrite (9) so that stochastic terms have their deterministic components removed, i.e.
y2t −1 Zt′
−
0 ΣZ
′
0
−1/2′
1/2 ′ 1/2 −1 −1 ′ Σ2.1 = Σ22 − Σ12 Σ11 Σ12 = Σ22 1 − Σ22 Σ12 Σ11 Σ12 Σ22
d˜ t d˜ ′t
−
where Σz = E [Zt Zt′ ]. For the numerator terms, the term is
−1
∑
−
∫
0
2 ε2t →p Σ22 .
′
∫
′
given Qt Qt′ ΥT−1 Hence, from (15) and results for ΥT−1 above, it follows from the continuous mapping theorem and some algebra that ∫ ′ ∫ −1 ∫ ′ −1 ′ ′ ¯ ¯M ¯′ ¯ Wald ⇒ Σ22 ι2 Σ 1/2 MdW M MdW Σ 1/2 ι2
∫
1/2
T −3/2
y2t −1 d˜ ′t
·
−1/2 T (βˆ − β) ⇒ Ω11
t =2
−
T −1
·
′ Ω22∫ MM Ω22 ⇒ 1/2′ sM Ω22
∑T
ˆ 22 = T −1 Σ
T −3/2
y2t −1 y′2t −1
·
1/2 Σ1.2
ments of ι2 Σ and are Σ12 Σ11 and respectively. The Wald test testing the null hypothesis that β = β0 against β ̸= β0 is given by −1 −1 − −1 ˆ ˆ 22 Wald = T 2 Σ (β − β0 )′ S1 ΥT−1 Qt Qt′ ΥT−1 S1′ (βˆ − β0 ) ′
−
=
The limit distribution for the numerator is a weighted average of ¯ ¯ MdW MdW 1 which has a nonstandard distribution and 2 which is asymptotically mixed normal. The weights follow from the ele−1/2′
T −2
1/2 Σ22
+ ˜
˜ ′ ˜ 1/2
(1 − δ δ )
∫
¯ W ˜2 Md
G. Elliott / Journal of Econometrics 164 (2011) 79–91
′ −1
−1 ˆ ˆ 22 Wald = Σ (Rθ − r )′
ˆ 22 =Σ
−1
− R
Qt Qt′
−1 −1 R′
∑
(Rθˆ − r )
−1 −1 − [T (βˆ − β0 )] R ΥT−1 Qt Qt′ ΥT−1 R′
1y1t 1y2t
= Ψ 1 yd1t −1 + Ψ2 (yd2t −1 ) − β yd1t −1 d 1y1t −1 + Φ dt + Π (L) 1yd2t −1 + εt∗
× [T (βˆ − β0 )]
∗
where R = (Rβ , 0, 0) and r = β0 where the partitions of R are con∑ ˆ 22 = T −1 Tt=2 (y2t − θˆ ′ Qt )2 , i.e. the sum of formable with θ and Σ squared residuals of the regression (9) divided by the sample size. ∑ 2 ˆ 22 = T −1 Tt=2 ε˜ 2t By analogous results to those in Theorem 1 Σ + op (1) and application of Chebyshev using assumption (fourth mo−1 ′ 2 ′ ˜ 22 . ε˜ 2t →p Σ22 − Σ23 Σ33 Σ23 = Σ Hence, using the asymptotic results for T (βˆ − β0 ) and ΥT−1 −1 ∑ Qt Qt′ ΥT−1 given above, it follows from the continuous map-
∑T
t =2
ping theorem and some algebra that −1 ′ ˜ 22 Wald ⇒ Σ ι2 GΣ 1/2
∫ ∫ =
∫
¯ MdW
′ ∫
¯M ¯′ M
∫
′˜ ¯ ˜ ′ ˜ 1/2 MdW 1 δ + (1 − δ δ)
×
tribution we require the weight on
¯ W ˜2 Md ∫
∫
¯ W ˜2 Md
¯M ¯′ M
−1
¯M ¯ M
−1
¯ MdW 1 to be zero
−1 ′ ′ which occurs only if Σ12 − Σ23 Σ33 Σ13 = 0. Under the restriction −1 ′ ′ Σ12 − Σ23 Σ33 Σ13 = 0 −1/2′
T (βˆ − β) ⇒ Ω11
∼
a
˜ 22 0, Σ
N
¯M ¯′ M ∫
−1 ∫
¯M ¯′ M
Ψ
y1t −1 y2t −1
= Ψ1 y1t −1 + Ψ2 y2t −1 = (Ψ1 + Ψ2 β)y1t −1 + Ψ2 (y2t −1 − β y1t −1 ) = Ψ 1 y1t −1 + Ψ2 (y2t −1 − β y1t −1 )
as stated in the text. For the distribution to have an asymptotic mixed normal dis-
∫
where Ψ 1 = P A˜ (1)ι1 (ρ − Ik ) and Ψ2 = −P A˜ (1)ι2 and we define ι1 = (Ik , 0′kxr , 0′pxr )′ and ι2 = (0′kxr , Ir , 0′pxr )′ are selector matrices. This regression is equivalent to the regression (12) apart from rearranging the deterministic components so each of the regressors has this term subtracted out and we have rotated the term
where Ψ1 is (k + r + p)xk, Ψ2 is (k + r + p)xr and Ψ 1 = Ψ1 + Ψ2 β . The regression is in the Sims et al. (1990) format that allows limit results to be obtained. Define the matrix
′˜ ¯ ˜ ′ ˜ 1/2 MdW 1 δ + (1 − δ δ )
∫
(17)
ztd
−1
′ ′ ¯ MdW Σ 1/2 G′ ι2
×
(16)
zt
′
ments) yields T −1
89
Defining A˜ (1) = A˜ i where A˜ i are the matrices of coefficients in ˜A(L) and the summation is over all the lags (including the zero lag) and rearranging we can write this as
¯M ¯ conditional on M . The Wald test testing hypotheses β for β0 is
(y2t −1 − ϕ2 dt −1 ) − β(y1t −1 − ϕ1 dt −1 ) ∆(y1t −1 − ϕ1 dt −1 ) ∆(y2t −1 − ϕ2 dt −1 ) Xt = zt −1 − ϕ3 dt −1 .. .
where the . . . are the remaining lags of each of the detrended ′ changes in 1y1t , 1y2t and zt . Define Qt = [yd1t −1 , d′t , Xt′ ] and collect the coefficients of the regression into θ , the regression can be written
′ 1/2′ ′ ¯ MdW Σ G ι2
−1
.
When zt satisfies the conditions for being an orthogonalizing −1 ′ ′ − Σ23 Σ33 Σ13 = 0 holds, then conditional on regressor, i.e. Σ12 2 ′ ¯ ¯ M M this is a χk random variable.
1y1t 1y2t
= θ ′ Qt + εt∗ .
zt Define the diagonal scaling matrix ΥT partitioned conformably with the rows of Qt and equal to [T −1 Ik , ΥrT , T −1/2 Ig ] where g is the number of rows in Xt . Then
−1 − − ΥT (θˆ − θ ) = ΥT−1 Qt Qt′ ΥT−1 ΥT−1 Qt εt∗′ −1 − − = ΥT−1 Qt Qt′ ΥT−1 ΥT−1 Qt εt′ P ′
Proof of Theorem 3. Define the matrix
P =
Ik
β
0
0 Ir 0
0 0 Ip
.
The model (11) can be written
(1 − ρ L)yd1t P yd2t − β yd1t = P εt = εt∗ .
P A˜ (L)P
−1
ztd
On rearrangement this yields
P A˜ (L)P
−1
∆(
= P εt = εt∗ .
yd2t
∆(yd1t ) + (Ik − ρ)(yd1t −1 ) d d d ) + (y2t −1 ) − β(y1t −1 ) + β(I − ρ)(y1t −1 ) ztd
where the summations are over available observations once (the finite number of) lags have been accounted for. Consider the denominator term − ΥT−1 Qt Qt′ ΥT−1 − − −2 − T y1t −1 y′1t −1 T −3/2 y1t −1 d˜ ′t T −3/2 y1t −1 Xt′ − − ˜ ′ = · T −1 d˜ t d˜ ′t T −1 − d t Xt −1 · · T Xt Xt′ ∫ ∫ 1/2 1/2′ 1/2 Ω11 M (λ)M (λ)′ dλΩ11 Ω11 Ms(λ)′ dλ 0 ∫ ∫ ⇒ 1/2′ s(λ)M (λ)dλΩ11 s(λ)s(λ)′ dλ 0 0
0
K
where K is a (matrix) constant. The result for the (1, 1) and (1, 2) elements follows from Lemma 1(a) and (c) respectively. The result in the (2, 2) element of the matrix is Condition 1(e). The result in
90
G. Elliott / Journal of Econometrics 164 (2011) 79–91
(3, 3) follows from the stationarity of Xt and in the (2, 3) element from Lemma 1(e). For the numerator
ΥT−1
T − t =2
T −1
−
y1t −1 εt′ P ′
−1/2 d˜ t ε ′ P ′ . Qt εt′ P ′ s2 = T − t
−
T −1/2
For the first of these terms T
′
−1/2′
′
T (Ψˆ 1 − Ψ ′1 ) ⇒ Ω11
y1t −1 εt′ P ′ ⇒
1/2 Ω11
MdW ′ Σ
∑ −1/2
ˆ − r) × (Rvec(β)
1/2′
˜ 2.1 ι′2 Ω −1/2′ Rvec Ω
⇒
d˜ t εt′
× Ω11
−1 ∫
˜ 12 Ω ˜ 11 Γ +Ω
Xt are all consistently estimated, hence we have that Ψˆ 2 →p Ψ2 . ˆ∗ = Since∑all coefficients converge to their true values then Σ −1 ∗ ∗′ p ∗ ′ ∗ T εˆ t εˆ t → Σ = P Σ P where εˆ t are the least squares residuals from the regression (12). ˜ where Define the (k + r )x(k + r ) matrix Ω
−1 ′ Ω12 − Ω13 Ω33 Ω23 . −1 ′ Ω22 − Ω23 Ω33 Ω23
Notice that this is the inverse of the upper (k + r )x(k + r ) block of Ω −1 . Recall Ψ2 = P A˜ (1)ι2 then
¯M ¯ M
×
¯M ¯′ M
−1
Ω11
¯′ dW M
˜ 12 Ω ˜ 11 Γ +Ω −1
′
∑
˜ 2.1 ι′2 Ω −1/2′ Wald ⇒ Rvec Ω −1/2′
∫
¯M ¯ M
′
−1
−1/2′
∫
∫
¯′ dW M
−1/2
Ω11
¯′ dW M
∫
¯M ¯′ M
−1
−1/2
′
Ω11
−1 ˜ 2.1 )R ⊗Ω
′
∫
−1
¯M ¯′ M
−1/2
.
Ω11
′ ˜ 2−.11 λ so is equivalent to ι′2 Ω −1/2 Ω −1/2 ι2 λ = ι′2 Ω −1 ι2 λ = Ω ′ ˜ (λ) where W ˜ (λ) is a standard rx1 Brownian Motion. This ˜ 2−.11/2 W Ω
allows the simplified notation
Wald ⇒
−1/2
× Ω11
Rvec
=
˜ 2−.11 Ω −1/2 −1 ′ −1 ′ ˜ −Ω3.21 (Ω23 − Ω13 Ω11 Ω12 )Ω2.1
1 −1 ′ ′ ′ ′ −1 ˜ −1 ′ where Ω3−.21 = Ω33 (I + [ Ω 13 , Ω23 ]′ Ω [Ω13 , Ω23 ] Ω33 ) and so ¯ there is a zero weight on dW1 M which is the element with a nonstandard distribution. It follows directly that the variance covariance matrix of this sum of mixed normals is ι′2 Ω −1 ι2 =
1/2′ Ω2.1
˜
−1/2′
· R(Ω11
The stochastic part of this result has a mixed normal distribu ′ ¯ ′ are tion. The weights ι′2 Ω −1/2 on dW M 0
−1/2
Ω11
−1 ˜ 2.1 R′ ⊗Ω
′
˜ 2.1 ι′2 Ω −1 ι1 Γ −Ω
−1
∫
−1/2
¯M ¯ ′ this distribution is equivalent to a χq2 where Conditional on M the number of degrees of freedom q is equal to the number of restrictions (row rank of R). ′ Finally, ι′2 Ω −1/2 W (λ) is rx1 with variance covariance matrix
ˆ ∗−1 Ψˆ 2 )−1 (Ψˆ 2′ Σ ˆ ∗−1 T (Ψˆ 1 − Ψ ′1 )) = −(Ψˆ 2′ Σ ˆ ∗−1 Ψˆ 2 )−1 Ψˆ 2′ Σ ˆ ∗−1 T Ψ ′1 − (Ψˆ 2′ Σ ∫ −1 ∫ ¯′ ¯M ¯′ ˜ 2.1 ι′2 Ω −1/2′ dW M ⇒Ω M
=
′
−1
ˆ is the upper left kxk block of ΥT−1 Qt Qt′ ΥT−1 . This where M distribution has a center that depends on the nuisance parameters ˜ 12 = 0 and of the model. When zt are orthogonalizing covariates Ω this collapses to the distribution
˜ 2.1 ι′2 Ω · Rvec Ω
ˆ ∗−1 Ψˆ 2 )−1 (Ψˆ 2′ Σ ˆ ∗−1 T (Ψˆ 1 + Ψˆ 2 β)) = −(Ψˆ 2′ Σ
Ω −1/2 ι2 =
∫
′ ˜ −1 ˜ ˜ 2.1 = Ω ˜ 22 − Ω ˜ 12 where Ω Ω11 Ω12 . From the definition of the suggested estimator (13) then
∫
·′
¯M ¯′ M
· R(Ω11
= (ι′2 Ω −1 ι2 ) ˜ 2−.11 = Ω
′ ˜ −1 ˜ 12 +Ω Ω11 Γ where (Ψ2′ Σ ∗−1 Ψ2 )−1 Ψ2′ Σ ∗−1 T Ψ ′1 ′ ˜ −1 ˜ 12 −Ω Ω11 Γ .
∫
−1/2′
˜ 2.1 ι′2 Ω −1/2′ · Rvec Ω
ˆ ∗−1 Ψˆ 2 ) → p (ι′2 A˜ (1)′ P ′ (P Σ P ′ )−1 P A˜ (1)ι2 ) (Ψˆ 2′ Σ = (ι′2 A˜ (1)′ Σ −1 A˜ (1)ι2 )
−1
′
× R Ω11
′ 1/2′ ′ ¯ MdW Σ P.
∑
¯′ dW M
−1/2
′ The convergence of T −1 ∑ ′Xt′Xt to the constant K and the boundedness of T −1/2 Xt εt P means that the coefficients on
T βˆ − β
∫
−1
′
¯M ¯′ M
−1 ′ − Ω13 Ω33 Ω13 ˜ = Ω11 Ω −1 ′ ′ Ω12 − Ω23 Ω33 Ω13
is free of nuisance
∫
−1
Proof of Theorem 4. For Wald tests of hypotheses of the form Rvec(β) = r the test statistic is
P ⇒ s(λ)dW (λ)dλΣ P . Finally, we have from Lemma 1(f) ∑ ′ ′ that T −1/2 Xt εt P is Op (1). By the asymptotic block diagonality of the denominator and the boundedness of the third term of the numerator, along with algebra as above we have
′
¯M ¯′ M
P ′ from Lemma 1(b). Similarly for the second term T 1/2′
parameters.
¯′ dW M
ˆ ⊗ (Ψˆ 2′ Σ ˆ − r )′ R(M ˆ ∗−1 Ψˆ 2 )−1 )−1 R′ Wald = T 2 (Rvec(β)
Xt εt′ P ′
∑ −1
˜ 2−.11 so Ω ˜ 21./12 ι′2 Ω −1/2′ Ω
∫
∫
˜ 21./12 · Rvec Ω
′
˜M ¯ dW
¯M ¯′ M ∫
′
∫
−1
˜M ¯′ dW
¯M ¯ M −1/2
Ω11
∫
′
−1
−1/2
′
Ω11
−1 ˜ 2.1 )R′ ⊗Ω
¯M ¯′ M
−1
−1/2
.
Ω11
Proof of Theorem 5. We will prove the result for the cointegrating case, the result for the predictive case is identical replacing Ω with Σ . Asymptotically efficient estimators for β in the absence of zt are such that T (vec(βˆ − β)) are asymptotically mixed normal
1/2
with variance covariance matrix Ω2.1 ⊗ Ω11
1/2′
¯M ¯ ′ Ω11 M
−1
. It
follows directly from Theorem 4 that when ρ is unknown however we do have orthogonalizing covariates then T (vec(βˆ − β))
G. Elliott / Journal of Econometrics 164 (2011) 79–91
are asymptotically mixed normal with variance covariance matrix
−1 1/2 1/2′ ¯M ¯ ′ Ω11 ˜ 22 ⊗ Ω11 Ω M where we have used the restriction ˜ that Ω12 = 0 for zt that satisfy the conditions for being orthogonalizing covariates. Hence the latter estimator is asymptotically more ˜ 22 is positive definite. We have efficient if Ω2.1 − Ω −1 −1 ′ ′ ˜ 22 = (Ω22 − Ω12 Ω2.1 − Ω Ω11 Ω12 ) − (Ω22 − Ω23 Ω33 Ω23 ) −1 ′ −1 ′ = Ω23 Ω33 Ω23 − Ω12 Ω11 Ω12 −1 −1 −1 ′ ′ = Ω23 Ω33 (Ω33 − Ω13 Ω11 Ω13 )Ω33 Ω23
where the last line follows as for zt to be an orthogonalizing −1 ′ −1 ′ covariate Ω12 = Ω13 Ω33 Ω23 . Now Ω33 − Ω13 Ω11 Ω13 is the variance covariance matrix of a zero frequency projection of zt on −1 u1t and is hence positive definite, Ω33 is positive definite, and so this quadratic is positive semi definite and the result holds. References Amihud, Y., Hurvich, C., 2004. Predictive regression: a reduced bias estimation method. Journal of Financial and Quantitative Analysis 39, 813–841. Basawa, I., Makllik, A., McCormick, W., Reeves, J., Taylor, R., 1991. Bootstrapping unstable first-order autoregressive processes. Annals of Statistics 19, 1098–1101. Campbell, J., Yogo, M., 2006. Efficient tests of stock return predictability. Journal of Financial Economics 81, 27–60. Cavanagh, C., Elliott, G., Stock, J., 1995. Inference in models with nearly integrated regressors. Econometric Theory 11, 1131–1147. Dickey, D., Fuller, W., 1979. Distribution of the estimators for autoregressive time series with a unit root. Journal of the American Statistical Association 74, 427–431. Elliott, G., 1998. The robustness of cointegration methods when regressors almost have unit roots. Econometrica 66, 149–158.
91
Elliott, G., Rothenberg, T., Stock, J., 1996. Efficient tests for an autoregressive unit root. Econometrica 64, 813–836. Elliott, G., Stock, J., 1994. Inference in models with nearly integrated regressors. Econometric Theory 10, 672–700. Fama, E., 1975. Short-term interest rates as predictors of inflation. American Economic Review 65, 269–282. Fama, E., French, K., 1988. Dividend yield and expected stock returns. Journal of Financial Economics 22, 3–25. Hall, R., 1978. Stochastic implications of the life cycle-permanent income hypothesis: theory and evidence. Journal of Political Economy 86, 971–987. Johansen, S., 1988. Statistical analysis of cointegrating models. Journal of Economic Dynamics and Control 12, 231–254. Johansen, S., 1991. Estimation and hypothesis testing of cointegration vectors in Gaussian vector autoregressive models. Econometrica 59, 1551–1580. Lewellen, J., 2004. Predicting returns with financial ratios. Journal of Financial Economics 74, 209–235. Mankiw, N., Shapiro, M., 1986. Do we reject too often: small sample properties of tests of rational expectations models. Economics Letters 20, 139–145. Moreira, M., Jansson, M., 2006. Optimal inference in regression models with nearly integrated regressors. Econometrica 74, 681–714. Phillips, P., 1987. Towards a unified asymptotic theory for autoregression. Biometrika 74, 535–547. Phillips, P., Hansen, B., 1990. Statistical inference in instrumental variables regression with I(1) processes. Review of Economic Studies 57, 99–125. Saikkonen, P., 1991. Asymptotically efficient estimation in cointegrating systems. Econometric Theory 7, 1–21. Saikkonen, P., 1992. Estimation and testing of cointegrating systems by autoregressive approximation. Econometric Theory 8, 1–27. Sims, C., Stock, J., Watson, M., 1990. Inference in linear time series models with some unit roots. Econometrica 58, 113–144. Stambaugh, R., 1999. Predictive regressions. Journal of Financial Economics 54, 375–421. Stock, J., 1994. Unit roots, structural breaks and trends. In: Engle, R., McFadden, D. (Eds.), Handbook of Econometrics, vol. 4. North Holland, New York, pp. 2740–2841. Stock, J., Watson, M., 1993. A simple estimator of cointegrating vectors in higher order integrated systems. Econometrica 58, 1035–1056.
Journal of Econometrics 164 (2011) 92–115
Contents lists available at ScienceDirect
Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom
A semiparametric panel model for unbalanced data with application to climate change in the United Kingdom✩ Alev Atak a , Oliver Linton b,∗ , Zhijie Xiao c,d,1 a
Department of Economics, Queen Mary, University of London, Mile End Road, London E1 4NS, United Kingdom
b
Department of Economics, London School of Economics, Houghton Street, London WC2A 2AE, United Kingdom
c
Department of Economics, Boston College, Chestnut Hill, MA 02467, USA
d
School of Economics and Management, Tsinghua University, Beijing, China
article
info
Article history: Available online 26 February 2011 JEL classification: C13 C14 C21 D24 Keywords: Global warming Kernel estimation Semiparametric Trend analysis
abstract This paper is concerned with developing a semiparametric panel model to explain the trend in UK temperatures and other weather outcomes over the last century. We work with the monthly averaged maximum and minimum temperatures observed at the twenty six Meteorological Office stations. The data is an unbalanced panel. We allow the trend to evolve in a nonparametric way so that we obtain a fuller picture of the evolution of common temperature in the medium timescale. Profile likelihood estimators (PLE) are proposed and their statistical properties are studied. The proposed PLE has improved asymptotic property comparing the sequential two-step estimators. Finally, forecasting based on the proposed model is studied. © 2011 Elsevier B.V. All rights reserved.
1. Introduction The partially linear regression model was introduced in Engle et al. (1986), y = β T X + θ (Z ) + ε
(1)
where θ(.) is an unknown scalar function and ε is a zero mean error orthogonal to both X and θ (.). This model embodies a compromise between employing a general nonparametric specification g (X , Z ), which, if the conditioning variables are high dimensional, would lead to serious loss of precision, and a fully parametric specification which may result in badly biased estimators and inconsistent hypothesis tests. The implicit asymmetry between the effects of X and Z may be attractive when X consists of dummy or categorical variables, as in Stock (1989, 1991). This specification arises in various sample selection models; see Ahn and Powell (1993), Newey et al. (1990), and Lee et al. (1997). It is also the basis
✩ We thank a guest editor and two referees for helpful comments on an early version and Jaanaki Momaya for helpful research assistance. ∗ Corresponding author. E-mail addresses:
[email protected] (A. Atak),
[email protected] (O. Linton),
[email protected] (Z. Xiao). 1 Tel.: +1 617 552 1709.
0304-4076/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2011.02.008
of a general specification test for functional form introduced in Delgado and Stengos (1994). The model has been used in a number of applications. We will use a panel-data version of this model to model climate change. The issue of global warming has received a great deal of attention recently. This paper is concerned with developing a semiparametric model to describe the trend in UK regional temperatures and other weather outcomes over the last century. The data we work with conditions the analysis we propose. We work with the monthly averaged maximum and minimum temperatures observed at the twenty six Meteorological Office stations. The data is an unbalanced panel. We propose a semiparametric partial linear panel model in which there is a common trend component that is allowed to evolve in a nonparametric way. This permits the most general possible pattern for the evolution of a common secular change in temperature. We also allow for a deterministic seasonal component in temperature, since we are working with monthly data. Gao and Hawthorne (2006) used a univariate partially linear model to explain annual global temperature in terms of a nonparametric time trend and a covariate the southern oscillation index (SOI). They applied the existing theory to deduce the properties of their estimators and developed a new adaptive test of the shape of the trend function. See Campbell and Diebold (2005) for some alternative analysis of multivariate climate time series data. Pateiro-Lopez
A. Atak et al. / Journal of Econometrics 164 (2011) 92–115
and Gonzalez-Manteiga (2006) worked with a multivariate model with cross-sectionally correlated errors and different trends for each series. They establish distribution theory for the parametric components and derive the bias and variance of the nonparametric components. Their setting is similar to ours except that we impose a common trend structure. Furthermore, the covariates in our parametric part are also common and deterministic, as they represent seasonality. Most importantly we allow for unbalanced dataset, which is important in applications. This difference has important implications for efficient estimation. The asymptotic framework we work with allows a non-trivial fraction of the data to be missing. We propose to use a profile likelihood method, which in the unbalanced case is different from the sequential two-step squares method proposed by Robinson (1988) in the univariate case and employed by Pateiro-Lopez and GonzalezManteiga (2006) in the multivariate case. This method is fully efficient in the Gaussian case as established in Severini and Wong (1992). Finally, we allow for heteroskedasticity and serial correlation in the error terms. We apply our methods to the UK dataset. We show the nonparametric trend in comparison with a more standard parametric approach. In both cases there is an upward trend over the last twenty years that is statistically significant. We compare our results with those obtained by Gao and Hawthorne (2006). We also use our model to forecast future temperature. 2. Model and data The subject that we are interested are monthly temperatures
{yit }, where i signifies different stations and t is the corresponding time when the temperature is recorded, t = 1, . . . , T and i = 1, . . . , n. In practice, there may be missing data in the sense that some stations began keeping records before other stations. In our application, Oxford started in 1857, while Cardiff Bute Park only began in 1977. So we suppose that station i starts at time ti , i = 1, . . . , n, thus records for station i are only available from time ti to T . Order the stations by their starting point so that t1 ≤ t2 ≤ · · · ≤ tn < T . The complete record occurs after tn . At any point in time there are nt stations available with nt varying from one to n. The most general model we consider is of the following form yit = αi + βi⊤ Dt + γi⊤ Xit + gi (t /T ) + εit , where i = 1, . . . , n and t = ti , . . . , T . Here, Dt ∈ Rd is a vector of seasonal dummy variables, Xit are a vector of observed covariates, and the error terms εit satisfy E (εit |Xit ) = 0 a.s. The functions gi (·) are unknown but smooth. These represent the trend in temperatures at location i. We shall further assume that gi (·) = g (·), so that there is a single common trend, which imposes a standard way of thinking about climate change. For simplicity we also dispense with the additional covariates X (in our application we are concerned with documenting the temperature record rather than assigning changes to particular causes). The parameter vector θ = (α1 , . . . , αn , β1⊤ , . . . βn⊤ )⊤ is unknown and describes the seasonal and level effects for the different locations. The model is not identified as it stands, since one can add a constant to each αi and subtract the ∑n same constant from g (·). For identification we suppose that i=1 αi = 0, in which case the function g (.) represents the common level of average temperature relative to average seasonal variation. According to Wikipedia (2009): ‘‘Climate change is any long-term significant change in the ‘‘average weather’’ of a region or the earth as a whole. Average weather may include average temperature, precipitation and wind patterns’’. Our model directly permits the measuring of this average weather trend through the function g (·).
93
In doing the asymptotics we suppose that T → ∞ but n is fixed (in fact n = 26 in our application). In conclusion the model we adopt for the application is as follows yit = αi + βi⊤ Dt + g (t /T ) + εit ,
(2)
where the error term may be heteroskedastic across i and serially correlated over time. Let βi⊤ = (βi1 , . . . , βid ). We can write the model as y = Aα +
d −
Cj βj + Bg + ε,
(3)
j =1
where y, ε is the nT × 1 data, error vector with zeros in place of missing observations, while α ∈ Rn , g = (g (1/T ), . . . , g (1))⊤ ∈ RT , and βj = (β1j , . . . , βnj ) ∈ Rn . In this case, A, B are matrices of conformable dimensions of zeros and ones that reflect the commonality and missingness as well, see below. The matrices Cj contains the dummy variable Dj . This representation is different from Eq. (2) of Pateiro-Lopez and Gonzalez-Manteiga (2006); it allows for the ‘‘missingness’’ of data in some observation units and preserves a simple algebraic structure that is useful in what follows. Suppose n = 2 and T = 3 and for simplicity that d = 0, i.e., no seasonal effect. Then
y
1 11 y12 1 y13 1 = 0 0 y22 0 y23 0
0 1 0 [ ] 0 0 α1 + 0 0 0 α2 1 0 1 0
0 1 0 0 1 0
0 ε11 0 ε12 g1 1 g2 + ε13 . 0 0 g3 0 ε22 1 ε23
3. Profile likelihood estimation Our model may be estimated using different nonparametric methods. We consider in this paper the widely used kernel estimators. Specifically, we consider the Gaussian profile likelihood procedure for the general unbalanced case—see additional discussions in Remarks 2 and 3 for advantages of using profile likelihood estimation. This in general leads to semiparametrically efficient estimators, Severini and Wong (1992). 3.1. The estimator of g We first define the local profile likelihood in the local parameter
η ∈ R:
L(η; t /T ) =
ns T − −
yis − αi − βi⊤ Ds − η
2
Kh ((t − s)/T )
yis − αi − βi⊤ Ds − η
2
Kh ((t − s)/T ),
s=1 i=1
=
n − T − i=1 s=ti
where Is denotes the set of stations available at time s, which is of cardinality ns and we assumed the ordering of the stations is consistently chosen. Here, K is a kernel function and h is a bandwidth so that Kh (.) = K (./h)/h. The first derivative with respect to η is given by T − − ∂ L(η; t /T ) = −2 yis − αi − βi⊤ Ds − η Kh ((t − s)/T ), ∂η s=1 i∈Is
94
A. Atak et al. / Journal of Econometrics 164 (2011) 92–115
3.3. In matrix notation
so that T
n ∑ T ∑ −1 i=1 s=ti
η = gθ (t /T ) =
yis − αi − βi⊤ Ds Kh ((t − s)/T )
n ∑ T ∑
T −1
gθ = ( gθ (1/T ), . . . , gθ (1))⊤ d − ⊤ = (in ⊗ K) y − Aα − Cj β j ,
Kh ((t − s)/T )
i=1 s=ti
T
=
T ∑ −1
Kh ((t − s)/T )
s=1
ns ∑
yis − αi − βi⊤ Ds
T −1
∑
where K is the T × T smoother matrix with typical element Kts = ∑n ∑T Kh ((t − s)/T )/mt T , and mt = T −1 i=1 s=ti Kh ((s − t )/T ). In matrix notation the profile likelihood estimator solves
.
Kh ((t − s)/T )ns
s=1
Notice that if we standardize the kernel so that T −1 s=1 Kh (u − s/T ) = 1, then, when T is large, mt = m, where mt = ∑n ∑T T −1 i=1 s=ti Kh ((s − t )/T ), for all t with tm /T < t /T < tm+1 /T .
∑T
by n − T −
2
yjt − αj − βj⊤ Dt − gθ (t /T )
.
j = 1 t = ti
We maximize this subject to the constraint that i=1 αi = 0, equivalently finding the∑first order condition of the Lagrangian n L(θ, λ) = L(θ; gθ ) + λ i=1 αi . The first derivatives of L with respect to θ are:
∑n
−− ∂ εjt (θ ) ∂ L(θ; gθ ) εjt (θ ) =2 ; ∂αi ∂αi j=1 t =tj ∂ L(θ; gθ ) =2 ∂βi
T
n − T − j=1 t =tj
min
θ :α ⊤ in =1
y − Aα −
d −
⊤ Cj βj − B gθ
y − Aα −
j =1
d −
Cj βj − B gθ
j =1
or equivalently, since gθ is linear in y, θ :α ⊤ in =1
The global profile likelihood in the parameter vector θ is given
n
min ( y − X θ )⊤ ( y − X θ ),
3.2. The estimator of θ
L(θ; gθ ) =
(4)
j =1
i=1 T
We may re-write the vector of η as
εjt (θ )
∂ εjt (θ ) , ∂βi
where εjt (θ ) = yjt − αj − βj⊤ Dt − gθ (t /T ) and
∂ gθ (t /T ) if j = i ∂ εjt (θ) −1 − ∂αi = ∂ g (t /T ) ∂αi − θ else ∂αi ∂ gθ (t / T ) if j = i ∂ εjt (θ) −Dt − ∂βi = ∂ g (t /T ) ∂βi − θ else ∂βi for i = 1, . . . , n, where T ∂ gθ (t /T ) 1 1− = − Kh ((t − s)/T ) ∂αi mt T s=t i 1 − , i ≤ mt → as T → ∞. mt 0, i > mt ,
∂ gθ (t /T ) 1 1− = − Kh ((t − s)/T )Ds ∂βi mt T s=t i 1 − i11 , i ≤ mt as T → ∞ → 12mt 011 , i > mt , T
do not depend on the unknown parameters. The profile likelihood equations are linear in θ and can be solved explicitly to give the constrained estimators θ . We then define the nonparametric estimator g (u) = g θ (u).
where θ = (α ⊤ , β1⊤ , . . . , βd⊤ )⊤ ∈ Rn(d+1) and X = ( A, C1 , . . . , Cd ) is nT by n(d + 1), while: y = My, A = MA, and Cj = MCj with M = InT −B(i⊤ n ⊗ K). Ignoring the restriction we can write the above first order conditions in the following matrix form X ⊤ X θ = X ⊤ y, ⊤ ⊤ except that X X is singular. Define q = (1, . . . , 1, 0, . . . , 0), then the linear restriction is represented as q⊤ θ = 0. Then define the matrix R, which is a k × (k − 1) matrix, where k = n(d + 1), such that (q, R) is non-singular and R⊤ q = 0, Amemiya (1985, Section 1.4). In this case, we can take
[
R1 R= R3
]
R2 , R4
[
In − 1 R1 = −in−1
] ;
R4 = Ind×nd ,
n×n−1
where in−1 is the n − 1 × 1 vector of ones, and R2 , R3 are matrices of zeros of conformable dimensions. It follows that for the profile likelihood estimator subject to the linear restriction q⊤ θ = 0, we have
θ = R(R⊤ X ⊤ X R)−1 R⊤ X ⊤ y, ⊤⊤ where R X X R is non-singular.2 Then, d − ⊤ g = (in ⊗ K) y − A α− Cj βj . j =1
In computing the least squares estimators in our application we make some additional steps because T is very large, 1858 in fact. ⊤ ⊤ ⊤ ⊤ ⊤ We partition A = (A⊤ 1 , . . . , An ) and B = (B1 , . . . , Bn ) , where Aj and Bj are T × n matrices and T × T matrices respectively. Then, ∑n ∑n for example, MA = A − ((B1 K j=1 Aj )⊤ , . . . , (Bn K j=1 Aj )⊤ )⊤ , where Bj′ KAj is a T × n matrix. In this way one can avoid matrices of dimensions nT × nT or even nT × T , which are too large to fit into memory. 4. Asymptotic properties In this section we present the asymptotic properties of the estimators defined above. The following conditions are quite standard in kernel estimation. For the convenience of asymptotic analysis, we introduce β -mixing (absolutely regular), which is defined as follows. A stationary process {(ξt , Ft ), −∞ < t <
2 Note that R⊤ α 1
= (α1 , . . . , αn−1 )⊤ . We can interpret the above as a re ∑n−1
parameterization to θ = (α1 , . . . , αn−1 , β1⊤ , . . . βn⊤ )⊤ with αn = − i=1 αi and then changing A → A∗ in (3) to reflect the different structure. For example, in the special case given above, A∗ = (1, 1, 1, 0, −1, −1)⊤ . Then compute θ by an unconstrained regression.
A. Atak et al. / Journal of Econometrics 164 (2011) 92–115
∞} is said to be β -mixing (or, absolutely regular) if the mixing coefficient β(n) defined by β(n) = E
t ) − P (A)| sup |P (A|F−∞
A∈Ft∞ +n
converges to zero as n → ∞. β -mixing includes many linear and nonlinear time series models as special cases; see Doukhan (1994) for more discussion on mixing.
95
Gn is given in Box I. Then define the matrices:
∆n + Gn Q = 1 (∆n + Gn ) ⊗ i⊤ 11 12 Ωn + An Ω= 1 [Ωn + An ] ⊗ i⊤ 11
(∆n + Gn ) ⊗ ∆n ⊗
T →∞
T →∞
where ri ∈ (0, 1),
(5)
for i = 1, . . . , n, (and rn+1 = 1) in which case the starting time affects the estimators asymptotically. ∑nTo present thek main result we need some notation. Let akj = s=j (rs+1 − rs )/s , k = 1, 2, 3, 4, δi = (1 − ri − 2a1i + a2i ), fi =
(n + 2)a2,i − 2a1,i − na3,i , and λi = (n2 a4,i − 4na3,i + 4a2,i ), and let Ωn = diag δ1 ω12 , · · · , δi ωi2 , . . . , δn ωn2 , Sn = diag δ1 s21 , · · · , δi s2i , · · · , δn s2n ∆n = diag{1, . . . , 1 − ri , . . . , 1 − rn }. In addition, let An be the n × n symmetric matrix whose (i, j)th element is
i −1 n − − ωj2 + λj ωj2 , λ i j = 1 j = i + 1 [An ]i,j = − − 2 2 ωl2 + λl ωl2 , fi ωj + ωi + λi l̸=j,l
l>i
i11 1
I11 + Gn ⊗
12
J 2 11
,
(6)
(7)
where i11 is a 11 × 1 vector of ones, and J11 = i11 i⊤ 11 is a 11 × 11 matrix of ones, and
∗
g =
bi =
b⊗ 1 p!
b 1 12
µp (K )
i11
,
b = b1 ,
···,
n ∫ −
1
(p)
bi ,
δ(s)g (s)ds −
1
∫
g
(p)
bn
⊤
(s)ds
,
ri
rl
l =1
···,
and δ(s) is a weighting function on [0, 1], δ(s) = 1/j, if rj < s < rj+1 , j = 1, 2, . . . , n. We summarize the limiting distributions as follows. Theorem 1. Suppose that Assumptions A1–A4 hold, and assume that the initial observation condition are given by (5). Then, as T → ∞,
√
Assumptions A1 is a typical assumption in the time series literature and ensures that εit is stationary with weak dependence and that appropriate limiting theory can be applied. This condition is useful in our technical development and, no doubt could be replaced by a range of similar assumptions. Assumption A2 concerns about the smoothness of the trend function and ensures a Taylor expansion to appropriate order. Assumption A3 for the kernel function and Assumption A4 for the bandwidth expansion are quite standard in nonparametric estimation: in part a, the bandwidth is chosen to ensure root-T asymptotics for parametric quantities; in part b, the bandwidth is chosen to be optimal for estimation of the nonparametric component. The asymptotics depends on our assumptions about t1 ≤ t2 ≤ · · · ≤ tn . In the simplest case when t1 ≤ t2 ≤ · · · ≤ tn are finite numbers, the asymptotic results are the same as those with complete data—the differences in the starting dates are asymptotically ignorable, thus the asymptotic distributions are unaffected by the difference of starting dates. We shall assume that ti → ∞ in such a way that ti = ⌊ri T ⌋ ,
12
12 1 ⊤ [Ωn + An ] ⊗ i11 12 , 1 1 Sn ⊗ I11 + An ⊗ J 11 12 122
12
Assumption A. 1. For each i, εit is a stationary β -mixing with t mixing decay rate ∞ for 1≤i≤n βit <∑ ∑∞βit with lim supt b max ∞ some b > 1, h=−∞ E (εit εit +h ) = ωi2 and s2i = k=−∞ E (εit εi,t +12k ) with 0 < ω ≤ min1≤i≤n ωi ≤ max1≤i≤n ωi ≤ ω < ∞. 2. The function g : [0, 1] → R, is continuously differentiable up to the order τ ≥ p. 3. The kernel K has support [−1, 1] and is symmetric about zero and satisfies K (u)du = 1. In addition, uj K (u)du = 0, j = 1, . . . , p − 1, and up K (u)du ̸= 0. Define µp (K ) = up K (u)du and ‖K ‖22 = K 2 (z )dz. 4. The bandwidth satisfies: (a) As T → ∞, h → 0, and Th → ∞, Th2p → 0 (b) h = cT T −1/2p+1 with 0 < lim inf cT ≤ lim sup cT < ∞.
1
1
T (R⊤ θ − R⊤ θ + hp (R⊤ QR)−1 R⊤ g∗ )
⇒ N (0, (R⊤ QR)−1 R⊤ Ω R(R⊤ QR)−1 ). Remark 1. The asymptotic distribution of the profile likelihood estimator is complicated largely due to the unbalanced data structure, which affects the limiting distributions under our assumptions. Remark 2. The partial linear model that we study in this paper may be estimated by other methods—see an early version of this paper Atak et al. (2008) for studies of other methods. Comparing the profile likelihood estimator with the other estimators, the profile likelihood estimator is a joint estimation for the nonparametric and parametric parts, while the other estimators such as the traditional methods used in the literature of partial linear regressions are sequential two-step estimators. It is easy to see that the profile likelihood estimator has a smaller bias term than the two-step estimator. Remark 3. Heteroskedasticity across i, weak correlation over t, and seasonality all affect the limiting results. These effects are reflected through ωi2 and s2i in the limits. If we consider the special case with complete data, all observations start at t = 1, then ri = 0, i = 1, . . . , n, rn+1 = 1, and we have δ(s) = 1/n, for 0 < s < 1, j = 1, 2, . . . , n. Consequently bi =
1 p!
µp (K )
l =1
j < i,
1
(p)
δ(s)g (s)ds − 0
1
∫
g
(p)
(s)ds
= 0.
0
This cancelation occurs because of the recentering due to the parametric part of the model. Thus we have the following simplified asymptotic results for the profile likelihood estimator with complete data. Let Q = ΣX −
i=j
n ∫ −
1 n
ΣX∗ ,
In ΣX = 1 In ⊗ i11 12
(8)
1 12
In ⊗ i⊤ 11
1
12
I11n
,
96
A. Atak et al. / Journal of Econometrics 164 (2011) 92–115
a21 − 2a11 +
n −
···
a2l
ia2i − 2a1i +
l=2
n −
···
a2l
l=i+1
n −
Gn =
ia2i − 2a1i + a2l l=i+1
ia2i − 2a1i +
n −
a2l
l=i+1
(na2,n − 2a1,n )
(na2,n − 2a1,n )
(na2,n − 2a1,n ) . (na2,n − 2a1,n ) (na2n − 2a1n )
Box I.
1
Jn ΣX = 1 Jn ⊗ i⊤ 11
12
∗
Jn ⊗ i11 1
122
12
J11n
Corollary 2. Suppose that Assumptions A1–A4 hold and all observations start at t = 1. Then, as T → ∞,
,
√
and Ω is defined by the same formula (7) with
Ωn = diag
[ 1−
[ Sn = diag
1−
1
2
n
1
2 2 ] 1 1 ω12 , . . . , 1 − ωi2 , . . . , 1 − ωn2 , n
2
n
1
s21 , . . . , 1 −
2
n
n
s2i , . . . , 1 −
1
2 ]
s2n ,
n
and the (i, j)th element of An is given by
− 1 ωj2 , n2 j ̸=i [An ]i,j = 1 1 1 − 2 ωl , 1 − (ωj2 + ωi2 ) + 2 − n
n
n
i=j j < i.
l̸=j,i
Corollary 1. Suppose that Assumptions A1–A4 hold, in the case with complete data, the profile likelihood estimator has the following asymptotic distribution as T → ∞,
√
T (R⊤ θ − R⊤ θ ) ⇒ N (0, (R⊤ Q R)−1 R⊤ Ω R(R⊤ Q R)−1 ). If we further assume that εit are i.i.d. distributed with mean
2
zero and variance σ 2 , Ωn = Sn = 1 − 1n σ 2 In where In is the n-dimensional identity matrix, and the (i, j)th element of An is given by
[An ]i,j =
1 1 1− σ 2,
i=j
− 1 σ 2
j ̸= i.
n
n
n
We next analyze the estimator of the trend function. The asymptotic results of this estimator is summarized in Theorem 2 whose proofs are again given in the Appendix. Theorem 2. Suppose that Assumptions A1–A4 hold, and assume that the initial observation condition are given by (5). Then, as T → ∞,
√
Th[ g (u) − g (u) − hp b(u)]
1
⇒ N 0,
m
ω ‖K ‖ 2 m
√
2 2
,
for u ∈ [rm , rm+1 ), m = 1, . . . , n − 1,
Th[ g (u) − g (u) − hp b(u)] ⇒ N
where b(u) = n
∑n −1
i=1
ω
1 (p) g p!
0,
1 n
ω2 ‖K ‖22 ,
(u)µp (K ), while ω2m = m−1
for u > rn
∑m
i =1
ωi2 , ω2 =
2 i .
In the special case with complete data, we have the following special result.
Th[ g (u) − g (u) − h b(u)] ⇒ N p
0,
1 n
ω ‖K ‖ 2
2 2
.
(9)
Remark 4. It is possible to extend the above results to allow for cross-sectional dependence as well, since the CLT is coming from the weak dependence in the large time series dimension. Suppose instead that εt = (ε1t , . . . , εnt )⊤ = Ξ (t /T )1/2 ηt , where the vector ηt = (η1t , . . . , ηnt )⊤ is stationary β -mixing with the same decay rate as in Assumption A1, while Ξ (u) is a symmetric positive definite matrix of smooth functions. Let Ψ (s) = E ηt ηt⊤+s ∑∞ and Ψ∞ = s=−∞ Ψ (s). Then the asymptotic variance in (9) becomes ‖K ‖22 i⊤ Ξ (u)1/2 Ψ∞ Ξ (u)1/2 i/n, where i = (1, 1, . . . , 1)⊤ . However, the results for θ are much more complicated in this case. Remark 5. One can also expect that Theorem 2 continues to hold in the case where n → √ ∞. In this case, the rate of convergence √ of g (u) is of order 1/ Tmh, and if u > rn this rate is 1/ Tnh. The precise rates attainable depend on the distribution of the sequence r1 , r2 , . . . throughout [0, 1]. However, the asymptotic distribution is the same regardless of whether n is large or not. The corresponding results for θ have to be rethought in this case because the dimensions of this parameter vector increases. 5. Forecasting In this section we consider forecasting based on the semiparametric model (2). In particular, we consider q-step forecasting, i.e. forecasting of yi,T +q based on information up to time T . Our primary interest is to forecast yi,T +q with finite q, although our analysis allows for forecasts with q → ∞ under appropriate expansion rate of q. The common structure in our model allows us to exploit the forecasting gains entailed by these restrictions (reduction in forecasting variance), which amount to homogeneity restrictions in a panel-data environment. These restrictions were found to be helpful in the empirical application of Hoogstrate et al. (2000) for GDP forecasts. In a recent paper, Issler and Lima (2009) have a theoretical explanation of why these restrictions might work in practice. Notice that yi,T +q = αi + βi⊤ DT +q + g (1 + q/T ) + εi,T +q . Therefore, a simple forecast for yi,T +q , that ignores the error dynamics, can be obtained based on estimators for αi , βi and a predictor of g (1 + q/T ) based on observations i = 1, . . . , n and t ≤ T . Since estimators for αi , βi are studied in the previous sections, we study forecasting of g (1 + q/T ) in this section and construct a predictor of yi,T +q using the predicted g (1 + q/T ). We are also interested in forecasting the average temperature, yT +q = ∑n i=1 yi,T +q /n, given by ⊤
yT +q = β DT +q + g (1 + q/T ) + ε t , where β
⊤
=
∑n
i=1
βi /n, and ε T +q =
(10)
∑n
i=1
εi,T +q /n.
A. Atak et al. / Journal of Econometrics 164 (2011) 92–115
We first consider the simple case when {εit }t are martingale difference sequences. Since forecasting of g (1 + q/T ) is the key issue, we note that ET yi,T +q = αi + βi⊤ DT +q + g (1 + q/T ), where ET denotes conditional expectation given the data. We make the following assumptions to facilitate forecasting the common trend. A1′ For each i, εit is a martingale difference sequence, E (εit2 ) = σi2 , and 0 < σ ≤ min1≤i≤n σi ≤ max1≤i≤n σi ≤ σ < ∞. A2′ The function g : [0, 1 + ϵ] → R, some ϵ > 0, is continuously differentiable up to the order τ ≥ p. A5 K is a one-sided kernel satisfying (a) K and K ′ are continuous on [−1, 0]; (b) µ∗0 (K ) > 0 and µ∗0 (K )µ∗2 (K ) − µ∗1 (K )2 > 0,
0
where µj (K ) = −1 u K (u)du. A6 The bandwidth h satisfies A4(a) and the bandwidth h1 satisfies h/h1 → 0 as T → ∞. ∗
j
We construct a local polynomial predictor for g (1 + q/T ). Notice that g (·) is a smooth function under Assumption A2′ ; therefore, when T → ∞, q/T → 0, by a Taylor expansion of g (·) around u = 1 to the τ th order (τ = p − 1), g (1 + q/T ) =
τ τ q k q τ − q k − 1 (k) g (1) +o = γk · k! T T T k=0 k=0 q τ +o .
T As will be more clear later in this section, forecasting at time T is largely affected by data information close to time T . We let n
y = n−1 t
i =1
for tn ≤ t ≤ T . Let K (·) be a one-sided kernel whose properties are defined in Assumption A5 above, we consider the following local polynomial estimation at the end point T : T −
K
T −t Th1
t =1
y −
τ −
t
γk ·
k=0
t −T
k 2
T
.
(11)
where h1 is a bandwidth parameter satisfying Assumption A6. We summarize the asymptotic behavior of the local polynomial estimator (11) in the following theorem. Let
µ∗τ +1 (K ) ∗ 1 µ (K ) B(K ) = g (τ +1) (1) τ +2 ... (τ + 1)! ∗ µ2τ +1 (K ) ∗ µ0 ( K ) µ∗1 (K ) · · · µ∗τ (K ) ∗ ∗ ∗ µ2 (K ) µτ +1 (K ) µ (K ) M (K ) = 1 , ··· ··· ··· ∗ ∗ ∗ µτ (K ) µτ +1 (K ) · · · µ2τ (K ) ∗ ν0 (K ) ν1∗ (K ) · · · ντ∗ (K ) ∗ ∗ ∗ ντ +1 (K ) ν (K ) ν2 (K ) V (K ) = 1 , ... ··· ···· ∗ ∗ ντ (K ) ··· ν2τ (K ) 0 0 ∗ ∗ k and µk (K ) = −1 K (u)u du, νj (K ) = −1 uj K 2 (u)du. Let also Dh = diag(1, h, . . . , hτ ).
Theorem 3. Suppose that Assumptions A1, A2′ , A3, A4, A5, and A6 hold, as T → ∞,
√
ThDh ( γ − γ − h1τ +1 M (K )−1 B(K ))
1 ⇒ N 0, σ 2 M (K )−1 V (K )M (K )−1 , n
∑n where σ = n−1 i=1 σi2 . 2
The above result indicates that the leading bias effect of local polynomial estimation of (γ0 , γ1 , . . . , γτ ) is given by hτ +1 Dh M (K )−1 B(K ), and the leading variance effect is given 1 1 −1 by ω2 D− V (K )M (K )−1 D− h M (K ) h /nTh. The local polynomial predictor for g (1 + q/T ) is then given by
g (1 + q/T ) =
τ −
γk ·
q k
k=0
T
,
and our predictor for yi,T +q is given by
yi,T +q = αi + βi⊤ DT +q + g (1 + q/T ).
(12)
The forecast for average temperature is just the average forecast, so ⊤
yT +q = β DT +q + g (1 + q/T ),
(13)
⊤ where β=n i=1 βi . The forecasting error is given in the following theorem. Let Pτ = (1, (q/Th), . . . , (q/Th)τ )⊤ . Let ET∗ denotes asymptotic conditional expectation given the data. ∑n −1
Theorem 4. Suppose that Assumption A1, A2′ , A3, A4, and A5 hold, as T → ∞, the forecasting bias in yi,T +q is given by ET∗ [ yi,T +q − yi,T +q ] = bg = hτ +1 [Pτ⊤ M (K )−1 B(K ) + o(1)], and the forecasting error variance in yi,T +q is given by ET∗ [( yi,T +q − ET yi,T +q )2 ]
− ⊤ β Dt , (yit − αi − βi⊤ Dt ) = yt −
97
= σi2 +
1 Tnh
[Pτ⊤ M (K )−1 V (K )M (K )−1 Pτ + o(1)] σ 2 ,
where, σ is defined in Theorem 3. For the forecast of average temperature, yT +q , the forecasting bias is the same as that of yi,T +q 2
given by the above formula, and the forecasting error variance in yT +q is given by ET∗ [( yT +q − ET∗ yT +q )2 ]
=
1 n
1+
1 Th
[Pτ⊤ M (K )−1 V (K )M (K )−1 Pτ + o(1)] σ 2 .
The results of Theorems 3 and 4 indicate that the forecasting error of yi,T +q is dominated by that of the local polynomial
forecaster of g (1 + q/T ). In particular, for the leading case of forecasting with finite q, the bias term is dominated by the first term in bg : hτ +1 B0 , where B0 is the first element in the (τ + 1)vector M (K )−1 B(K ). The forecasting error variance is dominated by σi2 + V0 σ 2 /Tnh, where V0 is the (1, 1)-element of matrix M (K )−1 V (K )M (K )−1 . Similar result can be obtained for the average temperature forecaster yT +q . These results also hold for more general cases as long as q/Th → 0. If we allow that q → ∞, the order of magnitude of the forecasting error is determined jointly by the bandwidth h and the forecasting distance q/T . In the case of yi,T +q , if q/Th → 0, the bias term is dominated by the first term in bg : hτ +1 B0 , and the forecasting error variance is dominated by σi2 + V0 σ 2 /Tnh, where B0 and V0 are defined in the same way as above. If q/Th → δ ∈ (0, ∞), the leading bias term is affected −1 by all terms in bg : hτ +1 ∆⊤ = τ M (K ) B(K ), where ∆τ τ ⊤ (1, δ, . . . , δ ) . The leading variance terms is giving by: σi2 + −1 −1 2 ∆⊤ τ M (K ) V (K )M (K ) ∆τ σ /Tnh. If q/Th → ∞, our theory is not applicable.
98
A. Atak et al. / Journal of Econometrics 164 (2011) 92–115
Remark 6. In the general case when {εit }t are weakly dependent, ET yi,T +q = αi + βi⊤ DT +q + g (1 + q/T ) + ET εi,T +q , where ET denotes conditional expectation given the data. Under our condition A1, ET εi,T +q ̸= 0 (although ET εi,T +q → 0 as q → ∞). To forecast ET εi,T +q , we should fit a time series model (say, an ARMA model as Box and Jenkins) to the error term, and using the existing forecasting method to construct a predictor. In this case, we may detrend and remove the seasonal components from yi,t using our estimates αi , βi , and g (t /T ), i.e.
εi,t = yi,t − αi − βi⊤ Dt − g (t /T ) and then fit the estimated stochastic component εi,t by an appropriate ARMA model to obtain forecast of εi,T +q , say, E T εi ,T + q . A predictor for yi,T +q can then be constructed by g (1 + q/T ) that we obtained earlier in this section together with other components, i.e.
yi,T +q = αi + βi⊤ DT +q + g (1 + q/T ) + E T εi ,T + q . In the AR(1) special case εi,t = ρεi,t −1 + ηit , where ηit is i.i.d., we have ET εi,T +q = ρ q εi,T . More generally, for ARMA process errors one could use the standard linear forecasting techniques associated with Box and Jenkins. Alternatively, we may ignore the error dynamics and simple construct forecasts for yi,T +q and yT +q by (12) and (13). Such predictors are asymptotically equivalent to predictors that takes into account the weak correlation in εi,t for long-run forecasting (the case q → ∞), but are less efficient for short-run forecasting than predictors that utilize the correlation property. 6. Application Our dataset contains the average maximum temperature within a month (TMAX ), the average minimum temperature within a month (TMIN ), the difference between the average maximum and minimum temperatures within a month b(TRANGE ), all measured in degrees Celsius and also the number of hours of sunshine and the number of millimeters of rainfall. The primary data source is the met office web site for each of the twenty six stations (Fig. 1).3 The first observations were taken in 1853 at Armagh and Oxford so that we have a total of 1858 time series records. In the working paper version of this paper we provide the full results of a univariate parametric analysis based on a quadratic trend. This shows evidence of seasonality and an upward trend for all stations. There is also some evidence of serial correlation in the residuals but little evidence of GARCH effects. The error correlation does not affect the estimation of the regression coefficients and changes only slightly the standard errors. Similar results were obtained for both maximum and minimum temperature. We also report results for the range. These are somewhat different. Specifically, the trend coefficients are significant in only nine cases, with seven of those cases having a similar upward trend, whereas the other two actually have a negative trend in range. Range has also a significant seasonal effect and a significant autocorrelation coefficient in most cases. The results for sunshine hours are not so consistent as for temperature. There are seven stations with significant trends, six of them with increasing trend. Overall though many other stations have negative, albeit insignificant, trends. With rainfall, the trend is not significant in any station. One critique of such a parametric analysis is that the implied trend is a little unrealistic and poorly estimated. Extrapolating beyond the sample implies an outrageously high temperature twenty years from now, which is just not credible. This is why we have advocated a semiparametric approach.
3 The data are available at http://www.metoffice.gov.uk/climate/uk/stationdata/.
Fig. 1.
We next present the results of the semiparametric analysis. In Tables 1 and 2 we give the estimated values of θ and the associated standard errors for TMAX and TMIN. The parameter values are strongly significant and show evidence of geographic variability in the level of temperature and seasonality. These results are broadly consistent with the individual purely parametric results we gave in the working paper version. We present in Fig. 2(a) and (b) the implied trend from the parametric analysis. The jagged nature of the graph is caused by the introduction of new stations. Also note that the implied trend at the end of the period is quite extreme. Our results are somewhat different from those obtained in Gao and Hawthorne (2006) for example, since we find evidence of trend starting much later. In Fig. 3(a) and (b) we give the estimated nonparametric trend over the same period. The trend is much more moderate especially at the end of the period. In Fig. 4(a) and (b) we give the trend just for the recent period by only considering the balanced subset of the data. Even though the nonparametric trend indicates some variation i.e., some downward movements, but generally it climbs upward, this being more pronounced after 1995. In both cases, balanced and unbalanced, we can easily claim that there is an upward trend for the TMAX and TMIN values. These were implemented using a Gaussian kernel and Silverman’s rule of thumb bandwidth (which in this case yield h ≃ 0.05). As we remarked in the text, the estimation of the common trend is purely local and unaffected by earlier data. The standard errors for the nonparametric estimators of TMAX and TMIN over the shown period are 0.476709, 0.48602 respectively, indicating the level of significance of the estimated curves. We next present the result of an out of sample analysis. We compute the estimated forecast based on local linear smoothing. We report the absolute error for the p-step forecast, where p = 1, 2, . . . , 12, so forecasting out to one year ahead. The forecast errors given in Fig. 5(a) and (b) appear reasonable and are better than the corresponding parametric results, which substantially overpredict the temperature in this period. 7. Conclusion In conclusion, we have developed a semiparametric model we think is appropriate for modeling the changes in temperatures
A. Atak et al. / Journal of Econometrics 164 (2011) 92–115
99
Table 1 Maximum temperature nonparametric results. Station
Time
Alpha
Betas
Aberporth
1942–2007 15.1545 −4.8198 2.4600 −3.7686 3.4780 −2.7198 4.4884 −1.6743 5.4908 −1.0563 6.4845 0.4039 0.0177 (0.4872) (0.4987) (0.4875) (0.5051) (0.4978) (0.4800) (0.4975) (0.4796) (0.4893) (0.4900) (0.4880) (0.4793)
Armagh
1865–2007 15.3314 (0.4969)
Bradford
1908–2007 14.2064 −4.9778 2.3021 −3.9265 3.3201 −2.8777 4.3305 −1.8322 5.3329 −1.0553 6.3266 0.2460 0.0840 (0.4793) (0.4799) (0.5464) (0.5566) (0.5468) (0.5657) (0.5610) (0.5319) (0.5561) (0.5316) (0.5497) (0.5413)
Braemar
1959–2007 12.4661 (0.5372)
Cambridge
1959–2007 16.2552 −3.7045 3.5753 −2.6532 4.5933 −1.6045 5.6038 −0.5589 6.6061 (0.5566) (0.5293) (0.5305) (0.5318) (0.5412) (0.5522) (0.5418) (0.5611) (0.5557)
0.4826 7.5998 1.4374 0.0004 (0.5270) (0.5515) (0.5262)
Cardiff
1977–2007 15.6739 (0.5444)
3.0741 10.3539 (0.5364) (0.5329)
4.1253 11.3719 (0.5254) (0.5506)
5.1741 12.3823 (0.5268) (0.5261)
6.2196 13.3847 (0.5502) (0.5405)
7.2611 14.3784 8.2978 0.0244 (0.5540) (0.5382) (0.5449)
Durham
1880–2007 14.4470 (0.5625)
3.7297 11.0095 (0.5506) (0.5520)
4.7809 12.0275 (0.5249) (0.5257)
5.8297 13.0379 (0.5270) (0.5450)
6.5113 14.0403 (0.5550) (0.5451)
7.6098 15.0340 8.9534 0.0125 (0.5641) (0.5590) (0.5301)
Eastbourne
1959–2007 15.8710 (0.5545)
3.0741 10.3539 (0.5300) (0.5476)
4.1253 11.3719 (0.5395) (0.5359)
5.1741 12.3823 (0.5286) (0.5536)
6.2196 13.3847 (0.5298) (0.5285)
7.2611 14.3784 8.2978 0.0020 (0.5530) (0.5432) (0.5568)
Greenwich
1959–2004 16.1333 −5.4322 1.8477 −4.3809 2.8657 −3.3321 3.8761 −2.2866 4.8785 −1.2451 5.8722 −0.5548 0.0085 (0.5408) (0.5479) (0.5656) (0.5535) (0.5546) (0.5280) (0.5291) (0.5299) (0.5424) (0.5533) (0.5431) (0.5626)
Hurn
1957–2007
8.7890 (0.5572)
12.5155 19.7954 13.5668 20.8133 (0.5284) (0.5525) (0.5273) (0.5458)
14.6155 21.8238 15.6611 22.8261 16.4415 23.8199 17.7393 0.0191 (0.5376) (0.5334) (0.5262) (0.5522) (0.5281) (0.5272) (0.5513)
Lerwick
1930–2007
6.1471 (0.5417)
13.2163 20.4961 14.2675 21.5141 (0.5553) (0.5393) (0.5460) (0.5634)
15.3163 22.5246 16.3619 23.5269 17.1254 24.5206 18.4401 0.0296 (0.5513) (0.5530) (0.5256) (0.5269) (0.5282) (0.5435) (0.5536)
Leuchers
1957–2007
6.9764 (0.5433)
3.0741 10.3539 (0.5624) (0.5569)
4.1253 11.3719 (0.5284) (0.5527)
5.1741 12.3823 (0.5282) (0.5456)
6.2196 13.3847 (0.5378) (0.5349)
7.2611 14.3784 8.2978 0.0401 (0.5270) (0.5521) (0.5279)
Newton Rigg
1959–2007
5.2113 (0.5269)
3.9111 11.1910 (0.5512) (0.5415)
4.9624 12.2090 (0.5550) (0.5390)
5.6380 13.2194 (0.5461) (0.5643)
7.0567 14.2217 (0.5524) (0.5529)
7.7849 15.2155 9.1349 0.0506 (0.5270) (0.5277) (0.5283)
Oxford
1853–2007
4.9145 (0.5429)
3.0741 10.3539 (0.5538) (0.5438)
4.1253 13.0460 (0.5634) (0.5580)
5.1741 12.3823 −1.2794 1.9531 15.2543 14.3784 8.2978 0.0612 (0.5290) (0.5532) (0.5281) (0.5466) (0.5381) (0.5340) (0.5273)
Paisley
1959–2007
3.7738 (0.5528)
3.0741 10.3539 (0.5290) (0.5278)
4.1253 13.8831 (0.5520) (0.5426)
5.1741 12.3823 −9.6978 8.3368 21.5850 14.3784 8.2978 0.0072 (0.5559) (0.5399) (0.5468) (0.5639) (0.5516) (0.5538) (0.5261)
Ringway
1949–2004
5.8577 −6.3399 0.9399 −5.2887 14.7201 −4.2399 2.9683 −3.1853 3.9707 −2.1529 4.9644 −1.5097 0.0082 (0.5275) (0.5289) (0.5425) (0.5525) (0.5421) (0.5612) (0.5557) (0.5274) (0.5518) (0.5271) (0.5445) (0.5368)
Ross-on-wye
1930–2007
8.7682 −3.5391 3.7408 −2.4878 5.1857 −1.4390 5.7692 −0.3935 6.7715 (0.5342) (0.5259) (0.5510) (0.5268) (0.5259) (0.5502) (0.5403) (0.5539) (0.5380)
Shawbury
1957–2007 10.4095 −14.4370 −7.1572 −13.3858 5.4648 −12.3370 −5.1287 −11.2915 −4.1264 −10.4299 −3.1327 −9.1391 0.1033 (0.5520) (0.5262) (0.5268) (0.5273) (0.5433) (0.5542) (0.5444) (0.5636) (0.5586) (0.5293) (0.5538) (0.5284)
Sheffied
1883–2007 15.8198 (0.5470)
Southampton
1855–2004 12.7800 −3.9741 3.3057 −2.9229 6.0228 −1.8741 5.3341 −1.1075 6.3365 (0.5639) (0.5517) (0.5543) (0.5261) (0.5276) (0.5292) (0.5418) (0.5518) (0.5414)
St. Mawgan
1957–2007 11.7145 (0.5510)
Stornoway
1873–2007 13.0459 −10.1131 −2.8333 −9.0619 6.5808 −8.0131 −0.8049 −6.9676 0.1975 −5.9261 1.1912 −4.9243 0.0145 (0.5374) (0.5443) (0.5634) (0.5513) (0.5515) (0.5256) (0.5260) (0.5266) (0.5426) (0.5532) (0.5433) (0.5625)
Sutton Bonnington
1959–2007 15.4447 (0.5577)
4.6062 11.8860 (0.5286) (0.5527)
5.6575 6.8599 (0.5278) (0.5463)
6.7062 13.9145 (0.5380) (0.5335)
7.7518 14.9168 (0.5268) (0.5518)
8.5469 15.9105 6.4822 0.0156 (0.5285) (0.5269) (0.5511)
Tiree
1930–2007 25.9146 (0.5417)
7.9759 3.9880 (0.5551) (0.5394)
9.9699 7.1389 (0.5461) (0.5631)
4.8250 12.0626 (0.5508) (0.5532)
6.0228 3.0114 (0.5253) (0.5266)
3.9880 8.5748 6.8309 0.0166 (0.5281) (0.5409) (0.5517)
Valley
1930–2007 25.8283 (0.5408)
8.2550 4.1275 10.3187 7.4179 (0.5604) (0.5546) (0.5261) (0.5506)
4.9645 12.4113 (0.5255) (0.5436)
6.3018 3.1509 (0.5356) (0.5330)
4.1275 8.9236 7.1797 0.0177 (0.5250) (0.5500) (0.5255)
Yeovilton
1964–2007 32.7795 (0.5249)
8.5340 4.2670 10.6675 7.6969 (0.5493) (0.5394) (0.5529) (0.5371)
5.1741 12.3823 (0.5438) (0.5624)
6.5808 3.2904 (0.5507) (0.5512)
4.2670 14.3784 8.2978 0.0187 (0.5249) (0.5255) (0.5260)
5.8443 13.1241 (0.4798) (0.4792)
6.2878 13.5676 (0.5304) (0.5553)
6.1523 13.4321 (0.5388) (0.5343)
6.8956 14.1421 (0.4928) (0.4871)
7.3390 14.5856 (0.5319) (0.5303)
7.2035 5.7438 (0.5277) (0.5530)
13.6632 20.9431 14.4184 6.3018 (0.5260) (0.5437) (0.5361) (0.5337)
RSS
7.9443 15.1526 (0.4925) (0.4889)
8.3878 15.5960 (0.5547) (0.5453)
8.2523 15.4605 (0.5295) (0.5278)
8.9534 16.1549 10.0314 17.1486 10.4707 0.1213 (0.4900) (0.5090) (0.4920) (0.4982) (0.4788)
9.4333 16.5984 10.4748 17.5921 11.1030 0.0500 (0.5588) (0.5427) (0.5494) (0.5715) (0.5548)
0.6480 7.7653 1.4127 0.0113 (0.5452) (0.5637) (0.5517)
9.2978 16.4629 10.3393 17.4566 11.0581 0.0127 (0.5523) (0.5430) (0.5562) (0.5405) (0.5473) 0.2129 7.3302 1.2496 0.0124 (0.5606) (0.5549) (0.5265)
15.7632 22.9715 16.5131 23.9738 16.8030 24.9676 18.8870 0.0135 (0.5252) (0.5504) (0.5260) (0.5252) (0.5494) (0.5396) (0.5534)
The values in the parentheses indicate the standard errors.
3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0
2.5 2.0 1.5 1.0 0.5 0.0 1860
1880
1900
1920
(a) Average trend in TMAX by OLS.
1940
1960
1980
2000
1860
1880
1900
1920
(b) Average trend in TMIN by OLS. Fig. 2. Average trend by OLS.
1940
1960
1980
2000
100
A. Atak et al. / Journal of Econometrics 164 (2011) 92–115
Table 2 Minimum temperature nonparametric results. Station
Time
Alpha
Betas
Aberporth
1942–2007
2.3700 1.4734 0.2254 5.6285 (0.3929) (0.3956) (0.3924) (0.4025)
0.6890 0.2364 2.0505 −0.3964 0.8742 1.4239 0.5348 0.4685 0.0060 (0.3999) (0.3910) (0.3934) (0.3915) (0.3931) (0.3933) (0.3937) (0.3900)
Armagh
1865–2007 −12.1000 −0.3669 0.2042 1.9268 (0.3936) (0.3909) (0.3905) (0.3917)
0.0289 −0.4264 0.5512 −0.1645 0.0073 0.1473 −0.0057 −0.0322 0.0414 (0.3914) (0.3907) (0.3932) (0.3937) (0.3966) (0.3918) (0.3954) (0.3910)
Bradford
1908–2007
2.4400 −0.6241 −0.1922 5.4360 (0.3914) (0.3907) (0.4418) (0.4447)
0.6477 0.9084 1.5923 −1.4851 0.3493 0.8633 1.8578 0.8119 0.0286 (0.4411) (0.4531) (0.4514) (0.4316) (0.4433) (0.4327) (0.4422) (0.4346)
Braemar
1959–2007 −23.2000 −1.1384 −0.7543 2.7556 −1.8227 −8.1432 −0.4675 −4.3715 −1.5970 −1.4042 −6.1424 −0.4642 0.0170 (0.4347) (0.4302) (0.4432) (0.4317) (0.4310) (0.4443) (0.4382) (0.4466) (0.4373) (0.4422) (0.4535) (0.4450)
Cambridge
1959–2007 −10.0000 1.5187 0.1604 3.2441 −0.2018 −1.0108 0.5976 −0.9870 −0.1347 0.0658 −2.9559 −0.1144 0.0001 (0.4446) (0.4313) (0.4321) (0.4311) (0.4372) (0.4405) (0.4366) (0.4489) (0.4468) (0.4273) (0.4390) (0.4282)
Cardiff
1977–2007
Durham
1880–2007 −10.6000 −0.9650 −0.0809 2.6062 −0.2255 −1.4849 0.6923 −0.9738 −0.1399 0.1163 0.5467 −0.0260 0.0043 (0.4443) (0.4416) (0.4402) (0.4275) (0.4279) (0.4269) (0.4409) (0.4439) (0.4400) (0.4521) (0.4503) (0.4306)
Eastbourne
1959–2007
14.1000 1.0138 0.2478 2.6369 −0.1960 −0.2964 0.5028 −0.7961 −0.1243 0.3758 −2.6351 −0.0318 0.0007 (0.4424) (0.4317) (0.4411) (0.4335) (0.4337) (0.4290) (0.4422) (0.4304) (0.4298) (0.4431) (0.4371) (0.4454)
Greenwich
1959–2004
2.3300 0.4489 0.0478 2.0135 −0.3236 −0.5904 −0.0544 0.1148 1.0281 −0.8050 −2.2322 0.4821 0.0029 (0.4361) (0.4414) (0.4415) (0.4441) (0.4434) (0.4304) (0.4311) (0.4299) (0.4379) (0.4411) (0.4372) (0.4495)
Hurn
1957–2007
0.6770 1.4351 0.5469 3.3147 (0.4474) (0.4281) (0.4396) (0.4290)
Lerwick
1930–2007
14.1000 10.1165 2.7597 9.7292 (0.4379) (0.4303) (0.4306) (0.4261)
RSS
3.0061 2.9027 4.4568 1.5635 2.7471 4.4601 4.2095 1.9633 0.0083 (0.4389) (0.4273) (0.4269) (0.4399) (0.4344) (0.4422) (0.4331) (0.4382)
0.3461 1.3415 1.5557 −0.8845 −0.0024 1.2413 −0.6726 0.5926 0.0065 (0.4387) (0.4311) (0.4310) (0.4266) (0.4396) (0.4280) (0.4276) (0.4405)
−0.3170 −0.6313 −0.4225 0.0437 −0.0396 −3.2641 −0.3394 −2.2625 −1.1694
0.1087 −0.5898 −0.4956 0.0101 (0.4449) (0.4421) (0.4409) (0.4278) (0.4285) (0.4277) (0.4399) (0.4427)
(0.4351) (0.4430) (0.4337) (0.4391)
14.1000 −0.1221 −0.6054 0.1052 −0.7358 −2.4101 0.5969 −3.4711 −1.8482 −0.4097 −2.0278 −0.2939 0.0137 (0.4387) (0.4510) (0.4490) (0.4294) (0.4411) (0.4304) (0.4399) (0.4323) (0.4327) (0.4279) (0.4348) (0.4293)
Leuchers
1957–2007
Newton Rigg
1959–2007 −10.7000 1.6483 0.1243 2.5238 −0.5422 −2.6037 0.7267 −1.2317 0.5687 −0.1050 −3.0580 −0.1080 0.0173 (0.4287) (0.4418) (0.4362) (0.4442) (0.4350) (0.4402) (0.4468) (0.4433) (0.4423) (0.4296) (0.4301) (0.4289)
Oxford
1853–2007
11.1000 0.8900 0.2388 4.4482 (0.4386) (0.4418) (0.4380) (0.4503)
Paisley
1959–2007
13.1000 1.1270 −0.0092 1.0388 −0.7462 −2.0505 0.6267 −1.4248 0.2422 −0.0785 −2.2996 −0.1408 0.0024 (0.4404) (0.4289) (0.4283) (0.4412) (0.4362) (0.4438) (0.4345) (0.4398) (0.4454) (0.4428) (0.4417) (0.4284)
Ringway
1949–2004
Ross-on-wye
1930–2007
Shawbury
1957–2007
Sheffied
1883–2007
1.4000 −3.5342 −0.7545 3.0605 −0.5274 −1.3703 0.1892 −1.7463 −0.5579 −0.6915 −1.1898 −0.2726 0.0043 (0.4399) (0.4320) (0.4318) (0.4277) (0.4406) (0.4291) (0.4284) (0.4415) (0.4366) (0.4439) (0.4348) (0.4400)
Southampton
1855–2004
1.1700 0.9202 0.1034 3.2253 (0.4456) (0.4432) (0.4419) (0.4285)
0.5978 0.5331 0.8900 0.6020 0.6692 0.1148 1.1430 0.5367 0.0042 (0.4293) (0.4285) (0.4317) (0.4413) (0.4307) (0.4493) (0.4473) (0.4279)
St. Mawgan
1957–2007 −25.3000 2.2557 1.0314 3.6950 (0.4332) (0.4287) (0.4383) (0.4249)
0.7932 2.6404 2.1836 0.2021 0.8031 1.8220 0.5734 1.0597 0.0046 (0.4255) (0.4266) (0.4333) (0.4278) (0.4273) (0.4402) (0.4349) (0.4361)
Stornoway
1873–2007
0.5737 0.7781 1.2236 0.3525 1.0488 0.5365 1.2410 0.4168 0.0209 (0.4484) (0.4289) (0.4404) (0.4297) (0.4395) (0.4318) (0.4315) (0.4274)
−3.7600
1.3717 0.1330 3.0629 (0.4291) (0.4284) (0.4391) (0.4354)
0.2753 0.0132 0.8333 1.2322 2.1747 −0.1999 −0.0831 1.0261 0.0028 (0.4379) (0.4501) (0.4481) (0.4287) (0.4340) (0.4296) (0.4391) (0.4317)
4.7800 3.3694 0.6580 4.5835 (0.4321) (0.4272) (0.4404) (0.4286)
0.7623 0.3430 1.1623 −0.1583 0.7323 1.3839 1.5353 0.6428 0.0038 (0.4280) (0.4344) (0.4356) (0.4433) (0.4342) (0.4330) (0.4395) (0.4428)
−3.1400
0.3513 0.5314 1.7280 −1.1052 0.0875 1.1700 −0.4155 0.6032 0.0352 (0.4388) (0.4421) (0.4383) (0.4506) (0.4487) (0.4293) (0.4407) (0.4300)
2.3822 0.4179 3.7905 (0.4415) (0.4290) (0.4294) (0.4282)
−4.0600 −1.1364 −0.3128 0.5396 −0.2145 −1.4758
0.4455 −1.6589 −0.6657 −0.2020 −0.2599 −0.0102 0.0050 (0.4344) (0.4284) (0.4288) (0.4275) (0.4382) (0.4413) (0.4375) (0.4499)
(0.4334) (0.4387) (0.4391) (0.4424) Sutton Bonnington
1959–2007 −10.7000 1.9809 0.1297 2.9827 −0.3925 −1.5656 0.5088 −1.2404 −0.2772 0.0450 −2.9690 −0.0229 0.0053 (0.4481) (0.4285) (0.4399) (0.4295) (0.4392) (0.4314) (0.4313) (0.4270) (0.4397) (0.4284) (0.4277) (0.4407)
Tiree
1930–2007
14.0500 1.0180 0.3159 1.7306 (0.4362) (0.4431) (0.4339) (0.4392)
0.4153 −0.3880 0.5462 −0.4916 0.0478 1.1533 0.7886 0.1933 0.0057 (0.4451) (0.4427) (0.4411) (0.4279) (0.4287) (0.4278) (0.4375) (0.4345)
Valley
1930–2007
15.0000 2.4432 0.8413 3.8132 (0.4365) (0.4420) (0.4399) (0.4274)
0.8073 0.6812 1.2713 0.4618 0.8848 1.8880 1.6453 0.7447 0.0060 (0.4390) (0.4280) (0.4314) (0.4305) (0.4310) (0.4262) (0.4392) (0.4272)
Yeovilton
1964–2007
15.2000 6.0330 1.9238 8.0683 (0.4269) (0.4333) (0.4285) (0.4355)
1.6078 2.8385 3.1332 0.6487 2.6123 3.1542 1.8167 1.6489 0.0064 (0.4268) (0.4381) (0.4445) (0.4291) (0.4341) (0.4279) (0.4281) (0.4269)
The values in the parentheses indicate the standard errors.
0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01
0.175 0.150 0.125 0.100 0.075 0.050 0.025 1860
1880
1900
1920
1940
1960
1980
2000
(a) Trend in TMAX by nonparametric method—unbalanced case.
1860
1880
1900
1920
1940
1960
1980
2000
(b) Trend in TMIN by nonparametric method—unbalanced case.
Fig. 3. Trend by nonparametric method: unbalanced case.
A. Atak et al. / Journal of Econometrics 164 (2011) 92–115
1.150 1.125 1.100 1.075 1.050 1.025 1.000 0.975
101
0.56 0.54 0.52 0.50 0.48 0.46 0.44 1980
1985
1990
1995
1980
2000
(a) Trend in TMAX by nonparametric method—balanced case.
1985
1990
1995
2000
(b) Trend in TMIN by nonparametric method—balanced case.
Fig. 4. Trend by nonparametric method: balanced case.
1.4
1.4
1.2
1.2
1.0
1.0
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2 1
2
3
4
5
6
7
8
9
10 11 12
(a) Absolute forecast error TMAX.
1
2
3
4
5
6
7
8
9
10 11 12
(b) Absolute forecast error TMIN.
Fig. 5. Forecasting by nonparametric method.
observed at a cross section of locations. The model and methods are defined for the important practical case of unbalanced data. The methods we develop give similar results to a parametric analysis and help to confirm the main finding of a gradual upward trend in temperature in the UK, although with somewhat less trend obtained by the nonparametric method than the parametric one. Acknowledgements The second author thanks the ESRC, the ERC, and the Leverhulme foundations for financial support. This paper was partly written while the second author was a Universidad Carlos III de Madrid-Banco Santander Chair of Excellence, and the author thanks them for financial support. The third author is gratefully acknowledged the financial support from Boston College. Appendix A.1. Proof of theorems
∂ L(θ ) ∂βi T −− ∂ gθ (t /T ) (yjt − αj − βj⊤ Dt − gθ (t /T )) ∂βi j̸=i t =tj T − ∂ gθ (t /T ) (yit − αi − βi⊤ Dt − gθ (t /T )) Dt + = 0, − ∂βi t = ti
=−
where:
1 T ∂ gθ (t /T ) 1 1− − , i ≤ mt =− Kh ((t − s)/T ) → mt ∂αi mt T s=t 0, i > mt i ∂ gθ (t /T ) ∂βi 1 T 1 1− − i11 , i ≤ mt =− Kh ((t − s)/T )Ds → 12mt mt T s=t 011 , i > mt . i Thus, for i = 1, . . . , n,
Proof of Theorem 1. The first order condition (FOC) for θ is
∂ L(θ ) ∂αi
T −− l̸=i t =tl
∂ gθ (t /T ) =− (yjt − αj − βj Dt − gθ (t /T )) ∂αi j̸=i t =tj T − ∂ gθ (t /T ) − (yit − αi − βi⊤ Dt − gθ (t /T )) 1 + =0 ∂αi t = ti T −−
⊤
ylt − αl − βl⊤ Dt −
n T 1 1 −−
mt T j=1 s=t j
(yjs − αj − βj⊤ Ds )
T − ∂ gθ (t /T ) × Kh ((t − s)/T ) + yit − αi − βi⊤ Dt ∂αi t = ti n − T − 1 1 ⊤ − (yjs − αj − βj Ds )Kh ((t − s)/T ) mt T j=1 s=t j
102
A. Atak et al. / Journal of Econometrics 164 (2011) 92–115
∂ gθ (t /T ) = 0, × 1+ ∂αi T n T −− 1 1 −− ylt − αl − βl⊤ Dt − (yjs − αj − βj⊤ Ds )
mt T j=1 s=t j
l̸=i t =tl
T ∂ gθ (t /T ) − yit − αi − βi⊤ Dt × Kh ((t − s)/T ) ∂βi t = ti n − T − 1 1 − (yjs − αj − βj⊤ Ds )Kh ((t − s)/T )
mt T j=1 s=t j
×
Dt +
∂ gθ (t /T ) ∂βi
= 0.
Substitute the true model yit = αi + βi⊤ Dt + g (t /T ) + εit into the above FOC, notice that yit − αi − βi⊤ Dt = εit + g (t /T ) − ( αi − αi ) − ( βi⊤ − βi⊤ )Dt , thus we have, for i = 1, . . . , n, the corresponding FOC w.r.t. αi is given by
− − ∂ gθ (t /T ) l̸=i
T
( αl − αl ) ∂αi T − − ∂ gθ (t /T )
t = tl
+
×
T −
Dt
∂αi
t = tl
l̸=i
Kh ((t − s)/T )
T − 1 −− 1 j̸=i
s = tj
( βl⊤ − βl⊤ ) −
∂ gθ (t /T ) ∂αi
T l= mt ̸ i t = tl
( αj − αj )
∂ gθ (t /T ) − Kh ((t − s)/T ) T l̸=i t =t mt s=t ∂αi l i T − −− 1 ( βj⊤ − βj⊤ ) × ( αi − αi ) − T 1 −− 1
T −
l̸=i t =tl
j̸=i
T 1−
× ×
1+ T −
∂ gθ (t /T ) ∂αi
mt
Dt −
t =ti
+ ( βi⊤ − βi⊤ )
T 1 1−
mt T s=t i
Ds Kh ((t − s)/T )
×
×
1+ T − t =ti
t = ti
∂ gθ (t /T ) ∂αi T 1 1−
mt T s=t j
−
and the corresponding FOC w.r.t. βi is
T − − ∂ gθ (t /T ) ( αl − αl ) ∂βi t = tl l̸=i T − − ∂ gθ (t /T ) + Dt ( βl⊤ − βl⊤ ) ∂βi t = tl l̸=i T T − 1 −− 1 − ∂ gθ (t / T ) − Kh ((t − s)/T ) T l̸=i t =t mt s=t ∂βi j̸=i l j T T 1 −− 1 − Kh ((t − s)/T ) × ( αj − αj ) − − ∂ gθ (t / T ) ( αi − αi ) − ( βj⊤ − βj⊤ ) ∂βi j̸=i T T −− 1 ∂ gθ (t /T ) 1− × Ds Kh ((t − s)/T ) mt T s=t ∂βi l̸=i t =tl j T T −− 1− 1 ⊤ ⊤ − (βi − βi ) Ds Kh ((t − s)/T ) l̸=i t =tl
mt T s=t i
×
Dt +
j̸=i,j=1
Ds Kh ((t − s)/T )
×
T − t = ti
mt T s=t j
t = ti
( βj⊤ − βj⊤ ) ∂ gθ (t /T ) 1+ ∂αi
T s=t i
T − 1 1 ∂ gθ (t / T ) × + ( αi − αi ) 1− ∂βi m t T t =ti T − ∂ gθ (t /T ) × Kh ((t − s)/T ) Dt + + ( βi⊤ − βi⊤ ) ∂βi s=ti T T − 1 1− × Dt − Ds Kh ((t − s)/T )
j̸=i,j=1
mt
mt T s=t j n −
s=ti
∂ gθ (t /T ) × Dt + ∂βi n T T − − 1 1− − ( αj − αj ) Kh ((t − s)/T )
j̸=i,j=1
εjs Kh ((t − s)/T )
T n T −− ∂ gθ (t / T ) 1 1 −− × + g (t /T ) − g (s/T ) ∂αi mt T j=1 s=t l̸=i t =tl j T T − ∂ gθ (t /T ) 1 1− × Kh ((t − s)/T ) + εit − εis ∂αi m t T s =t t = ti i − T ∂ gθ (t /T ) × Kh ((t − s)/T ) 1+ + g (t /T ) ∂αi t = ti n T 1 1 −− ∂ gθ (t /T ) − g (s/T )Kh ((t − s)/T ) 1+ mt T j=1 s=t ∂αi j T n T − 1 1 − − ∂ gθ (t /T ) − εjs Kh ((t − s)/T ) 1+ mt T j̸=i,j=1 s=t ∂αi t = ti j
t = ti
∂ gθ (t /T ) × 1+ ∂αi n T T − − 1 1− − ( αj − αj ) Kh ((t − s)/T )
mt T j=1 s=t j
×
mt T s=t i
εlt −
l̸=i t =tl
n T 1 1 −−
T l= mt ̸ i t = tl
Ds Kh ((t − s)/T )
t = ti
∂ gθ (t /T ) − ( βi⊤ − βi⊤ ) T s=t ∂α i j T T −− 1 1− ∂ gθ (t /T ) × Ds Kh ((t − s)/T ) mt T s=t ∂αi l̸=i t =tl i T T − 1 1− + ( αi − αi ) Kh ((t − s)/T ) 1− ×
=
T −−
∂ gθ (t /T ) ∂βi T 1 1−
m t T s =t j
−
n −
( βj⊤ − βj⊤ )
j̸=i,j=1
Ds Kh ((t − s)/T )
∂ gθ (t /T ) Dt + ∂βi
A. Atak et al. / Journal of Econometrics 164 (2011) 92–115
=
T −−
n T 1 1 −−
εlt −
mt T j=1 s=t j
l̸=i t =tl
and
εjs Kh ((t − s)/T )
CT ,B → ∆n ⊗
T −− 1 1 ∂ gθ (t /T ) g (t /T ) − + ∂βi mt T l̸=i t =tl n − T − × g (s/T )Kh ((t − s)/T )
×
1 12
CT → Q
T T − 1 1− ∂ gθ (t /T ) + εit − εis Kh ((t − s)/T ) ∂βi mt T s=t t = ti i − T n 1 1− ∂ gθ (t /T ) + g (t /T ) − × Dt + ∂βi mt T j=1 t = ti T − ∂ gθ (t /T ) × g (s/T )Kh ((t − s)/T ) Dt + ∂βi s = tj T n T − 1 1 − − − εjs Kh ((t − s)/T )
×
= 1 (∆n + Gn ) ⊗ i⊤ 11
b⊗
b 1 12
i11
ri
]
√
where w(s) and δ(s) are weighting functions on [0, 1]:
δ(s) =
1 j
,
if rj < s < rj+1 , j = 1, 2, . . . , n. 1
[ ] dT =
[ ]
da , dA
ea , eA
eT =
By Lemma 4, the stochastic term eT converge in distribution to a multivariate normal with covariance matrix
(14)
Thus the profile likelihood estimator subject to the linear restriction q⊤ θ = 0 satisfies
] Ω11 Ω12 Ω = Ω21 Ω22 Ωn + An = 1 ⊤ [Ωn + An ] ⊗ i11 12
√
−1 ⊤ −1 ⊤ θ − θ = R R⊤ CT R R dT + R R⊤ CT R R eT , T
where R is the K × (K − 1) normalized orthogonal complements of q. By results of Lemmas 1 and 2, as T → ∞: CT ,a ⇒
···
c11 ci1 cn1
c1i cii cni
c11 ci1 cn1
=
⊗
1 ⊤ i11 12
c11 ci1 cn1
⊗
1
···
1 12
i11
T −1
= ∆n + Gn = Cn ,
c1n cin cnn
i = 1 s = ti
gP (u) =
= (∆n + Gn ) ⊗
1 ⊤ i11 12
,
= (∆n + Gn ) ⊗
=
= 1 12
i11
,
I11 + An ⊗
1 122
J11
.
yis − αi − βi⊤ Ds Kh (u − s/T )
n ∑ T ∑
.
Kh (u − s/T )
∑ ∑ ∑ ∑T t < u < mT+1 , ni=1 Tt=ti Kh (u − t /T )/T = m i=1 t = ti K ([u − t /T ] /h)/Th = m. Therefore,
=
12
tm T
gP (u) =
Sn ⊗
1 ⊤ i11 12
i=1 s=ti
If
1
n ∑ T ∑
T −1
i11 12 c1i c1n cii cin cni cnn
CT ,A → Cn ⊗
=
c1n cin cnn
[Ωn + An ] ⊗
Proof of Theorem 2. Consider
···
1 ⊤ i11 12 · · · c1i cii cni
CT ,b → Cn ⊗
if rj < s < rj+1 , j = 1, 2, . . . , n.
[
θ − θ = dT + eT . T
√ + o( T hp ),
⊤ · · · , bi , · · · , bn − ∫ 1 1 δ(s)g (p) (s)ds bi = µp (K ) p! rl l̸=i ∫ 1 (p) w(s)g (s)ds , −
∂ gθ (t /T ) . ∂βi
C T ,b , C T ,B
j
CT ,a C T ,A
. 1 + Gn ⊗ 2 i⊤ 11 i11 12 12
b = b1 ,
the FOC can be written as: CT
w(s) = 1 − δ(s) = 1 − ,
[
i11
where:
Let CT =
√ = − T hp
da dA
] [√ ] [ ] [ ] α − α) CT ,b √T ( = da + ea . CT ,B dA eA T β −β
C T ,a C T ,A
I11
12
1
By Lemma 3, the bias terms are
[ ]
If we denote the terms given in Boxes II and III then we have
[
1
∆n ⊗
12
m t T j= ̸ i,j=1 s=tj
Dt +
(∆n + Gn ) ⊗
∆n + Gn
×
t = ti
1 ⊤ i11 i11 . 122
I11 + Gn ⊗
Thus
j=1 s=tj
103
n T 1 −−
Tm i=1 t =t i 1
Kh (u − t /T ) yit − αi − βi⊤ Dt
m − T −
Tmh i=1 t =t i 1
m − T −
1
m − T −
K (u − t /T ) yit − αi − βi⊤ Dt
K ([u − t /T ]/h) Tmh i=1 t =t i ⊤ × yit − αi − βi⊤ Dt − ( αi − αi ) − βi − βi⊤ Dt K ([u − t /T ]/h) Tmh i=1 t =t i ⊤ × g (t /T ) + εit − ( αi − αi ) − βi − βi⊤ Dt
104
A. Atak et al. / Journal of Econometrics 164 (2011) 92–115
Ca,11
···
CT ,a =
Ca,nn
Cb,11
···
Cb,1n
··· Cb,nn
CA,11
···
CA,1n
···
···
CA,nn
CB,11
···
CB,1n
···
CB,nn
··· da,1
,
··· d A ,1
···
···
CA,n1
CB,n1
,
dA =
···
ea,1
,
ea =
··· , ea,n
d A ,n
da,n
Ca,ii
···
CT ,A =
,
Cb,n1
CT ,B =
··· ···
···
CT ,b =
Ca,1n
Ca,n1
da =
···
eA,1
eA · · · , e A ,n
T T − ∂ gθ (t /T ) 1 1− Kh ((t − s)/T ) 1+ 1− m t T s =t ∂α i 1 t = ti i = T T −− 1 − T ∂ gθ (t /T ) − 1 Kh ((t − s)/T ) T mt ∂αi l̸=i t =tl
Ca,ij =
Cb,ii =
Cb,ij =
da,i =
ea,i =
s=ti
∂ gθ (t /T ) − Kh ((t − s)/T ) ∂αi T l̸=i t =t mt s=t ∂αi 1 t = tj l j T T − − T 1 1 ∂ gθ (t /T ) Kh ((t − s)/T ) 1+ − m T ∂α t i t = ti s=tj T T − ∂ gθ (t /T ) 1 1− ⊤ ⊤ Ds Kh ((t − s)/T ) 1+ Dt − mt T s=t ∂α i 1 t = ti i T T −− T 1 1− ⊤ ∂ gθ (t /T ) Ds Kh ((t − s)/T ) − m T ∂α t i s = ti l̸=i t =tl T T T − ∂ ∂ gθ (t /T ) 1 −− 1 − ⊤ ⊤ gθ (t /T ) Dt Ds Kh ((t − s)/T ) − ∂αi T l̸=i t =t mt s=t ∂αi 1 t = tj j l T T − − T 1 1 ∂ gθ (t /T ) − Ds Kh ((t − s)/T ) 1+ m T ∂α t i t = ti s=tj T n T 1 −− 1 1 −− ∂ gθ (t /T ) g (t /T ) − g (s/T )Kh ((t − s)/T ) √ mt T j=1 s=t ∂αi T l̸=i t =tj j T n T − ∂ gθ (t /T ) 1 1 −− g (s/T )Kh ((t − s)/T ) 1+ + g (t /T ) − mt T j=1 s=t ∂αi t = ti j T T T T 1 − 1 1− ∂ gθ (t /T ) 1 − −− 1 1 ∂ gθ (t /T ) εit − εis Kh ((t − s)/T ) 1+ −√ Kh ((t − s)/T ) εis √ mt T s=t ∂αi ∂αi T t = ti T s=ti l̸=i t =tl mt T i T T n T − 1 − − ∂ gθ (t /T ) 1 − −− 1 ∂ gθ (t /T ) +√ εjt − Kh ((t − s)/T ) εjs ∂αi T s=t mt ∂αi T j̸=i t =tj j̸=i,j=1 l̸=i t =tl j n T T 1 − 1− − 1 ∂ gθ (t /T ) −√ Kh ((t − s)/T ) 1 + εjs ∂αi T j̸=i,j=1 T s=tj t =ti mt
− ∂ gθ (t /T ) T
T 1 −− 1
T −
Box II.
=
1
m − T −
Tmh i=1 t =t i
+
1
K ([u − t /T ]/h)g (t /T )
m − T −
Tmh i=1 t =t i
K ([u − t /T ]/h)εit
− −
1
m − T −
Tmh i=1 t =t i 1
m − T −
Tmh i=1 t =t i
K ([u − t /T ]/h) ( αi − αi ) K ([u − t /T ]/h) βi⊤ − βi⊤ Dt .
A. Atak et al. / Journal of Econometrics 164 (2011) 92–115
CA,ii
CA,ij
CB,ii
CB,ij
d A ,i
e A ,i
105
T T − ∂ gθ (t /T ) 1 1− Kh ((t − s)/T ) Dt + 1− mt T s=t ∂β i 1 t =ti i = T T T 1 −− 1 − ∂ gθ (t /T ) − Kh ((t − s)/T ) T l̸=i t =t mt s=t ∂β i i l T T T − ∂ 1 −− 1 − ∂ gθ (t / T ) gθ (t /T ) − Kh ((t − s)/T ) ∂βi T l̸=i t =t mt s=t ∂βi 1 t = tj j l = T T − − T 1 1 ∂ gθ (t /T ) − Kh ((t − s)/T ) Dt + m T ∂β t i t = ti s = tj T T − ∂ gθ (t / T ) 1 1− ⊤ ⊤ Ds Kh ((t − s)/T ) Dt + Dt − mt T s=t ∂β i 1 t = ti i = T T −− T 1 1− ⊤ ∂ gθ (t /T ) Ds Kh ((t − s)/T ) − m T ∂β t i s = ti l̸=i t =tl T T T gθ (t /T ) 1 −− 1 − ⊤ ∂ gθ (t /T ) 1 − ⊤ ∂ Dt − Ds Kh ((t − s)/T ) = T t =t ∂βi T l̸=i t =t mt s=t ∂βi j l j T T − − 1 1 1 ∂ gθ (t /T ) − Ds Kh ((t − s)/T ) Dt + T t =t m t T s =t ∂βi i j T n T 1 −− 1 1 −− ∂ gθ (t /T ) = √ g (t /T ) − g (s/T )Kh ((t − s)/T ) mt T j=1 s=t ∂βi T l̸=i t =tj j T n T − ∂ gθ (t / T ) 1 1 −− Dt + + g (t /T ) − g (s/T )Kh ((t − s)/T ) mt T j=1 s=t ∂βi t = ti j T T 1 1− ∂ gθ (t /T ) 1 − εit − εis Kh ((t − s)/T ) Dt + = √ m T ∂βi T t = ti t s=ti T T 1 − −− 1 1 ∂ gθ (t /T ) −√ Kh ((t − s)/T ) εis ∂αi T s=ti l̸=i t =tl mt T T n T T − gθ (t / T ) ∂ gθ (t /T ) 1 − − ∂ 1 − −− 1 Kh ((t − s)/T ) +√ εjt − εjs ∂βi T s=t mt ∂βi T j̸=i t =tj j̸=i,j=1 l̸=i t =tl j T n T ∂ gθ (t /T ) 1 − 1− − 1 Kh ((t − s)/T ) Dt + εjs −√ ∂βi T j̸=i,j=1 T s=tj t =ti mt Box III.
For the first stochastic term, m T 1 −−
Tmh i=1 t =t i
=
m 1 −
K ([u − t /T ]/h)εit
=
m i=1
=
T 1 −
Th t =t i
K ([u − t /T ]/h)εit
.
Again, for each i, t K ([u − t /T ]/h)εit is a weighted sum of weakly correlated random variables and a CLT applies,
∑
1
√
T −
Th t =ti
21/2
K ([u − t /T ]/h)εit ⇒ ωi ‖K ‖2
ξi .
The second term is simply a kernel smoothed estimator of g (u), 1
m − T −
Tmh i=1 t =t i
K ([u − t /T ]/h)g (t /T )
m T 1 − 1 −
m i=1 Th t =t i m T 1 − 1 −
m i=1 Th t =t i
K ([u − t /T ] /h) g (t /T ) K ([u − t /T ] /h)
j p − 1 j u − t /T (j) × g (u) + h g (u) + o(hp ) j ! h j =1 ∫ 1 m − 1 1 = g (u) + hp g (p) (u) z p K (z )dz + o(hp ) m i =1 p! 0 ∫ 1 1 p (p) = g (u) + h g (u) z p K (z )dz + o(hp ). p! 0 For the third and fourth terms, 1
m − T −
Tmh i=1 t =t i
K ([u − t /T ]/h) ( αi − αi ) = op
1
√
Th
,
106
A. Atak et al. / Journal of Econometrics 164 (2011) 92–115 m − T −
1
Tmh i=1 t =t i
K ([u − t /T ]/h) βi⊤ − βi⊤ Dt = op
1
√
Th
Notice that, under Assumption 5,
,
T 1 −
the preliminary estimation of θ does not affect the first order asymptotics for this estimator. Thus for tm /T < u < tm+1 /T , m = 1, . . . , n − 1,
√
∗ 1 Th g (u) − g (u) − hp b(u) ⇒ N 0,
m
√
m 1 −
m i=1
ω
2 i
‖K ‖
.
2 2
Th t =1
T 1 −
Th t =1
⇒ N 0,
n
n 1−
n i=1
ω
2 i
‖K ‖
.
2 2
Proof of Theorems 3 and 4. Notice that when q/T → 0, as T → ∞, under Assumption A2′ , by a Taylor expansion, τ q k q τ − 1 (k) g (1) +o k! T T k=0 τ q k q τ − γk · = +o .
T
k=0
K
T −t
⇒
y − γ ⊤ xt t
Th
t =1
2
,
γ =γ+
K
−
T −
K
T −t
×
T −
K
T −t Th
t =1
+
T −
K
T −
1
g
(τ + 1)!
xt xt
T −
K
T −t
Th
xt ε t
γ =γ+
T −
(τ +1)
K
(1)
T −t
+
T −
K
T −t Th
t =1
×
T −
K
×
T −t
−1/2
+ op ((Th)
εit
i =1
xt ε t
ν1∗ (K ) ν2∗ (K ) ··· ···
ντ∗ (K ) ντ∗+1 (K ) · · · · ∗ ν2τ (K )
···
ωi2 /n, since k t −T T −t
∑n
i =1
Th
T −
1
√
Th s=1
Th
K
γεi (j)
T −s
εit
s−T
Th
∫
∞ −
→
−1
l
Th
εis
K (u)2 ul+k du = ωi2 νl∗+k (K ).
The variance term of the local polynomial estimator is
xt x⊤ t T −
K
T −t
xt
Th
−1
+h
⊤
xt xt
T −
K
t =1
−1 xt x⊤ t
Th
t =1
n i=1
n
ωi2
E √ K Th t =1
xt x⊤ t
Th
t =1
t −T
T −
t =1
k − n 1
Th
Th
n ∑
1
−1
t −T
τ +1
T
By result of Atak et al. (2008),
t −T
2 ∗ i 2k
K
where ω =
⊤ ⊤ xt β −β Dt
Th
1
j=−∞
T −t
t =1
×
⊤
t =1
Th
t =1
−1
Th
···
n
T −t
t =1
T −t Th
2
The local polynomial estimator can be written as T −
µ∗τ (K ) µ∗τ +1 (K ) = M (K ). ··· ∗ µ2τ (K )
···
and
T
xt x⊤ t
∗ ν0 (K ) i=1 ν1∗ (K ) ⇒ N 0, 2 ··· n ντ∗ (K ) 1 = N 0, ω2 V (K ) ,
.. . τ. t −T
K (u) uk du = µ∗k (K ), −1
n ∗ ν2k (K ) − 2 N 0, ω ν (K ) = N 0, ωi , 2
xt =
0
∫ →
µ∗1 (K ) µ∗2 (K ) ··· µ∗τ +1 (K )
Th t =1
1
Th
K
n i=1
√
γ0 .. where γ = . , γτ
k
Th
T −t
n 1−
T
T −
1
Th t =1
The local polynomial estimation at the end point T is given as follows:
t −T
Notice that, although with incomplete data, when we consider the end point T and neighborhood around T , observations from all i are available,
√
g (1 + q/T ) =
T −
K
µ∗0 (K ) µ∗ ( K ) → 1 ··· µ∗τ (K )
Th
g ∗ (u) − g (u) − hp b(u) Th 1
t −T
and thus
For u > tn /T ,
K
xt τ +1
).
hτ +1
(τ + 1)! τ +1
T −t
Th g (τ +1) (1)
xt ε t
.
T 1 −
Th t =1
K
−1
Th
⊤
xt xt
1
√
T −
Th t =1
K
T −t Th
xt ε t
1 ⇒ M (K )−1 N 0, ω2 V (K ) n 1 2 = N 0, ω M (K )−1 V (K )M (K )−1 . n
And the bias term 1
(τ + 1)!
g
(τ +1)
t −T Th
T −t
(1)
T 1 −
Th t =1
K
T −t Th
µ∗τ +1 (K ) ∗ µτ +2 (K ) g (τ +1) (1)
xt
t −T
→
1
(τ + 1)!
···
µ∗2τ +1 (K )
= B(K ).
Th
τ +1
A. Atak et al. / Journal of Econometrics 164 (2011) 92–115
Thus
√
CA,ii → cii
Th γ − γ − hτ +1 M (K )−1 B(K ) 1 ⇒ N 0, ω2 M (K )−1 V (K )M (K )−1 .
107
1 12
i11
=
1 − ri − 2a1i + ia2i +
CB,ii → Cii = (1 − ri )
Notice that 1
T hk+1/2
Uk ,
+ ia2i
and our forecaster for g (1 + q/T ) is given by
g (1 + q/T ) =
τ −
γk ·
q k T
k=0
T
1
bg =
τ −k+1
h
q k T
k=0
vg =
τ −
√
T
+o
Uk
Bk
q τ T
.
Cb,ij → cij
=h
τ +1
τ − q k
Th
T
Bk
CA,ij → cij
⊤
+
+
1 ⊤ i11 i11 122
1 n
.
a2l ,
= (max(i, j) − 1)a2,max(i,j) − 2a1,max(i,j)
a2l
1 12
i11
,
1 ⊤ i11 i11 122
n −
= (max(i, j) − 1)a2,max(i,j) − 2a1,max(i,j)
a2l
1 ⊤ i11 i11 122
.
√
√ = − T hp bi
1 12
i11
√ + o( T h p )
12
where w(s) and δ(s) are weighting functions on [0, 1]:
δ(s) =
l =i +1
ri
a2l
,
− ∫ 1 √ p1 (p) = − T h µp (K ) δ(s)g (s)ds p! rl l̸=i ∫ 1 √ 1 (p) − w(s)g (s)ds i11 + o( T hp ),
l=i+1
(1 − ri ) − 2a1i + ia2i +
i11
ri
dA,i
Lemma 1. For each i, as T → ∞:
n −
1 ⊤ i11 12
− ∫ 1 √ p1 = − T h µp (K ) δ(s)g (p) (s)ds p! rl l̸=i ∫ 1 √ − w(s)g (p) (s)ds + o( T hp ),
ω2 M (K )−1 V (K )
A.2. Lemmas
1
√
where V0 is the (1, 1)-element in the matrix M ( K ) −1 .
1 ⊤ i11 12
a2l − 2a1,max(i,j)
da,i = − T hp bi + o( T hp )
V0 ,
n −
Lemma 3. For each i, as T → ∞:
Thus, the forecasting bias is of order O(hτ +1 ), with leading term hτ +1 B0 , and the leading term of forecasting variance is
=
a2l
= (max(i, j) − 1)a2,max(i,j)
l=max(i,j)
1 yi,T +q − yi,T +q = εi,T +q − hτ +1 B0 − √ U0 Th 1 + op h τ + 1 + √ . Th
Cb,ii → cii
l=max(i,j)
Since the parameter estimates are of smaller error, for any fixed q,
ia2i − 2a1i +
n − l =i +1
n −
× DT +q − [ g (1 + q/T ) − g (1 + q/T )] .
Ca,ii → cii = 1 − ri − 2a1i + ia2i +
1 ⊤ i11 12
12
CB,ij = cij
βi⊤ − βi yi,T +q − yi,T +q = εi,T +q − ( αi − αi ) −
1
I11 +
l=max(i,j)
,
τ 1 − q k Uk = √ Uk Th k=0 Th
Th
1 ⊤ i11 i11 122 n − 1 ⊤ + a2l i11 i11 122 l=i+1
a2l − 2a1,max(i,j) ,
n −
+
whose order of magnitude are jointly determined by the bandwidth h and the forecasting distance q/T . In particular, the prediction error is given by
ωi2 +
,
l=max(i,j)
k=0
q k
1
T hk+1/2
k=0
12
n −
+
The bias and variance terms are given by τ −
12
i11
I11 − 2a1i
1
1
Ca,ij → cij = (max(i, j) − 1)a2,max(i,j)
T
q k
T hk+1/2
k=0
a2l
Lemma 2. For i ̸= j, as T → ∞:
k=0
√
1 ⊤ i11 i11
.
g (1 + q/T ) − g (1 + q/T ) τ q k q τ − = hτ −k+1 Bk + o +
1
12
122
= (1 − ri )
Thus, the forecasting error is
τ −
l=i+1
n
γk − γk = hτ −k+1 Bk + √
n −
1 ⊤ i11 12
,
1 j
,
if rj < s < rj+1 , j = 1, 2, . . . , n. 1
w(s) = 1 − δ(s) = 1 − , j
if rj < s < rj+1 , j = 1, 2, . . . , n.
108
A. Atak et al. / Journal of Econometrics 164 (2011) 92–115
Lemma 4. For each i, as T → ∞,
ea,i ⇒ N (0, σ ), 2 a
e A ,i ⇒ N
T 1−
=
0,
1 12
σ
2 A1 I11
1
+
122
σ
2 A2 J11
,
+ σa2 = (1 − ri − 2a1i + a2i ) ωi2 + na4,i − 4na3,i + 4a2,i
i−1 −
1−
T t =t i
where:
[
1
=
[ 1−
ω
j =1
+
−
na4,j − 4na3,j + 4a2,j ωj2
σ σ
2 A2
(1 − ri − 2a1i + a2i ) , − ωj2 = (na4i − 4na3i + 4a2i ) =
s2j
+
j
+
−
na4j − 4na3j + 4a2j ωj2 .
mt t = tl
Kh ((t − s)/T )
s=ti
∂ gθ (t /T ) → 0, ∂α i
=
→0
T 1−
T 1−
T 1−
1−
T t =t i
×
−
−
=
1−
1
T 1
T
mt T s=t i
mt T s=t i
T 1 −− 1
T −
T l >i t =t m t l
s=ti
1
=
Kh ((t − s)/T )
s=ti
][ 1−
∂ gθ (t /T ) ∂αi
∂ gθ (t /T ) Kh ((t − s)/T ) ∂αi
Kh ((t − s)/T )
]
1
−
mt
l > i t = tl
T 1−
[ 1−
T t =t i
−
T
n −
1
−
1
T s=t i
][
(ri+1 − ri )
j2
a2l ,
mt
1−
T −− 1 l < i t = ti
m2t
]
mt
−
1 T
−
mt
T −− 1 l > i t = tl
m2t
1−
T s=t i
−
mt
l < i t = ti
l > i t = tl
mt
∂ gθ (t /T ) Ds Kh ((t − s)/T ) ∂αi
mt
T 1− 1 ⊤ i11 T t =t 12
T
−
1
(i − 1)
T
−
T
12
T −
T −−
−
l > i t = tl
(1 − ri )i11 −
1 + i⊤ 11 12
⊤
mt
n − 1 l =i
l2
1
mt
⊤
m2t
1 m2t
12
1
1
mt
mt
−
2
1−
1
mt
t = ti
1
1
1
1
∂ gθ (t /T ) Ds Kh ((t − s)/T ) ∂αi s=ti T − ∂ gθ (t /T ) ⊤ Ds Kh ((t − s)/T ) ∂αi s=ti
T
mt
⊤
i
T 1 −− 1 1 T
1
T 1− 1 ⊤ i11 T t =t 12
1
1−
i
=
T T T 1− ⊤ 1− 1 ⊤ 1− 1 ⊤ Dt − Dt − i11 T t =t T t =t mt T t =t 12 i i i
→−1/mt
∂ gθ (t /T ) Kh ((t − s)/T ) ∂αi
1
l̸=i t =tl
T
+
− 1 T ∂ gθ (t /T ) K (( t − s )/ T ) h T ∂α s = ti i
mt
1 ⊤ 1 Dt − i11 12 mt
T T 1 −− 1 1−
−
T
T T −− 1 1−
T
1
m2t
⊤
T 1− ⊤ Dt T t =t
−
=
=1
=
l = i + 1 t = tl
T 1 1− ⊤ Ds Kh ((t − s)/T ) Dt − mt T s=t
i
− − T 1 × m l < i t = ti t
−
n − 1 j =i
T T −− 1 1−
T
Kh ((t − s)/T )
T l
mt
1
−
T T 1 −− 1 −
1−
T t =t i
T 1 1−
T 1 1−
T [ 1−
1
T
⊤
T t =t i
we have Ca,ii =
+
n − T − 1
∂ gθ (t /T ) = 1+ T t =t ∂αi i i T T ∂ gθ (t /T ) 1− ⊤ 1 −− 1 Ds Kh ((t − s)/T ) − T l̸=i t =t mt T s=t ∂αi l i
Proof of Lemma 1. Notice that T −
1
Cb,ii
A.3. Proof of lemmas
l=i+1
j >i
ti−1 − 1
m2t
m2t
t = ti
= 1 − ri − 2a1i + ia2i +
l=i+1 t =tl
T − 1
i
T
T
n − T − 1
(ri+1 − ri )
j2
l=i+1 j=l
(rj+1 − rj ) + i
j
n − n − 1
1
+
1
+
mt
j =i
j=i+1 2 A1
]
2
n − 1
= 1 − ri − 2
n
m2t
t = ti
T t =t i
m2t
T − 1
(i − 1)
T
]
1
+
mt
T 1−
2 j
2
1 ⊤ i11 12
1 ⊤ i11 12
n − 1
l l=i
(rl+1 − rl ) i⊤ 11
(rl+1 − rl ) + (i − 1)
n − 1 l =i
l2
(rl+1 − rl )
A. Atak et al. / Journal of Econometrics 164 (2011) 92–115
×
1 ⊤ i11 12
+
n −− 1 l >i
k2
k=l
= (1 − ri ) − 2a1i + ia2i +
(rk+1 − rk )
n −
a2l
l=i+1
1 ⊤ i11 12
1 ⊤ i11 12
s=ti
=
1−
T t =t i
×
−
Dt −
1
T
−
T t =t i
T
−
T
1−
mt T s=t i
Kh ((t − s)/T )
Dt −
mt 12
T 1−
i11
Kh ((t − s)/T )
mt T s=t i
T 1−
T s =t i
T s=t j
T
=
T t =t i
+
Dt −
T t =t m t i
T i − 1 1
T t =t m2t 12 i
→ ( 1 − ri )
+i
n − 1 j=i
j2
1 12
n
+
−
i11 +
i11 − 2
(ri+1 − ri )
= 1 − ri − 2
n − 1 j =i
j
T l=i+1 t =t m2t i
n − 1
j j =i 1 12
(rj+1 − rj )
i11 +
n − 1
j =i
j2
+
k2
k=l
(rk+1 − rk )
(r − rl ) 2 l+1
1 12
+
l =i +1 j =l
j2
(ri+1 − ri )
1 12
12
i11
i11
i11
1 12
i11
n −
a2l
l=j
=
j2
(ri+1 − ri )
1 12
T 1−
1 12
i11
T 1 1− ⊤ Dt − Ds Kh ((t − s)/T ) mt T s=t i
∂ gθ (t /T ) ∂βi
×
Dt +
(ri+1 − ri )
×
−
T 1 −− 1
T l= mt ̸ i t = tl
T 1− ⊤ Ds Kh ((t − s)/T ) T s=t
=
T 1−
T t =t i
⊤
T t =t i
i11
i11
1
1 12
= (j − 1)a2j − 2a1j +
i11
i11
l
−1 l
12
i
n − n − 1
Kh ((t − s)/T )
1
( r − rl ) 2 l +1
(rl+1 − rl )
l
n − 1
i11
n − n − 1
(rj+1 − rj ) + i
(rl+1 − rl )
n
n − 1
l =j
1
T s=t j
CB,ii
12
n − 1
i11
12
1
l=i+1 j=l
l2
n
mt
12
−− 1 l =j
T t =t mt 12 i
mt
t = tj
l=j
Kh ((t − s)/T )
n T 1 − − 1
∂ gθ (t /T ) ∂βi T T − 1 1−
l
+ (j − i − 1)
Kh ((t − s)/T )Ds
Dt −
+
t = tl
l=j
T 1− 1 1
l < i t = ti
Kh ((t − s)/T )
+ (i − 1)
T 1− 1
T
l >i
l =j T 1−
T l= mt ̸ i t = tl
12
t j−1 1 − −
l =j
T
1 1−
mt
Kh ((t − s)/T )Ds T 1−
T 1 −− 1
T −
.
T T ∂ gθ (t /T ) 1− 1 1− Kh ((t − s)/T ) × − ∂βi T t =t mt T s=t i j ∂ gθ (t /T ) × Dt + ∂βi n − 1 1 →− (rl+1 − rl ) i11
1 1
T s =t i
T l>i t =t mt i
−
−
s = ti
Kh ((t − s)/T )Ds
mt
T 1 1−
×
−
−
∂βi
t = tj
T
1
T
T t =t j
T 1 −− 1
×
mt T s=t i
Kh ((t − s)/T )
i11
∂ gθ (t /T ) × Kh ((t − s)/T ) ∂βi s=tj T T 1− 1 1− ∂ gθ (t /T ) − Kh ((t − s)/T ) Dt + T t =t mt T s=t ∂βi i j T T 1 −− 1 1− 1 1 − = i11 −
Kh ((t − s)/T )Ds
T l
×
−
T 1 1−
T 1 −− 1
1
m t T s =t i
T gθ (t /T ) 1 − ∂
1 −− 1
T 1−
−
T 1 1−
T l= mt ̸ i t = tl
×
1
mt T s=t i
T
=
T 1 1−
12
If j > i,
T T − 1 1− ∂ gθ (t /T ) Kh ((t − s)/T ) Dt + 1− m t T s =t ∂β i 1 t = ti i = T T T 1 −− 1 − ∂ gθ (t /T ) Kh ((t − s)/T ) − T mt ∂βi T 1−
1
a2l
l=i+1
CA,ii
l̸=i t =tl
n −
= 1 − ri − 2a1i + ia2i +
CA,ij =
109
D⊤ t −
1 ⊤ 1 i11 12 mt
∂ gθ (t /T ) ∂βi
Dt −
1 1 mt 12
i11
110
A. Atak et al. / Journal of Econometrics 164 (2011) 92–115
−
−
T T 1− 1 −− 1 T
mt
l < i t = ti
∂ gθ (t /T ) Ds Kh ((t − s)/T ) ∂βi s=ti T − ∂ gθ (t /T ) ⊤ Ds Kh ((t − s)/T ) ∂βi s=ti
T
T 1 1 −− 1 T
mt
l>i t =tl
T
⊤
=−
+
−
−
=
1
T t =t i
1 1 ⊤ i11 12 mt
T
mt
l < i t = ti
12
T 1 −− 1 1 T
12
mt
l>i t =tl
1 ⊤ i11 i11 122
+ (i − 1)
−
l2
l =i
1 1 ⊤ i11 12 mt
12
−
mt
n − 1
l
l=i
mt
l
1 T 1
mt
T 1−
T t =t j
× ×
(rk+1 − rk )
1 ⊤ i11 i11 122
+
1
mt
1 ⊤ i11 12
1 ⊤ i11 12
mt 12
1 ⊤ 1 1 i11 − i11 mt 12
i11
+ (i − 1)
1
i11
1 12
1 ⊤ i11 12
12
×
−
T 1−
T t =t i
T 1 1−
mt T s=t j
k2
Ca,ij =
Ds Kh ((t − s)/T )
1 12
T 1− 1 1
i11
T t =t mt mt j
×
−
mt 12
T 1−
T s=t j
Kh ((t − s)/T )Ds
⊤
t j−1 1 − − T
l >i
T
l < i t = ti
t = tl
+
t = tj
mt
mt
∂ gθ (t /T ) ∂βi
T T − 1− 1 T s=t j
mt
12
T
t = tj
−
1 12
T 1−
T t =t j
i11 n −
1 12
12
i11
1
mt
Dt
i11
a2l
1
1 ⊤ i11 i11 122
.
Kh ((t − s)/T )D⊤ s
∂αi
−
T 1 −− 1
T l= mt ̸ i t = tl
T −
T t =t j
T ∂ gθ (t /T ) 1− 1 1 − i11 ∂βi T t =t mt 12 j 1 1 × Dt − i11 ×
(rl+1 − rl )
∂ gθ (t /T ) × Kh ((t − s)/T ) ∂αi s=tj T T 1 − 1 1− ∂ gθ (t /T ) − Kh ((t − s)/T ) 1+ T t =t mt T s=t ∂αi i j T T T 1− 1 1 −− 1 1− = − − Kh ((t − s)/T )
∂ gθ (t /T ) ∂βi T T − 1 1 1 1 −− 1 ⊤ = Dt − i11 −
T 1 − ∂ gθ (t /T )
Dt +
T t =t j
l2
(rk+1 − rk )
×
n − 1
Proof of Lemma 2. We have
j
(rl+1 − rl )
i11
l=j
T 1− ⊤ ∂ gθ (t /T ) Ds Kh ((t − s)/T ) T s=t ∂βi
= (j − 1)a2j − 2a1j +
l2
i11
l =j
T T 1 − ⊤ ∂ gθ (t /T ) 1 −− 1 Dt − T t =t ∂βi T l= mt ̸ i t = tl j
12
+ (j − i − 1)
1
+
1
n − n − 1
×
mt
n − 1
1 ⊤ i11 i11 122
1
Dt −
l =j
12
k=l
1 1 i11 − i11 ⊤
1 12
1 ⊤ i11 i11 122
l =j
2
mt 12
12
(rl+1 − rl )
1 1 i11 − i11
12
mt
l = j t = tl
⊤
12
n − T − 1
T
(rl+1 − rl ) i⊤ 11 i11
1 ⊤ i11 i11 122
j −1 − T − 1 1 l=i+1 t =tj
T −− 1 1
T
= −aj
mt
If j > i, CB,ij =
1
(1 − ri )I11 − 2 a1i i⊤ i⊤ 11 i11 + ia2i 11 i11 12 12 122 n − 1 ⊤ + a2l i i . 2 11 11 l=i+1
(rl+1 − rl )
l
1
−
1 ⊤ 1 ⊤ i11 − i11
( r − rl ) 2 l +1
k2 l>i k=l
12
n − 1
n −− 1
−
1 ⊤ 1 ⊤ i11 − i11
122
n − 1
l =i
+
12 2
(1 − ri )I11 −
=
T 1 1 −− 1
1
+
−
l
l =j
T T T 1− ⊤ 1− 1 1 ⊤ 1− 1 1 ⊤ = i11 Dt − Dt i11 Dt Dt − T t =t T t =t 12 mt T t =t 12 mt i i i T
n − 1
mt
T
l
mt
T s =t j
T T 1 −− 1 1−
∂ gθ (t /T ) × − Kh ((t − s)/T ) ∂αi T l>i t =t mt T s=t l j T T ∂ gθ (t /T ) 1− 1 1− × − Kh ((t − s)/T ) ∂αi T t =t mt T s=t i j ∂ gθ (t /T ) × 1+ ∂αi
A. Atak et al. / Journal of Econometrics 164 (2011) 92–115
=−
−
T 1− 1
T
j −1 −
+
×
T 1−
T s =t j n − 1 l =j
l
m2t l < i t = tj
T
t j−1 n − −
Kh ((t − s)/T )
l2
mt
×
1
= −a1j
l
1 ⊤ i11 12
(rl+1 − rl )
n
−
a2l
l =j
(rl+1 − rl )
n − 1
+ (rl+1 − rl )
l2 l=j
l =j
= (j − 1)a2j +
a2l − 2a1j .
T 1 −
T −
T t =ti
⊤
∂ gθ (t /T ) ∂αi
1+
T 1− 1
T t =t mt j
−
√ =
D⊤ t −
1− T s =t j
×
n − 1
l
l=j
1 ⊤ i11
(rl+1 − rl )
1 ⊤ i11 12
12
a2l − 2a1j
Th
da,i = √
T 1 −− 1 T
mt l < i t = ti
Kh ((t − s)/T )D⊤ s
l >i
t = tl
∂ gθ (t /T ) ∂αi
T T − 1 1− t = tj
l
1 ⊤ i11 12
p
1 p!
µp (K )
1 ⊤ i11 12
.
mt
T s =t j
12
i ≤ mt i > mt
t
g
T
1
∫
1 1
−
mt T
s T
√ w(s)g (s)ds + o( T hp ),
(p)
ri
T l= ̸ i t = tl
+ (j − i − 1)
l=j
l2
1 1 mt T
g (s/T )Kh ((t − s)/T )
j=1 s=tj
T 1 −
T t = ti
Kh ((t − s)/T )D⊤ s
l =j
n − 1
g (t /T ) −
n − T −
+√ +
l
(rl+1 − rl )
1 ⊤ i11 12
g (t /T ) −
∂ gθ (t /T ) ∂αi
1 1 mt T
∂ gθ (t /T ) × g (s/T )Kh ((t − s)/T ) 1+ ∂αi j=1 s=tj ∫ 1 − √ 1 = − µp (K ) T hp δ(s)g (p) (s)ds p! rl l̸=i ∫ 1 √ p1 √ (p) + T h µp (K ) w(s)g (s)ds + o( T hp ) p! ri ∫ 1 − √ 1 p (p) = − µp (K ) T h δ(s)g (s)ds p! rl l̸=i ∫ 1 √ (p) − w(s)g (s)ds + o( T hp ), n − T −
∂ gθ (t / T ) × ∂αi T T 1− 1 1− ∂ gθ (t /T ) ⊤ − Kh ((t − s)/T )Ds 1+ T t =t mt T s=t ∂αi i j n n − − 1 1 ⊤ 1 =− (rl+1 − rl ) i11 + (i − 1) (r − rl ) 2 l +1
−
Kh ((t − s)/T )g
T 1 −−
t j−1 1 − −
l =j
we have
T
T
n −
j = 1 s = tj
× ×
1 ⊤ i11 12
∂ gθ (t /T ) 1+ ∂αi
mt − T −
×
mt T s=t j
t = ti
√
∂ gθ (t /T ) × Ds Kh ((t − s)/T ) ∂αi s=tj T T 1 − 1 1− − Ds Kh ((t − s)/T )
1 ⊤ i11 12
and
T T gθ (t /T ) 1 −− 1 1 − ⊤ ∂ Dt − Cb,ij = T t =t ∂αi T l= mt ̸ i t = tl j
=−
1 T ∂ gθ (t /T ) 1 1− − , =− Kh ((t − s)/T ) → mt ∂αi mt T s=t 0, i
For Cb,ij , if j > i,
×
1 ⊤ i11
Proof of Lemma 3. Notice that n − l=j
T
l
l =j
(rl+1 − rl )
l2
= (j − 1)a2j +
n − 1
l =j
−1
−
12
+ (i − 1)a2j
(rl+1 − rl )
n
+
1 ⊤ i11 12
12
+
l l=j
[ ] 1 ⊤ + (j − i − 1)a2j i11
l
n − 1
(rk+1 − rk )
k2
k=l
1−
111
n − n − 1 l =j
(r − rl ) 2 l +1
(rk+1 − rk ) −
k2 l=j k=l
+
n − 1
n − n − 1
1
mt
l =j
l =j
1−
T t =t m t j
∂ gθ (t /T ) ∂αi
(rl+1 − rl ) + (i − 1)
+ (j − i − 1)
T − 1
n − 1
T 1− 1
−
t = tj
+
+
t = tl
l =j
l=i+1
=−
+
T t =t mt j 1
T 1 −− 1
ri
112
A. Atak et al. / Journal of Econometrics 164 (2011) 92–115
d A ,i
T n T 1 1 −− 1 −− g (t /T ) − = √
since
mt T j=1 s=t j
T l= ̸ i t = tj
t +1 −1 i 1 i− 2 + 2 cit εit = √ 1− mt mt T t = ti T t = ti
T 1 −
∂ gθ (t /T ) × g (s/T )Kh ((t − s)/T ) ∂βi T n T 1 − 1 1 −− +√ g (t /T ) − × g (s/T )Kh ((t − s)/T )
T 1 −
−
T 1 −
T t =ti
p!
T hp
×
Dt −
1 p!
12
ea,1
ea =
1 1 mt 12
e A ,1
w(s)g (p) (s)ds
1 12
i11
ti+1 −1
−
1−
t = ti
2
−
1
T t =ti+1
1−
T 1 −
T t = tn
T j =1 n
1 − +√ T j=i+1
2 mt
+
1−
T t = ti
√
T t = ti
−
1−
2
1−
T 1 −
1
j =i
T t = ti
T 1 −
√
T t = ti
2
εit .
m2t
m2t
+
mt
εit
i+1
m2t
2
1−
mt
+
εit
n
m2t
εit + o(1)
ωi2 .
j
cijt εjt
mt
εit
T t = ti
m2t
+
⇒ Nj 0, ω
2 j
εit
T
− n
−
n
m2t
−
1 − T t = ti
j̸=i
εit
cit εit ⇒ Ni
0, ω
2 i
n −
lim Var
2
mt
=
εjt
cijt εjt
.
T 1 −
√
T t = ti
T T 1 −−
T t =t s=t i i
⇒ ωj2
mt
εjt
cijt εjt
n j2
−
2
2
j
n −
cijt εjt
1
n
−
m2t
(rs+1 − rs )
s=i
n s2
−
r j +1 − r j
2
n
mt 2
2
s
1
m2s
−
,
and for j > i,
1−
j =i
1 j
= Ni 0, (1 − ri − 2a1i + a2i ) ωi ,
−
m2t
2
since
T 1 −
= Nj 0, ωj2 (na4i − 4na3i + 4a2i ) ,
T
√
r j +1 − r j
n
1
m2t
t = tj
cit εit +
1
n −
j =i
mt
mt
t = ti
cijt εjt = √ T t = ti
T 1 −
i+1
2
T 1 −
= √
First, T 1 −
r j +1 − r j
T 1 −
T
T
1 −
ti+2 −1
1
n −
√
1 − − 1 2 +√ n − εjt 2
= √
in particular, for j < i,
m2t
+ ··· + √ i−1
n
εit + · · ·
Next,
i
+
mt
ti+2 −1
+√
=
e A ,n
ea,i = √ T
m2t
i
+
mt
T t =tn
··· ,
and eA
ea,n 1
1−
+ · · · + lim Var √
i11
2
t = ti
δ(s)g (p) (s)ds
⇒ Nj 0, ωj2 na4,max(i,j) − 4na3,max(i,j) + 4a2,max(i,j) ,
··· ,
T
T t =ti+1
Proof of Lemma 4. Notice that
−
+ lim Var √
√ + o( T hp ).
i11
ti+1 −1
1
ri
1
1
T t = ti
= lim Var √
i11
rl
∫
µp (K )
1
∫
l̸=i
√ +
+
mt
i+1
+
cit εit
√
mt T j=1 s=t j
× g (s/T )Kh ((t − s)/T ) − √ µp (K ) T hp
mt 12
εit
T 1 −
n T 1 1 −−
g (t /T ) −
+√
1
lim Var
mt T j=1 s=t j
× g (s/T )Kh ((t − s)/T )
Under Assumption A1,
T n T 1 −− 1 1 −− = √ g (t /T ) − 1 1
mt
2
1−
T t = tn
2
1−
+√
∂ gθ (t /T ) Dt + ∂βi
−
T t =ti+1
T l= ̸ i t = tj
ti+2 −1
1
+√
mt T j=1 s=t j
T t =ti
=−
√
2
2
T 1 − cijt εjt = √ n T t = ti T t = tj
√
T 1 −
= √
T t = tj
cijt εjt
1 m2t
−
2 mt
εjt
2 ms
γj (t − s)
A. Atak et al. / Journal of Econometrics 164 (2011) 92–115
⇒ Nj 0, ωj2
n −
(rl+1 − rl )
n
l =j
= Nj 0, ωj na4j − 4na3j + 4a2j 2
2
−
l2
2
For j = 1, . . . , i − 1,
l
T 1 −
√
.
113
T t =ti
Cijt εjt =
T 1 −
1 12
i11
[
√
n m2t
T t = ti
12
ea,i ⇒ N 0, (1 − ri − 2a1i + a2i ) ω + na4,i − 4na3,i + 4a2,i 2 i
i−1 −
×
ωj2 +
j =1
√
na4,j − 4na3,j + 4a2,j ωj2 .
T t =ti
j=i+1
→ Nj 0, ω na4j − 4na3j + 4a2j
ti+2 −1
−
1
+√
T t =ti+1
Dt −
−
+√
T t =ti+2
Dt −
T 1 −
1 12
+
1 12
i11
1 1 mt 12
Dt −
T t = tn
+
mt 12
+ ··· + √
1 1
ti+3 −1
1
i −1
i11 −
i11 −
1 1 mt 12
T
−−[
1
√
T j=1 t =ti
i11
Cijt εjt
t +1 −1 1 i− 1 i 1 1 1 eA,i = √ i11 − Dt + 2 i11 Dt − mt 12 mt mt 12 T t = ti
n
T j=i+1
1 mt
−
m2t
m2t
t = tj
mt
i+1 1
Dt +
m2t 12 i+2 1
Dt + 1
i11 −
T [ − n
n 1 − √
1
2
mt
]
mt
m2t 12
Dt +
2 j
0,
eA,i ⇒ Ni
i11
εit
s2j
−
+
(1 − ri − 2a1i + a2i )
i11
εit
m2t 12
12
i11
e a ,i , e ⊤ A,i
Cov
=
T T 1 −−
T t =t s=t i i
εjt +
→
T T 1 −−
j
T t =t s=t i i
−
T T 1 −−
j>i
T t =t s=t i i
n −
T t = ti
= √
T
Dt −
t = ti
1
+√
ti+2 −1
−
T t =ti+1
Dt −
T 1 −
+ ··· + √
T t = tn 1
√
T
mt 12
1 1 mt 12
Dt −
ti+1 −1
−
i11 −
1 1 mt 12
T 1 −
+··· + √
T t = tn
t = ti
Dt +
mt
i11 −
i
1 mt
m2t 12
Dt +
i11 −
1 mt
Dt −
1 mt
since, for each fixed j, T 1−
T t =1
Dt D⊤ t +j
→
1
12 0,
I11 ,
i11
i+1 1 m2t 12
Dt +
1 ⇒ Ni 0, s2i (1 − ri − 2a1i + a2i ) I11 , 12
1
Dt
+
rj+1 − rj
n −
n 1 m2t 12
εit
j>i
1 12
1 ⊤ i11 12
1 ⊤ i11 12
i11
i11
.
εit
+
i11
+
γj (t − s) γj (t − s)
T cijt Cijs
1−
1
2
1
j
12
n
2
l2
(rl+1 − rl )
−
n l2
l =j
= (1 − ri − 2a1i + a2i )
i11 ωi2
2
l
1 12
−
2
i11
2
l
−
ωj2
j
1 12
i11 ωj2
1 ⊤ 2 i11 ωi 12
− 2 1 ⊤ 2 i11 n a4i − 4na3i + 4a2i ωj 12 j
i11
−
n2 a4j − 4na3j + 4a2j
ωj2 ,
j >i
since
−
T T 1 −−
j
T t =t s=t i i
T cijt Cijs
γj (t − s)
= ··· [ ] T − T −1− 1 2 n 2 1 = n − − i11 γj (t − s) 2 2 j
if j = 12k, if j ̸= 12k.
12
T cijt Cijs
(rl+1 − rl )
n − −
εit
i11
1
γi (t − s)
l =i
1
cit CisT
−
+
1 1
.
Finally we analyze the covariance terms:
T 1 − Cit εit √
−
I11
j=i
ti+1 −1
1 ⊤ i11 12
1
0, ωj2 na4j − 4na3j + 4a2j
Nj
First,
1
i11
j>i
n 1
−
+
T t = ti
j̸=i
12
j
T T − 1 − 1 − = √ Cit εit + Cijt εjt . √
T t = ti
1
0, ωj2 (na4i − 4na3i + 4a2i )
Nj
+ mt
]
Thus
εit
εjt
2
−
12
and for j = i + 1, . . . , n, T 1 −
n −
Next, we consider
=
εjt
mt
1 1 ⊤ i11 i11 , →Nj 0, ωj2 (na4i − 4na3i + 4a2i )
Thus
]
2
−
=
T t =t s=t i i
n − l =i
(rl+1 − rl )
mt
mt
n l2
−
2 l
2
ms
1 12
i11
− j
ms
ωj2 ,
12
114
A. Atak et al. / Journal of Econometrics 164 (2011) 92–115
−
T T 1 −−
T t =t s=t j i
j>i
γj (t − s)
T cijt Cijs
j >i
[ ×
n
2
−
m2z
12
l
i11 γj (t − s)
=
j >i
(rl+1 − rl )
n
−
l2
l =j
2
2
1
l
12
l >i
ωj2 ,
i11
=
T t =t s =t i i
+
T T 1 −−
l̸=i,j
T t =t s=t i i
(rl+1 − rl ) 1 −
+
(rs+1 − rs )
=
(rl+1 − rl ) n −
−
γj (t − s)
−
12
n+2
(rs+1 − rs )
2
−
l2
1
s
12
n2
−
s4
l̸=i,j s=max(i,j,l)
n
−
l
l3
1 12
4n s3
i11 ωi2 + ωj2
i11 ω
+
i11 ωi2 + ωj2
4
2
s2
1 12
2
n a4,max(i,j,l) − 4na3,max(i,j,l) + 4a2,max(i,j,l)
l̸=i,j
Cov
→
ea,i , e⊤ A ,j n −
Cov
1 12
i11 ωl2 .
l
l =j
l
=
(rs+1 − rs )
(rl+1 − rl ) 1 −
1
l =j
+
n −
(rs+1 − rs )
n − − l >j
n s2
s =j
+
l
n
n l2
l
−
(rs+1 − rs )
2
2
s2
s =l
2
1
s
12
1
i11 ωi2 + ωj2
12
−
i11
1
s
12
Cov
ea,i , e⊤ A ,j
→2
n −
l=max(i,j)
l =i
(rl+1 − rl ) 1 −
1 l
n l2
−
2 l
s
1 12
i11 ωl2 .
l
n
−
s2
l
2
2
s
ωl2
= (n + 2)a2,j − 2a1,j − na3,j ωi2 + ωj2 2 − 2 ωl + n a4,j − 4na3,j + 4a2,j l<j − + n2 a4,l − 4na3,l + 4a2,l ωl2 ,
n 2 1 1 − i i⊤ ωi2 + ωj2 (rl+1 − rl ) 1 − 2 11 11 2 l
n −
+
(rs+1 − rs )
ωl2
i11 ωl2 .
1 i11 ωi2 + ωj2
12
n s2
l
2
−
l
2
1 122
s
12
2 i11 i⊤ 11 σ
= (n + 2)a2,max(i,j) − 2a1,max(i,j) − na3,max(i,j) 1 × 2 J11 ωi2 + ωj2 12
−
n2 a4,max(i,j,l) − 4na3,max(i,j,l) + 4a2,max(i,j,l)
1 122
J11 σ 2 .
δi = (1 − ri − 2a1i + a2i ) fi = (n + 2)a2,i − 2a1,i − na3,i . . . fi ωl2 + ωi2 ,
−
λi = n2 a4,i − 4na3,i + 4a2,i . . . λi
l
ωj2 +
−
λj ωj2 ,
l < i,
j >i
j̸=l,j
⊤ ⊤ and the covariance matrix of eT = (e⊤ a , eA ) is given as in Box IV. Thus the covariance matrix of the stochastic term is
[ Ω11 Ω= Ω21
2
2
i11 ωl2
Thus, let
l<j
2
1 12
l̸=i,j
i11 ωl2
If i > j,
(rs+1 − rs )
12
−
2
l
1
s n
2
−
−
s
n −
→
12
2
−
s2
l̸=i,j s=max(i,j,l) n −
eA,i , e⊤ A ,j
+ n
+
s2
s=max(i,j,l)
1 n 2 1 − i11 ωi2 + ωj2 (rl+1 − rl ) 1 − 2 −
n
2
l >j
i11 ωl2
−
−
= (n + 2)a2,max(i,j) − 2a1,max(i,j) − na3,max(i,j) ωi2 + ωj2 − + n2 a4,max(i,j,l) − 4na3,max(i,j,l) + 4a2,max(i,j,l) ωl2 ,
For example, if i < j,
n −
Cov ea,i , e⊤ a ,j
12
+
(rs+1 − rs )
n s2
12
for example, for i < j,
2 l
1 = (n + 2)a2,max(i,j) − 2a1,max(i,j) − na3,max(i,j) i11 ωi2 + ωj2 −
i11 ωl2
l̸=i,j
2
2
−
l
1
1 12
l
l̸=i,j s=max(i,j,l)
s
2
l
−
+ 2
2
1 n 2 2 − ωi + ωj2 (rl+1 − rl ) 1 − 2
l=max(i,j)
l=max(i,j)
+
n s2
l̸=i,j s=max(i,j,l) n −
n l2
l
l=max(i,j) n −
T t =t s=t i i
→
γl (t − s) 1
s2
2
n −
cijt CjsT
T cilt Cjls
−
T T 1 −−
γi (t − s) +
− n −
→
T cit Cjis
(rs+1 − rs )
s =l
Cov ea,i , e⊤ a ,j
T T 1 −−
−
Similarly,
Cov ea,i , eA,j
n
l
s=i
n −−
⊤
l
n −−
+
+ n − −
1 n 2 1 − i11 ωi2 + ωj2 (rl+1 − rl ) 1 − 2
l =i
1
mz
=
n −
mt
mt
]
(rs+1 − rs )
l̸=i,j s=max(i,j,l)
= ··· T − T − 1− 1 2 = − n 2 T t =t s=t j i
n −
−
+
=
Ω12 Ω22
]
Ωn + An
[Ωn + An ] ⊗
1 ⊤ i11 12
[Ωn + An ] ⊗ Sn ⊗
1 12
1 ⊤ i11 12 1
I11 + An ⊗
12
J 2 11
.
A. Atak et al. / Journal of Econometrics 164 (2011) 92–115
[
Var (ea ) Cov (eA , ea )
115
Cov (ea , eA ) , Var (eA )
]
where
δ1 ω12 Var (ea ) =
δi ωi2
− n λj ωj2 j=2 +
δn ω − − 2 ωj2 + λj ωj2 fi ω1 + ωi2 + λi 2 n
−
λi
j >i
j̸=1,j
i −1
−
ωj2 +
j =1
−
fn ω12 + ωn2 + λn
ωj2
j̸=1,j
− 2 ωj2 fn ωn + ωi2 + λn j̸=i,j
λj ωj2
j=i+1
j =1
= Ωn + An , δ1 ω12 Cov (ea , eA ) =
δi ωi2
δn ω
2 n
− n λj ωj2 j =2 +
⊗ 1 i⊤ 11 12
fi ω12 + ωi2 + λi
−
ωl2 +
−
l
λi
−
ω + 2 j
j
n −
ωl2 l
λl ωl2
fn ω12 + ωn2 + λn
l >i
λj ω
2 j
j=i+1
−
j
1 = [Ωn + An ] ⊗ i⊤ 11 , 12 2 δ1 s1 ⊗ 1 I11 Var (eA ) = δi s2i 12 δn s2n
− n λj ωj2 j =2 +
fi ω12 + ωi2 + λi
i−1 −
ωj2 +
− l >i
j =2
λi
i−1 − j =1
ω + 2 j
n −
λj ω
2 j
j=i+1
λl ωl2
n−1 −
ωj2 j =2 i−1 − 2 1 2 2 fn ω n + ω i + λ n ωj ⊗ 2 J11 12 j =2 n −1 − 2 ωj λn
fn ω12 + ωn2 + λn
j=1
= Sn ⊗
1 12
I11 + An ⊗
1 122
J11 Box IV.
References Ahn, H., Powell, J.L., 1993. Estimation of censored selection models with a nonparametric selection mechanism. Journal of Econometrics 58, 3–30. Amemiya, T., 1985. Advanced Econometrics. Harvard University Press. Atak, Linton, Zhijie, 2008. Working Paper, Department of Economics, LSE. Campbell, S.D., Diebold, F.X., 2005. Weather forecasting for weather derivatives. Journal of the American Statistical Association 100, 6–16. Delgado, M.A., Stengos, T., 1994. Semiparametric specification testing of non-nested econometric models. Review of Economic Studies 61, 291–303. Doukhan, P., 1994. Mixing: Properties and Examples. In: Lecture Notes in Statistics, Springer, New York. Engle, R.F., Granger, C.W.J., Rice, J., Weiss, A., 1986. Semiparametric estimates of the relationship between weather and electricity sales. Journal of the American Statistical Association 81, 310–320. Gao, J., Hawthorne, K., 2006. Semiparametric estimation and testing of the trend of temperature series. Econometrics Journal 9, 332–355. Hoogstrate, A.J., Palm, F.C., Gerard, G.A., 2000. Pooling in dynamic panel-data models: an application to forecasting GDP growth rates. Journal of Business and Economic Statistics 18 (3), 274–283.
Issler, J.V., Lima, L.R., 2009. A panel data approach to economic forecasting: the biascorrected average forecast. Journal of Econometrics 152, 153–164. Lee, L., Rosenzweig, M., Pitt, M., 1997. The effects of improved nutrition, sanitation, and water quality on child health in high-mortality populations. Journal of Econometrics 77, 209–235. Newey, W.K., Powell, J.L., Walker, J.R., 1990. Semiparametric estimation of selection models: some empirical results. American Economic Review 80, 324–328. Pateiro-Lopez, B., Gonzalez-Manteiga, W., 2006. Multivariate partially linear models. Statistics & Probability Letters 76 (14), 1543–1549. Robinson, P.M., 1988. Root-N-consistent semiparametric regression. Econometrica 56, 931–954. Severini, T.A., Wong, W.H., 1992. Profile likelihood and conditionally parametric models. The Annals of Statistics 20, 1768–1802. Stock, J.H., 1989. Nonparametric policy analysis. Journal of the American Statistical Association 84, 567–576. Stock, J.H., 1991. Nonparametric policy analysis: an application to estimating hazardous waste cleanup benefits. In: Barnett, , Powell, , Tauchen, (Eds.), Nonparametric and Semiparametric Methods in Econometrics and Statistics. Wikipedia, 2009. Climate change. http://en.wikipedia.org/wiki/Climate_change.
Journal of Econometrics 164 (2011) 116–129
Contents lists available at ScienceDirect
Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom
Model selection, estimation and forecasting in VAR models with short-run and long-run restrictions George Athanasopoulos a , Osmani Teixeira de Carvalho Guillén b , João Victor Issler c,∗ , Farshid Vahid a a
Department of Econometrics and Business Statistics, Monash University, Australia
b
Banco Central do Brasil and Ibmec-RJ, Rio de Janeiro, Brazil
c
Graduate School of Economics – EPGE, Getulio Vargas Foundation, Praia de Botafogo 190 s. 1111, Rio de Janeiro, RJ 22253-900, Brazil
article
info
Article history: Available online 26 February 2011 JEL classification: C32 C53 Keywords: Reduced rank models Model selection criteria Forecasting accuracy
abstract We study the joint determination of the lag length, the dimension of the cointegrating space and the rank of the matrix of short-run parameters of a vector autoregressive (VAR) model using model selection criteria. We suggest a new two-step model selection procedure which is a hybrid of traditional criteria and criteria with data-dependant penalties and we prove its consistency. A Monte Carlo study explores the finite sample performance of this procedure and evaluates the forecasting accuracy of models selected by this procedure. Two empirical applications confirm the usefulness of the model selection procedure proposed here for forecasting. © 2011 Elsevier B.V. All rights reserved.
1. Introduction There is a large body of literature on the effect of cointegration on forecasting. Engle and Yoo (1987) compare the forecasts generated from an estimated vector error correction model (VECM) assuming that the lag order and the cointegrating rank are known, with those from an estimated VAR in levels with the correct lag. They find out that the VECM only produces forecasts with smaller mean squared forecast errors (MSFE) in the long-run. Clements and Hendry (1995) note that Engle and Yoo’s conclusion is not robust if the object of interest is differences rather than levels, and use this observation to motivate their alternative measures for comparing multivariate forecasts. Hoffman and Rasche (1996) confirm Clements and Hendry’s observation using a real data set. Christoffersen and Diebold (1998) also use Engle and Yoo’s setup, but argue against using a VAR in levels as a benchmark on the grounds that the VAR in levels not only does not impose cointegration, it does not impose any unit roots either. Instead, they compare the forecasts of a correctly specified VECM with forecasts from correctly specified univariate models, and find no advantage in MSFE for the VECM. They use this result as a motivation to suggest an alternative way of evaluating forecasts of a cointegrated system. Silverstovs et al. (2004) extend Christoffersen and Diebold’s results to multicointegrated
∗
Corresponding author. Tel.: +55 21 3799 5833; fax: +55 21 2553 8821. E-mail address:
[email protected] (J.V. Issler).
0304-4076/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2011.02.009
systems. Since the aforementioned papers condition on the correct specification of the lag length and cointegrating rank, they cannot provide an answer as to whether we should examine the cointegrating rank of a system in multivariate forecasting if we do not have any a priori reason to assume a certain form of cointegration. Lin and Tsay (1996) examine the effect on forecasting of the misspecification of the cointegrating rank. They determine the lag order using the AIC, and compare the forecasting performance of estimated models under all possible numbers of cointegrating vectors (0–4) in a four-variable system. They observe that, keeping the lag order constant, the model with the correct number of cointegrating vectors achieves a lower MSFE for long-run forecasts, especially relative to a model that over-specifies the cointegrating rank. Although Lin and Tsay do not assume the correct specification of the lag length, their study also does not address the uncertainty surrounding the number of cointegrating vectors in a way that can lead to a modelling strategy for forecasting possibly cointegrated variables. Indeed, the results of their example with real data, in which they determine the cointegrating rank using a sequence of hypothesis tests, do not accord with their simulation results. At the same time, there is an increasing amount of evidence of the advantage of considering rank restrictions for shortterm forecasting in stationary VAR (and VARMA) models (see, for example, Ahn and Reinsel, 1988; Vahid and Issler, 2002; Athanasopoulos and Vahid, 2008). One feature of these papers is that they do not treat lag-length and rank uncertainty, differently. Their quest is to identify the dimension of the most parsimonious state vector that can represent the dynamics of a system. Here, we
G. Athanasopoulos et al. / Journal of Econometrics 164 (2011) 116–129
add the cointegrating rank to the menu of unknowns and evaluate model selection criteria that determine all of these unknowns simultaneously. Our goal is to determine a modelling strategy that is useful for multivariate forecasting. There are other papers in the literature that evaluate the performance of model selection criteria for determining lag-length and cointegrating rank, but they do not evaluate the forecast performance of the resulting models. Gonzalo and Pitarakis (1999) show that in large systems the usual model selection procedures may severely underestimate the cointegrating rank. Chao and Phillips (1999) show that the posterior information criterion (PIC) performs well in choosing the lag-length and the cointegrating rank simultaneously. In this paper we evaluate the performance of model selection criteria in the simultaneous choice of the lag-length p, the rank of the cointegrating space q, and the rank of other parameter matrices r in a vector error correction model. We suggest a hybrid model selection strategy that selects p and r using a traditional model selection criterion, and then chooses q based on PIC. We then evaluate the forecasting performance of models selected using these criteria. Our simulations cover the three issues of model building, estimation, and forecasting. We examine the performances of model selection criteria that choose p, r and q simultaneously (IC(p, r , q)), and compare their performances with a procedure that chooses p using a standard model selection criterion (IC(p)) and determines the cointegrating rank using a sequence of likelihood ratio tests proposed by Johansen (1988). We provide a comparison of the forecasting accuracy of fitted VARs when only cointegration restrictions are imposed, when cointegration and short-run restrictions are jointly imposed, and when neither are imposed. These comparisons take into account the possibility of model misspecification in choosing the lag length of the VAR, the number of cointegrating vectors, and the rank of other parameter matrices. In order to estimate the parameters of a model with both long-run and short-run restrictions, we propose a simple iterative procedure similar to the one proposed by Centoni et al. (2007). It is very difficult to claim that any result found in a Monte Carlo study is general, especially in multivariate time series. There are examples in the VAR literature of Monte Carlo designs which led to all model selection criteria overestimating the true lag in small samples, therefore leading to the conclusion that the Schwarz criterion is the most accurate. The most important feature of these designs is that they have a strong propagation mechanism.1 There are other designs with weak propagation mechanisms that result in all selection criteria underestimating the true lag and leading to the conclusion that AIC’s asymptotic bias in overestimating the true lag may actually be useful in finite samples (see Vahid and Issler, 2002, for references). We pay particular attention to the design of the Monte Carlo to make sure that we cover a wide range of data generating processes in terms of the strength of their propagation mechanisms. The outline of the paper is as follows. In Section 2 we study finite VARs with long-run and short-run restrictions and motivate their empirical relevance. In Section 3, we outline an iterative procedure for computing the maximum likelihood estimates of parameters of a VECM with short-run restrictions. We provide an overview of model selection criteria in Section 4, and in particular we discuss model selection criteria with data dependent penalty functions. Section 5 describes our Monte Carlo design. Section 6 presents the simulation results and Section 8 concludes.
1 Our measure of the strength of the propagation mechanism is proportional to the trace of the product of the variance of first differences and the inverse of the variance of innovations.
117
2. VAR models with long-run and short-run common factors We start from the triangular representation of a cointegrated system used extensively in the cointegration literature (some early examples are Phillips and Hansen, 1990; Phillips and Loretan, 1991; Saikkonen, 1992). We assume that the K -dimensional time series
yt =
y1t y2t
,
t = 1, . . . , T
where y1t is q × 1 (implying that y2t is (K − q) × 1) is generated from: y1t = β y2t + u1t
(1)
1y2t = u2t where β is a q × (K − q) matrix of parameters, and
ut =
u1t u2t
is a strictly stationary process with mean zero and positive definite covariance matrix. This is a data generating process (DGP) of a system of K cointegrated I (1) variables with q cointegrating vectors, also referred to as a system of K I (1) variables with K − q common stochastic trends (some researchers also refer to this as a system of K variables with K − q unit roots, which can be ambiguous if used out of context, and we therefore do not use it here).2 The extra feature that we add to this fairly general DGP is that ut is generated from a VAR of finite order p and rank r (< K ). In empirical applications, the finite VAR(p) assumption is routine. This is in contrast to the theoretical literature on testing for cointegration, in which ut is assumed to be an infinite VAR, and a finite VAR(p) is used as an approximation (e.g. Saikkonen, 1992). Here, our emphasis is on building multivariate forecasting models rather than hypothesis testing. The finite VAR assumption is also routine when the objective is studying the maximum likelihood estimator of the cointegrating vectors, as in Johansen (1988). The reduced rank assumption is considered for the following reasons. First, this assumption means that all serial dependence in the K -dimensional vector time series ut can be characterised by only r < K serially dependent indices. This is a feature of most macroeconomic models, in which the short-run dynamics of the variables around their steady states are generated by a small number of serially correlated demand or supply shifters. Second, this assumption implies that there are K − r linear combinations of ut that are white noise. Gourieroux and Peaucelle (1992) call such time series ‘‘codependent’’, and interpret the white noise combinations as equilibrium combinations among stationary variables. This is justified on the grounds that, although each variable has some persistence, the white noise combinations have no persistence at all. For instance, if an optimal control problem implies that the policy instrument should react to the current values of the target variables, then it is likely that there will be such a linear relationship between the observed variables up to a measurement noise. Finally, many papers in the multivariate time series literature provide evidence of the usefulness of reduced rank VARs for forecasting (see, for example, Velu et al., 1986; Ahn and Reinsel, 1988). Recently, Vahid and Issler (2002) have shown that failing to allow for the possibility of reduced rank structure can lead to developing seriously misspecified vector autoregressive models that produce bad forecasts.
2 While in theory every linear system of K cointegrated I (1) variables with q cointegrating vectors can be represented in this way, in practice the decision on how to partition K -variables into y1t and y2t is not trivial, because y1t are variables which must definitely have a non-zero coefficient in the cointegrating relationships.
118
G. Athanasopoulos et al. / Journal of Econometrics 164 (2011) 116–129
The dynamic equation for ut is therefore given by (all intercepts are suppressed to simplify the notation) ut = B1 ut −1 + B2 ut −2 + · · · + Bp ut −p + εt
(2)
where B1 , B2 , . . . , Bp are K × K matrices with rank[B1 B2 · · · Bp ] = r, and εt is an i.i.d. sequence with mean zero and positive definite variance–covariance matrix and finite fourth moments. Note that the rank condition implies that each Bi has rank at most r, and the intersection of the null-spaces of all Bi is a subspace of dimension K − r. The following lemma derives the vector error correction representation of this data generating process. Lemma 1. The data generating process given by Eqs. (1) and (2) has a reduced rank vector error correction representation of the type
1yt = γ Iq
−β yt −1 + Γ1 1yt −1 + Γ2 1yt −2 + · · ·
+ Γp 1yt −p + ηt , in which rank Γ1 Γ2 · · ·
(3)
Γp ≤ r.
Proof. Refer to the working paper version of the current paper.
This lemma shows that the triangular DGP (1) under the assumption that the dynamics of its stationary component (i.e. ut ) can be characterised by a small number of common factors, is equivalent to a VECM in which the coefficient matrices of lagged differences have reduced rank and their left null-spaces overlap. Hecq et al. (2006) call such a structure a VECM with weak serial correlation common features (WSCCF). We should note that the triangular structure (1) implies K − q common Beveridge–Nelson (BN) trends, but the reduced rank structure assumed for ut does not imply that deviations from the BN trends (usually refereed to as BN cycles) can be characterised as linear combinations of r common factors. Vahid and Engle (1993) show that a DGP with common BN trends and cycles is a special case of the above under some additional restrictions and therefore a stricter form of comovement. Hecq et al. (2006) show that the uncertainty in determining the rank of the cointegrating space can adversely affect inference on common cycles, and they conclude that testing for weak common serial correlation features is a more accurate means of uncovering short-run restrictions in vector error correction models. Our objective is to come up with a model development methodology that allows for cointegration and weak serial correlation common features. For stationary time series, Vahid and Issler (2002) show that allowing for reduced rank models is beneficial for forecasting. For partially non-stationary time series, there is an added dimension of cointegration. Here, we examine the joint benefits of cointegration and short-run rank restrictions for forecasting partially non-stationary time series. 3. Estimation of VARs with short-run and long-run restrictions The maximum likelihood estimation of the parameters of a VAR written in error-correction form
1yt = Π yt −1 + Γ1 1yt −1 + Γ2 1yt −2 + · · · + Γp 1yt −p + ηt
(4)
under the long-run restriction that the rank of Π is q, the short-run restriction that rank of [Γ1 Γ2 · · · Γp ]is r and the assumption of normality, is possible via a simple iterative procedure that uses the general principle of the estimation of reduced rank regression models (Anderson, 1951). Noting that the above model can be written as
1yt = γ α ′ yt −1 + C D1 1yt −1 + D2 1yt −2 + · · · + Dp 1yt −p + ηt ,
(5)
where α is a K × q matrix of rank q and C is a K × r matrix of rank r, one realises that if α was known, C and Di , i = 1, . . . , p, could be estimated using a reduced rank regression of 1yt on 1yt −1 , . . . , 1yt −p after partialling out α ′ yt −1 . Also, if Di , i = 1, . . . , p, were known, then γ and α could be estimated using ∑ap reduced rank regression of 1yt on yt −1 after controlling for i=1 Di 1yt −i . This points to an easy iterative procedure for computing maximum likelihood estimates for all parameters.
ˆ 1 , Dˆ 2 , . . . , Dˆ p ] from a reduced rank regression Step 0. Estimate [D of 1yt on (1yt −1 , . . . , 1yt −p ) controlling for yt −1 . Recall that these estimates are simply coefficients of the canonical variates corresponding to the r largest squared partial canonical correlations (PCCs) between 1yt and (1yt −1 , . . . , 1yt −p ), controlling for yt −1 . Step 1. Compute the PCCs between 1yt and yt −1 conditional on [Dˆ 1 1yt −1 + Dˆ 2 1yt −2 +· · ·+ Dˆ p 1yt −p ]. Take the q canonical variates αˆ ′ yt −1 corresponding to the q largest squared PCCs as estimates of cointegrating relationships. Regress 1yt on αˆ ′ yt −1 and [Dˆ 1 1yt −1 + Dˆ 2 1yt −2 + · · · + Dˆ p 1yt −p ], and ˆ |, the logarithm of the determinant of the compute ln |Ω residual variance matrix. Step 2. Compute the PCCs between 1yt and (1yt −1 , . . . , 1yt −p ) conditional on αˆ ′ yt −1 . Take the r canonical variates [Dˆ 1 1yt −1 + Dˆ 2 1yt −2 + · · · + Dˆ p 1yt −p ] corresponding to the largest r PCCs as estimates of [D1 1yt −1 + D2 1yt −2 + · · · + Dp 1yt −p ]. Regress 1yt on αˆ ′ yt −1 and [Dˆ 1 1yt −1 + ˆ 2 1yt −2 + · · · + Dˆ p 1yt −p ], and compute ln |Ω ˆ |, the D logarithm of the determinant of the residual variance matrix. If this is different from the corresponding value computed in Step 1, go back to Step 1. Otherwise, stop. ˆ | becomes smaller at each stage until it achieves The value of ln |Ω ˆ p,r ,q |. The values of αˆ its minimum, which we denote by ln |Ω
ˆ 1 , Dˆ 2 , . . . , Dˆ p ] in the final stage will be the maximum and [D likelihood estimators of α and [D1 , D2 , . . . , Dp ]. The maximum likelihood estimates of other parameters are simply the coefficient estimates of the final regression. Note that although γ and α , and also C and [D1 , D2 , . . . , Dp ], are only identified up to appropriate normalisations, the maximum likelihood estimates of Π and [Γ1 , Γ2 , . . . , Γp ] are invariant to the choice of normalisation. Therefore, the normalisation of the canonical correlation analysis is absolutely innocuous, and the ‘‘raw’’ estimates produced from this procedure can be linearly combined to produce any desired alternative normalisation. Also, the set of variables that are partialled out at each stage should include constants and other deterministic terms if needed. 4. Model selection The modal strategy in applied work for modelling a vector of I (1) variables is to use a model selection criterion for choosing the lag length of the VAR, then test for cointegration conditional on the lag-order, and finally estimate the VECM. Hardly ever any further step is taken to simplify the model, and if any test of the adequacy of the model is undertaken, it is usually a system test. For example, to test the adequacy of the dynamic specification, additional lags of all variables are added to all equations, and a test of joint significance for K 2 parameters is used. For stationary time series, Vahid and Issler (2002) show that model selection criteria severely underestimate the lag order in weak systems, i.e. in systems where the propagation mechanism is weak. They also show that using model selection criteria (suggested in Lütkepohl, 1993, p. 202) to choose lag order and rank simultaneously can remedy this shortcoming significantly.
G. Athanasopoulos et al. / Journal of Econometrics 164 (2011) 116–129
In the context of VECMs, one can consider selecting (p, r ) with these model selection criteria first, and then use a sequence of likelihood ratio tests to determine the rank of the cointegrating space q. Specifically, these are the analogues of Akaike information criterion (AIC), the Hannan and Quinn criterion (HQ) and the Schwarz criterion (SC), and are defined as K −
AIC(p, r ) = T
ˆ 2i (p) + 2(r (K − r ) + rKp) ln 1 − λ
(6)
i=K −r +1
HQ (p, r ) = T
K −
ˆ 2i (p) ln 1 − λ
i =K −r +1
+ 2(r (K − r ) + rKp) ln ln T SC (p, r ) = T
K −
(7)
ˆ 2i (p) + (r (K − r ) + rKp) ln T , (8) ln 1 − λ
i=K −r +1
where K is the dimension of (number of series in) the system, r is the rank of [Γ1 Γ2 · · · Γp ], p is the number of lagged differences in the VECM, T is the number of observations, and λˆ 2i (p) are the sample squared PCCs between 1yt and the set of regressors (1yt −1 , . . . , 1yt −p ) after the linear influence of yt −1 (and deterministic terms such as a constant term and seasonal dummies if needed) is taken away from them, sorted from the smallest to the largest. Traditional model selection criteria are special cases of the above when rank is assumed to be full, i.e. when r is equal to K . Here, the question of the rank of Π , the coefficient of yt −1 in the VECM, is set aside, and taking the linear influence of yt −1 away from the dependent variable and the lagged dependent variables concentrates the likelihood on [Γ1 Γ2 · · · Γp ].Then, conditional on the p and the r that minimise one of these criteria, one can use a sequence of likelihood ratio tests to determine q. While in the proof of Theorem 2 we show that the estimators of p and r based on HQ and SC are consistent, the estimator of q from the sequential testing method with a fixed level of significance is obviously not. Moreover, the asymptotic distribution of the likelihood ratio test statistic for q conditional on selected p and r may be far from that when the true p and r are known (Leeb and Potscher, 2005). Here, we study model selection criteria which choose p, r and q. We consider two classes of model selection criteria. First, we consider direct extensions of the AIC, HQ and SC to the case where the rank of the cointegrating space, which is the same as the rank of Π , is also a parameter to be selected by the criteria. Specifically, we consider
ˆ p,r ,q | AIC(p, r , q) = T ln |Ω + 2(q(K − q) + Kq + r (K − r ) + rKp)
(9)
ˆ p,r ,q | + 2(q(K − q) HQ(p, r , q) = T ln |Ω + Kq + r (K − r ) + rKp) ln ln T
(10)
ˆ p,r ,q | SC(p, r , q) = T ln |Ω + (q(K − q) + Kq + r (K − r ) + rKp) ln T ,
(11)
ˆ p,r ,q | (the minimised value of the logarithm of the where ln |Ω determinant of the variance of the residuals of the VECM of order p, with Π having rank q and [Γ1 Γ2 · · · Γp ] having rank r) is computed by the iterative algorithm described above in Section 3. Obviously, when q = 0 or q = K , we are back in the straightforward reduced rank regression framework, where one set of eigenvalue calculations for each p provides the value of the log-likelihood function for r = 1, . . . , K . Similarly, when r = K , we are back in the usual VECM estimation, and no iterations are needed. We also consider a model selection criterion with a data dependent penalty function. Such model selection criteria date
119
back at least to Poskitt (1987), Rissanen (1987) and Wallace and Freeman (1987). The model selection criterion that we consider in this paper is closer to those inspired by the ‘‘minimum description length (MDL)’’ criterion of Rissanen (1987) and the ‘‘minimum message length (MML)’’ criterion of Wallace and Freeman (1987). Both of these criteria measure the complexity of a model by the minimum length of the uniquely decipherable code that can describe the data using the model. Rissanen (1987) establishes that the closest the length of the code of any empirical model can possibly get to the length of the code of the true DGP Pθ is at least as large as 12 ln |Eθ (FIMM (θˆ ))|, where FIMM (θˆ ) is the Fisher information matrix of model M (i.e., [−∂ 2 ln lM /∂θ ∂θ ′ ], the second derivative of the log-likelihood function of the model M) evaluated at θˆ , and Eθ is the mathematical expectation under Pθ . Rissanen uses this bound as a penalty term to formulate the MDL as a model selection criterion, 1
ln |FIMM (θˆ )|. 2 Wallace and Freeman’s MML is also based on coding and information theory but is derived from a Bayesian perspective. The MML criterion is basically the same as the MDL plus an additional term that is the prior density of the parameters evaluated at θˆ (see Wallace, 2005, for more details and a summary of recent advances in this line of research). While the influence of this term is dominated by the other two terms as sample size increases, it plays the important role of making the criterion invariant to arbitrary linear transformations of the regressors in a regression context. Based on their study of the asymptotic form of the Bayesian data density, Phillips (1996) and Phillips and Ploberger (1996) design the posterior information criterion (PIC), which is similar to MML and MDL criteria. Their important contribution has been to show that such criteria can be applied to partially nonstationary time series as well.3 Chao and Phillips (1999) use the PIC for simultaneous selection of the lag length and cointegration rank in VARs. There are practical difficulties in working with PIC that motivates simplifying this criterion. One difficulty is that FIMM (θˆ ) must be derived and coded for all models considered (The details of the Fisher information matrix for a reduced rank VECM is given in the Appendix A). A more important one is the large dimension of FIMM (θˆ ). For example, if we want to choose the best VECM allowing for up to 4 lags in a six variable system, we have to compute determinants of square matrices of dimensions as large as 180. These calculations are likely to push the boundaries of numerical accuracy of computers, in particular when these matrices are ill-conditioned.4 This, and the favourable results of the HQ criterion in selecting lag p and rank of stationary dynamics r, led us to consider a two step procedure. MDL = − ln lM (θˆ ) +
4.1. A two-step procedure for model selection In the first step, the linear influence of yt −1 is removed from 1yt and (1yt −1 , . . . , 1yt −p ), then HQ(p, r ), as defined in (7), is used to determine p and r. Then PIC is calculated for the chosen values of p and r, for all q from 0 to K . This reduces the task to K + 1 determinant calculations only.
3 Ploberger and Phillips (2003) generalised Rissanen’s result to show that even for trending time series, the distance between any empirical model and the Pθ is larger or equal to 21 ln |Eθ (FIMM )| almost everywhere on the parameter space. They use the outer-product formulation of the information matrix, which has the same expected value as the negative of the second derivative under Pθ . 4 In our simulations, we came across one case where the determinant was returned to be a small negative number even though the matrix was symmetric positive definite. This happened both using GAUSS and also using MATLAB.
120
G. Athanasopoulos et al. / Journal of Econometrics 164 (2011) 116–129
Theorem 2. If the data generating process is
1yt = c + Π yt −1 + Γ1 1yt −1 + Γ2 1yt −2 + · · · + Γp0 1yt −p0 + ηt in which (i) all roots of the characteristic polynomial of the implied VAR for yt are on or outside the unit circle and all those on the unit circle are +1; (ii) the rank of Π is q0 ≤ K , which implies that Π can be written as γ α ′ where ∑pγ and α are full rank K × q0 matrices; (iii) γ⊥′ I − i=0 1 Γi α⊥ has full rank where γ⊥ and α⊥ are full rank ′ K × (K − q0 ) matrices such that γ⊥′ γ = α⊥ α = 0; (iv) the rank of [Γ1 Γ2 · · · Γp0 ] is r0 ≤ K ; (v) the rank of Γp0 is not zero; (vi) E (ηt | Ft −1 ) = 0 and E (ηt ηt′ | Ft −1 ) = Ω positive definite where Ft −1 is the σ -field generated by {ηt −1 , ηt −2 , . . .}, and E (ηit4 ) < ∞ for i = 1, 2, . . . , K , and the maximum possible lag considered pmax ≥ p0 , then the estimators of p, r and q obtained from the two step procedure explained above are consistent. Proof. See Appendix B.
5. Monte-Carlo design To make the Monte-Carlo simulation manageable, we use a three-dimensional VAR. We consider VARs in levels with lag lengths of 2 and 3, which translates to 1 and 2 lagged differences in the VECM. This choice allows us to study the consequences of both under- and over-parameterisation of the estimated VAR. For each p0 , r0 and q0 we draw many sets of parameter values from the parameter space of cointegrated VARs with serial correlation common features that generate difference stationary data. In order to ensure that the DGPs considered do not lie in a subset of the parameter space that implies only very weak or only very strong propagation mechanisms we choose 50 DGPs with system R2 s (as defined in Vahid and Issler, 2002) that range between 0.3 and 0.65, with a median between 0.4 and 0.5 and 50 DGPs with system R2 s that range between 0.65 and 0.9, with a median between 0.7 and 0.8. From each DGP, we generate 1000 samples of 100, 200 and 400 observations (the actual generated samples were longer, but the initial part of each generated sample is discarded to reduce the effect of initial conditions). In summary, our results are based on 1000 samples of 100 different DGPs — a total of 100,000 different samples — for each of T = 100, 200 or 400 observations. The Monte-Carlo procedure can be summarised as follows. Using each of the 100 DGPs, we generate 1000 samples (with 100, 200 and 400 observations). We record the lag length chosen by traditional (full-rank) information criteria, labeled IC(p) for IC = {AIC, HQ, SC}, and the corresponding lag length chosen by alternative information criteria, labelled IC(p, r , q) for IC = {AIC, HQ, SC, PIC, HQ-PIC} where the last is the hybrid procedure we propose in Section 4.1. We should note that although we present the results averaged over all 100 DGPs we have also analysed the results for the DGPs with low and high R2 s separately. We indeed found that any advantage of model selection criteria with a relatively smaller (larger) penalty factor was accentuated when only considering DGPs with relatively weaker (stronger) propagation mechanisms. In order to save space we do not present these results here but they are available upon request. For choices made using the traditional IC(p) criteria, we use Johansen’s (1988, 1991) trace test at the 5% level of significance to select q, and then estimate a VECM with no short-run restrictions. For choices made using IC(p, r , q), we use the two step procedure of Section 4.1 to obtain the triplet (p, r , q), and then estimate the resulting VECM with SCCF restrictions using the algorithm of
Section 3. For each case we record the out-of-sample forecasting accuracy measures for up to 16 periods ahead. We then compare the out-of-sample forecasting accuracy measures for these two types of VAR models. 5.1. Measuring forecast accuracy We measure the accuracy of forecasts using the traditional trace of the mean-squared forecast error matrix (TMSFE) and the determinant of the mean-squared forecast error matrix |MSFE| at different horizons. We also compute Clements and Hendry’s (1993) generalised forecast error second moment (GFESM). GFESM is the determinant of the expected value of the outer product of the vector of stacked forecast errors of all future times up to the horizon of interest. For example, if forecasts up to h quarters ahead are of interest, this measure will be:
ε˜ t +1 ε˜ t +1 ′ ε˜ t +2 ε˜ t +2 GFESM = E . .. ... , ε˜ t +h ε˜ t +h where ε˜ t +h is the K -dimensional forecast error of our K -variable model at horizon h. This measure is invariant to elementary operations that involve different variables (TMSFE is not invariant to such transformations), and also to elementary operations that involve the same variable at different horizons (neither TMSFE nor |MSFE| is invariant to such transformations). In our Monte-Carlo, the above expectation is evaluated for every model, by averaging over replications. There is one complication associated with simulating 100 different DGPs. Simple averaging across different DGPs is not appropriate, because the forecast errors of different DGPs do not have identical variance–covariance matrices. Lütkepohl (1985) normalises the forecast errors by their true variance–covariance matrix in each case before aggregating. Unfortunately, this would be a very time consuming procedure for a measure like GFESM, which involves stacked errors over many horizons. Instead, for each information criterion, we calculate the percentage gain in forecasting measures, comparing the full-rank models selected by IC(p), with the reduced-rank models chosen by IC(p, r , q). This procedure is done at every iteration and for every DGP, and the final results are then averaged. 6. Monte-Carlo simulation results 6.1. Selection of lag, rank, and the number of cointegrating vectors Simulation results are reported in ‘‘three-dimensional’’ frequency tables. The columns correspond to the percentage of times the selected models had cointegrating rank smaller than the true rank (q < q0 ), equal to the true rank (q = q0 ) and larger than the true rank (q > q0 ). The rows correspond to similar information about the rank of short-run dynamics r. Information about the lag-length is provided within each cell, where the entry is disaggregated on the basis of p. The three numbers provided in each cell, from left to right, correspond to percentages with lag lengths smaller than the true lag, equal to the true lag and larger than the true lag. The ‘Total’ column on the right margin of each table provides information about marginal frequencies of p and r only. The row titled ‘Total’ on the bottom margin of each table provides information about the marginal frequencies of p and q only. Finally, the bottom right cell provides marginal information about the laglength choice only. We report results of two sets of 100 DGPs. Table 1 summarises the model selection results for 100 DGPs that have one lag in
G. Athanasopoulos et al. / Journal of Econometrics 164 (2011) 116–129
differences with a short-run rank of one and cointegrating rank of two, i.e., (p0 , r0 , q0 ) = (1, 1, 2). Table 2 summarises the model selection results for 100 DGPs that have two lags in differences with a short-run rank of one and cointegrating rank of one (p0 , r0 , q0 ) = (2, 1, 1). These two groups of DGPs are contrasting in the sense that the second group of DGPs have more severe restrictions in comparison to the first one. The first three panels of the tables correspond to all model selection based on the traditional model selection criteria. The additional bottom row for each of these three panels provides information about the lag-length and the cointegrating rank, when the lag-length is chosen using the simple version of that model selection criterion and the cointegrating rank is chosen using the Johansen procedure, and in particular the sequential trace test with 5 critical values that are adjusted for sample size. Comparing the rows labeled ‘AIC + J’, ‘HQ + J’ and ‘SC + J’, we conclude that the inference about q is not sensitive to whether the selected lag is correct or not. In Table 1 all three criteria choose the correct q approximately 54%, 59% and 59% of the time for sample sizes 100, 200 and 400, respectively. In Table 2 all three criteria choose the correct q approximately 70%, 82% and 82% of the time for sample sizes 100, 200 and 400, respectively. From the first three panels of Table 1 we can clearly see that traditional model selection criteria do not perform well in choosing p, r and q jointly in finite samples. The percentages of times the correct model is chosen are only 22%, 26% and 29% with the AIC, 39%, 52% and 62% with HQ, and 42%, 63% and 79% with SC, for sample sizes of 100, 200 and 400, respectively. Note that when we compare the marginal frequencies of (p, r ), HQ is the most successful for choosing both p and r, a conclusion that is consistent with results in Vahid and Issler (2002). The main reason for not being able to determine the triplet (p, r , q) correctly is the failure of these criteria to choose the correct q. Ploberger and Phillips (2003) show that the correct penalty for free parameters in the long-run parameter matrix is larger than the penalty considered by traditional model selection criteria. Accordingly, all three criteria are likely to over-estimate q in finite samples, and of them SC is likely to appear relatively most successful because it assigns a larger penalty to all free parameters, even though the penalty is still less than ideal. This is exactly what the simulations reveal. The fourth panel of Table 1 includes results for the PIC. The percentages of times the correct model is chosen increase to 52%, 77% and 92% for sample sizes of 100, 200 and 400, respectively. Comparing the margins, it becomes clear that this increased success relative to HQ and SC is almost entirely due to improved precision in the selection of q. The PIC chooses q correctly 76%, 91% and 97% of the time for sample sizes 100, 200 and 400, respectively. Furthermore, for the selection of p and r only, PIC does not improve upon HQ. Similar conclusions can be reached from the results for the (2, 1, 1) DGPs presented in Table 2. We note that in this case, even though the PIC improves on HQ and SC in choosing the number of cointegrating vectors, it does not improve on HQ or SC in choosing the exact model, because it severely underestimates p. This echoes the findings of Vahid and Issler (2002) in the stationary case that the Schwarz criterion (recall that the PIC penalty is of the same order as the Schwarz penalty in the stationary case) severely underestimates the lag length in small samples in reduced rank VARs. Our Monte-Carlo results show that the advantage of PIC over HQ and SC is in the determination of the cointegrating rank. Indeed, HQ seems to have an advantage over PIC in selecting the correct p and r in small samples. These results coupled with the practical difficulties in computing the PIC we outline in Section 4 motivated us to consider the two-step alternative procedure to improve the model selection task.
121
The final panels in Tables 1 and 2 summarise the performance of our two-step procedure. In both tables we can see that the hybrid HQ-PIC procedure improves on all other criteria in selecting the exact model. The improvement is a consequence of the advantage of HQ in selecting p and r better, and PIC in selecting q better. Note that our hybrid procedure results in over-parameterised models more often than just using PIC as the model selection criterion. We examined whether this trade-off has any significant consequences for forecasting and found that it does not. In all simulation settings, models selected by the hybrid procedure with HQ-PIC as the model selection criteria forecast better than models selected by PIC. Again, we do not present these results here, but they are also available upon request. 6.2. Forecasts Recall that the forecasting results are expressed as the percentage improvement in forecast accuracy measures of possibly rank reduced models over the unrestricted VAR model in levels selected by SC. Also, note that the object of interest in this forecasting exercise is assumed to be the first difference of variables, although GFESM gives a measure of accuracy that is the same for levels or differences. We label the models chosen by the hybrid procedure proposed in the previous section and estimated by the iterative process of Section 3 as VECM(HQ-PIC). We label the models estimated by the usual Johansen method with AIC as the model section criterion for the lag order as VECM(AIC + J). Table 3 presents the forecast accuracy improvements in a (1, 1, 2) setting. In terms of the trace and determinant of the MSFE matrix, there is some improvement in forecasts over unrestricted VAR models at all horizons. With only 100 observations, GFESM worsens for horizons 8 and longer. This means that if the object of interest was some combination of differences across different horizons (for example, the levels of all variables or the levels of some variables and first differences of others), there may not have been any improvement in the MSFE matrix. With 200 or more observations, all forecast accuracy measures show some improvement, with the more substantial improvements being for the one-step-ahead forecasts. Also note that the forecasts of the models selected by the hybrid procedure are almost always better than those produced by the model chosen by the AIC plus Johansen method, which only pays attention to lag-order and long-run restrictions. Table 4 presents the forecast accuracy improvements in a (2, 1, 1) setting. This set of DGPs have more severe rank reductions than the (1, 1, 2) DGPs, and, as a result, the models selected by the hybrid procedure show more substantial improvements in forecasting accuracy over the VAR in levels, in particular for smaller sample sizes. Forecasts produced by the hybrid procedure are also substantially better than forecasts produced by the AIC + Johansen method, which does not incorporate short-run rank restrictions. Note that although the AIC + Johansen forecasts are not as good as the HQ-PIC forecasts, they are substantially better than the forecasts from unrestricted VARs at short horizons. Following a request from a referee in Tables 3 and 4 we have also presented Diebold and Mariano (1995) tests for equal predictive accuracy between the rank reduced specifications and the unrestricted VARs for the TMSFE. In general the results are as expected. Models that incorporate reduced rank restrictions rarely forecast significantly worse than the unrestricted models. They either perform the same or significantly better than the unrestricted VARs. 7. Empirical example The techniques discussed in this paper are applied in two different forecasting exercises to two data sets. The first data set contains Brazilian inflation, as measured by three different
122
G. Athanasopoulos et al. / Journal of Econometrics 164 (2011) 116–129
types of consumer-price indices, available on a monthly basis from 1994:9 to 2009:11, with a span of more than 15 years (183 observations). It was extracted from IPEADATA—a public database with downloadable Brazilian data (http://www.ipeadata.gov.br/). The second data set consists of real US per-capita private output,5 personal consumption per-capita, and fixed investment per-capita, available on a quarterly basis from 1947:1 to 2009:3, with a span of more than 62 years (251 observations). It was extracted from FRED’s database of the Federal Reserve Bank of St. Louis (http://research.stlouisfed.org/fred2/). Considering that we will keep some observations for forecast evaluation (90 observations), the size of these data bases are close to the number of simulated observations in the Monte-Carlo exercise for T = 100 and T = 200 respectively. 7.1. Forecasting Brazilian inflation The Brazilian data set consists of three alternative measures of consumer price indices. The first is the official consumer price index used in the Brazilian Inflation-Targeting Programme. It is computed by IBGE, the statistics bureau of the Brazilian government, labelled here as CPI-IBGE. The second is the consumer price index computed by Getulio Vargas Foundation, a traditional private institution which computes several Brazilian price indices since 1947, labelled here as CPI-FGV. The third is the consumer price index computed by FIPE, an institute of the Department of Economics of the University of São Paulo, labelled here as CPI-FIPE. These three indices capture different aspects of Brazilian consumer-price inflation. First, they differ in terms of geographical coverage. CPI-FGV is based on prices in 12 different metropolitan areas in Brazil, 11 of which are also covered by CPI-IBGE.6 On the other hand, CPI-FIPE only covers São Paulo — the largest city in Brazil — also covered by the other two indices. Tracked consumption bundles are also different across indices. CPI-FGV is based on the typical consumption bundles of consumers with income between 1 and 33 minimum wages. CPI-IBGE covers consumption baskets of consumers with income between 1 and 40 minimum wages, while CPI-FIPE focuses on consumers with income between 1 and 20 minimum wages. Although all three indices measure consumer-price inflation in Brazil, Granger-causality tests confirm the usefulness of conditioning on alternative indices to forecast any given index in the models estimated here. We compare the forecasting performance of (i) the VAR in log-levels, with lag length chosen by the standard Schwarz criterion; (ii) the VECM, using standard AIC for choosing the lag length and Johansen’s test for choosing the cointegrating rank; and (iii) the reduced rank model, with rank and lag length chosen simultaneously using the Hannan–Quinn criterion and cointegrating rank chosen using PIC. All forecast comparisons are made using the first difference of the logarithms of the price indices, i.e., price inflation. For all three models, the estimation sample starts from 1994:9 through 2001:2, with 78 observations. With these initial estimates, we compute the applicable choices of p, r, and q for each model and forecast inflation up to 16 months ahead. Keeping the initial observation fixed (1994:9), we add one observation at the end of the estimation sample, choose potentially different values for p, r, and q for each model, and forecast inflation again up to 16 months ahead. This procedure is then repeated until the final estimation sample reaches 1994:9 through 2008:7, with 167
5 Private output is GNP minus federal government’s consumption and investment. 6 There are no metropolitan areas covered by CPI-IBGE that are not covered by CPI-FGV.
observations. Then, we have a total of 90 out-of-sample forecasts for each horizon (1–16 months ahead), which are used for forecast evaluation. Thus, the estimation sample varies from 78 to 167 observations and mimics closely the simulations labelled T = 100 in the Monte-Carlo exercise. Results of the exercise described above are presented in Table 5. For all horizons, there are substantial forecasting gains of the VECM(HQ-PIC) over the VAR in levels: for example, for 12 months (one year) ahead, TMSFE, |MSFE| and GFESM show gains of 33.6%, 38.4% and 120.3% respectively. The VECM(AIC + J) forecasts are also better than the VAR in levels forecasts, but the improvements are not as large. The comparison between VECM(HQ-PIC) and VECM(AIC + J) shows gains for the former everywhere. Table 5 also includes the results of Diebold–Mariano tests for equality of mean squared errors of each pair of forecasts for each individual series for the reported horizons. These are reported using three comma separated symbols (one for each series) in parentheses below the TMSFE values. Each symbol indicates if the null hypothesis of the equality of mean squared forecast errors is rejected in favour of a one sided alternative and if so the level of significance at which it is rejected. The results indicate that in this application, the VECM(HQ-PIC) forecast of inflation based on every one of the three series has significantly lower mean squared error than the corresponding forecast from the VAR in log-levels. The test for the equality of the mean squared forecast errors of the VECM(HQ-PIC) and the VECM(AIC + J) rejects equality in favour of better VECM(HQ-PIC) forecasts at horizons 1, 4 and 8. It should be noted that there is no case where either the VAR in log-levels or the VECM(AIC + J) generate a significantly smaller MSE vis-à-vis the VECM(HQ-PIC) for any of the inflation series at any horizon. It is also worth reporting the choices of p, r, and q for the best models studied here as the estimation sample increases from 1994:9–2001:2 all the way to 1994:9–2008:7. While the VECM(HQ-PIC) chose p = 1, r = 1 or 2, and q = 0, most of the time (on the rare occasion it chose p = 3, q = 1), the VECM(AIC + J) chose p = 1, q = 1, most of the time (on rare occasions it chose p = 5, q = 0 or q = 3). Hence, the superior performance of the VECM(HQ-PIC) vis-à-vis the VECM(AIC + J) may be due to either imposing a reduced-rank structure or to ignoring potential cointegration relationships. This is especially true for the shorter horizons. If the coverage of the price indices were similar, then one would expect a single common trend (i.e. two cointegrating vectors) in this system. However, the definition and coverage of these indices are substantially different and our analysis shows that this creates very persistent differences in these series and suggests that the users of these series must pay careful attention to their definitions and choose the appropriate one for their purpose. Even if one believes that these persistent differences appear to be non-mean-reverting because of the short span of the data and they will eventually die out, our analysis shows that for forecasting purposes these differences are persistent enough that is better to model them as unit roots rather than stationary processes.7 This is consistent with the results of Stock (1996). 7.2. Forecasting US macroeconomic aggregates The data set consists of logarithms of real US per-capita private output—y, personal consumption per-capita—c, and fixed
7 We imposed q = 2 and repeated the forecasting exercise. The resulting forecasts were not even as good as VECM(AIC + J) forecasts. Detailed results are not reported here to save space, but are available to interested readers.
G. Athanasopoulos et al. / Journal of Econometrics 164 (2011) 116–129
investment per-capita—i, extracted from FRED’s database on a quarterly frequency8 from 1947:1 to 2009:3. Again, we compare the forecasting performance of (i) the VAR in log-levels, with lag length chosen by the standard Schwarz criterion; (ii) the VECM, using standard AIC for choosing the lag length and Johansen’s test for choosing the cointegrating rank; and (iii) the reduced rank model, with rank and lag length chosen simultaneously using the Hannan–Quinn criterion and cointegrating rank chosen using PIC, estimated by the iterative process of Section 3. All forecast comparisons are made using the first difference of the log-levels of the data, i.e., using 1 log (yt ) , 1 log (ct ), and 1 log (it ). For all three models, the first estimation sample covers the period 1947:1 to 1983:2, a total of 146 observations. As before, we keep expanding the estimation sample until it reaches 1947:1–2005:3, with 235 observations. This produces a total of 90 out-of-sample forecasts for each horizon that are used for forecast evaluation. Since the estimation sample varies from 146 to 235 observations, it corresponds closely to the simulations labeled T = 200 in the Monte-Carlo exercise. Results of the exercise described above are presented in Table 6. For all horizons, there are considerable forecasting gains for the VECM(HQ-PIC) over the VAR in levels: at 4 quarters (one year) ahead, TMSFE, |MSFE| and GFESM show gains of 56.3%, 83.5% and 134.7% respectively. The forecasting gains of the VECM(AIC + J) over the VAR in levels, though statistically significant, are not as large especially for the short-run horizons. The comparison between VECM(HQ-PIC) and VECM(AIC + J) shows gains for the former in one quarter to four quarters ahead forecasts. The Diebold–Mariano tests for equality of the mean squared forecast errors for each of the three series provide evidence that HQ-PIC forecasts have significantly smaller mean squared errors than the VECM(AIC + J) forecasts for horizons 1 and 4. Finally, we investigate the final choices of p, r, and q as the estimation sample increases from 1947:1–1983:2 to 1947:1–2005:3. For the VECM(HQ-PIC) they are: p = 1, r = 2, and q = 0, everywhere, while the VECM(AIC + J) chose p = 1 half of time and p = 3 the other half and q = 0 half of the time and q = 1 the other half. As in the previous example, the selected cointegrating ranks may not accord with theoretical priors. A theoretical real business cycle model hypothesises that the technology shocks is the driver of the only common stochastic trend in all real variables and hence implies that y, c and i have two cointegrating vectors. What we learn from the data though is that even if this theory is correct, one or both of these cointegrating relationships must have such a high persistence (roots close to unity) that for forecasting purposes it is best if they are modelled as unit roots. If we impose q = 2 the forecasts are even inferior to VECM(AIC + J) forecasts (detailed results not reported to save space). 8. Conclusion Motivated by the results of Vahid and Issler (2002) on the success of the Hannan–Quinn criterion in selecting the lag length and rank in stationary VARs, and the results of Ploberger and Phillips (2003) and Chao and Phillips (1999) on the generalisation of Rissanen’s theorem to trending time series and the success of PIC in selecting the cointegrating rank in VARs, we propose a combined HQ-PIC procedure for the simultaneous choice of the lag-length and the ranks of the short-run and long-run parameter matrices in
8 Using FRED’s mnemonics (2010) for the series, the precise definitions are: PCECC96—consumption, FPIC96—investment, and (GNP96-FGCEC96)—output. Population series mnemonics is POP, which is only available from 1952 on in FRED. To get a complete series starting in 1947:1 it was spliced with the same series available in the DRI database, whose mnemonics is GPOP.
123
a VECM and we prove its consistency. Our simulations show that this procedure is capable of selecting the correct model more often than other alternatives such as pure PIC or SC. In this paper we also present forecasting results that show that models selected using this hybrid procedure produce better forecasts than unrestricted VARs selected by SC and cointegrated VAR models whose lag length is chosen by the AIC and whose cointegrating rank is determined by the Johansen procedure. We have chosen these two alternatives for forecast comparisons because we believe that these are the model selection strategies that are most often used in the empirical literature. However, we have considered several other alternative model selection strategies and the results are qualitatively the same: the hybrid HQ-PIC procedure leads to models that generally forecast better than VAR models selected using other procedures. A conclusion we would like to highlight is the importance of short-run restrictions for forecasting. We believe that there has been much emphasis in the literature on the effect of longrun cointegrating restrictions on forecasting. Given that long-run restrictions involve the rank of only one of the parameter matrices of a VECM, and that inference on this matrix is difficult because it involves inference about stochastic trends in variables, it is puzzling that the forecasting literature has paid so much attention to cointegrating restrictions and relatively little attention to lagorder and short-run restrictions in a VECM. The present paper fills this gap and highlights the fact that the lag-order and the rank of short-run parameter matrices are also important for forecasting. Our hybrid model selection procedure and the accompanying simple iterative procedure for the estimation of a VECM with longrun and short-run restrictions provide a reliable methodology for developing multivariate autoregressive models that are useful for forecasting. How often restrictions of the type considered in this paper are present in VAR approximations to real life data generating processes is an empirical question. Macroeconomic models in which trends and cycles in all variables are generated by a small number of dynamic factors fit in this category. Also, empirical papers that study either regions of the same country or similar countries in the same region often find these kinds of long-run and short-run restrictions. We illustrate the usefulness of the modelselection strategy discussed above in two empirical applications: forecasting Brazilian inflation and US macroeconomic aggregates growth rates. We find gains of imposing short- and longrun restrictions in VAR models, since the VECM(HQ-PIC) and the VECM(AIC + J) outperform the VAR in levels everywhere. Tests of equal variance confirm that these gains are significant. Moreover, ignoring short-run restrictions usually produces inferior forecasts with these data, since the VECM(HQ-PIC) outperforms the VECM(AIC + J) almost everywhere, but these gains are not always significant in tests of equal variance. It is true that discovering the ‘‘true’’ model is a different objective from model selection for forecasting. However, in the context of partially non-stationary variables, there are no theoretical results that lead us to a definite model selection strategy for forecasting. Using a two variable example, Elliott (2006) shows that, ignoring estimation uncertainty, whether or not considering cointegration will improve short-run or longrun forecasting depends on all parameters of the DGP, even the parameters of the covariance matrix of the errors. In addition there is no theory that tells us whether finite sample biases of parameter estimates will help or hinder forecasting in partially non-stationary VARs. Given this state of knowledge, when one is given the task of selecting a single model for forecasting it is reasonable to use a model selection criterion that is more likely to pick the ‘‘true’’ model and in this paper we verify that VARs selected by our hybrid model selection strategy are likely to produce better forecasts than unrestricted VARs and VARs that only incorporate cointegration restrictions.
124
G. Athanasopoulos et al. / Journal of Econometrics 164 (2011) 116–129
Acknowledgements
and the second differential is: d2 ln l (θ , ω) = tr dΩ −1 E ′ E − T Ω Ω −1 dΩ
We would like to thank the Associate Editor, two anonymous referees, Heather Anderson, Taya Dumrongrittikul, Yin Liao, Shuping Shi and Wenying Yao for useful comments and suggestions, and Claudia F. Rodrigues for excellent research assistance. João Issler thanks CNPq, FAPERJ and INCT for financial support. George Athanasopoulos and Farshid Vahid acknowledge support from the Australian Research Council grant DP0984399. Appendix A. The Fisher information matrix of the reduced rank VECM Assuming that the first observation in the sample is labelled observation −p+1 and that the sample contains T +p observations, we write the K -variable reduced rank VECM
β ′ y t −1 +
1yt = γ ′ Iq
Ir C′
× [D1 1yt −1 + D2 1yt −2 + · · · + Dp 1yt −p ] + µ + et ,
1Y = Y−1
Iq
β
1y′1 1Y = ... , T ×K 1y′T
C + ιT µ′ + E ,
y′0
T ×K
···
T ×Kp
where DK is the ‘‘duplication matrix’’. From the model, we can see that
e1
(2)′ (2)
T ×K
y′T −1
1y0 1Y−p = ... 1y′T −1 ′
··· .. . ···
(2)′
FIM12 = γ Ω −1 ⊗ Y−1 Y−1
e′T
1y′−p+1 .. .
1y′T −p
′
D1
. D = .. , Kp×r
FIM14
and ιT is a T × 1 vector of ones. When et are N (0, Ω ) and serially uncorrelated, the log-likelihood function, conditional on the first p observations being known, is:
=−
KT 2 KT 2
ln (2π ) −
T
ln (2π ) −
T
2 2
ln |Ω | − ln |Ω | −
FIM23
T 1 − ′ −1 et Ω et 2 t =1
FIM24
1
FIM25
2
tr E Ω −1 E ′ ,
where vec (β) vec (γ ) θ = vec(D) vec(C )
FIM22
FIM33
µ
is a (K − q) q + Kq + Kpr + r (K − r )+ K matrix of mean parameters, and ω = vech (Ω ) is a K (K + 1) /2 vector of unique elements of the variance matrix. The differential of the log-likelihood is (see Magnus and Neudecker, 1988)
T 1 d ln l (θ, ω) = − trΩ −1 dΩ + tr Ω −1 dΩΩ −1 E ′ E 2 2 1 1 − tr Ω −1 E ′ dE − tr Ω −1 dE ′ E 2 2 1 = tr Ω −1 E ′ E − T Ω Ω −1 dΩ − tr Ω −1 E ′ dE , 2
FIM34 FIM35 FIM44 FIM45
Iq
β
,
Ir C′
IK − r
⊗ Y−(21)′ ιT I = Ω −1 ⊗ Iq β ′ Y−′ 1 Y−1 q , β I = Ω −1 r′ ⊗ Iq β ′ Y−′ 1 W C 0 −1 =Ω ⊗ Iq β ′ Y−′ 1 WD, IK − r = Ω −1 ⊗ Iq β ′ Y−′ 1 ιT I = Ir C Ω −1 r′ ⊗ W ′ W , C 0 = Ir C Ω − 1 ⊗ W ′ WD IK − r = Ir C Ω −1 ⊗ W ′ ιT −1 0 = 0 IK − r Ω ⊗ D′ W ′ WD, IK − r = 0 IK −r Ω −1 ⊗ D′ W ′ ιT
FIM15 = γ Ω
Dp
⊗ Y−(21)′ W , 0 = γ Ω −1 ⊗ Y−(21)′ WD
FIM13 = γ Ω −1
′
ln l (θ, ω) = −
FIM11 = γ Ω −1 γ ′ ⊗ Y−1 Y−1 ,
. E = ..
0 dβ
ments of the Fisher information matrix are:
′
. Y−1 = .. ,
W = 1Y−1
T d2 ln l (θ , ω) = − tr Ω −1 dΩΩ −1 dΩ − tr Ω −1 dE ′ dE 2 T = − (dω)′ D′K Ω −1 ⊗ Ω −1 DK dω 2 − (vec(dE ))′ Ω −1 ⊗ IT vec(dE ),
I γ − Y−1 q dγ − W dD Ir C β − WD 0 dC − ιT dµ′ , and therefore vec(dE ) is as formulated in Box I. Hence, the ele-
where
Since we eventually want to evaluate the Fisher information matrix at the maximum likelihood estimator, and at the maximum ˆ = 0, and also Ω ˆ −1 Eˆ ′ dE /dθ = likelihood estimator Eˆ ′ Eˆ − T Ω 0 (these are apparent from the first differentials), we can delete these from the second differential, and use tr(AB) = terms ′ vec A′ vec(B) to obtain
dE = −Y−1
γ + WD Ir
1 + tr Ω −1 2E ′ dE − T dΩ Ω −1 dΩ 2 − tr dΩ −1 E ′ dE − tr Ω −1 dE ′ dE .
or in stacked form
−1
FIM55 = Ω −1 ⊗ ι′T ιT = Ω −1 × T . Appendix B. Proof of Theorem 2 The first three assumptions ensure that 1yt is covariance stationary and yt are cointegrated with cointegrating rank q0 . These together with assumption (vi) ensure that all sample means and covariances of 1yt consistently estimate their population
G. Athanasopoulos et al. / Journal of Econometrics 164 (2011) 116–129
[ (2) vec(dE ) = − γ ′ ⊗ Y−1
Iq
IK ⊗ Y − 1
β
Ir C′
⊗W
0
IK − r
125
]
⊗ WD
IK ⊗ ιT dθ .
Box I.
counterparts and the least squares estimator of parameters is consistent. Assumptions (iv) and (v) state that the true rank is r0 and the true lag-length is p0 (or the lag order of the implied VAR in levels is p0 + 1). For any (p, r ) pair, the second step of the analysis produces the least squares estimates of Γ1 , . . . , Γp with rank r when no restrictions are imposed on Π (Anderson, 1951). Reinsel (1997) contains many of the results that we use in this proof. Under the assumption of normality, these are the ML estimates of ˆ p,r Γ1 , . . . , Γp with rank r with Π unrestricted and the resulting Ω used in the HQ procedure is the corresponding ML estimate of Ω . Note that normality of the true errors is not needed for the proof. We use the results of Sims et al. (1990) who show that in the above model the least squares estimates of Γ1 , . . . , Γp have the standard asymptotic properties as in stationary VARs, in particular that they consistently estimate their population counterparts and that 1
their rate of convergence is the same as T − 2 . Let zt , zt −1 , . . . , zt −p denote 1yt , 1yt −1 , . . . , 1yt −p after the influence of the constant and yt −1 is removed from them and let Z , Z−1 , . . . , Z−p denote T × K matrices with zt′ , zt′−1 , . . . , zt′−p in their row t = 1, . . . , T (we assume that the sample starts from t = −pmax + 1), and let
.
.
.
.
Wp = [Z−1 .. · · · ..Z−p ] and Bp = [Γ1 .. · · · ..Γp ]′ . The estimated model in the second step can be written as: Z = Wp Bˆ p + Uˆ p where Uˆ p is the T × K matrix of residuals when the lag length is p. In an unrestricted regression ln | T1 Uˆ p′ Uˆ p | = ln | T1 (Z ′ Z − Z ′ Wp
(Wp′ Wp )−1 Wp′ Z )| = ln | T1 Z ′ Z | + ln |IK − (Z ′ Z )−1 Z ′ Wp (Wp′ Wp )−1 ∑K ˆ 2i (p)), where λˆ 21 (p) ≤ λˆ 22 (p) ≤ Wp′ Z | = ln | T1 Z ′ Z | + i=1 ln(1 − λ · · · ≤ λˆ 2K (p), the eigenvalues of (Z ′ Z )−1 Z ′ Wp (Wp′ Wp )−1 Wp′ Z are the ordered sample partial canonical correlations between 1yt and 1yt −1 , . . . , 1yt −p after the influence of a constant and yt −1 has been removed. Under the restriction that the rank of B is r, the log-determinant of the squared residuals matrix becomes ∑K ˆ 2i (p)). Further, note ln | T1 Uˆ p′ ,r Uˆ p,r | = ln | T1 Z ′ Z | + i=K −r +1 ln(1 − λ
.
that Wp = [Wp−1 ..Z−p ] and from the geometry of least squares we know Z ′ Wp (Wp′ Wp )−1 Wp′ Z = Z ′ Wp−1 (Wp′ −1 Wp−1 )−1 Wp′ −1 Z + ′ −1 ′ Z ′ Qp−1 Z−p (Z− Z−p Qp−1 Z where Qp−1 = IT − Wp−1 p Qp−1 Z−p )
(Wp′ −1 Wp−1 )−1 Wp′ −1 .
(i) Consider p = p0 and r = r0 − 1 : ln | T1 Uˆ p′ 0 ,r0 −1 Uˆ p0 ,r0 −1 | −
ˆ 2K −r +1 (p0 )). λˆ 2K −r +1 (p0 ) converges ln | T1 Uˆ p′ 0 ,r0 Uˆ p0 ,r0 | = − ln(1 − λ 0 0 in probability to its population counterpart, the r0 -th largest eigenvalue of Σz−1 B′p0 Σw Bp0 , where Σx denotes the population second moment of the vector x. This population canonical correlation is strictly greater than zero because Bp0 has rank r0 . Therefore p lim(ln | T1 Uˆ p′ 0 ,r0 −1 Uˆ p0 ,r0 −1 | − ln | T1 Uˆ p′ 0 ,r0 Uˆ p0 ,r0 |) =
Since the second matrix on the right side is positive semi-definite, ˆ 2i (p0 − 1) ≤ λˆ 2i (p0 ) for all i = 1, . . . , K .9 it follows that λ We know that the probability limits of the smallest K − r0 ˆ 2i (p0 ) are zero. Therefore, the probability limits of eigenvalues λ
ˆ 2i (p0 − 1) must also be zero. the smallest K − r0 eigenvalues λ Moreover, the trace of the matrix on the left is equal to the sum of the traces of the two matrices on the right of the equals sign. The probability limit of the last matrix on the right side ′ is Σz−1 Γp′0 Σz .w Γp0 where Σz .w = p lim( T1 Z− p0 Qp0 −1 Z−p0 ), and since rank(Γp0 ) > 0 by assumption, the probability limit of the trace of the second matrix on the right-hand side will be strictly positive (note that even when Γp0 is nilpotent (i.e. has all zero eigenvalues even though its rank is not zero), Σz−1 Γp′0 Σz .w Γp0 will not be nilpotent). Therefore it must be that p lim λˆ 2i (p0 − 1) < p lim λˆ2i (p0 ) for at least one i = r0 + 1, . . . , K.This implies that p lim ln T1 Uˆ p′ 0 −1,r0 Uˆ p0 −1,r0 − ln T1 Uˆ p′ 0 ,r0 Uˆ p0 ,r0 = ∑K 2 2 i=K −r0 +1 (ln(1 − λi (p0 − 1)) − ln(1 − λi (p0 ))) > 0. (i) and (ii), together with the fact that |Uˆ p′ ,r Uˆ p ,r | ≥ |Uˆ p′ ,r 1
1
1
1
2
2
Uˆ p2 ,r2 | whenever p1 ≤ p2 and r1 ≤ r2 (i.e., for all nested models the less restrictive cannot fit worse) imply that the probability limit of ln T1 Uˆ p′ 0 ,r0 Uˆ p0 ,r0 is strictly smaller than the probability limit of
ln T1 Uˆ p′ ,r Uˆ p,r for all (p ≤ p0 and r < r0 ) or (p < p0 and r ≤ r0 ).
Although the penalty favours the smaller models, the reward for parsimony increases at rate ln ln T while the reward for better fit increases at rate T and therefore dominates. Hence, the probability of choosing a model with (p ≤ p0 and r < r0 ) or (p < p0 and r ≤ r0 ) goes to zero asymptotically. (i′ ) In (i), replace p = p0 with p = p˜ ≥ p0 . The model now includes redundant lags whose true coefficients are zero and these coefficients are consistently estimated. Moreover, adding these zero parameters does not change the rank. Therefore all arguments in (i) apply to this case also, and we can therefore deduce that the probability of under-estimating r with this procedure goes to zero asymptotically. (ii′ ) In (ii), replace r = r0 with r = r˜ ≥ r0 . The model now does not impose all rank restrictions that the true data generating process includes, but the extra eigenvalues will converge to their true value of zero asymptotically and all arguments in (ii) apply to this case also. Therefore, we can conclude that the probability of underestimating p with this procedure goes to zero asymptotically. (iii) Consider p = p˜ ≥ p0 and r = r˜ ≥ r0 with at least one of the inequalities strict. These are all models that are larger than the true model and nest the true model. The probability limit of ln T1 Uˆ p˜′ ,˜r Uˆ p˜ ,˜r for these models is the same
− ln(1 − λ2K −r0 +1 (p0 )) > 0. (ii) Consider p = p0 − 1 and r = r0 :
9 Some textbooks define positive definiteness and associated inequalities concerning ordered eigenvalues for symmetric matrices only. Note that since the eigenvalues of any square matrix A is the same as the eigenvalues of GAG−1 for any invertible matrix G with the same dimensions as A (see Magnus and Neudecker,
(Z ′ Z )−1 Z ′ Wp0 (Wp′ 0 Wp0 )−1 Wp′ 0 Z
1988, Chapter 1) one can choose G = (Z ′ Z ) 2 and make all matrices on both sides of the inequality symmetric without changing any of their eigenvalues. Indeed this is a useful transformation for calculating canonical correlations because computer procedures for computation of eigenvalues of symmetric matrices are more accurate than those for general matrices.
= (Z ′ Z )−1 Z ′ Wp0 −1 (Wp′ 0 −1 Wp0 −1 )−1 Wp′ 0 −1 Z + (Z ′ Z )−1 Z ′ Qp0 −1 Z−p0 (Z−′ p0 Qp0 −1 Z−p0 )−1 Z−′ p0 Qp0 −1 Z .
1
126
G. Athanasopoulos et al. / Journal of Econometrics 164 (2011) 116–129
Table 1 Performance of IC(p, r , q) in a (1, 1, 2) design and its comparison with the usual application of the Johansen method. T = 100
T = 200
T = 400
q < q0
q = q0
q > q0
Total
q < q0
q = q0
q > q0
Total
q < q0
q = q0
q > q0
Total
AIC r < r0 r = r0 r > r0 Total AIC + J
0, 0, 0 0, 0, 0 0, 0, 0 0, 0, 0 1, 9, 1
2, 0, 0 0, 22, 9 0, 5, 3 2, 27, 12 10, 41, 3
4, 0, 0 0, 31, 13 0, 7, 4 4, 38, 17 6, 26, 3
6, 0, 0 0, 54, 23 0, 11, 6 6, 65, 29 17, 76, 7
0, 0, 0 0, 0, 0 0, 0, 0 0, 0, 0 0, 2, 0
1, 0, 0 0, 26, 7 0, 6, 2 1, 32, 9 4, 53, 3
1, 0, 0 0, 37, 10 0, 7, 3 1, 44, 13 2, 34, 2
2, 0, 0 0, 63, 17 0, 13, 5 2, 76, 22 6, 89, 5
0, 0, 0 0, 0, 0 0, 0, 0 0, 0, 0 0, 0, 0
0, 0, 0 0, 29, 6 0, 5, 2 0, 34, 8 1, 55, 3
0, 0, 0 0, 38, 10 0, 8, 2 0, 46, 12 1, 38, 2
1, 0, 0 0, 67, 15 0, 13, 4 1, 80, 19 2, 94, 4
HQ r < r0 r = r0 r > r0 Total HQ + J
0, 0, 0 0, 3, 0 0, 0, 0 0, 3, 0 2, 8, 0
10, 0, 0 0, 39, 3 0, 2, 0 10, 41, 4 20, 34, 0
8, 0, 0 0, 30, 3 0, 1, 0 8, 31, 3 14, 22, 0
19, 0, 0 0, 72, 6 0, 3, 0 19, 75, 6 37, 63, 0
0, 0, 0 0, 1, 0 0, 0, 0 0, 1, 0 0, 2, 0
5, 0, 0 0, 52, 2 0, 2, 0 5, 54, 2 11, 48, 0
3, 0, 0 0, 33, 1 0, 1, 0 3, 34, 1 8, 31, 0
8, 0, 0 0, 86, 3 0, 2, 0 8, 89, 3 19, 81, 0
0, 0, 0 0, 0, 0 0, 0, 0 0, 0, 0 0, 0, 0
2, 0, 0 0, 62, 1 0, 1, 0 2, 63, 1 4, 55, 0
1, 0, 0 0, 31, 1 0, 1, 0 1, 32, 1 3, 38, 0
3, 0, 0 0, 94, 2 0, 2, 0 3, 95, 2 7, 93, 0
SC r < r0 r = r0 r > r0 Total SC + J
3, 0, 0 0, 8, 0 0, 0, 0 3, 8, 0 4, 5, 0
24, 0, 0 0, 42, 1 0, 0, 0 24, 42, 1 31, 23, 0
9, 0, 0 0, 13, 0 0, 0, 0 9, 13, 0 24, 14, 0
36, 0, 0 0, 63, 1 0, 0, 0 36, 63, 1 58, 42, 0
0, 0, 0 0, 4, 0 0, 0, 0 0, 4, 0 0, 2, 0
15, 0, 0 0, 63, 0 0, 0, 0 15, 63, 0 22, 36, 0
4, 0, 0 0, 14, 0 0, 0, 0 4, 14, 0 16, 23, 0
19, 0, 0 0, 80, 0 0, 0, 0 19, 81, 0 38, 62, 0
0, 0, 0 0, 1, 0 0, 0, 0 0, 1, 0 0, 0, 0
7, 0, 0 0, 79, 0 0, 0, 0 7, 79, 0 11, 47, 0
1, 0, 0 0, 12, 0 0, 0, 0 1, 12, 0 9, 33, 0
8, 0, 0 0, 92, 0 0, 0, 0 8, 92, 0 20, 80, 0
PIC r < r0 r = r0 r > r0 Total
7, 0, 0 0, 14, 0 0, 0, 0 7, 14, 0
24, 0, 0 0, 52, 0 0, 0, 0 24, 52, 0
1, 0, 0 0, 2, 0 0, 0, 0 1, 2, 0
32, 0, 0 0, 68, 0 0, 0, 0 32, 68, 0
1, 0, 0 0, 6, 0 0, 0, 0 1, 6, 0
14, 0, 0 0, 77, 0 0, 0, 0 14, 77, 0
0, 0, 0 0, 2, 0 0, 0, 0 0, 2, 0
15, 0, 0 0, 85, 0 0, 0, 0 15, 85, 0
0, 0, 0 0, 2, 0 0, 0, 0 0, 2, 0
5, 0, 0 0, 92, 0 0, 0, 0 5, 92, 0
0, 0, 0 0, 1, 0 0, 0, 0 0, 1, 0
5, 0, 0 0, 95, 0 0, 0, 0 5, 95, 0
HQ-PIC r < r0 r = r0 r > r0 Total
4, 0, 0 0, 15, 1 0, 1, 0 4, 15, 1
14, 0, 0 0, 54, 5 0, 3, 0 14, 57, 5
1, 0, 0 0, 2, 0 0, 0, 0 1, 2, 0
19, 0, 0 0, 72, 6 0, 3, 0 19, 75, 6
0, 0, 0 0, 6, 0 0, 0, 0 0, 6, 0
8, 0, 0 0, 79, 3 0, 2, 0 8, 81, 3
0, 0, 0 0, 2, 0 0, 0, 0 0, 2, 0
8, 0, 0 0, 86, 3 0, 2, 0 8, 89, 3
0, 0, 0 0, 2, 0 0, 0, 0 0, 2, 0
3, 0, 0 0, 90, 2 0, 2, 0 3, 92, 2
0, 0, 0 0, 1, 0 0, 0, 0 0, 1, 0
3, 0, 0 0, 93, 2 0, 2, 0 3, 95, 2
Note: The total of the three entries a, b, c in each cell show the percentage of times the selected model falls in the category identified by the column and row labels. Entry a shows the percentage where p < p0 , b shows the percentage where p = p0 and c the percentage where p > p0 . The row labelled X + J shows this information for the method commonly used in practice, where the lag-length p is chosen by model selection criterion X, and then the Johansen procedure is used for determining q.
Table 2 Performance of IC(p, r , q) in a (2, 1, 1) design and its comparison with the usual application of the Johansen method. T = 100
T = 200
T = 400
q < q0
q = q0
q > q0
Total
q < q0
q = q0
q > q0
Total
q < q0
q = q0
q > q0
Total
AIC r < r0 r = r0 r > r0 Total AIC + J
1, 0, 0 0, 0, 1 1, 1, 3 2, 1, 4 2, 8, 1
1, 0, 0 1, 11, 4 1, 3, 2 3, 14, 6 23, 43, 4
1, 0, 0 4, 34, 14 2, 9, 6 7, 43, 20 6, 11, 2
3, 0, 0 6, 45, 19 3, 13, 11 12, 58, 30 32, 62, 6
0, 0, 0 0, 0, 1 0, 0, 3 0, 0, 4 0, 1, 0
1, 0, 0 1, 14, 4 1, 3, 3 3, 17, 7 11, 68, 4
1, 0, 0 2, 41, 11 1, 8, 5 4, 49, 16 4, 2, 13
2, 0, 0 3, 57, 15 2, 11, 10 7, 68, 25 13, 82, 5
0, 0, 0 0, 0, 1 0, 1, 2 0, 1, 3 0, 0, 0
0, 0, 0 0, 16, 3 1, 3, 3 1, 19, 6 4, 74, 4
1, 0, 0 1, 44, 10 1, 8, 5 3, 52, 15 1, 15, 1
1, 0, 0 1, 60, 14 2, 11, 11 4, 71, 25 4, 91, 5
HQ r < r0 r = r0 r > r0 Total HQ + J
0, 0, 0 0, 1, 1 0, 0, 1 0, 1, 2 3, 5, 0
3, 0, 0 8, 37, 4 1, 1, 1 12, 39, 5 47, 25, 0
3, 0, 0 6, 25, 3 1, 2, 1 10, 27, 4 14, 6, 0
6, 0, 0 14, 64, 8 3, 4, 3 23, 67, 10 64, 36, 0
0, 0, 0 0, 0, 0 0, 0, 1 0, 1, 1 0, 1, 0
2, 0, 0 3, 56, 3 0, 1, 1 6, 57, 3 32, 51, 0
1, 0, 0 2, 25, 2 1, 1, 1 4, 27, 3 7, 9, 0
3, 0, 0 6, 79, 5 1, 3, 3 10, 82, 8 39, 61, 0
0, 0, 0 0, 0, 0 0, 0, 1 0, 0, 1 0, 0, 0
1, 0, 0 2, 65, 2 0, 1, 1 3, 66, 3 12, 71, 0
0, 0, 0 1, 21, 2 0, 1, 2 1, 22, 4 3, 15, 0
1, 0, 0 2, 85, 4 1, 3, 4 4, 88, 8 15, 85, 0
SC r < r0 r = r0 r > r0 Total SC + J
2, 0, 0 2, 8, 0 0, 0, 0 4, 8, 0 5, 2, 0
10, 0, 0 21, 42, 1 0, 0, 0 32, 42, 1 62, 7, 0
3, 0, 0 3, 6, 0 0, 0, 0 7, 6, 0 22, 2, 0
15, 0, 0 26, 55, 2 1, 0, 0 42, 56, 2 89, 11, 0
0, 0, 0 0, 3, 0 0, 0, 0 0, 3, 0 0, 0, 0
7, 0, 0 12, 71, 1 0, 0, 0 19, 71, 1 55, 26, 0
1, 0, 0 1, 4, 0 0, 0, 0 2, 4, 0 14, 5, 0
8, 0, 0 13, 77, 1 1, 0, 0 22, 77, 1 69, 31, 0
0, 0, 0 0, 0, 0 0, 0, 0 0, 0, 0 0, 0, 0
3, 0, 0 8, 86, 0 0, 0, 0 11, 86, 1 34, 48, 0
0, 0, 0 0, 2, 1 0, 0, 0 0, 2, 1 8, 10, 0
4, 0, 0 4, 90, 1 0, 1, 0 8, 90, 1 41, 59, 0
PIC r < r0 r = r0 r > r0 Total
4, 0, 0 4, 3, 0 0, 0, 0 8, 3, 0
11, 0, 0 37, 28, 0 0, 0, 0 48, 28, 0
1, 0, 0 1, 1, 0 0, 0, 0 2, 1, 0
16, 0, 0 42, 41, 0 1, 0, 0 59, 41, 0
0, 0, 0 1, 4, 0 0, 0, 0 1, 4, 0
7, 0, 0 25, 62, 0 0, 0, 0 32, 62, 0
0, 0, 0 0, 0, 0 0, 0, 0 1, 0, 0
7, 0, 0 26, 66, 0 0, 0, 0 34, 66, 0
0, 0, 0 0, 0, 0 0, 0, 0 0, 0, 0
3, 0, 0 10, 87, 0 0, 0, 0 13, 87, 0
0, 0, 0 0, 0, 0 0, 0, 0 0, 0, 0
3, 0, 0 9, 88, 0 0, 0, 0 12, 88, 0
HQ-PIC r < r0 r = r0 r > r0 Total
1, 0, 0 2, 13, 3 1, 2, 2 4, 15, 5
5, 0, 0 12, 49, 4 2, 2, 1 19, 51, 5
0, 0, 0 0, 1, 0 0, 0, 0 1, 1, 0
6, 0, 0 14, 63, 7 2, 4, 3 23, 67, 10
0, 0, 0 0, 4, 1 0, 0, 1 1, 4, 2
3, 0, 0 5, 77, 4 1, 2, 2 9, 79, 6
0, 0, 0 0, 0, 0 0, 0, 0 0, 0, 0
3, 0, 0 6, 79, 5 1, 3, 3 10, 82, 8
0, 0, 0 0, 0, 0 0, 0, 0 0, 0, 0
1, 0, 0 2, 86, 4 1, 2, 4 4, 88, 8
0, 0, 0 0, 0, 0 0, 0, 0 0, 0, 0
1, 0, 0 2, 86, 4 1, 2, 4 4, 88, 8
Note: The total of the three entries a, b, c in each cell show the percentage of times the selected model falls in the category identified by the column and row labels. Entry a shows the percentage where p < p0 , b shows the percentage where p = p0 and c the percentage where p > p0 . The row labelled X + J shows this information for the method commonly used in practice, where the lag-length p is chosen by model selection criterion X, and then the Johansen procedure is used for determining q.
G. Athanasopoulos et al. / Journal of Econometrics 164 (2011) 116–129
127
Table 3 Percentage improvement in forecast accuracy measures for possibly reduced rank models over unrestricted VARs in a (1, 1, 2) setting. Horizon (h)
T = 100
T = 200
|MSFE|
TMSFE
GFESM
T = 400
TMSFE
|MSFE|
1.4
4.0
4.0
0.7
2.4
10.2
0.1
0.1
8.0
0.4
0.9
7.8
0.4
1.0
3.7
0.8
2.3
2.3
0.2
0.8
5.5
0.0
−0.2
4.2
0.2
0.5
4.1
0.3
0.7
1.5
GFESM
TMSFE
|MSFE|
0.9
2.7
2.7
0.3
1.1
6.3
0.1
0.5
6.8
0.1
0.2
6.6
0.1
0.2
7.2
GFESM
VECM(HQ-PIC) for all DGPs 1.4
1
(44,46,10)a
0.7
4
(23,77,0)
8
3.7
0.7
1.8
−7.2
0.5
−19.4
0.2
0.6
−31.3
0.9
2.3
2.3
0.4
0.6
2.0
0.5
1.4
−5.5
0.1
0.4
−12.5
0.1
0.4
−20.4
(3,93,4)
16
3.8
1.6
0.2
(19,80,1)
12
3.8
(5,94,1)
(49,49,2) (46,54,0) (5,91,4) (14,86,0) (18,82,0)
(53,47,0) (27,73,0) (4,96,0) (4,96,0) (4,95,1)
VECM(AIC + J) for all DGPs 1
(28,63,9)
4
(14,86,0)
8
(21,78,1)
12
(5,92,3)
16
(5,92,3)
(30,67,3) (13,86,1) (2,91,7) (12,88,0) (18,82,0)
0.4
1.0
1.0
0.1
0.4
2.2
0.1
0.2
1.9
0.0
−0.1
1.4
0.0
0.0
1.8
(27,71,2) (8,92,0) (2,98,0) (0,98,2) (3,97,0)
VECM(HQ-PIC) are models selected by the model selection process proposed in Section 4.1 and estimated by the algorithm proposed in Section 3. VECM(AIC + J) are estimated by the usual Johansen procedure with AIC as the model selection criterion for the lag length. a We perform Diebold and Mariano (1995) tests at the 5% level of significance for equal predictive accuracy between the reduced rank models and unrestricted VARs. For cell (x,y,z), y denotes the percentage of DGPs for which the null of equal forecast accuracy is not rejected and entries x and z denote the percentage of DGPs for which the null is rejected with a positive statistic (i.e., the reduced rank model is significantly more accurate than the unrestricted VAR) and a negative statistic (i.e., the reduced rank model is significantly less accurate than the unrestricted VAR) respectively.
that T ln T1 Uˆ p′ 0 ,r0 Uˆ p0 ,r0 − ln T1 Uˆ p˜′ ,˜r Uˆ p˜ ,˜r is the likelihood ratio
as the probability limit of ln T1 Uˆ p′ 0 ,r0 Uˆ p0 ,r0 . However, we know
produces a consistent estimator of q0 , and this completes the proof.
statistic of testing general linear restrictions that reduce the p˜ ,r˜ model to the p 0 , r0 model. Since these restrictions are true,
Remark 3. The above proof is not exclusive to HQ and applies c to any model selection criterion in which cT → ∞ and TT → 0 as T → ∞, where cT is the penalty for each additional parameter in the first stage of the procedure. The consistency of model selection criteria with this property for determining p in vector autoregressions has been established in Quinn (1980), and in autoregressions with unit roots in Paulsen (1984) and Tsay (1984). Consistency of such criteria for selection of cointegrating rank q and the lag order p has been established in Gonzalo and Pitarakis (1995) and Aznar and Salvador (2002). Consistency of PIC for selection of cointegrating rank q and the lag order p has been established in Chao and Phillips (1999). The contribution here is proving the consistency when r is added to the set of parameters to be estimated, and showing that this can be achieved with a twostep procedure.
T ln T1 Uˆ p′ 0 ,r0 Uˆ p0 ,r0 − ln T1 Uˆ p˜′ ,˜r Uˆ p˜ ,˜r
= Op (1). While the reward
for better fit from larger models is bounded in probability, the penalty terms for extra parameters increases without bound. Hence, the probability of choosing a larger model that nests the true model goes to zero asymptotically. This completes the proof that the first step of the procedure consistently estimates p0 and r0 . For the consistency of the second step estimator of q0 , we note that Chao and Phillips (1999) show that the PIC can be written as the sum (Chao and Phillips, 1999, express PIC as product of the likelihood and penalty term, here we refer to the logarithmic transformation of the PIC expressed in their paper) of two parts, one that comprises the log-likelihood of q given p and its associated penalty, and the other that comprises the log-likelihood of p without any restrictions on q and a penalty term involving the laglength. With similar steps one can write the PIC in our case as the sum of one part related to q given p and r and another that involves p and r. Hence, plugging in p and r that are estimated via another consistent procedure does not alter the consistency of the estimator of q. The main reason that the choice of p and r does not affect the consistency of q is that the smallest K − q0 sample squared canonical correlations between 1yt and yt −1 converge to zero in probability and the remaining q0 converge to positive limits, regardless of any finite stationary elements that are partialled out. Therefore, for a given (p, r ) when q < q0 , T times the difference in log-likelihood values dominates the penalty term, and hence the probability of underpredicting q goes to zero and T → ∞. Also, when q > q0 , T times the difference in log-likelihood values remains bounded in probability, but the magnitude of the penalty for lack of parsimony grows without bound as T → ∞, therefore the probability of overestimating q goes to zero asymptotically also. Note that the fact that the asymptotic distribution of the likelihood ratio statistic is not χ 2 or that it may depend on nuisance parameters does not matter. What is important is that it is Op (1). Hence the second step
Remark 4. As with all models selected with any consistent model selection criterion, the warning of Leeb and Potscher (2005) applies to models selected with our procedure as well in the sense that there is no guarantee that any inference made based on asymptotic distributions conditional on p, q, r selected by this procedure will necessarily be more accurate than that based on an unrestricted autoregression of order pmax . Remark 5. Let α˜ 1 be a full rank K × (K − r0 ) matrix such that α˜ 1′ [Γ1 Γ2 · · · Γp0 ] = 0. Such a matrix exists because rank [Γ1 Γ2 · · · Γp0 ] = r0 but it is not unique. We can augment α˜ 1 with r0 additional linearly independent vectors arranged as columns of matrix α˜ 2 to form a basis for Rn , and to achieve uniqueness we can choose these matrices such that
. . (α˜ 1 ..α˜ 2 )′ Ω (α˜ 1 ..α˜ 2 ) = IK . The DGP can be alternatively written as α˜ 1′ 1yt = c1 + Π(1) yt −1 + η(1),t α˜ 2′ 1yt = c2 + Π(2) yt −1 + Γ(2),1 1yt −1 + Γ(2),2 1yt −2 + · · · + Γ(2),p0 1yt −p0 + η(2),t
where for any vector or matrix X , X(i) = α˜ i′ X , i = 1, 2. While we have presented the model selection criteria as penalised
128
G. Athanasopoulos et al. / Journal of Econometrics 164 (2011) 116–129
Table 4 Percentage improvement in forecast accuracy measures for possibly reduced rank models over unrestricted VARs in a (2, 1, 1) setting. Horizon (h)
T = 100 TMSFE
T = 200
|MSFE|
GFESM
TMSFE
T = 400
|MSFE|
GFESM
TMSFE
|MSFE|
GFESM
VECM(HQ-PIC) for all DGPs 1 4 8 12 16
7.8
21.8
21.8
2. 2
8.1
37.8
1.0
2.7
38.5
0.4
0.8
29.8
0.8
1.8
25.5
(87,13,0)a (69,31,0) (24,76,0) (12,87,1) (16,84,0)
4.5
12.9
12.9
2.0
5.2
30.6
0.6
2.3
34.1
0.8
2.4
36.8
0.3
0.3
32.8
(90,10,0) (78,22,0) (22,78,0) (27,73,0) (16,59,25)
2.5
7.5
7.5
0.9
2.3
17.5
0.6
2.2
25.7
0.9
2.9
29.5
0.7
2.4
32.7
1.4
4.1
4.1
0.6
1.8
10.7
0.4
1.7
16.8
0.7
2.4
19.2
0.6
2.2
22.0
(95,5,0) (47,53,0) (32,68,0) (82,18,0) (39,61,0)
VECM(AIC + J) for all DGPs 1 4 8 12 16
5.4
14.1
14.1
1. 3
4.8
21.6
0.7
1.9
21.5
0.5
0.9
14.5
0.6
1.4
11.0
(81,19,0) (29,71,0) (15,85,0) (11,89,0) (13,87,0)
3.2
8.7
8.7
1.2
3.0
21.3
0.6
2.3
26.1
0.6
1.9
29.6
0.2
0.3
27.4
(81,19,0) (61,39,0) (23,77,0) (19,81,0) (16,84,0)
(72,28,0) (35,65,0) (14,86,0) (65,35,0) (38,62,0)
VECM(HQ-PIC) are models selected by the model selection process proposed in Section 4.1 and estimated by the algorithm proposed in Section 3. VECM(AIC + J) are estimated by the usual Johansen procedure with AIC as the model selection criterion for the lag length. a Refer to note in Table 3. Table 5 Percentage improvement in forecast accuracy measures for reduced ranked models and unrestricted VARs for Brazilian inflation. Horizon (h)
VECM(AIC + J) versus VAR in levels
VECM(HQ-PIC) versus VAR in levels TMSFE
|MSFE|
1
36.9
69.6
69.6
4
32.4
45.2
91.0
8
24.6
32.9
107.9
12
33.6
38.4
120.3
16
36.4 (–, ∗, ∗)
40.2
142.7
(∗∗, ∗∗, ∗∗) (∗∗, ∗∗, ∗∗) (∗, ∗∗, ∗∗) (∗, ∗∗, ∗∗)
GFESM
TMSFE
|MSFE|
18.1 (–, ∗, –) 11.8 (–, ∗∗, –) 15.1 (–, ∗∗, ∗∗) 25.6
22.8
29.0
(∗, ∗∗, ∗∗)
(∗∗, ∗∗, ∗)
VECM(HQ-PIC) versus VECM(AIC + J) TMSFE
|MSFE|
22.8
18.8
46.8
46.8
28.4
−10.6
20.6
16.8
101.6
26.2
−11.8
9.5
6.7
119.6
29.7
1.0
8.7
119.3
34.5
39.2
8.0 (–, –, –) 7.4 (–, –, –)
5.7
103.5
GFESM
(∗, ∗, ∗∗) (∗∗, ∗∗, ∗∗)
(∗∗, ∗, ∗∗)
GFESM
VECM(HQ-PIC) is the model selected by the model selection process proposed in Section 4.1 and estimated by the algorithm proposed in Section 3. VECM(AIC + J) is the model estimated by the usual Johansen procedure with AIC as the model selection criterion for the lag length. See Section 7 for further details. The triplet (·, ·, ·) presents the results of tests for equal mean squared forecast errors predicting 1 ln (CPI-IBGEt ) , 1 ln (CPI-FGVt ), and 1 ln (CPI-FIPEt ) respectively. The symbols **, * and – denote, respectively, significance at the 5% level, at the 10% level, and not significant at the 10% level. Table 6 Percentage improvement in forecast accuracy measures for reduced ranked models and unrestricted VARs for US macroeconomic aggregates. Horizon (h)
VECM(AIC + J) versus VAR in levels
VECM(HQ-PIC) versus VAR in levels TMSFE
|MSFE|
1
35.1
60.4
4
56.3
8
VECM(HQ-PIC) versus VECM(AIC + J)
TMSFE
|MSFE|
60.4
16.2
49.7
49.7
83.5
134.7
27.8
46.4
112.1
8.4
25.3
169.2
8.9
24.0
145.2
12
1.5
20.0
176.3
2.6
21.8
172.1
16
3.6
26.0
147.3
4.5
27.1
160.1
(∗∗, ∗∗, ∗∗)
(∗∗, ∗∗, ∗∗) (∗∗, ∗∗, –) (∗, ∗∗, –)
(∗∗, ∗∗, –)
GFESM
(∗∗, ∗∗, ∗) (∗∗, ∗∗, ∗)
(∗∗, ∗∗, –) (∗, ∗∗, –)
(∗∗, ∗∗, –)
GFESM
TMSFE 18.9 (–, ∗, ∗∗) 28.5
(∗∗, ∗∗, ∗∗) −0.5 (–, –, –) −1.1 (–, –, –) −0.9 (–, –, –)
|MSFE|
GFESM
10.7
10.7
37.1
22.6
1.3
24.0
−1.8
4.2
−1.1
−12.8
VECM(HQ-PIC) is the model selected by the model selection process proposed in Section 4.1 and estimated by the algorithm proposed in Section 3. VECM(AIC + J) is the model estimated by the usual Johansen procedure with AIC as the model selection criterion for the lag length. See Section 7 for further details. The triplet (·, ·, ·) presents the results of tests for equal mean squared forecast errors predicting 1 ln (yt ) , 1 ln (ct ), and 1 ln (it ) respectively. The symbols **, * and – denote, respectively, significance at the 5% level, at the 10% level, and not significant at the 10% level.
log-likelihoods and have referred to maximum likelihood estimators and likelihood ratio tests in our proof to conform with the previous literature, all arguments could be phrased in the context of GMM estimation of the above structural model and test statistics for testing overidentifying restrictions in the first block of this structure (Anderson and Vahid, 1998). Therefore, there is no need for any assumption of normality at any stage.
Appendix C. Tables See Tables 1–6. References Ahn, S.K., Reinsel, G.C., 1988. Nested reduced-rank autoregressive models for multiple time series. Journal of the American Statistical Association 83, 849–856.
G. Athanasopoulos et al. / Journal of Econometrics 164 (2011) 116–129 Anderson, T.W., 1951. Estimating linear restrictions on regression coefficients for multivariate normal distributions. Annals of Mathematical Statistics 22, 327–351. Anderson, H.M., Vahid, F., 1998. Testing multiple equation systems for common nonlinear components. Journal of Econometrics 84, 1–36. Athanasopoulos, G., Vahid, F., 2008. VARMA versus VAR for macroeconomic forecasting. Journal of Business and Economic Statistics 26, 237–252. Aznar, A., Salvador, M., 2002. Selecting the rank of the cointegation space and the form of the intercept using an information criterion. Econometric Theory 18, 926–947. Centoni, M., Cubbada, G., Hecq, A., 2007. Common shocks, common dynamics and the international business cycle. Economic Modelling 24, 149–166. Chao, J.C., Phillips, P.C.B., 1999. Model selection in partially nonstationary vector autoregressive processes with reduced rank structure. Journal of Econometrics 91, 227–271. Christoffersen, P.F., Diebold, F.X., 1998. Cointegration and long-horizon forecasting. Journal of Business and Economic Statistics 16, 450–458. Clements, M.P., Hendry, D.F., 1993. On the limitations of comparing mean squared forecast errors (with discussion). Journal of Forecasting 12, 617–637. Clements, M.P., Hendry, D.F., 1995. Forecasting in cointegrated systems. Journal of Applied Econometrics 10, 127–146. Diebold, F.X., Mariano, R.S., 1995. Comparing predictive accuracy. Journal of Business and Economic Statistics 13, 253–263. Elliott, G., 2006. Forecasting with trending data. In: Elliott, G., Granger, C., Timmermann, A. (Eds.), Handbook of Economic Forecasting, vol. 1. Elsevier, pp. 555–604 (Chapter 11). Engle, R.F., Yoo, S., 1987. Forecasting and testing in cointegrated systems. Journal of Econometrics 35, 143–159. Gonzalo, J., Pitarakis, J., 1995. Specification via model selection in vector error correction models. Economics Letters 60, 321–328. Gonzalo, J., Pitarakis, J., 1999. Dimensionality effect in cointegration tests. In: Engle, R., White, H. (Eds.), Cointegration, Causality and Forecasting: A Festschrift in Honour of Clive W.J. Granger. Oxford University Press, New York, pp. 212–229 (Chapter 9). Gourieroux, C., Peaucelle, I., 1992. Series codependantes application a l’hypothese de parite du pouvoir d’achat. Revue d’Analyse Economique 68, 283–304. Hecq, A., Palm, F., Urbain, J.-P., 2006. Common cyclical features analysis in VAR models with cointegration. Journal of Econometrics 132, 117–141. Hoffman, D.L., Rasche, R.H., 1996. Assessing forecast performancee in a cointegrated system. Journal of Applied Econometrics 11, 495–517. Johansen, S., 1988. Statistical analysis of cointegrating vectors. Journal of Economic Dynamics and Control 12, 231–254. Johansen, S., 1991. Estimation and hypothesis testing of cointegration vectors in Gaussian vector autoregressive models. Econometrica 59, 1551–1580. Leeb, H., Potscher, B.M., 2005. Model selection and inference: facts and fiction. Econometric Theory 21, 21–59.
129
Lin, J.L., Tsay, R.S., 1996. Cointegration constraints and forecasting: an empirical examination. Journal of Applied Econometrics 11, 519–538. Lütkepohl, H., 1985. Comparison of criteria for estimating the order of a vector autoregressive process. Journal of Time Series Analysis 9, 35–52. Lütkepohl, H., 1993. Introduction to Multiple Time Series Analysis, 2nd ed. SpringerVerlag, Berlin, Heidelberg. Magnus, J.R., Neudecker, H., 1988. Matrix Differential Calculus with Applications in Statistics and Econometrics. John Wiley and Sons, New York. Paulsen, J., 1984. Order determination of multivariate autoregressive time series with unit roots. Journal of Time Series Analysis 5, 115–127. Phillips, P.C.B., 1996. Econometric model determination. Econometrica 64, 763–812. Phillips, P.C.B., Hansen, B., 1990. Statistical inference in instrumental variables regression with I (1) processes. Review of Economic Studies 57, 99–125. Phillips, P.C.B., Loretan, M., 1991. Estimating long-run economic equilibria. Review of Economic Studies 58, 407–436. Phillips, P.C.B., Ploberger, W., 1996. An asymptotic theory of Bayesian inference for time series. Econometrica 64, 381–413. Ploberger, W., Phillips, P.C.B., 2003. Empirical limits for time series econometric models. Econometrica 71, 627–673. Poskitt, D.S., 1987. Precision, complexity and Bayesian model determination. Journal of the Royal Statistical Society. Series B 49, 199–208. Quinn, B.G., 1980. Order determination for a multivariate autoregression. Journal of the Royal Statistical Society. Series B 42, 182–185. Reinsel, G.C., 1997. Elements of Multivariate Time Series, 2nd ed. Springer-Verlag, New York. Rissanen, J., 1987. Stochastic complexity. Journal of the Royal Statistical Society. Series B 49, 223–239. Saikkonen, P., 1992. Estimation and testing of cointegrated systems by an autoregressive approximation. Econometric Theory 8, 1–27. Silverstovs, B., Engsted, T., Haldrup, N., 2004. Long-run forecasting in multicointegrated systems. Journal of Forecasting 23, 315–335. Sims, C.A., Stock, J.H., Watson, M.W., 1990. Inference in linear time series models with some unit roots. Econometrica 58, 113–144. Stock, J.H., 1996. VAR, error correction and pretest forecasts at long horizons. Oxford Bulletin of Economics and Statistics 58, 685–701. Tsay, R.S., 1984. Order selection in nonstationary autoregressive models. Annals of Statistics 12, 1425–1433. Vahid, F., Engle, R.F., 1993. Common trends and common cycles. Journal of Applied Econometrics 8, 341–360. Vahid, F., Issler, J.V., 2002. The importance of common cyclical features in VAR analysis: a Monte-Carlo study. Journal of Econometrics 109, 341–363. Velu, R.P., Reinsel, G.C., Wickern, D.W., 1986. Reduced rank models for multiple time series. Biometrika 73, 105–118. Wallace, C., 2005. Statistical and Inductive Inference by Minimum Message Length. Springer, Berlin. Wallace, C., Freeman, P., 1987. Estimation and inference by compact coding. Journal of the Royal Statistical Society. Series B 49, 240–265.
Journal of Econometrics 164 (2011) 130–141
Contents lists available at ScienceDirect
Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom
Optimal prediction pools John Geweke a,b,∗ , Gianni Amisano c,d a
University of Technology Sydney, Australia
b
University of Colorado, United States
c
European Central Bank, Germany
d
Universitá di Brescia, Italy
article
info
Article history: Available online 3 March 2011 JEL classification: C11 C53 Keywords: Forecasting Log scoring Model combination S&P 500 returns
abstract We consider the properties of weighted linear combinations of prediction models, or linear pools, evaluated using the log predictive scoring rule. Although exactly one model has limiting posterior probability, an optimal linear combination typically includes several models with positive weights. We derive several interesting results: for example, a model with positive weight in a pool may have zero weight if some other models are deleted from that pool. The results are illustrated using S&P 500 returns with six prediction models. In this example models that are clearly inferior by the usual scoring criteria have positive weights in optimal linear pools. © 2011 Elsevier B.V. All rights reserved.
1. Introduction and motivation The formal solutions of most decision problems in economics, in the private and public sectors as well as academic contexts, require probability distributions for magnitudes that are as yet unknown. Point forecasts are rarely sufficient. For econometric investigators whose work may be used by clients in different situations the mandate to produce predictive distributions is compelling. Increasing awareness of this context, combined with advances in modeling and computing, is leading to a sustained emphasis on these distributions in econometric research (Diebold et al., 1998; Christoffersen, 1998; Corradi and Swanson, 2006a,b; Gneiting et al., 2007). In many situations there are several models with predictive distributions available, leading naturally to questions of model choice or combination. While there is a large econometric literature on choice or combination of point forecasts, dating at least to Bates and Granger (1969) and extending through many more contributions reviewed recently by Timmermann (2006), the treatment of predictive density combination in the econometrics literature is much more limited. Granger et al. (1989) and Clements (2006) attacked the related problems of event and quantile forecast combination, respectively. Wallis (2005) was perhaps the first econometrician to take up combinations of predictive densities
explicitly. Hall and Mitchell (2007) is the closest precursor of the approach taken here. We consider the situation in which alternative models provide predictive distributions for a vector time series yt given its history Yt −1 = {yh , . . . , yt −1 }; h is a starting date for the time series, h ≤ 1. A prediction model A (for ‘‘assumptions’’) is a construction that produces a probability density for yt with respect to an appropriate measure ν from the history Yt −1 denoted p(yt ; Yt −1 , A). There are many kinds of prediction models. Leading examples begin with parametric conditional densities p(yt | Yt −1 , θA , A). Then, in a formal Bayesian approach p(yt ; Yt −1 , A) = p(yt | Yt −1 , A)
∫ =
p(yt | Yt −1 , θA , A)p(θA | Yt −1 , A)dθA ,
where p(θA | Yt −1 , A) is the posterior density p(θA | Yt −1 , A) ∝ p(θA | A)
t −1 ∏
p(ys | Ys−1 , θA , A)
Corresponding address: Centre for the Study of Choice, University of Technology Sydney, Ultimo NSW 2007, Australia. Tel.: +61 2 9514 9797; fax: +61 2 9514 7722. E-mail address:
[email protected] (J. Geweke). 0304-4076/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2011.02.017
(2)
s=1
and p(θA | A) is the prior density for θA . A non-Bayesian approach t −1
might construct the parameter estimates θA then t −1
∗
(1)
p(yt ; Yt −1 , A) = p(yt | Yt −1 , θA
, A).
= ft −1 (Yt −1 ) and (3)
The specific construction of p(yt ; Yt −1 , A) does not concern us: in the extreme, it could be entirely judgmental. What is critical is
J. Geweke, G. Amisano / Journal of Econometrics 164 (2011) 130–141
that it rely only on information available at time t − 1 and that it provide a mathematically complete predictive density for yt . The primitives are these predictive densities and the realizations of the time series yt , which we denote yot (o for ‘‘observed’’) in situations where the distinction between the random vector and its realization is important. This set of primitives is the one typically used in the few studies that have addressed these questions (e.g. Diebold et al., 1998, p. 879). As Gneiting et al. (2007, p. 244) notes, the assessment of a predictive distribution on the basis of p(yt ; Yt −1 , A) and yot only is consistent with the prequential principle of Dawid (1984). 1.1. Log scoring Our assessment of models and combinations of models will rely on the log predictive score function. For a sample YT = YoT the log predictive score function of a single prediction model A is LS (YoT , A) =
T −
log p(yot ; Yot−1 , A).
(4)
131
p(yt ; Yt −1 , A) only through p(yot ; Yot−1 , A) then it is said to be local (Bernardo, 1979). The only proper local scoring rule takes the form of (4).1 This was shown by de Finetti and Savage (1963) and Shuford et al. (1966) for the case in which the support of {yt } is a finite set of at least three discrete points; for further discussion see Winker (1969, p. 1075). It was shown for the case of continuously distributed {yt } by Bernardo (1979); for further discussion see Gneiting and Raftery (2007, p. 366). This study will consider alternative prediction models A1 , . . . , An . Propriety of the scoring rule is important in this context because it guarantees that if one of these models were to coincide with the true data generating process D, then that model would attain the maximum score as T → ∞. There is a long-standing literature on scoring rules for discrete outcomes and in particular for Bernoulli random variables (DeGroot and Fienberg, 1982; Clemen et al., 1995). However, as noted in the recent review article by Gneiting et al. (2007, p. 364) and Bremmes (2004) the literature on scoring rules for probabilistic forecasts of continuous variables is sparse.
t =1
In a full Bayesian approach p(yt ; Yt −1 , A) = p(yt | Yt −1 , A) and (4) becomes LS (YoT , A) =
T −
log p(yot | Yot−1 , A)
t =1
= log p(YoT | A) = log
∫
p(YoT , θA | A)dθA .
(5)
(Geweke, 2001, 2005, Section 2.6.2). In a parametric non-Bayesian approach (3) the log predictive score is LS (YoT , A) =
T −
t −1
log p(yot | Yot−1 , θA
, A)
(6)
This study explores using the log scoring rule (4) to evaluate combinations of probability densities p(yt | Yot−1 , Aj ) (j = 1, . . . , n). There are, of course, many ways in which these densities could be combined, or aggregated; see Genest et al. (1984) for a review and axiomatic approach. McConway (1981) showed that, under mild regularity conditions, if the process of combination is to commute with any possible marginalization of the distributions involved, then the combination must be linear. Moreover, such combinations are trivial to compute, both absolutely and in comparison with alternatives. Thus we study predictive densities of the form n −
t =1
which is smaller than the full-sample log-likelihood function T
evaluated at the maximum likelihood estimate θA . For some of the analytical results in this study we assume that there is a data generating process D that gives rise to the ergodic vector time series {yt }. That is, there is a true model D, but it is not necessarily one of the models under consideration. For most D and A ED [LS (YT , A)] =
1.2. Linear pooling
∫ − T
log p(yt ; Yt −1 , A) p(YT |D)dν(YT )
(7)
t =1
a.s.
T −1 LS (YT , A) → lim T −1 ED [LS (YT , A)] = LS (A; D). T →∞
i =1
(8)
Whenever we invoke a true model D, we shall assume that (8) is true for D and any model A under consideration. The log predictive score function is a measure of the out-ofsample prediction track record of the model. Other such scoring rules are, of course, possible, mean square prediction error being perhaps the most familiar. One could imagine using a scoring rule to evaluate the predictive densities provided by a modeler. Suppose that the modeler then produced predictive densities in such a way as to maximize the expected value of the scoring rule, the expectations being taken with respect to the modeler’s subjective probability distribution. The scoring rule is said to be proper if, in such a situation, the modeler is led to report a predictive density that is coherent and consistent with his subjective probabilities. (The term ‘‘proper’’ was coined by Winkler and Murphy (1968), but the general idea dates back at least to Brier (1950) and Good (1952).) If the scoring rule depends on YoT and
n −
wi = 1;
wi ≥ 0
i=1
(i = 1, . . . , n).
(9)
The restrictions on the weights wi are necessary and sufficient to assure that (9) is a density function for all values of the weights and all arguments of any density functions. We evaluate these densities using the log predictive score function T − t =1
exists and is finite. Given the ergodicity of {yt },
wi p(yt ; Yot−1 , Ai );
log
n −
wi p( ; yot
Yot−1
, Ai ) .
(10)
i =1
Combinations of subjective probability distributions are known as opinion pools, a term due to Stone (1961), and linear combinations are known as linear opinion pools (Bacharach, 1974). We use the term prediction pools to describe the setting specific to this study. While all models are based on opinions, only formal statistical models are capable of producing the complete predictive densities that, together with the data, constitute our primitives. Choice of weights in any combinations like (9) is widely regarded as a difficult and important question. This study uses past performance of the pool to select the weights; in the language of Jacobs (1995) the past constitutes the training sample for the present. Sections 3 and 5 show that this is easy to do. This study compares linear prediction pools using the log scoring rule. An optimal prediction pool is one with weights chosen so as to maximize (10).
1 This leaves aside positive linear transformations of (3), which are equivalent under expected utility.
132
J. Geweke, G. Amisano / Journal of Econometrics 164 (2011) 130–141
Hall and Mitchell (2007) proposed combining predictive probability densities by finding the nonnegative weights wi that maximize (10). The motivation of that study is asymptotic: as T → ∞, the weights so chosen are those that minimize the Kullback–Leibler directed distance from an assumed data generating process D to the model (9). Hall and Mitchell (2007) show that direct maximization of (10) is more reliable than some other methods, involving probability integral transforms, that have been proposed in the literature. The focus of our work is complementary and more analytical, and we also provide a larger-scale implementation of optimal pooling than does Hall and Mitchell (2007). The characteristics of optimal prediction pools turn out to be strikingly different from those that are constructed by means of Bayesian model averaging (which is always possible in principle and often in practice) as well as those that result from conventional frequentist testing (which is often problematic since the models are typically non-nested). Given a data generating process D that produces ergodic {yt } a limiting optimal prediction pool exists, and unless one of the models Aj coincides with D, several of the weights in this pool typically are positive. In contrast, the posterior probability of the model Aj with the smallest Kullback–Leibler directed distance from D will tend to one and all others to zero. Any frequentist procedure based on testing will have a similar property, but with a distance measure specific to the test. The contrast is rooted in the fact that Bayesian model averaging and frequentist tests are predicated on the belief that Aj = D for some j, whereas optimal prediction pools make no such assumption. If Aj ̸= D (j = 1, . . . , n) then it is possible that one model would dominate the optimal pool, but this result seems to us unusual and this supposition is supported in the examples studied here. Our findings show that optimal pools can, and do, perform substantially better than any of their constituent models as assessed by a log predictive scoring rule. We show that there must exist a model, not included in the pool, with a log predictive score at least as good as, and in general better than, that of the optimally scored prediction pool. The paper develops the basic ideas for a pool of two models (Section 2) and then applies them to prediction model pools for daily S&P 500 returns, 1972 through 2005 (Section 3). It then turns to the general case of pools of n models and studies how changes in the composition of the pool change the optimal weights (Section 4). Section 5 constructs an optimal pool of six alternative prediction models for the S&P 500 returns. Section 6 studies the implications of optimal prediction pools for the existence of prediction models as yet undiscovered that will compare favorably with those in the pool as assessed by a log predictive scoring rule. The final section concludes. 2. Pools of two models Consider the case of two competing prediction models A1 ̸= A2 . From (8) a.s.
T −1 [LS (YT , A1 ) − LS (YT , A2 )] → LS (A1 ; D) − LS (A2 ; D).
(11)
If A1 corresponds to the data generating process D, then in general LS (A1 ; D) − LS (A2 ; D) = LS (D; D) − LS (A2 ; D) ≥ 0 and the limiting value coincides with the Kullback–Leibler distance from D to A2 . If A2 also nests A1 then LS (A1 ; D) − LS (A2 ; D) = 0, but in most cases of interest LS (A1 ; D) ̸= LS (A2 ; D) and so if A1 = D then LS (A1 ; D) − LS (A2 ; D) > 0. These special cases are interesting and informative, but in application most econometricians would agree with the dictum of Box (1980) that all models are false. Indeed the more illuminating special case might be LS (A1 ; D) − LS (A2 ; D) = 0 when neither model Aj is nested in the other: then both A1 and A2 must be false.
In general LS (A1 ; D) − LS (A2 ; D) ̸= 0. For most prediction models constructed from parametric models of the time series {yt } a closely related implication is that one of the two models will almost surely be rejected in favor of the other as T → ∞. For example in the Bayesian approach (1) the Bayes factor in favor of one model over the other will converge to zero, and in the nonBayesian construction (3) the likelihood ratio test or another test t
appropriate to the estimates θAj will reject one model in favor of the other. We mention these well-known results here only to emphasize the contrast with those in the remainder of this section. Given the two prediction models A1 and A2 , the prediction pool A = {A1 , A2 } consists of all prediction models p(yt ; Yt −1 , A) = w p(yt ; Yt −1 , A1 )
+ (1 − w)p(yt ; Yt −1 , A2 ), w ∈ [0, 1].
(12)
The predictive log score function corresponding to given w ∈ [0, 1] is fT (w) =
T −
log[w p(yot ; Yot−1 , A1 ) + (1 − w)p(yot ; Yot−1 , A2 )]. (13)
t =1
The optimal prediction pool corresponds to wT∗ = arg maxw fT (w) in (13).2 The determination of such a pool was, of course, impossible for purposes of forming the elements w p(yt ; Yot−1 , A1 ) + (1 − w)p(yt ; Yot−1 , A2 ) (t = 1, . . . , T ) because it is based on the entire sample. But it is just as clear that weights w could be determined recursively at each date t based on information through t − 1. We shall see subsequently that the required computations are practical, and in the examples in the next section there is almost no difference between the optimal pool considered here and those created recursively when the two procedures are evaluated using a log scoring rule. The first two derivatives of fT are fT′ (w) =
T −
p(yot ; Yot−1 , A1 ) − p(yot ; Yot−1 , A2 )
, (14) wp(yot ; Yot−1 , A1 ) + (1 − w)p(yot ; Yot−1 , A2 ) ]2 T [ − p(yot ; Yot−1 , A1 ) − p(yot ; Yot−1 , A2 ) ′′ fT (w) = − w p(yot ; Yot−1 , A1 ) + (1 − w)p(yot ; Yot−1 , A2 ) t =1 < 0. t =1
(15) a.s.
For all w ∈ [0, 1], T −1 fT (w) → f (w). If lim T −1
T →∞
T −
ED [p(yt ; Yt −1 , A1 ) − p(yt ; Yt −1 , A2 )] ̸= 0
(16)
t =1
then f (w) is concave. The condition (16) does not necessarily hold, but it seems to us that the only realistic case in which it does not occurs when one of the models nests the other and the restrictions that create the nesting are correct for the pseudo-true parameter vector. We have in mind, here, prediction models A1 and A2 that are typically non-nested and, in fact, differ substantially in functional form for their predictive densities. Henceforth we shall assume that (16) is true. Given this assumption wT∗ = arg maxw fT (w) converges almost surely to the unique value w ∗ = arg maxw f (w).
2 The setup in (13) is formally similar to the nesting proposed by Quandt (1974) in order to test the null hypothesis A1 = D against the alternative A2 = D. (See also Gourieroux and Monfort (1989, Section 22.2.7).) That is not the objective here. Moreover, Quant’s test involves simultaneously maximizing the function in the parameters of both models and w , and is therefore equivalent to the attempt to estimate by maximum likelihood the mixture models discussed in Section 6; Quandt (1974) clearly recognizes the pitfalls associated with this procedure.
J. Geweke, G. Amisano / Journal of Econometrics 164 (2011) 130–141
Thus for a given data generating process D there is a unique, limiting optimal prediction pool. As shown in Hall and Mitchell (2007) this prediction pool minimizes the Kullback–Leibler directed distance from D to the prediction model (9). It will prove useful to distinguish between several kinds of prediction pools, based on the properties of fT . If wT∗ ∈ (0, 1) then A1 and A2 are each competitive in the pool {A1 , A2 }. Equivalently, fT′ (0) > 0 and fT′ (1) < 0. Beyond these inequalities and concavity there are no restrictions on fT (w) when the models are competitive. It is possible that LS (YoT ; A2 ) > LS (YoT ; A1 ) and yet A1 has larger weight in the optimal prediction pool. That this is so may be seen from the following example with T = 3. The values of p(yot ; Yot−1 , Aj ) are A1
A2
t =1
0.0298
0.9652
t =2
0.9160
0.1408
t =3
0.8105
0.2240
The log scores are LS (YoT ; A1 ) = −3.7180 and LS (YoT ; A2 ) = −3.4943, and the optimal weight is wT∗ = 0.5980: model A1 receives almost 60% of the weight despite having a lower log score. Similarly, in the hypothetical case LS (YoT ; A1 ) = LS (YoT ; A2 ) the optimal weight wT∗ can occur anywhere in (0, 1). By mild extension A1 and A2 are each competitive in the population pool {A1 , A2 } if w∗ ∈ (0, 1) If wT∗ = 1 then A1 is dominant in the pool {A1 , A2 } and A2 is ′
excluded in that pool;3 equivalently fT (1) ≥ 0, which amounts to T −1
T −
p(yot ; Yot−1 , A2 )/p(yot ; Yot−1 , A1 ) ≤ 1.
(17)
t =1
By extension, if w ∗ = 1 then A1 is dominant in the population pool and A2 is excluded in that pool. Computation of wT∗ is trivial. If fT′ (0) ≤ 0 then wT∗ = 0; if ′ fT (1) ≥ 0 then wT∗ = 1; if neither condition holds then wT∗ can be determined by successive bifurcation, Newton, Newton–Raphson, or other descent methods. Some special cases are interesting, not because they are likely to occur, but because they help to illuminate the relationship of prediction pools to concepts familiar from model comparison. First consider the hypothetical case A1 = D. Proposition 1. If A1 = D then A1 is dominant in the population pool {A1 , A2 } and f ′ (1) = 0. Proof. If A1 = D,
T →∞
provides some indication of the strength of the evidence that neither A1 = D nor A2 = D. There is a literature on testing that formalizes this idea in the context of (12); see Gourieroux and Monfort (1989, Chapter 22), and Quandt (1974). Our motivation is not to demonstrate that any prediction model is false; we know at the outset that this is the case. What is more important is that (12) evaluated at wT∗ provides a lower bound on the improvement in the log score predictive density that could be attained by models not in the pool, including models not yet discovered. We return to this point in Section 6. If w ∗ ∈ (0, 1) then for a sufficiently large sample size the optimal pool will have a log predictive score superior to that of eia.s.
ther A1 or A2 alone, and as sample size increases wT∗ → w ∗ . This is in marked contrast to conventional Bayesian model combination or non-Bayesian tests. Both will exclude one model or the other asymptotically, although the procedures are formally distinct. For Bayesian model combination the contrast is due to the fact that the conventional setup conditions on one of either D = A1 or D = A2 being true. As we have seen, in this case the posterior probability of A1 and wT∗ have the same limit. By formally admitting the contingency that A1 ̸= D and A2 ̸= D we change the conventional assumptions, leading to an entirely different result: even models that are arbitrarily inferior, as measured by Bayes factors, can substantially improve predictions from the superior model as indicated by a log scoring rule. For non-Bayesian testing the explanation is the same: since a true test rejects one model and accepts the other, it also conditions on one of either D = A1 or D = A2 being true. In the context of some other well-known optimization problems the result is not surprising. The early and influential work of Bates and Granger (1969) examined the optimal combination of two point forecasts in a linear model with a quadratic loss function. These weights always sum to one and both forecasts generally have non-zero weights (though one could be negative). Consider the condition that one forecast is the predictive mean in the data generating process. If this condition is true then that forecast’s weight is one asymptotically, the analogue of Proposition 1. If a Bayesian forecaster operating under quadratic loss believes the condition without doubt then asymptotically the forecaster will place all weight on one model whether the condition is actually true or not. A somewhat looser analogy can be found in portfolio theory: if asset A stochastically dominates asset B (the analogue of a superior log score) an investor will never select asset B if forced to hold all wealth in either A or B, but given the opportunity to construct a portfolio including both assets may well choose to include positive holdings of B as well as A. 3. Examples of two-model pools
T
f ′ (1) = lim T −1
133
− t =1
[ ED 1 −
p(yt ; Yt −1 , A2 ) p(yt ; Yt −1 , D)
]
= 0.
(18)
From (14) and the strict concavity of f it follows that A1 is dominant in the population pool. A second illuminating hypothetical case is LS (A1 ; D) = LS (A2 ; D). Given (16) then A1 ̸= D and A2 ̸= D in view of Proposition 1. The implication of this result for practical work is that if two non-nested models have roughly the same log score then neither is ‘‘true’’. Section 6 returns to this implication at greater length. Turning to the more realistic case LS (A1 ; D) ̸= LS (A2 ; D), w∗ ∈ (0, 1) implies also that A1 ̸= D and A2 ̸= D. In fact one never observes f , of course, but the familiar log scale of fT (w)
3 Dominance is a necessary condition for forecast encompassing (Chong and Hendry, 1986) asymptotically. But it is clearly weaker than forecast encompassing.
We illustrate some properties of two-model pools using daily percent log returns of the Standard and Poors (S&P) 500 index and six alternative models for these returns. All of the models used rolling samples of 1250 trading days, about five years.4 The first sample consisted of returns from January 3, 1972 (h = −1249, in the notation of the previous section) through December 14, 1976 (t = 0), and the first predictive density evaluation was for the return on December 15, 1976 (t = 1). The last predictive density evaluation was for the return on December 16, 2005 (T = 7324). Three of the models are estimated by maximum likelihood and predictive densities are formed by substituting the estimates
4 Posterior densities with full samples are similar to those for rolling samples; see Geweke and Amisano (2011). Because the work here entails the construction of 7,324 MCMC samples in the stochastic volatility and hierarchical Markov mixture models, and computing time is proportional to sample size in those models, we chose to carry out the exercise with rolling samples.
134
J. Geweke, G. Amisano / Journal of Econometrics 164 (2011) 130–141 Table 1 Log predictive scores of the alternative models. Gaussian
GARCH
EGARCH
t-GARCH
SV
HMNM
−10570.80
−9574.41
−9549.41
−9317.50
−9460.93
−9336.60
Fig. 1. Log predictive score as a function of model weight in some two-model pools of S&P 500 predictive densities 1976–2005. Prediction models are described in the text.
for the unknown parameters: a Gaussian i.i.d. model (‘‘Gaussian,’’ hereafter); a Gaussian generalized autoregressive conditional heteroscedasticity model with parameters p = q = 1, or GARCH (1,1) (‘‘GARCH’’); and a Gaussian exponential GARCH model with p = q = 1 (‘‘EGARCH’’). Three of the models formed full Bayesian predictive densities using MCMC algorithms: a GARCH (1,1) model with i.i.d. Student t shocks (‘‘t-GARCH’’); the stochastic volatility model of Jacquier et al. (1994) (‘‘SV’’); and the hierarchical Markov normal mixture model with serial correlation and m1 = m2 = 5 latent states described in Geweke and Amisano (2011) (‘‘HMNM’’). Table 1 provides the log predictive score for each model. That for t-GARCH exceeds that of the nearest competitor, HMNM, by 19. Results for each are based on full Bayesian inference but the log predictive scores are not the same as log marginal likelihoods because the early part of the data set is omitted and rolling rather than full samples are used. Nevertheless the difference between these two models strongly suggests that a formal Bayesian model comparison would yield overwhelming posterior odds in favor of t-GARCH. Of course the evidence against the other models in favor of t-GARCH is even stronger: 143 against SV, 232 against EGARCH, 257 against GARCH, and 1253 against Gaussian. Pools of two models, one of which is t-GARCH, reveal that t-GARCH is not dominant in all of these pools. Fig. 1 shows the function fT (w) for pools of two models, one of which is t-GARCH with w denoting the weight on the t-GARCH predictive density. The vertical scale is the same in each panel. All functions fT (w) are, of course, concave. In the GARCH and t-GARCH pool fT (w) has an internal maximum at w = 0.944 with fT (0.944) = −9317.12, whereas fT (1) = −9315.50. This distinction is too subtle to be evident in the upper left panel in which it appears that fT′ (1) u 0. For the EGARCH and t-GARCH pool, and for the HMNM and t-GARCH pool, the maximum is clearly internal. For the SV and
t-GARCH pool fT (w) is monotone increasing, with fT′ (1) = 1.96. In the Gaussian and t-GARCH pool, not shown in Fig. 1, t-GARCH is again dominant with fT′ (1) = 54.4. Thus while all two-model comparisons strongly favor t-GARCH, it is dominant only in the pool with Gaussian and the pool with SV. Fig. 2 portrays fT (w) for two-model pools consisting of HMNM and one other predictive density, with w denoting the weight on HMNM. The scale of the vertical axis is the same as in Fig. 1 in all panels except the upper left, which shows fT (w) in the two-model pool consisting of Gaussian and HMNM. The latter model nests the former, and it is dominant in this pool with fT′ (1) = 108.3. In pools consisting of HMNM on the one hand and GARCH, EGARCH or SV, on the other, the models are mutually competitive. Thus SV is excluded in a two-model pool with t-GARCH, but not in a twomodel pool with HMNM. This is not a logical consequence of the fact that t-GARCH has a higher log predictive score than HMNM. Indeed, the optimal two-model pool for EGARCH and HMNM has a higher log predictive score than any two-model pool that includes t-GARCH, as is evident by comparing the lower left panel of Fig. 2 with all the panels of Fig. 1. Table 2 summarizes some key characteristics of all the twomodel pools that can be created for these predictive densities. The entries above the main diagonal indicate the log score of the optimal linear pool of the two prediction models. The entries below the main diagonal indicate the weight wT∗ on the model in the row entry in the optimal pool. In each cell there is a pair of entries. The upper entry reflects pool optimization exactly as described in the previous section. In particular, the optimal prediction model weight is determined just once, on the basis of the predictive densities for all T data points. This scheme could not be used in practice because only past data are available for optimization. The lower entry in each pair reflects pool optimization using the predictive densities p(yos ; Yos−1 , Aj ) (s = 1, . . . , t − 1) to form
J. Geweke, G. Amisano / Journal of Econometrics 164 (2011) 130–141
135
Fig. 2. Log predictive score as a function of model weight in some two-model pools of S&P 500 predictive densities 1976–2005. Prediction models are described in the text.
Table 2 Optimal pools of two predictive models. Gaussian Gaussian GARCH EGARCH t-GARCH SV HMNM
0.957 0.943 0.943 0.920 1.000 0.984 0.986 0.971 1.000 0.996
GARCH
EGARCH
t-GARCH
SV
HMNM
−9539.72 −9541.42
−9505.57 −9507.73 −9514.26 −9516.47
−9317.50 −9318.65 −9317.12 −9317.48 −9296.08 −9298.29
−9460.45 −9461.99 −9417.88 −9419.84 −9380.07 −9383.15 −9317.50 −9318.15
−9336.60 −9337.48 −9310.59 −9313.55 −9280.34 −9282.68 −9284.72 −9287.28 −9323.88 −9325.50
0.628 0.386 0.944 0.931 0.494 0.384 0.628 0.611
0.677 0.861 0.421 0.453 0.529 0.670
0.000 0.007 0.289 0.307
0.713 0.787
Entries above the diagonal are log scores of optimal pools. Entries below the diagonal provide the weight of the model in that row in the optimal pool. The top entry in each pair reflects optimization using the entire sample and the bottom entry reflects continuous updating of weights using only the data available on each date. Bottom entries below the diagonal indicate the average weight over the sample.
the optimal pooled predictive density for yt . The log scores (above the main diagonal in Table 1) are the sums of the log scores for pools formed in this way. The weights (below the main diagonal in Table 1) are averages of the weights wt∗ taken across all T predictive densities. (For t = 1, w1∗ was arbitrarily set at 0.5.) For example, in the t-GARCH and HMNM pool, the log score using the optimal weight based on all T observations is −9284.72. If, instead, the optimal weight is recalculated in each period using only past predictive likelihoods, then the log score is −9287.28. The weight on the HMNM model is 0.289 in the former case, and the average weight on this model is 0.307 in the latter case. Note that in every case the log score is lower when it is determined using only past predictive likelihoods, than when it is determined using the entire sample. But the values are, at most, about 3 points lower. The weights themselves show some marked differences–pools involving EGARCH seem to exhibit the largest contrasts. The fact that the two methods can produce substantial differences in weights, but the log scores are always nearly the same, is consistent with the small values of |fT′′ (w)| in substantial neighborhoods of the optimal value of w evident in Figs. 1 and 2.
Fig. 3 shows the evolution of the weight wt∗ in some two-model pools when pools are optimized using only past realizations of predictive densities. Not surprisingly wt∗ fluctuates violently at the start of the sample. Although the predictive densities are based on rolling five-year samples, wt∗ should converge almost surely to a limit under the conditions specified in Section 2. The HMNM and t-GARCH pool, upper left panel, might be interpreted as displaying this convergence, but the case for the pools involving EGARCH is not so strong. Whether or not Section 2 provides a good asymptotic paradigm for the behavior of wt∗ is beside the point, however. The important fact is that a number of pools of two models outperform the model that performs best on its own (t-GARCH), performance being assessed by the log scoring rule in each case. The best of these two-model pools (HMNM and EGARCH) does not even involve t-GARCH, and it outperforms t-GARCH by 37 points. These findings illustrate the fresh perspective brought to model combination by linear pools of prediction models. Extending pools to more than two models provides additional interesting insights.
136
J. Geweke, G. Amisano / Journal of Econometrics 164 (2011) 130–141
Fig. 3. Evolution of model weights in some some two-model pools of S&P 500 predictive densities 1976–2005. Prediction models are described in the text.
4. Pools of multiple models
Proposition 1 generalizes immediately to pools of multiple models.
In a prediction pool with n models the log predictive score function is T
fT (w) =
−
log
n −
wi p(yt ; Yt −1 , Ai )
(19)
i=1
t =1
where w =(w1 , . . . , wn )′ , wi ≥ 0(i = 1, . . . , n) and i=1 wi = 1. Given our assumptions about the data generating process D,
∑n
T −1 fT (w) a.s.
→
lim T −1
∫ log
T →∞
n −
wi p(yt ; Yt −1 , Ai ) p(YT |D)dν(YT )
i=1
(20)
Denote pti = p(yot ; Yot−1 , Ai )(t = 1, . . . , T ; i = 1, . . . , n). Substi∑n tuting w1 = 1 − i=2 wi ,
∂ fT (w)/∂wi =
T − pti − pt1 t =1
n ∑
(i = 2, . . . , n);
(21)
wj ptj
j=1
∂ 2 fT (w)/∂wi ∂wj T − (pti − pt1 )(ptj − pt1 ) (i, j = 2, . . . , n). [ n ]2 ∑ t =1 wk ptk
(23)
= (1, 0, . . . , 0) . where w ′
Proof. From (21),
= 0 (j = 2, . . . , m)
(24)
and consequently ∂ f (w)/∂w1 |w= w = 0 as well. From the concav. ity of f (w), w∗ = w Extending the definitions of Section 2, models ∑ A1 , . . . , Am ∗ (m < n) are jointly excluded in the pool {A1 , . . . ∑ , An } if m i=1 wTi = m ∗ 0; they are jointly competitive in the pool if 0 < w < 1; and Ti i =1 ∑m ∗ they jointly dominate the pool if i=1 wTi = 1. Obviously any pool has a smallest dominant subset. A pool trivially dominates itself. There are several relations between exclusion, competitiveness and dominance that partially characterize the conditions in which a model enters an optimal pool.
and
=−
∂ f (w)/∂wj |w= w = 0 (j = 1, . . . , n)
] [ T − ∂ f (w) p(yt ; Yt −1 , Ai ) −1 |w= = lim T − 1 E D w T →∞ ∂wj p(yt ; Yt −1 , D) t =1
= f (w).
Proposition 2. If A1 = D then A1 is dominant in the population pool {A1 , . . . , Am } and
(22)
k =1
The n × n Hessian matrix ∂ 2 fT /∂ w∂ w′ is non-positive definite for all w and, pathological cases aside, negative definite. Thus f (w) is strictly concave on the unit simplex. Given the evaluations pti over the sample from the alternative prediction models, finding w∗T = arg maxw fT (w) is a straightforward convex programming problem.5 The limit f (w) is also concave in w and w∗ = arg maxw f (w) = limT →∞ wT .
5 The computations can be undertaken with widely available mathematical applications software. The results reported in the next section used the
Proposition 3. If {A1 , . . . , Am } dominates the pool {A1 , . . . , An } then {A1 , . . . , Am } dominates {A1 , . . . , Am , Aj1 , . . . , Ajk } for all {j1 , . . . , jk } ⊆ {m + 1, . . . , n}. Proof. By assumption {Am+1 , . . . , An } is excluded in the pool {A1 , . . . , An }. The pool {A1 , . . . , Am , Aj1 , . . . , Ajk } imposes the constraints wi = 0 for all i > m, i ̸= {j1 , . . . , jk }. Since {Am+1 , . . . , An } was excluded in {A1 , . . . , An } these constraints are not binding. Therefore {Aj1 , . . . , Ajk } is excluded in the pool {A1 , . . . , Am , Aj1 , . . . , Ajk }.
MatlabTM function fmincon, which implements the trust region method of More and Sorensen (1983).
J. Geweke, G. Amisano / Journal of Econometrics 164 (2011) 130–141
Thus a dominant subset of a pool is dominant in all subsets of the pool in which it is included. Proposition 4. If {A1 , . . . , Am } dominates all pools {A1 , . . . , Am , Aj } (j = m + 1, . . . , n) then {A1 , . . . , Am } dominates the pool {A1 , . . . , An }. Proof. The result is a consequence of the concavity of the objective functions. The assumption implies that there exist ∗ ∗ weights w2∗ , . . . , wm such that ∂ fT (w2∗ , . . . , wm , wj )/∂wj < 0 when evaluated at wj = 0(j = m + 1, . . . , n). Taken jointly these n − m conditions are necessary and sufficient for wm+1 = · · · wn = 0 in the optimal pool created from the models {A1, . . . , An }. The converse of Proposition 4 is a special case of Proposition 3. Taken together these propositions provide an efficient means to show that a small group of models is dominant in a large pool. Proposition 5. The set of models {A1 , . . . , Am } is excluded in the pool {A1 , . . . , An } if and only if Aj is excluded in each of the pools {Aj , Am+1 , . . . , An }(j = 1, . . . , m). Proof. This is an immediate consequence of the first-order conditions for exclusion, just as in the proof of Proposition 4. Proposition 6. If the model A1 is excluded in all pools (A1 , Ai ) (i = 2, . . . , n) then A1 is excluded in the pool (A1 , . . . , An ). Proof. From (14) and the concavity of fT the assumption implies T −1
T −
pt1 /pti ≤ 1 (i = 2, . . . , n).
(25)
t =1
Let w i (i = 2, . . . , n) be the optimal weights in the pool (A2 , . . . , An ). From (21) T −1
T − t =1
pti n ∑
w j ptj
j =2
= λ if w i > 0 (i = 2, . . . , n)
Proposition 6 shows that one can establish the exclusion of A1 in the pool {A1 , . . . , An }, or for that matter any subset of the pool {A1 , . . . , An } that includes A1 , by showing that A1 is excluded in the two-model pools {A1 , Ai } for all Ai that make up the larger pool. The converse of Proposition 6 is false. That is, a model can be excluded in a pool with three or more models, and yet it is competitive in some (or even all) pairwise pools. Consider T = 2 and the following values of pti :
T −1
pt1
− n t =1
T
< T −1
w j ptj
∑ j =2
n
−−
w i
t =1 i=2
pt1
< 1.
pti
T −1
pti
− n t =1
∑ j =2
T
(27)
= T −1
w j ptj
−− T =1 ℓ=2
w ℓ
pt ℓ n
∑ j =2
(28)
From (27) and (28), T −1
T − pti − pt1 t =1
n ∑ j =2
≥ 0 (i = 2, . . . , n).
(29)
w j ptj
Since w1 = 1 − i=2 wi , it follows from (21) that ∂ fT (w)/∂w1 ≤ 0 at the point w =(0, w 2 , . . . , w n )′ . Because fT is concave this is necessary and sufficient for A1 to be excluded in the pool (A1 , . . . , An ).
∑n
t =1
0.4
0.1
1.0
t =2
0.4
1.0
0.1
A1
A2
A3
t =1
0.8
0.9
1.3
t =2
1.2
1.1
0.7
t =3
0.9
1.0
1.1
t =4
1.1
1.0
0.9
−0.1 0.1 −0.1 0.1 + + + < 0. 0.9 1.1 1 1
(30)
The contours of the log predictive score function are shown in Fig. 4(b).
=1
w j ptj
(i = 2, . . . , n).
A3
The optimal pool {A1 , A2 , A3 } weights the models equally, as may be verified from (21). But A1 is excluded in the pool {A1 , A2 }: assigning w to A1 , (14) shows fT′ (0) =
n
A2
(26)
Suppose w i > 0. From (26) T
A1
The model A1 is competitive in the pools {A1 , A2 } and {A1 , A3 } because in (14) fT′ (0) > 0 and fT′ (1) < 0 in each pool. In the optimal pool {A2 , A3 } the models A2 and A3 have equal weight with ∑2 ∑3 j ptj = 0.55. The first-order conditions in (21) are t =1 j=2 w ∂ fT (w)/∂w2 = ∂ fT (w)/∂w3 = 0.3/0.55 > 0 and therefore the constraint w1 ≥ 0 is binding in the optimal pool {A1 , A2 , A3 }. The contours of the log predictive score function are shown in Fig. 4(a). Notice also in this example that LS (YoT , A1 ) = −1.833 > −2.302 = LS (YoT , A2 ) = LS (YoT , A3 ), and thus the model with the highest log score can be excluded from the optimal pool. The same result holds in the population: the Kullback–Leibler distance from D to A1 may be less than the distance from D to Aj (j = 2, . . . , m) and yet A1 may be excluded in the population pool {A1 , . . . , Am } so long as m > 2. If m = 2 then the model with the higher log score is always included in the optimal pool. No significantly stronger version of Proposition 6 appears to be true. Consider the conjecture that if model A1 is excluded in one of the pools {A1 , Ai } (i = 2, . . . , n), then A1 is excluded in the pool {A1 , . . . , An }. The contrapositive of this claim is that if A1 is competitive in {A1 , . . . , An } then it is competitive in {A1 , Ai } (i = 2, . . . , n), and by extension A1 wold be competitive in any subset of {A1 , . . . , An } that includes A1 . That this not true may be seen from the following example with T = 4:
for some positive but unspecified constant λ. From (25) and Jensen’s inequality T
137
5. Multiple-model pools: an example Using the same S&P 500 returns data set described in Section 3 it is easy to find the optimal linear pool of all six prediction models described in that section. (The optimization required 0.22 s using conventional Matlab software, illustrating the trivial computations required for log score optimal pooling once the predictive density evaluations are available.) The first line of Table 3 indicates the composition of the optimal pool and the associated log score. The EGARCH, t-GARCH and HMNM models are jointly dominant in this pool while Gaussian, GARCH and SVOL are excluded. In the optimal pool the highest weight is given to t-GARCH, the next highest to EGARCH, and the smallest positive weight to HMNM.
138
J. Geweke, G. Amisano / Journal of Econometrics 164 (2011) 130–141
Fig. 4. Panel (a) is a counterexample to the converse of Proposition 6. Panel (b) is a counterexample to a conjectured strengthening of Proposition 6.
Fig. 5. Contours of the log score function (10) for the three models dominant in the six-model prediction pool for S%P 500 returns 1972–2005. Residual weight accrues to the HMNM model. The three small circles indicate optimal two-model pools.
Weights do not indicate a predictive model’s contribution to log score, however. The next three lines of Table 3 show the impact of excluding one of the models dominant in the optimal pool. The results show that HMNM makes the largest contribution to the optimal score, 31.25 points; EGARCH the next largest, 19.47 points; and t-GARCH the smallest, 15.51 points. This ranking strictly reverses the ranking by weight in the optimal pool. When EGARCH is removed GARCH enters the dominant pool with a small weight, whereas the same models are excluded in the optimal pool when either t-GARCH or HMNM is removed. These characteristics of the pool are evident in Fig. 5, which shows log predictive score contours for the dominant three-model pool on the unit simplex. Weights for EGARCH and t-GARCH are
shown explicitly on the horizontal and vertical axes, with residual weight on HMNM. Thus the origin corresponds to HMNM, the lower right vertex of the simplex to EGARCH, and the upper left vertex to t-GARCH. Values of the log score for the pool at those points can be read from Table 1. The small circles indicate optimal pools formed from two of the three models: EGARCH and HMNM on the horizontal axis, t-GARCH and HMNM on the vertical axis, and EGARCH and t-GARCH on the diagonal. Values of the log score for the pool at those points can be read from the last three entries in the last column of Table 3. The optimal pool is indicated by the asterisk. Moving away from this point, the log score function is much steeper moving toward the diagonal than toward either axis.
J. Geweke, G. Amisano / Journal of Econometrics 164 (2011) 130–141
139
Fig. 6. Evolution of model weights in the six-model pool of S&P 500 predictive densities 1976–2005. Prediction models are described in the text. Table 3 Optimal pools of 6 and 5 models. Gaussian
GARCH
EGARCH
t-GARCH
SV
HMNM
log score
0.000 0.000 0.000 0.000
0.000 0.060 0.000 0.000
0.319 X 0.471 0.323
0.417 0.653 X 0.677
0.000 0.000 0.000 0.000
0.264 0.286 0.529 X
−9264.83 −9284.30 −9280.34 −9296.08
The first six columns provide the weights for the optimal pools and the last column indicates the log score of the optimal pool. ‘‘X’’ indicates that a model was not included in the pool.
This reflects the large contribution of HMNM to log score relative to the other two models just noted. The optimal pool could not be used in actual prediction 1976–2005 because the weights draw on all of the returns from that period. As in Section 3, optimal weights can be computed each day to form a prediction pool for the next day. These weights are portrayed in Fig. 6. There is substantial movement in the weights, with a noted tendency for the weight on EGARCH to be increasing at the expense of t-GARCH even late in the period. Nevertheless the log score function for the prediction model pool constructed in this way is −9267.82, just 3 points lower than the pool optimized over the entire sample. Moreover this value substantially exceeds the log score for any model over the same period, or for any optimal pool of two models (see Table 3). This insensitivity of the pool log score to substantial changes in the weights reflects the shallowness of the objective function near its mode: a pool with equal weights for the three dominant models has a log score of −9265.62, almost as high as that of the optimal pool. This leaves essentially no possible return (as measured by the log score) to more elaborate methods of combining models like bagging (Breiman, 1996) or boosting (Friedman et al., 2000). Whether these circumstances are typical can be established directly by applying the same kind of analysis undertaken in this section for the relevant data and models, a question left to future research. These results with weights reoptimized each day using only the data at hand establishes the utility of optimal pooling with a log scoring rule. The comparisons made are entirely with respect to out-of-sample predictive performance, which is the final arbiter.
Moreover, the optimal pool devised in this way offers a substantial improvement on the best model in the pool. There are any number of potential improvements. For example, weights could be made functions of the recent history of yt ; out-ofsample performance might be improved by shrinking the weights toward a common value; the outcome of conventional samplingtheoretic tests of the almost sure limits of the weights might be harness to some improvement. These possibilities, and others, may be worthy of future research. 6. Pooling and model improvement The linear pool {A1 , A2 } is superficially similar to the mixture of the same models. In fact the two are not the same, but there is an interesting relationship between their log predictive scores. Denote the mixture of A1 and A2 p(yt | Yt −1 , θA1 , θ A2 , w, A1·2 ) = w p(yt | Yt −1 , θA1 )
+ (1 − w)p(yt | Yt −1 , θA2 ). (31) Equivalently there is an i.i.d. latent binomial random variable w t , independent of Yt −1 , P ( wt = 1) = w , with yt ∼ p(yt | Yt −1 , θA1 ) if w t = 1 and yt ∼ p(yt | Yt −1 , θA2 ) if w t = 0. If the prediction model Aj is fully Bayesian (1) or utilizes maximum likelihood estimates in (3) then under weak regularity conditions T −1 LS (YT , Aj ) a.s.
→ lim T T →∞
−1
∫
log p(YT | θ∗Aj , Aj )p(YT | D)dν(YT )
= LS (Aj ; D) (j = 1, 2)
(32)
140
J. Geweke, G. Amisano / Journal of Econometrics 164 (2011) 130–141
Fig. 7. Expected log scores for individual models, a linear model pool, a mixture model, and the data generating process. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
where
θ∗Aj = arg max lim T −1 θAj T →∞
∫
log p(YT | θAj , Aj )p(YT | D)dν(YT )
(j = 1, 2),
(33)
sometimes called the pseudo-true values of θA1 and θA2 . However θ∗A1 and θ∗A2 are not, in general, the pseudo-true values of θA1 and θA2 in the mixture model A1·2 , and w∗ is not the pseudo-true value of w . These values are instead
∗∗ ∗∗ ∗∗ θA1 , θA2 , w = arg max lim T −1 θA1 ,θ A2 ,w T →∞
∫ − T
log w p(yt | Yt −1 , θA1 )
t =1
+ (1 − w)p(yt | Yt −1 , θA2 ) p(YT | D)dν(YT ).
(34)
Let w = arg maxw f (w). Note that ∗
lim T −1
T →∞
∫ − T
log w ∗∗ p(yt | Yt −1 , θ∗∗ A1 )
t =1
+ (1 − w∗∗ )p(yt | Yt −1 , θ∗∗ A1 ) p(YT | D)dν(YT ) ∫ − T ≥ lim T −1 log w ∗ p(yt | Yt −1 , θ∗A1 ) T →∞
t =1
+ (1 − w∗ )p(yt | Yt −1 , θ∗A1 ) p(YT | D)dν(YT ) = w∗ LS (Aj ; D) + (1 − w ∗ )LS (Aj ; D).
(35)
Therefore the best log predictive score that can be obtained from a linear pool of the models A1 and A2 is a lower bound on the log predictive score of a mixture model constructed from A1 and A2 . This result clearly generalizes to pools and mixtures of n models. This result suggests that a formal mixture model of the constituents of the pool might be worthy of investigation, but itself does not sustain the conclusion that under a log scoring rule one should accordingly conduct inference in a mixture model and then construct predictive densities. First, the foregoing argument is entirely asymptotic and there is no implication
that realized log scores would be superior. Second, the technical demands for inference in a mixture of quite heterogeneous models substantially exceed those for the individual models, and the latter are generally available while (at least at present) the former are not. Nonetheless, this is a promising avenue for future research. To illustrate these relationship between the log scores of the optimal pool and the mixture model, suppose the data generating process D is yt ∼ N (1, 1) if yt −1 > 0, yt ∼ N (−1, 1) if iid
yt −1 < 0. In model A1 , yt ∼ N (µ, σ 2 ) with µ ≥ 1 and in model A2 , iid
yt ∼ N (µ, σ 2 ) with µ ≤ −1. Corresponding to (33) the pseudotrue value of µ is 1 in A1 and −1 in A2 ; the pseudo-true value of σ 2 is 3 in both models. The expected log score, approximated by direct simulation, is −1.974 in both models. This value is indicated by the dashed (green) horizontal line in Fig. 7. The function f (w), also approximated by direct simulation, is indicated by the concave solid (red) curve in the same figure. The maximum, at w = 1/2, is f (w) = −1.866. Thus fT (w) would indicate that neither model could coincide with D, even for small T . The mixture model (31) will interpret the data as independent and identically distributed, and the pseudo-true values corresponding to (34) will be µ = 1 for one component, µ = −1 for the other, and σ 2 = 1 in both. The expected log score, approximated by direct simulation, is −1.756, indicated by the dotted (blue) horizontal line in Fig. 7. In the model A = D, yt | (yt −1 , A) has mean −1 or 1, and variance 1. Its expected log score is −(1/2)[log(2π ) − 1] = −1.419, indicated by the solid (black) horizontal line in the figure. The example illustrates that max f (w) can fall well short of the mixture model expected log score, and that the latter can, in turn, be much less than the data generating process expected log score. It is never possible to show that A = D: only to adduce evidence that A ̸= D. 7. Conclusion In any decision-making setting requiring prediction there will be competing models. If one is willing to condition on one of the models available being true, then econometric
J. Geweke, G. Amisano / Journal of Econometrics 164 (2011) 130–141
theory is comparatively tidy. In both Bayesian and non-Bayesian approaches, it is typically the case that one of a fixed number of models will come to dominate as sample size increases without bound. At least in social science applications there is no reason to believe that any of the models under consideration is true and in many instances there is ample evidence that none could be true. This study develops an approach to model combination designed for such settings. It shows that linear prediction pools generally yield superior predictions as assessed by a conventional log score function. (This finding does not depend on the existence of a true model.) An important characteristic of these pools is that prediction model weights do not necessarily tend to zero or one asymptotically, as is the case for posterior probabilities. (This result invokes the existence of a true model.) The example studied here involves six models and a large sample. One of these models has posterior probability very nearly one. Yet three of the six models in the pool have positive weights, all substantial. Optimal log scoring of prediction pools has three practical advantages. First, it is easy to do: compared with the cost of specifying the constituent models and conducting formal inference for each, it is practically costless. Second, the behavior of the log score as a function of model weights can show clearly that none of the models under consideration is true, or even close to true as measured by Kullback–Leibler distance. Third, linear prediction pools provide an easy way to improve predictions as assessed by the log score function. The example studied in this paper illustrates how acknowledging that all the available models are false can result in improved predictions, even as the search for better models goes on. The last result is especially important. Our examples showed how models that are clearly inferior to others in the pool nevertheless substantially improve prediction by being part of the pool rather than being discarded. The analytical results in Section 4 and the examples in Section 5 establish that the most valuable model in a pool need not be the one most strongly favored by the evidence interpreted under the assumption that one of several models is true. It seems to us that this is a lesson that should be heeded generally in decision making of all kinds. Acknowledgements This paper was originally prepared for the Forecasting in Rio Conference, Graduate School of Economics, Getulio Vargas Foundation, Rio de Janeiro, July 2008. We wish to acknowledge helpful comments from participants in numerous seminars and meetings, and in particular James Chapman, Frank Diebold, Joel Horowitz, James Mitchell, Luke Tierney, Mattias Villani, Kenneth Wallis, Robert Winkler and Arnold Zellner, and two referees. Responsibility for any errors or omissions in the paper is ours. Financial support from NSF grant SBR-0720547 is gratefully acknowledged. Portions of this paper are drawn from Geweke (2010) and used with the permission of the publisher, Princeton University Press. References Bacharach, J., 1974. Bayesian dialogues. In: Unpublished manuscript. Christ Church College, Oxford University. Bates, J.M., Granger, C.W.J., 1969. The combination of forecasts. Operational Research Quarterly 20, 451–468. Bernardo, J.M., 1979. Expected information as expected utility. The Annals of Statistics 7, 686–690.
141
Box, G.E.P., 1980. Sampling and Bayes inference in scientific modeling and robustness. Journal of the Royal Statistical Society Series A 143, 383–430. Breiman, L., 1996. Bagging predictors. Machine Learning 26, 123–140. Bremmes, J.B., 2004. Probabilistic forecasts of precipitation in terms of quantiles using NWP model output. Monthly Weather Review 132, 338–347. Brier, G.W., 1950. Verification of forecasts expressed in terms of probability. Monthly Weather Review 78, 1–3. Chong, Y.Y., Hendry, D.F., 1986. Econometric evaluation of linear macro-economic models. Review of Economic Studies 53, 671–690. Christoffersen, P.F., 1998. Evaluating interval forecasts. International Economic Review 39, 841–862. Clemen, R.T., Murphy, A.H., Winkler, R.L., 1995. Screening probability forecasts: contrasts between choosing and combining. International Journal of Forecasting 11, 133–146. Clements, M.P., 2006. Evaluating the survey of professional forecasters probability distributions of expected inflation based on derived event probability forecasts. Empirical Economics 31, 49–64. Corradi, V., Swanson, N.R., 2006a. Predictive density evaluation. In: Elliott, G., Granger, C.W.J., Timmermann, A. (Eds.), Handbook of economic forecasting. Amsterdam, North-Holland, pp. 197–284. Chapter 5. Corradi, V., Swanson, N.R., 2006b. Predictive density and conditional confidence interval accuracy tests. Journal of Econometrics 135, 187–228. Dawid, A.P., 1984. Statistical theory: the prequential approach. Journal of the Royal Statistical Society Series A 147, 278–292. de Finetti, B., Savage, L.J., 1963. The elicitation or personal probabilities, unpublished manuscript.. DeGroot, M.H., Fienberg, S.E., 1982. Assessing probability assessors: calibration and refinement. In: Gupta, S.S., Berger, J.O. (Eds.), Statistical decision theory and related topics III, vol. 1. Academic Press, New York, pp. 291–314. Diebold, F.X., Gunter, T.A., Tay, A.S., 1998. Evaluating density forecasts with applications to financial risk management. International Economic Review 39, 863–883. Friedman, J., Hastie, T., Tibshirani, R., 2000. Additive logistic regression: a statistical view of boosting. Annals of Statistics 28, 337–374. Genest, C., Weerahandi, S., Zidek, J.V., 1984. Aggregating opinions through logarithmic pooling. Theory and Decision 17, 61–70. Geweke, J., 2001. Bayesian econometrics and forecasting. Journal of Econometrics 100, 11–15. Geweke, J., 2005. Contemporary Bayesian econometrics and statistics. Wiley, Hoboken. Geweke, J., 2010. Complete and incomplete econometric models. Princeton University Press, Princeton. Geweke, J., Amisano, G., 2011. Hierarchical Markov normal mixture models with applications to financial asset returns. Journal of Applied Econometrics 26, 1–29. Gneiting, T., Balabdaoui, F., Raftery, A.E., 2007. Probability forecasts, calibration and sharpness. Journal of the Royal Statistical Society Series B 69, 243–268. Gneiting, T., Raftery, A.E., 2007. Strictly proper scoring rules, prediction and estimation. Journal of the American Statistical Association 102, 359–378. Good, I.J., 1952. Rational decisions. Journal of the Royal Statistical Society Series B 14, 107–114. Gourieroux, C., Monfort, A., 1989. Statistics and econometric models, vol. 2. Cambridge University Press, Cambridge. Granger, C.W.J., White, H., Kamstra, M., 1989. Interval forecasting: an analysis based upon ARCH-quantile estimators. Journal of Econometrics 40, 87–96. Hall, S.G., Mitchell, J., 2007. Combining density forecasts. International Journal of Forecasting 23, 1–13. Jacobs, R.A., 1995. Methods for combining experts’ probability assessments. Neural Computation 7, 867–888. Jacquier, E., Polson, N.G., Rossi, P.E., 1994. Bayesian analysis of stochastic volatility models. Journal of Business and Economic Statistics 12, 371–389. McConway, K.J., 1981. Marginalization and linear opinion pools. Journal of the American Statistical Association 76, 410–414. More, J.J., Sorensen, D.C., 1983. Computing a trust region step. SIAM Journal on Scientific and Statistical Computing 3, 553–572. Quandt, R.E., 1974. A comparison of methods for testing nonnested hypotheses. Review of Economics and Statistics 56, 92–99. Shuford, E.H., Albert, A., Massengill, H.E., 1966. Admissible probability measurement procedures. Psychometrika 31, 125–145. Stone, M., 1961. The opinion pool. Annals of Mathematical Statistics 32, 1339–1342. Timmermann, A., 2006. Forecast combination. In: Elliott, G., Granger, C.W.J., Timmermann, A. (Eds.), Handbook of economic forecasting. Amsterdam, NorthHolland, pp. 135–196. Chapter 4. Wallis, K.F., 2005. Combining density and interval forecasts: a modest proposal. Oxford Bulletin of Economics and Statistics 67, 983–994. Winker, R.L., 1969. Scoring rules and the evaluation of probability assessors. Journal of the American Statistical Association 64, 1073–1078. Winkler, R.L., Murphy, A.M., 1968. ‘‘Good’’ probability assessors. Journal of Applied Meteorology 7, 751–758.
Journal of Econometrics 164 (2011) 142–157
Contents lists available at ScienceDirect
Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom
Quantile regression for dynamic panel data with fixed effects Antonio F. Galvao Jr. Department of Economics, University of Iowa, W334 Pappajohn Business Building, 21 E. Market Street, Iowa City, IA 52242, United States
article
info
Article history: Available online 2 March 2011 JEL classification: C14 C23 Keywords: Quantile regression Dynamic panel Fixed effects Instrumental variables
abstract This paper studies a quantile regression dynamic panel model with fixed effects. Panel data fixed effects estimators are typically biased in the presence of lagged dependent variables as regressors. To reduce the dynamic bias, we suggest the use of the instrumental variables quantile regression method of Chernozhukov and Hansen (2006) along with lagged regressors as instruments. In addition, we describe how to employ the estimated models for prediction. Monte Carlo simulations show evidence that the instrumental variables approach sharply reduces the dynamic bias, and the empirical levels for prediction intervals are very close to nominal levels. Finally, we illustrate the procedures with an application to forecasting output growth rates for 18 OECD countries. © 2011 Elsevier B.V. All rights reserved.
1. Introduction Recently, there has been a growing literature on estimation, testing, and prediction using semiparametric panel data models. Koenker (2004) introduced a general approach to estimation of quantile regression (QR) models for longitudinal data. Individual specific (fixed) effects are treated as pure location shift parameters common to all conditional quantiles and may be subject to shrinkage toward a common value as in the Gaussian random effects paradigm. Controlling for individual specific heterogeneity via fixed effects (FE) while exploring heterogeneous covariate effects within the QR framework offers a more flexible approach to the analysis of panel data than that afforded by the classical Gaussian fixed and random effects estimators. Recent work by Lamarche (2010) and Geraci and Bottai (2007) has elaborated on this form of penalized QR estimator. Abrevaya and Dahl (2008) have introduced an alternative quantile-estimation approach motivated by a correlated random-effects model. In econometric applications the modeling of dynamic relationships and the availability of panel data often suggest dynamic model specifications involving lagged dependent variables. Consistency of estimators in conventional dynamic panel data models depends critically on the assumptions about the initial conditions of the dynamic process. It has been recognized at least since Nickell (1981) that classical ordinary least squares (OLS) estimators in dynamic panel models with FE are seriously biased when the temporal dimension of the panel is short. An extensive literature, initiated by Anderson and Hsiao (1981, 1982) and Arellano and Bond (1991),
E-mail address:
[email protected]. 0304-4076/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2011.02.016
has explored instrumental variables (IV) approaches to attenuate the bias, showing that IV methods are able to produce consistent estimators that are independent of the initial conditions.1 Conventional QR estimation of dynamic panel data models with fixed effects suffers from similar bias effects as those seen in the least squares case when the time dimension is modest. Reliance on the existing least squares strategies for bias reduction is unsatisfactory in the QR setting for at least two reasons. First, differencing is inappropriate, either temporally or via the usual deviation from individual means (within) transformation. Linear transformations that are completely innocuous in the context of conditional mean models are highly problematic in the conditional quantile models since they alter in a fundamental way what is being estimated.2 Second, the implementation of the IV method needs to be rethought. Fortunately, neither problem is insurmountable. There is no need to transform the QR model to compute the FE estimator. This is a computable convenience in the least squares case, but even when the number of FE is large, interior
1 See, e.g., Ahn and Schmidt (1995), Blundell and Bond (1998) and Bun and Carree (2005). More recently, a number of additional approaches have been proposed to reduce the bias in dynamic and nonlinear panels. These methods use asymptotic approximations derived as both the number of individuals, N, and the number time series, T , go to infinity jointly; see e.g. Arellano and Hahn (2007) for a survey, and Hahn and Kuersteiner (2002), Alvarez and Arellano (2003), Hahn and Newey (2004) and Bester and Hansen (2009), for specific approaches. 2 Expectations enjoy the convenient property that they commute with linear transformations; quantiles do not. This intrinsic difficulty has been recognized by Abrevaya and Dahl (2008), among others, and is clarified by Koenker and Hallock (2000, p. 19): ‘‘Quantiles of convolutions of random variables are rather intractable objects, and preliminary differencing strategies familiar from Gaussian models have sometimes unanticipated effects’’.
A.F. Galvao Jr. / Journal of Econometrics 164 (2011) 142–157
point optimization methods using modern sparse linear algebra make direct estimation of the QR model quite efficient. This paper investigates estimation, inference, and prediction in a quantile regression formulation of the dynamic panel data model with individual specific intercepts. We find that conventional FE estimation of the QR specification suffers from similar bias problems as those of the least squares estimation. To reduce the dynamic bias in the quantile regression FE estimator, we suggest the use of the IV quantile regression method of Chernozhukov and Hansen (2005, 2006, 2008) along with lagged (or lagged differences of the) regressors as instruments. Thus, the estimator combines the usual IV concept for dynamic panel data and the quantile regression IV framework. There is an emerging literature on forecasting with dynamic panel data (see e.g. Baltagi (2008)), and an important application of the proposed panel model is out-of-sample prediction. Most econometric forecasting methods using panels have focused on models for the conditional mean under Gaussian conditions. The quantile regression model has a significant advantage over these models, since it will be less sensitive to the tail behavior of the underlying random variables representing the forecasting variable of interest, and consequently will be less sensitive to outliers. Moreover, because of the heterogeneous nature of most of the variables of interest in economics, prediction using QR techniques is an important tool for applied work. The model proposed in this paper can also be used to predict the conditional density function of the variable of interest under weak assumptions. Monte Carlo studies are conducted to evaluate the finite sample properties of the estimator. The simulations show that the quantile regression FE estimator is significantly biased in the presence of lagged dependent variables, while the IV method sharply reduces the bias. The experiments suggest that the quantile regression IV approach for dynamic panel data turns out to be especially advantageous when innovations are heavy-tailed. In addition, we conduct simulations to evaluate the performance of the prediction intervals. The results provide evidence that the empirical levels approximate well the nominal or theoretical levels, and the length of the prediction interval decreases with the sample size. Finally, we illustrate the methods with an application to forecasting output growth rates for 18 OECD countries during the time period that includes the subprime economic crisis. We compute one-step-ahead prediction intervals and density functions. The results from estimation of an autoregressive dynamic model show that for most countries zero growth rate lies inside the interval with 0.90 nominal level. In a second round of estimation, we also include the stock price index as an exogenous covariate. In this case, the results show negative lower bound for all countries, and the densities forecast have more mass on the left tail. The rest of the paper is organized as follows. Section 2 presents the quantile regression dynamic panel data IV with FE estimation and inference. In Section 3 we describe how to employ the estimated models for prediction. Section 4 describes the Monte Carlo experiments. In Section 5, we briefly illustrate the new approach. Finally, Section 6 concludes the paper. 2. The model and assumptions 2.1. Estimation In this section we introduce estimation of the quantile regression dynamic panel data instrumental variables (QRPIV) that includes individual specific FE. Consider the classical dynamic model for panel data with individual fixed effects3 yit = ηi + α yit −1 + x′it β + uit
i = 1, . . . , N ; t = 1, . . . , T
143
where yit is the response variable, ηi denotes the individual FE, yit −1 is the lag of the response variable, xit is a p-vector of exogenous covariates, and uit is the innovation term. It is possible to write model (1) in a more concise matrix form as, y = Z η + α y−1 + X β + u,
(2)
so that Z = IN ⊗ιT , ιT is a T ×1 vector of ones, and η = (η1 , . . . , ηN )′ is the N × 1 vector of individual specific effects or intercepts. Note that Z represents an incidence matrix that identifies the N distinct individuals in the sample. The analogous version model to (1) for the τ th conditional quantile function of the response of the tth observation on the ith individual yit can be represented as Qyit (τ |yit −1 , xit ) = ηi + α(τ )yit −1 + x′it β(τ ).
(3)
In model (3) only the effects of the covariates (yit −1 , xit ) are allowed to depend upon the quantile, τ , of interest. The η’s are intended to capture some individual specific source of variability, or ‘‘unobserved heterogeneity,’’ that was not adequately controlled for other covariates. In most applications the time series dimension T is relatively small compared to the number of individuals N. Therefore, it might be difficult to estimate a τ -dependent distributional individual effect, and we restrict the estimates of the individual specific effects to be independent of τ across the quantiles. The restriction can be implemented by estimating the model for several quantiles simultaneously. Koenker (2004) introduced a general approach to estimation of quantile regression FE models for panel data. Applying this principle to Eq. (3), one would solve
ˆ = min (η, ˆ α, ˆ β)
η,α,β
K − N − T −
υk ρτ
k=1 i=1 t =1
× (yit − ηi − α(τk )yit −1 − x′it β(τk )) where ρτ (u) := u(τ − I (u < 0)) as in Koenker and Bassett (1978), and υk are the weights that control the relative influence of the K quantiles {τ1 , . . . , τK } on the estimation of the ηi . However, we will see that the quantile regression FE estimator, as in the OLS case, is biased in the presence of lagged dependent variables as regressors. In least squares estimation of dynamic panel models it is evident that the unobserved initial values of the dynamic process induce a bias.4 For long panels, the effect associated with the initial conditions is seen to be O(T −1 ) and therefore negligible. Latter we evaluate the dynamic bias in the within-group and quantile regression FE estimators by means of Monte Carlo simulation. We find that the quantile regression FE estimator suffers from similar bias effects to those seen in the least squares case when T is moderate. Anderson and Hsiao (1981, 1982) and Arellano and Bond (1991), in the linear regression case, show that IV methods are able to produce consistent estimators for dynamic panel data models that are independent of the initial conditions. These estimators are based on the idea that lagged (or lagged differences of) the regressors are correlated with the included regressor but are uncorrelated with the innovations. Thus, valid instruments, wit , are available from inside the model and can be used to estimate the parameters of interest by IV methods. In this paper we use an analogous rationality for the construction of instruments. The problem of bias for the dynamic panel quantile regression can be ameliorated through the use of instrumental variables, w , that affect the determination of lagged y but are independent of
(1)
3 To simplify the presentation we focus on the first-order autoregressive processes, since the main insights generalize in a simple way to higher-order cases.
4 See Hsiao (2003), and Arellano (2003) for more details. Nickell (1981) provides analytical calculations for bias in the within-group estimator in a linear dynamic panel model.
144
A.F. Galvao Jr. / Journal of Econometrics 164 (2011) 142–157
innovations. Following Chernozhukov and Hansen (2006, 2008), and assuming the availability of instrumental variables, wit , we consider estimators defined as
αˆ = min ‖γˆ (α)‖A ,
(4)
α
where
ˆ (η(α), ˆ β(α), γˆ (α)) = min η,β,γ
√
K − N − T −
υk ρτ (yit − ηi
k=1 i=1 t =1
− α(τk )yit −1 − x′it β(τk ) − wit′ γ (τk )),
(5)
x′ Ax,
with ‖x‖A = and A is a positive definite matrix. Our final parameter estimates of interest are thus
ˆ ) ≡ (α(τ ˆ )) ≡ (α(τ ˆ α(τ θ(τ ˆ ), β(τ ˆ ), β( ˆ ), τ )).
(6)
The intuition underlying the estimator is that, since w is a valid instrument and it is independent of u, it should have a zero coefficient. The estimator (6) finds parameter values for α and β through the inverse step (4) such that the value of coefficient γ (α, τ ) on w in the ordinary QR step (5) is driven as close to zero as possible. Hence, by minimizing the coefficient of the variable wit one can recover the estimator of α . Therefore, the bias generated by inclusion of yit −1 in Eq. (3) is reduced through the presence of IV, wit , that affect the determination of yit but are independent of uit . Values of y lagged (or differences) two periods or more and/or lags of the exogenous variable x affect the determination of lagged y but are independent of u, so they can potentially be used as instruments to estimate α and β by the QRPIV method. As suggested by Chernozhukov and Hansen (2008), in practice, a simple procedure is to let the instruments wit either be wit or the predicted value from a least squares projection of lagged y on w and x. The implementation of the QRPIV procedure is straightforward. Define the objective function RNT (τ , ηi , α, β, γ ) :=
K − N − T −
υk ρτ (yit − ηi − α(τk )yit −1
k=1 i=1 t =1
− x′it β(τk ) − wit′ γ (τk )) where yit −1 is, in general, a dim(α)-vector of endogenous variables, ηi are the FE, xit is a dim(β)-vector of exogenous explanatory variables, wit is a dim(γ )-vector of IV such that dim(γ ) ≥ dim(α). For the special case of K = 1, one can use a grid search as follows: (1) For a given quantile of interest τ , define a grid of values {αj , j = 1, . . . , J ; |α| < 1}, and run the ordinary τ -quantile regression of (yit − αj yit −1 ) on (zit , wit , xit ) to obtain coefficients
.
without any instruments. The design matrix is [υ ⊗ (IN ⊗ ιT )..Υ ⊗ . . y−1 ..Υ ⊗ X ..Υ ⊗ w], where IN is a N × N identity matrix, ιT is a T × 1 vector of ones, Υ is a K × K diagonal matrix with the weights υ . The response vector is y˜ = (υ ⊗ y). As Koenker (2004) observes, in typical applications the design matrix of the full problem is very sparse, i.e. has mostly zero elements, and this considerably reduces the computational effort and memory requirements in large problems. The existence of the parameter ηi , whose dimension N tends to infinity, raises some new issues for the asymptotic analysis of the proposed estimator. As first noted by Neyman and Scott (1948), leaving the individual heterogeneity unrestricted in a nonlinear or dynamic model generally results in inconsistent estimators of the common parameters due to the incidental parameters problem; that is, noise in the estimation of the fixed effects when the time dimension is short results in inconsistent estimates of the common parameters due to the nonlinearity of the problem. In this respect, QR panel data suffers from this problem. Koenker (2004) overcomes this difficulty by using a large N and T asymptotics. We impose the following regularity conditions: A1. The yit are independent across individuals, covariance stationary, with conditional distribution functions Fit , and differentiable conditional densities, 0 < fit < ∞, with bounded derivatives fit′ for i = 1, . . . , N and t = 1, . . . , T . A2. Let y = (yit ), Z = IN ⊗ ιT , and ιT a T -vector of ones, y−1 = (yit −1 ) be a NT × dim(α) matrix, X = (xit ) be a NT × dim(β) matrix, and W = (wit ) be a NT × dim(γ ) matrix. For
Π (η, α, β, τ ) := E [υ(τ − 1(y < Z η + y−1 α + X β))Xˇ (τ )], Π (η, α, β, γ , τ ) := E [υ(τ − 1(y < Z η + y−1 α + X β + W γ ))Xˇ (τ )], ∂ Π (η, α, β, τ ) Xˇ (τ ) := [Z , W , X ]′ , the Jacobian matrices ∂(η,α,β) ∂ and ∂(η,β,γ ) Π (η, α, β, γ , τ ) are continuous and have full rank uniformly over E × A × B × G × T . The parameter space, E × A × B , is a connected set. Moreover the image of E × A × B under the map (η, α, β) → Π (η, α, β, τ ) is simply connected;
A3. Denote Φ (τk ) = diag (fit (ξit (τk ))), where ξit (τk ) = ηi + α(τk )yit −1 + x′it β(τk ) + wit′ γ (τk ), MZk = I − PZk and PZk =
Z (Z ′ Φ (τk )Z )−1 Z ′ Φ (τk ). Let X˜ = [W ′ , X ′ ]′ . Then, the following matrices are positive definite: Jϑ =
lim
N ,T →∞
υ1 X˜ ′ MZ′ 1 Φ (τ1 )MZ1 X˜ .. × .
ˆ j , τ ) and γˆ (αj , τ ); that is, for a given value of the η(α ˆ j , τ ), β(α autoregression structural parameter, say α , one estimates the
0
ordinary panel QR to obtain
ˆ j , τ ), γˆ (αj , τ )) := min RNT (τ , ηi , α, β, γ ). (ηˆ i (αj , τ ), β(α
Jα =
ηi ,β,γ
α(τ ˆ ) = min[γˆ (α, τ ) ]Aˆ (τ )[γˆ (α, τ )] α∈A
ˆ ) is then where A is a positive definite matrix. The estimate β(τ ˆ α(τ given by β( ˆ ), τ ), which leads to the estimates: θˆ (τ ) = (α(τ ˆ ), ˆ )) = (α(τ ˆ α(τ β(τ ˆ ), β( ˆ ), τ )). For the case of K > 1, the optimization is very large depending on the number of estimated quantiles. Therefore, instead of using a grid search, we use a numerical optimization function in R. As starting values, we use the estimates from the QR FE model
lim
N ,T →∞
··· .. . ···
0
.. . ′ ′ ˜ ˜ υk X MZk Φ (τk )MZk X
1 NT
υ1 X˜ ′ MZ′ 1 Φ (τ1 )MZ1 y−1 .. × .
(2) To find an estimate for α(τ ), choose α(τ ˆ ) as the value among {αj , j = 1, . . . , J } that makes ‖γˆ (αj , τ )‖ closest to zero. Formally, let ′
1 NT
0
··· .. . ···
0
.. , . ′ ′ ˜ υk X MZk Φ (τk )MZk y−1
and
Σ11 X˜ ′ MZ′ 1 MZ1 X˜ 1 .. S = lim . N ,T →∞ NT Σk1 X˜ ′ MZ′ MZ1 X˜ k
··· .. . ···
Σ1k X˜ ′ MZ′ 1 MZk X˜ .. , . ′ ′ Σkk X˜ MZ MZ X˜ k
k
where Σij = υi (τi ∧τj −τi τj )υj . Now define [J¯β′ , J¯γ′ ]′ as a partition of Jϑ−1 , and H = J¯γ′ A[α(τk )]J¯γ . Then, Jϑ is invertible, and Jα′ HJα is also invertible;
A.F. Galvao Jr. / Journal of Econometrics 164 (2011) 142–157
145
A4. For all τ ∈ T = [c , 1 − c ] with c ∈ (0, 1/2), (α(τ ), β(τ )) ∈ int A × B , and A × B is compact and convex;
Following Powell (1986), Jϑ and Jα can be estimated as stated in Theorem 2. The typical element of Jˆϑ takes the following form
A5.√maxit ‖yit ‖ = O( NT ); maxit ‖xit ‖ = O( NT ); maxit ‖wit ‖ = O( NT );
Jˆϑj = υj
√
√
a
A6. T → ∞ as N → ∞ and NT → 0 for some a > 0. Condition A1 is a standard assumption in the quantile regression literature and imposes a restriction on the density function of yit . Condition A2 is important for identification of the parameters. The continuity and full rank conditions require that the instrument W impacts the conditional distribution of y at many relevant points. Assumption A3 states conditions for the matrices that guarantee asymptotic normality. A4 imposes compactness on the parameter space of α(τ ). Such an assumption is needed since the objective function is not convex in α . Assumption A5 imposes bounds on the variables. Finally, condition A6 is the same assumption as in Koenker (2004) and Lamarche (2010). To further comment on the nature of the correlation between y−1 and W required by A2, note that, for a given quantile τ , by A1 we have that
∂ E [(τ − 1(y < Z η + y−1 α + X β))Xˇ (τ )]/∂(η, α, β) = E [(Z ′ , W ′ , X ′ )′ Φ (Z ′ , y′−1 , X ′ )]. Hence, the Jacobian in A2 takes a form of density-weighted covariance matrix for Z , y−1 and W variables, and requires that this matrix has full rank. In addition, A2 imposes that global identifiability must hold; hence, the impact of W should be rich enough to guarantee that the equations are solved uniquely. We can now establish consistency and asymptotic normality of the estimator. Proofs appear in the Appendix. Theorem 1. Given assumptions A1–A6, (η, α(τ ), β(τ )) uniquely solves the equations E [υψ(y − Z η − y−1 α − X β)Xˇ (τ )] = 0 over E × A × B , and θ (τ ) = (α(τ ), β(τ )) is consistently estimable. The limiting distribution of the parameters of interest is given in Theorem 2. Theorem 2. Under conditions A1–A6, for a given τ ∈ (0, 1), θˆ converges to a Gaussian distribution:
√
d
NT (θˆ (τ ) − θ (τ )) → N (0, Ω (τ )), ′ ′
′
′
= (min(τ , τ ′ ) − τ τ ′ )E (VV ′ ), V = (υ1 X˜ ′ MZ′ 1 , . . . , ′ ′ ′ ˜ υk X MZk ) , K = (Jα′ HJα )−1 Jα H, H = J¯γ′ A[α(τ )]J¯γ , L = J¯β M, M = I − Jα K , [J¯β , J¯γ ] is a partition of Jϑ , Φ (τk ) = diag (fit (ξit (τk ))), X˜ = [W ′ , X ′ ]′ , and Jϑ and Jα are as defined in assumption A3. where S
Remark 1. When dim(γ ) = dim(α), the choice of A(α) does not affect the asymptotic variance. As in Chernozhukov and Hansen (2008), when dim(γ ) > dim(α), the choice of the weighting matrix A(α) generally matters, and it is important for efficiency. A natural choice for A(α) is given by the inverse of the covariance matrix of γˆ (α(τ ), τ ). Noticing that A(α) is equal to (J¯γ S J¯γ )−1 at α(τ ), it
√
follows that the asymptotic variance of by Ωα = (Jα′ J¯γ′ (J¯γ S J¯γ′ )−1 J¯γ J¯α )−1 .
NT (α(τ ˆ )−α(τ )) is given
The components of the asymptotic variance matrix that need to be estimated include Jϑ , Jα and S. The matrix S can be estimated by its sample counterpart Sˆ (τ , τ ′ ) = (min(τ , τ ′ ) − τ τ ′ )
N − T −
2NThn i=1 t =1
I (|ˆu(τj )| ≤ hn )X˜ MZj MZ′ j X˜ ′
ˆ j ) and hn is an appropriately where uˆ (τj ) ≡ y − Z η− ˆ α(τ ˆ j )y−1 −X β(τ chosen bandwidth. The estimator of Jˆαj is analogous to Jˆϑj . Using
ˆ (τj )Z in PZ . the same procedure we can estimate the element Z Φ The consistency of these asymptotic covariance matrix estimators will not be discussed further in this paper. 2.2. Inference Now we turn our attention to inference in the quantile regression dynamic panel instrumental variable (QRPIV) model, and suggest Wald and Kolmogorov–Smirnov type tests for general linear hypotheses. In the independent and identically distributed setting the conditional quantile functions of the response variable, given the covariates, are all parallel, implying that covariate effects shift the location of the response distribution but do not change the scale or shape. However, slope estimates often vary across quantiles implying that it is important to test for equality of slopes across quantiles. Wald tests designed for this purpose were suggested by Koenker and Bassett (1982), and Koenker and Machado (1999). It is possible to formulate a wide variety of tests using variants of the proposed Wald test, from simple tests on a single QR coefficient to joint tests involving many covariates and distinct quantiles at the same time. The Wald process and associated limiting theory provide a natural foundation for the hypothesis Rθ (τ ) = r when r is known. Gutenbrunner and Jureckova (1992) show that the QR process is tight and thus the limiting variate viewed as a function of τ is a Brownian Bridge over τ ∈ T .5 Therefore, under the linear hypothesis H0 : Rθ (τ ) = r, and letting Γ = (K ′ , L′ )′ EVV ′ (K ′ , L′ ),
√
VNT =
NT [RΓ (τ )R′ ]−1/2 (Rθˆ (τ ) − r ) ⇒ Bq (τ ),
N T 1 −−
NT i=1 t =1
Vit Vit′ .
(7)
where Bq (τ ) represents a q-dimensional standard Brownian Bridge. For any fixed τ , Bq (τ ) is N (0, τ (1 − τ )Iq ). The normalized √ Euclidean norm of Bq (τ ), Qq (τ ) = ‖Bq (τ )‖/ τ (1 − τ ), is generally referred to as a Kiefer process of order q. Thus, for given τ , the regression Wald process can be constructed as
ˆ (τ )R′ ]−1 (Rθˆ (τ ) − r ), WNT = NT (Rθˆ (τ ) − r )′ [RΩ
Ω (τ ) = (K , L ) S (K , L ) ′
1
(8)
ˆ is a consistent estimator of Ω , which is given in Theowhere Ω rem 2. If we are interested in testing Rθ (τ ) = r at a particular quantile τ = τ0 , a Chi-square test can be conducted based on the ˆ (τ ), statistic WNT (τ0 ). Under H0 , given a consistent estimate of Ω the statistic WNT is asymptotically χq2 with q-degrees of freedom, where q is the rank of the matrix R. The limiting distribution of the test is summarized as follows. Theorem 3 (Wald Test). Under H0 : Rθ (τ ) = r, and conditions A1–A6, for fixed τ , a
WNT (τ ) ∼ χq2 . Proof. The proof of Theorem 3 follows directly from Eq. (8) and Theorem 2. It is important to note that given a consistent estimate of the variance-covariance matrix, one can easily construct confidence
5 In a related result, Wei and He (2006) establish tightness of the QR process in the longitudinal data context with increasing parameter dimension, see for instance, Lemma 8.4.
146
A.F. Galvao Jr. / Journal of Econometrics 164 (2011) 142–157
interval estimates for the parameters of interest based on the asymptotic approximation given in Theorem 2, or inverting the above Wald test. More general hypotheses are also easily accommodated by the Wald approach. Let ζ = (θ (τ1 )′ , . . . , θ (τm )′ ) and define the null hypothesis as H0 : Rν = r. The test statistic is the same Wald test as Eq. (8). In this case Ω is the matrix with (k, l)th block
Ω (τk , τl ) = (K ′ (τk ), L′ (τk ))′ S (τk , τl )(K ′ (τl ), L′ (τl )), where Ω (τk , τl ) is as defined in Theorem 2. The statistic WNT is still asymptotically χq2 under H0 where q is the rank of the matrix R. This formulation accommodates a wide variety of testing situations, from a simple test on single QR coefficients to joint tests involving several covariates and distinct quantiles. Another important class of tests involves the Kolmogorov– Smirnov (KS) type tests, where the goal is to examine the property of the estimator over a range of quantiles τ ∈ T . Thus, if one has interest in testing Rθ (τ ) = r over τ ∈ T , one may consider the KS type sup-Wald test. Following Koenker and Xiao (2006), we may construct a KS type test as KS WNT = sup WNT (τ ).
(9)
τ ∈T
The limiting distribution of the KS test is given by Theorem 4 (Kolmogorov–Smirnov Test). Under H0 and conditions A1–A6, KS WNT = sup WNT (τ ) ⇒ sup Qq2 (τ ). τ ∈T
τ ∈T
Proof. The proof of Theorem 4 follows directly from the continuous mapping theorem and Eqs. (7) and (9). Critical values for (1993).
sup Qq2
(τ ) have been tabled by Andrews
3. Prediction In this section, we propose to use the quantile regression dynamic panel data instrumental variables model for prediction. We first suggest a procedure that uses the quantiles of the empirical distribution of the forecasts to construct prediction intervals. The QR model has a significant advantage over models based on the conditional mean, since it will be less sensitive to the tail behavior of the underlying random variables representing the forecasting variable of interest, and consequently will be less sensitive to observed outliers. Moreover, because of the heterogeneous nature of most of the variables of interest in economics, prediction using QR techniques is an important tool for applied work. Second, we construct conditional density forecasts of the variable of interest under weak assumptions, such conditional density forecasts can be made based on an ensemble of the forecast paths. The literature on prediction using QR is still growing. Koenker and Bassett (2010) adapt QR methods to the paired comparison framework, the model is capable of delivering estimates of the predictive conditional joint density. Giacomini and Komunjer (2005) construct a conditional quantile forecast encompassing test for evaluation and combination of the conditional quantile forecast. Zhou and Portnoy (1996) suggest direct and studentization methods to construct confidence and prediction intervals for conditional quantile functions. In a companion paper, Zhou and Portnoy (1998) extend these results to a class of heteroscedastic linear models. Koenker and Zhao (1996) show that QR offers a natural approach to the construction of prediction intervals for ARCH-type models. Taylor and Bunn (1999) present an alternative QR approach for construction of predictive distributions and intervals.
There is a substantial amount of literature on forecasting with dynamic panel data. Most of the literature is based on models for conditional mean, with the slope coefficients being homogeneous or individually heterogeneous. Baltagi and Griffin (1997), Hsiao and Tahmiscioglu (1997), Baltagi et al. (2000), and Hoogstrate et al. (2000) present empirical applications, discuss and compare the forecast performance of homogeneous and individually heterogeneous slope estimators.6 However, as Baltagi (2008) observes, the consistent finding is that the models with homogeneous slope coefficients perform better in forecast performance mostly due to their simplicity, their parsimonious representation and stability of parameters estimates. An important application of dynamic panel data QR models is out-of-sample prediction. In many situations it is important to construct a prediction interval for a future response variable. The QRPIV model is able to examine how covariates influence the location, scale, and shape of the entire response distribution. Most econometric forecasting methods using panels have focused on models for the conditional mean under Gaussian conditions. The dynamic panel data QR models described above offer an opportunity to significantly expand the scope of forecasting applications. There are several potential applications, for instance, the model might be useful to predict the dynamic demand of electricity, natural-gas or cigarettes across US states. The model can also be used to forecast output growth, investment (based on Tobin’s q theory), or exchange rate. The scope of potential examples is substantial. One-step ahead prediction of the quantile function of yit for an individual i are immediately available from model (3). From a date T , when the information is available, using the estimated model, it is possible to generate a forecast for T + 1 as
ˆ ), Qˆ yiT +1 (τ |yiT , xiT +1 ) = ηˆ i + α(τ ˆ )yiT + x′iT +1 β(τ
(10)
where the predictors (yiT , xiT +1 ) are known. Moreover, based on the results of the previous sections, for a fixed quantile τ , the one-step-ahead predictor follows a normal distribution. Let Xˇ = [Z ′ , W ′ , X ′ ]′ , δ(τ = (η,√α, β)′ , and the standardization √) √ matrix DNT = diag ( T , NT , NT ). From a simple extension d
ˆ ) − δ(τ )) → of the results in Theorem 2, we have that DNT (δ(τ −1 −1 N (0, Σ (τ )), where Σ (τ ) = J (τ ) S (τ )J (τ ) , with J = E (Xˇ Φ [Z , y−1 , X ]) and S = τ (1 − τ )E [Xˇ Xˇ ′ ]. Now, let X˙ = [1, yT , xT +1 ]′ , and note that Qˆ yiT +1 (τ ) = X˙ ′ δ(τ ), such that (Qˆ yiT +1 (τ ) − QyiT +1 a (τ )) = (X˙ ′ δˆ (τ ) − X˙ ′ δ(τ )) ∼ N (0, Ω (τ )), where Ω (τ ) = − 1 − 1 ′ X˙ DNT Σ (τ )DNT X˙ . Given this distributional property, we can
construct an interval for predictive quantiles.
Corollary 1. Under assumptions of Theorem 2, the (1 − 2λ)100% interval for the one-step-ahead conditional quantile is of the form I = (Qˆ yiT +1 (τ ) − aNT , Qˆ yiT +1 (τ ) + aNT ) where aNT = zλ Ω (τ )1/2 , and zλ denotes the (1 − λ)th standard normal percentile point. The proof of the corollary is similar to Theorem 2, and it follows directly from asymptotic normality of the estimator. In many applications it is important to establish confidence intervals for s-step-ahead forecasts. However, the problem of deriving the distribution of this predictor becomes a more difficult task.7 To solve this problem, Schmidt (1974), in a least squares dynamic model context, proposes to iterate the model and use the
6 See e.g. Baltagi (2008) for a survey on forecasting with panel data, and a more complete list of references. 7 Elliott and Timmermann (2008) discuss and illustrate multi-step forecasting in linear models.
A.F. Galvao Jr. / Journal of Econometrics 164 (2011) 142–157
147
delta method to derive the distribution of multi-period forecast. More recently, Chevillon and Hendry (2005) also apply iteration and the delta method to obtain the asymptotic distribution of iterated multi-step estimators for forecasting.8 The QR analogous version of this technique could be applied to the τ th conditional quantile estimate based on the s-step-ahead iteration of the QR model, and one could use the asymptotic result stated previously along with the delta method to derive the asymptotic distribution ˆ )s , and consequently of the conditional of the estimator, δ(τ quantiles. However, when constructing an s-step-ahead forecast for a future response variable using QR models, it is important to consider all the potential trajectories for the corresponding quantiles rather than a fixed τ . Thus, as Koenker and Zhao (1996) suggest, for QR models, a simulation based approach is a suitable tool for s-step-ahead forecasting. The approach we propose is based on this method. Let Qˆ yiT +s (τ |·) denote the conditional quantile function of yiT +s given the information up to time T + s − 1, for a given individual i. A draw from the one-step-ahead forecast distribution is given by yˆ iT +s = Qˆ yiT +s (U |·), where U is a uniformly distributed random variable on [0, 1]. Finally, we are able to construct a predictive interval by drawing repetitively from the distribution of the forecasts and taking the selected quantiles of the empirical distribution. A number of other numerical approaches are available for s-step-ahead forecasting in the literature. Lin and Granger (1994) describe several numerical techniques for twostep forecasting with nonlinear models.9 In addition, Peters and Freedman (1985) show that the use of bootstrap to construct multiperiod forecasts is more reliable than the asymptotics based on the delta method. More specifically, quantile regression offers an alternative approach to the construction of s-step-ahead prediction intervals. It is important to note that if the parameters of the model were known exactly, the conditional quantile function itself could be used, and the interval
A similar approach seems reasonable for the QRPIV problem. Let Qyit (τ |yit −1 , xit ) denote the conditional quantile function of yit , for an individual i, given the information up to time t − 1. A draw from the s-step-ahead forecast distribution is given by
[QyiT +s (λ/2), QyiT +s (1 − λ/2)],
where hn is a bandwidth which tends to zero as the sample size tends to infinity. Using the same principle, the s-step-ahead conditional density function is immediately available from (12). Koenker and Bassett (2010) suggest an alternative use of QR methods to construct predictive densities. As described previously, given the estimated quantile regression model (11), it is possible to draw uniform random variables and evaluate the model. Repeated evaluations, say R times, yield a predictive distribution. Thus, density can be estimated using usual kernel methods. This simulation method of producing predictive densities is closely related to the ‘‘rearrangement’’ methods for monotonization of conditional quantile estimates introduced recently by Chernozhukov et al. (2010). Hence, the conditional density forecasts s-step-ahead, for the individual i, can be constructed based on an ensemble of such forecast paths using the QRPIV model. In practice, the simulation method described previously produces R prediction samples of conditional quantile function of interest for period T + s. Thus, one can apply any typical nonparametric density estimator to predict the conditional density function s-step-ahead. A similar method of constructing density forecasts from quantile regressions is proposed by Gaglianone and Lima (2009). They propose a model that generates an optimal individual forecast that is a function of a common factor, consensus forecast, and an idiosyncratic component that depends only on the individual loss function. The conditional density of the s-step-ahead forecast can be estimated from the conditional quantile functions, fˆyiT +s (τ |·) =
would provide an exact 1 − λ level interval for an s-step-ahead forecast. The methods proposed by Koenker and Zhao (1996) and Zhou and Portnoy (1996) suggest the construction of a 1 − λ level interval for an s-step-ahead forecast as
[Qˆ yiT +s (λ/2 − hn ), Qˆ yiT +s (1 − λ/2 + hn )], where hn → 0 accounts for parameter uncertainty.10 However, an important practical aspect of constructing such forecast interval involves computing Qˆ yiT +s (·). This is straightforward in the one-step ahead case, as shown in Eq. (10), but more problematic for s > 1. Geweke (1989) discusses a Bayesian approach to construct predictive densities, for the ARCH linear models, by means of Monte Carlo integration with importance sampling. Koenker and Zhao (1996) suggest a related method, based on a simple simulation approach, for construction of out-of-sample prediction for quantile regression ARCH models.
8 When deriving the distribution of the s-step-ahead predictor for simple autoregressive and moving average linear regression models, it is customary to impose the assumption of normality of the innovations. In this case, it is relatively straightforward to derive the probability limit of the forecasts at any lead time (see e.g. Box et al. (2008)). For QR we want to avoid to specify the distribution of the innovations. 9 In the QR literature, there is recent work based on numerical methods for predicting with dynamic models. Cai (2010) proposes forecasting methods for obtaining s-step-ahead predictive quantiles based on numerical and Monte Carlo methods; and Lee and Yang (2008) suggest a bootstrap aggregating for multi-step forecast horizons for nonlinear models. 10 See Zhou and Portnoy (1996) Theorem 2.1 for a Bahadur representation, and Corollary 2.1 for asymptotic normality and confidence interval.
yˆ iT +s = Qˆ yiT +s (U |yiT +s−1 , xiT +s )
ˆ U ), = ηˆ i + α( ˆ U )˜yiT +s−1 + x′iT +s β(
(11)
where U is a uniformly distributed random variable on [0, 1], xiT +s is given, and
y˜ it =
yit yˆ it
if t ≤ T , if t > T .
Applying (11) recursively we can compute a sample path of forecasts (ˆyiT +1 , yˆ iT +2 , . . . , yˆ iT +s )′ . Repeatedly applying this procedure, say R times, one can compute the λ/2 − hn and 1 −λ/2 + hn , quantiles of the empirical distribution of the forecasts and used them to construct the final prediction intervals.11 It may appear that the use of (11) is computationally prohibitive because it appears to require a quantile regression estimate for each possible realization of U ∈ (0, 1). However, the entire function QyiT (τ |·) is easily computed by standard parametric linear programming techniques, yielding a piecewise constant function on a known grid that is then readily evaluated by the forecasting simulation. The QRPIV model can also be used to estimate and predict the conditional density function. For instance, let us consider the one-step-ahead forecast. Given the parameter estimates, for an individual i, the τ -th conditional quantile function of yiT +1 , can be estimated by (10). Thus, given a family of estimated conditional quantile functions for an individual i, the conditional density of yiT +1 at values of the conditioning covariate can be estimated by the difference quotients, fˆyiT +1 (τ |yiT , xiT +1 )
=
2hn Qˆ yiT +1 (τ + hn |yiT , xiT +1 ) − Qˆ yiT +1 (τ − hn |yiT , xiT +1 )
,
(12)
(τk − τk−1 )/(Qˆ yiT +s (τk |·) − Qˆ yiT +s (τk−1 |·)), for some appropriately chosen sequence of τ ’s. The conditional densities can also be estimated by using the Epanechnikov kernel.
11 We discuss the choice of h in the next section. n
148
A.F. Galvao Jr. / Journal of Econometrics 164 (2011) 142–157
Table 1 Location-shift model: bias and RMSE of estimators for Normal (G) and t3 distributions (T = 10 and N = 50). WG
G
α = 0.5 β = 0.7
t3
α = 0.5 β = 0.7
Bias RMSE Bias RMSE Bias RMSE Bias RMSE
OLS-IV
QRP
τ = 0.5
τ = 0.75
τ = 0.25
τ = 0.5
τ = 0.75 0.0097 0.088 0.0178 0.069 0.0056 0.094 0.0166 0.092
−0.0982
−0.0070
−0.1044
−0.0976
−0.0953
−0.0190
−0.0013
0.103 0.0349 0.055 −0.1432 0.153 0.0635 0.106
0.082 −0.0006 0.065 0.0107 0.151 0.0115 0.117
0.113 −0.0019 0.063 −0.0969 0.109 −0.0134 0.074
0.109 0.0325 0.057 −0.1018 0.114 0.0364 0.065
0.107 0.0683 0.091 −0.0923 0.106 0.0779 0.111
0.093 −0.0158 0.075 −0.0155 0.098 −0.0144 0.083
0.089 −0.0015 0.069 −0.0055 0.095 −0.0018 0.073
4. Monte Carlo simulation We use simulation experiments to assess the finite sample performance of the quantile regression estimator discussed in the previous sections. Two simple versions of the basic model in Eq. (1) are considered in the simulation experiments. In the first, the exogenous covariate, xit , exerts a pure location shift effect. In the second, xit exerts both location and scale effects. In the former case the response yit is generated by the model, yit = ηi + α yit −1 + β xit + uit while in the latter case, yit = ηi + α yit −1 + β xit + (γ xit )uit . We employ two different schemes to generate the disturbances uit . Under Scheme 1, we generate uit as a N (0, σu2 ), and under Scheme 2 we generate uit as a t-distribution with 3 degrees of freedom. The regressor xit is generated according to xit = µi + ζit , where ζit follows an ARMA(1, 1) process,
(1 − φ L)ζit = ϵit + θ ϵit −1 , and ϵit follows the same distribution as uit , that is, normal distribution and t3 for Schemes 1 and 2, respectively. In all cases we set ζi,−50 = 0 and generate ζit for t = −49, −48, . . . , T . This ensures that the results are not unduly influenced by the initial values of the xit process. In generating yit we also set yi,−50 = 0 and discard the first 50 observations, using the observations t = 0 through T for estimation. The fixed effects, µi and ηi , are generated as
µi = e1i + T −1
T −
ϵit ,
e1i ∼ N (0, σe21 );
xit ,
e2i ∼ N (0, σe22 ).
t =1
ηi = e2i + T −1
T −
QRPIV
τ = 0.25
t =1
In the simulations, we consider T = {10, 20, 50} and N = {50, 100}. We set the number of replications to 2000, and consider the following values for the remaining parameters: (α, β) = (0.5, 0.7), φ = 0.7, θ = 0.2, γ = 0.5, σu2 = σe21 = σe22 = 1.
the autoregression coefficient, α , and the exogenous variable coefficient, β , for the location-shift and the location-scale-shift models, respectively. Table 1 collects results for the location-shift model and shows that, in the Gaussian case (G) the autoregressive coefficient is biased downward for the WG case, but the OLS-IV is approximately unbiased. Likewise, in the presence of lagged variables, αˆ is biased downward for QRP, and the QRPIV estimator is able to largely eliminate the bias. Table 1 also reveals that βˆ is slightly biased in the WG and QRP cases, and unbiased for both OLS-IV and QRPIV. The results regarding the t3 -distribution show that the autoregressive estimates of WG and QRP are biased downward, and the QRPIV and OLS-IV are approximately unbiased estimators for both coefficients. Regarding the RMSE, in the Gaussian case, the OLS-based estimators perform better than the respective quantile regression estimators. In contrast, for the heavier-tailed t3 distribution of innovations, the RMSE results from the quantile estimators are smaller than their respective OLS based estimators. The results for the location-scale-shift model are presented in Table 2. They are qualitatively similar to those in Table 1, where QRPIV is approximately unbiased, with improved precision relative to OLS-IV, in the t3 case. Table 2 shows that in both distributional cases, the WG and QRP estimators are biased downward and the OLS-IV and QRPIV are approximately unbiased. The RMSE’s present the same features as in the previous case as well, where under non-Gaussian heavier-tailed conditions, t3 , the quantile regression estimators perform better than the least squares based estimators. Finally, we examine the bias of the autoregressive estimator when α varies. We estimate the location model for the Normal and t3 distribution cases varying α ∈ (0, 1). The results are essentially the same and presented in Fig. 1. They show that for values of α very close to zero, there is a very small positive bias, which disappears for most values of α . However, as the autoregression coefficient increases toward the unit, the bias might be large. This result is in line with the literature on dynamic panel data for conditional mean, where for α ≈ 1 the instruments from lagged variables tend to be weak.12 4.2. Prediction
4.1. Bias and RMSE In this section, the bias and root mean squared error (RMSE) of different estimators are presented. We evaluate four different estimators, the within group (WG), the OLS instrumental variables (OLS-IV), the fixed effects panel quantile regression (QRP) proposed by Koenker (2004), and the fixed effects quantile regression dynamic panel instrumental variables (QRPIV) proposed in this paper. The quantile regression based estimators are analyzed for quantiles τ = (0.25, 0.5, 0.75). For the OLS-IV and QRPIV we considered two different instruments, yit −2 and xit −1 ; the results are similar in both cases, and we simply present results for the xit −1 case. We also consider different sample sizes in the experiments. However, due to space limitations we only report results for T = 10, N = 50. The results for the other sample sizes are similar. Tables 1 and 2 present bias and RMSE results for estimates of
First we evaluate the performance of the prediction interval in terms of empirical coverage probabilities and empirical lengths. We set the nominal level equal to 0.90 (i.e. λ = 0.9), N = 50, and vary the time series dimension over T = {10, 20, 50}. We follow the approach proposed by Zhou and √Portnoy (1996) and set the constant hn = zα x˜ ′ D−1 x˜ τ (1 − τ )/ NT , for fixed x˜ , with
D = (NT )−1 t =1 i=1 x˜ it x˜ ′it , and τ = 0.5. We compute predictive intervals for (yiT , yiT +1 , yiT +2 , yiT +3 ). Eq. (11) is computed recursively from the location and location-scale models. Moreover, the number of draws from the uniform distribution is R = 500
∑T
∑N
12 See, e.g., Arellano and Bover (1995) and Ahn and Schmidt (1995) for more details.
A.F. Galvao Jr. / Journal of Econometrics 164 (2011) 142–157
149
Table 2 Location-scale-shift model: bias and RMSE of estimators for Normal (G) and t3 distributions (T = 10 and N = 50). WG
G
α = 0.5 β = 0.7
t3
−0.0967
Bias RMSE Bias RMSE Bias RMSE Bias RMSE
α = 0.5 β = 0.7
0.099 0.0454 0.077 −0.1244 0.139 0.0727 0.132
OLS-IV
0.0019 0.078 0.0003 0.080 0.0173 0.132 0.0104 0.153
QRP
QRPIV
τ = 0.25
τ = 0.5
τ = 0.75
τ = 0.25
τ = 0.5
τ = 0.75
−0.0939
−0.0959
−0.0987
−0.0073
−0.0003
0.104 −0.0337 0.081 −0.0881 0.103 −0.0415 0.081
0.106 0.0180 0.085 −0.0894 0.107 0.0238 0.063
0.109 0.0699 0.091 −0.0905 0.109 0.0905 0.119
0.083 −0.0053 0.082 −0.0152 0.079 −0.0186 0.087
0.083 −0.0018 0.084 −0.0044 0.085 0.0011 0.088
0.0069 0.086 0.0172 0.086 0.0042 0.086 0.0125 0.096
Table 3 Location-shift model: empirical coverage probabilities and length for predictive interval with nominal level 0.90. Normal
CP
Length
T T T T T T
= 10 = 20 = 50 = 10 = 20 = 50
t3
yT
yT + 1
yT + 2
yT +3
yT
yT +1
yT + 2
yT + 3
0.929 0.903 0.896 3.076 2.920 2.927
0.863 0.875 0.888 3.074 3.657 2.927
0.832 0.861 0.858 4.608 4.374 4.093
0.828 0.841 0.849 5.395 5.106 4.830
0.907 0.879 0.887 4.877 4.192 4.058
0.852 0.861 0.879 4.870 4.188 4.062
0.844 0.857 0.858 7.353 6.312 6.127
0.829 0.837 0.851 8.698 7.422 7.088
Table 4 Location-scale-shift model: empirical coverage probabilities and length for predictive interval with nominal level 0.90. Normal
CP
Length
T T T T T T
= 10 = 20 = 50 = 10 = 20 = 50
t3
yT
yT + 1
yT +2
yT + 3
yT
yT +1
yT + 2
yT + 3
0.915 0.866 0.892 5.963 5.239 5.140
0.854 0.858 0.864 5.967 5.237 5.138
0.849 0.843 0.856 8.972 7.872 7.683
0.832 0.834 0.850 10.603 9.268 8.969
0.922 0.929 0.918 10.620 10.939 8.471
0.825 0.865 0.873 10.611 10.799 9.671
0.848 0.879 0.874 15.872 13.331 13.761
0.848 0.853 0.860 17.801 16.627 14.579
Fig. 1. Bias autoregressive coefficient. The solid line shows the bias for the Normal distribution case, the dashed line indicates the bias for the t3 distribution case.
and xiT +1 is the last observation in the sample. Since there are N intervals, we present results of the average over the individuals, but the results do not change for a fixed individual i. The results for both distributions are summarized in Tables 3 and 4 for location and location-scale cases respectively. Table 3 presents results for empirical coverage probabilities and length for prediction interval with nominal level 0.90 for the location case. The simulations results show that, for small time series and in-sample prediction, the interval is slightly conservative on coverage probabilities. However, the conservativeness disappears when sample size grows. In addition,
results present evidence that, for a given sample size, the proposed method to compute predictive interval tends to underestimate the coverage levels for longer forecasting horizons. However, this phenomenon disappears for larger sample sizes. As the time series dimension increases, the coverages probability improve with respect to the nominal level 0.9. The results regarding the empirical length of the prediction interval show that, holding a time series sample size constant, the interval is wider for longer horizons of prediction, but the length decreases when we fix the forecast horizon and increase the sample size. In Table 4 we present results for the location-scale model. The results are qualitatively similar to those in Table 3. The empirical coverage probability for both Normal and t3 cases are close to the nominal 90%. For all out-of-sample cases, the coverage increases with the sample size. Regarding the length of the interval, as in Table 3, it decreases as sample size increases, and increases as prediction horizon increases. Second, we study the finite sample performance of the predictive conditional density method. We estimate the one-stepahead density and compare it with the true underline density function from the location model. In this simulation we use a simple Gaussian kernel smooth technique with default bandwidth in R. Fig. 2 presents results for four random subjects in one run of simulation. The solid (black) line is the true conditional density, the dashed (red) line indicates the predictive one-step-ahead conditional density estimates by the kernel method. Fig. 2 shows that the conditional density estimator closely follows the true density function and thus presents a good performance in terms of predictive conditional density function. We also estimate the models using the Epanechnikov kernel, as suggested by Gaglianone and Lima (2009), the results remain essentially the same.
150
A.F. Galvao Jr. / Journal of Econometrics 164 (2011) 142–157
contain the following variables: (i) GDP; (ii) real stock prices; (iii) a GDP deflator. We use data on the following 18 countries: Australia, Austria, Belgium, Canada, Denmark, Finland, France, Germany, Ireland, Italy, Japan, The Netherlands, Norway, Spain, Sweden, Switzerland, United Kingdom, and United States. The longest series contain annual data from 1948 to 2008, but data points are missing for some countries, making the panel unbalanced. We construct one-step-ahead predictions using a simple autoregressive dynamic model, and an alternative model including leading indicator (LI) variables. The following dynamic panel quantile regression model was employed to generate the output growth rates forecasts: Qyit (τ |Fit −1 ) = ηi + α(τ )yit −1 + β1 (τ )SRit −1
+ β2 (τ )SRit −2 ,
Fig. 2. Predictive density function. The solid (black) line shows the true conditional density, the dashed (red) line indicates the one-step-ahead predictive conditional density estimates using the Gaussian kernel method.
5. Application Aggregate output is probably the most popular macroeconomic variable when it comes to forecasting. We illustrate the proposed approach with an application to forecasting gross domestic product (GDP) growth rates for 18 OECD (Organization for Economic Cooperation and Development) countries. Panel data techniques have been frequently applied to forecast multi-country output growth rates, and the use of such models to generate a forecast for GDP growth rates is important because it is crucial to take into account time invariant unobserved heterogeneity across countries (fixed effects). Pure cross-section studies cannot control for the unobserved country effects, whereas pure timeseries studies cannot control for unobserved changes occurring along the time. In addition, the use of QR methods is important to capture heterogeneity in the model, to give more flexibility with no distributional assumptions on the innovations, and to produce robust estimates. There are several studies on forecasting growth rates for OECD countries using dynamic panel data. In a study of real gross national product growth rates of nine OECD countries, Garcia-Ferrer et al. (1987) showed that pooled estimates of an dynamic model with leading indicator variables provided superior forecasting results. Hoogstrate et al. (2000) investigate the improvement of forecasting performance using pooling techniques instead of single country forecasts for N fixed and T large. They use a set of dynamic regression equations and GDP growth rates of 18 OECD countries. Gavin and Theodorou (2005) use forecasting criteria to examine the macroeconomic behavior of 15 OECD countries. They find that the small set of variables and a simple VAR common model strongly support the hypothesis that many industrialized countries have similar macroeconomic dynamics.13 There is also a literature on analyzing economic growth models using QR techniques (see e.g. Crespo-Cuaresma et al. (2009), Canarella and Pollard (2004), and Mello and Novo (2002)). Moreover, there is recent work on forecasting economic growth rates with timeseries QR models (see e.g. Cai (2007, 2010)). However, to the best of our knowledge, there is a lack of literature on forecasting with panel data QR models. The data used in this analysis come from the International Monetary Funds International Financial Statistics Data Base and
13 In related work, Fok et al. (2005) use a panel of 48 US states to show that forecasts of aggregates, as total output, can be improved by considering panel models of disaggregated series.
(13)
where yit denotes the first difference of the logarithm of the real output, and SRit denotes the first difference of the log of a stock price index. We also estimate the least squares version of (13) for comparison reasons. We use yt −2 as instruments for both models. First, we estimate the dynamic autoregressive model without LI variables (β1 = β2 = 0). We compare the estimates for the autoregressive coefficient, α , for both QRPIV, with τ = 0.5, and OLS-IV. The results are 0.403 (0.15) for QRPIV, and 0.386 (0.07) for OLS-IV, with standard errors inside parentheses. These results show similar point estimates for both conditional mean and median models. Moreover, we construct prediction intervals for yiT +1 (2009) with nominal level 0.90. For QRPIV, we use the simulation technique described in Section 3, and for OLS-IV we assume Gaussian innovations, for simplicity and such that we can compare later the fit of the predictive densities. The results for lower bound (LB), upper bound (UB), and length (LG) of the intervals are presented in Table 5 (Model 1). When comparing the length, we observe a larger variation in QRPIV than OLS-IV, with Finland and Ireland presenting the largest lengths. For these two countries the growth rate of the real GDP ranges from negative values to strong positive growth rates. A preliminary analysis of the QRPIV intervals shows that the countries with better prognosis in terms of economic growth are France, Norway, and Spain. For all the other countries, the one-step-ahead intervals for output growth rate present negative LB, for both QRPIV and OLS-IV cases. In addition, note that the OLS-IV delivers negative estimates for all LB, and they are, in general, more negative than QRPIV. Using these estimators we also construct densities for the onestep-ahead forecast. For QRPIV we estimate the conditional density functions using the simulation method described in Section 3, and for OLS-IV we make use of the normality assumption.14 The results are presented in Fig. 3, and show evidence of similar estimates of the peak of the distributions for almost all countries, with exception of Belgium, Ireland, Italy, Norway, Spain, and Sweden. Although, one important distinction is that the QRPIV model has often a higher peak at center of the distribution than OLS-IV. This is due to the normality assumption for the latter. Another important difference between the two methods is that for several countries the QRPIV densities are asymmetric, while this feature cannot be captured with OLS-IV. Moreover, from the QRPIV results it can be seen that Finland and Japan have considerable amount of probability mass on their left tails in negative regions. Thus, according to these estimates, these countries were likely to face a recession.
14 There are several methods to compute density forecasting available in the literature, see for instance Tay and Wallis (2000). We impose normality for simplicity and comparison reasons. We compute standard errors for the individual specific effects as in Arellano (2003).
A.F. Galvao Jr. / Journal of Econometrics 164 (2011) 142–157
151
Table 5 One-step-ahead predictive intervals for OECD growth rate of the real GDP. Model 1
Model 2
QRPIV
Australia Austria Belgium Canada Denmark Finland France Germany Ireland Italy Japan Netherlands Norway Spain Sweden Switzerland United Kingdom United States
OLS-IV
QRPIV
OLS-IV
LB
UB
LG
LB
UB
LG
LB
UB
LG
LB
UB
LG
−0.001 −0.005 −0.004 −0.014 −0.016 −0.026
0.075 0.052 0.055 0.063 0.055 0.087 0.053 0.074 0.102 0.039 0.068 0.067 0.054 0.071 0.051 0.057 0.058 0.069
0.076 0.058 0.060 0.077 0.071 0.113 0.042 0.082 0.110 0.050 0.086 0.068 0.051 0.068 0.058 0.068 0.069 0.079
−0.006 −0.013 −0.014 −0.015 −0.032 −0.015 −0.014 −0.020 −0.023 −0.030 −0.014 −0.007 −0.009 −0.008 −0.021 −0.020 −0.018 −0.015
0.069 0.060 0.059 0.060 0.040 0.062 0.060 0.055 0.052 0.045 0.061 0.065 0.065 0.065 0.053 0.054 0.055 0.058
0.075 0.074 0.073 0.075 0.073 0.077 0.074 0.075 0.076 0.075 0.075 0.073 0.074 0.074 0.074 0.075 0.073 0.074
−0.022 −0.036 −0.032 −0.019 −0.048 −0.031 −0.023 −0.029 −0.062 −0.048 −0.055 −0.026 −0.027 −0.021 −0.042 −0.020 −0.002 −0.029
0.052 0.065 0.043 0.027 0.004 0.049 0.039 0.056 0.027 0.032 0.039 0.049 0.042 0.071 0.021 0.043 0.063 0.055
0.074 0.100 0.075 0.046 0.053 0.080 0.063 0.085 0.089 0.080 0.094 0.075 0.069 0.091 0.063 0.063 0.065 0.083
−0.003 −0.012 −0.014 −0.005 −0.028 −0.011 −0.010 −0.018 −0.014 −0.024 −0.004 −0.009 −0.007 −0.005 −0.017 −0.021 −0.004 −0.011
0.060 0.053 0.050 0.055 0.035 0.052 0.053 0.046 0.049 0.039 0.061 0.056 0.057 0.059 0.046 0.042 0.059 0.055
0.063 0.065 0.064 0.060 0.063 0.064 0.063 0.064 0.063 0.063 0.065 0.065 0.064 0.064 0.063 0.063 0.063 0.066
0.011
−0.008 −0.007 −0.011 −0.019 0.000 0.003 0.003 −0.008 −0.011 −0.011 −0.009
Fig. 3. One-step-ahead predictive density functions for model 1. The solid line is the one-step-ahead predictive density based on the quantile regression, the dashed line indicates the one-step-ahead predictive density estimate using the mean regression method.
In a second round of estimations, we include SRit as a LI variable, and estimate Eq. (13). First, we compare the autoregressive coefficient. For median QRPIV the coefficient is 0.526 (0.16) and for OLS-IV we have 0.185 (0.07). Thus, the results show a large difference between these coefficients, with QRPIV increasing and OLS-IV decreasing. Again, we focus on one-step-ahead forecast and construct prediction intervals and density functions for each country in the sample. The data for stock price index has a considerable number of missing values. Denmark has data only from 1996 to 2008, and Switzerland has data from 1989 to 2008. In addition, for UK we only have data from 1958 to 1998, such that forecasting one-step-ahead has a different interpretation than for the other countries.15 Table 5 (Model 2) presents the one-stepahead prediction interval estimates for LB, UB, and LG.
15 Due to the missing data problem, we do not interpret results for the United Kingdom.
Relative to the previous model, without controlling for stock price index variable, on one hand, the forecasts for the QRIV are more pessimistic with all the countries presenting a negative lower bound and zero into the interval. On the other hand, in general, the intervals for OLS-IV are more optimistic with less negative LB. For the QRPIV, the lengths are similar, with exception of Austria, Japan, and Spain, which present a considerably wider length. Regarding the OLS-IV, the lengths are narrower, and once again, present less variation than QRPIV. Another important feature one can observe is that, in general, the intervals have smaller UB GDP growth rate for both cases. We also construct predictive density functions for this case. The results are presented in Fig. 4, and show that the models produce similar densities for Australia, Austria, Netherlands, Norway, Switzerland, and UK. However, for the remaining countries, the results are quite different, and the most prominent features are the higher peaks for QRPIV, and that densities from QRPIV model are asymmetric with considerable more mass on the left tail, thus,
152
A.F. Galvao Jr. / Journal of Econometrics 164 (2011) 142–157
Fig. 4. One-step-ahead predictive density functions for model 2. The solid line is the one-step-ahead predictive density based on the quantile regression, the dashed line indicates the one-step-ahead predictive density estimate using the mean regression method.
forecasting smaller growth rates than OLS-IV models. It is possible to observe that for QRPIV, relative to the previous model, the left tail presents much more mass for most of the densities, and the mean is shifted to the left. Thus, most predictive densities indicate a recession with more probability relative to the previous model. An interesting phenomenon happened with the QRPIV densities of Belgium, Canada, Denmark, and Norway, which became negatively skewed in the second round of estimation. This exercise illustrates forecasting with QRPIV, and compares the results with a simple OLS-IV. The results suggest evidence of important differences. In particular the density estimates for the model with LI variables differ considerably. In addition, the results show that presenting density forecasts is an important complement to interval forecasting, since the former conveys additional important information about the distribution of probability mass. Furthermore, using QR models for this forecasting requires no explicit distributional assumptions on the densities of the innovations. 6. Conclusion This paper studies estimation, inference, and prediction in a quantile regression dynamic panel model with fixed effects. Standard approaches to estimate dynamic panel models with FE are typically biased in the presence of lagged dependent variables as regressors. To reduce the dynamic bias in the quantile regression FE estimator we suggest the use of the IV quantile regression method of Chernozhukov and Hansen (2006) along with lagged regressors as instruments. In addition, we describe how to construct prediction intervals and conditional density functions using the proposed model. Monte Carlo simulations show evidence that the quantile regression IV approach for dynamic panel data sharply reduces the dynamic bias and turns out to be especially advantageous when innovations are heavy-tailed. Moreover, the empirical levels for predictive intervals approximate fairly well the nominal or theoretical levels. Finally, we illustrate the methods with an application to forecasting output growth rates for 18 OECD countries. We compute one-step-ahead prediction intervals and density functions.
Acknowledgements This paper is based on the first chapter of my doctoral thesis. I am deeply grateful to Roger Koenker for his continuing guidance and encouragement. The author would like to thank the Editor, Oliver Linton, and two anonymous referees for their valuable comments and suggestions, I am also grateful to Anil Bera, Xuming He, Ted Juhl, Gabriel Montes-Rojas, Steve Portnoy, Shinichi Sakata, participants in the seminars at Forecasting in Rio, 2008 North American Summer Meeting of the Econometric Society, 2007 Midwest Econometrics Group Meeting, University of Iowa, University of Wisconsin-Milwaukee, University of Illinois at Urbana-Champaign, and City University London for helpful comments and discussions. All the remaining errors are my own. Appendix The next two lemmas help in the derivation of the results. Lemma 1. Given assumptions A1–A6, (η, β(τ ), α(τ )) uniquely solves the equations E [υψ(y − Z η − y−1 α − X β)Xˇ (τ )] = 0 over E × A × B. Proof. This is a simple application of Theorem 2 of Chernozhukov and Hansen (2006), which shows that (η, α(τ ), β(τ )) uniquely solves the limit problem for each τ , that is, η(α ∗ (τ )) = η, α ∗ (τ ) = α(τ ), and β(α ∗ (τ ), τ ) = β(τ ). Let ψ(u) := (τ − I (u < 0)), and H (η, α, β, τ ) :=
∂ E [υψ(y − Z η ∂(η, α, β) − y−1 α(τ ) − X β(τ ))Xˇ (τ )].
By assumption A2, H (η, α, β, τ ) is continuous in (η, α, β) and full rank, uniformly over E × A × B . Moreover, by A2, the image of the set E × A × B under the mapping (η, α, β) → Π (η, α, β, τ ) is simply connected. As in Chernozhukov and Hansen (2005), the application of Hadamard’s global univalence theorem for general metric spaces yields that the mapping Π (·, ·, ·, τ ) is a homeomorphism (one-to-one) between (E × A × B ) and Π (E , A , B , τ ), the image of E × A × B under Π (·, ·, ·, τ ). Since (η, α, β) = (η, α(τ ), β(τ )) solves the equation
A.F. Galvao Jr. / Journal of Econometrics 164 (2011) 142–157
Π (η, α, β, τ ) = 0; and thus it is the only solution in (E × A × B ). This argument is valid for τ ∈ T . The remaining of the argument is the same as in Chernozhukov and Hansen (2006), and (η(α(τ )) = η, α(τ ), β(α(τ ), τ ) = β(τ )) it is the only such solution for the problem.
Lemma 2. Let ξit (τk ) = ηzit +α(τk )yit −1 +β(τk )xit +γ (τk )wit , and uit (τk ) = yit − ξit (τk ). Let ϑ := (η, α, β, γ ) be a parameter vector in V := E × A × B × G . Let
√ T (ηˆ − η) δη √ ˆ k ) − α(τk )) δ √NT (α(τ . δ = α = δβ ˆ k ) − β(τk )) NT (β(τ √ δγ NT (γˆ (τ ) − γ (τ )), k
k
NT
NT
NT
NT
wit δγ − √
NT
NT
− ρτ (uit (τk )) = op (1).
NT
(14)
−−− δβ˜ δη P sup (NT ) ρ uit − x˜ it √ − zit √ k i t ϑ∈V NT T δβ˜ δη ˜ − ρ(uit ) − E ρ uit − xit √ − zit √
−1
NT
] [ ζqβ˜ ζqη − zit √ − ρ(uit ) + E ρ uit − x˜ it √ NT T ≤ max sup (NT )−1 1≤q≤KN ϑ∈Γq
] − − − [ x˜ zit it × 4 √ (δβ˜ − ζqβ˜ ) − √ (δη − ζqη ) k i t NT T
−−− 1 ≤ max sup (NT ) 4 √ ‖˜xit ‖ ‖(δβ˜ − ζqβ˜ )‖ −1
1≤q≤KN ϑ∈Γq
k
√
. Let pN be the dimension of ϑ ,
pN = O(N ), and then KN ≤ (2 pN /qNT + 1) (c.f. Wei and He (2006)). Let ζi ∈ Γi , i = 1, . . . , KN be fixed points. Then the lefthand side of (15) can be bounded by P1 + P2 , where
NT
1
12KC1 T
t
i
(15)
For each k, consider a partition the parameter space Γ = V into KN disjoint parts Γ1 , Γ2 , . . . , ΓKN , such that the diameter of each part is less than qNT =
T
] − − − [ x˜ zit it E √ (δβ˜ − ζqβ˜ ) − √ (δη − ζqη ) + 2 k i t NT T
T
− ρ(uit ) > ϵ → 0.
ϵN a
T
[ ] δβ˜ δη − E ρ uit − x˜ it √ − zit √ − ρ(uit )
Proof. With some abuse of notation let x˜ it = (yit −1 , xit , wit ), β˜ = ˜ η). Let ‖ · ‖ denote the Euclidean norm. It (α, β, γ ), and ϑ = (β, is sufficient to show that for any ϵ > 0
− √ ‖zit ‖ ‖(δη − ζqη )‖ T
+2
−−−
pN
k
i
t
1 E √ ‖˜xit ‖ ‖(δβ˜ − ζqβ˜ )‖ NT
1
− √ ‖zit ‖ ‖(δη − ζqη )‖ T
P1 = P
T
− ρ(uit ) > ϵ/2 .
ζqβ˜ ζqη − ρ(uit ) − ρ uit − x˜ it √ − zit √ + ρ(uit )
T
xit δβ
−√
−1
−−− δβ˜ δη max sup (NT ) ρ uit − x˜ it √ − zit √ k i t 1≤q≤KN ϑ∈Γq NT T
−−− yit −1 δα zit δη sup (NT ) ρτ uit (τk ) − √ − √ ϑ∈V T NT t k i wit δγ xit δβ zit δη −√ − √ − ρτ (uit (τk )) − E ρτ uit (τk ) − √ yit −1 δα
−−− ζβ˜ ζη P2 = P max (NT ) − zit √ ρ uit − x˜ it √ k i t 1≤q≤KN NT T ζqβ˜ ζ qη + ρ(uit ) − E ρ uit − x˜ it √ − zit √
−1
−1
− √
and
Therefore, it suffices to show that both P1 and P2 are op (1). Noting that |ρ(x + y) − ρ(x)| ≤ 2|y|, we have
Under conditions A1–A6,
NT
153
max sup (NT )
−1
≤ [4qNT + 2E [qNT ]] ≤
1≤q≤KN ϑ∈Γq
−−−[ δβ˜ δη × ρ uit − x˜ it √ − zit √ − ρ(uit ) k i t NT T ζqβ˜ ζqη − ρ uit − x˜ it √ − zit √ + ρ(uit ) NT
NT
mit =
∑
ζ ˜
k
P2 = P
≤
1 − − ϵ max [mit − Emit ] > 2 1≤q≤KN NT t i
KN − q =1
ζ
β [ρ(uit − x˜ it √qNT − zit √qTη ) − ρ(uit )] then
T
[ ]] ζqβ˜ ζ qη + E ρ uit − x˜ it √ − zit √ − ρ(uit ) > ϵ/2 NT T
,
where the last equality follows from assumptions A5–A6, and implies P1 = op (1). It remains to show that P2 = op (1). If we write
T
[ ] δβ˜ δη − E ρ uit − x˜ it √ − zit √ − ρ(uit )
ϵ 2
P
− 1 − − mit − E mit NT i t t
ϵ . > 2
154
A.F. Galvao Jr. / Journal of Econometrics 164 (2011) 142–157 p
For fixed i,
van der Vaart and Wellner (1996)) implies that ‖α(τ ˆ ) − α(τ )‖ →
] − [ ζqβ˜ ζ qη mit ≤ sup − zit √ − ρ(uit ) ρ uit − x˜ it √ ζ ∈Γ k NT T ] − [ ζ ˜ ζqη qβ ≤ sup 2 + zit √ x˜ it √ ζ ∈Γ k NT T ] −[ 1 1 ≤ sup 2 √ ‖˜xit ‖ ‖ζqβ˜ ‖ + √ ‖zit ‖ ‖ζqη ‖ ≤ 2C2 ,
ˆ ) − β(τ )‖ → 0 and ‖γˆ (α, 0, which by (*) implies that ‖β(τ ˆ τ) −
ζ ∈Γ
NT
k
T
p
p
0‖ → 0. Therefore, ‖θˆ (τ ) − θ (τ )‖ → 0 and the theorem follows. Now we prove Theorem 2. Proof. Consider the following model yit = ηi + α yit −1 + β xit + uit . The objective function is
and,
Var
T −
≤
mit
T −
t =1
≤
T −
t1 < t2
t =1
Var (mit ) +
− t1 < t2
t =1
≤T
Var (mit ) + 2
T −
−
p
min
Cov(mit1 , mit2 )
ηi ,α,β,γ
K − N − T −
−α(τk )yit −1 − β(τk )xit − γ (τk )wit ). Consider a collection of closed balls Bn (α(τk )), centered at α(τk ), with radius πn , and πn → 0 slowly enough. Note that, for any
[Var (mit1 ) + Var (mit2 )]
p
p
αn (τk ) → α(τk )(δα → 0) we can write the objective function as
Var (mit ).
VNT (δ) =
t =1
Note that,
Using the fact that ‖ζq ‖ < C for any q, and the independence across individuals in A1, we have
N T − − i=1
K − N − T −
υk ρτ (yit − ξit (τk )
k=1 i=1 t =1
√ √ − zit δη / T − yit −1 δα / NT √ √ − xit δβ / NT − wit δγ / NT ) − υk ρτ (yit − ξit (τk ))
2 x˜ it zit √ ζ ˜ + √ ζqη ≤ C3 ‖ζqβ˜ ‖ + C4 ‖ζqη ‖. NT qβ T
Var
υk ρτ (yit − ηi
k=1 i=1 t =1
≤T
mit
t =1
N − T −
Ey2it ≤ NT 2 C5 + o(1).
i=1 t =1
By the assumption of independence, we can bound P2 using Bernstein’s inequality:
(ϵ NT /2)2 ∑ TC Var mit + 32 ϵ NT 2 t i pN √ pN (ϵ NT /2)2 exp − +1 ≤2 2 TC qNT NT 2 C5 + o(1) + 32 ϵ NT 2
P2 ≤ 2KN exp − ∑
where ξit (τk ) = ηi + α(τk )yit −1 + β(τk )xit + γ (τk )wit , and
√ ˆ n ) − η) δη √ T (η(α δ √ NT (αn (τk ) − α(τk )) . δn = α = NT (β(α δβ ˆ n , τk ) − β(τk )) √ δγ NT (γˆ (α , τ ) − 0) n
k
Note that, for fixed (δα , δβ , δγ ), we can consider the behavior of δη . Let ψ(u) ≡ (τ − I (u < 0)) and for each i K T 1 −− υk ψτ gi (δη , δα , δβ , δγ ) = − √ T k=1 t =1
×
δη
δγ δβ − √ xit − √ wit .
= 2 exp O(N ) ln(24O(N 1/2−a T )C1 /ϵ + 1)
NT
−
(ϵ NT /2)2 NT 2 C
5
+ o(1) +
TC2 3
p
For, fixed δβ , δγ , supτ ∈t ‖αn (τ ) − α(τ )‖ → 0, and K > 0
ϵ NT . 2
Let θ(τ ) = (θ (τ1 ) , . . . , θ (τK ) ) . Now we prove Theorem 1. ′ ′
Proof. The first part follows from Lemma 1. The second part is similar to Chernozhukov and Hansen (2006), we need to show that under conditions A1–A6, θˆ (τ ) = θ (τ ) + op (1). Let
sup ‖gi (δηi , δα , δβ , δγ ) − gi (0, 0, 0, 0)
‖δ‖
− E [gi (δη , δα , δβ , δγ ) − gi (0, 0, 0, 0)]‖ = op (1). Expanding we have E [gi (δη , δα , δβ , δγ ) − gi (0, 0, 0, 0)]
T k=1 t =1
and note that P is continuous. Therefore, uniform convergence is given by Lemma 2, such that supθ ∈Θ |Mn − M | = op (1), δ˜
δ
β ˜ it √NT − zit √ηT ) − k=1 i=1 t =1 ρ(uit − x ρ(uit ), and M = EMn . Therefore, denoting ϑ = (η, β, γ ), we p ˆ have that ‖ϑ(α, τ ) − ϑ(α, τ )‖ → 0 (*) which implies that p ‖ ‖γˆ (α, τ )‖−‖γ (α, τ )‖ ‖ → 0, which by a simple Argmax process
1 nT
∑K
∑n ∑T
K T 1 −−
= E −√
P = (η, α, β, γ ) → ρτ (y − Z η − y−1 α − X β − W γ )
where Mn ≡
NT
Hence, P2 → 0 as N , T → ∞ and N a /T → 0 and the lemma follows. ′
δα
yit − ξit − √ − √ yit −1 T NT
over a compact set argument (Assumption A4 and Corollary 3.2.3 in
×
υk ψτ
δγ δη δα δβ yit − ξit − √ − √ yit −1 − √ xit − √ wit T
NT
K T 1 −− + √ υk ψτ (yit − ξit )
NT
NT
T k=1 t =1
K T 1 −−
= −√
T k =1 t =1
[ δα δη υk τk − F ξit + √ + √ yit −1 T
NT
A.F. Galvao Jr. / Journal of Econometrics 164 (2011) 142–157
δγ δβ + √ xit + √ wit NT
]
= √
T k=1 t =1
δη δα υk fit (ξit (τk )) √ + √ yit −1 T
+
NT
] δγ δβ + √ xit + √ wit + Rit .
δˆη = −f i
K T 1 −−
√
NT
υk fit (ξit (τk ))
NT
K T 1 −−
− √
T k =1 t =1 where f ik = T
NT
×
t
NT
NT
NT
√
‖δ‖
− G(0, 0, 0)]‖ = op (1), and at the minimizer, G(δˆ α , δˆ β , δˆ γ ) = o((NT )−1 ). Expanding, as above, E [G(δα , δβ , δγ ) − G(0, 0, 0)] NT
×
k
υk Xit fit
δγ δα δβ δˆη √ yit −1 + √ xit + √ wit + √ NT
1
NT
−−−
= √
NT
k
NT
T
+ op (1)
υk Xit fit
t
i
δγ δβ δα × √ yit −1 + √ xit + √ wit NT NT NT −− 1 −1 − √ fi T −1/2 υk fit (ξit (τk )) [
T
k
t
] δγ δα δβ × √ yit −1 + √ xit + √ wit NT NT NT T −− −1/2 − T υk ψτ (yit − ξit (τk )) + Rit + op (1) [
k
=
t =1
1 −−− NT
k
i
t
υk Xit fit
t
i
−1 −1
wit − f i T
υk Xit fit
−−
−−− k
−−
i
−1
υk Xit fit f i T −1/2
t
υk ψ(yit − ξit )
k
t
1
−−− k
υk fit wit δγ
t
i
√ υk Xit fit Rit / T + op (1),
t
where the order of the final term is controlled by the bound on the derivative of the conditional density. It is important to note that G(δˆ α , δˆ β , δˆ γ ) = 0 and then E [G(δα , δβ , δγ ) − G(0, 0, 0)] = G(0, 0, 0), Φ = diag(fit (ξit (τk ))), and let Ψτ be a NT -vector (ψτ (yit −ξit (τ ))). Define δϑ = (δβ , δγ ), then omitting the subscript k we have
υ(X ′ MZ Φ MZ Y−1 /NT )δα + υ(X ′ MZ Φ MZ X /NT )δϑ √ √ − υ X ′ PZ Ψ / NT = −υ X ′ Ψ / NT + RnT √ Jϑ δϑ = [−(υ X ′ MZ Ψ / NT − RnT ) − Jα δα ] √ δˆϑ = Jϑ−1 [−(υ X ′ MZ Ψ / NT − RnT ) − Jα δα ]
K N T √ 1 −−− υk Xit fit Rit / T + op (1). RnT = √ NT k=1 i=1 t =1
Let [J¯β′ , J¯γ′ ]′ be the conformable partition of Jϑ−1 , then
√ δˆγ = J¯γ [−(υ X ′ MZ Ψ / NT − RnT ) − Jα δα ] √ δˆβ = J¯β [−(υ X ′ MZ Ψ / NT − RnT ) − Jα δα ].
t
i
υk fit xit δβ
t
where MZ = I − PZ , PZ = Z (Z ′ Φ Z )−1 Z ′ Φ , Jϑ and Jα are as in assumption A3, and
sup ‖G(δα , δβ , δγ ) − G(0, 0, 0) − E [G(δα , δβ , δγ )
k
NT
υk fit and Rit is the remainder term for each
where Xit = (x′it , wit′ )′ , and δα (αn ) = NT (αn (τk ) − α(τk )), √ √ ˆ δβ (αn ) = NT (β(αn , τk ) − β), and δγ (αn ) = NT (γˆ (αn , τk ) − 0). As in Koenker (2004)
−−−
NT
−√
T
1
−−
1 −−−
NT
δγ δˆη δα δβ yit − ξit − √ − √ yit −1 − √ xit − √ wit ,
= √
xit − f i T
−√
K N T 1 −−− G(δα , δβ , δγ ) = − √ υk Xit ψτ NT k i=1 t =1
×
−1 −1
1
]
υk ψτ (yit − ξit (τk )) + Rit ,
∑ ∑ −1
υk Xit fit
t
i
k
i. Substituting δˆ η ’s, we denote
+
×
k
k
υk fit yit −1 δα
t
δγ δα δβ √ yit −1 + √ xit + √ wit
×
NT
k
T k=1 t =1
[
1 −−−
×
Optimality of δˆ ηi implies that gi (δη , δα , δβ , δγ ) = o(T −1 ), and thus
−−
NT
−1
−1
k
[
NT
yit −1 − f i T −1
×
NT
K T 1 −−
155
(16) (17)
The remainder term RnT has a dominant component that comes from the Bahadur representation of the η’s. By A1 and A5, we have for a generic constant C1 N −
C1 K RnT = T −1/4 √ Ri0 + op (1). N i =1 The analysis of Knight (2001) shows that the summands converge d
in distribution, that is as T → ∞ the remainder term T 1/4 Rit → Ri0 , where Ri0 are functionals of Brownian motions with finite second moment. Therefore, independence of yit , and condition A6 ensure that the contribution of the remainder is negligible. Thus (16) and (17) simplify to
√ δˆγ = J¯γ [−(υ X ′ MZ Ψ / NT ) − Jα δα ], √ δˆβ = J¯β [−(υ X ′ MZ Ψ / NT ) − Jα δα ]. By consistency, w p → 1
δˆα =
min
δα ∈Bn (α(τ ))
δˆγ (δα )′ Aδˆγ (δα ),
156
A.F. Galvao Jr. / Journal of Econometrics 164 (2011) 142–157
assuming that δˆ γ′ Aδˆ γ is continuous in δα , and from the first order condition
√ δˆα = −[Jα′ J¯γ AJ¯γ Jα ]−1 [Jα′ J¯γ AJ¯γ (υ X ′ MZ Ψ / NT )]. Substituting δˆ α back in δβ we obtain
√ δˆβ = −J¯β [(X ′ MZ Ψ / NT ) √ − Jα [Jα′ J¯γ AJ¯γ Jα ]−1 Jα′ J¯γ AJ¯γ (υ X ′ MZ Ψ / NT )] = −J¯β [(I − Jα [Jα′ J¯γ AJ¯γ Jα ]−1 Jα′ J¯γ AJ¯γ ) √ × (υ X ′ MZ Ψ / NT )]. It is also important to analyze δˆ γ . Thus, replacing δˆ α in δγ
√ δˆγ = −J¯γ [(X ′ MZ Ψ / NT ) − Jα [Jα′ J¯γ AJ¯γ Jα ]−1 √ × Jα′ J¯γ AJ¯γ (υ X ′ MZ Ψ / NT )] = −J¯γ [(I − Jα [Jα′ J¯γ AJ¯γ Jα ]−1 Jα′ J¯γ AJ¯γ ) √ × (υ X ′ MZ Ψ / NT )]. By condition A3, using the fact that J¯γ Jα is invertible
δˆγ = 0 + Op (1) + op (1). Let Ψk = diag(ψτk (yit − ξit (τk ))). Notice that Ψk 1NT 1′NT Ψl = (τk ∧ τl − τk τl )INT , and that conditions A1–A6 imply a central limit theorem. Thus, neglecting the remainder term, and using the definition of δα and δβ we have
√
d
NT (θˆ − θ ) → N (0, Ω ),
Ω = (K ′ , L′ )′ S (K ′ , L′ )
where S = (τ ∧ τ ′ − τ τ ′ )E (VV ′ ), V = X ′ MZ , K = (Jα′ HJα )−1 Jα H, H = J¯γ′ A[α(τ )]J¯γ , L = J¯β M, M = I − Jα K . References Abrevaya, J., Dahl, C.M., 2008. The effects of birth inputs on birthweight: evidence from quantile estimation on panel data. Journal of Business and Economic Statistics 26, 379–397. Ahn, S.C., Schmidt, P., 1995. Efficient estimation of models for dynamic panel data. Journal of Econometrics 68, 5–27. Alvarez, J., Arellano, M., 2003. The time series and cross-section asymptotics of dynamic panel data estimators. Econometrica 71, 1121–1159. Anderson, T.W., Hsiao, C., 1981. Estimation of dynamic models with error components. Journal of the American Statistical Association 76, 598–606. Anderson, T.W., Hsiao, C., 1982. Formulation and estimation of dynamic models using panel data. Journal of Econometrics 18, 47–82. Andrews, D.W.K., 1993. Tests for parameter instability and structural change with unknown change point. Econometrica 61, 821–856. Arellano, M., 2003. Panel Data Econometrics. Oxford University Press, New York. Arellano, M., Bond, S., 1991. Some tests of specification for panel data: Monte Carlo evidence and an application to employment equations. Review of Economic Studies 58, 277–279. Arellano, M., Bover, O., 1995. Another look at the instrumental variable estimation of error-components models. Journal of Econometrics 68, 29–51. Arellano, M., Hahn, J., 2007. Understanding bias in nonlinear panel models: some recent developments. In: Blundell, R., Newey, W., Persson, T. (Eds.), Advances in Economics and Econometrics. In: Ninth World Congress, vol. III. Cambridge Press, pp. 381–409. Baltagi, B.H., 2008. Forecasting with panel data. Journal of Forecasting 27, 153–173. Baltagi, B.H., Griffin, J.M., 1997. Pooled estimators vs. their heterogeneous counterparts in the context of dynamic demand for gasoline. Journal of Econometrics 77, 303–327. Baltagi, B.H., Griffin, J.M., Xiong, W., 2000. To pool or not to pool: homogeneous versus heterogeneous estimators applied to cigarette demand. Review of Economics and Statistics 82, 117–126. Bester, C.A., Hansen, C., 2009. A penalty function approach to bias reduction in non-linear panel models with fixed effects. Journal of Business and Economic Statistics 27, 131–148. Blundell, R., Bond, S.R., 1998. Initial conditions and moment restrictions in dynamic panel data models. Journal of Econometrics 87, 115–143. Box, G.E.P., Jenkins, G.M., Reinsel, G.C., 2008. Time Series Analysis: Forecasting and Control, 4th edn.. Wiley, New Jersey. Bun, M.J., Carree, M.A., 2005. Bias-corrected estimation in dynamic panel data models. Journal of Business and Economic Statistics 23, 200–210.
Cai, Y., 2010. Forecasting for quantile-self-exciting threshold autoregressive time series models. Biometrika 97, 199–208. Cai, Y., 2007. A quantile approach to US GNP. Economic Modelling 24, 969–979. Canarella, G., Pollard, S., 2004. Parameter heterogeneity in neoclassical growth model: a quantile regression approach. Journal of Economic Development 29, 1–32. Chernozhukov, V., Fernandez-Val, I., Galichon, A., 2010. Quantile and probability curves without crossing. Econometrica 78, 1093–1125. Chernozhukov, V., Hansen, C., 2005. An IV model of quantile treatment effects. Econometrica 73, 245–261. Chernozhukov, V., Hansen, C., 2006. Instrumental quantile regression inference for structural and treatment effects models. Journal of Econometrics 132, 491–525. Chernozhukov, V., Hansen, C., 2008. Instrumental variable quantile regression: a robust inference approach. Journal of Econometrics 142, 379–398. Chevillon, G., Hendry, D.F., 2005. Non-parametric direct multi-step estimation for forecasting economic processes. International Journal of Forecasting 21, 201–218. Crespo-Cuaresma, J., Foster, N., Stehrer, R., 2009. The determinants of regional economic growth by quantile, wiiw Working Paper. Elliott, G., Timmermann, A., 2008. Economic forecasting. Journal of Economic Literature 46, 3–56. Fok, D., van Dijk, D., Franses, P.H., 2005. Forecasting aggregates using panels of nonlinear time series. International Journal of Forecasting 21, 785–794. Gaglianone, W.P., Lima, L.R., 2009. Constructing Density Forecasts from Quantile Regressions. mimeo. Garcia-Ferrer, A., Highfield, R.A., Palm, F., Zellner, A., 1987. Macroeconomic forecasting using pooled international data. Journal of Business and Economic Statistics 5, 53–67. Gavin, W.T., Theodorou, A.T., 2005. A common model approach to macroeconomics: using panel data to reduce sampling error. Journal of Forecasting 24, 203–219. Geraci, M., Bottai, M., 2007. Quantile regression for longitudinal data using the asymmetric Laplace distribution. Biostatistics 8, 140–154. Geweke, J., 1989. Exact predictive densities for linear models with ARCH disturbances. Journal of Econometrics 40, 63–86. Giacomini, R., Komunjer, I., 2005. Evaluation and combination of conditional quantile forecasts. Journal of Business and Economic Statistics 23, 416–431. Gutenbrunner, C., Jureckova, J., 1992. Regression rank scores and regression quantiles. The Annals of Statistics 20, 305–330. Hahn, J., Kuersteiner, G.M., 2002. Asymptotically unbiased inference for a dynamic panel model with fixed effects when both n and T are large. Econometrica 70, 1639–1657. Hahn, J., Newey, W., 2004. Jackknife and analytical bias reduction for nonlinear panel models. Econometrica 72, 1295–1319. Hoogstrate, A.J., Palm, F.C., Pfann, G.A., 2000. Pooling in dynamic panel data models: an application to forecasting GDP growth rates. Journal of Business and Economic Statistics 18, 274–283. Hsiao, C., 2003. Analysis of Panel Data, 2d edn.. Cambridge University Press, Cambridge, Massachusetts. Hsiao, C., Tahmiscioglu, A.K., 1997. A panel analysis of liquidity constraints and firm investment. Journal of the American Statistical Association 92, 455–465. Knight, K., 2001. Comparing Conditional Quantile Estimators: First and Second Order Considerations. Mimeo. Koenker, R., 2004. Quantile regression for longitudinal data. Journal of Multivariate Analysis 91, 74–89. Koenker, R., Bassett, G.W., 1978. Regression quantiles. Econometrica 46, 33–49. Koenker, R., Bassett, G.W., 1982. Robust tests for heteroscedasticity based on regression quantile. Econometrica 50, 43–61. Koenker, R., Bassett, G.W., 2010. March madness, quantile regression bracketology and the Hayek hypothesis. Journal of Business and Economic Statistics 28, 26–35. Koenker, R., Hallock, K., 2000. Quantile regression an introduction, Manuscript, University of Illinois at Urbana-Champaign. Koenker, R., Machado, J.A.F., 1999. Goodness of fit and related inference processes for quantile regression. Journal of the American Statistical Association 94, 1296–1310. Koenker, R., Xiao, Z., 2006. Quantile autoregression. Journal of the American Statistical Association 101, 980–990. Koenker, R., Zhao, Q., 1996. Conditional quantile estimation and inference for Arch models. Econometric Theory 12, 793–813. Lamarche, C., 2010. Robust penalized quantile regression estimation for panel data. Journal of Econometrics 157, 396–408. Lee, T.-H., Yang, Y., 2008. Bagging binary and quantile predictors for time series: further issues. In: Wohar, M.E., Rapach, D.E. (Eds.), Forecasting in Presence of Structural Breaks and Model Uncertainty. North Holland Publishers, pp. 477–534. Lin, J.-L., Granger, C.W.J., 1994. Forecasting from non-Linear models in practice. Journal of Forecasting 13, 1–9. Mello, M., Novo, A., 2002. The new empirics of economic growth: quantile regression estimation of growth equations, University of Illinois, Mimeo. Neyman, J., Scott, E.L., 1948. Consistent estimates based on partially consistent observations. Econometrica 16, 1–32. Nickell, S., 1981. Biases in dynamic models with fixed effects. Econometrica 49, 1417–1426. Peters, S.C., Freedman, D.A., 1985. Using the bootstrap to evaluate forecasting equations. Journal of Forecasting 4, 251–262.
A.F. Galvao Jr. / Journal of Econometrics 164 (2011) 142–157 Powell, J.L., 1986. Censored regression quantiles. Journal of Econometrics 32, 143–155. Schmidt, P., 1974. The asymptotic distribution of forecasts in the dynamic simulation of an econometric model. Econometrica 42, 303–309. Taylor, J.W., Bunn, D.W., 1999. A quantile regression approach to generating prediction intervals. Management Science 45, 225–237. Tay, A.S., Wallis, K.F., 2000. Density forecasting: a survey. Journal of Forecasting 19, 235–254.
157
van der Vaart, A., Wellner, J.A., 1996. Weak Convergence and Empirical Processes. Springer-Verlag Press, New York, New York. Wei, Y., He, X., 2006. Conditional growth charts. The Annals of Statistics 34, 2069–2097. Zhou, K.Q., Portnoy, S.L., 1996. Direct use of regression quantiles to construct confidence sets in linear models. The Annals of Statistics 24, 287–306. Zhou, K.Q., Portnoy, S.L., 1998. Statistical inference on heteroscedastic models based on regression quantiles. Journal of Nonparametric Statistics 9, 239–260.
Journal of Econometrics 164 (2011) 158–172
Contents lists available at ScienceDirect
Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom
Understanding models’ forecasting performance Barbara Rossi a,∗ , Tatevik Sekhposyan b a
Department of Economics, Duke University, 213 Social Sciences, P.O. Box 90097, Durham, NC 27708, USA
b
International Economic Analysis Department, Bank of Canada, 234 Wellington Street, Ottawa, ON K1A 0G9, Canada
article
info
Article history: Available online 1 March 2011 JEL classification: C22 C52 C53
abstract We propose a new methodology to identify the sources of models’ forecasting performance. The methodology decomposes the models’ forecasting performance into asymptotically uncorrelated components that measure instabilities in the forecasting performance, predictive content, and overfitting. The empirical application shows the usefulness of the new methodology for understanding the causes of the poor forecasting ability of economic models for exchange rate determination. © 2011 Elsevier B.V. All rights reserved.
Keywords: Forecasting Forecast evaluation Instabilities Over-fitting Exchange rates
1. Introduction This paper has two objectives. The first objective is to propose a new methodology for understanding why models have different forecasting performance. Is it because the forecasting performance has changed over time? Or is it because there is estimation uncertainty which makes models’ in-sample fit not informative about their out-of-sample forecasting ability? We identify three possible sources of models’ forecasting performance: predictive content, over-fitting, and time-varying forecasting ability. Predictive content indicates whether the in-sample fit predicts out-of-sample forecasting performance. Over-fitting is a situation in which a model includes irrelevant regressors, which improve the in-sample fit of the model but penalize the model in an out-of-sample forecasting exercise. Time-varying forecasting ability might be caused by changes in the parameters of the models, as well as by unmodeled changes in the stochastic processes generating the variables. Our proposed technique involves a decomposition of the existing measures of forecasting performance into these components to understand the reasons why a model forecasts better than its competitors. We also propose tests for assessing the significance of each component. Thus, our methodology suggests constructive ways of improving models’ forecasting ability. The second objective is to apply the proposed methodology to study the performance of models of exchange rate determination
∗
Corresponding author. Tel.: +1 919 660 1801; fax: +1 919 684 8974. E-mail address:
[email protected] (B. Rossi).
0304-4076/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2011.02.020
in an out-of-sample forecasting environment. Explaining and forecasting nominal exchange rates with macroeconomic fundamentals has long been a struggle in the international finance literature. We apply our methodology in order to better understand the sources of the poor forecasting ability of the models. We focus on models of exchange rate determination in industrialized countries such as Switzerland, United Kingdom, Canada, Japan, and Germany. Similarly to Bacchetta et al. (2010), we consider economic models of exchange rate determination that involve macroeconomic fundamentals such as oil prices, industrial production indices, unemployment rates, interest rates, and money differentials. The empirical findings are as follows. Exchange rate forecasts based on the random walk are superior to those of economic models on average over the out-of-sample period. When trying to understand the reasons for the inferior forecasting performance of the macroeconomic fundamentals, we find that lack of predictive content is the major explanation for the lack of short-term forecasting ability of the economic models, whereas instabilities play a role especially for medium term (one-year ahead) forecasts. Our paper is closely related to the rapidly growing literature on forecasting (see Elliott and Timmermann, 2008, for reviews and references). In their seminal works, Diebold and Mariano (1995) and West (1996) proposed tests for comparing the forecasting ability of competing models, inspiring a substantial research agenda that includes West and McCracken (1998), McCracken (2000), Clark and McCracken (2001, 2005, 2006), Clark and West (2006) and Giacomini and White (2006), among others. None of these works, however, analyze the reasons why the best model outperforms its competitors. In a recent paper, Giacomini and
B. Rossi, T. Sekhposyan / Journal of Econometrics 164 (2011) 158–172
Rossi (2009) propose testing whether the relationship between insample predictive content and out-of-sample forecasting ability for a given model has worsened over time, and refer to such situations as Forecast Breakdowns. Interestingly, Giacomini and Rossi (2009) show that Forecast Breakdowns may happen because of instabilities and over-fitting. However, while their test detects both instabilities and over-fitting, in practice it cannot identify the exact source of the breakdown. The main objective of our paper is instead to decompose the forecasting performance in components that shed light on the causes of the models’ forecast advantages/disadvantages. We study such decomposition in the same framework of Giacomini and Rossi (2010), that allows for time variation in the forecasting performance, and discuss its implementation in fixed rolling window as well as recursive window environments. The rest of the paper is organized as follows. Section 2 describes the framework and the assumptions, Section 3 discusses the new decomposition proposed in this paper, and Section 4 presents the relevant tests. Section 5 provides a simple Monte Carlo simulation exercise, whereas Section 6 attempts to analyze the causes of the poor performance of models for exchange rate determination. Section 7 concludes. 2. The framework and assumptions Since the works by Diebold and Mariano (1995), West (1996) and Clark and McCracken (2001), it has become common to compare models according to their forecasting performance in a pseudo out-of-sample forecasting environment. Let h ≥ 1 denote the (finite) forecast horizon. We are interested in evaluating the performance of h-steps ahead forecasts for the scalar variable yt using a vector of predictors xt , where forecasts are obtained via a direct forecasting method.1 We assume the researcher has P out-of-sample predictions available, where the first out-of-sample prediction is based on a parameter estimated using data up to time R, the second prediction is based on a parameter estimated using data up to R + 1, and the last prediction is based on a parameter estimated using data up to R+P −1 = T , where R+P +h−1 = T +h is the size of the available sample. The researcher is interested in evaluating model(s)’ pseudo out-of-sample forecasting performance, and comparing it with a measure of in-sample fit so that the out-of-sample forecasting performance can be ultimately decomposed into components measuring the contribution of time-variation, over-fitting, and predictive content. Let {Lt +h (.)}Tt=R be a sequence of loss functions evaluating h-steps ahead out-of-sample forecast errors. This framework is general enough to encompass: (i) measures of absolute forecasting performance, where Lt +h (.) is the forecast error loss of a model; (ii) measures of relative forecasting performance, where Lt +h (.) is the difference of the forecast error losses of two competing models; this includes, for example, the measures of relative forecasting performance considered by Diebold and Mariano (1995) and West (1996); (iii) measures of regression-based predictive ability, where Lt +h (.) is the product of the forecast error of a model times possible predictors; this includes, for example, Mincer and Zarnowitz’s (1969) measures of forecast efficiency. For an overview and discussion of more general regression-based tests of predictive ability see West and McCracken (1998).
1 That is, h-steps ahead forecasts are directly obtained by using estimates from the direct regression of the dependent variable on the regressors lagged h-periods.
159
To illustrate, we provide examples of all three measures of predictive ability. Consider an unrestricted model specified as yt +h = x′t α + εt +h , and a restricted model: yt +h = εt +h . Let αˆ t be an estimate of the regression coefficient of the unrestricted model at time t using all the observations available up to time t, i.e. αˆ t =
∑
t −h j =1
xj x′j
−1 ∑
t −h j =1
xj yj+h .
(i) Under a quadratic loss function, the measures of absolute forecasting performance for models 1 and 2 would be the squared forecast errors: Lt +h (.) = (yt +h − x′t αˆ t )2 and Lt +h (.) = y2t +h respectively. (ii) Under the same quadratic loss function, the measure of relative forecasting performance of the models in (i) would be Lt +h (.) = (yt +h − x′t αˆ t )2 − y2t +h . (iii) An example of a regression-based predictive ability test is the test of zero mean prediction error, where, for the unrestricted model: Lt +h (.) = yt +h − x′t αˆ t . Throughout this paper, we focus on measures of relative forecasting performance. Let the two competing models be labeled 1 and 2, which could be nested or non-nested. Model 1 is characterized by parameters α and model 2 by parameters γ . We consider two estimation frameworks: a fixed rolling window and an expanding (or recursive) window. 2.1. Fixed rolling window case In the fixed rolling window case, the model’s parameters are estimated using samples of R observations dated t − R + 1, . . . , t, for t = R, R + 1, . . . , T , where R < ∞. The parameter estimates ∑ for model 1 are obtained by αt ,R = arg mina tj=t −R+1 L(j 1) (a), where L(1) (.) denotes the in-sample loss function for ∑ model 1; similarly, the parameters for model 2 are γt ,R = arg ming tj=t −R+1 (2)
Lj (g ). At each point in time t, the estimation will generate a sequence of R in-sample fitted errors denoted by {η1,j ( αt ,R ), η2,j ( γt ,R )}tj=t −R+1 ; among the R fitted errors, we use the last in-sample fitted errors at time t , (η1,t ( αt ,R ), η2,t ( γt ,R )), (1)
to evaluate the models’ in-sample fit at time t , Lt ( αt ,R ) and (2) γt ,R ). For example, for the unrestricted model considered preLt ( viously, yt +h = x′t α + εt +h , under a quadratic loss, we have
∑ ∑ αt ,R = ( tj=−th−h−R+1 xj x′j )−1 ( tj=−th−h−R+1 xj yj+h ), for t = R, R + 1, . . . , T . The sequence of in-sample fitted errors at time t is: {η1,j ( αt ,R )}tj=t −R+1 = {yj − x′j−h αj,R }tj=t −R+1 , of which we use the αt ,R , to evalulast in-sample fitted error, η1,t ( αt ,R ) = yt − x′t −h (1) ate the in-sample loss at time t: Lt ( αt ,R ) ≡ (yt − x′t −h αt ,R )2 .
Thus, as the rolling estimation is performed over the sample for t = R, R + 1, . . . , T , we collect a series of in-sample losses: γt ,R )}Tt=R . {L(t 1) ( αt ,R ), L(t 2) ( (1)
(2)
We consider the loss functions Lt +h ( αt ,R ) and Lt +h ( γt ,R ) to evaluate out-of-sample predictive ability of direct h-step ahead forecasts for models 1 and 2 made at time t. For example, for the unrestricted model considered previously (yt +h = x′t α + εt +h ), under a quadratic loss, the out-of-sample multi-step direct forecast (1) loss at time t is: Lt +h ( αt ,R ) ≡ (yt +h − x′t αt ,R )2 . As the rolling estimation is performed over the sample, we collect a series of out(1) of-sample losses: {Lt +h ( αt ,R ), L(t 2+)h ( γt ,R )}Tt=R . The loss function used for estimation need not necessarily be the same loss function used for forecast evaluation, although in order to ensure a meaningful interpretation of the models’ insample performance as a proxy for the out-of-sample performance, we require the loss function used for estimation to be the same as the loss used for forecast evaluation. Assumption 4(a) in Section 3 will formalize this requirement.
160
B. Rossi, T. Sekhposyan / Journal of Econometrics 164 (2011) 158–172
Let θ ≡ (α ′ , γ ′ )′ be the (p × 1) parameter vector, θt ,R ≡ ( α t ,R , γt′,R )′ , Lt +h ( θt ,R ) ≡ L(t 1+)h ( αt ,R ) − L(t 2+)h ( γt ,R ) and Lt ( θt ,R ) ≡ ′
(1)
(2)
Lt ( αt ,R ) − Lt ( γt ,R ). For notational simplicity, in what follows we drop the dependence on the parameters, and simply use Lt +h t to denote Lt +h ( and L θt ,R ) and Lt ( θt ,R ) respectively.2 We make the following assumption.
t Assumption 1. Let lt +h ≡ ( Lt +h , L Lt +h )′ , t = R, . . . , T , and Zt +h,R ≡ lt +h − E ( lt +h ). −1/2
(a) Ωroll ≡ limT →∞ Var(P t =R Zt +h,R ) is positive definite; t ]‖r < ∆ < ∞;3 (b) for some r > 2, ‖[Zt +h,R , L (c) {yt , xt } are mixing with either {φ} of size −r /2(r − 1) or {α} of size −r /(r − 2); (d) for k ∈ [0, 1], limT →∞ E [WP (k)WP (k)′ ] = kI2 , where WP (k) = ∑R+[kP ] −1/2 Ωroll Zt +h,R ; t =R (e) P → ∞ as T → ∞, whereas R < ∞, h < ∞.
∑T
Remarks. Assumption 1 is useful for obtaining the limiting distribution of the statistics of interest, and provides the necessary assumptions for the high-level assumptions that guarantee that a Functional Central Limit Theorem holds, as in Giacomini and Rossi (2010). Assumption 1(a)–(c) impose moment and mixing conditions to ensure that a Multivariate Invariance Principle holds (Wooldridge and White, 1988). In addition, Assumption 1(d) imposes global covariance stationarity. The assumption allows for the competing models to be either nested or non-nested and to be estimated with general loss functions. This generality has the trade-off of restricting the estimation to a fixed rolling window scheme, and Assumption 1(e) ensures that the parameter estimates are obtained over an asymptotically negligible in-sample fraction of the data (R).4 Proposition 1 (Asymptotic Results for the Fixed Rolling Window Case). For every k ∈ [0, 1], under Assumption 1: 1
√
P
R+[kP ]
−
−1/2 Ωroll [ lt +h − E ( lt +h )] ⇒ W (k),
(1)
(2)
model 2 are γt ,R = arg ming j=1 Lj (g ). Accordingly, the first prediction is based on a parameter vector estimated using data from 1 to R, the second on a parameter vector estimated using data from 1 to R + 1, . . . , and the last on a parameter vector estimated using data from 1 to R + P − 1 = T . Let θt ,R denote the estimate of the parameter θ based on data from period t and earlier. For example, for the unrestricted model considered previously, yt +h =
∑t
x′t α + εt +h , we have: αt ,R =
∑
t −h j =1
xj x′j
−1 ∑
t −h j =1
(1)
(1)
t Assumption 2. Let lt +h ≡ ( Lt +h , L Lt +h )′ , lt +h ≡ (Lt +h , Lt Lt +h )′ . (a) R, P → ∞ as T → ∞, and limT →∞ (P /R) = π ∈ [0, ∞), h < ∞. (b) In some open neighborhood N around θ ∗ , and with probability one, Lt +h (θ ), Lt (θ ) are measurable and twice continuously differentiable with respect to θ . In addition, there is a constant K < ∞ such that for all t , supθ∈N |∂ 2 lt (θ )/∂θ ∂θ ′ | < Mt for which E (Mt ) < K . θt ,R satisfies θt ,R − θ ∗ = Jt Ht , where ∑ Jt is (p × q), Ht is (c) (q × 1), Jt → J, with J of rank p; Ht = t −1 ts=1 hs for a (q × 1) as
∗ orthogonality condition vector hs ≡ hs (θ ); E (hs ) = 0.
(d) Let Dt +h ≡
∂ lt +h (θ ) ∗, D ∂θ θ=θ ′ ′
≡ E (Dt +h ) and ξt +h ≡ [vec
(Dt +h ) , lt +h , ht , Lt ] . Then: (i) For some d > 1, supt E ‖ξt +h ‖4d < ∞. (ii) ξt +h − E (ξt +h ) is strong mixing, with mixing coefficients of size −3d/(d − 1). (iii) ξt +h − E (ξt +h ) is covariance stationary. (iv) Let Γll (j) = E (lt +h − E (lt +h ))(lt +h−j − E (lt +h ))′ , Γlh (j) = E (lt +h − E (lt +h ))(ht −j∑ − E (ht ))′ , Γhh (j) = ∞ ′ E (ht − E (ht ))(ht −j − E (ht )) , Sll = Γ (j), S = j=−∞ ll′ lh ∑∞ ∑∞ Sll Slh J . Then Sll j=−∞ Γlh (j), Shh = j=−∞ Γhh (j), S = JS ′ JS J ′ ′
′
Comment. A consistent estimate of Ωroll can be obtained by
−
(1 − |i/q(P )|)P −1
i=−q(P )+1
T − ldt+hldt +′ h ,
hh
is positive definite.
where W (.) is a (2 × 1) standard vector Brownian Motion.
q(P )−1
xj yj+h , for
t = R, R + 1, . . . , T , Lt +h ( αt ,R ) ≡ (yt +h − x′t αt ,R )2 and Lt ( αt ,R ) ≡ ′ 2 ∗ (yt − xt −h αt ,R ) . Finally, let Lt +h ≡ Lt +h (θ ), Lt ≡ Lt (θ ∗ ), where θ ∗ is the pseudo-true parameter value. We make the following assumption.
lh
t =R
roll = Ω
exception. The parameter estimates for model 1 are obtained ∑ by αt ,R = arg mina tj=1 L(j 1) (a); similarly, the parameters for
(2)
t =R
where ldt+h ≡ lt +h − P −1 t =R lt +h and q(P ) is a bandwidth that grows with P (Newey and West, 1987).
∑T
2.2. Expanding window case In the expanding (or recursive) window case, the forecasting environment is the same as in Section 2.1, with the following
2 Even if we assume that the loss function used for estimation is the same as the loss function used for forecast evaluation, the different notation for the estimated t ) is necessary to reflect that they are in-sample and out-of-sample losses ( Lt and L evaluated at parameters estimated at different points in time. 3 Hereafter, ‖.‖ denotes the Euclidean norm. 4 Note that researchers might also be interested in a rolling window estimation case where the size of the rolling window is ‘‘large’’ relative to the sample size. In this case, researchers may apply the Clark and West’s (2006, 2007) test, and obtain results similar to those above. However, using Clark and West’s test would have the drawback of eliminating the over-fitting component, which is the focus of this paper.
Assumption 2(a) allows both R and P to grow as the total sample size grows; this is a special feature of the expanding window case.5 Assumption 2(b) ensures that the relevant losses are wellapproximated by smooth quadratic functions in the neighborhood of the parameter vector, as in West (1996). Assumption 2(c) is specific to the recursive window estimation procedure, and requires that the parameter estimates be obtained over recursive windows of data.6 Assumption 2(d) (i)–(iv) impose moment conditions that ensure the validity of the Central Limit Theorem, as well as covariance stationarity for technical convenience. As in West (1996), positive definiteness of Sll rules out the nested model case.7 Thus, for nested model comparisons, we recommend the rolling window scheme discussed in Section 2.1. There are two important cases where the asymptotic theory for the recursive window case simplifies. These are cases in which the estimation uncertainty vanishes asymptotically. A first case is when π = 0, which implies that estimation uncertainty
5 This also implies that the probability limit of the parameter estimates are constant, at least under the null hypothesis. 6 In our example: J = t
∑ t −h 1 t
j=1
xj x′j
−1
and Ht =
1 t
∑t −h j=1
xj yj+h .
7 The extension of the expanding window setup to nested models would require using non-normal distributions for which the Functional Central Limit Theorem cannot be easily applied.
B. Rossi, T. Sekhposyan / Journal of Econometrics 164 (2011) 158–172
is irrelevant since there is an arbitrary large number of observations used for estimating the models’ parameters, R, relative to the number used to estimate E (lt +h ), P. The second case is when D = 0. A leading case that ensures D = 0 is the quadratic loss (i.e. the Mean Squared Forecast Error) and i.i.d. errors. Proposition 2 considers the case of vanishing estimation uncertainty. Proposition 5 in Appendix A provides results for the general case in which either π ̸= 0 or D ̸= 0. Both propositions are formally proved in Appendix B. Proposition 2 (Asymptotic Results for the Recursive Window Case). For every k ∈ [0, 1], under Assumption 2, if π = 0 or D = 0 then 1
√
P
R+[kP ]
−
−1/2 Ωrec [lt +h − E (lt +h )] ⇒ W (k),
t =R
where Ωrec = Sll , and W (.) is a (2 × 1) standard vector Brownian Motion. Comment. Upon mild strengthening of the assumptions on ht as in Andrews (1991), a consistent estimate of Ωrec can be obtained by q(P )−1
rec = Ω
−
(1 − |i/q(P )|)P −1
i=−q(P )+1
T − ldt+hldt +′ h ,
(3)
t =R
where ldt+h ≡ lt +h − P −1 t =R lt +h and q(P ) is a bandwidth that grows with P (Newey and West, 1987).
∑T
3. Understanding the sources of models’ forecasting performance: a decomposition
161
addition, regression (4) does not include a constant, so that the error term measures the average out-of-sample losses not explained by in-sample performance. Then, the average Mean Square Forecast Error (MSFE) can be decomposed as: T 1−
P t =R
Lt +h = BP + UP ,
where BP ≡ β
∑ T 1 P
(5)
t =R Lt and UP ≡
1 P
∑T
t =R
ut +h . BP can be in-
terpreted as the component that was predictable on the basis of the in-sample relative fit of the models (predictive content), whereas UP is the component that was unexpected (over-fitting). The following example provides more details on the interpretation of the components measuring predictive content and overfitting. Example. Let the true data generating process (DGP) be yt +h = α + εt +h , where εt +h ∼ i.i.d.N (0, σ 2 ). We compare the forecasts of yt +h from two nested models’ made at time t based on parameter estimates obtained via the fixed rolling window approach. The first (unrestricted) model includes a constant only, so that its forecasts ∑ are αt ,R = R1 tj=−th−h−R+1 yj+h , t = R, R + 1, . . . , T , and the second (restricted) model sets the constant to be zero, so that its forecast is zero. Consider the (quadratic) forecast error loss difference, Lt +h ≡ (1) (2) 2 2 Lt +h ( αt ,R ) − Lt +h (0) ≡ (yt +h − αt ,R ) − yt +h , and the (quadratic) in(2)
(1)
t ≡ Lt ( αt ,R )2 −y2t . αt ,R )−Lt (0) ≡ (yt − sample loss difference L 2 Let β ≡ E (Lt +h Lt )/E (Lt ). It can be shown that β = (α 4 + 4σ 2 α 2 + (4σ 2 + 2σ 2 α 2 )/R)−1 (α 4 − 3σ 2 /R2 ).8
(6)
When the models are nested, in small samples E (Lˆ t ) = −(α + σ 2 /R) < 0, as the in-sample fit of the larger model is always better than that of the small one. Consequently, E (BP ) = β E (Lˆ t ) = 0 only when β = 0. The calculations show that the numerator for β has two distinct components: the first, α 4 , is an outcome of the mis-specification in model 2; the other, 3σ 2 /R2 , changes with 2
Existing forecast comparison tests, such as Diebold and Mariano (1995) and West (1996), inform the researcher only about which model forecasts best, and do not shed any light on why that is the case. Our main objective, instead, is to decompose the sources of the out-of-sample forecasting performance into uncorrelated components that have meaningful economic interpretation, and might provide constructive insights to improve models’ forecasts. The out-of-sample forecasting performance of competing models can be attributed to model instability, over-fitting, and predictive content. Below we elaborate on each of these components in more detail. We measure time variation in models’ relative forecasting performance by averaging relative predictive ability over rolling windows of size m, as in Giacomini and Rossi (2010), where m < P satisfies Assumption 3. Assumption 3. limT →∞ (m/P ) → µ ∈ (0, ∞) as m, P → ∞. We define predictive content as the correlation between the insample and out-of-sample measures of fit. When the correlation is small, the in-sample measures of fit have no predictive content for the out-of-sample and vice versa. An interesting case occurs when the correlation is strong, but negative. In this case the in-sample predictive content is strong yet misleading for the out-of-sample. We define over-fitting as a situation in which a model fits well insample but loses predictive ability out-of-sample; that is, where in-sample measures of fit fail to be informative regarding the outof-sample predictive content. To capture predictive content and over-fitting, we consider the following regression:
t + ut +h for t = R, R + 1, . . . , T . Lt +h = β L (4) ∑ −1 ∑ T 1 2t Let β ≡ P1 Tt=R L denote the OLS estit =R Lt Lt +h P t and mate of β in regression (4), βL ut +h denote the corresponding t + fitted values and regression errors. Note that Lt +h = βL ut +h . In
the sample size and ‘‘captures’’ estimation uncertainty in model 1. When the two components are equal to each other, the insample loss differences have no predictive content for the out-ofsample. When the mis-specification component dominates, then the in-sample loss differences provide information content for the out-of-sample. On the other hand, when β is negative, though the in-sample fit has predictive content for the out-of-sample, it is misleading in that it is driven primarily by the estimation uncertainty. For any given value of β, E (BP ) = β E (Lˆ t ) = −β(α 2 + σ 2 /R), where β is defined in Eq. (6). By construction, E (UP ) = E (Lˆ t +h ) − E (BP ) = (σ 2 /R − α 2 ) − E (BP ). Similar to the case of BP , the component designed to measure over-fitting is affected by both mis-specification and estimation uncertainty. One should note that for β > 0, the mis-specification component affects both E (BP ) and E (UP ) in a similar direction, while the estimation uncertainty moves them in opposite directions. Estimation uncertainty penalizes the predictive content BP and makes the unexplained component UP larger.9
∑ ∑ 8 E ( t ) = E ((−2ϵt +h ( tj=−th−h−R+1 ϵj+h )/R + ( tj=−th−h−R+1 ϵj+h )2 /R2 − α 2 − Lt +h L 2αϵt +h )(−2ϵt (
∑t −h
j=t −h−R+1
ϵj+h )/R + (
∑t −h
j=t −h−R+1
ϵj+h )2 /R2 − α 2 − 2αϵt )) = α 4 −
3σ /R , where the derivation of the last part relies on the assumptions of normality and i.i.d. for the error terms in that E (ϵt ) = 0, E (ϵt2 ) = σ 2 , E (ϵt3 ) = 0, E (ϵt4 ) = 3σ 4 and, when j > 0, then E (ϵt2+j ϵt2 ) = σ 4 , E (ϵt +j ϵt ) = 0. In addition, it is useful to note 4
2
that E ((
∑ ϵ )4 ) = tE (ϵt4 ) + t (t − 1)(σ 2 )2 + 4 tj=−11 (σ 2 )2 = tE (ϵt4 ) + 3t (t − ∑t ∑t 4 3 3 4 1)σ = 3t σ , E (( j=1 ϵj ) ) = 0, and E ((ϵt j=1 ϵj ) ) = E (ϵt ) + 3(t − 1)σ . The ∑t
4
j=1 j 2 4
expression for the denominator can be derived similarly. 9 Note that E (Lˆ ) = σ 2 /R − α 2 , whereas E (Lˆ ) = −σ 2 /R − α 2 . Thus, the t t same estimation uncertainty component σ 2 /R penalizes model 2 in-sample and at
162
B. Rossi, T. Sekhposyan / Journal of Econometrics 164 (2011) 158–172
To ensure a meaningful interpretation of models’ in-sample performance as a proxy for the out-of-sample performance, we assume that the loss used for in-sample fit evaluation is the same as that used for out-of-sample forecast evaluation. This assumption is formalized in Assumption 4(a). Furthermore, our proposed decomposition depends on the estimation procedure. Parameters estimated in expanding windows converge to their pseudo-true values so that, in the limit, expected out-of-sample performance is measured by Lt +h ≡ E (Lt +h ) and expected in-sample performance is measured by Lt ≡ E (Lt ); we also define LLt ,t +h ≡ E (Lt Lt +h ). However, for parameters estimated in fixed rolling windows the estimation uncertainty remains asymptotically relevant, so that expected out-of-sample performance is measured by Lt +h ≡ E ( Lt +h ), and the expected in-sample performance is measured by t ), and LLt ,t +h ≡ E (L t Lt ≡ E (L Lt +h ). We focus on the relevant case where the models’ have different in-sample performance. This assumption is formalized in Assumption 4(b). The assumption is always satisfied in small-samples for nested models, and it is also trivially satisfied for measures of absolute performance. Assumption 4. (a) For every t , Lt (θ ) = Lt (θ ); (b) limT →∞ P1 ∑T t =R Lt ̸= 0. Let λ ∈ [µ, 1]. For τ = [λP ] we propose to decompose the outof-sample loss function differences { Lt +h }Tt=R calculated in rolling windows of size m into their difference relative to the average loss, Aτ ,P , an average forecast error loss expected on the basis of the insample performance, BP , and an average unexpected forecast error loss, UP . We thus define: Aτ ,P =
BP =
R+τ −1
1
−
m t =R+τ −m T 1−
P t =R
UP =
T 1−
P t =R
T 1− Lt +h − Lt +h ,
P t =R
t β, L
ut + h ,
where BP , UP are estimated from regression (4). The following assumption states the hypotheses that we are interested in. Assumption 5. Let Aτ ,P ≡ E (Aτ ,P ), BP ≡ β Lt , and U P ≡ Lt +h − β Lt . The following null hypotheses hold: H0,A : Aτ ,P = 0
for all τ = m, m + 1, . . . , P ,
(7)
H0,B : BP = 0,
(8)
H0,U : U P = 0.
(9)
Proposition 3 provides the decomposition for both rolling and recursive window estimation schemes. Proposition 3 (The Decomposition). Let either (a) [Fixed Rolling Window Estimation] Assumptions 1, 3 and 4 hold; or (b) [Recursive Window Estimation] Assumptions 2–4 hold and either D = 0 or π = 0. Then, for λ ∈ [µ, 1] and τ = [λP ]: 1
R+τ −1
−
[ Lt +h − Lt +h ]
m t =R+τ −m
= (Aτ ,P − Aτ ,P ) + (BP − BP ) + (UP − U P ).
Under Assumption 5, Aτ ,P , BP , UP are asymptotically uncorrelated, and provide a decomposition of the out-of-sample measure of ∑R+τ −1 1 forecasting performance, m t =R+τ −m [Lt +h − Lt +h ]. Appendix B proves that Aτ ,P , BP and UP are asymptotically uncorrelated. This implies that (10) provides a decomposition of rolling averages of out-of-sample losses into a component that reflects the extent of instabilities in the relative forecasting performance, Aτ ,P , a component that reflects how much of the average out-of-sample forecasting ability was predictable on the basis of the in-sample fit, BP , and how much it was unexpected, UP . In essence, BP + UP is the average forecasting performance over the out-of-sample period considered by Diebold and Mariano (1995) and West (1996), among others. To summarize, Aτ ,P measures the presence of time variation in the models’ performance relative to their average performance. In the presence of no time variation in the expected relative forecasting performance, Aτ ,P should equal zero. When instead the sign of Aτ ,P changes, the out-of-sample predictive ability swings from favoring one model to favoring the other model. BP measures the models’ out-of-sample relative forecasting ability reflected in the in-sample relative performance. When BP has the same sign ∑T as P1 t =R Lt +h , this suggests that in-sample losses have predictive content for out-of-sample losses. When they have the opposite sign, there is predictive content, although it is misleading because the out-of-sample performance will be the opposite of what is expected on the basis of in-sample information. UP measures models’ out-of-sample relative forecasting ability not reflected by in-sample fit, which is our definition of over-fitting. Similar results hold for the expanding window estimation scheme in the more general case where either π ̸= 0 or D ̸= 0. Proposition 6, discussed in Appendix A and proved in Appendix B, demonstrates that the only difference is that, in the more general case, the variance changes as a deterministic function of the point in time in which forecasts are considered, which requires a decomposition normalized by the variances. 4. Statistical tests This section describes how to test the statistical significance of the three components in decompositions (10) and (20). Let σA2 ≡ limT →∞ Var P −1/2
Lt +h , σB2 ≡ limT →∞ Var(P 1/2 BP ), σU2 ≡
(i,j),rec denote the (i-th, j-th) element of Ω roll and Ω rec , for Ω roll and Ω rec defined in Eq. (3). defined in Eq. (2) and for Ω Proposition 4 provides test statistics for evaluating the significance of the three components in decomposition (10). Proposition 4 (Significance Tests). Let either: (a) [Fixed Rolling Window Estimation] Assumptions 1, 3 and 4 hold, and σA2 = 2 2 Ω(1,1),roll , σB = ΦP Ω(2,2),roll ; or (b) [Recursive Window Estima(1,1),rec , (2,2),rec , tion] Assumptions 2–4 hold, and σA2 = Ω σB2 = ΦP2 Ω and either π = 0 or D = 0. In addition, let Aτ ,P , BP , UP be defined as in Proposition 3, and the tests be defined as:
ΓP
(B)
ΓP the same time improves model 2’s performance out-of-sample (relative to model 1). This result was shown by Hansen (2008), who explores its implications for information criteria. This paper differs by Hansen (2008) in a fundamental way by focusing on a decomposition of the out-of-sample forecasting ability into separate and economically meaningful components.
t =R
limT →∞ Var(P 1/2 UP ), and σA2 , σB2 , σU2 be consistent estimates of 2 2 2 (i,j),roll σA , σB and σU (such as described in Proposition 4). Also, let Ω
(A)
(10)
∑T
(U )
ΓP
≡
sup
τ =m,...,P
√ ≡
√ −1 | P σA Aτ ,P |,
P σB−1 BP ,
√ ≡
P σU−1 UP ,
where ΦP ≡ Then:
∑ T 1 P
t =R
t L
∑ T 1 P
t =R
2t L
−1
, and σU2 = σA2 − σB2 .
B. Rossi, T. Sekhposyan / Journal of Econometrics 164 (2011) 158–172 Table 1 (A )
Critical values for the ΓP
µ
We compare the following two nested models’ forecasts for yt +h :
test.
α:
0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90
0.10
0.05
0.025
0.01
9.844 7.510 6.122 5.141 4.449 3.855 3.405 3.034 2.729 2.449 2.196 1.958 1.720 1.503 1.305 1.075 0.853
10.496 8.087 6.609 5.594 4.842 4.212 3.738 3.333 2.984 2.700 2.412 2.153 1.900 1.655 1.446 1.192 0.952
11.094 8.612 7.026 5.992 5.182 4.554 4.035 3.602 3.226 2.913 2.615 2.343 2.060 1.804 1.575 1.290 1.027
11.858 9.197 7.542 6.528 5.612 4.941 4.376 3.935 3.514 3.175 2.842 2.550 2.259 1.961 1.719 1.408 1.131 (A)
Note. The table reports critical values kα for the test statistic ΓP levels α = 0.10, 0.05, 0.025, 0.01.
at significance
(i) Under H0,A ,
√
P σA−1 Aτ ,P ⇒
1
µ
[W1 (λ) − W1 (λ − µ)] − W1 (1),
(11)
for λ ∈ [µ, 1] and τ = [λP ], where W1 (·) is a standard univariate Brownian motion. The critical values for significance level α are ±kα , where kα solves:
Pr
sup |[W1 (λ) − W1 (λ − µ)]/µ − W1 (1)| > kα
λ∈[µ,1]
= α.
(12)
Table 1 reports the critical values kα for typical values of α . (B) (U ) (ii) Under H0,B : ΓP ⇒ N (0, 1); under H0,U : ΓP ⇒ N (0, 1). The null hypothesis of no time variation (H0,A ) is rejected when
(A)
ΓP
> kα ; the null hypothesis of no predictive content (H0,B ) is (B)
rejected when |ΓP | > zα/2 , where zα/2 is the (α/2)-th percentile of a standard normal; the null hypothesis of no over-fitting (H0,U ) is (U )
rejected when |ΓP | > zα/2 . Appendix B provides the formal proof of Proposition 4. In addition, Proposition 7 in Appendix A provides the generalization of the recursive estimation case to either π ̸= 0 or D ̸= 0. 5. Monte Carlo analysis The objective of this section is twofold. First, we evaluate the performance of the proposed method in small samples; second, we examine the role of the Aτ ,P , BP , and UP components in the proposed decomposition. The Monte Carlo analysis focuses on the rolling window scheme used in the empirical application. The number of Monte Carlo replications is 5000. We consider two Data Generating Processes (DGP): the first is the simple example discussed in Section 3; the second is tailored to match the empirical properties of the Canadian exchange rate and money differential data considered in Section 6. Let yt +h = αt xt + et +h ,
163
t = 1, 2, . . . , T ,
(13)
where h = 1 and either: (i) (IID) et ∼ i.i.d.N (0, σε ), σε = 1, xt = 1, or (ii) (Serial Correlation) et = ρ et −1 + εt , ρ = −0.0073, εt ∼ ∑6 i.i.d.N (0, σε2 ), σε2 = 2.6363, xt = b1 + s=2 bs xt −s+1 + vt , vt ∼ i.i.d.N (0, 1) independent of εt and b1 = 0.1409, b2 = −0.1158, b3 = 0.1059, b4 = 0.0957, b5 = 0.0089, b6 = 0.1412. 2
2
Model 1 forecast : αt xt Model 2 forecast : 0,
where model 1 is estimated by OLS in rolling windows of fixed size,
αt = (Σjt=−th−h−R+1 x2j )−1 (Σjt=−th−h−R+1 xj yj+h ) for t = R, . . . , T .
We consider the following cases. The first DGP (DGP 1) is used to evaluate the size properties of our procedures in small samples. We let αt = α , where: (i) for the IID case, α = 1/2 σe R−√ satisfies H0,A , α = σe (3R−2 )1/4 satisfies H0,B , and α = 1 σ ( −3R + R2 + 1 + 1)1/2 satisfies H0,U 10 ; (ii) for the correlated R e case, the values of α ∈ R satisfying the null hypotheses were calculated via Monte Carlo approximations, and σe2 = σε2 /(1 − ρ 2 ).11 A second DGP (DGP 2) evaluates the power of our procedures in the presence of time variation in the parameters: we let αt = α + b · cos(1.5π t /T ) · (1 − t /T ), where b = {0, 0.1, . . . , 1}. A third DGP (DGP 3) evaluates the power of our tests against stronger predictive content. We let αt = α + b, for b = {0, 0.1, . . . , 1}. Finally, a fourth DGP (DGP 4) evaluates our procedures against over-fitting. We let model 1 include (p − 1) redundant regressors ∑ and its forecast for yt +h is specified as: αt xt + ps=−11 γs+1,t xs,t , where x1,t , . . . , xp−1,t are (p − 1) independent standard normal random variables, whereas the true DGP remains (13) where αt = α , and α takes the same values as in DGP1. The results of the Monte Carlo simulations are reported in Tables 2–5. In all tables, Panel A reports results for the IID case and Panel B reports results for the serial correlation case. Asymptotic variances are estimated with a Newey West’s (1987) procedure, Eq. (2), with q(P ) = 1 in the IID case and q(P ) = 2 in the Serial Correlation case. First, we evaluate the small sample properties of our procedure. Table 2 reports empirical rejection frequencies for DGP 1 for the tests described in Proposition 4 considering a variety of out-ofsample (P) and estimation window (R) sizes. The tests have a nominal level equal to 0.05, and m = 100. Panel A reports results for the IID case whereas Panel B reports results for the Serial Correlation case. Therefore, Aτ ,P , BP , and UP should all be statistically insignificantly different from zero. Table 2 shows indeed that the rejection frequencies of our tests are close to the nominal level, even in small samples, although serial correlation introduces mild size distortions. Second, we study the significance of each component in DGP 2–5. DGP 2 allows the parameter to change over time in a way that, as b increases, instabilities become more important. Table 3 shows (A) (B) (U ) that the ΓP test has power against instabilities. The ΓP and ΓP tests are not designed to detect instabilities, and therefore their empirical rejection rates are close to the nominal size.
10 The values of α satisfying the null hypotheses can be derived from the example in Section 3. t −h 2 2 −1 11 In detail, let δ = Σ t −h Σj=t −h−R+1 xj εj+h . Then, α = E (δt2 ) t j=t −h−R+1 xj is the value that satisfies H0,A when ρ is small, and the expectation is 1 approximated via Monte Carlo simulations. Similarly, H0,B sets α = − 2A (B −
√
−4AD + B2 ), where A = E (x2t x2t −h ), B = 2[E (δt et xt −h x2t ) − E (δt2 x2t x2t −h )], C = [−2E (δt2 et xt −h x2t )] = 0, D = E (δt4 x2t x2t −h ) − 2E (δt3 et xt −h x2t ). H0,U instead sets α ∈ R s.t. Gα 6 + H α 4 + Lα 2 + M = 0, where G = {E (x2t −h )E (x2t x2t −h ) − E (x2t )E (x4t −h )}, H = 2E (x2t −h )[E (δt et xt −h x2t ) − E (δt2 x2t x2t −h )] − 2E (x2t )[2E (x3t −h et δt ) − E (x4t −h δt2 ) + 2E (e2t x2t −h )] + E (x2t x2t −h )[2E (et xt −h δt ) − E (x2t −h δt2 )] + Ex4t −h E (δt2 x2t ), L = [2E (δt et xt −h x2t ) − 2E (δt2 x2t x2t −h )][2E (et xt −h δt ) − E (x2t −h δt2 )] + E (x2t −h )[E (δt4 x2t x2t −h ) − 2E (δt3 et xt −h x2t ) − 4Ee2t x2t −h δt2 + 4Ex3t −h et δt3 − Ex4t −h δt4 ] + E (δt2 x2t )[4Ex3t −h et δt − 2Ex4t −h δt2 + 4Ee2t x2t −h ], and M = [E (δt2 x2t )][4Ee2t x2t −h δt2 − 4Ex3t −h et δt3 + Ex4t −h δt4 ] + [E (δt4 x2t x2t −h ) − 2E (δt3 et xt −h x2t )][2E (et xt −h δt ) − E (x2t −h δt2 )]. We used 5000 replications.
164
B. Rossi, T. Sekhposyan / Journal of Econometrics 164 (2011) 158–172
Table 2 DGP 1: size results. R
P
Table 5 DGP 4: over-fitting case. Panel A. IID (A )
20
50
100
200
150 200 300 150 200 300 150 200 300 150 200 300
Panel B. Serial corr. ( B)
(U )
(A)
(B)
(U )
ΓP
ΓP
ΓP
ΓP
ΓP
ΓP
0.02 0.01 0.01 0.03 0.02 0.03 0.04 0.03 0.03 0.04 0.03 0.04
0.07 0.06 0.05 0.07 0.06 0.06 0.06 0.07 0.06 0.06 0.06 0.06
0.03 0.03 0.03 0.03 0.04 0.03 0.05 0.04 0.04 0.06 0.05 0.04
0.02 0.01 0.01 0.02 0.01 0.02 0.03 0.02 0.03 0.03 0.03 0.04
0.08 0.08 0.07 0.08 0.08 0.08 0.09 0.08 0.07 0.08 0.08 0.07
0.04 0.04 0.03 0.04 0.04 0.03 0.04 0.04 0.04 0.05 0.05 0.04
Note. The table reports empirical rejection frequencies of the test statistics (A) ( B) (U ) ΓP , ΓP , ΓP for various window and sample sizes (see DGP 1 in Section 5 for details). m = 100. Nominal size is 0.05. Table 3 DGP 2: time variation case. b
Panel A. IID (A)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1
Panel B. Serial corr. ( B)
(U )
(A)
( B)
(U )
ΓP
ΓP
ΓP
ΓP
ΓP
ΓP
0.07 0.06 0.07 0.08 0.10 0.14 0.19 0.23 0.28 0.35 0.41
0.06 0.06 0.07 0.07 0.07 0.07 0.07 0.07 0.07 0.07 0.07
0.04 0.05 0.05 0.06 0.06 0.07 0.07 0.08 0.08 0.09 0.09
0.06 0.06 0.06 0.06 0.06 0.07 0.09 0.10 0.13 0.15 0.18
0.07 0.07 0.07 0.07 0.07 0.08 0.08 0.08 0.08 0.08 0.08
0.04 0.04 0.05 0.05 0.06 0.06 0.06 0.06 0.07 0.07 0.07
Note. The table reports empirical rejection frequencies of the test statistics (A) ( B) (U ) ΓP , ΓP , ΓP in the presence of time variation in the relative performance (see DGP 2 in Section 5). R = 100, P = 300, m = 60. The nominal size is 0.05. Table 4 DGP 3: stronger predictive content case. b
Panel A. IID (A)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Panel B. Serial corr. ( B)
(U )
(A)
( B)
(U )
ΓP
ΓP
ΓP
ΓP
ΓP
ΓP
0.03 0.05 0.05 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04
0.06 0.06 0.07 0.10 0.16 0.29 0.47 0.68 0.86 0.96 0.99
0.04 0.11 0.58 0.94 1 1 1 1 1 1 1
0.03 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04
0.07 0.07 0.07 0.07 0.08 0.11 0.14 0.18 0.24 0.32 0.40
0.04 0.05 0.20 0.50 0.79 0.95 0.99 1 1 1 1
Note. The table reports empirical rejection frequencies of the test statistics (A) ( B) (U ) ΓP , ΓP , ΓP in the case of an increasingly stronger predictive content of the explanatory variable (see DGP 3 in Section 5). R = 100, P = 300, m = 100. Nominal size is 0.05.
DGP 3 is a situation in which the parameters are constant, and the information content in model 1 becomes progressively better as b increases (the model’s performance is equal to its competitor in expectation when b = 0). Since the parameters are constant and there are no other instabilities, the Aτ ,P component should not be significantly different from zero, whereas the BP and UP components should become different from zero when b is sufficiently different from zero. These predictions are supported by the Monte Carlo results in Table 4. DGP 4 is a situation in which model 1 includes an increasing number of irrelevant regressors (p). Table 5 shows that, as p increases, the estimation uncertainty caused by the increase in the
p
Panel A. IID (A)
0 1 2 5 10 15 20 25 30 35 40
Panel B. Serial corr. ( B)
(U )
(A)
(B)
(U )
ΓP
ΓP
ΓP
ΓP
ΓP
ΓP
0.03 0.03 0.02 0.02 0.01 0.01 0.02 0.01 0.02 0.02 0.02
0.06 0.06 0.06 0.07 0.08 0.12 0.17 0.24 0.34 0.46 0.60
0.04 0.08 0.15 0.41 0.80 0.95 1 1 1 1 1
0.03 0.02 0.02 0.01 0.01 0.01 0.01 0.01 0.02 0.02 0.02
0.07 0.06 0.07 0.07 0.09 0.13 0.19 0.27 0.37 0.49 0.63
0.04 0.08 0.14 0.42 0.80 0.95 0.99 1 1 1 1
Note. The table reports empirical rejection frequencies of the test statistics (U ) ( B) (A) ΓP , ΓP , ΓP in the presence of over-fitting (see DGP 4 in Section 5), where p is the number of redundant regressors included in the largest model. R = 100, P = 200, m = 100. Nominal size is 0.05.
number of parameters starts penalizing model 1: its out-of-sample performance relative to its in-sample predictive content worsens, and BP , UP become significantly different from zero. On the other hand, Aτ ,P is not significantly different from zero, as there is no time-variation in the models’ relative loss differentials.12 6. The relationship between fundamentals and exchange rate fluctuations This section analyzes the link between macroeconomic fundamentals and nominal exchange rate fluctuations using the new tools proposed in this paper. Explaining and forecasting nominal exchange rates has long been a struggle in the international finance literature. Since Meese and Rogoff (1983a,b) first established that the random walk generates the best exchange rate forecasts, the literature has yet to find an economic model that can consistently produce good in-sample fit and outperform a random walk in outof-sample forecasting, at least at short- to medium-horizons (see e.g. Engel et al., 2008). In their papers, Meese and Rogoff (1983a,b) conjectured that sampling error, model mis-specification and instabilities can be possible explanations for the poor forecasting performance of the economic models. We therefore apply the methodology presented in Section 2 to better understand why the economic models’ performance is poor. We reconsider the forecasting relationship between the exchange rates and economic fundamentals in a multivariate regression with growth rate differentials of the following countryspecific variables relative to their US counterparts: money supply (Mt ), the industrial production index (IPIt ), the unemployment rate (URt ), and the lagged interest rate (Rt −1 ), in addition to the growth rate of oil prices (whose level is denoted by OPt ). In addition, we separately consider the predictive ability of the commodity price index (CP) growth rate for the Canadian exchange rate. We focus on monthly data from 1975:9 to 2008:9 for a few industrial countries, such as Switzerland, United Kingdom, Canada, Japan, and Germany. The data are collected from the IFS, OECD, Datastream, as well as country-specific sources; see Appendix C for details. We compare one-step ahead forecasts for the following models. Let yt denote the growth rate of the exchange rate at time t , yt = [ln(St /St −1 )]. The ‘‘economic’’ model is specified as: Model 1 : yt = α xt + ϵ1,t ,
(14)
where xt is the vector containing ln(Mt /Mt −1 ), ln(IPIt /IPIt −1 ), ln (URt /URt −1 ), ln(Rt −1 /Rt −2 ), and ln(OPt /OPt −1 ), and ϵ1,t is the
12 Unreported results show that the magnitude of the correlation of the three components is very small.
B. Rossi, T. Sekhposyan / Journal of Econometrics 164 (2011) 158–172
error term. In the case of commodity prices, xt contains only the growth rate of the commodity price index. The benchmark is a simple random walk: Model 2 : yt = ε2,t ,
(15)
where ε2,t is the error term. First, we follow Bacchetta et al. (2010) and show how the relative forecasting performance of the competing models is affected by the choice of the rolling estimation window size R. Let R = 40, . . . , 196. Given that our total sample size is fixed (T = 396) and the estimation window size varies, the out-of-sample period P (R) = T + h − R is not constant throughout the exercise. Let P = min P (R) = 200 denote the minimum common out-ofsample period across the various estimation window sizes. One could proceed in two ways. One way is to average the first P out-ofsample forecast losses,
1 P
Lt +h . Alternatively, one can average ∑T the last P out-of-sample forecast losses, P1 t =T −P +1 Lt +h . The difference is that in the latter case the out-of-sample forecasting period is the same despite the estimation window size differences, which allows for a fair comparison of the models over the same forecasting period. Figs. 1 and 2 consider the average out-of-sample forecasting performance of the economic model in Eq. (14) relative to the benchmark, Eq. (15). The horizontal axes report R. The vertical axes report the ratio of the mean square forecast errors (MSFEs). Fig. 1 depicts MSFEModel /MSFERW , where MSFEModel =
∑R+P ϵ12,t +h (αˆ t ,R ), and MSFERW = P1 t =R ε22,t +h , whereas Fig. 2 ∑T 2 ˆ t ,R ) relative depicts the ratio of MSFEModel = P1 t =T −P +1 ϵ1,t +h (α ∑ T 1 2 ε . If the MSFE ratio is greater to MSFERW = P t =T −P +1 2,t +h 1 P
Table 6 Empirical results. Countries: Switzerland
United Kingdom
∑R+P t =R
than one, then the economic model is performing worse than the benchmark on average. The figures show that the forecasting performance of the economic model is inferior to that of the random walk for all the countries except Canada. However, as the estimation window size increases, the forecasting performance of the models improves. The degree of improvement deteriorates when the models are compared over the same out-of-sample period, as shown in Fig. 2.13 For example, in the case of Japan, the model’s average out-of-sample forecasting performance is similar to that of the random walk starting at R = 110 when compared over the same out-of-sample forecast periods, while otherwise (as shown in Fig. 1), its performance becomes similar only for R = 200. Fig. 1 reflects Bacchetta et al. (2010) findings that the poor performance of economic models is mostly attributed to overfitting, as opposed to parameter instability. Fig. 2 uncovers instead that the choice of the window is not crucial, so that over-fitting is a concern only when the window is too small. Next, we decompose the difference between the MSFEs of the two models (14) and (15) calculated over rolling windows of size m = 100 into the components Aτ ,P , BP , and UP , as described in the Section 3. Negative MSFE differences imply that the economic model (14) is better than the benchmark model (15).14 In addition to the relative forecasting performance of one-step ahead forecasts, we also consider one-year-ahead forecasts in a direct multistep forecasting exercise. More specifically, we consider the following ‘‘economic’’ and benchmark models: Model 1—multistep: yt +h = α(L)xt + ϵ1,t +h
(16)
13 Except for Germany, whose time series is heavily affected by the adoption of the Euro. 14 The size of the window is chosen to strike a balance between the size of the out-of-sample period (P) and the total sample size in the databases of the various countries. To make the results comparable across countries, we keep the size of the window constant across countries and set it equal to m = 100, which results in a sequence of 186 out-of-sample rolling loss differentials for each country.
One-month ahead
Canada
DMW
1.110 2.375
3.800
2.704*
−0.658
1.077 2.095*
1.788 2.593*
3.166
4.595*
(A)
DMW (A)
DMW (A)
ΓP (B) ΓP (U ) ΓP Japan
DMW (A)
ΓP (B) ΓP (U ) ΓP Germany
Commodity prices
One-year ahead multistep
ΓP (B) ΓP (U ) ΓP ΓP (B) ΓP (U ) ΓP
∑R+P t =R
165
DMW (A)
1.420
0.600
1.362
2.074* 0.335
2.283* −0.831
3.923
4.908*
2.594*
−2.167*
0.279 1.409
1.029 0.677
2.541
3.222
2.034*
−1.861
1.251 1.909
1.677 1.290
ΓP (B) ΓP (U ) ΓP
2.188
1.969
−2.247*
0.109
1.945
DMW
−1.116
1.285 1.420
ΓP (B) ΓP (U ) ΓP
2.842
2.300
−2.241* −1.352
−1.459
(A)
1.656 (A)
( B)
(U )
Note. The table reports the estimated values of the statistics ΓP , ΓP , ΓP . DMW denotes the Diebold and Mariano (1995) and West (1996) test statistic. * Denotes significance at the 5% level. Significance of the DMW test follows from Giacomini and White’s (2006) critical values. The results are based on window sizes R = m = 100.
where yt +h is the h period ahead rate of growth of the exchange rate at time t, defined by yt +h = [ln(St +h /St )]/h, and xt is as defined previously. α(L) is a lag polynomial, such that α(L)xt = ∑p α x j t −j+1 , where p is selected recursively by BIC, and ϵ1,t +h is j =1 the h-step ahead error term. The benchmark model is the random walk: Model 2—multistep: yt +h = ε2,t +h ,
(17)
where ε2,t +h is the h-step ahead error term. We focus on one-year ahead forecasts by setting h = 12 months. Fig. 3 plots the estimated values of Aτ ,P , BP and UP for the decomposition in Proposition 3 for one-step ahead forecasts (Eq. (14), (15)) and Fig. 4 plots the same decomposition for multi-step ahead forecasts (Eq. (16), (17)). In addition, the first column of Table 6 reports the test statistics for assessing the significance of (A) (B) (U ) each of the three components, ΓP , ΓP and ΓP , as well as the Diebold and Mariano (1995) and West (1996) test, labeled ‘‘DMW’’. Overall, according to the DMW test, there is almost no evidence that economic models forecast exchange rates significantly better than the random walk benchmark. This is the well-known ‘‘Meese and Rogoff puzzle’’. It is interesting, however, to look at our decomposition to understand the causes of the poor forecasting ability of the economic models. Figs. 3 and 4 show empirical evidence of time variation in Aτ ,P , signaling possible instability in the relative forecasting performance of the models. Table 6 shows that such instability is statistically insignificant for onemonth ahead forecasts across the countries. Instabilities, however, become statistically significant for one-year ahead forecasts in some countries. The figures uncover that the economic model was forecasting significantly better than the benchmark in the late 2000s for Canada and in the early 1990s for the UK For the one-month ahead forecasts, the BP component is mostly positive, except for the case of Germany. In addition, the test statistic indicates that the component is statistically significant
166
B. Rossi, T. Sekhposyan / Journal of Econometrics 164 (2011) 158–172
Fig. 1. Out-of-sample fit in data: MSFEModel /MSERW , P = 200-non-overlapping. Note: The figures report the MSFE of one month ahead exchange rate forecasts for each country from the economic model relative to a random walk benchmark as a function of the estimation window (R), based on the first P = 200 out-of-sample periods following the estimation period.
Fig. 2. Out-of-sample fit in data: MSFEModel /MSERW , P = 200-overlapping. Note: The figures report the MSFE of one month ahead exchange rate forecasts for each country from the economic model relative to a random walk benchmark as a function of the estimation window (R), based on the last P = 200 out-of-sample periods which are overlapping for the estimation windows of different size.
B. Rossi, T. Sekhposyan / Journal of Econometrics 164 (2011) 158–172
167
Fig. 3. Decomposition for one-step ahead forecast. Note: The figures report Aτ ,P , Bp , and Up and 5% significance bands for Aτ ,P . The results are based on window sizes R = m = 100.
Fig. 4. Decomposition for one-year ahead direct multistep forecast. Note: The figures report Aτ ,P , Bp , and Up and 5% significance bands for Aτ ,P . The results are based on window sizes R = m = 100.
for Switzerland, Canada, Japan and Germany. We conclude that the lack of out-of-sample predictive ability is related to lack of in-sample predictive content. For Germany, even though the predictive content is statistically significant, it is misleading (since BP < 0). For the UK instead, the bad forecasting performance of
the economic model is attributed to over-fit. As we move to one-year ahead forecasts, the evidence of predictive content becomes weaker. Interestingly, for Canada, there is statistical evidence in favor of predictive content when we forecast with the large economic model. We conclude that lack of predictive
168
B. Rossi, T. Sekhposyan / Journal of Econometrics 164 (2011) 158–172
content is the major explanation for the lack of one-month ahead forecasting ability of the economic models, whereas time variation is mainly responsible for the lack of one-year ahead forecasting ability.
Proposition 6 (The Decomposition: General Expanding Window −1/2 Estimation). Let Assumptions 2–4 hold, λ ∈ [µ, 1], Ω (λ)(i,j),rec −1/2
denote the (i-th, j-th) element of Ω (λ)rec , defined in Eq. (19), and for τ = [λP ]:
7. Conclusions This paper proposes a new decomposition of measures of relative out-of-sample predictive ability into uncorrelated components related to instabilities, in-sample fit, and over-fitting. In addition, the paper provides tests for assessing the significance of each of the components. The methods proposed in this paper have the advantage of identifying the sources of a model’s superior forecasting performance and might provide valuable information for improving forecasting models.
R+τ −1
1
−
m
R+τ −m−1
−
−1/2 Ω (λ − µ)(1,1),rec Lt +h
t =R
t =R
= [ Aτ ,P − E ( Aτ ,P )] + [ BP − E ( BP )] + [ UP − E ( UP )], for τ = m, m + 1, . . . , P , where
(20)
Aτ ,P 1
≡
Acknowledgements
R+τ −1
−
m
−
We thank the editors, two anonymous referees, and participants of the 2008 Midwest Econometrics Group, the EMSG at Duke University, and the 2010 NBER-NSF Time Series Conference for comments. Barbara Rossi gratefully acknowledges support by NSF grant 0647627. The views expressed in this paper are those of the authors. No responsibility should be attributed to the Bank of Canada.
−1/2 Ω (λ)(1,1),rec Lt + h −
−1/2 Ω (λ)(1,1),rec Lt +h
t =R T 1−
P t =R
R+τ −m−1
−
−
Ω (λ −
−1/2 µ)(1,1),rec Lt +h
t =R
−1/2 Ω (1)(1,1),rec Lt +h ,
−1/2 −1/2 BP ≡ Ω (1)(1,1),rec BP , and UP ≡ Ω (1)(1,1),rec UP . Under Assumption 5, where Aτ ,P ≡ E ( Aτ ,P ), Aτ ,P , BP , UP are asymptotically uncor-
related, and provide a decomposition of the rolling average (standardized) out-of-sample measure of forecasting performance, Eq. (20).
Appendix A. Additional theoretical results
Comment. Note that, when D = 0, so that estimation uncertainty is not relevant, (20) is the same as (10) because Ω (λ)rec does not depend on λ.
This appendix contains theoretical propositions that extend the recursive window estimation results to situations where π ̸= 0 or D ̸= 0.
Proposition 7 (Significance Tests: Expanding Window Estimation). (1)(1,1),rec , (λ)(1,1),rec , Let Assumptions 2–4 hold, σA2 = Ω σA2,λ = Ω
Proposition 5 (Asymptotic Results for the General Recursive Window Case). For every k ∈ [0, 1], under Assumption 2: (a) If π = 0 then 1
√
P
R+[kP ]
−
−1/2 Ωrec [lt +h − E (lt +h )] ⇒ W (k),
1 P
R+[kP ]
−
ΓP
(U )
ΓP
Sll Υ JSlh′
Υ Slh J ′ 2Υ JShh J ′
I D′
,
(18)
and W (.) is a (2 × 1) standard vector Brownian Motion, Υ ≡ ln(1+kπ) 1− , kπ ∈ [0, π]. kπ Comment to Proposition 5. Upon mild strengthening of the assumptions on ht as in Andrews (1991), a consistent estimate of Ω (k)rec can be obtained by
(k)rec = I Ω Sll Slh = Shl Shh
D
Sll ′ Υ J Slh
q(P )−1
−
J′ Υ Slh I ′ ′ D 2Υ J S J
(19)
hh
(1 − |i/q(P )|)P −1
×
ςt +h − P −1
t =R
where ςt +h = ( lt +h , h′t )′ , D = P −1
P ( σA / σU ) UP .
≡
This appendix contains the proofs of the theoretical results in the paper. Lemma 1 (The Mean Value Expansion). Under Assumption 2, we have R+[kP ]
R+[kP ]
−
−
t =R
lt +h = √1
P
t =R
R+[kP ] 1 − lt +h + D √ JHt + op (1). P t =R
(21)
,
t =R
∂ lt +h t =R ∂θ θ=θ
∑T
P ( σA / σB ) BP
√
Appendix B. Proofs
P
2 ςt +h
≡
(ii) Under H0,B : ΓP(B) ⇒ N (0, 1); under H0,U : ΓP(U ) ⇒ N (0, 1).
√ T −
√
ports the critical values kα .
1
i=−q(P )+1 T −
t =R
P
(A) ⇒ supλ∈[µ,1] 1 µ [W1 (λ) − W1 (λ − µ)] − W1 (1), where τ = [λP ], m = [µP ] and W1 (·) is a standard univariate Brownian motion. The critical values for significance level α are ±kα , where kα solves (12). Table 1 re-
t =R
A,λ−µ
Then: (i) Under H0,A , where Aτ ,P ≡ E ( Aτ ,P ) : ΓP
where D
t =R
τ =m,...,P
1/2 Ω (k)− rec [lt +h − E (lt +h )] ⇒ W (k),
Ω (k)rec ≡ I
A,λ
t =R
m
UP ≡ σA−1 UP , and the tests be defined as: Lt +h , BP ≡ σA−1 BP , σA−1 (A) Aτ ,P , ΓP ≡ sup (B)
t =R
where Ωrec = Sll ; and (b) If S is p.d. then
√
(1)(2,2),rec , for Ω rec defined in Eq. (19), σB2 = ΦP2 Ω σU2 = σA2 − σB2 , ∑ ∑ ∑ R+τ −1 −1 T Lt +h − 1 σ −1 Aτ ,P ≡ 1 σ Lt +h − R+τ −m−1
t ,R
, J = JT , and
q(P ) is a bandwidth that grows with P (Newey and West, 1987).
Proof of Lemma 1. Consider the following mean value expansion of lt +h = lt +h ( θt ,R ) around θ ∗ :
lt +h ( θt ,R ) = lt +h + Dt +h ( θt ,R − θ ∗ ) + rt +h ,
(22)
B. Rossi, T. Sekhposyan / Journal of Econometrics 164 (2011) 158–172
where the i-th element of rt +h is: 0.5( θt ,R −θ ∗ )′
∂2l
i,t +h (θt ,R ) ∂θ∂θ ′ ∗
( θt ,R − θ ) and θt ,R is an intermediate point between θt ,R and θ . From (22) ∗
(1996, Lemma A2(b), A5) with P being replaced by R + [kP ] and [kP ]/P → k. T →∞
Using similar arguments,
we have R+[kP ]
1
−
√
P
−
lt +h = √
P
t =R
R+[kP ]
1
1
+√
P
R+[kP ]
−
D t +h θ t ,R − θ
∗
R+[kP ]
1
−
+√
P
t =R
rt +h .
∑R+[kP ]
(b)
∑
P √1 P
rt +h t =R R+[kP ] Dt +h t =R
=
1
−
√
P
=
P
t =R
−
= √
P
+√
P 1
+√
P 1
+√
P
DJHt
−
(Dt +h − D)JHt
−
D(Jt − J )Ht
1
−
(Dt +h − D)(Jt − J )Ht .
1
√
t =R
= O(1) and Lemma A4(a) in West
0 0
and
(i) lim Var √ T →∞
P
t =R
JHt
Υ Slh J 2Υ JShh J ′
−
[kP ]
Ht
t =R
= kSll if π = 0, and
lt +h
t =R
R+[kP ]
−
P
lt +h
= kΩ (k)rec if π > 0.
t =R
P
R+[kP ]
− lt +h − E (lt +h ) = √1
P
t =R
R+[kP ]
−
[lt +h − E (lt +h )]
t =R
1
+ D√
P
R+[kP ]
−
JHt + op (1).
t =R
That the variance is as above indicated follows from ∑R+[kP ] lt +h = I D limT →∞ Lemma 2 and limT →∞ Var √1 t =R P
1 −R+[kP ] lt +h t =R I′ , Var 1P −R+[kP ] D JH t √ t =R P
√
equals kSll if π = 0.
which equals kΩ (k)rec if π > 0, and
θt ,R , which includes only a finite (R) number of lags (leads) of {yt , xt }, and {yt , xt } are mixing, Zt +h,R is also mixing, of the same size as {yt , xt }. Then WP ⇒ W by Corollary 4.2 in Wooldridge and
Proof of Propositions 2 and 5. Let mt +h (k) = Ω (k)rec [ lt +h − E (lt +h )] if π > 0 (where Ω (k)rec is positive definite since S is pos−1/2 itive definite), and mt +h (k) = Sll [lt +h − E (lt +h )] if π = 0 or D = 0. That √1
0 if π = 0 k(2Υ JShh J ′ ) if π > 0,
=
R+[kP ]
1
−1/2
′
Proof of Lemma 2. We have,
−
−
1
P
R+[kP ]
l t +h , √
t =R
R+[kP ]
1 P
White (1988).
t =R
1
Ht
t =R
Proof of Proposition 1. Since Zt +h,R is a measurable function of
if π = 0,
if π > 0.
−
p
lt +h √ Sll P t =R lim Var = k +[kP ] Υ JSlh′ T →∞ 1 R− Jt H t √ P
P
Proof of Lemma 3. From Lemma 1, we have
R+[kP ]
t =R
−
−
R+[kP ]
1
[kP]
T →∞
t =R
R+[kP ] 1 − l √ t +h S P t =R lim Var = k 0ll +[kP ] T →∞ 1 R− Jt H t √
R+[kP ]
1
lt +h , √
(b) lim Var √
R+[kP ]
R+[kP ]
√
T →∞
0 if π = 0 by West (1994) kΥ Slh if π > 0 by Lemma A5 in West (1994).
t =R
Lemma 2 (Joint Asymptotic Variance of lt +h and Ht ). For every k ∈ [0, 1],
P
Cov
T →∞
(1996). (ii) Similar arguments show that the third and fourth terms are op (1) by Lemma A4(b,c) in West (1996).
→ kSll ;
lt +h
t =R
t =R
(a) lim Var √
R+[kP ]
P
P
−
We have that, in the last equality: op (1) as (i) the second term is √ ∑R+[kP ] [kP ] √ 1 ∑R+[kP] √1 √ D − D Ht → 0 ( D − D ) JH = t +h t +h t t =R t =R [kP ] √ [kP ] √
→
[kP]
Lemma 3 (Asymptotic Variance of lt +h ). For k ∈ [0, 1], and Ω (k)rec defined in Eq. (19):
t =R
1
P
−
R+[kP ]
1
R+[kP ]
1
Dt +h Jt Ht
t =R R+[kP ]
1
[kP ]
T →∞
R+[kP ]
−
√
P
t =R
P
1 Dt +h ( θt ,R − θ ∗ ) = √
by Assumption 2(d),
Var
(iii) Cov √
= op (1) and ∑R+[kP ] ( θt ,R − θ ∗ ) = D √1 JHt ,
P
P
lt +h
t =R
(a) follows from Assumption 2(b) and Eq. (4.1)(b) in West (1996, p. 1081). To prove (b), note that by Assumption 2(c) R+[kP ]
[kP ]
t =R
Eq. (21) follows from: (a) √1
−
P
t =R
R+[kP ]
1
(ii) Var √
lt +h
169
where π = 0 case follows from Lemma A5 inWest (1996). The
R+[kP] result for π > 0 case follows from limT →∞ Var √[1kP ] Ht t =R −1 = 2[1 − (kπ ) ln(1 + kπ )]Shh = 2Υ Shh , together with West
∑
∑R+[kP ] t =R
mt +h (k) = √1
P
∑[kP ] s=1
ms+R+h (k) satis-
fies Assumption D.3 in Wooldridge and White (1988) follows from Lemma 3. The limiting variance also follows from Lemma 3, and is full rank by Assumption 2(d). Weak convergence of the standardized partial sum process then follows from Corollary 4.2 in Wooldridge and White (1988). The convergence can be converted to uniform convergence as follows: first, an argument similar to that in Andrews (1993, p. 849, lines 4–18) ensures that assumption (i) in Lemma A4 in Andrews (1993) holds; then, Corollary 3.1 in Wooldridge and White (1988) can be used to show that mt +h (k) satisfies assumption (ii) in Lemma A4 in Andrews (1993). Uniform convergence then follows by Lemma A4 in Andrews (1993).
170
B. Rossi, T. Sekhposyan / Journal of Econometrics 164 (2011) 158–172
Proof of Proposition 3. Let W (.) = [W1 (.), W2 (.)]′ denote a twodimensional vector of independent standard univariate Brownian Motions. (a) For the fixed rolling window estimation, let Ω(i,j),roll denote the i-th row and j-th column element of Ωroll , and let B (.) ≡ 1/2 Ωroll W (.). Note that
R+τ −1
1
−
m t =R+τ −m
Lt +h =
R+τ −1
1
−
m t =R+τ −m
+
T 1−
P t =R
Lt +h −
Lt +h ,
T 1−
P t =R
Lt +h
R+τ −1
1
m
for τ = m, . . . , P ,
⇒
P t =R+τ −m 1
µ
∑T
t =R
( Lt +h
for µ defined in Assumption 3. By construction, BP and UP are asymptotically uncorrelated under either H0,B or H0,U ,15 and ∑T 1 t =R Lt +h = BP + UP . P
t = L β −β
t =R
t + β 1 L P
1 ∑T
t =R
P
∑T
t =R
t . Thus, under H0,B : L 1 BP − BP = ( β − β)
T −
P t =R
=
T 1−
P t =R
×
−1 L2t
T 1−
P t =R
= ΦP
t + β L
P t =R
T 1−
P t =R
t L
T 1−
P t =R
+β
T 1−
t L
t [ t ] L Lt +h − β L
T 1−
P t =R
t L
t L Lt +h ,
(23)
where the last equality follows from the fact that Assumption 4(b) ∑T and H0,B : β P1 = 0 imply β = 0, and ΦP ≡ t =R L t
∑ T 1 P
t =R
t L
∑ T 1
t =R
P
2t L
−1
. Note that UP =
1 P
∑T
t =R
Lt +h − BP ,
thus UP − U P = P1 t =R (Lt +h − Lt +h ) − (BP − BP ). Then, from Assumptions 1 and 5, we have ∑T
Aτ ,P − Aτ ,P P 1/2 BP − BP UP − U P
P
m =
R+τ −1
T 1 − Lt +h − Lt +h − √ Lt +h − Lt +h P t =R+τ −m P t =R T − 1 Lt +h − Lt +h [ ] √ 0 ΦP P t =R T 1 −ΦP 1 − t L Lt +h √
1
√
[B1 (λ) − B1 (λ − µ)] − B1 (1)
, (24)
B1 (1) B2 (1)
t → limT →∞ L p
1 P
∑T
t =R
Lt ̸= 0 by Assumption 4(b),
−
P t =R
15 Let L = [L , . . . , L ]′ , L = [L ′ ′ −1 ′ L , MP ≡ R T R+h , . . . , LT +h ] , PP = L(L L) I − PP , B = PP L, U = MP L. Then, BP = P −1 ι′ B and UP = P −1 ι′ U , where ι is a (P × 1) vector of ones. Note that, under either H0,B or H0,U , either BP or UP have zero mean. Also note that Cov(BP , UP ) = E (B′P UP ) = P −2 E (B′ ιι′ U ) = P −1 E (B′ U ) = P −1 E (L′t +h PP MP Lt +h ) = 0. Therefore, BP and UP are asymptotically uncorrelated.
1 P
∑T
t =R
2t EL
−1
1 P
∑T
t =R
Lt
. Note that
[B1 (λ) − B1 (λ − µ)] − B1 (1), Φ B2 (1) µ 1 = Cov B1 (λ), Φ B2 (1) µ 1 − Cov B1 (λ − µ), Φ B2 (1) − Ω(1,2),roll Φ µ λ λ−µ = Ω(1,2),roll Φ − − 1 = 0. µ µ
Cov
P t =R
∑T
1
Φ µ −Φ
t =R
limT →∞
[B1 (λ) − B1 (λ − µ)] −B 1 (1),
Note that BP = β P1
∑T
p
T 1 − ( Lt +h − Lt +h ) − √ ( Lt +h − Lt +h )
−
√
1 P
− Lt +h ) ⇒ B1 (1), and P
where
0
and ΦP → Φ by Assumption 1, where Φ ≡ limT →∞
P
PAτ ,P =
0 0 1
where, from Proposition 1 and Assumptions 3 and 5: √1
√
⇒
1 0 0
1
It follows that [Aτ ,P − Aτ ,P ] and [BP − BP ] are asymptotically uncorrelated. A similar proof shows that [Aτ ,P − Aτ ,P ] and [UP − U P ] are asymptotically uncorrelated:
Cov [B1 (λ) − B1 (λ − µ)] − B1 (1), B1 (1) − Φ B2 (1) µ 1 B1 (λ), B1 (1) = Cov µ 1 B1 (λ − µ), B1 (1) − Ω(1,1),roll − Cov µ 1 − Cov B1 (λ), Φ B2 (1) µ 1 B1 (λ − µ), Φ B2 (1) + Ω(1,2),roll Φ + Cov µ λ−µ λ − −1 = Ω(1,1),roll µ µ λ−µ λ − − 1 = 0. − Ω(1,2),roll Φ µ µ 1
Thus, supτ =m,...,P [Aτ ,P − Aτ ,P ] is asymptotically uncorrelated with BP − BP and UP − U P . (b) For the recursive window estimation, the results follows directly from the proof of Proposition 6 by noting that, when D = 0, Ω (λ)rec is independent of λ. Thus (20) is exactly the same as (10). Proof of Proposition 4. (a) For the fixed rolling window estimation case, we first show that σA2 , σB2 , σU2 are consistent es2 2 2 timators for σA , σB , σU . From (23), it follows that σB2 ≡
limT →∞ Var(P 1/2 BP ) = limT →∞ ΦP2 Var √1
P
limT →∞ Var √1
P
∑T
t =R
∑T
t =R Lt Lt +h
= Φ2
t L Lt +h , which can be consistently esti-
Also, the proof of Proposition 3 enmated by σ = sures that, under either H0,B or H0,U , BP and UP are uncorrelated. Then, σU2 = σA2 − σB2 is a consistent estimate of Var(P 1/2 UP ). Then, consistency of σA2 , σB2 , σU2 follows from standard arguments (see Newey and West, 1987). Part (i) follows directly √ from the proof of Proposition 3, Eq. (24). Part (ii) follows from PBP ⇒ σB W2 (1) √ and thus PBP / σB ⇒ W2 (1) ≡ N (0, 1). Similar arguments apply for UP . (b) For the recursive window estimation case, the result follows similarly. 2 B
ΦP2 Ω(2,2),roll .
B. Rossi, T. Sekhposyan / Journal of Econometrics 164 (2011) 158–172
171
Proof of Proposition 6. Consistent with the definition in the proof 1/2 of Propositions 2 and 5, let mt +h (k) = Ω (k)− [lt +h − E (lt +h )] rec
Appendix C. Data description
and Mt +h (k) = Ω (k)(1,1),rec [ Lt +h − Lt +h ] . Proposition 5 and Assumptions 3 and 5, where Aτ ,P ≡ E ( Aτ ,P ), imply
1. Exchange rates. We use the bilateral end-of-period exchange rates for the Swiss franc (CHF), Canadian dollar (CAD), and Japanese yen (JPY). We use the bilateral end-of-period exchange rates for the German mark (DEM–EUR) using the fixed conversion factor adjusted euro rates after 1999. The conversion factor is 1.95583 marks per euro. For the British pound (GBP), we use the US dollar per British pound rate to construct the British pound to US dollar rate. The series are taken from IFS and correspond to lines ‘‘146..AE.ZF. . . ’’ for CHF, ‘‘112..AG.ZF. . . ’’ for BP, ‘‘156..AE.ZF. . . ’’ for CAD, ‘‘158..AE.ZF. . . ’’ for JPY and ‘‘134..AE.ZF. . . ’’ for DEM, and ‘‘163..AE.ZF. . . ’’ for the EUR. 2. Money supply. The money supply data for US, Japan, and Germany are measured in seasonally adjusted values of M1 (IFS line items 11159MACZF. . . , 15859MACZF. . . , and 13459MACZF. . . accordingly). The seasonally adjusted value for the M1 money supply for the Euro Area is taken from the Eurostat and used as a value for Germany after 1999. The money supply for Canada is the seasonally adjusted value of the Narrow Money (M1) Index from the OECD Main Economic Indicators (MEI). Money supply for UK is measured in the seasonally adjusted series of the Average Total Sterling notes taken from the Bank of England. We use the IFS line item ‘‘14634. . . ZF. . . ’’ as a money supply value for Switzerland. The latter is not seasonally adjusted and we seasonally adjust the data using monthly dummies. 3. Industrial production. We use the seasonally adjusted value of the industrial production index taken from IFS and it corresponds to the line items ‘‘11166..CZF. . . ’’, ‘‘14666..BZF. . . ’’, ‘‘11266..CZF. . . ’’, ‘‘15666..CZF. . . ’’, ‘‘15866..CZF. . . ’’, and ‘‘13466.. CZF. . . ’’ for the US, Switzerland, United Kingdom, Canada, Japan, and Germany correspondingly. 4. Unemployment rate. The unemployment rate corresponds to the seasonally adjusted value of the ‘‘Harmonised Unemployment Rate’’ taken from the OECD Main Economic Indicators for all countries except Germany. For Germany we use the value from Datastream (mnemonic WGUN%TOTQ) that covers the unemployment rate of West Germany only over time. 5. Interest rates. The interest rates are taken from IFS and correspond to line items ‘‘11160B..ZF. . . ’’, ‘‘14660B..ZF. . . ’’, ‘‘11260B..ZF. . . ’’, ‘‘15660B..ZF. . . ’’, ‘‘15860B..ZF. . . ’’, and ‘‘13460B.. ZF. . . ’’ for the US, Switzerland, United Kingdom, Canada, Japan, and Germany correspondingly. 6. Commodity prices. The average crude oil price is taken from IFS line item ‘‘00176AAZZF. . . ’’. Country-specific ‘‘Total, all commodities’’ index for Canada is from CANSIM database.
−1/2
P
R+τ −1
1
−
√
m
P
1
mt +h (λ) − √
t =R
−
P
√
P t =R
1
µ
mt +h (λ − µ)
T 1 −
−√
P t =R
t =R T 1 −
⇒
R+τ −m−1
mt +h (1)
mt +h (1)
(W (λ) − W (λ − µ)) − W (1) . W (1)
Assumption 4(b), 5, where Aτ ,P ≡ E ( Aτ ,P ), and Eqs. (20), (23) and (24) imply: Aτ ,P − E ( Aτ , P ) P BP − E (BP ) UP − E ( UP ) R+τ− −m−1 T R+τ −1 − P 1 − M − µ) −√ Mt +h (1) M − √ (λ (λ) t +h t +h P t =R m P t =R t =R T − 1 = ( L − L ) √ t +h t +h 0 ΦP P t =R −1/2 Ω (1)(1,1),rec 1 −Φ T − P 1 t L Lt +h √ 1/2
P t =R
⇒
1 0 0
0 0 1
0
Φ −Φ
1 −1/2 −1/2 −1/2 Ω (λ)(1,1),rec B1 (λ) − Ω (λ − µ)(1,1),rec B1 (λ − µ) − Ω (1)(1,1),rec B1 (1) µ . −1/2 × Ω (1)(1,1),rec B1 (1) −1/2 Ω (1)(1,1),rec B2 (1)
since
ΦP ≡
T 1−
P t =R
t L
Φ≡
lim
T →∞
T 1−
P t =R T 1−
P t =R
−1 2t L
→ p
Lt
lim
T →∞
T 1−
P t =R
−1 E L2t
.
Also, note that
Cov
1
µ
−1/2 −1/2 Ω (λ)(1,1),rec B1 (λ) − Ω (λ − µ)(1,1),rec B1 (λ − µ)
−1/2 −1/2 −Ω (1)(1,1),rec B1 (1), ΦΩ (1)(1,1),rec B2 (1)
1
−1/2 −1/2 W1 (λ), ΦΩ (1)(1,1),rec Ω (1)(2,2),rec W2 (1)
= Cov µ 1 −1/2 −1/2 −Cov W1 (λ − µ)ΦΩ (1)(1,1),rec Ω (1)(2,2),rec W2 (1) µ −1/2 −1/2 −Cov W1 (1), ΦΩ (1)(1,1),rec Ω (1)(2,2),rec W2 (1) λ λ−µ −1/2 −1/2 − − 1 Ω (λ)(1,2),rec Ω (λ)(1,1),rec Ω (λ)(2,2),rec . =Φ µ µ It follows that Aτ ,P − E ( Aτ ,P ) is asymptotically uncorrelated with BP − E (BP ). Similar arguments to those in the proof of Proposition 3 establish that Aτ ,P − E ( Aτ ,P ) is asymptotically uncorrelated with UP − E ( UP ). Proof of Proposition 7. The proof follows from Proposition 6 and arguments similar to those in the proof of Proposition 4.
References Andrews, D.W.K., 1991. Heteroskedasticity and autocorrelation consistent covariance matrix estimation. Econometrica 59 (3), 817–858. Andrews, D.W.K., 1993. Tests for parameter instability and structural change with unknown change point. Econometrica 61 (4), 821–856. Bacchetta, P., van Wincoop, E., Beutler, T., 2010. Can parameter instability explain the Meese–Rogoff Puzzle? In: Reichlin, L., West, K.D. (Eds.), NBER International Seminar on Macroeconomics 2009. University of Chicago Press, Chicago, pp. 125–173. Clark, T.E., McCracken, M.W., 2001. Tests of equal forecast accuracy and encompassing for nested models. Journal of Econometrics 105 (1), 85–110. Clark, T.E., McCracken, M.W., 2005. The power of tests of predictive ability in the presence of structural breaks. Journal of Econometrics 124 (1), 1–31. Clark, T.E., McCracken, M.W., 2006. The predictive content of the output gap for inflation: resolving in-sample and out-of-sample evidence. Journal of Money, Credit & Banking 38 (5), 1127–1148. Clark, T.E., West, K.D., 2006. Using out-of-sample mean squared prediction errors to test the martingale difference hypothesis. Journal of Econometrics 135 (1–2), 155–186. Clark, T.E., West, K.D., 2007. Approximately normal tests for equal predictive accuracy in nested models. Journal of Econometrics 138 (1), 291–311.
172
B. Rossi, T. Sekhposyan / Journal of Econometrics 164 (2011) 158–172
Diebold, F.X., Mariano, R.S., 1995. Comparing predictive accuracy. Journal of Business and Economic Statistics 13 (3), 253–263. Elliott, G., Timmermann, A., 2008. Economic forecasting. Journal of Economic Literature 46 (1), 3–56. Engel, C., Mark, N.C., West, K.D., 2008. Exchange rate models are not as bad as you think. NBER Macroeconomics Annual 2007 (22), 381–441. Giacomini, R., Rossi, B., 2009. Detecting and predicting forecast breakdowns. Review of Economic Studies 76 (2), 669–705. Giacomini, R., Rossi, B., 2010. Forecast comparisons in unstable environments. Journal of Applied Econometrics 25 (4), 595–620. Giacomini, R., White, H., 2006. Tests of conditional predictive ability. Econometrica 74 (6), 1545–1578. Hansen, P.R., 2008. In-sample fit and out-of-sample fit: their joint distribution and its implications for model selection. Mimeo. McCracken, M., 2000. Robust out-of-sample inference. Journal of Econometrics 99 (2), 195–223. Meese, R.A., Rogoff, K., 1983a. Empirical exchange rate models of the seventies: do they fit out of sample? Journal of International Economics 14, 3–24.
Meese, R.A., Rogoff, K., 1983b. The out-of-sample failure of empirical exchange rate models: sampling error or misspecification? In: Frenkel, J.A. (Ed.), Exchange Rates and International Macroeconomics. University of Chicago Press, Chicago, pp. 67–112. Mincer, J.A., Zarnowitz, V., 1969. The evaluation of economic forecasts. In: Mincer, J.A. (Ed.), Economic Forecasts and Expectations: Analysis of Forecasting Behavior and Performance. NBER, pp. 1–46. Newey, W.K., West, K.D., 1987. A simple, positive semi-definite, heteroskedasticity and autocorrelation consistent covariance matrix. Econometrica 55 (3), 703–708. West, K.D., 1994. Asymptotic inference about predictive ability. An additional appendix. Mimeo. West, K.D., 1996. Asymptotic inference about predictive ability. Econometrica 64 (5), 1067–1084. West, K.D., McCracken, M.W., 1998. Regression-based tests of predictive ability. International Economic Review 39 (4), 817–840. Wooldridge, J.M., White, H., 1988. Some invariance principles and central limit theorems for dependent heterogeneous processes. Econometric Theory 4 (2), 210–230.
Journal of Econometrics 164 (2011) 173–187
Contents lists available at ScienceDirect
Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom
Variable selection, estimation and inference for multi-period forecasting problems✩ M. Hashem Pesaran a,b,∗ , Andreas Pick c,d , Allan Timmermann e,f a
Cambridge University, United Kingdom
b
University of Southern California, United States
c
Erasmus University, Rotterdam, De Nederlandsche Bank, The Netherlands
d
CIMF, United Kingdom
e
UC San Diego, United States
f
CREATES, Denmark
article
info
Article history: Available online 1 March 2011 JEL classification: C22 C32 C52 C53 Keywords: Direct forecasts Iterated forecasts Factor- augmented VARs SURE estimation Akaike information criterion
abstract This paper conducts a broad-based comparison of iterated and direct multi-period forecasting approaches applied to both univariate and multivariate models in the form of parsimonious factor-augmented vector autoregressions. To account for serial correlation in the residuals of the multi-period direct forecasting models we propose a new SURE-based estimation method and modified Akaike information criteria for model selection. Empirical analysis of the 170 variables studied by Marcellino, Stock and Watson (2006) shows that information in factors helps improve forecasting performance for most types of economic variables although it can also lead to larger biases. It also shows that SURE estimation and finite-sample modifications to the Akaike information criterion can improve the performance of the direct multi-period forecasts. © 2011 Elsevier B.V. All rights reserved.
1. Introduction Economists are commonly asked to forecast uncertain outcomes multiple periods ahead in time. For example, when the economy is in a recession a policy maker would want to know when a recovery begins and so is interested in forecasts of output growth at multiple horizons. Similarly, fixed-income investors are interested in comparing forecasts of spot rates multiple periods ahead in time against current long-term interest rates in order to arrive at an optimal investment strategy, and stock market
✩ We are grateful to three anonymous referees as well as the co-editor, Oliver Linton, for helpful comments. We also thank Alessio Sancetta and seminar participants at the 2008 Rio Forecasting Conference at Fundacao Getulio Vargas, the NBER-NSF Time Series conference at UC Davis, NESG meeting at KU Leuven, DNB, the European Central Bank, UCSD and University of Toronto for comments and suggestions on the paper. This paper was written while Pick was Sinopia Research Fellow at the University of Cambridge. He acknowledges financial support from Sinopia, quantitative specialist of HSBC Global Asset Management. Timmermann acknowledges support from CREATES, funded by the Danish National Research Foundation. The opinions expressed in this paper do not necessarily reflect those of DNB. ∗ Corresponding author at: Cambridge University, United Kingdom. E-mail address:
[email protected] (M.H. Pesaran).
0304-4076/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2011.02.018
investors could consider the effect of demographic variables on expected returns and risks at both medium and long investment horizons (Favero and Tamoni, 2010). Two very different strategies have been proposed for generating multi-period forecasts. The first approach is to estimate a dynamic model for data observed at the highest available frequency, e.g. monthly, and then use the chain rule to generate forecasts at longer horizons. Under this iterated or indirect approach, the model specification is the same across all forecast horizons; only the number of iterations changes with the horizon. Univariate ARMA models or their multivariate VARMA equivalent, are usually used in the iterations. The second approach is to estimate a separate model for each forecast horizon, regressing future realizations on current information. Such direct forecasts dispense with the need for forward iteration. Under this approach, both the model specification and estimates can vary across different forecast horizons. Both approaches have advantages and drawbacks. For a given model specification the iterated approach leads to more efficient parameter estimates since it includes data recorded at the highest available frequency and so uses the largest available sample size. If the model is misspecified, due, for example, to an omitted variable or because of an incorrect lag order, iterating the model multiple
174
M.H. Pesaran et al. / Journal of Econometrics 164 (2011) 173–187
steps ahead can either attenuate or reduce existing biases. Direct forecasts are less efficient, but also more likely to be robust to model misspecification as they are typically linear projections of current realizations on past data. Direct forecasts introduce new problems, however, due to the overlap in data when the forecast horizon exceeds a single period which affects the covariance of the forecast errors. Given the importance of the horizon to many forecasting problems, it is not surprising that a substantial theoretical literature has considered the multi-step forecasting problem, including Bao (2007); Brown and Mariano (1989); Clements and Hendry (1998); Cox (1961); Findley (1983); Hoque et al. (1988); Ing (2003); Schorfheide (2005) and Ullah (2004), with Bhansali (1999) and Chevillon (2007) providing surveys. This literature has examined the bias-efficiency trade-off in the context of specific models such as stationary first-order or higher-order autoregressive models. Whether the direct or iterated approach can be expected to produce the best forecasts generally depends on the sample size, forecast horizon, the (unknown) underlying data generating process and the methods used to select lag length for the forecasting models (Ing, 2003). In general, no approach can be shown to uniformly dominate the other so, as pointed out by Marcellino et al. (2006) (MSW, henceforth), the relative merit of the iterated versus direct forecast methods could vary across different economic variables and so is ultimately an empirical question. For multivariate forecast models additional issues complicate the comparison of the direct and iterated forecast approaches. First, it becomes important how the potentially high-dimensional variable selection search is conducted and, by extension, how multistep forecasts of any additional predictor variables are generated under the iterated approach. For multivariate specifications of even modest dimension, a global model specification search very rapidly becomes intractable unless the problem is further constrained. With d potential regressors, there are 2d different linear models and with d easily in the hundreds it is infeasible to evaluate every possible model. To deal with this dimensionality problem, we propose a factor-augmented VAR approach to iterated forecasting that builds on the work by Bernanke et al. (2005) and Stock and Watson (2005). This limits the model specification search to consider inclusion of only a few common factors extracted from different categories of economic variables. In addition to past values of the predicted variable itself, relatively few potential predictors therefore need to be considered. A second issue that has not previously received much attention in this context is the serial correlation in the errors of the direct forecast models, which arises due to the use of overlapping data. This raises issues at both the estimation and model selection stages. In the estimation stage we propose a SURE estimation approach that reorganizes the data in non-overlapping blocks of observations spaced apart by the length of the forecast horizon. We show how to compute the resulting covariance matrix under this approach which holds the potential of efficiency gains over conventional direct forecasts. In the model selection stage we propose modifications to the Akaike information criterion that account for serial correlation in residuals from the forecast models. Monte Carlo simulations confirm that the modifications to the AIC and the SURE estimation approach both lead to improvements in the performance of the direct forecast models. In an empirical exercise we consider the 170 variables studied by MSW. We confirm their finding that the iterated forecasts are best overall among the univariate forecasting methods, particularly at long horizons where the inefficiency of the direct forecasting method is most prominent. Furthermore, we find that forecasts generated by factor-augmented VARs generally perform better than the univariate forecasts, an important exception being variables tracking prices and wages. This suggests that it is
helpful to extend the forecasting models beyond purely univariate schemes and include the multivariate information embedded in common factors. Among the direct forecasts, in the majority of cases the modified Akaike information criteria and the SURE approach both help improve forecasting performance. In summary, the main contributions of the paper are as follows. First, we propose a factor-augmented forecast approach that extends univariate iterated methods to the multivariate setting. Second, we propose a new SURE estimation method that accounts for the data overlap that arises at multi-period horizons under the direct forecast approach. Third, we extend the AIC to account for the overlap introduced by the direct forecasting method which affects the covariance of the forecast errors and so can lead to different models being chosen in small samples. Fourth, we study the forecasting performance of both extant and new estimation, model selection, and forecasting methods through Monte Carlo simulations. Finally, we present an empirical application that considers recursively generated forecasts of the economic variables included in the study by MSW and extends their study to a multivariate setting. The outline of the paper is as follows. Section 2 sets up the multi-period forecasting problem for univariate and multivariate cases, while Section 3 deals with model selection and estimation issues. Section 4 presents Monte Carlo results, while Section 5 describes our empirical findings using the Marcellino et al. data set. Section 6 concludes. 2. Methods for multi-period forecasting Suppose a forecaster is interested in predicting a K × 1 vector of variables yt +h = (y1,t +h , y2,t +h , . . . , yK ,t +h )′ by means of their own past values and the past values of an additional set of M potentially relevant predictor variables, xt = (x1t , x2t , . . . , xMt )′ . Typically K is small, often one or two, but M could be very large. The forecaster’s horizon, h, could be a single period, h = 1, or could involve several periods, h > 1. Iterated forecasts use a single model fitted to the shortest horizon and then iterate on this model to obtain multi-step forecasts. Direct forecasts regress realizations h periods into the future on current information and, therefore, estimate and select a separate forecasting model for each horizon. For purposes of calculating one-step-ahead forecasts under the iterated approach, the regressors are treated as conditional information and so how they are generated is not a concern. This also holds under the direct forecasting approach irrespective of the forecast horizon. In contrast, when applying the iterated forecasting approach to multi-period horizons, h > 1, the regressors themselves need to be predicted since such values in turn are required to predict future values of yt . We next show how this can be done using factor-augmented VAR models. 2.1. Multi-step forecasts with factor-augmented VARs In cases where M is relatively small one approach is to treat all variables simultaneously, i.e., model (y′t , x′t )′ jointly. Multi-period forecasts of yt can then be obtained by iterating on a VAR of the form
yt xt
µy yt −1 = + 3(L) + ψt , xt −1 µx
(1)
where 3(L) is a matrix lag polynomial of finite order. In the common situation where M is large while the time series dimension of the data is limited, this approach is unlikely to be successful due to the high dimension of 3(L), particularly the parts tracking dynamics in the large-dimensional vector xt . To deal with this issue, a conditional factor-augmentation approach can be used. Under this approach, the large-dimensional
M.H. Pesaran et al. / Journal of Econometrics 164 (2011) 173–187
xt -vector is condensed into a subset of factors, ft , of dimension m < M, that summarize the salient features of the large-dimensional data. A factor-augmented VAR (FAVAR) based on the variables zt = (y′t , f′t )′ can then be used: zt = µz +
Ap (L) 0
Bq (L) z + ξt , D s ( L) t −1
(2)
where the finite-order matrix lag polynomials are
Ds (L) = D0 + D1 L + · · · + Ds−1 Ls−1 . Notice the asymmetric treatment of yt and ft under this approach: future values of the factors are generated using only current and past values of the factors themselves. The y-variables are therefore not used to predict the factors, while the factors are used to predict the y-variables. For illustration, suppose that the K × 1 vector of target variables, yt , is generated according to the following factor model (3)
ut ∼ i.i.d. (0, 6), where ft is a vector of unobserved common factors, while µ, A, B, and 6 are unknown coefficient matrices. Using this model to predict yt +h given information at time t requires a forecast of the factors whenever h ≥ 2. Despite the dynamic nature of the above model, we follow Stock and Watson (2002) and estimate ft by the principal component (PC) procedure, although one could equally employ the dynamic factor approach of Forni et al. (2005). A key question when using factor models is the choice of the number of factors. This can be determined, for example, by using the information criteria (IC) proposed by Bai and Ng (2002). In practice there is considerable uncertainty surrounding the number of factors to be used, and as pointed out by Bai and Ng (2009), the use of IC is based on the assumption that the factors are ordered as predictors of the regressors xt , an ordering that might not be appropriate for predicting yt . In view of these concerns we adopt a hierarchical approach where we first divide all variables into economically distinct groups and then select the first PC from each of the categories. All the computations are carried out recursively with rolling estimation windows of length w , so no future information is used in the construction of the factors. We denote the recursively estimated PCs by ˆft , t = R, R + 1, . . . , T − h, where t is the point in time where the factors are computed, R ≥ w is the time at which the first forecast is made, and T is the total sample length. Hence, at time t we use data over the sample t − w + 1, t − w + 2, . . . , t to extract ˆft . Only those factors that help predict yt are relevant and should be included in the model. This may be a subset, ˆf1t , of the full set of factors, ˆft , under consideration. One then has to choose whether to use the full set of factors ˆft = (ˆf1t , ˆf2t ) to predict the subset of factors, ˆf1t , selected when forecasting yt +h , or whether to use only lagged values of ˆf1t to predict their future values. The choice could depend on the number of available factors. In the empirical application below where we consider five factors we use all factors to forecast future values of ˆf1t . To generate iterated multivariate forecasts we first select the relevant subset of factors and the lag lengths p and q using the conditional model: yt = µy + Ap (L)yt −1 + Bq (L)ˆft −1 + ut .
ˆft = µf + Ds (L)ˆft −1 + εt .
(5)
Next, to compute iterated h-step-ahead forecasts of yt , we simply combine the conditional and marginal models:
ˆft = µf + Ds (L)ˆft −1 + νt ,
Bq (L) = B0 + B1 L + · · · + Bq−1 Lq−1 ,
t = 1, 2, . . . , T ,
For simplicity, we obtain forecasts only from the full set of m = m1 + m2 factors. Hence we first select a VAR(s) model in ˆft where the value of s is determined by some IC, again applied recursively:
yt = µy + Ap (L)yt −1 + Bq (L)ˆft −1 + ut ,
Ap (L) = A0 + A1 L · · · + Ap−1 Lp−1 ,
yt = µ + Ayt −1 + Bft −1 + ut ,
175
(4)
We determine the lag orders p and q and the subset of m1 factors from the total set of m factors by applying IC to the likelihood of yt .
which, recalling that zt = (y′t , ˆf′t )′ and ξt = (u′t , ν′t )′ , is consistent with Eq. (2). Notice that the selection of factors in the conditional model (4) is reflected in zero-restrictions on B1 , B2 , . . . , Bq , where the columns corresponding to the factors that are not selected are set to zeros. The factor-augmented VAR in zt can then readily be iterated forward.1 Direct h-step-ahead forecasts of yt are based on the following specification: yt = µyh + Aph (L)yt −h + Bqh (L)ˆft −h + uth . Univariate forecasts arise as a special case of these multivariate forecasts. 3. Estimation and model selection Two important econometric issues arise in the context of multiperiod forecasting. First, the direct forecast models introduce overlaps in the observations and give rise to a particular dependence structure which, if imposed on the estimation, could lead to efficiency gains. Second, for both iterative and direct forecasts, how a particular model is chosen by the forecaster is of great importance given the large dimension of the set of potentially relevant predictor variables. In this section we discuss both issues, first presenting a new SURE estimation procedure that may lead to efficiency gains and next considering a variety of information criteria, including modified ones that deal with serial dependence in the errors. 3.1. Estimation Overlaps in the data associated with the direct forecast models introduce serial dependence in the errors. Even if the underlying errors are serially uncorrelated, the errors associated with an hperiod overlap typically follow an MA(h − 1) process. This suggests that estimating VARMA models could be beneficial. We do not follow this direction for two reasons. First, estimation of VARMA models is not a very common undertaking in the forecasting literature and thus goes against our focus on evaluating forecasting methods in common use. Second, VARMA models have stability and convergence problems for the types of multivariate models considered in our paper, which can be of large dimension, and require extensive specification searches; see Athanasopoulos and Vahid (2008) for further discussion of these points. Instead, we propose using SURE estimation which leads to some efficiency gains as it exploits the information of the MA structure of the error even if it implies a more heavily parameterized model. Consider estimation of the following direct forecast model yt = β′ zt −h + ut ,
t = 1, 2, . . . , T ,
(6)
1 Alternatively, one could model only the factors that have been selected in the conditional specification in Eq. (4). This may be less efficient than using the full VAR in Eq. (5), but a smaller number of parameters need to be estimated from a finite number of observations. It is therefore not clear a priori which approach will perform better.
176
M.H. Pesaran et al. / Journal of Econometrics 164 (2011) 173–187
and suppose that ut follows an (h − 1) order moving average process ut = εt + θ1 εt −1 + θ2 εt −2 + · · · + θh−1 εt −h+1 ,
εt ∼ i.i.d. (0, σ 2 ).
(7)
The regressors, zt −h , include yt −h , yt −h−1 , . . . , yt −h−p for some order p, and estimated factors dated t − h or earlier, i.e. ˆft −h , . . . , ˆft −h−q . When the direct regression is derived from an underlying VAR in yt and ˆft the regression coefficients, β, and the MA coefficients θ = (θ1 , θ2 , . . . , θh−1 )′ are related, and fully efficient estimation of Eq. (6) must allow for such cross-parameter restrictions. These types of restrictions can be implemented by assuming that β and θ are both functions of a set of deeper parameters, φ, with estimation of β(φ) and θ(φ) carried out directly in terms of φ. Imposing such restrictions on the system of direct forecasting models simply helps recover the original parameter values from the iterated forecasting model. Since the latter efficiently uses data at the highest frequency, maximum likelihood estimates from joint estimation of the mutually consistent h-step direct forecasting models is identical to estimation of the parameter estimates from the one-step-ahead model, which of course is far more easily achieved. In scenarios where the direct forecast would dominate the iterated forecast, e.g., because the forecasting model is misspecified, using the parameters of the iterated forecast model for the covariance matrix may be undesirable. As for the model parameters, it may therefore be better to estimate the covariance matrix directly without imposing the parameter restrictions on the iterated forecast. Asymptotically efficient estimates of β can be computed by applying maximum likelihood directly to the overlapping regressions for t = 1, 2, . . . , T , allowing for the MA(h − 1) process of the errors. Alternatively, one could consider estimating β from pooled regressions of h non-overlapping regressions. Suppose that the estimation sample, w , is an exact multiple of h and set n = w/h. Let z˜ij = yt −w+j+(i−1)h and decompose the overlapping regressions in Eq. (6) into the following h non-overlapping regressions z˜ij = β′ wi−1,j + vij ,
denote the autocovariance of {ut }, which follows an MA(h − 1) process, by γ (s), where γ (s) = γ (−s) = 0 for s ≥ h. It is now easily seen that E (vj v′j ) = γ (0)In , where In is an identity matrix of order n. Similarly, for s ̸= r we have E (vr v′s ) and E (vr v′s ) as given in Box I. Therefore, the covariance matrix of v is given by
γ (0)In 921 . 6v (θ) = .. 9
h−1,1
9h1
denote by βˆ j . However, this estimate is not efficient and a pooled estimate that utilizes all the h non-overlapping regressions can be more efficient. The estimates across the h non-overlapping regressions can be pooled in a number of ways, e.g., with a simple or a weighted average of βˆ j , over j = 1, 2, . . . , h, with weights based on the relative precision of the different estimates. Alternatively, the h regressions in Eq. (8) can be viewed as a set of seemingly unrelated regression equations (SURE), allowing for the cross dependence of the errors vij across j for each i. Specifically, consider the regressions z˜ j = Wj β + vj , for j = 1, 2, . . . , h,
(9)
which in stacked form can be written as z˜ = Wβ + v, where z˜ = (˜z1′ , z˜ ′2 , . . . , z˜ ′h )′ , W = (W′1 , W′2 , . . . , W′h )′ , and v = (v′1 , v′2 , . . . , v′h )′ . To derive the covariance matrix of v we first note that vj = (ut −w+j , ut −w+j+h , ut −w+j+2h , . . . , ut −w+j+(n−1)h )′ , and
913 923
γ (0)In .. .
.. . ··· ···
9h−1,2 9h2
··· ···
91h
γ (0)In 9h,h−1
9h−1,h
.
(10)
γ (0)In
The log-likelihood function of the SURE specification is now given by 1 1 (11) 2 2 The unknown parameters can be obtained by maximization of the log-likelihood function: 1 z − W β). ℓ(β, θ) ∝ − ln |6v (θ)| − (˜z − W β)′ 6− v (θ)(˜
−1 ′ −1 1 βˆ SURE = W′ 6− W 6v (θ)˜z. v (θ)W
(12)
This maximum likelihood procedure is equivalent to the maximum likelihood estimation of the original overlapping regression in Eq. (6) with errors following the MA(h − 1) process (7). To see this, let y = (y1 , y2 , . . . , yT )′ , and note that since the elements of z are selected from the elements of y without repetition, there exists a non-singular T × T selection matrix P such that z˜ = Py. Hence the log-likelihood functions based on y and z˜ must be the same, and the covariance matrix in Eq. (10) is equivalent to using a GLS covariance matrix after reordering the observations. Maximum likelihood estimation of β under the SURE approach suggests intermediate procedures that are computationally less extensive and relatively easy to implement. One possible approach would be to estimate 6v (θ) using consistent estimates of γ (s) ′
obtained from uˆ t = yt − βˆ LS zt −h where βˆ LS is the first-stage least squares estimate of β computed from the overlapping regressions. Hence2
for i = 1, 2, . . . , n, and j = 1, 2, . . . , h, (8)
where wi−1,j = zt −w+j+(i−1)h−h , and vij = ut −w+j+(i−1)h . When h = 2, we have two non-overlapping regressions: one for the odd observations, z˜i1 , and another for the even observations, z˜i2 , for i = 1, 2, . . . , n. For each j the errors vij are serially uncorrelated across i and the least squares regression of z˜ j = (˜z1j , z˜2j , . . . , z˜nj )′ on Wj = (w0j , w1j , . . . , wn−1,j )′ yields a consistent estimate of β which we
912
T ∑
uˆ t uˆ t −s
t =1
, for s = 0, 1, 2, . . . , h. (13) T Alternatively, we could apply the standard SURE estimation to Eq. (9) subject to the restrictions βj = β for all j = 1, 2, . . . , h and use the covariance matrix 6h ⊗ IT /h , where 6h is the covariance matrix of the SURE system (9) without accounting for the restrictions in Eq. (10). This procedure provides an efficient way of pooling the h different consistent estimates of β but does not take account of the specific form of the cross dependence of the errors from the different non-overlapping regressions. Monte Carlo experiments not reported here but available from the authors suggest that this leads to an inferior forecast performance and will therefore not be further considered. In the Monte Carlo simulations and empirical analysis we use the first approach and the likelihood function in Eq. (11) to calculate the information criteria. γ (s) is estimated by using least squares residuals as in Eq. (13) and down-weighted by 1 − hs . To robustify the calculation of the forecasts, one can use the SURE approach in the model selection stage, but then use OLS estimation for the selected model. Numbers reported in subsequent tables are based on this combination, but results are very similar without this modification. γˆ (s) =
2 For relatively large values of h, the estimates of γˆ (s) can be down-weighted using Bartlett or Parzen weights.
M.H. Pesaran et al. / Journal of Econometrics 164 (2011) 173–187
177
If 0 < s − r < h,
γ (s − r ) γ (h − s + r ) 0 E (vr v′s ) ≡ 9rs = .. . 0
0
γ (s − r ) γ (h − s + r ) .. .
γ (s − r ) .. .
0 0
··· ···
0
··· ··· ··· .. .
0 0
0 0 0
.. . γ (s − r ) γ (h − s + r )
γ (h − s + r ) 0
0 0 0
.. .
0
γ (s − r )
,
while if 0 < r − s < h,
E (vr v′s ) ≡ 9rs =
γ (r − s) 0 0
γ (h − r + s) γ (r − s) 0
.. .
.. .
0 0
0 0
0 γ ( h − r + s) γ (r − s)
.. . ··· ···
··· ··· ··· .. . 0 0
0 0 0
.. . γ (r − s) 0
0 0 0
+ s)
.. . γ (h − r γ (r − s)
Box I.
3.2. Model selection
ˆ t w is the K × K estimated error covariance matrix where 6
We consider two model selection criteria, namely the AIC and BIC. Both are commonly used in forecasting studies and have wellknown properties: AIC achieves a good approximate model as the sample size expands even if the true model is not contained in the set of models under consideration. However, it is not a consistent criterion and so does not select the true model with probability one, asymptotically, if it happens to be included in the search. In contrast, BIC is a consistent model selection criterion if the true model is one of the models under consideration. The iterated and the direct forecast models can be selected either on the basis of single equations or using a system of equations. For the direct forecasting models an additional decision has to be made whether to correct for the overlap in the observations that affects the sample covariance matrix of the forecast errors.
ˆ tw = 6
3.2.1. Iterated forecasts Individual models in the specification search select different subsets of zt −1 = (y′t −1 , ˆf′t −1 )′ . We consider the following standard criteria. First, we apply AIC or BIC recursively to the single equation (K = 1) containing the variable of interest, i.e., at time t, for a particular model we have:
BICt wh
2d
ˆ ′t w uˆ t w /(w − 1)] + AICt w = ln[u BICt w
, w−1 d ln(w − 1) = ln[uˆ ′t w uˆ t w /(w − 1)] + , w−1
(14)
where d is the dimension of the vector containing the subset of zt −1 ˆ ′t w is a (w − 1) × 1 selected by the model under consideration, u vector of estimated residuals from the rolling-window estimation ′
with typical element uˆ τ = y1τ − βˆ t zτ −1 , for τ = t − w + 2, t − w + 3, . . . , t, and βˆ t is estimated using the most recent w observations up to period t.3 Similarly, for K > 1, we have
ˆ tw | + AICt w = ln |6 BICt w
2Kd
w−1 Kd ln(w − 1) ˆ tw | + = ln |6 w−1
(15)
3 For simplicity, we refer to z τ −1 as comprising both the most recent lag as well as any additional lags. This notation therefore corresponds to using the companion form of the model.
1
w−1
Y′t w Yt w − Y′t w Zt −1,w (Z′t −1,w Zt −1,w )−1 Z′t −1,w Yt w ,
Yt w is the (w − 1) × K matrix of stacked yτ = (y1τ , y2τ , . . . , yK τ )′ , and Zt −1,w is the observation matrix formed by stacking zt −1,w over the w − 1 observations indexed by τ = t − w + 2, t − w + 3, . . . , t. These are all standard expressions. 3.2.2. Direct forecasts The direct forecast models that are not based on the SURE system in Eq. (9) are selected based on either AIC or BIC
ˆ ′t wh uˆ t wh /(w − h)] + AICt wh = ln[u
2d
, w−h d ln(w − h) = ln[uˆ ′t wh uˆ t wh /(w − h)] + , w−h
(16)
ˆ t wh is a (w − h) × 1 vector of estimated residuals with where u ′ typical element uˆ τ = y1τ − βˆ t zτ −h , for τ = t − w + 1 + h, . . . , t, when K = 1 or the equivalent of Eq. (15) when K > 1. For h > 1, the overlap in the forecasts will produce autocorrelation in the residuals. This should be accounted for in small samples when calculating the information criteria. Here we consider two ways to calculate the corrections for the case where K = 1. To motivate these, assume that the h-step forecast model takes the form in Eq. (6) estimated using the observations τ = t − w + 1, t − w + 2, . . . , t, and denote the least squares criterion for estimating β by Q (β) =
1 2σu2
(yt wh − Zt wh β)′ (yt wh − Zt wh β),
where σu2 = Var(ut ), yt wh = (yt −w+1+h , . . . , yt )′ , and Zt wh = (zt −w+1 , . . . , zt −h )′ . Let Q0 (β) = E[Q (β)], where expectations are taken conditional on Zt wh and with respect to the true conditional density of yt wh , and denote the jth derivative of Q (β) by Q (j) (β). Moreover, let β0 be the true value of β, and its estimate given by βˆ = arg minβ [Q (β)]. Using these notations, we have that (1)
(2)
ˆ = 0, Q (2) (β) = Q (β0 ) = Q0 (β0 ) = 0, Q (1) (β) 0
1 ′ Z Z . σu2 t wh t wh
ˆ as the loss incurred by using the estimated Define Q0 (β) ˆ parameter β instead of the unknown true parameter, β0 . Note that ˆ is obtained by replacing β in Q0 (β) with βˆ . It is easily verified Q0 (β) that because the operations of taking expectations and replacing
178
M.H. Pesaran et al. / Journal of Econometrics 164 (2011) 173–187
an unknown parameter with its estimator do not commute, in
ˆ ̸= E Q (β) ˆ . general, E Q0 (β)
Consider the following second-order Taylor expansions
ˆ = Q0 (β0 ) + Q0 (β)
1 2
(βˆ − β0 )′ Q0(2) (β0 )(βˆ − β0 ),
ˆ t wh uˆ t wh /(w − h)] + AICΠ˜ = ln[u
(2)
2
These expressions are exact for the quadratic loss function considered here for convenience, but can be seen to hold approximately for a general log-likelihood specification remainder terms being of lower orders in t, under certain standard regularity conditions. Taking expectations of these two equations yields
ˆ = Q0 (β0 ) + E Q0 (β)
, (18) w−h ′ Z h Zt wh ˆ is the long= σˆ u2 t ww− , h
′
+ (βˆ − β0 ) Q0 (β0 )(βˆ − β0 ). ′
˜ 2tr 5
ˆ = Q (β0 ) + (βˆ − β0 )′ Q (1) (β0 ) Q (β) 1
correlation in the underlying (non-overlapping) observations is negligible. The second approach uses an estimated covariance matrix. In particular, we use the Newey and West (1987) covariance matrix to obtain the correction. This yields the modified AIC:
1 (2) E (βˆ − β0 )′ Q0 (β0 )(βˆ − β0 ) , 2
and
ˆ = E[Q (β0 )] + E[(βˆ − β0 )′ Q (1) (β0 )] E[Q (β)] 1
+ E[(βˆ − β0 )′ Q0(2) (β0 )(βˆ − β0 )] 2
= Q0 (β0 ) + E[(βˆ − β0 )′ Q (1) (β0 )] 1
+ E[(βˆ − β0 )′ Q0(2) (β0 )(βˆ − β0 )].
˜ where 5
=
1 ˆ ˆ ˆ− 6 zz , 6zz
run variance of the residuals as estimated by the Newey–West covariance matrix with the bandwidth set to min(h, w 1/3 ). When selecting the models based on the SURE system in Eq. (9), standard formulations of AIC and BIC with likelihood (11) can be used. The unknown parameters of the covariance matrix need not be included in the penalty term because models are compared only for the same forecast horizon, so the number of parameters in the covariance matrix will be the same across models. Schorfheide (2005) proposes a related selection criterion based on the weighted sum of mean squared prediction errors adjusted for a term that penalizes for estimation inefficiencies resulting from high-dimensional forecasting models. Although Schorfheide’s criterion is closely related to the AIC, the penalty term in his approach captures prediction risk and is different from the penalty term used here.
2
Therefore,
4. Monte Carlo simulations
ˆ − Q (β)] ˆ = −E[(βˆ − β0 )′ Q (1) (β0 )]. E[Q0 (β)
We next turn to Monte Carlo simulations as a means to evaluate the performance of the various model selection and estimation approaches under two data generating processes (DGPs). For both DGPs we consider situations with a single target variable, K = 1, and m factors which yields the following VAR model for zt = (y1t , f1t , . . . , fmt )′ :
Furthermore,
(βˆ − β0 ) = −[Q (2) (β0 )]−1 Q (1) (β0 ), and hence (for a given set of regressors)
−E[(βˆ − β0 )′ Q (1) (β0 )]
′
= E[tr(σu−2 u′t wh Zt wh (Zt wh Zt wh )−1 Zt wh ut wh )] ′ = tr σu−2 (Zt wh Zt wh )−1 Z′t wh E ut wh u′t wh Zt wh , where ut wh = (ut −w+1+h , . . . , ut )′ . If the errors ut wh were i.i.d., this would give the standard penalty term K . However, for overlapping forecasts the errors will be autocorrelated and the expression will not collapse to K in small samples. Building on this result, we first consider a band diagonal modified AIC that takes the form
ˆ ′t wh uˆ t wh /(w − h)] + AICΠˆ S = ln[u
ˆS 2tr 5
w−h
,
(17)
where
ˆ S = (Z′t wh Zt wh )−1 Z′t wh Sh Zth /h. 5 Sh is a matrix with h on the diagonal, h − 1 on the first diagonal above and below the main diagonal, h − 2 on the second diagonal above and below the main diagonal, etc., i.e. h
h − 1 h − 2 Sh = .. . 0 0
h−1 h
h−2 h−1
h−1
1
··· ··· ..
··· ···
. h−1 h−2
h−1 h h−1
0 0
.. . . h − 2 h−1 h
This formulation aims at capturing the MA(h − 1) form of the error process in the overlapping regressions but assumes that the serial
(b)
y1t (b) ft
=
αρ 0
αγ′
(b)
y1,t −1 (b)
A
f t −1
+ ε(t b) ,
εt(b) ∼ N (0, 6m+1 ).
(19)
Here b = 1, 2, . . . , B tracks the replications in the Monte Carlo experiments. We set 6m+1 to a block diagonal matrix, where the first block corresponds to the target variable and the second block to the m × m covariance matrix of the factors. The goodness of fit of the prediction equation (the first row of Eq. (19)) is controlled by the parameters α, ρ , γ, A and 6m+1 . We set α such that the population R2 for y1t , denoted R2y , is either 0.2 or 0.8, representing low and high predictability scenarios, respectively. Assuming that the eigenvalues of A lie inside the unit circle, it is readily seen that (b) Cov(ft ) = Im provided that the part of the covariance matrix 6m+1 that corresponds to the factors is set to (Im − AA′ ). From Eq. (19) we see that R2y
=
α 2 ρ 2 + α 2 γ′ γ /σε21 1 + α 2 (γ′ γ) /σε21
,
where σε21 is the variance of the innovation to y1t . For a given choice of R2y , ρ and γ we then have (setting σε21 = 1)
α2 =
R2y
ρ + 2
γ′ γ(1
− R2y )
.
In practice, the factors are unobserved and forecasters will extract estimates of these from a panel of observed variables. We generate Mj variables for each factor j, j = 1, 2, . . . , m, in a hierarchical
M.H. Pesaran et al. / Journal of Econometrics 164 (2011) 173–187
fashion, which corresponds to the estimation of the factors in the empirical analysis, (b)
(b)
ψjit(b) ∼ N(0, 1),
i = 1, 2, . . . , Mj , j = 1, 2, . . . , m,
where Mj = 30 for all j. λji is set such that R2jx = 0.5 in the low predictability scenario and 0.8 in the high predictability scenario: R2jx =
λ′j λj 1 + λ′j λj
,
j = 1, 2, . . . , m,
The first DGP assumes that all variables contain useful information for predicting the variable of interest (always the first variable) and so the one-step-ahead forecast should select all variables. Moreover, we set m = 2 and choose the remaining parameters as follows C≡
γ′
0
A
=
0.8 0 0
0.5 0.5 0
0.5 0 , 0.8
Data generating process 2 Under the second DGP, iterated multi-step forecasts can be expected to be inefficient because they select models that produce good one-step-ahead forecasts and factors in the DGP are only helpful for longer horizon forecasts. Specifically, the parameters for the second DGP are set to m = 3 and 0.1 0 = 0 0
C≡
ρ
γ
0
A
′
0.5 0.2 0 0
0 0.6 0.2 0
0 0 . 0.6 0.75
Notice that f1 helps predict y1 , but f1 is in turn not itself predictable by means of past values of y1 . Moreover, f2 neither predicts nor is predicted by y1 but f2 predicts f1 and therefore may help predict y1 over medium horizons. Finally, the most persistent factor, f3 , indirectly helps predict y1 through its ability to predict f2 . 4.1. Forecasts We generate forecasts from both univariate and multivariate models. The univariate forecasts are based on AR models with lag length up to pmax = 12. The multivariate models consider all regressors in the DGPs with the maximum lag length restricted to pmax = 2. In each case forecasts are based on the model selected by one of the criteria discussed in the previous section. Iterated forecasts are then calculated as follows:
[ h−1 ] h ˆZ(t b+)∗h|t = Im+1 + Cˆ (b) + · · · + Cˆ (b) ˆ + Cˆ (b) Z(t b)∗ , µ (b)∗
(20)
= z(t b)∗ if z(t b)∗ includes y(1tb) or Z(t b)∗ = (y(1tb) , z(t b)∗ ) if z(t b)∗ (b) (b) does not include y1t . (b)∗ is the subset of zt chosen in the model where Zt
(b)
(b)
xjit , i = 1, 2, . . . , Mj . Cˆ (b) is the estimate of C defined above for the bth Monte Carlo replication. The iterated h-step ahead forecast (b) (b) of y1t is denoted by yˆ 1,t +h|t . Direct forecasts are obtained from ′ (b)∗
(b)
y˜ 1,t +h|t = µ ˆ h + βˆ h zht ,
(21)
where zht is the subset of regressors selected for the h-step ahead forecast. Forecast errors are calculated as (b)
(b)
(b)
(b)
(b)
(b)
eˆ t +h = y1,t +h − yˆ 1,t +h|t , e˜ t +h = y1,t +h − y˜ 1,t +h|t .
(22)
MSFE =
B 1 − (b) 2 e , B b=1 t +h
(b)
(b)
(23) (b)
where et +h is either eˆ t +h or e˜ t +h . 4.2. Summary of Monte Carlo results
so that both f1 and f2 help predict y1 , but f1 and f2 are in turn not themselves predictable by means of past values of y1 . Moreover, f2 is quite persistent while f1 is not, suggesting that for large values of h, f2 should play more of a role in forecasting y1 than f1 .
(b)
Forecasting performance is measured by the mean squared forecast error (MSFE) computed as
Data generating process 1
ρ
(b)
(b)∗
where λj = (λj1 , λj2 , . . . , λjMj )′ . Data are generated for window sizes of w = 60, 120, and 240 observations which in turn are used to compute forecasts for period w + h. We consider forecast horizons of h = 1, 3, 6, 12, and 24 periods.
(b)
selection procedure, zt = (y1t , fˆ1t , fˆ2t , . . . , fˆmt ), and fˆjt is the first principal component extracted from the set of Mj regressors, (b)
+ ψjit(b) ,
xjit = λji fjt
(b)
179
Results from the Monte Carlo simulations are reported in Tables 1 and 2. To study how the degree of predictability affects the findings, each table contains a panel with R2y = 0.2 and R2x = 0.5,
and a panel with R2y = R2x = 0.8. The former is closer to the empirical results that we obtain, while the second scenario is more relevant for highly persistent variables. First, consider the results in Table 1 for the data generated under DGP 1. For both the univariate models and the FAVARs, the iterated approach dominates the direct approach. This is a robust finding that holds across estimation sample sizes (w = 60, 120, and 240), information criteria (AIC and BIC), and forecast horizons (h = 3, 6, 12, and 24). In this case the performance of the different methods can largely be explained by the effect of parameter estimation error. The relative performance of the direct to the iterated forecasts improves with the length of the estimation window, w , because it becomes less costly to use an inefficient estimation method in the larger samples. Conversely, for a fixed estimation window, w , the relative performance of the direct approach worsens as h increases because fewer observations are effectively available to estimate the parameters of the direct forecast model. Turning to a comparison of the univariate and multivariate forecasts, under the iterated approach FAVAR models selected by the simple AIC or BIC produce better forecasts on average than their univariate counterparts. The dominance of the iterated FAVAR models selected by the AIC is more pronounced in the high predictability scenario (R2y = 0.8) where the parameters of the model are more precisely estimated. Interestingly, the direct forecasts from the FAVAR models only dominate the direct univariate forecasts in the high predictability scenario. For the longer horizons (i.e., h ≥ 6), the modifications to the AIC work well as the data overlap becomes more pronounced and serial correlation in the errors is attenuated. In particular, the band diagonal modified AIC, defined by Eq. (17), improves over the simple direct AIC in three of six of the FAVAR cases shown in Table 1 when h = 6, while this number rises to five of six cases when h = 24. Interestingly, the band diagonal approach is generally more successful at reducing the MSFE values than the Newey–West approach. The SURE approach provides an even better forecast
180
M.H. Pesaran et al. / Journal of Econometrics 164 (2011) 173–187
Table 1 Forecast performance of iterated and direct methods: Monte Carlo results under DGP 1. Forecast horizon
3
6
12
24
3
6
12
24
3
6
12
24
R2y = 0.2, R2x = 0.5 AR, w = 60 iterated AIC direct AIC mod. AIC(NW) mod. AIC(diag) AIC(SURE) iterated BIC direct BIC BIC(SURE)
1.408 1.447 1.468 1.440 1.435 1.366 1.386 1.385
iterated AIC direct AIC mod. AIC(NW) mod. AIC(diag) AIC(SURE) iterated BIC direct BIC BIC(SURE)
1.363 1.456 1.459 1.443 1.444 1.365 1.419 1.410
AR, w = 120 1.388 1.474 1.517 1.457 1.449 1.361 1.387 1.389
1.407 1.540 1.595 1.549 1.556 1.392 1.430 1.471
1.393 1.828 1.956 2.212 1.871 1.382 1.526 1.585
1.365 1.389 1.394 1.383 1.375 1.352 1.363 1.362
1.350 1.496 1.502 1.449 1.435 1.350 1.432 1.400
1.374 1.553 1.569 1.482 1.510 1.373 1.473 1.467
1.366 1.671 1.702 1.688 1.644 1.364 1.560 1.564
1.284 1.321 1.323 1.328 1.327 1.297 1.316 1.326
9.221 11.309 11.548 11.590 11.180 8.683 10.318 10.159
12.624 19.233 19.540 18.211 18.369 12.085 16.050 15.220
14.564 30.316 31.614 32.994 28.429 13.794 24.196 22.714
4.728 4.830 4.838 4.828 4.796 4.635 4.732 4.696
FAVAR, w = 60
1.363 1.386 1.390 1.379 1.380 1.353 1.355 1.355
AR, w = 240 1.333 1.379 1.385 1.372 1.375 1.331 1.346 1.354
1.348 1.405 1.418 1.417 1.424 1.347 1.366 1.392
1.321 1.391 1.389 1.358 1.360 1.320 1.357 1.346
1.315 1.394 1.391 1.364 1.373 1.315 1.355 1.360
10.720 12.105 12.058 11.773 12.032 10.687 11.444 11.381
11.060 13.385 13.304 12.799 13.034 11.106 12.486 12.135
FAVAR, w = 120 1.356 1.414 1.412 1.397 1.395 1.361 1.382 1.379
1.294 1.304 1.305 1.305 1.303 1.288 1.297 1.294
1.343 1.357 1.358 1.350 1.356 1.336 1.340 1.339
1.381 1.399 1.399 1.391 1.403 1.380 1.383 1.388
1.344 1.368 1.368 1.360 1.368 1.344 1.346 1.356
1.338 1.369 1.366 1.353 1.357 1.338 1.353 1.349
1.323 1.353 1.352 1.337 1.345 1.323 1.337 1.341
10.017 10.486 10.489 10.499 10.482 10.068 10.267 10.280
10.583 11.277 11.203 11.004 11.245 10.647 10.985 10.921
9.584 10.035 10.064 10.139 10.046 9.585 10.035 10.001
10.361 11.227 11.088 10.858 11.185 10.362 10.990 10.945
FAVAR, w = 240 1.249 1.277 1.279 1.284 1.293 1.254 1.282 1.303
1.327 1.361 1.360 1.356 1.353 1.328 1.350 1.341
R2y = 0.8, R2x = 0.8 AR, w = 60 iterated AIC direct AIC mod. AIC(NW) mod. AIC(diag) AIC(SURE) iterated BIC direct BIC BIC(SURE)
5.218 5.659 5.727 5.705 5.566 4.945 5.307 5.222
AR, w = 120
FAVAR, w = 60 iterated AIC direct AIC mod. AIC(NW) mod. AIC(diag) AIC(SURE) iterated BIC direct BIC BIC(SURE)
3.687 4.008 4.057 4.159 4.019 3.695 4.057 4.052
7.370 8.930 9.040 9.249 8.905 7.367 8.862 8.859
8.408 8.896 8.960 9.068 8.855 8.240 8.651 8.605
AR, w = 240
FAVAR, w = 120 11.159 14.837 14.753 13.938 14.433 11.127 14.302 13.773
12.370 20.110 20.076 19.820 18.882 12.344 19.194 17.897
3.364 3.429 3.445 3.448 3.459 3.344 3.435 3.433
6.861 7.362 7.502 7.686 7.366 6.840 7.481 7.476
4.473 4.526 4.531 4.527 4.503 4.468 4.490 4.447
7.799 8.023 8.043 8.040 7.973 7.805 7.902 7.878
FAVAR, w = 240 9.972 11.276 11.200 11.074 11.206 9.951 11.084 11.050
10.808 12.917 12.725 12.228 12.675 10.813 12.544 12.248
3.204 3.234 3.234 3.236 3.249 3.194 3.218 3.224
6.500 6.701 6.747 6.827 6.705 6.493 6.740 6.694
The table reports the MSFE for the different forecasting methods. The forecasts labeled ‘mod. AIC(diag)’ and ‘mod. AIC(NW)’ are based on the modified AIC with band diagonal or Newey–West covariance matrices. Forecasts ‘AIC(SURE)’ and ‘BIC(SURE)’ are based on models that allow for autocorrelation in the likelihood function and use OLS estimates once the model has been selected. w is the length of the estimation window. AR refers to univariate autoregressive models while FAVAR refers to factor-augmented vector autoregressions. The Monte Carlo results are based on 10,000 iterations. Details of DGPs 1 and 2 are given in Section 4.
performance and improves in four of six cases over the simple direct AIC when h = 6 and in all six cases when h = 24. Turning to the second DGP, Table 2 now shows a few cases where the direct FAVAR forecasts produce better performance than the best iterated forecasts. It is clear from this DGP, however, that the degree of model misspecification has to be quite large for this to happen. The table also shows continued gains from using the SURE approach and the band diagonal covariance adjustment to the conventional AIC. 5. Empirical results In their empirical analysis, Marcellino et al. (2006) (MSW) found that iterated univariate forecasts generally outperform direct univariate forecasts. Furthermore, they found that the relative performance of the iterated univariate forecasts improves with the forecast horizon. MSW studied univariate and bivariate VARs with lag orders either fixed or selected by AIC or BIC. Apart from the search over lag orders, they did not, however, conduct a broad model specification search involving multivariate models. Hence, it remains to be seen whether their findings change under a broader model specification search. We consider this question using the same data as in the
MSW study which comprises 170 US macroeconomic time series measured at the monthly frequency over the period 1959–2002 (528 months).4 5.1. Data transformations Following MSW, all variables are transformed by differencing a suitable number of times to achieve stationarity for estimation and model selection.5 In a second step, forecasts are transformed back to levels and compared to level variables. We briefly explain how the forecasts are computed under the direct and iterated approaches using the autoregressive models as an illustration. Denote the variables in levels by xt and differenced variables by yt . AR forecasts can be computed as follows. Under the iterated approach, yt +h is predicted and xˆ t +h|t is constructed from xt and yˆ t +h|t . The forecasts from the AR model are based on the sample yt −w+1 , yt −w+2 , . . . , yt . Here yt = xt if the variable is I(0), yt =
4 We are grateful to Mark Watson for making this data set publicly available. 5 We ignore structural breaks. See Pesaran and Timmermann (2005) for an analysis of this in the case of forecasts from autoregressive models.
M.H. Pesaran et al. / Journal of Econometrics 164 (2011) 173–187
181
Table 2 Forecast performance of iterated and direct methods: Monte Carlo results under DGP 2. Forecast horizon
3
6
12
24
3
6
12
24
3
6
12
24
1.318 1.334 1.335 1.334 1.346 1.318 1.321 1.334
1.267 1.277 1.280 1.283 1.296 1.267 1.268 1.287
1.267 1.296 1.297 1.290 1.297 1.267 1.274 1.272
1.257 1.281 1.285 1.276 1.280 1.257 1.262 1.264
5.820 5.885 5.885 5.876 5.904 5.814 5.836 5.844
5.581 5.644 5.644 5.642 5.694 5.580 5.590 5.637
5.560 5.727 5.707 5.671 5.688 5.558 5.622 5.578
5.559 5.731 5.721 5.659 5.678 5.559 5.636 5.590
R2y = 0.2, R2x = 0.5 AR, w = 60 iterated AIC direct AIC mod. AIC(NW) mod. AIC(diag) AIC(SURE) iterated BIC direct BIC BIC(SURE)
1.335 1.362 1.376 1.360 1.368 1.306 1.314 1.325
iterated AIC direct AIC mod. AIC(NW) mod. AIC(diag) AIC(SURE) iterated BIC direct BIC BIC(SURE)
1.321 1.412 1.416 1.400 1.414 1.317 1.361 1.358
AR, w = 120 1.319 1.372 1.404 1.376 1.402 1.290 1.305 1.333
1.307 1.387 1.424 1.431 1.463 1.297 1.320 1.402
1.292 1.558 1.676 1.947 1.586 1.287 1.343 1.397
1.282 1.385 1.407 1.368 1.383 1.280 1.324 1.321
1.290 1.430 1.450 1.425 1.427 1.289 1.353 1.357
1.265 1.486 1.525 1.581 1.486 1.265 1.372 1.382
5.978 6.463 6.596 6.419 6.251 5.861 5.956 5.971
5.867 6.481 6.680 6.628 6.478 5.813 6.018 6.134
5.898 7.657 8.250 9.232 7.750 5.857 6.371 6.622
FAVAR, w = 60
1.313 1.321 1.327 1.320 1.329 1.296 1.296 1.305
1.301 1.315 1.321 1.315 1.323 1.293 1.297 1.307
AR, w = 240 1.265 1.293 1.305 1.306 1.316 1.265 1.271 1.293
1.270 1.311 1.323 1.338 1.341 1.270 1.281 1.314
1.247 1.305 1.307 1.289 1.300 1.247 1.265 1.261
1.268 1.331 1.337 1.328 1.331 1.268 1.291 1.294
5.728 5.917 5.927 5.883 5.915 5.712 5.751 5.774
5.656 5.898 5.939 5.997 5.987 5.654 5.741 5.853
FAVAR, w = 120 1.256 1.286 1.287 1.284 1.287 1.251 1.266 1.262
1.292 1.343 1.346 1.333 1.341 1.291 1.308 1.303
1.247 1.252 1.255 1.251 1.253 1.239 1.239 1.243
1.274 1.286 1.288 1.284 1.294 1.270 1.272 1.278
FAVAR, w = 240 1.263 1.270 1.273 1.267 1.272 1.262 1.265 1.267
1.274 1.302 1.303 1.293 1.300 1.274 1.282 1.279
R2y = 0.8, R2x = 0.8 AR, w = 60 iterated AIC direct AIC mod. AIC(NW) mod. AIC(diag) AIC(SURE) iterated BIC direct BIC BIC(SURE)
6.146 6.374 6.446 6.288 6.229 5.959 5.981 5.960
AR, w = 120
FAVAR, w = 60 iterated AIC direct AIC mod. AIC(NW) mod. AIC(diag) AIC(SURE) iterated BIC direct BIC BIC(SURE)
5.734 5.649 5.651 5.623 5.816 5.768 5.626 5.853
5.963 6.552 6.597 6.453 6.363 5.927 6.252 6.147
5.893 5.944 5.942 5.894 5.885 5.822 5.766 5.787
5.941 6.042 6.059 6.002 5.979 5.882 5.913 5.905
AR, w = 240
FAVAR, w = 120 5.718 6.586 6.625 6.472 6.328 5.712 6.215 6.100
5.771 7.254 7.406 7.555 7.182 5.770 6.678 6.728
5.460 5.154 5.143 5.130 5.300 5.517 5.144 5.434
5.887 6.020 6.032 6.031 5.992 5.868 5.941 5.931
5.684 5.692 5.701 5.678 5.686 5.652 5.631 5.634
5.729 5.788 5.785 5.768 5.769 5.693 5.707 5.716
FAVAR, w = 240 5.587 5.859 5.864 5.765 5.740 5.587 5.718 5.649
5.551 5.937 5.938 5.890 5.883 5.549 5.753 5.761
5.441 4.886 4.895 4.919 5.112 5.496 4.918 5.393
5.635 5.648 5.660 5.668 5.664 5.625 5.614 5.636
See the footnote to Table 1.
1xt if the variable is I(1), and yt = ∆2 xt if the variable is I(2). For each variable we use the same order of integration as MSW. Under the iterated approach the forecast of xt +h is constructed from the forecast of yt +h , xt , and 1xt as follows
xˆ t +h|t =
if xt is I(0)
ˆ yt +h|t h − xt + yˆ t +i|t
if xt is I(1)
.
i=1
h − i − yˆ t +j|t xt + h1xt +
if xt is I(2)
i=1 j=1
Similarly, under the direct approach, the forecast of xt +h is constructed from the forecast of yt +h , xt , and 1xt as follows yˆ t +h|t xt + yˆ t +h|t xt + h1xt + yˆ t +h|t
xˆ t +h|t =
if xt is I(0) if xt is I(1). if xt is I(2)
5.2. Setup Forecasting is performed recursively, begins in 1979M1 (with a minimum of w observations before forecasting) and runs until the end of the sample, 2002M12. This yields up to 286 forecasts for h = 3, and so on. Forecasts are reported for horizons of h = 3, 6, 12, and 24 months. Two window lengths are used for
estimation, namely w = 120 and w = 240, that is, 10 and 20 years of data. Fixing the window length allows us to better understand the role of estimation error in the relative performance of the various approaches. To address the effect of model selection on the multivariate forecasts, we extract factors from the 170 series arranged into five groups, namely (A) one factor for ‘‘income, output, sales, capacity utilization’’ (38 variables); (B) one factor for ‘‘employment and unemployment’’ (27 variables); (C) one factor for ‘‘construction, inventories, and orders’’ (37 variables); (D) one factor for ‘‘interest rates and asset prices’’ (33 variables); and (E) one factor for ‘‘nominal prices, wages, and money’’ (35 variables). To avoid any look-ahead biases, the factors are estimated recursively. We then obtain forecasts of the factors from VARs fitted to all five factors with lag orders chosen by AIC or BIC and pmax = 2. The search over FAVAR models is thus conducted over specifications that include its own lags as well as those of the factors. The space of models is limited as follows. For the univariate autoregressive models the possible lag lengths are p = 0, 1, 2, . . . , 12, where p = 0 is an intercept only model. For the factor-augmented VAR models we search across five factors with zero, one or two lags in addition to an intercept. For computational simplicity the lag length is restricted to be the same for yit and fˆit .
182
M.H. Pesaran et al. / Journal of Econometrics 164 (2011) 173–187
Table 3 Average forecasting performance measured by the MSFE relative to the corresponding value generated by the univariate iterated forecast models selected by the AIC. Forecast horizon
3
6
12
24
AR, w = 120 iterated AIC direct AIC mod. AIC(diag) mod. AIC(NW) AIC(SURE) iterated BIC direct BIC BIC(SURE)
1.000 1.015 1.013 1.003 1.017 1.009 1.009 1.007
1.000 1.018 1.017 1.015 1.028 1.037 1.017 1.026
1.000 1.084 1.106 1.086 1.081 1.053 1.077 1.087
1.000 1.208 1.202 1.227 1.190 1.049 1.157 1.150
1.032 1.045 1.038 1.046 1.048 1.015 1.018 1.020
1.075 1.129 1.108 1.133 1.133 1.046 1.106 1.105
1.069 1.252 1.204 1.259 1.227 1.086 1.223 1.216
FAVAR, w = 120 iterated AIC direct AIC mod. AIC(diag) mod. AIC(NW) AIC(SURE) iterated BIC direct BIC BIC(SURE)
0.977 0.985 0.984 0.981 0.978 0.974 0.981 0.965
3
6
12
24
0.945 0.945 0.944 0.939 0.997 1.042 0.953 1.024
0.952 0.987 0.982 0.986 1.058 1.085 0.989 1.069
0.994 1.087 1.097 1.090 1.106 1.148 1.081 1.093
1.028 0.995 0.993 0.992 0.983 1.041 0.998 0.980
1.058 1.049 1.035 1.046 1.038 1.083 1.043 1.029
1.117 1.134 1.103 1.141 1.109 1.156 1.117 1.098
AR, w = 240 0.962 0.964 0.964 0.957 0.985 1.007 0.970 0.993 FAVAR, w = 240 0.974 0.949 0.947 0.948 0.940 0.983 0.960 0.941
The table reports the MSFE of the different forecasts relative to the MSFE of the iterated AR forecast based on the models selected by AIC with w = 120, where w is the length of the estimation window. ‘AR’ results are based on univariate autoregressive models and ‘FAVAR’ results are based on multivariate factor-augmented VAR models. The MSFEs are calculated only for those periods where forecasts from all methods are available. The forecasts labeled ‘mod. AIC(diag)’ and ‘mod. AIC(NW)’ are based on the modified AIC with band diagonal or Newey–West covariance matrices. Forecasts ‘AIC(SURE)’ and ‘BIC(SURE)’ are based on models that allow for autocorrelation in the likelihood function and use OLS estimates once the model has been selected. Averages are computed across all 170 series in the Marcellino et al. (2006) data set. Table 4 Proportion of variables for which individual forecast methods generate a lower MSFE than the corresponding values generated by the univariate iterated forecast models selected by the AIC. Forecast horizon
3
6
12
24
AR, w = 120 direct AIC mod. AIC(diag) mod. AIC(NW) AIC(SURE) iterated BIC direct BIC BIC(SURE)
0.441 0.435 0.500 0.447 0.541 0.412 0.476
0.429 0.394 0.418 0.353 0.535 0.418 0.365
0.218 0.235 0.229 0.194 0.482 0.200 0.194
0.176 0.153 0.165 0.159 0.471 0.165 0.171
0.541 0.441 0.465 0.418 0.412 0.618 0.500 0.476
0.500 0.341 0.371 0.300 0.306 0.565 0.335 0.341
0.488 0.212 0.271 0.212 0.194 0.500 0.212 0.218
FAVAR, w = 120 iterated AIC direct AIC mod. AIC(diag) mod. AIC(NW) AIC(SURE) iterated BIC direct BIC BIC(SURE)
0.641 0.565 0.565 0.600 0.582 0.724 0.612 0.641
3
6
12
24
0.506 0.506 0.506 0.412 0.424 0.429 0.412
0.359 0.365 0.365 0.265 0.388 0.324 0.271
0.235 0.212 0.235 0.212 0.424 0.241 0.235
0.576 0.518 0.535 0.547 0.547 0.553 0.529 0.571
0.565 0.435 0.441 0.435 0.418 0.588 0.424 0.441
0.600 0.429 0.459 0.441 0.447 0.594 0.429 0.447
AR, w = 240 0.518 0.547 0.547 0.318 0.512 0.447 0.406 FAVAR, w = 240 0.659 0.665 0.647 0.647 0.671 0.635 0.647 0.676
The table reports the proportion of series for which the iterated AR forecasts based on the models selected by AIC have larger MSFEs than forecasts based on the respective information criteria. Hence, values above 0.5 suggest that the respective method dominates univariate AIC forecasts for a majority of variables. For a description of the model selection methods see the footnote to Table 3. Proportions are computed as averages across all 170 series in the Marcellino et al. (2006) data set.
5.3. Forecasting performance Empirical results are summarized in Tables 3–9. We present MSFE values averaged across all 170 variables (Table 3) as well as subsets of these (Table 8). Since these could be dominated by extreme values for individual variables, we also report the proportion of cases (again out of the 170 variables) where a modeling approach either dominates the benchmark univariate iterated forecasting model selected by the AIC (Table 4), or an approach is best overall for a given variable (Table 5). To evaluate statistical significance, we conduct pairwise comparisons of the forecast precision of various approaches against the benchmark univariate iterated models selected by the AIC using the approach suggested by Giacomini and White (2006) (Table 7). Finally, to understand whether the performance of a given forecasting approach is driven by its bias or by imprecision in the forecasts, we report separately the magnitude of the squared bias component in the MSFE (Tables 6 and 9).
5.3.1. Univariate models Table 3 shows the relative forecasting performance (measured by MSFE) of the iterated and direct methods averaged over all 170 variables included in the MSW data. The iterated univariate forecasts based on models selected by the AIC are better on average than the direct ones, particularly at long horizons (h = 12 and 24 months) and when the estimation window is short (w = 120). Conversely, with a longer estimation window (w = 240) the direct univariate forecasting models selected by the more parsimonious BIC perform better than the iterated forecasts selected by this criterion. To understand this, recall that the iterated forecasts are more efficient and therefore tend to have a lower estimation error. Such errors are most important for large models (AIC penalizes large models less than the BIC) and when the sample size is short.6
6 Univariate models selected by the AIC on average include three or four variables for the short estimation sample and four or five variables in the longer sample. For
M.H. Pesaran et al. / Journal of Econometrics 164 (2011) 173–187
183
Table 5 Proportion of variables for which individual forecast methods produce the lowest MSFE values. Forecast horizon
3
6
12
24
AR, w = 120 iterated AIC direct AIC mod. AIC(diag) mod. AIC(NW) AIC(SURE) iterated BIC direct BIC BIC(SURE)
0.041 0.018 0.041 0.071 0.041 0.065 0.024 0.029
0.094 0.029 0.029 0.065 0.018 0.106 0.024 0.029
0.124 0.029 0.006 0.035 0.006 0.135 0.047 0.012
0.206 0.006 0.018 0.041 0.006 0.106 0.012 0.029
0.171 0.018 0.035 0.041 0.047 0.188 0.047 0.059
0.188 0.041 0.053 0.024 0.006 0.229 0.041 0.024
0.247 0.018 0.059 0.012 0.012 0.171 0.053 0.006
FAVAR, w = 120 iterated AIC direct AIC mod. AIC(diag) mod. AIC(NW) AIC(SURE) iterated BIC direct BIC BIC(SURE)
0.247 0.012 0.018 0.029 0.059 0.182 0.047 0.076
3
6
12
24
0.094 0.035 0.041 0.076 0.029 0.035 0.024 0.065
0.129 0.035 0.035 0.035 0.000 0.059 0.006 0.006
0.165 0.024 0.000 0.018 0.024 0.024 0.018 0.000
0.129 0.024 0.035 0.029 0.065 0.218 0.035 0.065
0.176 0.047 0.053 0.041 0.053 0.253 0.029 0.041
0.247 0.076 0.094 0.029 0.047 0.141 0.059 0.035
AR, w = 240 0.071 0.035 0.041 0.041 0.029 0.053 0.006 0.035 FAVAR, w = 240 0.106 0.024 0.024 0.053 0.118 0.165 0.053 0.147
For each forecast horizon and window length, the table reports the proportion of variables for which the respective forecast methods generate the lowest MSFE value. For the model selection methods see the footnote to Table 3. Proportions are computed across all 170 series in the Marcellino et al. (2006) data set. Table 6 Average ratio of squared bias measured relative to the MSFE of the iterated univariate forecast models selected by the AIC. Forecast horizon
3
6
12
24
0.028 0.023 0.023 0.023 0.024 0.050 0.029 0.031
0.049 0.048 0.052 0.049 0.048 0.081 0.056 0.059
0.090 0.114 0.120 0.114 0.116 0.129 0.117 0.122
0.068 0.043 0.041 0.043 0.041 0.063 0.042 0.041
0.106 0.070 0.073 0.070 0.069 0.098 0.073 0.073
0.163 0.152 0.164 0.152 0.151 0.152 0.157 0.157
AR, w = 120 iterated AIC direct AIC mod. AIC(diag) mod. AIC(NW) AIC(SURE) iterated BIC direct BIC BIC(SURE)
0.014 0.012 0.012 0.012 0.013 0.025 0.016 0.018
0.036 0.027 0.026 0.026 0.025 0.033 0.028 0.025
6
12
24
0.040 0.037 0.037 0.038 0.040 0.092 0.042 0.050
0.074 0.079 0.079 0.078 0.081 0.166 0.085 0.094
0.154 0.194 0.195 0.195 0.202 0.294 0.198 0.213
0.117 0.061 0.065 0.061 0.055 0.124 0.067 0.061
0.206 0.104 0.106 0.104 0.097 0.217 0.109 0.101
0.365 0.226 0.215 0.230 0.206 0.374 0.227 0.205
AR, w = 240
FAVAR, w = 120 iterated AIC direct AIC mod. AIC(diag) mod. AIC(NW) AIC(SURE) iterated BIC direct BIC BIC(SURE)
3
0.019 0.018 0.018 0.019 0.025 0.041 0.022 0.032 FAVAR, w = 240 0.056 0.038 0.038 0.037 0.035 0.058 0.041 0.037
The table reports the squared bias of the different forecasts as a ratio of the corresponding MSFE of the iterated AR forecast based on the models selected by AIC with w = 120, where w is the length of the estimation window. For the model selection methods see the footnote to Table 3. Averages are computed across all 170 series in the Marcellino et al. (2006) data set.
For the short estimation window (w = 120), the iterated models selected by the AIC deliver the best average forecasting performance among the univariate models. For the longer estimation window (w = 240), however, the best forecasting performance for horizons of h = 3 and h = 6 months is produced by the direct forecast model selected by the AIC modified by using a Newey–West covariance matrix. Once again the univariate iterated approach based on the AIC dominates on average when h = 12 and 24 months. For the direct univariate forecast models there is only very limited evidence that the SURE estimation approach helps reducing average MSFE values. The average MSFE values reported in Table 3 may be dominated by the most volatile variables and could provide an incomplete picture of relative forecasting performance. To deal with this, Table 4 shows the proportion of the 170 variables for which the
the univariate models selected by the BIC, this number declines to only one or two variables in the small estimation sample and two variables on average in the longer sample.
iterated univariate AR forecasts based on models selected by the AIC generate a larger MSFE than the various alternatives. We use the iterated univariate forecasts selected by the AIC as our benchmark given the earlier evidence that this approach generally selects good univariate models, a finding corroborated by the results reported by MSW. Among the univariate forecasts, for the short estimation window (w = 120), only the iterated forecasts based on models selected by the BIC produce a majority of cases that outperform the iterated AIC, and only then for h = 3 or h = 6 months. With a longer estimation window (w = 240) the direct forecasts based on the AIC, whether modified or not, also produce lower average MSFE values for the majority of variables at horizons of three and six months. Table 5 shows the proportion of cases (averaged across the 170 variables) where each of the respective methods produces the lowest MSFE value. Among the univariate approaches only the iterated AIC and the iterated BIC produce a sizable proportion of variables with the lowest MSFE value, particularly for the short estimation window (w = 120) and at the longest forecast horizons.
184
M.H. Pesaran et al. / Journal of Econometrics 164 (2011) 173–187
Table 7 Pairwise forecast comparison tests. Forecast horizon
3
6
12
24
3
AR w = 120 direct AIC mod. AIC(diag) mod. AIC(NW) AIC(SURE) iterated BIC direct BIC BIC(SURE)
5.3(3.5) 4.7(3.5) 5.9(3.5) 8.2(3.5) 11.2(2.9) 10.6(1.8) 8.2(2.9)
4.7(2.9) 6.5(2.9) 4.1(3.5) 7.1(1.8) 12.4(3.5) 8.2(1.2) 12.4(1.2)
12.4(0.6) 17.1(0.6) 11.8(0.6) 14.7(0.6) 6.5(2.4) 15.3(0.0) 17.1(0.6)
22.4(0.0) 18.8(1.2) 21.8(0.6) 21.8(0.6) 5.9(2.4) 18.2(0.6) 17.1(0.6)
1.8(4.7) 1.2(5.3) 3.5(5.3) 12.4(0.0) 18.8(2.4) 10.6(1.8) 12.9(4.1)
12.4(8.8) 9.4(5.3) 8.2(5.3) 8.2(4.7) 10.6(5.9) 11.8(10.0) 8.2(8.2) 10.6(7.1)
8.8(8.2) 11.2(2.4) 10.0(1.8) 11.2(2.4) 14.1(1.8) 8.2(7.6) 10.0(2.4) 8.8(2.4)
7.1(4.7) 11.8(0.0) 8.8(0.0) 14.1(0.0) 12.9(0.0) 8.2(5.3) 12.4(0.0) 10.0(0.0)
12.4(11.2) 10.6(10.0) 10.0(9.4) 9.4(10.6) 9.4(14.1) 12.4(13.5) 11.2(12.9) 10.0(14.1)
FAVAR, w = 120 iterated AIC direct AIC mod. AIC(diag) mod. AIC(NW) AIC(SURE) iterated BIC direct BIC BIC(SURE)
10.6(11.2) 8.2(8.2) 6.5(9.4) 7.6(9.4) 7.1(9.4) 12.4(12.4) 8.2(8.2) 7.6(10.0)
6
12
24
4.1(4.1) 4.1(5.3) 4.1(5.9) 7.6(2.4) 18.2(2.9) 4.1(1.2) 18.8(1.2)
9.4(1.8) 9.4(1.8) 8.2(2.4) 13.5(0.0) 12.4(1.2) 5.3(0.6) 13.5(0.0)
8.8(1.2) 9.4(1.2) 9.4(1.2) 10.6(1.2) 10.6(1.2) 10.0(1.8) 11.2(2.4)
12.4(10.6) 8.8(7.1) 10.0(7.6) 8.2(6.5) 9.4(7.6) 11.8(12.4) 9.4(8.8) 9.4(8.2)
11.2(7.6) 8.8(6.5) 9.4(6.5) 9.4(6.5) 9.4(6.5) 10.6(10.0) 9.4(6.5) 10.0(5.9)
7.6(11.2) 10.0(7.1) 6.5(5.3) 9.4(7.1) 9.4(7.1) 8.2(9.4) 8.8(5.9) 10.0(5.3)
AR, w = 240
FAVAR, w = 240
The table reports the results of the Giacomini and White (2006) test with null hypothesis that the iterated AR forecast based on AIC and the method in the respective row are equally accurate. Reported are the proportions of rejections at the 5% level for the two-sided test where the iterated AR(AIC) forecast has the lowest MSFE and in brackets the proportion of rejections when the respective forecasting method has the lower MSFE. For the model selection methods, see the footnote to Table 3.
Table 8 Average forecasting performance measured by the MSFE relative to the corresponding value generated by the univariate iterated forecasting models selected by the AIC in sub-categories of the Marcellino et al. (2006) data. Forecast horizon
3
6
12
24
3
6
12
24
1.000 1.027 1.026 1.023 1.036 0.993 1.011 1.015
1.000 1.096 1.122 1.099 1.088 0.997 1.074 1.076
1.000 1.185 1.163 1.201 1.161 0.995 1.124 1.113
0.964 0.969 0.969 0.960 0.974 0.960 0.963 0.962
0.945 0.949 0.949 0.941 0.969 0.953 0.949 0.966
0.944 0.983 0.978 0.981 0.992 0.953 0.972 0.980
0.966 1.042 1.051 1.046 1.048 0.968 1.012 1.009
0.954 0.983 0.985 0.989 0.989 0.937 0.961 0.967
0.978 1.069 1.065 1.074 1.081 0.942 1.051 1.061
0.982 1.231 1.165 1.241 1.201 1.001 1.189 1.183
0.888 0.909 0.903 0.903 0.898 0.888 0.905 0.890
0.872 0.977 0.959 0.971 0.965 0.875 0.961 0.949
0.877 1.094 1.043 1.099 1.066 0.895 1.061 1.045
1.000 0.984 0.984 0.984 0.997 1.205 1.038 1.067
1.000 1.038 1.048 1.036 1.057 1.270 1.088 1.129
1.000 1.296 1.352 1.326 1.299 1.260 1.284 1.295
0.946 0.927 0.926 0.933 1.108 1.384 0.971 1.246
0.986 1.004 0.999 1.002 1.312 1.595 1.053 1.410
1.101 1.258 1.275 1.259 1.330 1.842 1.346 1.420
1.332 1.281 1.245 1.264 1.274 1.316 1.237 1.228
1.449 1.360 1.273 1.361 1.333 1.443 1.320 1.276
1.401 1.332 1.356 1.330 1.327 1.413 1.353 1.346
1.566 1.328 1.337 1.334 1.309 1.631 1.358 1.328
1.773 1.323 1.328 1.334 1.320 1.884 1.357 1.334
2.041 1.291 1.335 1.301 1.275 2.162 1.333 1.299
Categories (A)–(D) (averages over 135 series) AR, w = 120 iterated AIC direct AIC mod. AIC(diag) mod. AIC(NW) AIC(SURE) iterated BIC direct BIC BIC(SURE)
1.000 1.024 1.021 1.008 1.023 0.984 0.999 0.996
AR, w = 240
FAVAR, w = 120 iterated AIC direct AIC mod. AIC(diag) mod. AIC(NW) AIC(SURE) iterated BIC direct BIC BIC(SURE)
0.928 0.945 0.946 0.942 0.939 0.927 0.943 0.931
FAVAR, w = 240 0.886 0.881 0.878 0.877 0.872 0.890 0.887 0.872
Category (E) (averages over 35 series) AR, w = 120 iterated AIC direct AIC mod. AIC(diag) mod. AIC(NW) AIC(SURE) iterated BIC direct BIC BIC(SURE)
1.000 0.983 0.982 0.983 0.997 1.106 1.046 1.047
AR, w = 240
FAVAR, w = 120 iterated AIC direct AIC mod. AIC(diag) mod. AIC(NW) AIC(SURE) iterated BIC direct BIC BIC(SURE)
1.168 1.139 1.132 1.131 1.129 1.155 1.129 1.098
0.954 0.943 0.942 0.946 1.029 1.191 0.997 1.111 FAVAR, w = 240 1.315 1.210 1.212 1.222 1.203 1.341 1.239 1.205
The table reports the MSFE of the different forecasts as a ratio of the MSFE of the iterated AR forecast based on the model selected by AIC with w = 120, where w is the length of the estimation window. ‘AR’ results are based on univariate autoregressive models and ‘FAVAR’ results are based on multivariate factor-augmented VAR models. The MSFEs are calculated only for those periods where forecasts from all methods are available. For details, see the footnote to Table 3.
M.H. Pesaran et al. / Journal of Econometrics 164 (2011) 173–187
185
Table 9 Average ratio of squared bias measured relative to the MSFE of the iterated univariate forecasts selected by the AIC in sub-categories of the Marcellino et al. (2006) data. Forecast horizon
3
6
12
24
3
6
12
24
0.027 0.026 0.026 0.026 0.029 0.031 0.027 0.030
0.047 0.053 0.053 0.053 0.054 0.051 0.054 0.053
0.097 0.129 0.128 0.129 0.125 0.099 0.117 0.119
0.056 0.042 0.044 0.041 0.037 0.046 0.044 0.040
0.082 0.088 0.087 0.087 0.080 0.068 0.088 0.080
0.137 0.221 0.194 0.223 0.196 0.121 0.210 0.186
0.088 0.081 0.081 0.083 0.081 0.323 0.096 0.128
0.180 0.178 0.177 0.179 0.185 0.609 0.207 0.250
0.372 0.445 0.456 0.447 0.499 1.042 0.508 0.576
0.353 0.133 0.145 0.135 0.127 0.425 0.155 0.142
0.683 0.167 0.182 0.170 0.164 0.791 0.188 0.182
1.241 0.247 0.296 0.253 0.244 1.350 0.293 0.279
Categories (A)–(D) (averages over 135 series) AR, w = 120 iterated AIC direct AIC mod. AIC(diag) mod. AIC(NW) AIC(SURE) iterated BIC direct BIC BIC(SURE)
0.008 0.008 0.008 0.009 0.008 0.009 0.008 0.009
AR, w = 240 0.017 0.015 0.016 0.016 0.016 0.019 0.016 0.017
0.027 0.032 0.037 0.033 0.031 0.028 0.033 0.034
0.054 0.068 0.067 0.068 0.066 0.053 0.064 0.065
0.032 0.020 0.022 0.021 0.019 0.026 0.021 0.020
0.044 0.040 0.048 0.041 0.040 0.035 0.041 0.043
0.078 0.108 0.119 0.110 0.109 0.065 0.108 0.109
FAVAR, w = 120 iterated AIC direct AIC mod. AIC(diag) mod. AIC(NW) AIC(SURE) iterated BIC direct BIC BIC(SURE)
0.019 0.014 0.015 0.014 0.013 0.016 0.016 0.014
0.013 0.013 0.013 0.013 0.015 0.015 0.014 0.016 FAVAR, w = 240 0.031 0.024 0.024 0.023 0.022 0.027 0.025 0.024
Category (E) (averages over 35 series) AR, w = 120 iterated AIC direct AIC mod. AIC(diag) mod. AIC(NW) AIC(SURE) iterated BIC direct BIC BIC(SURE)
0.035 0.026 0.026 0.025 0.028 0.083 0.048 0.052
AR, w = 240 0.074 0.054 0.053 0.052 0.054 0.171 0.077 0.085
0.134 0.108 0.111 0.108 0.114 0.286 0.146 0.159
0.228 0.292 0.322 0.293 0.306 0.421 0.323 0.338
0.206 0.130 0.118 0.127 0.126 0.205 0.124 0.123
0.344 0.186 0.171 0.181 0.181 0.340 0.194 0.186
0.490 0.321 0.338 0.315 0.312 0.485 0.345 0.341
FAVAR, w = 120 iterated AIC direct AIC mod. AIC(diag) mod. AIC(NW) AIC(SURE) iterated BIC direct BIC BIC(SURE)
0.100 0.076 0.068 0.072 0.072 0.099 0.076 0.069
0.041 0.039 0.039 0.040 0.062 0.138 0.051 0.093 FAVAR, w = 240 0.151 0.090 0.091 0.092 0.083 0.178 0.105 0.091
The table reports the squared bias of the different forecasts as a ratio of the MSFE of the iterated AR forecast based on the model selected by AIC with w = 120, where w is the length of the estimation window. For the model selection methods, see the footnote to Table 3.
From a theoretical perspective it is unclear whether the iterated approach leads to greater (squared) biases than the direct approach. To shed empirical light on this issue, Table 6 reports the squared forecast bias as a ratio of the MSFE of the benchmark iterated univariate forecast models selected by the AIC. For all methods the squared bias grows as a proportion of the MSFE of the benchmark model when the forecast horizon is extended. Interestingly, at short horizons (h ≤ 6) the squared bias component of the iterated forecasts based on models selected by the AIC is slightly larger than that of the direct approaches, while conversely the relative bias of the iterated AIC models is smaller at the two longest horizons (h = 12, 24). The iterated forecast models selected by the BIC generate a comparatively large bias that exceeds that generated by the direct forecast models selected by the BIC, suggesting that the parsimony of these models comes at the expense of a larger bias. 5.3.2. Multivariate models Turning to the factor-augmented VAR models, Table 3 shows that the iterated forecasts continue to do better on average than the direct FAVAR forecasts when a short estimation window (w = 120) is used. These results are frequently overturned, however, when the long estimation window (w = 240) is used. In the latter case, the direct forecasting method is better for forecast horizons of h = 3, 6 and 12 months irrespective of which information criterion
is used, and also for h = 24 months under the diagonal modified AIC method or the SURE estimation approach.7 For the factor-augmented models, Table 3 shows that the band diagonal modification to the AIC helps improve the average performance of the direct forecasts across all horizons and for both estimation windows (w = 120 or w = 240). The Newey–West modification is less consistent in improving on the conventional AIC method. Moreover, in contrast with the univariate models, the SURE approach generally improves the direct FAVAR forecasts, particularly with a long estimation window (w = 240). Comparing the average forecasting performance across both univariate and multivariate models, at the shortest horizon (h = 3) the SURE method combined with the BIC produces the lowest average MSFE values when w = 120. Similarly, when w = 240, the direct forecast models estimated by SURE produce the best performance for h = 3 and h = 6 months. In all other cases, the univariate iterated forecast models selected by the AIC produce the lowest average MSFE values. These results are somewhat
7 As expected, the multivariate models include more predictor variables than their univariate counterparts. Under the AIC, on average five to seven regressors get included in the small sample, rising to six to eight variables in the larger sample. Once again, the BIC leads to somewhat smaller models with three to four predictor variables.
186
M.H. Pesaran et al. / Journal of Econometrics 164 (2011) 173–187
dominated by extreme cases, however. Table 4 shows that for a majority of the 170 variables the iterated FAVAR forecasts based on models selected by AIC or BIC produce lower MSFE values than the univariate iterated forecasting models selected by AIC. Table 5 shows additional evidence that the iterated FAVAR models perform well. The multivariate iterated forecasting models selected by the AIC or BIC produce the lowest MSFE values for around 40% of the 170 variables. This is a greater share than that recorded by other methods, although the SURE approach also performs well at the shortest horizon (h = 3). Table 6 shows that the squared bias associated with the FAVAR models tends to exceed the squared bias found for the univariate models. Moreover, for the multivariate models the iterated approach tends to produce larger biases than the direct forecast approach. In conclusion, for the majority of variables the iterated FAVAR models generate smaller forecast errors than the best univariate approach. There is less evidence in favor of the direct forecasting models. Only for w = 240 and h = 3 or 6 months do we find that the direct approach leads to comparable forecasting performance. Overall, these findings demonstrate the value from utilizing multivariate information and also provide evidence that our proposed refinements to standard information criteria work in many cases. 5.4. Model comparisons Table 7 provides test results based on a formal comparison of the benchmark univariate iterated model selected by the AIC with the alternative approaches listed in each row. Tests are based on the methodology advocated by Giacomini and White (2006), which is ideally suited for our purpose since we are conducting pairwise model comparisons and use rolling-window estimators. The table lists the percentage of model comparisons for which the null of equal predictive accuracy is rejected in a two-sided test conducted at a 5% significance level against the alternative that the univariate iterated models selected by the AIC are best or, conversely, that the alternative model is best (listed in brackets). The percentage of cases where the iterated univariate AIC method dominates other univariate forecast methods generally grows with the forecast horizon and is around 5%–15% when h = 3 or h = 6 months and 10%–20% for h = 12 or h = 24 months. We find far fewer cases where the iterated univariate forecasts selected by the AIC are rejected in favor of alternative univariate methods. These test results provide statistical evidence that the univariate iterated AIC approach frequently performs significantly better than the other univariate methods. Hence there is little evidence to prefer alternative univariate methods. Turning to the factor-augmented models, the evidence is generally less clear-cut, with the proportion of significant cases where the iterated univariate AIC forecasts are preferred over other approaches such as the direct AIC forecasts, generally being more balanced, at least at the short horizon. In most cases, however, the iterated univariate AIC forecasts continue to reject alternative approaches more often than it gets rejected itself. Exceptions to this are the iterated FAVAR models based on either the AIC or BIC which reject about as often as they themselves get rejected by the univariate iterated models based on the AIC. 5.5. Results by variable categories The empirical results turn out to be quite similar for four of the five categories of economic variables, namely (A) income, output, sales and capacity utilization, (B) employment and unemployment, (C) construction, inventories and orders, and (D) interest rates and asset prices. In contrast, quite different results are obtained for
the fifth category, (E), nominal prices, wages and money. For this reason, Tables 8 and 9 present separate results averaged across variables in categories A–D versus category E variables. Table 8 shows that the benefit from using the multivariate factor-based approach comes out very strongly for the first four categories. For these variables, across almost all sample sizes and forecast horizons, the models selected by the multivariate iterated AIC or BIC produce lower MSFE values than the univariate iterated AIC approach. Among the direct FAVAR forecasts the modified AIC and SURE methods perform quite well, particularly with the long estimation window (w = 240). Across the first four categories of variables, the multivariate iterated approach based on the BIC performs best for the shortest estimation window (w = 120) when h = 3, 6 or 12 months. When w = 240, the iterated FAVAR approach based on the AIC generates the best results on average, except for when h = 3 where the SURE approach is best. Table 9 shows that forecasts based on these methods are only modestly biased. In contrast, for the final group of variables, (E) nominal prices, wages and money, the FAVAR approach strongly underperforms against the univariate iterated AIC models. Iterated FAVAR models are particularly poor and are outperformed by their direct counterparts. This suggests that the iterated FAVAR models are heavily biased, a conjecture that is confirmed in Table 9 which reveals massive biases for the iterated FAVAR models at long horizons. The biases associated with the direct forecast models are much smaller.
6. Conclusion We compare the performance of iterated and direct forecasts generated by univariate and multivariate (factor-augmented VAR) models. Our simulations and empirical results show an interesting interaction between the length of the estimation window, how strongly a particular model selection method penalizes the inclusion of additional variables, the forecast horizon, the method used to estimate model parameters and the relative performance of the direct versus iterated approaches. Like Marcellino et al. (2006), our results suggest that there is no single dominant approach and that the best forecasting method varies considerably across economic variables. The iterated factoraugmented VAR approach performs considerably better than the best univariate forecasting approach for variables tracking income, output, employment, construction, interest rates and asset prices. Conversely, the univariate iterative models dominate among variables tracking nominal prices, wages and money. For such variables the factor-augmented iterated models produce heavily biased forecasts. Our empirical and simulation results suggest that the degree of model misspecification has to be quite large for the direct forecasts to start dominating the iterated forecasts and that the forecasts generated by autoregressive models of low order – whether factor-augmented or not – are difficult to beat for most economic variables. This is a result of the (squared) bias component generally playing a relatively minor role relative to the importance of parameter estimation error in the composition of MSFE values. Consistent with this, the iterated forecasting approach performs particularly well relative to the direct approach when the sample size is small, when using an information criterion such as the AIC that does not penalize additional parameters too heavily and when the forecast horizon gets large.
M.H. Pesaran et al. / Journal of Econometrics 164 (2011) 173–187
References Athanasopoulos, G., Vahid, F., 2008. VARMA versus VAR for macroeconomic forecasting. Journal of Business and Economic Statistics 26, 237–251. Bai, Jushan, Ng, Serena, 2002. Determining the number of factors in approximate factor models. Econometrica 70, 161–221. Bai, Jushan, Ng, Serena, 2009. Boosting diffusion indices. Journal of Applied Econometrics 24, 607–629. Bao, Y., 2007. Finite sample properties of forecasts from the stationary first-order autoregressive model under a general error distribution. Econometric Theory 23, 767–773. Bernanke, Ben S., Boivin, Jean, Eliasz, Piotr, 2005. Measuring the effect of monetary policy: a factor-augmented vector autoregressive (FAVAR) approach. Quarterly Journal of Economics 120, 387–422. Bhansali, Rajendra J., 1999. Parameter estimation and model selection for multistep prediction of a time series: a review. In: Ghosh, Subir (Ed.), Asymptotics, NonParametrics and Time Series. Marcel Dekker, New York, pp. 201–225. Brown, Bryan W., Mariano, Roberto S., 1989. Measures of deterministic prediction bias in nonlinear models. International Economic Review 30, 667–684. Chevillon, Guillaume, 2007. Direct multi-step estimation and forecasting. Journal of Economic Surveys 21, 746–785. Clements, Michael, Hendry, David, 1998. Forecasting Economic Time Series. Cambridge University Press, Cambridge. Cox, David R., 1961. Prediction by exponentially weighted moving averages and related methods. Journal of the Royal Statistical Society. Series B. Statistical Methodology 23, 414–422. Favero, Carlo A., Tamoni, Andrea, 2010. Demographics and the term structure of stock market risk. Mimeo, Bocconi University.
187
Findley, David F., 1983. On the use of multiple models for multi-period forecasting. In: Proceedings of Business and Economic Statistics. American Statistical Association, pp. 528–531. Forni, Mario, Hallin, Marc, Lippi, Marco, Reichlin, Lucrezia, 2005. The generalized dynamic factor model, one sided estimation and forecasting. Journal of the American Statistical Association 100, 830–840. Giacomini, Raffaella, White, Halbert, 2006. Tests of conditional predictive ability. Econometrica 74, 1545–1578. Hoque, A., Magnus, J.R., Pesaran, B., 1988. The exact multi-period mean squared forecast error for the first-order autoregressive model. Journal of Econometrics 39, 327–346. Ing, Ching-Kang, 2003. Multistep prediction in autoregressive processes. Econometric Theory 19, 254–279. Marcellino, Massimiliano, Stock, James H., Watson, Mark W., 2006. A comparison of direct and iterated multistep AR methods for forecasting macroeconomic time series. Journal of Econometrics 135, 499–526. Newey, Whitney K., West, Mark W., 1987. A simple, positive semi-definite, heteroskedasticity and autocorrelation consistent covariance matrix. Econometrica 55, 703–708. Pesaran, M. Hashem, Timmermann, Allan, 2005. Small sample properties of forecasts from autoregressive models under structural breaks. Journal of Econometrics 129, 183–217. Schorfheide, Frank, 2005. VAR forecasting under misspecification. Journal of Econometrics 128, 99–136. Stock, James H., Watson, Mark W., 2002. Forecasting using principal components from a large number of predictors. Journal of the American Statistical Association 97, 1167–1179. Stock, James H., Watson, Mark W., 2005. Implications of dynamic factor models for VAR analysis. Mimeo, Princeton University. Ullah, Aman, 2004. Finite Sample Econometrics. Oxford University Press, Oxford.
Journal of Econometrics 164 (2011) 188–205
Contents lists available at ScienceDirect
Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom
A two-step estimator for large approximate dynamic factor models based on Kalman filtering✩ Catherine Doz a,b , Domenico Giannone c,d , Lucrezia Reichlin e,d,∗ a
Paris School of Economics, France
b
University Paris 1 Panthéon-Sorbonne, France
c
Université Libre de Bruxelles, ECARES, Belgium
d
CEPR, United Kingdom
e
London Business School, United Kingdom
article
info
Article history: Available online 16 March 2011 JEL classification: C51 C32 C33
abstract This paper shows consistency of a two-step estimation of the factors in a dynamic approximate factor model when the panel of time series is large (n large). In the first step, the parameters of the model are estimated from an OLS on principal components. In the second step, the factors are estimated via the Kalman smoother. The analysis develops the theory for the estimator considered in Giannone et al. (2004) and Giannone et al. (2008) and for the many empirical papers using this framework for nowcasting. © 2011 Elsevier B.V. All rights reserved.
Keywords: Factor models Kalman filter Principal components Large cross-sections
1. Introduction A very recent development of the forecasting literature has been the design of statistical models for ‘‘nowcasting’’. Nowcasting is the forecast of GDP for the recent past, the present and the near future. Since GDP is published with a long delay and it is only available at quarterly frequency, many institutions are concerned with the problem of exploiting monthly information in order to obtain an early estimate of last quarter and current quarter GDP as well as a forecast for one quarter ahead. When one exploits information from many monthly variables in real time, a key issue is that at the end of the sample information is incomplete since data are released at non-synchronized dates and, as a consequence, the panel of monthly data has a jagged/ragged edge.
✩ We would like to thank the editor, two anonymous referee, Ursula Gather, Marco Lippi and Ricardo Mestre for helpful suggestions and seminar participants at the International Statistical Institute in Berlin 2003, the European Central Bank, 2003, the Statistical Institute at the Catholic University of Louvain la Neuve, 2004, the Institute for Advanced Studies in Vienna, 2004, the Department of Statistics in Madrid, 2004. ∗ Corresponding address: London Business School, Regents Park, NW1 4SA London, United Kingdom. Tel.: +44 0 20 7000 8435; fax: +44 0 20 7000 7001. E-mail address:
[email protected] (L. Reichlin).
0304-4076/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2011.02.012
A seminal paper in the nowcasting literature is Giannone et al. (2008). The authors develop a model for nowcasting quarterly GDP using a large number of monthly releases. The proposed framework consists in bridging quarterly GDP with common factors extracted from the large set of monthly variables. In order to deal with jagged edges, the factor model is estimated in a twostep estimation procedure based on principal components and the Kalman filter. The model was first implemented at the Board of Governors of the Federal Reserve and then at the European Central Bank (Angelini et al., 2011; Banbura and Rünstler, 2007; Rünstler et al., 2009; ECB, 2008). The method has also used for other economies, including France (Barhoumi et al., 2010), Ireland (D’Agostino et al., 2008), New Zealand (Matheson, 2010), Norway (Aastveit and Trovik, 2008) and Switzerland (Siliverstovs and Kholodilin, 2010). Although the good empirical performance of the model has been extensively documented, the theoretical characteristics of the estimator have never been studied. The goal of this paper is to fill this gap. We consider a ‘‘large’’ panel of time series and assume that it can be represented by an approximate factor structure whereby the dynamics of each series is split in two orthogonal components— one capturing the bulk of cross-sectional comovements and driven by few common factors and the other being composed of poorly cross-correlated elements. This model has been introduced by
C. Doz et al. / Journal of Econometrics 164 (2011) 188–205
Chamberlain and Rothschild (1983) and generalized to a dynamic framework by Forni et al. (2000), Forni and Lippi (2001) and Stock and Watson (2002a,b). As in many other papers in the literature, this paper studies the estimation of the common factors and consistency and rates for the size of the cross-section n and the sample size T going to infinity. The literature has extensively studied the particular case in which the factors are estimated by principal components (Bai, 2003; Bai and Ng, 2002; Forni et al., 2005, 2009; Stock and Watson, 2002a,b). In this paper, we study consistency properties of the two-step estimator considered by Giannone et al. (2004) and Giannone et al. (2008). We parametrize the dynamic of the factors as in Forni et al. (2009). In the first step, we estimate the parameters of the model by simple least squares by treating the principal components as if they were the true common factors. In the second step, the estimated parameters are used to project onto the observations. We consider three cases, each corresponding to an estimator under different forms of misspecification: factor dynamics, idiosyncratic heteroscedasticity and idiosyncratic dynamics (principal components); factor and idiosyncratic dynamics (reweighted principal components); idiosyncratic dynamics only (Kalman smoother). Each projection corresponds to a different two-step estimator whereby the first step involves the estimation of the parameters and the second step the application of the Kalman smoother. We prove consistency for such estimators and design a Monte Carlo exercise that allows us to study the behavior of our estimators in small samples. Beside allowing the treatment of unbalanced panels, the Kalman smoother may help achieving possible efficiency improvements. Moreover, ‘‘cleaning’’, through the second step, the estimate of the factors, allows a better reconstruction of the common shocks considered in the structural factor model by Giannone et al. (2004). Finally, such parametric approach allows us to easily evaluate uncertainty in the estimates of the factors as shown in both the papers just cited. Let us finally note that similar reasoning to that applied to this paper can be applied to use principal components to initialize the algorithm for maximum likelihood estimation. We study consistency of maximum likelihood estimator in a separate paper Doz et al. (2006a). The paper is organized as follows. Section two introduces models and assumptions. Section three analyzes the projections, for known parameters, and for the different misspecified model assumptions: we show that the extracted factors are root n consistent in each case. Section four contains the main propositions which show consistency and (n, T ) rates for the two-step estimators. Section five presents the Monte Carlo exercise and report results for an empirical application on nowcasting. Section six concludes. Proofs are gathered in the Appendix. 2. The models We consider the following model: Xt = Λ∗0 Ft + ξt where Xt = (x1t , . . . , xnt )′ is a (n × 1) stationary process. Λ∗0 = (λ∗0,ij ) is the n × r matrix of factor loadings. Ft = (f1t , . . . , frt )′ is a (r × 1) stationary process (common factors). ξt = (ξ1t , . . . , ξnt )′ is a (n × 1) stationary process (idiosyncratic component). (Ft ) and (ξt ) are two independent processes.
189
Note that Xt , Λ∗0 , ξt depend on n but, in this paper, we drop the subscript for the sake of simplicity. The general idea of the model is that the observable variables can be decomposed in two orthogonal unobserved processes: the common component driven by few common shocks which captures the bulk of the covariation between the time series and the idiosyncratic component which is driven by n shocks generating dynamics which is series specific or local. We have the following decomposition of the covariance matrix of the observables:
Σ0 = Λ∗0 Φ0∗ Λ∗′ 0 + Ψ0 where Ψ0 = E[ξt ξt′ ] and Φ0∗ = E[Ft Ft′ ]. It is well known that the factors are defined up to a pre-multiplication by an invertible matrix, so that it is possible to choose Φ0∗ = Ir : we will maintain this assumption throughout the paper. Even in this case, the factors are defined up to a pre-multiplication by an orthogonal matrix, a point that we make more precise below. We also have the following decomposition of the auto-covariance matrix of order h of the observables:
Σ0 (h) = Λ∗0 Φ0∗ (h)Λ∗′ 0 + Ψ0 (h) where Σ0 (h) = E[Xt Xt′−h ], Φ0∗ (h) = E[Ft Ft′−h ], and Ψ0 (h) = E[ξt ξt′−h ]. This decomposition extends the previous one, if we adopt the following notations:
Σ0 (0) = Σ0 ,
Φ0∗ (0) = Ir ,
and Ψ0 (0) = Ψ0 .
Remark 1. Bai (2003), Bai and Ng (2002) and Stock and Watson (2002a) consider also some form of non-stationarity. Here, we do not do it for simplicity. The main arguments used in what follows still hold under the assumption of weak time dependence of the common and the idiosyncratic component. More precisely, we make the following set of assumptions: (A1) For any n, (Xt ) is a stationary process with zero mean and finite second-order moments. (A2) The xit ’s have uniformly bounded variance: ∃M /∀(i, t )Vxit = σ0,ii ≤ M. (A3) – (Ft ) and (ξt ) are independent processes. ∑+∞ – (Ft ) admits a Wold representation: Ft = C0 (L)εt = k=0 ∑+∞ Ck εt −k such that: k=0 ‖Ck ‖ < +∞, and εt is stationary at order four. – For any∑n, (ξt ) admits a Wold representation: ξt = D0 (L) ∑+∞ +∞ vt = D v where k t − k k=0 k=0 ‖Dk ‖ < +∞ and vt is a strong white noise such that: ∃M /∀(n, i, t )Evit4 ≤ M. Note that (vt ) and D0 (L) are not nested matrices: when n increases because a new observation is added to Xt , a new observation is also added to ξt but the innovation process and the filter D0 (L) entirely change. A convenient way to parametrize the dynamics is to further assume that the common factors following a VAR process so that the following assumption is added to (A3) (see Forni et al., 2009, for a discussion): (A3′ ) The factors admit a VAR representation: A∗0 (L)Ft = ut where A∗0 (z ) ̸= 0 for |z | ≤ 1 and A∗0 (0) = Ir . 2 j=1 Eξit , and in the whole pa¯ per, A0 (L), Ψ0 , D0 (L), ψ0 denote the true values of the parameters.
¯0 = For any n, we denote by ψ
1 n
∑n
∗
Given the size of the cross-section n, the model is identified provided that the number of common factors (r) is small with respect to the size of the cross-section (n), and the idiosyncratic component is orthogonal at all leads and lags, i.e. D0 (L) is a diagonal matrix (exact factor model). This version of the model was
190
C. Doz et al. / Journal of Econometrics 164 (2011) 188–205
proposed by Engle and Watson (1981) and they estimated it by maximum likelihood.1 In what follows, we will not impose such restriction and work under the assumption of some form of weak correlation among idiosyncratic components (approximate factor model) as in the n large, new generation factor literature. There are different ways to impose identifying assumptions that restrict the cross-correlation of the idiosyncratic elements and preserve the commonality of the common component as n increases. We will assume that the Chamberlain and Rothschild (1983)’s conditions are satisfied and we will extend some of these conditions in order to fit the dynamic case. More precisely, denoting by λmin (A) and λmax (A) the smallest and the greatest eigenvalues of a matrix A, and
1/2
by ‖A‖ = λmax (A′ A) , we make the following assumptions. We suppose that the common component is pervasive, in the following sense:
∗ (CR1) lim infn→∞ 1n λmin (Λ∗′ 0 Λ0 ) > 0.
We also suppose, as in Forni et al. (2004), that all the eigenvalues ∗ of Λ∗′ 0 Λ0 diverge at the same rate, which is equivalent to the following further assumption: ∗ (CR2) lim supn→∞ 1n λmax (Λ∗′ 0 Λ0 ) is finite.
We suppose that the cross-sectional time autocorrelation of the idiosyncratic component can only have a limited amount: (CR3) lim supn→∞
∑
h∈Z
‖Ψ0 (h)‖ is finite.
We also make the two following technical assumptions: (CR4) infn λmin (Ψ0 ) = λ > 0. ∗ 2 (A4) Λ∗′ 0 Λ0 has distinct eigenvalues. It must be emphasized that: – assumption (CR3) extends the Chamberlain and Rothschild ¯ supn ‖Ψ0 ‖ < λ¯ and is (1983)’s following condition: ∃λ/ achieved as soon as the two∑following assumptions are made: ∃M /∀n‖E[vt vt′ ]‖ ≤ M and +∞ k=0 ‖Dk ‖ ≤ M – assumption (CR4) was made by Chamberlain and Rothschild (1983): it ensures that the idiosyncratic component does not tend to a degenerate random variable when n goes to infinity. Remark 2. These assumptions are slightly different than those introduced by Stock and Watson (2002a) and Bai and Ng (2002) but have a similar role. They have been generalized for the dynamic case by Forni et al. (2000) and Forni and Lippi (2001). As we said before, the common factors, and the factor loadings, are identified up to a normalization. In order to give a precise statement of the consistency results in our framework, we will use here a particular normalization. Let us define: – D0 as the diagonal matrix whose diagonal entries are the ∗ eigenvalues of Λ∗′ 0 Λ0 in decreasing order, – Q0 as the matrix of a set of unitary eigenvectors associated with D0 , – Λ0 = Λ∗0 Q0 , so that Λ′0 Λ0 = D0 and Λ0 Λ′0 = Λ∗0 Λ∗′ 0 , −1/2
– P0 = Λ0 D0 – Gt = Q0′ Ft .
so that P0′ P0 = Ir ,
With these new notations, the model can also be written as follows:
1 Identification conditions for the model for a fixed cross-sectional dimensions (n) are studied in Geweke and Singleton (1980). 2 This assumption is usual in this framework, and is made to avoid useless mathematical complications. However, in the case of multiple eigenvalues, the results would remain unchanged.
Xt = Λ0 Gt + ξt .
(2.1)
We then have: E[Gt Gt ] = Ir , and E[Gt Gt −h ] = Φ0 (h) = Q0′ Φ0∗ (h)Q0 for any h. It then follows that: ′
′
′ Σ0 = Λ∗0 Λ∗′ 0 + Ψ0 = Λ0 Λ0 + Ψ0 ′ and that, for any h: Σ0 (h) = Λ∗0 Φ0∗ (h)Λ∗′ 0 + Ψ0 (h) = Λ0 Φ0 (h)Λ0 + Ψ0 (h). Note that, in the initial representation of the model, the matrices Λ∗0 are supposed to be nested (when an observation is added to Xt , a line is added to the matrix Λ∗0 ), and that it is not the case for the Λ0 matrices. However, as Q0 is a rotation matrix, Gt and Ft have the same range, likewise Λ0 and Λ∗0 have the same range.3 In addition, assumptions (A1)–(A4) and (CR1)–(CR4) are satisfied if we replace Λ∗0 with Λ0 , and Ft with Gt . If also assumption (A3′ ) holds, then Gt also has a VAR representation. Indeed, as Q0 Gt = Ft , we have: A∗0 (L)Gt = ut , and Q0′ A∗0 (L)Gt = Q0′ ut . We then can write
A0 (L)Gt = wt , with A0 (L) = Q0′ A∗0 (L)Q0 , wt = Q0′ ut , A0 (z ) ̸= 0 for |z | ≤ 1, and A0 (0) = Ir . Throughout the paper, we concentrate on consistent estimation of Gt rather than Ft , which means that we make explicit which rotation of the factors we are estimating. 3. Approximating projections and population results The true model underlying the data can be defined as Ω = {Λ∗ , A∗ (L), D(L)} or equivalently as Ω = {Λ, A(L), D(L)}. If this true model were known, the best approximation of Gt as a linear function of the observables X1 , . . . , XT would be Gt |T = ProjΩ [Gt |Xs , s ≤ T ]. If the model is Gaussian, i.e. if ut and vt are normally distributed, then ProjΩ [Gt |Xs , s ≤ T ] = EΩ [Gt |Xs , s ≤ T ]. Moreover, if the projection is taken under the true parameter values, Ω0 = {Λ0 , A0 (L), D0 (L)}, then we have optimality in mean square sense. In what follows, we propose to compute other projections of Gt , which are associated with models which are misspecified as well, but which are likely to be closer to the real model underlying the data. We show that, although not optimal, these projections also give consistent approximations of Gt , under our set of assumptions. The simplest projection is obtained under the triple Ω0R1 =
Λ 0 , Ir ,
ψ¯ 0 In , that is under an approximating model according
to which the common factors are white noise with covariance Ir and the idiosyncratic components are cross-sectionally indepen¯ 0 . We have dent homoscedastic white noises with variance ψ
−1
ProjΩ R1 [Gt |Xs , s ≤ T ] = EΩ R1 [Gt Xt′ ] EΩ R1 [Xt Xt′ ] 0
0
0
= Λ′0 Λ0 Λ′0 + ψ¯ 0 In
−1
Xt
Xt .
Simple calculations show that, when Ψ0R is an invertible matrix of order n:
Λ0 Λ′0 + Ψ0R
−1
−1 ′ −1 −1 −1 −1 = Ψ0R − Ψ0R Λ0 Λ′0 Ψ0R Λ0 + Ir Λ0 Ψ0R .
3 It is worth noting that Q is uniquely defined up to a sign change of its columns 0 and that Gt is uniquely defined up to a sign change of its components (this will ∗ be used below). Indeed, as Λ∗′ 0 Λ0 is supposed to have distinct eigenvalues, Q0 is uniquely defined up to a sign change of its columns. Then, if ∆ is a diagonal matrix whose diagonal terms are ±1, and if Q0 is replaced by Q0 ∆, Λ0 is replaced by Λ0 ∆ and Gt is replaced by 1Gt .
C. Doz et al. / Journal of Econometrics 164 (2011) 188–205
¯ 0 In , the previous expression Applying this formula with Ψ0R = ψ can then be written as follows: ¯ 0−1 Λ0 + Ir ProjΩ R1 [Gt |Xs , s ≤ T ] = Λ′0 ψ
−1
0
Λ′0 ψ¯ 0−1 Xt
−1 ′ Λ0 Xt = Λ′0 Λ0 + ψ¯ 0 Ir which is, by assumption (CR1), asymptotically equivalent to the OLS regression of Xt on the factor loadings Λ0 . It is clear that, under conditions (CR1) and (CR3), such simple OLS regression provides a consistent estimate of the unobserved common factors as the cross-section becomes large.4 In particular, m.s.
ProjΩ R1 [Gt |Xs , s ≤ T ] −→ Gt 0
as n → ∞.
Indeed, given the factor model representation, and the definition of Λ0 , we have
−1 ′ −1 ′ ′ Λ0 Λ0 Gt Λ0 Xt = Λ′0 Λ0 + ψ¯ 0 Ir Λ0 Λ0 + ψ¯ 0 Ir ′ −1 ′ + Λ0 Λ0 + ψ¯ 0 Ir Λ0 ξ t . Under (CR1), the first term converges to the unobserved com-
−1
¯ 0 Ir Λ′0 Λ0 → Ir , as n → ∞. mon factors Gt , since Λ′0 Λ0 + ψ The last term converges to zero in mean square since, by assumptions (CR1)–(CR3):
−1 ′ ′ ′ −1 Λ′0 Λ0 + ψ¯ 0 Ir Λ0 ξt ξt Λ0 Λ0 Λ0 + ψ¯ 0 Ir −1 ′ −1 Λ0 Λ0 Λ′0 Λ0 + ψ¯ 0 Ir →0 ≤ λmax (Ψ0 ) Λ′0 Λ0 + ψ¯ 0 Ir as n → ∞.
EΩ0
If we denote Gt /T ,R1 = ProjΩ R1 [Gt |Xs , s ≤ T ], we then have 0
Gt /T ,R1 − Gt = OP
1
√
n
as n → ∞.
This simple estimator is the most efficient one if the true model is Ω0 = Ω0R1 : this is the model which is implicitly assumed in the probabilistic principal components framework, i.e. a static model with i.i.d. idiosyncratic terms. However, if there are dynamics in the common factors (A0 (L) ̸= Ir ) and if the idiosyncratic
¯ 0 In ), components have dynamics or are not spherical (D0 (L) ̸= ψ this approach still gives a consistent estimate of the unobserved common factors, as n → ∞. If the size of the idiosyncratic component is not the same across series, another estimator can be obtained by exploiting such heterogeneity and giving less weight to series with larger idiosyncratic component. Denoting Ψ0d = diag(ψ0,11 , . . . , ψ0,nn ), this can be done by running the projection under the triple Ω0R2
=
1/2 Λ0 Ir Ψ0d
, ,
.
Using the same formula as we used in the previous case, with
Ψ0R = Ψ0d instead of Ψ0 = ψ¯ 0 In , the following estimated factors are as follows:
−1 ProjΩ R2 [Gt |Xs , s ≤ T ] = Λ′0 Λ0 Λ′0 + Ψ0d Xt 0 ′ −1 −1 ′ −1 = Λ0 Ψ0d Λ0 + Ir Λ0 Ψ0d Xt . This estimator is used in the traditional (exact) factor analysis framework for static data, where it is assumed that Ω0R2 is the true model underlying the data. It is obtained as the previous one, up to the fact that Xt and, of course, Λ0 have been weighted, with weight given by ψ0,11 , . . . , ψ0,nn . If the true model is Ω0R2 , this
estimator will be more efficient than the previous one, for a given n. On the other hand, if Ω0R2 is not the true model, it is straightforward to obtain the same consistency result as in the previous case, under assumptions (CR1)–(CR4). If Gt /t ,R2 := ProjΩ R2 [Gt |Xs , s ≤ T ], then 0
m.s.
Gt /T ,R2 −→ Gt
and Gt /T ,R2 − Gt = OP
1
√
n
as n → ∞.
Note that in traditional factor models, where n is considered fixed, the factors are indeterminate and can only be approximated with an approximation error that depends inversely on the signal-tonoise variance ratio. The n large analysis shows that under suitable conditions, the approximation error goes to zero for n large. For a given n, further efficiency improvements could be obtained by non-diagonal weighting scheme, i.e. by running the projection under the triple
1/2
Λ 0 , I r , Ψ0
. This might be em-
pirically relevant since, although limited asymptotically by assumption (CR3), the idiosyncratic cross-sectional correlation may affect results in finite sample. We will not consider such projections since non-diagonal weighting schemes raise identifiability problems in finite samples and would practically require the estimation of too many parameters. Indeed, there is no satisfactory way to fully parametrize parsimoniously the DGP of the idiosyncratic component since in most applications the cross-sectional items have no natural order. On the other hand, the estimators considered above do not take into consideration the dynamics of the factors and the idiosyncratic component. For this reason, the factors are extracted by projecting only on contemporaneous observations. Since the model can be written in a state space form, we propose to compute projections under more general dynamic structures using Kalman smoothing techniques. Two particular cases in which the Kalman smoother can be used to exploit the dynamics of the common factors are
Ω0R3 = Λ0 , A0 (L), ψ¯ 0 In 1/2 Ω0R4 = Λ0 , A0 (L), Ψ0d . In both cases, the state-space form of the model under assumption (A3′ ) is Gt Gt −1 0 .
Xt = Λ0
0
···
..
+ ξt
Gt −p+1 Gt Gt −1
A01 Ir
. . .
= . . .
Gt −p+1
0
A02 0
··· ···
0
···
.. .
A0p Gt −1 0 Gt −2
.. .. . .
Ir
Gt −p
Ir
0 + . w t . . . 0
In the measurement equation, the covariance matrix of ξt is sup¯ 0 In in Ω0R3 framework, whereas it is supposed to be equal to ψ 1/2
posed to be equal to Ψ0d in Ω0R4 framework. Under such parametrizations, the computational complexity of the Kalman smoothing techniques depends mainly on the dimension of the transition equation which, under the parametrizations above, is independent of n and depends only on the number of the common factors. In both frameworks, the Kalman smoother computes Gt /T ,R := ProjΩ R [Gt |Xs , s ≤ T ], with R = R3 or R4. We want to show that 0
4 Note that here the term consistency could be misleading since we suppose that the parameters of the model are known. We will consider the case of joint estimation of parameters and factors in the next section.
191
this gives a consistent estimate of Gt even if Ω0R is misspecified, due to the fact that the true matrix Ψ0 is a non-diagonal matrix and the idiosyncratic components are autocorrelated. In both cases,
192
C. Doz et al. / Journal of Econometrics 164 (2011) 188–205
for given values of the parameters, the smoother is computed iteratively, for each value of t. However, in order to prove our consistency result, we will not use these recursive formulas but directly use the general form of Gt /T ,R . In order to do this, we introduce the following notations:
′
′
′
– XT = X1′ , . . . , XT′ , GT = G′1 , . . . , G′T , ZT = ξ1′ , . . . ξT′ , – E denotes the expectation of a random variable, under the true model Ω0 , – EΩ R denotes the expectation of a random variable, when Ω0R is
0
the model which is considered, – when (Yt ) is a stationary process: ΓY (h) = E (Yt Yt′−h ) and ΓY ,R (h) = EΩ R (Yt Yt′−h ), 0
′
– when (Yt ) is a stationary process and YT = Y1′ , . . . , YT′ , we denote
ΣY = E (YT Y′T ) and ΣY ,R = EΩ R (YT Y′T ), 0
– U′t is the (r × rT ) matrix defined by U′t = (0, . . . , Ir , 0 . . . 0). With these notations: XT = (IT ⊗ Λ0 )GT + ZT , and : Gt /T ,R = EΩ R (Gt X′T )(EΩ R (XT X′T ))−1 XT 0 0
= EΩ R (Gt X′T )ΣX−,1R XT . 0
Note that, when R = R3 or R4 the DGP of (Gt ) is supposed to be correctly specified, so that ΣG = ΣG,R . On the contrary, ΣZ ,R , is not equal to ΣZ , and we have
ΣZ ,R = IT ⊗ Ψ0,R ¯ 0 In and Ψ0,R4 = Ψ0d = diag(ψ0,11 , . . . , ψ0,nn ). with Ψ0,R3 = ψ Our consistency result is based on the following lemma: Lemma 1. Under assumptions (A1)–(A4), (A3′ ), (CR1)–(CR4), the following properties hold for R = R3, or R4: (i) Gt /T ,R = U′t ΣG,R (IT ⊗ Λ′0 )ΣX−,1R XT . (ii) ΣG,R = ΣG , ‖ΣG ‖ = O(1) and ‖ΣG−1 ‖ = O(1). (iii) ‖ΣZ ‖ = O(1), ΣZ ,R = O(1) and ‖ΣZ−,R1 ‖ = O(1). Proof. See Appendix A.1.
It is worth noting that the last result of this lemma comes from assumption (CR3), which states the limitation of the crosssectional autocorrelation. This assumption is crucial to obtain the following consistency result: Proposition 1. Under assumptions (A1)–(A4), (A3′ ), and (CR1)– (CR4), if Gt /T ,R = ProjΩ R [Gt |Xs , s ≤ T ] with R = R3, or R4, then: 0
m.s.
Gt /T ,R −→ Gt
and
Gt /T ,R − Gt = OP
Proof. See Appendix A.1.
1
√
n
as n → ∞.
In summary, the factors can be consistently estimated, as n become larger, by simple static projection of the observable on the factor loadings. However, it is also possible to exploit the cross-sectional heteroscedasticity of the idiosyncratic components through weighted regressions (parametrization Ω0R2 ), and the dynamics of the factors, through the Kalman smoother (parametrizations Ω0R3 and Ω0R4 ). This property may be particularly useful in the case of unbalanced panel of data. Individual idiosyncratic dynamics could also be taken into account when performing the projections. This would require to specify an autoregressive model for the idiosyncratic components or a reparametrization of the model as in Quah and Sargent (2004), to capture idiosyncratic dynamics by including lagged observable variable.
4. A two-step estimation procedure The discussion in the previous section assumed that the parameters were known and focused on the extraction of the factors. In this section, we propose a two-step procedure in order to estimate the factors when the parameters of the model are unknown. In the first step, preliminary estimators of the factors, and estimators of the parameters of the model, are computed from a principal component analysis (PCA). In the second step, we take the heteroscedasticity of the idiosyncratic components and/or the dynamics of the common factors into account, along the same lines as what we did in the previous section. The true values of the parameters are now replaced by their PCA estimates, and the dynamics of the factors are estimated from the associated preliminary estimates of the factors. As we said in the previous section, the estimation of the full model is not feasible since it is not possible to fully parametrize parsimoniously the DGP of the idiosyncratic component. However, we have seen that, if the factor loadings were known, the factors could be consistently estimated by a Kalman smoother, even if the projections were not computed under the correct specification. We show below that robustness with respect to misspecification still holds if the parameters are estimated by PCA. More precisely, our procedure can be defined as follows. For each of the approximating model Ω Ri , i = 1–4, that we have defined in the previous section, we replace the true parameters by estimated values in the following way. In the four cases, the ˆ obtained factor loadings matrix Λ0 is replaced by the matrix Λ by PCA, and the variance–covariance matrix which is specified for the idiosyncratic component is also directly obtained from the PCA estimates. For cases Ω Ri , i = 3 and 4, where the factors are supposed to follow a VAR model, we use the preliminary estimates ˆ t of the factors obtained by PCA and estimate the VAR coefficients G ˆ t on its own past, following Forni et al. (2009). by OLS regression of G We thus define, for each of the approximating model Ω Ri , i = 1–4, ˆ Ri , i = 1–4. In each case, an associated setup, which we denote Ω we compute a new estimation of the factor Gt , which we denote ˆ Ri [Gt |Xs , s ≤ T ]. ˆ t /T ,Ri and which is equal to Proj G Ω
ˆ t /T ,Ri is a consistent estimator of Gt , In order to prove that G we proceed in three steps. In the first step, we show that under our set of assumptions, principal components give consistent estimators of the span of the common factors, and of associated factors loadings, when both the cross-section and the sample size go to infinity. This result has been shown by Forni et al. (2009). Similar results, under alternative assumptions, have been derived Bai (2003), Bai and Ng (2002) and Stock and Watson (2002a). However, we give our own proof of these results in Appendix A.2, because we need intermediate results in order to prove the other ˆ t /T ,R1 and Gˆ t /T ,R2 propositions of this section. The consistency of G directly follow from the consistency of PCA. We then show that the estimates we propose for the dynamics of the factors are also consistent estimates when both the cross-section and the sample ˆ t /T ,Ri , i = size go to infinity. Finally, we derive the consistency of G 3 and 4. Let us then first study PCA estimates and the consistency of ∑ ˆ t /T ,R1 and Gˆ t /T ,R2 . If we denote by S = 1 Tt=1 Xt Xt′ the empirical G T variance–covariance matrix of the data, by dˆ j the jth eigenvalue of S, in decreasing order of magnitude,5 by pˆj the relative unitary
ˆ the (r × r ) diagonal matrix eigenvector, and if we denote by D
5 It is always assumed that those eigenvalues are all distinct, in order to avoid useless mathematical complications. Under assumption (A6), this will be asymptotically true, due to the fact that S converges to Σ0 .
C. Doz et al. / Journal of Econometrics 164 (2011) 188–205
with diagonal elements dˆ j , j = 1 . . . r, and Pˆ := pˆ 1 , . . . , pˆ r , the associated PCA estimates are given by
ˆ t = Dˆ −1/2 Pˆ ′ Xt G ˆ = Pˆ Dˆ 1/2 . Λ The consistency results are the following: Proposition 2. If assumptions (CR1)–(CR4), (A1)–(A4) and (A3′ ) hold, then Λ0 can be defined6 so that the following properties hold:
ˆ t − Gt = OP (i) G
√1 n
+ OP
ˆ ij − λ0,ij = OP (ii) For any i, j: λ
, as n, T → ∞. √1 √1 . + O P n T
√1 T
ˆΛ ˆ ′ then, for any (i, j): ψˆ ij − ψ0,ij = OP (iii) If Ψˆ = S − Λ OP
√1 T
√1
n
+
.
Proof. See Appendix A.2.
These consistency results can be interpreted as follows: ‘‘the bias arising from this misspecification of the data generating process of the idiosyncratic component and the dynamic properties of the factors is negligible if the cross-sectional dimension is large enough, under the usual set of assumptions’’. If we denote
1 1 1 ˆΛ ˆ ′) = ˆ ′Λ ˆ) ψˆ¯ = trΨˆ = tr(S − Λ trS − tr(Λ n n n 1 ˆ = trS − trD n
ˆΛ ˆ ′ ) = diag(ψˆ 11 , . . . , ψˆ nn ) Ψˆ d = diagΨˆ = diag(S − Λ ˆ R1 = Λ ˆ , Ir , ψˆ¯ In and Ω ˆ R2 = Λ ˆ , Ir , Ψˆ d . we can define Ω We then obtain
ˆ t /T ,R1 G
ˆ t /T ,R2 G
′ˆ ˆ¯ I −1 Λ ˆ R1 [Gt |Xs , s ≤ T ] = Λ ˆ ˆ ′ Xt Λ + = Proj ψ r Ω −1 ˆ 1/2 Pˆ ′ Xt = Dˆ + ψˆ¯ Ir D −1 ˆ R2 [Gt |Xs , s ≤ T ] = Λ ˆ ′ Ψˆ d−1 Xt ˆ ′ Ψˆ d−1 Λ ˆ + Ir Λ = Proj Ω
and the consistency of these two approximations of Gt directly follows from the consistency of PCA. We get Corollary 1. Under the same assumptions as in Proposition 2:
ˆ t /T ,R1 − Gt = OP (i) G
√1
ˆ t /T ,R2 − Gt = OP (ii) G
√1 n
n
Proof. See Appendix A.2.
+ OP
√1
+ OP
√1 T
, as n, T → ∞.
, as n, T → ∞.
T
ˆ t /T ,R1 = (Dˆ + Two remarks are in order. First, we see that G ˆ −1 ˆ ˆ ˆ ˆ ¯ ψ Ir ) DGt so that Gt /T ,R1 and Gt are equal up to a scale coeffi-
193
maximum likelihood estimator in a situation in which the probability model is not correctly specified: the true model satisfies conditions (CR1)–(CR4), is dynamic and approximate, while the approximating model is restricted to be static and the idiosyncratic component to be spherical. This is what White (1982) named as quasi-maximum likelihood estimator. This remark opens the way to the study of QML estimators in less restricted frameworks than Ω R1 , like for instance Ω R4 : such an estimator is studied in Doz et al. (forthcoming).
−1
ˆ t /T ,R2 = Λ ˆ ′ Ψˆ d−1 Λ ˆ + Ir ˆ ′ Ψˆ d−1 Xt is asymptotiSecond, G Λ cally equivalent to principal components on weighted observations, where the weights are the inverse of the standard deviation of the estimated idiosyncratic components. This estimator has been considered in Forni and Reichlin (2001), Boivin and Ng (2006) and Forni et al. (2005). Let us now turn to the Ω Ri , i = 3 and 4 frameworks, where the dynamics of the factors are taken into account in the second step of the procedure. As suggested by Forni et al. (2005), the VAR ˆ t , on its coefficients A0 (L) can be estimated by OLS regression of G own past. More precisely, the following OLS regression: ˆ t = Aˆ 1 Gˆ t −1 + · · · + Aˆ p Gˆ t −p + w G ˆt gives consistent estimates of the A0,k matrices. Proposition 3. Under the same assumptions as in Proposition 2, the following properties hold: (i) If Γˆ Gˆ (h) denotes the sample autocovariance of order h of the ∑T 1 ˆ ˆ′ estimated principal components: Γˆ Gˆ (h) = T − t =h+1 Gt Gt −h , h then for any h:
Γˆ Gˆ (h) − Φ0 (h) = OP
1 n
+ OP
1
√
T
and the result is uniform in h, h ≤ p. (ii) For any s = 0, . . . , p: Aˆ s − A0,s = OP Proof. See Appendix A.3.
1 n
+ OP
√1 T
.
If we denote by Aˆ (L) the associate estimates of A0 (L), we are then able to define:
R3 ˆ ˆ ˆ Ω = Λ, A(L), ψˆ¯ In ˆ , Aˆ (L), (diag(ψˆ 11 , . . . , ψˆ nn ))1/2 ˆ R4 = Λ Ω and to compute two new estimates of the factors:
ˆ Ri [Gt |Xs , s ≤ T ], ˆ t /T ,Ri = Proj G Ω
i = 3 and 4.
These two estimates are obtained with one run of the Kalman smoother and they take the estimated dynamics of the common factors into account:
cient on each component, and are asymptotically equal when n goes to infinity. This can be linked to a well-known result since principal components are known to be equal, up to a scale coefficient, to the maximum likelihood estimates of the parameters in the Ω R1 framework, under a Gaussian assumption.7 Hence, principal components can be seen as an asymptotic equivalent of the
ˆ t /T ,R3 is obtained without reweighting the data: it exploits the – G common factor dynamics but does not take the non-sphericity of the idiosyncratic component into account. ˆ t /T ,R4 exploits the dynamics of the common factors and the – G non-sphericity of the idiosyncratic component: it has been proposed by Giannone et al. (2008) and applied by Giannone et al. (2004).
6 As Λ is defined up to a sign change of its columns, and G is defined up to the 0 t sign of its components, the consistency result holds up to a given value of these signs. 7 See e.g. Doz et al. (forthcoming) for the calculation of the ML estimator in the
Consistency of these two new estimates of the common factors, follows from the consistency of the associated population estimates (Proposition 1), the consistency of PCA (Proposition 2), and the consistency of the autoregressive parameters estimates ˆ R3 and Ω ˆ R4 frame(Proposition 3). The proofs are identical in the Ω R works. We then denote by Ω0 the model under consideration, and
exact static factor model framework.
194
C. Doz et al. / Journal of Econometrics 164 (2011) 188–205
ˆ R the associated set of parameters, and we denote Gˆ t /T ,R = by Ω ProjΩˆ R [Gt |Xs , s ≤ T ] the associated estimation of the common factor. Note that, like in the previous section, our consistency proof will not rely on the Kalman smoother iterative formulas, but on the ˆ t /T ,R . As we have seen before that Gt /T ,R = direct computation of G −1 ′ ′ Ut ΣG,R (IT ⊗ Λ0 )ΣX ,R XT , we now have ˆ t /T ,R = U′t Σ ˆ G,R (IT ⊗ Λ ˆ ′ )Σ ˆ X−,1R XT G ˆ X ,R = (IT ⊗ Λ ˆ )Σ ˆ G,R (IT ⊗ Λ ˆ )+(IT ⊗ Ψˆ R ) and Σ ˆ G,R is obtained where Σ from the estimated VAR coefficients. In particular, we have the following property: Proposition 4. Under the same assumptions as in Proposition 2, the following properties hold:
ˆ G,R − ΣG,R ‖ = OP (i) ‖Σ
1 n
+ OP
√1 T
.
ˆ G,R ‖ = OP (1), ‖Σ ˆ G−,R1 ‖ = OP (1) and ‖Σ ˆ G−,R1 − ΣG−,R1 ‖ = (ii) ‖Σ OP
1 n
+ OP
√1 T
ˆ R [Gt |Xs , s ≤ T ] with R = ˆ t /T ,R = Proj Proposition 5. Denote G Ω T R3, and R4. If limsup n3 = O(1), the following result holds under assumptions (CR1)–(CR4), (A1)–(A4) and (A3′ ): 1
√
n
5.1. Monte Carlo The model from which we simulate is standard in the literature. A similar model has been used, for example, in Stock and Watson (2002a). Let us define it below (in what follows, in order to have simpler notations, we drop the zero subscript for the true value of the parameters which we had previously used to study the consistency of the estimates).
∑r
We can then obtain our consistency result:
We now propose two exercises aimed at assessing the performances of our estimator. In the first subsection, we present results from a simulation study under different hypothesis on the data generating process and under the assumption of unbalanced panel which is the typical realistic situation faced by the nowcaster. In the second, we report results on euro area data, from a study conducted in the context of a project of the Euro system of central banks (Rünstler et al., 2009).
∗ – xit = j=1 λij fjt + ξit , i = 1, . . . , n, in vector notation Xt = Λ∗ Ft + ξt . – λ∗ij i.i.d. N (0, 1), i = 1, . . . , n; j = 1, . . . , r.
.
Proof. See Appendix A.3.
ˆ t /T ,R − Gt = OP G
5. Empirics
+ OP
Proof. See Appendix A.3.
1
√
T
as n, T → ∞.
The procedure outlined above can be summarized in the following way: first, we estimate the parameters and the factors through principal components; second we estimate the dynamics of the factors from these preliminary estimates of the factors; finally, we re-estimate the common factors according to the selected approximating model. What if we iterate such procedure? From the new estimated factors, we can estimate a new set of parameters which in turn can then be used to re-estimate the common factors and so on. If, at each iteration the leastsquares estimates of the parameters are computed using expected sufficient statistics, then such iterative procedure is nothing that the EM algorithm by Dempster et al. (1977) and introduced in small-scale dynamic factor models by Engle and Watson (1981). Quah and Sargent (2004) used such algorithm for large crosssections, but their approach was disregarded in the subsequent literature. The algorithm is very powerful since at each step the likelihood increases, and hence, under regularity conditions, it converges to the maximum likelihood solution. For details about the estimation with state space models see Engle and Watson (1981) and Quah and Sargent (2004). The algorithm is feasible for large cross-sections for two reasons. First, as stressed above, its complexity is mainly due to the number of factors, which in our framework is independent of the size of the cross-section and typically very small. Second, since the algorithm is initialized with consistent estimates (principal component), the number of iterations required for convergence is expected to be limited, in particular, when the cross-section is large. The asymptotic properties of quasi-maximum likelihood estimates for large crosssection and under an approximate factor structure is developed in Doz et al. (2006a).
– A(L)Ft =ut , with ut i.i.d. N (0, (1 − ρ 2 )Ir ); i, j = 1, . . . , r aij (L) =
1 − ρL 0
if i = j if i ̸= j.
– D(L)ξt =vt with vt i.i.d. N (0, T ) dij (L) =
βi 1−βi
(1 − dL) 0
if i = j ; if i ̸= j
i, j = 1, . . . , n
αi = j=1 λij with βi i.i.d. U([u, 1 − u]). √ Tij = αi αj τ |i−j| (1 − d2 ), i, j = 1, . . . , n. ∑r
∗2
Note that we allow for instantaneous cross-correlation between the idiosyncratic elements. Since T is a Toeplitz matrix, the cross-correlation among idiosyncratic elements is limited and it is easily seen that assumption (A) (ii) is satisfied. The coefficient τ controls for the amount of cross-correlation. The exact factor model corresponds to τ = 0. The coefficient βi is the ratio between the variance of the idiosyncratic component ξit and the variance of xit . It is also known as the noise-to-signal ratio. In our simulation this ratio is uniformly distributed with an average of 50%. If u = 0.5, then the standardized observations have cross-sectionally homoscedastic idiosyncratic components. Note that if τ = 0, d = 0, our approximating model is well specified (with the usual notational convention that 00 = 1) and hence the approximating model R4 is well specified. If τ = 0, d = 0, ρ = 0, we have a static exact factor model with heteroscedastic idiosyncratic component and model R2 is correctly specified while principal components are not the most efficient estimator for finite n. Finally, if τ = 0, d = 0, u = 1/2, we have a spherical, static factor model on standardized variables, situation in which the approximating model R1 is correctly specified and principal components on standardized variables provide the most efficient, maximum likelihood, estimates. We generate the model for different sizes of the cross-section, n = 10, 25, 50, 100, and for sample size T = 50, 100. We perform 2500 Monte Carlo repetitions. We draw 50 times the parameters βi , i = 1, . . . , n, and λ∗ij , i = 1, . . . , n; j = 1, . . . , r. Then, for each draw of the parameters, we generate 50 times the shocks ut and ξt . As stressed in the introduction, an advantage of having a parametrized model is that it is possible to extract the common factors from panel with missing data at the end of the sample due to the unsynchronous data releases (see Giannone et al., 2004, 2008, for an application to real-time nowcasting and forecasting output and
C. Doz et al. / Journal of Econometrics 164 (2011) 188–205
inflation). To study the performance of our models, for each sample size T and cross-sectional dimension n, we generate the data under the following pattern of data availability, n xit available for t = 1, . . . , T − j if i ≤ (j + 1) 5 that is all the variables are observed for t = 1, . . . , T − 4, we name this a balanced panel; 80% of the data are available at time T − 3; 60% are available at time T − 2; 40% are available at time T − 1; 20% are available at time T . ˆ , Aˆ (L) and ψˆ ii , i = At each repetition, the parameters Λ 1, . . . , n are estimated on the balanced part of the panel, xit , i = 1, . . . , n, t = 1, . . . , T − 4. Data are standardized so as to have mean zero and variance equal to one. Such standardization is typically applied in empirical analysis since principal components are not scale invariant. We consider the factor extraction under the approximating models studied in the previous section and summarized below.
R1 ˆ ˆ Ω = Λ, Ir , ψˆ¯ In . ˆ , Ir , (diag(ψˆ 11 , . . . , ψˆ nn ))1/2 . ˆ R2 = Λ Ω ˆ R3 = Λ ˆ , Aˆ (L), ψˆ¯ In . Ω ˆ R4 = Λ ˆ , Aˆ (L), (diag(ψˆ 11 , . . . , ψˆ nn ))1/2 . Ω We compute the estimates by applying the Kalman smoother ˆ R [Gt |Xs , s ≤ T ], ˆ t /T ,R = Proj using the estimated parameters: G Ω for R = R1–R4. The pattern of data availability can be taken into account when estimating the common factors, by modifying the idiosyncratic variance when performing the projections:
• if xit is available, then Eξit2 is set equal to ψˆ¯ for the projections ˆ ii for the projections R2, and R4 R1, R3 and to ψ • if xit is not available, then Eξit2 is set equal to +∞. The estimates of the common factor can hence be computed running the Kalman smoother with time-varying parameters (see Giannone et al., 2004, 2008). We measure the performance of the different estimators as follows:
′ ∆t ,R = Trace Ft − Qˆ R′ Gˆ t /T ,R Ft − Qˆ R′ Gˆ t /T ,R where Qˆ R is the OLS coefficient from the regression of Ft on ˆ t /T ,R estimated using observations up to time T − 4, that is: G Qˆ R =
∑T −4 t =1
ˆ ′t /T ,R Ft G
∑ T −4 t =1
ˆ t /T ,R Gˆ ′t /T ,R G
−1
. This OLS regression is
performed since the common factors are identified only up to a ˆ t / T ,R rotation. Indeed, we know from the previous sections that G is a consistent estimator of Gt = Q ′ Ft , where Q is a rotation matrix such that Q ′ Λ∗′ ΛQ is diagonal, with diagonal terms in decreasing order. Thus, it can be easily checked that, as E Ft Ft′ = Ir , Qˆ R is a consistent estimator of
plim
T −4 1−
T t =1
= plim
T −4 1−
T t =1
= plim
ˆ ′t /T ,R Ft G
T −4 1−
T t =1
T −4 1−
T t =1
Ft G′t
T −4 1−
T t =1
′
Ft Ft
−1 ˆ t /T ,R Gˆ ′t /T ,R G
QQ
′
−1 Gt G′t
T −4 1−
T t =1
−1 ′
Gt Gt
Q
=Q ˆ t /T ,R is a consistent estimator of Ft . so that Qˆ R′ G
195
We compute the distance for each repetition and then compute ¯ t ,R ). the averages (∆ Table 1 summarizes the results of the Monte Carlo experiment for one common factors r = 1 and the following specification: ρ = 0.9, d = 0.5, τ = 0.5, u = 0.1. We report the following measures of performance for the last five observations to analyze how data availability affects the estimates. The Kalman filter with cross-sectional heteroscedas¯ T −j,R4 . The ticity R4 is used as a benchmark and we report ∆ smaller the measure, the more accurate are the estimates of the ¯ T −j,R4 /∆ ¯ T −j,R1 , ∆ ¯ T −j,R4 / common factors. In addition, we report ∆ ¯ ¯ ¯ ∆T −j,R2 , ∆T −j,R4 /∆T −j,R3 . A number smaller then 1 indicates that the projection under R4 is more accurate. Results show five main features:
¯ T −j,R4 decreases as n and T increase, that is the 1. For any j fixed, ∆ precision of the estimated common factors increases with the size of the cross-section n and the sample size T . ¯ T −j,R4 increases as j decreases, 2. For any combination of n and T , ∆ reflecting the fact that the more numerous are the available data, the higher the precision of the common factor estimates. ¯ T −j,R4 < ∆ ¯ T −j,R3 < ∆ ¯ T −j,R2 < ∆ ¯ T −j,R1 , for all n, T , j. 3. ∆ This result indicates that the less miss-specified is the model used for the projection, the more accurate are the estimated factors. This suggests that taking into account cross-sectional heteroscedasticity and the dynamic of the common factors helps extracting the common factor. ¯ T −j,R4 /∆ ¯ T −j,R (for R = R1–R3) 4. For any combination of n and T , ∆ decreases as j decreases. That is, the efficiency improvement is more relevant when it is harder to extract the factors (i.e. the less numerous are the available data). ¯ T −j,R4 /∆ ¯ T −j,R tends to one, for all j and for 5. As n, T increase ∆ R = R1–R3; that is the performance of the different estimators tends to become very similar. This reflects the fact that all the estimates are consistent for large cross-sections. Summarizing, the two-steps estimator of approximate factor models works well in finite sample. Because it models explicitly dynamics and cross-sectional heteroscedasticity, it dominates principal components. It is particularly relevant when the factor extraction is difficult, that is, when the available data are less numerous. 5.2. Empirics Here, we report the results from an out-of-sample evaluation of the model for the euro area from the study by Rünstler et al. (2009). Several studies have been evaluating our methodology empirically (Angelini et al., 2011; Barhoumi et al., 2010; D’Agostino et al., 2008; Matheson, 2010; Aastveit and Trovik, 2008; Siliverstovs and Kholodilin, 2010). We have chosen this particular study for two reasons. First, it reports results from a project involving many institutions, all users of short-term forecasting tools. Second, the study considers several models and, for each of them, several specifications and reports results, for each of the models, for the preferred specification. Below we report the subset of results for the euro area as a whole. More results and details on the implementation can be found in the paper. The exercise by Rünstler et al. (2009) aims at forecasting economic activity in the euro area by bridging quarterly GDP growth with common factors extracted from 85 monthly series. The evaluation sample is from 2000 Q1 to 2005 Q4. The exercise is a so-called pseudo-real-time evaluation where the model is estimated recursively and the publication lags in the individual monthly series are taken into account by considering a sequence of forecasts which replicate the flow of monthly information that arrives within a quarter. Precisely, the authors consider a sequence
196
C. Doz et al. / Journal of Econometrics 164 (2011) 188–205
Table 1 Monte Carlo evaluation. j
T=50
T=100
n=5
n = 10
n = 25
n = 50
n = 100
n=5
n = 10
n = 25
n = 50
n = 100
0.28 0.28 0.27 0.27 0.28
0.34 0.36 0.37 0.40 0.48
0.23 0.24 0.26 0.29 0.35
0.19 0.19 0.20 0.21 0.25
0.18 0.18 0.18 0.18 0.21
0.17 0.17 0.17 0.17 0.19
0.99 0.99 0.99 0.98 0.98
0.95 0.93 0.90 0.84 0.73
0.94 0.93 0.91 0.85 0.75
0.96 0.96 0.95 0.92 0.85
0.98 0.98 0.97 0.95 0.92
0.99 0.98 0.98 0.97 0.96
1.00 1.00 1.00 1.00 1.00
0.96 0.95 0.93 0.86 0.75
0.98 0.96 0.95 0.89 0.78
0.99 0.99 0.99 0.97 0.91
1.00 1.00 1.00 0.99 0.97
1.00 1.00 1.00 1.00 1.00
1.00 1.00 0.98 0.97 0.96
0.97 0.97 0.96 0.96 0.96
0.97 0.96 0.96 0.96 0.94
0.98 0.98 0.97 0.96 0.95
0.99 0.98 0.98 0.97 0.96
∆j,R4 : evaluation of the Kalman filter with cross-sectional heteroscedasticity
−4 −3 −2 −1 0
0.45 0.45 0.47 0.50 0.57
0.35 0.35 0.36 0.39 0.44
0.30 0.30 0.30 0.31 0.34
0.29 0.28 0.28 0.29 0.30
∆j,R4 /∆j,R1 : relative performances of simple principal components
−4 −3 −2 −1 0
0.97 0.95 0.92 0.88 0.80
0.97 0.95 0.93 0.89 0.82
0.98 0.97 0.97 0.95 0.90
0.99 0.98 0.98 0.97 0.95
∆j,R4 /∆j,R2 : relative performances of weighted principal components
−4 −3 −2 −1 0
0.98 0.96 0.94 0.90 0.81
0.98 0.97 0.96 0.92 0.84
0.99 0.99 0.99 0.98 0.94
1.00 1.00 1.00 0.99 0.98
∆j,R4 /∆j,R3 : relative performances of the Kalman filter with cross-sectional homoscedasticity
−4 −3 −2 −1 0
1.00 0.99 0.99 0.98 0.97
0.99 0.99 0.98 0.98 0.98
0.99 0.98 0.98 0.98 0.99
0.99 0.99 0.98 0.98 0.97
1.00 0.99 0.99 0.99 0.98
Table 2 Out-of-sample evaluation—Root mean squared errors relative to constant growth. Horizon
Preceding quarter (backcast)
Current quarter (nowcast)
One quarter ahead (forecast)
Average across horizons
AR VAR KF PC
0.82 0.81 0.71 0.78
0.91 0.89 0.76 0.86
1 0.98 0.78 0.90
0.92 0.90 0.75 0.85
of eight forecasts for GDP growth in a given quarter and, for each forecast, they replicate the real-time data release pattern found in the dataset at the time in which the forecasts are made (see also Giannone et al., 2008). The exercise is pseudo-real time since, because of the lack of real time vintages, they use revised data and hence data revision are not taken into account. Table 2 presents the results for the evaluation of the accuracy of predictions at different horizons. Results are expressed in terms of root mean squared errors (RMSE) relative to a naive benchmark of constant GDP growth. The predictions are produced for previous quarter (backcast), current quarter (nowcast) and one quarter ahead (forecasts). The average RMSE across horizons is also reported. Beside our model (KF), we describe performance of a univariate autoregressive model (AR), the average of bivariate quarterly vector auto regression (VAR) models, principal component forecasts (PC) where the EM algorithm developed by Stock and Watson (2002a) is used to deal with missing observations at the end of the sample (jagged edge). A number below one indicates an improvement with respect to the naive forecast. As mentioned, for all these models, the authors consider different criteria for the specifications of the number of lags and factors. The Table report results for their preferred specifications. Results are self-explanatory and point to some advantage of KF with respect to other models. 6. Conclusions We have shown (n, T ) consistency and rates of common factors estimated via a two-step procedure whereby, in the first step,
the parameters of a dynamic approximate factor model are first estimated by a OLS regression of the variables on principal components and, in the second step, given the estimated parameters, the factors are estimated by the Kalman smoother. This procedure allows us to take into account, in the estimation of the factors, both factor dynamics and idiosyncratic heteroscedasticity, features that are likely to be relevant in the panels of data typically used in empirical applications in macroeconomics. We show that it is consistent, even if the models which is used in the Kalman smoother is misspecified. This consistency result is confirmed in a Monte Carlo exercise, which also shows that our approach improves the estimation of the factors when n is small. The parametric approach studied in this paper provides the theory for two applications of factor models in large cross-sections: treatment of unbalanced panels (Giannone et al., 2004, 2008) and estimation of shocks in structural factor models (Giannone et al., 2004). The first application, emphasized in this paper, is a key element for nowcasting economic activity. Appendix A.1. Consistency of Kalman smoothing: population results Proof of Lemma 1. (i) As Xt = Λ0 Gt + ξt , we get: XT = (IT ⊗ Λ0 )GT + ZT . It then immediately follows from assumption (A3) that EΩ R (Gt X′T ) = EΩ R (Gt G′T )(IT ⊗ Λ′0 ) = U′t ΣG,R (IT ⊗ Λ′0 ) 0
0
C. Doz et al. / Journal of Econometrics 164 (2011) 188–205
and : Gt /T ,R = EΩ R (Gt XT )(EΩ R (XT XT )) 0 0 ′
′
= Ut ΣG,R (IT ⊗ ′
−1
Second, we know that: A0 (L)Gt = wt . As (Gt ) is a stationary process, if we denote: W0 = E [wt wt′ ] we have, for any ω ∈ [−π , +π ]:
XT
Λ0 )ΣX−,1R XT . ′
(ii) We have already noted that, when R = R3 or R = R4, the model is correctly specified for (Gt ), so that ΣGR = ΣG . For any ω ∈ [−π , +π ], let us now denote by SG (ω) the spectral density matrix of (Gt ) calculated in ω. In order to show the two announced properties, we first show that if
SG (ω) =
x′ ΣG x =
x′t ΓG (t − τ )xτ =
t =1 τ =1
T − T −
x′t Φ0 (t − τ )xτ .
t =1 τ =1
We thus get
−
x′ ΣG x =
x′t
1≤t ,τ ≤T
+π
∫
+π
∫
=
x′t SG (ω)xτ e−iω(t −τ )
+π
−
= −π
′ −i ω t
xt e
SG (ω)
−
iωτ
xτ e
dω
1≤τ ≤T
1 ≤t ≤T
2 − ′ −i ω t xt e ∈ m dω, M −π 1≤t ≤T ∫ +π − 2 x′t e−iωt dω . × −π 1≤t ≤T ∫
+π
Now
2 ∫ +π ∫ +π − − ′ −i ω t ′ −iωt −iωτ xt e xt e xτ e dω dω = −π 1≤t ≤T −π 1≤t ,τ ≤T − ∫ +π = x′t xτ e−iω(t −τ ) dω 1≤t ,τ ≤T
= 2π
−
−π
x′t xt = 2π
1 ≤t ≤T
−
‖xt ‖2 = 2π.
1 ≤t ≤T
We thus obtain that any eigenvalue of ΣG belongs to [2π m, 2π M ], which gives the announced result. Let us now show that m > 0 and M < ∞, which will prove that
‖ΣG ‖ = λmax (ΣG ) ≤ 2π M < ∞ and ‖ΣG−1 ‖ =
1
λmin (ΣG )
≤
1 2π m
< ∞.
First, it is clear, from assumption (A3), that for any ω ∈ [−π , +π ]:
+∞ +∞ 1 − 1 − iωh ‖SG (ω)‖ = Φ0 (h)e ≤ ‖Φ (h)‖ < +∞ 2π h=−∞ 0 2π h=−∞ so that: M < +∞.
2π 1
.
x′ A0 (eiω )
W0 A′0 (e−iω )
−1
−1
x¯
1 λmin (W0 ) 2π
λmin (SG (ω)) ≥
dω
1≤t ,τ ≤T
−π
−1
α0
so that
−π
−
∫
SG (ω)e−iω(t −τ ) dω xτ
If we denote α0 = Maxω∈[−π ,+π ] ‖A0 (eiω )‖2 , we know that α0 is finite and we get x′ SG (ω)¯x ≥
W0 A′0 (e−iω )
−1 ′ −iω −1 ‖¯x λmin (W0 )‖x′ A0 (eiω ) A0 (e ) 2π −1 1 λmin (W0 )λmin A′0 (e−iω )A0 (eiω ) ≥ 2π 1 λmin (W0 ) = 2π λmax A′0 (e−iω )A0 (eiω ) 1 λmin (W0 ) = . 2π ‖A0 (eiω )‖2
‖xt ‖2 = 1, we can write
T − T −
−1
1
≥
then: 2π m ≤ λmin (ΣG ) and 2π M ≥ λmax (ΣG ). In order to show this property, we generalize to the rdimensional process (Gt ) the proof which is given by Brockwell et al. (1991) (Proposition 4.5.3) in the univariate case. If x = (x′1 , . . . , x′T )′ is a non-random vector of RrT such that t =1
A0 (eiω )
For any x ∈ Cn such that ‖x‖2 = 1, we then have:
M = Maxω∈[−π,+π] λmax (SG (ω))
∑T
1
2π
x′ SG (ω)¯x =
m = Minω∈[−π,+π] λmin (SG (ω)) and
‖x‖2 =
197
1 λmin (W0 ) 2π
α0
.
(iii) For any ω ∈ [−π , +π ], let us now denote by Sξ (ω) the spectral density matrix of (ξt ) calculated in ω. If x = (x1 , . . . , xn )′ is a nonrandom vector of Cn such that: ‖x‖2 = x′ x¯ = 1, we have x′ Sξ (ω)¯x =
1 − ′ 1 − ′ x Γξ (h)eiωh x¯ = x Ψ0 (h)eiωh x¯ 2π h∈Z 2π h∈Z
so that
|x′ Sξ (ω)¯x| ≤
1 − 2π h∈Z
|x′ Ψ0 (h)¯x| ≤
1 − 2π h∈Z
‖Ψ0 (h)‖.
¯ such that, for any n: From assumption (CR3), we can define λ ∑ ¯ ‖ Ψ ( h )‖ < λ . 0 h∈Z ¯ , so We thus have, for any ω ∈ [−π , +π ]: λmax Sξ (ω) ≤ 1 λ 2π
that we finally get Maxω∈[−π ,+π ] λmax (Sξ (ω)) ≤
1 2π
¯ λ.
¯. Applying the same result as in (ii), we then obtain: ‖ΣZ ‖ ≤ λ Further, ‖ΣZ ,R ‖ = ‖IT ⊗ Ψ0,R ‖ = ‖Ψ0,R ‖, so that ‖ΣZ ,R ‖ ≤ λ¯ . Finally, ‖ΣZ−,R1 ‖ = ‖IT ⊗ Ψ0−,R1 ‖ = ‖Ψ0−,R1 ‖. As ‖Ψ0−,R1 ‖ = λmax (Ψ0−,R1 ) = λ (1Ψ ) , it follows from assumption (CR4) that min
‖ΣZ−,R1 ‖ = O(1).
0,R
Proof of Proposition 1. It follows from assumption (A3) that
ΣX ,R = EΩ R (XT X′T ) = (IT ⊗ Λ0 )EΩ R (GT G′T )(IT ⊗ Λ′0 ) + EΩ R (ZT Z′T ) 0
0
0
= (IT ⊗ Λ0 )ΣG,R (IT ⊗ Λ′0 ) + ΣZ ,R . Further, as (ξt ) is supposed to be a white noise in both Ω R3 and Ω R4 specifications, we also have ΣZ ,R = IT ⊗ Ψ0R .
198
C. Doz et al. / Journal of Econometrics 164 (2011) 188–205
Using the same kind of formula as the formula, we have used to calculate Σ0−1 , it can be easily checked that
ΣX−,1R = ΣZ−,R1 − ΣZ−,R1 (IT ⊗ Λ0 ) −1 (IT ⊗ Λ′0 )ΣZ−,R1 . × ΣG−,R1 + (IT ⊗ Λ′0 )ΣZ−,R1 (IT ⊗ Λ0 )
−1 U′t (ΣG−,R1 + IT ⊗ M0 )−1 ΣG−,R1 (IT ⊗ M0−1 Λ′0 Ψ0R )XT
= G2t /T ,R + G3t /T ,R
Using the fact that ΣZ−,R1 = IT ⊗ Ψ0−,R1 , we then get
(IT ⊗
Λ0 )ΣX−,1R ′
=
=
=
Turning to the second term of the summation, it can in turn be decomposed in two parts. Indeed, as XT = (IT ⊗ Λ0 )GT + ZT , we can write
with −1 G2t /T ,R = U′t (ΣG−,R1 + IT ⊗ M0 )−1 ΣG−,R1 (IT ⊗ M0−1 Λ′0 Ψ0R )
IT ⊗ Λ0 Ψ0−,R1 − IT ⊗ Λ′0 Ψ0−,R1 Λ0 × (ΣG−,R1 + IT ⊗ Λ′0 Ψ0−,R1 Λ0 )−1 IT ⊗ Λ′0 Ψ0−,R1 −1 ΣG,R + IT ⊗ Λ′0 Ψ0−,R1 Λ0 − IT ⊗ Λ′0 Ψ0−,R1 Λ0 × (ΣG−,R1 + IT ⊗ Λ′0 Ψ0−,R1 Λ0 )−1 IT ⊗ Λ′0 Ψ0−,R1 −1 ΣG−,R1 (ΣG−,R1 + IT ⊗ Λ′0 Ψ0−,R1 Λ0 )−1 IT ⊗ Λ′0 Ψ0R . ′
× (IT ⊗ Λ0 )GT −1 −1 −1 ΣG,R (IT ⊗ (Λ′0 Ψ0R Λ0 )−1 = U′t ΣG−,R1 + IT ⊗ M0 −1 × Λ′0 Ψ0R Λ0 )GT −1 −1 −1 ′ = Ut ΣG,R + IT ⊗ M0 ΣG , R G T
Using Lemma 1(i), we thus obtain
and
−1 −1 Gt /T ,R = U′t (ΣG−,R1 + IT ⊗ Λ′0 Ψ0R Λ0 )−1 (IT ⊗ Λ′0 Ψ0R )XT .
−1 G3t /T ,R = U′t (ΣG−,R1 + IT ⊗ M0 )−1 ΣG−,R1 (IT ⊗ M0−1 Λ′0 Ψ0R )ZT .
Before proving the proposition, let us first recall a relation, which we use in that proof as well as in others. If A and B are two square invertible matrices, it is possible to write: B−1 − A−1 = B−1 (A − B)A−1 , so that the relation:
We can write
(A + H )−1 = A−1 − (A + H )−1 HA−1
(R)
also gives a Taylor expansion of the inversion operator at order zero when H is small with respect to A. −1 Using relation (R), and denoting M0 = Λ′0 Ψ0R Λ0 , we then get
−1 × ΣG−,R1 (IT ⊗ M0−1 Λ′0 Ψ0R )XT −1 = M0−1 Λ′0 Ψ0R Xt − U′t (ΣG−,R1 + IT ⊗ M0 )−1
−1 −1 G1t /T ,R = (Λ′0 Ψ0R Λ0 )−1 Λ′0 Ψ0R Xt
−1 × ΣG−,R1 ΣG−,R1 + IT ⊗ M0 Ut .
−1 −1 −1 −1 ΣG−,R1 + IT ⊗ M0 ΣG,R ΣG,R + IT ⊗ M0 −2 ≤ ‖ΣG−,R1 ‖ ΣG−,R1 + IT ⊗ M0 .
E ‖G2t /T ,R ‖2 ≤ ‖ΣG−,R1 ‖tr U′t (IT ⊗ M0−2 )Ut
G2t /T ,R
Λ0 Ψ0R ξt ‖ ] 2
= E tr(Λ0 Ψ0R Λ0 ) Λ0 Ψ0R ξt ξt Ψ0R Λ0 (Λ0 Ψ0R Λ0 ) −1 −1 −1 −1 = tr (Λ′0 Ψ0R Λ0 )−1 Λ′0 Ψ0R Ψ0 Ψ0R Λ0 (Λ′0 Ψ0R Λ0 )−1 . −1/2
As Ψ0R
−1
′
−1/2
Ψ0 Ψ0R
−1
≤
−1
′
λmax (Ψ0 ) I , λmin (Ψ0R ) n
′
−1
−1
′
−1
−1
−1
′
−1
−1
m.s.
−→ Gt and
G1t /T ,R
1 n2
.
−→ 0 and
−1
−1
G2t /T ,R
= OP
1 n
.
If we use the same type of properties that we have used for the study of G2t /T ,R , we can write
(CR1) and (CR2). We have thus obtained: G1t /T ,R
m.s.
E ‖G3t /T ,R ‖2
we get ′
(Λ0 Ψ0R Λ0 ) Λ0 Ψ0R Ψ0 Ψ0R Λ0 (Λ0 Ψ0R Λ0 ) λmax (Ψ0 ) ′ −1 ≤ ( Λ Ψ Λ 0 ) −1 λmin (Ψ0R ) 0 0R −1 −1 so that E ‖(Λ′0 Ψ0R Λ0 )−1 Λ′0 Ψ0R ξt ‖2 = OP 1n by assumptions ′
−1 = tr U′t (ΣG−,R1 + IT ⊗ M0 )−1 ΣG−,R1 (IT ⊗ M0−1 Λ′0 Ψ0R )ΣZ −1 −1 −1 −1 −1 × (IT ⊗ Ψ0R Λ0 M0 )ΣG,R (ΣG,R + IT ⊗ M0 ) Ut . We thus get −1 −1 E ‖G3t /T ,R ‖2 ≤ ‖ΣG−,R1 (IT ⊗ M0−1 Λ′0 Ψ0R )ΣZ (IT ⊗ M0−1 Λ′0 Ψ0R )
× ΣG−,R1 ‖tr U′t (ΣG−,R1 + IT ⊗ M0 )−2 Ut = Gt + OP
≤
It then follows from assumptions (CR1) and (CR2) and from Lemma 1(ii) that:
with
= ‖ΣG−,R1 ‖tr M0−2 = O
−1 −1 = Gt + (Λ′0 Ψ0R Λ0 )−1 Λ′0 Ψ0R ξt
−1
IT ⊗ M0−1 . We then get
−1 −1 = (Λ′0 Ψ0R Λ0 )−1 Λ′0 Ψ0R (Λ0 Gt + ξt )
E [‖(Λ0 Ψ0R Λ0 )
−1
Let us denote G1t /T ,R the first term of the previous summation. We can write
−1
Now, ΣG−,R1 + IT ⊗ M0 ≥ IT ⊗ M0 so that: ΣG−,R1 + IT ⊗ M0
−1 )XT . × ΣG−,R1 (IT ⊗ M0−1 Λ′0 Ψ0R
′
As ΣG−,R1 ≤ λmax (ΣG−,R1 )IrT , with λmax (ΣG−,R1 ) = ‖ΣG−,R1 ‖, we have
−1 = U′t (IT ⊗ M0−1 Λ′0 Ψ0R )XT − U′t (ΣG−,R1 + IT ⊗ M0 )−1
−1
E ‖G2t /T ,R ‖2 = tr U′t ΣG−,R1 + IT ⊗ M0
−1 )XT × ΣG−,R1 (IT ⊗ M0−1 ))(IT ⊗ Λ′0 Ψ0R
−1
−1
ΣG−,R1 E GT G′T −1 × ΣG−,R1 ΣG−,R1 + IT ⊗ M0 Ut .
As E GT G′T = ΣG = ΣG,R , we then get
Gt /T ,R = U′t (IT ⊗ M0−1 − (ΣG−,R1 + IT ⊗ M0 )−1
′
E ‖G2t /T ,R ‖2 = tr U′t ΣG−,R1 + IT ⊗ M0
1
√
n
.
−1 2 ≤ ‖ΣG−,R1 ‖2 ‖(IT ⊗ M0−1 Λ′0 Ψ0R )‖ ‖ΣZ ‖tr ′ −1 −2 × Ut (ΣG,R + IT ⊗ M0 ) Ut .
C. Doz et al. / Journal of Econometrics 164 (2011) 188–205
From Lemma 1(ii) and (iii) we know that ‖ΣG−,R1 ‖ = O(1) and ‖ΣZ ‖ = O(1). Further, using assumptions (CR1) and (CR2), we can write, as before tr U′t (ΣG−,R1 + IT ⊗ M0 )−2 Ut ≤ tr U′t (IT ⊗ M0 )−2 Ut
= tr(M0 ) = O −2
1 n2
199
so that
T 1 − 1 ′ G G − Ir = OP √ and T t =1 t t T T 1 − n ′ . ξt ξt − Ψ0 = OP √ T t =1 T
and
1
−1 −1 ‖IT ⊗ M0−1 Λ′0 Ψ0R ‖ = ‖M0−1 Λ′0 Ψ0R ‖=O √
n
∑ T
It also follows from these assumptions that T1
.
OP
It then follows that E ‖G3t /T ,R ‖2 = O n13 , so that 1 m.s. G3t /T ,R −→ 0 and G3t /T ,R = OP √
Lemma 2. Under assumptions (CR1)–(CR3), (A1)–(A5), the following properties hold, as n, T → ∞:
‖S − Λ0 Λ′0 ‖ = O n + OP √1T ˆ − D0 ‖ = O 1 + OP √1 (ii) 1n ‖D n T 1 1 ˆ −1 − D− √1 (iii) n‖D 0 ‖ = OP n + OP T ˆ −1 = Ir + Op 1 + Op √1 . (iv) D0 D n T 1
n
1
‖Σ0 − Λ0 Λ′0 ‖ =
n
‖ Ψ0 ‖ = O
1
S =
T t =1
+
T t =1
T 1−
T t =1
ξt G′t Λ′0 +
Gt G′t Λ′0 + Λ0
T 1−
T t =1
T 1−
T t =1
Gt ξt′
ξt ξt′
so that 1 n
(S − Σ0 ) =
1 n
=
≤
Xt Xt′ = Λ0
T 1−
=
≤
.
n
E tr
≤
We also have T 1−
Λ0
+
+
1
n
T t =1
Λ0
n 1
T 1−
. Indeed, we can write
≤
Proof. (i) 1n ‖S − Λ0 Λ′0 ‖ ≤ 1n ‖S − Σ0 ‖ + 1n ‖Σ0 − Λ0 Λ′0 ‖. ′ As Σ0 = Λ∗0 Λ∗′ 0 + Ψ0 = Λ0 Λ0 + Ψ0 , we have by assumption (CR2): 1
n
T
t ,s
As (Gt ) and (ξt ) are two independent processes, we have
Before proving Proposition 2, we need to establish some preliminary lemmas.
1 n
√
T
A.2. Consistency of PCA
(i)
√
Gt ξt′ =
2 T 1 − 1 − ′ ′ ′ Gt ξt ξs Gs Gt ξt = 2 T t =1 T t ,s 1 − ′ ′ ≤ tr Gt ξt ξs Gs . 2
n n
which completes the proof of the proposition.
t =1
Gt Gt − Ir
T 1−
T t =1
T 1−
T t =1
Λ′0
′
Gt ξt + ′
T 1−
T t =1
ξt Gt Λ0 ′
′
ξt ξt − Ψ0 . ′
Then, using assumptions (A3) and (CR3) and a multivariate extension of the proof given in the univariate case by Brockwell et al. (1991, pp. 226–227), it is possible to show that
2 T 1 − 1 ′ E G G − Ir =O and T t =1 t t T 2 2 T 1 − n ′ E ξt ξt − Ψ0 = O T t =1 T
1 − T2
t ,s
1 − T2 T2 T2
= tr
1 − T2
t ,s
E(ξt ξs )E(Gt Gs ) ′
′
tr (Ψ0 (s − t )) tr (Φ0 (t − s))
|tr (Ψ0 (s − t )) ||tr (Φ0 (t − s)) | ‖Ψ0 (s − t )‖ ‖Φ0 (t − s)‖
t ,s T −1 −
nr
1−
|h| T
h=−T +1
nr − T T
t ,s
nr −
nr
Gt ξt ξs Gs ′
t ,s
1 −
T
′
‖Ψ0 (h)‖ ‖Φ0 (−h)‖
‖Ψ0 (h)‖ ‖Φ0 (−h)‖
h∈Z
Maxh∈Z ‖Ψ0 (h)‖
−
‖Φ0 (h)‖.
h∈Z
[ 2 ] 1 ∑T ′ = OP Tn and the result We thus obtain: E T t =1 Gt ξt follows. ˆ is the diagonal matrix of the r first eigenvalues of S, in (ii) D decreasing order. D0 is a diagonal matrix which is equal to Λ′0 Λ0 . It is then also equal to the diagonal matrix of the r first eigenvalues of Λ0 Λ′0 in decreasing order. Further, if we denote by λ1 (A) ≥ λ2 (A) ≥ · · · ≥ λn (A) the ordered eigenvalues of a symmetric matrix A, we can write, from Weyl theorem, that for any j = 1, . . . , r:
|λj (S ) − λj (Λ0 Λ′0 )| ≤ ‖S − Λ0 Λ′0 ‖ (see for instance, Horn and Johnson (1990) p. 181). The result then immediately follows from (i). (iii) By assumptions (CR1) and (CR2), we know that 1n D0 = O(1) and that eigenvalues of and
ˆ
−1
1 D n
1 n
−1
= O(1). It then results from (ii) that the −1 ˆ and of 1 Dˆ ˆ = OP (1) are OP (1), so that 1 D D0
1 D n
n
n
= OP (1). The result the follows from (ii) and from the
decomposition:
−1 1 −1 1 1 −1 −1 ˆ ˆ ˆ n D − D0 = D D − D0 D0
n
n
n
200
C. Doz et al. / Journal of Econometrics 164 (2011) 188–205
1
= OP D0 n
ˆ −1 = Ir + (iv) D0 D
1
+ OP
n
[ −1 ˆ D n
.
√
T
−
D0 n
√1 + O P n T (ii) ‖Pˆ − P0 ‖2 = OP 1n + OP √1 T ˆ − Λ0 ) = OP √1 + OP √1 , i = 1, . . . , n (iii) τin′ (Λ n T (i) Pˆ ′ P0 = Ir + OP
−1 ]
.
The result then follows from (iii) and assumption (CR2).
where τin the ith denotes the ith vector of the canonical basis in Rn .
Lemma 3. Let us denote Aˆ = Pˆ ′ P0 , with Aˆ = aˆ ij 1≤i,j≤r .
Proof. (i) We have seen before that P0 is uniquely defined up to a sign change of each of its columns, and that this implies that Gt is uniquely defined for any t up to a sign change of each of its components. As Pˆ is also defined up to a sign change of its columns, it is thus possible to suppose that P0 and Pˆ are chosen such that the diagonal terms of Aˆ = Pˆ ′ P0 are positive. In such a case, Lemma 2(ii) implies that
The following properties hold:
√1 for i ̸= j. + O P n T 1 2 (ii) aˆ ii = 1 + OP n + OP √1 for i = 1, . . . , r. T (i) aˆ ij = OP
1
ˆ we have Pˆ = S Pˆ Dˆ −1 and Proof. (i) As S Pˆ = Pˆ D ˆ −1 Pˆ ′ SP0 = Dˆ −1 Pˆ Pˆ ′ P0 = D
aˆ ii = 1 + OP
ˆ −1 Pˆ ′ Λ0 Λ′0 P0 . S − Λ0 Λ0 P0 + D ′
′
1/2
As Λ0 = P0 D0 , and P0′ P0 = Ir , we have: Λ0 Λ′0 P0 = P0 D0 . We then get:
ˆ′
P P0 =
−1 ˆ D n
ˆ′
P
S − Λ0 Λ′0
P0 +
n
−1 ˆ D n
Pˆ ′ P0
D0
.
n
and
D0 n
−1
are O(1) and that
ˆ D n
and
−1 ˆ D n
D0 n
are OP (1). As Pˆ ′ Pˆ = Ir
and P0′ P0 = Ir , it follows that Pˆ ′ P0 = OP (1). Thus, Lemma 2(i) and (iii) imply that Pˆ ′ P0 = OP
1
+ OP
n
1
+
√
D0
−1
n
T −1 ˆ
or equivalently that Aˆ = D0 AD0 + OP
Pˆ ′ P0
1 n
D0
+ OP
1
.
d0,jj d0,ii
aˆ ij +
+ OP √1T . For i ̸= j, we assume, from assumption (A4), that d0,jj ̸= d0,ii . We then obtain 1 1 aˆ ij = OP + OP √ for i ̸= j. OP
n
n
T
(ii) To study the asymptotic behavior of aˆ ii , let us now use the ˆ = Pˆ ′ S Pˆ which implies, together with Lemma 2(i), that relation D
ˆ D n
S Λ0 Λ0 Pˆ + OP = Pˆ ′ Pˆ = Pˆ ′ ′
n
n
or, equivalently, that
ˆ D n
ˆ′
=P
1 n
+ OP
D P0 n0 P0′ P
ˆ + OP
OP
1 n
√1 T
+ OP
√1 T
or equivalently that:
D0 n
n D0 n
T
+OP √1T . (ii) Let x ∈ Rn a non-random vector such that ‖x‖ = 1. As ˆP ′ Pˆ = Ir and P0′ P0 = Ir , we have
OP
√1 T
+ OP
√1 T
‖Pˆ − P0 ‖2 = OP
1
+ OP
n
Thus, for i = 1, . . . , r :
d0,kk 2 k=1 n aik
∑r
ˆ′
.
ˆ +
ˆ + +
1
√1 T
ˆ + OP n + OP ∑ d d0,kk 2 1 √1 . ˆ and: 0n,ii 1 − aˆ 2ii = a + O + O P P ik k̸=i n n T From result (i), we know that aˆ ik = OP 1n + OP √1 for i ̸= k. T D0 As = OP (1), it then follows that aˆ 2ii = 1 + OP 1n + n for i = 1, . . . , r. OP √1 T =
n
+
1
√
T
.
ˆ − Λ0 ) = τin′ (Pˆ Dˆ 1/2 − P0 D0 ) τin′ (Λ 1/2 = τin′ (S Pˆ Dˆ −1/2 − P0 D0 ) 1/2 = τin′ (S − Σ0 )Pˆ Dˆ −1/2 + (P0 D0 P0′ + Ψ0 )Pˆ Dˆ −1/2 − P0 D0 = τin′ (S − Σ0 )Pˆ Dˆ −1/2 + τin′ Ψ0 Pˆ Dˆ −1/2 −1/2 + τin′ P0 D0 P0′ Pˆ − D0 Dˆ 1/2 Dˆ −1/2 . In order to study the first term, let us first note that ‖τin′ (S − Σ0 )‖ =
∑n
(sij − σ0,ij )2
1/2
. Using the same arguments as in the proof of Lemma 2(i), we have j =1
n −
E(sij − σ0,ij )2 = O
so that τin′ (S − Σ0 ) = OP OP
√1
n
T
√ √
n
T
.
, it follows that
τin′ (S − Σ0 )Pˆ Dˆ −1/2 = OP
1
√
T
.
Turning to the second term, we have ‖τin′ Ψ0 ‖ ≤ ‖Ψ0 ‖ = O(1),
ˆ −1/2 = OP by assumption (CR2). As Pˆ = OP (1) and D τin′ Ψ0 Pˆ Dˆ −1/2 = OP
1
√
n
√1
n
, we get
.
−1/2 ˆ 1/2 ˆ −1/2 1/2 Finally, τin′ P0 D0 P0′ Pˆ − D0 D D = τin′ Λ0 D0 P0′ Pˆ −
Lemma 4. Under assumptions (CR1)–(CR4), (A1)–(A4), P0 and Pˆ can be defined so as the following properties hold, as n, T → ∞:
n
ˆ′ ˆ ˆ ˆ −1/2 = As PP = Ir , we know that P = OP (1). Then, using D
.
d0,ii n
1
. As this is true for any x ∈ R , it then follows that
D P P0 n0 P0′ P D A n0 A′ OP 1n
= ˆ
n
n
j =1
=
1
It then follows from (i) that x′ (Pˆ − P0 )′ (Pˆ − P0 )x = OP
E‖τin′ (S − Σ0 )‖2 =
T
1
for i = 1, . . . , r .
√
We then obtain from Lemma 2(i) that Pˆ ′ P0 = Ir +OP
√
It then follows from Lemma 2(ii) that OP
1
+ OP
n
1
1/2
√1 T
For any i and j the previous relation states that aˆ ij =
ˆ −1 and Σ0 = P0 D0 P0′ + Ψ0 , so that (iii) We have Pˆ = S Pˆ D
n
1
x′ (Pˆ − P0 )′ (Pˆ − P0 )x = x′ (2Ir − Pˆ ′ P0 − P0′ Pˆ )x.
As we saw in Lemma 2, assumptions (CR1) and (CR2) imply that
1
−1/2 ˆ 1/2
D0
D
ˆ −1/2 . D
C. Doz et al. / Journal of Econometrics 164 (2011) 188–205
201
1
1
As Vxit = ‖τin′ Λ0 ‖2 + ψ0,ii , it follows from assumption (A2) that τin′ Λ0 = O(1). −1/2 ˆ 1/2 Further, P0′ Pˆ − D0 D = OP 1n + OP √1T by ˆ −1/2 = OP √1 and D10/2 = Lemma 2(iv) and Lemma 4(i). As D n √
ˆ −1/2 Pˆ ′ P0⊥ P0′ ⊥ ξt = OP D
1 1 −1/2 τin′ P0 D0 P0′ Pˆ − D0 Dˆ 1/2 Dˆ −1/2 = OP + OP √
n 1 1 1− ˆ ¯ ψ= ψ0,ii + OP √ + OP √ = OP (1).
O
n , it then follows that
n
which completes the proof.
T
T
1 n
+
Then, applying Lemma 2(iv) a second time, and using the fact that Gt = OP (1), we get −1/2
1/2
D0 Gt = OP
1 n
+ OP
1
√
T
.
ˆ −1/2 Pˆ ′ ξt , let us first decompose ξt as ξt = In order to study D P0 P0′ ξt +P0⊥ P0′ ⊥ ξt where P0⊥ is a (n×(n−r )) matrix whose columns form an orthonormal basis of the orthogonal space of P0 . We then obtain ˆ −1/2 Pˆ ′ ξt = Dˆ −1/2 Pˆ ′ P0 P0′ ξt + Dˆ −1/2 Pˆ ′ P0⊥ P0′ ⊥ ξt . D
√
First, let us note that P0′ ξt = OP (1) and that P0′ ⊥ ξt = OP ( n). Indeed, we can write E ‖P0′ ξt ‖
2
= E ξt′ P0 P0′ ξt = E tr P0′ ξt ξt′ P0 ′ = tr P0 Ψ0 P0 ≤ r λ1 (Ψ0 ) = O(1) ′ 2 and E ‖P0⊥ ξt ‖ = E tr P0′ ⊥ ξt ξt′ P0⊥ ′ = tr P0⊥ Ψ0 P0⊥ ≤ (n − r )λ1 (Ψ0 ) = O(n). ˆ −1 = OP 1 , we then get from As Lemma 2(iii) implies that D n
Lemma 4(i) that
ˆ −1/2 Pˆ ′ P0 P0′ ξt = OP D
1
√
n
.
Pˆ ′ P0⊥ = OP
1 n
+ OP
1
√
T
.
ˆ −1 , we can write Pˆ ′ P0⊥ = Dˆ −1 Pˆ ′ SP0⊥ . Indeed, if we use Pˆ = S Pˆ D As P0 and Λ0 have the same range, P0′ ⊥ Λ0 = 0, so that we also have ˆ −1 Pˆ ′ (S − Λ0 Λ′0 )P0⊥ = Pˆ ′ P0⊥ = D
−1 ˆ D n
S − Λ0 Λ0 Pˆ ′ P0⊥ . ′
n
As P0⊥ P0⊥ = In−r , we have P0⊥ = O(1). It then follows from Lemma 2(i) and (ii) that ′
Pˆ ′ P0⊥ = OP
1 n
+ OP
ˆ −1/2 = OP Then, as D that
1
√
T
−1
ˆ 1/2 Pˆ ′ Xt = Dˆ + ψˆ¯ Ir D
−1
ˆ −1 Gˆ t . D
Lemma 5. Under assumptions (A1)–(A4), (CR1)–(CR4), if we de-
¯ 0 In or Ψ0R = ΨOd , and Ψˆ R = ψˆ¯ In or Ψˆ R = Ψˆ d , the note Ψ0R = ψ following properties hold: √1 . + O P n T 1 ′ −1 ′ ˆ −1 ˆ ˆ (ii) P ΨR P − P0 Ψ0R P0 = OP n + OP √1 . T 1 ′ −1 ′ ˆ −1 ˆ (iii) ‖P ΨR − P0 Ψ0R ‖ = OP n + OP √1 . T −1 (i) (Pˆ − P0 )′ Ψ0R P0 = OP
1
−1 −1 (iv) ‖(Pˆ ′ Ψˆ R−1 Pˆ )−1 Pˆ ′ Ψˆ R−1 − (P0′ Ψ0R P0 )−1 P0′ Ψ0R ‖
OP (v)
1 n
√1 T
= OP
1 n
+ OP
√1 T
.
ˆ ′ Ψˆ R Λ ˆ )−1 Λ ˆ ′ Ψˆ R −(Λ0 Ψ0R Λ0 )−1 Λ0 Ψ0R ‖ = OP (vi) ‖(Λ −1
+
n
.
−1 ˆ ′ Ψˆ R−1 Λ ˆ − Λ′0 Ψ0R ‖Λ Λ0 ‖ = OP
OP
1
1 √ √
n T
−1
−1
′
−1
′
1 √ n n
+
.
Proof. (i) Defining P0⊥ as we did in the proof of Proposition 2, we can write −1 −1 (Pˆ − P0 )′ Ψ0R P0 = (Pˆ − P0 )′ (P0 P0′ + P0⊥ P0′ ⊥ )Ψ0R P0 − 1 − 1 −1 ′ ′ ′ ′ = Pˆ P0 P0 Ψ0R P0 − P0 Ψ0R P0 + Pˆ P0⊥ P0′ ⊥ Ψ0R P0 .
We have seen before (see proof of Proposition 2) that
In order to study the second term, let us first show that
T
In order to prove (ii), we first prove the following lemma, which we will also use in the proof of property 5.
.
ˆ −1/2 Pˆ ′ P0 − Dˆ 1/2 D0 D
n
1/2 ˆ 1/2 D− Lemma 2(iv) and Lemma 4(i) give: Pˆ ′ P0 − D = OP 0
OP
ˆ¯ = 1 trΨˆ , it follows from ProposiProof of Corollary 1. (i) As ψ n tion 2(iii) and assumption (CR3) that
ˆ t /T ,R1 = Dˆ + ψˆ¯ Ir G
= Dˆ −1/2 Pˆ ′ (Λ0 Gt + ξt ) − Gt 1/2 = Dˆ −1/2 Pˆ ′ P0 D0 − Ir Gt + ξt −1/2 1/2 ˆ −1/2 Pˆ ′ ξt . = Dˆ −1/2 Pˆ ′ P0 − Dˆ 1/2 D0 D0 Gt + D
√1
T
ˆ = OP (n), the result then immediately follows from Since D Proposition 2(i) and the fact that
ˆ t − Gt = Dˆ −1/2 Pˆ ′ Xt − Gt G
√
which completes the proof of the proposition.
n i=1
Proof of Proposition 2. We can write
+ OP
n
√1
n
1 n
+ OP
1
√
T
.
−1 −1 As P0′ Ψ0R P0 and P0′ ⊥ Ψ0R P0 are O(1), the result then follows from Lemma 4(i). −1 −1 ˆ ˆ ′ −1 ˆ −1 P0 = Pˆ ′ (Ψˆ R−1 −Ψ0R )P +P Ψ0R P −P0′ Ψ0R P0 . (ii) Pˆ ′ Ψˆ R−1 Pˆ −P0′ Ψ0R −1 ˆ −1 −1 As ‖Pˆ ′ (Ψˆ R−1 − Ψ0R )P ‖ ≤ ‖Pˆ ‖2 ‖Ψˆ R−1 − Ψ0R ‖ = ‖Ψˆ R−1 − Ψ0R ‖,
−1 −1 and as ‖Ψˆ R−1 − Ψ0R ‖ = Max1≤i≤n |ψˆ ii−1 − ψ0ii |, it follows from Proposition 2(iii) that −1 ˆ ‖Pˆ ′ (Ψˆ R−1 − Ψ0R )P ‖ = OP
1 n
+ OP
1
√
T
.
Further
.
‖Pˆ ′ P0⊥ ‖ = OP
√
, and P0′ ⊥ ξt = OP ( n), it follows
−1 ˆ −1 ‖Pˆ ′ Ψ0R P − P0′ Ψ0R P0 ‖ −1 ˆ −1 ˆ ′ −1 ˆ = ‖(P − P0 ) Ψ0R P0 + P0′ Ψ0R (P − P0 ) + (Pˆ − P0 )′ Ψ0R (P − P0 )‖ −1 ′ −1 2 ˆ ˆ ≤ 2‖(P − P0 ) Ψ0R P0 ‖ + ‖Ψ0R ‖ ‖P − P0 ‖ .
202
C. Doz et al. / Journal of Econometrics 164 (2011) 188–205
It then follows from Lemma 4(ii), assumption (CR2), and Lemma 5(i) that −1 ˆ −1 ‖Pˆ ′ Ψ0R P − P0′ Ψ0R P0 ‖ = OP
1 n
+ OP
1 T
so that (ii) follows. −1 −1 (iii) In the same way, ‖Pˆ ′ Ψˆ R−1 − P0′ Ψ0R ‖ ≤ ‖Pˆ ′ (Ψˆ R−1 − Ψ0R )‖ +
by assumptions
√1
n
by assumptions
The result immediately follows.
−1 −1 ‖Pˆ ′ (Ψˆ R−1 − Ψ0R )‖ ≤ ‖Ψˆ R−1 − Ψ0R ‖ = OP
1 n
+ OP
1
ˆ R3 and Ω ˆ R4 framework) A.3. Consistency of Kalman filtering: (Ω
√
T Proof of Proposition 3. (i) Consider the sample autocovariance of the estimated principal components
and −1 −1 ‖(Pˆ − P0 )′ Ψ0R ‖ = ‖(Pˆ − P0 )′ (P0 P0′ + P0⊥ P0′ ⊥ )Ψ0R ‖ − 1 −1 ′ ′ ′ ′ ≤ ‖(Pˆ P0 − Ir )P0 Ψ0R ‖ + ‖Pˆ P0⊥ P0⊥ Ψ0R ‖
Γˆ Gˆ (h) =
−1 −1 ≤ ‖Pˆ ′ P0 − Ir ‖ ‖P0′ Ψ0R ‖ + ‖Pˆ ′ P0⊥ ‖ ‖P0′ ⊥ Ψ0R ‖ 1 1 + OP √ = OP .
n
T − h t =h+1
ˆ t Gˆ ′t −h = Dˆ −1/2 Pˆ ′ S (h)Pˆ Dˆ −1/2 G
∑T
T
(iv) As ‖Ψ0 ‖ = O(1) by asssumption (A4), we know from
Proposition 2(iii) that ‖Ψˆ R−1 ‖ = OP (1) so that (Pˆ ′ Ψˆ R−1 Pˆ )−1 = OP (1). We then can write
Γˆ Gˆ (h) = Dˆ −1/2 Pˆ ′ Λ0 Φ0 (h)Λ′0 Pˆ Dˆ −1/2 + Dˆ −1/2 Pˆ ′ S (h) − Λ0 Φ0 (h)Λ′0 Pˆ Dˆ −1/2 . First, we can write
−1 P0 )−1 P0′ ΨR−1 ‖ ‖(Pˆ ′ Ψˆ R−1 Pˆ )−1 Pˆ ′ Ψˆ R−1 − (P0′ Ψ0R = ‖(Pˆ ′ Ψˆ R−1 Pˆ )−1 (Pˆ ′ Ψˆ R−1 − P0′ ΨR−1 )
ˆ −1/2 Pˆ ′ Λ0 Φ0 (h)Λ′0 Pˆ Dˆ −1/2 D 1/2
It then follows from Lemma 2(iv), Lemma 4(i), and the fact that Φ0 (h) = O(1) that
+ (Pˆ ′ Ψˆ R−1 Pˆ )−1 [P0′ ΨR−1 P0 − Pˆ ′ Ψˆ R−1 Pˆ ](P0′ ΨR−1 P0 )−1 P0′ ΨR−1 ‖ ≤ ‖(Pˆ ′ Ψˆ R−1 Pˆ )−1 ‖‖(Pˆ ′ Ψˆ R−1 − P0′ ΨR−1 )‖
ˆ −1/2 Pˆ ′ Λ0 Φ0 (h)Λ′0 Pˆ Dˆ −1/2 = Φ0 (h) + OP D
+ ‖(Pˆ ′ Ψˆ R−1 Pˆ )−1 ‖ ‖P0′ ΨR−1 P0 − Pˆ ′ Ψˆ R−1 Pˆ ‖ ‖(P0′ ΨR−1 P0 )−1 ‖
that 1 1/2 −1/2 1/2 ′ −1 D D 0 D 0 P ΨR P n
ˆ ˆ
1 n
+ OP
1
√
T
.
Then, under assumptions (A3) and (CR3), it is possible to extend what has been done in Lemma 2(i) for h = 0, and to show
× ‖P0′ ΨR−1 ‖. ˆ
1/2
= Dˆ −1/2 Pˆ ′ P0 D0 Φ0 (h)D0 P0′ Pˆ Dˆ −1/2 .
+ ((Pˆ ′ Ψˆ R−1 Pˆ )−1 − (P0′ ΨR−1 P0 )−1 )P0′ ΨR−1 ‖ = ‖(Pˆ ′ Ψˆ R−1 Pˆ )−1 (Pˆ ′ Ψˆ R−1 − P0′ ΨR−1 )
The result then follows from (ii) and (iii). ˆ ′ Ψˆ R−1 Λ ˆ = 1 Dˆ 1/2 Pˆ ′ Ψˆ R−1 Pˆ Dˆ 1/2 = (v) 1n Λ n
T −
1
1 ′ with S (h) = T − t =h +1 X t X t −h . h For any h < T , we can decompose Γˆ Gˆ (h) as
−1
−1/2
n
(CR1)–(CR4) √ • ‖Gt ‖ = OP (1), ‖ξt ‖√ = OP ( n) and ‖Λ0 Gt + ξt ‖ ≤ ‖Λ0 ‖ + ‖Gt ‖ + ‖ξt ‖ = OP ( n).
−1 ‖(Pˆ − P0 )′ Ψ0R ‖ with
1/2
√1
(CR1)–(CR4) −1 −1 • ‖(Λ′0 Ψ0d Λ0 + Ir )−1 Λ′0 Ψ0d ‖ = O
√
−1 −1 • ‖(Λ′0 Ψ0d Λ0 + Ir )−1 Λ′0 Ψ0d Λ0 − Ir ‖ = O
ˆ
ˆ 1/2 . D0 D0 D The result thenfollows from Lemma 2(iv), Lemma 5(ii), and the fact that D0 = OP 1n . (vi) We can write −1 −1 ˆ ′ Ψˆ R−1 Λ ˆ )−1 Λ ˆ ′ Ψˆ R−1 − (Λ′0 Ψ0R (Λ Λ0 )−1 Λ′0 Ψ0R −1/2 −1 −1 = Dˆ −1/2 (Pˆ ′ Ψˆ R−1 Pˆ )−1 Pˆ ′ Ψˆ R−1 − D0 (P0′ Ψ0R P0 )−1 P0′ Ψ0R = Dˆ −1/2 (Pˆ ′ Ψˆ R−1 Pˆ )−1 Pˆ ′ Ψˆ R−1 −1/2 −1 −1 − Dˆ 1/2 D0 (P0′ Ψ0R P0 )−1 P0′ Ψ0R .
ˆ ′ Ψˆ d−1 Λ ˆ +Ir )−1 Λ ˆ ′ Ψˆ d−1 Xt = Proof of Corollary 1. (ii) As Gt /T ,R2 = (Λ ˆ ′ Ψˆ d−1 Λ ˆ + Ir )−1 Λ ˆ ′ Ψˆ d−1 (Λ0 Gt + ξt ), we have: (Λ ˆ ′ Ψˆ d−1 Λ ˆ + Ir ) − 1 Λ ˆ ′ Ψˆ d−1 ‖Gt /T ,R2 − Gt ‖ ≤ ‖(Λ −1 −1 − (Λ′0 Ψ0d Λ0 + Ir )−1 Λ′0 Ψ0d ‖ (‖Λ0 Gt + ξt ‖) −1 −1 + ‖(Λ′0 Ψ0d Λ0 + Ir )−1 Λ′0 Ψ0d Λ0 − Ir ‖ ‖Gt ‖ −1 −1 + ‖(Λ′0 Ψ0d Λ0 + Ir )−1 Λ′0 Ψ0d ‖ ‖ξt ‖
with ′ −1 −1 ˆ ′ ˆ −1 −1 ′ −1 ˆ ′ Ψˆ d−1 Λ ˆ • ‖(Λ + Ir) ΛΨd − (Λ0 Ψ0d Λ0 + Ir ) Λ0 Ψ0d ‖ = 1 1 OP n√n + OP √ by Lemma 5(vi) nT
1 n
‖S (h) − Λ0 Φ0 (h)Λ′0 ‖ = OP
1 n
+ OP
√1
T
, uniformly in
h ≤ p. Indeed, if we decompose S (h) as S ( h) =
1 T −h
+
T − t =h +1
Λ0
T −
Gt G′t −h Λ′0 + Λ0
t =h+1
ξt G′t −h Λ′0 +
T −
Gt ξt′−h
t =h +1 T −
ξt ξt′−h
t =h +1
we get 1 n
(S (h) − Λ0 Φ0 (h)Λ′0 ) T − 1 1 = Λ0 Gt G′t −h − Φ0 (h) Λ′0 n T − h t =h+1 T T − − 1 1 1 ′ ′ ′ + Λ0 Gt ξt −h + ξt Gt −h Λ0 n T − h t =h+1 T − h t =h+1 T − 1 1 1 ′ + ξt ξt −h − Ψ0 (h) + Ψ0 (h). n T − h t =h+1 n
Then, using assumptions (A3) and (CR3) and a multivariate extension of the proof given in the univariate case by Brockwell et al. (1991, pp. 226–227), it is possible, as in Lemma 2(i), to show that
2 1 ∑T ′ • E T −h t =h+1 Gt Gt −h − Φ0 (h) = O T1
C. Doz et al. / Journal of Econometrics 164 (2011) 188–205
(p) −1
‖(Φ0 )
‖A(0p) − Aˆ (p) ‖ = OP
Using the and the same kind of arguments as we have used in Lemma 2(i), it then also follows that
From assumption (CR1), we also have: ‖Λ0 ‖ = O ‖Ψ0 (h)‖ = O(1), so that n
‖S (h) − Λ0 Φ0 (h)Λ′0 ‖ = OP ˆ −1/2 ˆ ′
Finally, as D S (h)−Λ0 Φ0 (h)Λ′0
P
+ OP
n
1
√
T
′
ˆ −1/2 D , n
ˆ −1/2
and
=
ˆ −1/2 ′ D P n
( )
ˆ
ˆ D n
= A(0p) G(t p−)1 + wt(p) A A 01 Ir
(p)
··· ···
02
0
with A0 = .. .
. . .
0
0
A0p 0
. . .
(p)
If we denote Φ0
· · ·
(p) and wt = (wt′ , 0, . . . , 0)′ .
= E Gt Gt
(p)
and Φ1
= E
Φ0 (1) Ir
··· ···
Φ0′ (p − 2)
···
Φ0 (2)
··· ···
Ir
′ Φ0 (1) (p) Φ0 = .. . ′ Φ0 (p − 1) Φ0 (1) Φ0′ (1) (p) Φ1 = .. . Φ0′ (p − 2)
(p) (p)′ Gt Gt −1
W0 = E (ut Gt ) = E ′
Ir
Φ0 (p − 3) ′
Φ0 (p) Φ0 (p − 1) .. . Φ0 (1)
···
we have (p)
(p)
(p)
A0 = Φ1 (Φ0 )−1 . (p)
(p)
ˆ 0 and Φ ˆ 1 having, respectively, the same form as We can define Φ (p)
(p)
Φ0 and Φ1 , with Φ0,k replaced by Γˆ Gˆ (k) for any value of k. Then, we also have (p)
(p)
ˆ 1 (Φ ˆ 0 )−1 Aˆ (p) = Φ ˆ
A1 Ir
= Ir −
p −
where Aˆ (p) = . . .
0
Aˆ p 0
. . .
.
. . . ···
0
(p)
‖Φ1
ˆ 1(p) ‖ = OP −Φ
1 n
1 n
+ OP
1
√
T
for any s =
A0s Gt −s
Gt
∑p
s=1
Aˆ s Γˆ ˆ′ (s). G
p −
‖Aˆ s − A0s ‖ ‖Γˆ Gˆ (s)‖
+
‖A0s ‖ ‖Γˆ Gˆ (s) − Φ0 (s)‖.
s=1
It then follows from Proposition 3(i) and (ii) that
ˆ ‖ = OP ‖W0 − W
1
+ OP
n
1
√
T
.
Turning now to the spectral density matrices, we have SG,R (ω) =
1
A0 (eiω )
−1
W0 A′0 (e−iω )
−1
2π −1 −1 1 ˆ ˆ Aˆ ′ (e−iω ) and SG,R (ω) = Aˆ (eiω ) W . 2π −1 ′ −iω −1 −1 − Aˆ (e ) ‖ ≤ ‖ A′0 (e−iω ) ‖ ‖A′0 (e−iω )− As ‖ A′0 (e−iω ) −1 Aˆ ′ (e−iω )‖ ‖ Aˆ ′ (e−iω ) ‖, we have
1 n
+ OP
1
−1
√
T
−1 − Aˆ ′ (e−iω ) ‖
. 1 n
+ OP
√1 T
, it then
immediately follows that
and
√
T 1
T
′
ˆ ‖ = OP Using the fact that ‖W0 − W
+ OP
√1
s=1 p −
It thus follows from (i) that
ˆ 0(p) ‖ = OP ‖Φ0(p) − Φ
ˆ = Γˆ ˆ (0) − In the same way, we have Vˆ wt = W G We thus get
= OP
Ir
A0s Φ0′ (s).
··· ···
n
+ OP
s =1
Maxω∈[−π ,+π ] ‖ A′0 (e−iω ) Aˆ 2 0
1
s=1
, so
Ir
.. .
Gt −
p −
≤ ‖Ir − Γˆ Gˆ (0)‖ +
Φ0 (p − 1) Φ0 (p − 2) .. .
.. .
.
As A0 (L)Gt = wt and VGt = Ir , we know that V wt = W0 , with
that
T
Proof of Proposition 4. (i) If we denote by SG,R (ω) and SˆG,R (ω) the ˆ R , for R = R3 and spectral density matrices of Gt under ΩR and Ω R4, we can apply the same result as in Lemma 1(ii) and we get
Ir
(p) (p)′
p − ′ ′ ˆ ‖ = (Ir − Γˆ ˆ (0)) − Aˆ s Γˆ ˆ (s) − A0s Φ0 (s) ‖W0 − W G G s=1
n
1
√
T
(ii) Let us first recall that any VAR(p) model can be written in a (p) VAR(1) form. More precisely, if we denote Gt = (G′t , G′t −1 , . . . , ′ ′ Gt −p+1 ) , we can write (p)
+ OP
ˆ G,R − ΣG,R ‖ ≤ 2π Maxω∈[−π ,+π ] ‖SˆG,R (ω) − SG,R (ω)‖. or ‖Σ
n
Gt
1
ˆ G,R − ΣG,R ‖ ≤ 2π Maxω∈[−π ,+π ] λmax SˆG,R (ω) − SG,R (ω) ‖Σ
= OP (1), it follows that ˆ −1/2 Pˆ ′ S (h) − Λ0 Φ0 (h)Λ′0 Pˆ Dˆ −1/2 = OP 1 + OP √1 . D n
Pˆ ( )
n and
.
S (h) − Λ0 Φ0 (h)Λ0 Pˆ D
√
It then follows that ‖A0s − Aˆ s ‖ = OP 1, . . . , p.
√ T 1 − n ′ Gt ξt −h = OP √ . T − h t =h+1 T
(p) −1
which has been introduced in the proof of Proposition 1, we then get
∑ • T −1 h Tt=h+1 Gt G′t −h − Φ0 (h) = OP √1T ∑ • T −1 h Tt=h+1 ξt ξt′−h − Ψ0 (h) = OP √nT .
1
ˆ 1(p) ‖ ‖(Φ0(p) )−1 ‖ + ‖Φ ˆ 1(p) ‖ − Aˆ (p) ‖ ≤ ‖Φ1(p) − Φ ˆ 0 ) ‖. If we apply to the last term the relation (R) − (Φ
We have ‖A0
so that
1
203
(p)
2 2 1 ∑T ′ • E T −h t =h+1 ξt ξt −h − Ψ0 (h) = O nT
.
Maxω∈[−π ,+π ] ‖SG,R (ω) − SˆG,R (ω)‖ = OP which gives the desired result.
1 n
+ OP
1
√
T
204
C. Doz et al. / Journal of Econometrics 164 (2011) 188–205
(ii) Since we know, from Lemma 1(ii) that ‖ΣG,R ‖ = O(1), it ˆ G,R ‖ = OP (1). follows from (i) that ‖Σ
ˆ G ,R ‖ Further ‖Σ
−1
=
ˆ G,R ) − λmin (Σ ˆ G,R )| , with |λmin (Σ
1
ˆ G,R ) λmin (Σ
ˆ G,R ‖ by Weyl theorem. It then follows from (i) and ≤ ‖ΣG,R − Σ ˆ G,R ‖−1 = OP (1). from Lemma 1(ii) that ‖Σ − 1 ˆ − ΣG−,R1 ‖ ≤ ‖Σ ˆ −1 ‖ ‖Σ ˆ Gˆ ,R − ΣG,R ‖ ‖ΣG−,R1 ‖, we Finally, as ‖Σ ˆ ,R ˆ ,R G G also obtain
ˆ −1 − ΣG−,R1 ‖ = OP ‖Σ ˆ ,R G
1
+ OP
n
1
√
T
ˆ t /T ,R = U′t (Σ ˆ − 1 + IT ⊗ Λ ˆ ′ Ψˆ R−1 Λ ˆ )−1 (IT ⊗ Λ ˆ ′ Ψˆ R−1 )XT . G ˆ G ,R
Using relation (R) as in the proof of Proposition 1 (Taylor expansion at order 0), we obtain the same kind of decomposition ˆ t /T ,R as the one we have used to study Gt /T ,R . Thus, if we denote for G
ˆ =Λ ˆ , we can write Gˆ t /T ,R = Gˆ 1t /T ,R − Gˆ 2t /T ,R − Gˆ 3t /T ,R , with ˆ ′ Ψˆ R−1 Λ M ˆ 1t /T ,R = U′t IT ⊗ M ˆ −1 IT ⊗ Λ ˆ ′ Ψˆ R−1 XT G
ˆ 3t /T ,R G
−1 ˆ ˆ − 1 + IT ⊗ M = U′t Σ ˆ ,R G − 1 ˆ −1 ˆ ′ ˆ −1 (IT ⊗ Λ0 ) GT ˆ ×Σ ˆ ,R IT ⊗ M Λ ΨR G −1 −1 −1 ΣG,R IT ⊗ M0−1 Λ′0 Ψ0R − U′t ΣG−,R1 + IT ⊗ M0 × (IT ⊗ Λ0 ) GT and
.
ˆ t /t ,R = Proof of Proposition 5. As Gt /t ,R = ProjΩ R [Gt |Xt ] and G ProjΩˆ R [Gt |Xt ], they are obtained through the same formulas so that, by construction
ˆ 2t /T ,R G
ˆ 2t /T ,R − G2t /T ,R G
ˆ −1 Λ ˆ ′ Ψˆ R−1 Xt = M −1 ˆ ˆ −1 Λ ˆ − 1 IT ⊗ M ˆ − 1 + IT ⊗ M ˆ ′ Ψˆ R−1 Σ = U′t Σ ˆ ,R ˆ ,R G G
× (IT ⊗ Λ0 ) GT −1 ˆ ˆ −1 Λ ˆ − 1 IT ⊗ M ˆ − 1 + IT ⊗ M ˆ ′ Ψˆ R−1 ZT . Σ = U′t Σ ˆG,R ˆG,R
Let us study separately these three terms. If we compare the first term with G1t /T ,R , we get
G3t /T ,R − G3t /T ,R
−1 ˆ ˆ −1 Λ ˆ − 1 + IT ⊗ M ˆ − 1 IT ⊗ M ˆ ′ Ψˆ R−1 ZT = U′t Σ Σ ˆ ,R ˆ ,R G G −1 −1 −1 ΣG,R IT ⊗ M0−1 Λ′0 Ψ0R − U′t ΣG−,R1 + IT ⊗ M0 ZT ˆ 2t /T ,R − G2t /T ,R = U′t Hˆ (IT ⊗ Λ0 ) GT and Gˆ 3t /T ,R − G3t /T ,R = so that G ′ ˆ Ut HZT with ˆ = (Σ ˆ )−1 Σ ˆ −1 Λ ˆ − 1 + IT ⊗ M ˆ −1 (IT ⊗ M ˆ ′ Ψˆ R−1 ) H ˆ ,R ˆ ,R G G −1 − (ΣG−,R1 + IT ⊗ M0 )−1 ΣG−,R1 (IT ⊗ M0−1 Λ′0 Ψ0R ).
ˆ as: Hˆ = Hˆ 1 + Hˆ 2 + Hˆ 3 with We can also decompose H ˆ 1 = (Σ ˆ )−1 Σ ˆ − 1 + IT ⊗ M ˆ −1 H ˆ ,R ˆ ,R G G
−1 ˆ −1 Λ ˆ ′ Ψˆ R−1 − M0−1 Λ′0 Ψ0R ) × IT ⊗ ( M
−1 ˆ 2 = (Σ ˆ )−1 (Σ ˆ − 1 + IT ⊗ M ˆ −1 − ΣG−,R1 )(IT ⊗ M0−1 Λ′0 Ψ0R H ) ˆ ,R ˆ ,R G G
ˆ3 = H
ˆ )−1 − (ΣG−,R1 + IT ⊗ M0 )−1 ˆ − 1 + IT ⊗ M (Σ ˆ ,R G
−1 × ΣG−,R1 (IT ⊗ M0−1 Λ′0 Ψ0R ).
We then get
ˆ )−1 ‖ ‖Σ ˆ − 1 + IT ⊗ M ˆ −1 ‖ ‖Hˆ 1 ‖ ≤ ‖(Σ ˆ ,R ˆ ,R G G −1 ˆ −1 Λ ˆ ′ Ψˆ R−1 − M0−1 Λ′0 Ψ0R × ‖IT ⊗ (M )‖ − 1 − 1 ˆ ‖ ‖Σ ˆ ‖ ≤ ‖IT ⊗ M ˆ ,R G
ˆ 1t /T ,R − G1t /T ,R G =
−1 ′ −1 −1 ′ −1 ′ ˆ −1 ˆ ′ ˆ −1 ˆ ˆ Λ ΨR Λ Λ ΨR − Λ0 Ψ0R Λ0 Λ0 Ψ0R Xt
with
−1 −1 ′ −1 −1 Λ ˆ ′ Ψˆ R−1 Λ ˆ ˆ ′ Ψˆ R−1 − Λ′0 Ψ0R Λ Λ Λ Ψ Xt 0 0 0R ′ −1 −1 ′ −1 ′ −1 −1 ′ −1 ˆ ˆ ˆ ˆ Ψˆ R − Λ0 Ψ0R Λ0 ≤ Λ Λ0 Ψ0R ‖Xt ‖. Λ ΨR Λ √
As Xt = OP ( n), it then follows from Lemma 5(v) that
ˆ 1t /T ,R − G1t /T ,R = OP G
1 n
+ OP
Finally, as G1t /T ,R = Gt + OP
ˆ 1t /T ,R = Gt + OP G
1
√
n
√1
T
n
.
, we get
+ OP
1
√
1
√
In the same way, we can write
T
.
−1 ˆ −1 Λ ˆ ′ Ψˆ R−1 − M0−1 Λ′0 Ψ0R × ‖IT ⊗ (M )‖ − 1 − 1 − −1 − 1 − 1 ′ ˆ ‖ ‖Σ ˆ Λ ˆ ‖ ‖M ˆ Ψˆ R − M0 1 Λ′0 Ψ0R = ‖M ‖ ˆ ,R G
ˆ )−1 ‖ ‖Σ ˆ −1 − ΣG−,R1 ‖ ˆ − 1 + IT ⊗ M ‖Hˆ 2 ‖ ≤ ‖(Σ ˆ ,R ˆ ,R G G
−1 ‖ × ‖IT ⊗ M0−1 Λ′0 Ψ0R − 1 −1 − 1 ˆ ‖ ‖Σ ˆ − ΣG−,R1 ‖ ‖M0−1 Λ′0 Ψ0R ≤ ‖M ‖ ˆ ,R G
ˆ )−1 − (ΣG−,R1 + IT ⊗ M0 )−1 ‖ ˆ − 1 + IT ⊗ M ‖Hˆ 3 ‖ ≤ ‖(Σ ˆ ,R G −1 × ‖ΣG−,R1 ‖ ‖IT ⊗ M0−1 Λ′0 Ψ0R ‖ − 1 − − 1 ˆ ) ‖ ‖Σ ˆ ˆ + IT ⊗ M ˆ 1 + IT ⊗ M ≤ ‖(Σ ˆ ,R ˆ ,R G G
− ΣG−,R1 − IT ⊗ M0 ‖ ‖(ΣG−,R1 + IT ⊗ M0 )−1 ‖ −1 × ‖ΣG−,R1 ‖ ‖M0−1 Λ′0 Ψ0R ‖ − 1 − 1 − 1 ˆ ‖ ‖Σ ˆ − M0 ‖ ‖M0−1 ‖ ˆ − ΣG , R ‖ + ‖ M ≤ ‖M ˆ ,R G −1 × ‖ΣG−,R1 ‖ ‖M0−1 Λ′0 Ψ0R ‖.
ˆ −1 = OP 1 . Thus, applying From Lemma 5(v), we get: M n Lemma 5(v) and (vi), and Proposition 4, we get that
‖Hˆ i ‖ = OP
1
n
√ 2
ˆ ‖ = OP so that ‖H
n
+ OP
1 n2
√
n
1
for i = 1–3
√
n nT
+ OP
√1 n nT
.
C. Doz et al. / Journal of Econometrics 164 (2011) 188–205
OP
As E (‖GT ‖)2 = E √ T
∑
T t =1
‖Gt ‖2
= rT , we have ‖GT ‖ =
so that
‖Gˆ 2t /T ,R − G2t /T ,R ‖ ≤ ‖Ut ‖ ‖Hˆ ‖ ‖IT ⊗ Λ0 ‖ ‖GT ‖ √ T 1 = OP + O . P 2 n
Similarly, E (‖ZT ‖)2 = E
n
∑
T t =1
‖ξt ‖2 = T tr(Ψ0 ) = O(nT ), so
that
√ T
‖Gˆ 3t /T ,R − G3t /T ,R ‖ ≤ ‖Ut ‖ ‖Hˆ ‖ ‖ZT ‖ = OP
+ OP
n2
1 n
.
Finally, as we know, from the proof of Proposition 1 that G2t /T ,R
= OP
1
and
n
G3t /T ,R
ˆ 2t /T ,R + Gˆ 3t /T ,R = OP we get G ˆ t /T ,R = Gt + OP G
1
= OP
√ T n2
+ OP
√
n
√
n n
+ OP 1
1
1 n
, so that
√
+ OP
√
T
T
n2
.
If limsup nT3 = O(1), we then get
ˆ t /T ,R = Gt + OP G
1
√
n
+ OP
1
√
T
.
References Aastveit, Knut Are, Trovik, Tørres G., 2008. Nowcasting Norwegian GDP: the role of asset prices in a small open economy. Working Paper 2007/09. Norges Bank, January. Angelini, Elena, Camba-Méndez, Gonzalo, Giannone, Domenico, Rünstler, Gerhard, Reichlin, Lucrezia, 2011. Short-term forecasts of Euro area GDP growth. Econometrics Journal, Royal Economic Society 14 (1), C25–C44. Bai, Jushan, 2003. Inferential theory for factor models of large dimensions. Econometrica 71 (1), 135–171. Bai, Jushan, Ng, Serena, 2002. Determining the number of factors in approximate factor models. Econometrica 70 (1), 191–221. Banbura, Marta, Rünstler, Gerhard, 2007. A look into the factor model black box— publication lags and the role of hard and soft data in forecasting GDP. Working Paper Series 751. European Central Bank, May. Barhoumi, Karim, Darné, Olivier, Ferrara, Laurent, 2010. Are disaggregate data useful for factor analysis in forecasting French GDP? Journal of Forecasting 29 (1–2), 132–144. Boivin, Jean, Ng, Serena, 2006. Are more data always better for factor analysis? Journal of Econometrics 127 (1), 169–194. Brockwell, Peter J., Davis, Richard A., 1991. Time Series: Theory and Methods. Springer. Chamberlain, Gari, Rothschild, Michael, 1983. Arbitrage, factor structure and meanvariance analysis in large asset markets. Econometrica 51, 1305–1324.
205
D’Agostino, Antonello, McQuinn, Kieran, O’Brien, Derry, 2008. Now-casting Irish GDP. Research Technical Papers 9/RT/08. Central Bank & Financial Services Authority of Ireland, CBFSAI, November. Dempster, A., Laird, N., Rubin, D., 1977. Maximum likelihood estimation from incomplete data. Journal of the Royal Statistical Society 14, 1–38. Doz, Catherine, Giannone, Domenico, Reichlin, Lucrezia, 2006a. A two-step estimator for large approximate dynamic factor models based on Kalman filtering. Unpublished Manuscript. Université Libre de Bruxelles. Doz, Catherine, Giannone, Domenico, Reichlin, Lucrezia, 2006b. A maximum likelihood approach for large approximate dynamic factor models. Review of Economics and Statistics (forthcoming). ECB, 2008. Short-term forecasting in the Euro area. Monthly Bulletin, April, pp. 69–74. Engle, Robert F., Watson, Mark, 1981. A one-factor multivariate time series model of metropolitan wage rates. Journal of the American Statistical Association 76 (376), 774–781. Forni, Mario, Giannone, Domenico, Lippi, Marco, Reichlin, Lucrezia, 2009. Opening the black box: structural factor models with large cross sections. Econometric Theory 25 (05), 1319–1347. Forni, Mario, Hallin, Marc, Lippi, Marco, Reichlin, Lucrezia, 2000. The generalized dynamic factor model: identification and estimation. Review of Economics and Statistics 82 (4), 540–554. Forni, Mario, Hallin, Marc, Lippi, Marco, Reichlin, Lucrezia, 2004. The generalized dynamic factor model consistency and rates. Journal of Econometrics 119 (2), 231–255. Forni, Mario, Hallin, Marc, Lippi, Marco, Reichlin, Lucrezia, 2005. The generalized dynamic factor model: one-sided estimation and forecasting. Journal of the American Statistical Association 100, 830–840. Forni, Mario, Lippi, Marco, 2001. The generalized dynamic factor model: representation theory. vol. 17, pp. 1113–1141. Forni, Mario, Reichlin, Lucrezia, 2001. Federal policies and local economies: Europe and the US. European Economic Review 45, 109–134. Geweke, John F., Singleton, Kenneth J., 1980. Maximum likelihood confirmatory factor analysis of economic time series. vol. 22, pp. 37–54. Giannone, Domenico, Reichlin, Lucrezia, Sala, Luca, 2004. Monetary policy in real time. In: Gertler, Mark, Rogoff, Kenneth (Eds.), NBER Macroeconomics Annual. MIT Press, pp. 161–200. Giannone, Domenico, Reichlin, Lucrezia, Small, David, 2008. Nowcasting: the real-time informational content of macroeconomic data. Journal of Monetary Economics 55 (4), 665–676. Horn, Roger A., Johnson, Charles R., 1990. Matrix Analysis. Cambridge University Press. Matheson, Troy D., 2010. An analysis of the informational content of new zealand data releases: the importance of business opinion surveys. Economic Modelling 27 (1), 304–314. Quah, Danny, Sargent, Thomas J., 2004. A dynamic index model for large crosssection. In: Stock, James, Watson, Mark (Eds.), Business Cycle. Univeristy of Chicago Press, pp. 161–200. Rünstler, G., Barhoumi, K., Cristadoro, R., Den Reijer, A., Jakaitiene, A., Jelonek, P., Rua, A., Ruth, K., Benk, S., Van Nieuwenhuyze, C., 2009. Short-term forecasting of GDP using large monthly data sets: a pseudo real-time forecast evaluation exercise. Journal of Forecasting 28 (7), 595–611. Siliverstovs, Boriss, Kholodilin, Konstantin A., 2010. Assessing the real-time informational content of macroeconomic data releases for now-/forecasting GDP: evidence for Switzerland. Discussion Papers of DIW Berlin 970. Stock, James H., Watson, Mark W., 2002a. Forecasting using principal components from a large number of predictors. Journal of the American Statistical Association 97 (460), 147–162. Stock, James H., Watson, Mark W., 2002b. Macroeconomic forecasting using diffusion indexes. Journal of Business and Economic Statistics 20 (2), 147–162. White, Halbert, 1982. Maximum likelihood estimation of misspecified models. Econometrica 50 (1), 1–25.