The
Econometrics Journal Econometrics Journal (2009), volume 12, pp. 187–207. doi: 10.1111/j.1368-423X.2008.00278.x
Non-parametric regression with a latent time series O LIVER L INTON † , J ENS P ERCH N IELSEN ‡ AND S ØREN F EODOR N IELSEN § †
‡
§
The London School of Economics and Political Science, Department of Economics, Houghton Street, London WC2A 2AE, United Kingdom E-mail:
[email protected]
Cass Business School, City University London, 106 Bunhill Row, London, EC1Y 8TZ, United Kingdom E-mail:
[email protected]
Department of Mathematical Sciences, University of Copenhagen, Universitetsparken 5, DK-2100 Copenhagen Ø, Denmark E-mail:
[email protected] First version received: October 2007; final version accepted: November 2008
Summary In this paper we investigate a class of semi-parametric models for panel data sets where the cross-section and time dimensions are large. Our model contains a latent time series that is to be estimated and perhaps forecasted along with a non-parametric covariate effect. Our model is motivated by the need to be flexible with regard to the functional form of covariate effects but also the need to be practical with regard to forecasting of time series effects. We propose estimation procedures based on local linear kernel smoothing; our estimators are all explicitly given. We establish the pointwise consistency and asymptotic normality of our estimators. We also show that the effects of estimating the latent time series can be ignored in certain cases. Keywords: Forecasting, Kernel estimation, Panel data, Unit roots.
1. INTRODUCTION Panel data are found in many contexts. Traditionally, it is associated with a series of household surveys conducted over time on the same individuals for which the cross-sectional dimension is large and the time series dimension is short. Parametric methods appropriate for this kind of data can be found in Hsiao (1986). There has also been some work on semi-parametric models for such data, see e.g. Kyriazidou (1997), and non-parametric additive models, Porter (1996). The increase in the length of time series available for these data has led to some interest in the application of time series concepts; see e.g. Arellano (2003). More recently, there has been work on panel data with large cross-section and time series dimensions, especially in finance where the data sets can be large along both dimensions and in macro where there are many series with modest length time series. Some recent works include Phillips and Moon (1999), Bai and Ng (2002), Bai (2003, 2004) and Pesaran (2006). These authors have addressed a variety of issues C The Author(s). Journal compilation C Royal Economic Society 2009. Published by Blackwell Publishing Ltd, 9600 Garsington Road,
Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
188
O. Linton et al.
including non-stationarity, estimation of unobserved factors, and model selection. They all work with essentially parametric models. In this paper we investigate a class of semi-parametric models for such data sets. Our model contains a latent time series that is to be estimated along with a non-parametric covariate effect. Our model is motivated by the need to be flexible with regard to the functional form of covariate effects but also the need to be practical with regard to forecasting of time series effects. Our main contribution in fact is to provide results that support subsequent time series analysis on the latent time series, and for this purpose it is desirable and important to not require the latent time series to be stationary. Our framework is consistent with the influential model of Carter and Lee (1992) for US mortality. Some other related works in econometrics include Connor and Linton (2002), who applied a similar model to a large financial panel data set. See also Fengler et al. (2006) and Mammen et al. (2006). We propose estimates of the non-parametric component and the latent time series that are based on least-squares objective functions and are defined in closed form. We establish the pointwise asymptotic distribution of our estimator of the non-parametric component and the joint distribution of the estimated latent time series in the case where the time series length is fixed. We then establish some properties in the case where the time series length increases to infinity at some rate. In many cases one wants to do further modelling of the latent time series with a view to forecasting future values. We prove that the estimated latent time series is close enough to the true latent time series such that the estimation error can be ignored in such future analysis. We give an application on simulated data. The paper is organized as follows. In Section 2 we describe our model, while in Section 3 we introduce our estimators of the key components. In Section 4 we give the asymptotic properties of the estimates, while in Section 5 we investigate the application of our estimates to further modelling strategies. We give some numerical evidence on the finite sample performance of our procedures in Section 6, while Section 7 concludes. All proofs are in the Appendix.
2. MODEL We suppose that the data are generated as an unbalanced panel: Yi,t = θt + g(Xi,t ) + ui,t ,
i = 1, . . . , nt , t = 1, . . . , T ,
(2.1)
where the unobserved errors (u i,t ) i,t satisfy at least the conditional moment restriction E[u i,t | X i,t , θ t ] = 0. Here, (θ t ) t is an unobserved time series, while (X i,t ) i,t are observed covariates. We shall assume throughout that (θ t ) t is independent of the observed covariates and errors. The distribution theory requires additional conditions on the errors and the covariates to ensure that laws of large numbers and central limit theorems hold; we discuss this further below. The model is a semi-parametric panel data model and some aspects of this have been discussed recently in e.g. Fan and Li (2004), Mammen et al. (2006) and Fan et al. (2007) although our assumptions will be more general in some cases and our focus is different. In particular, the focus of our paper is on the latent time series (θ t ) t itself. In practice we expect the distribution of observed and unobserved variables to change over time, and this is allowed for in our model. For example, we wish to allow the covariates to have potentially time-varying densities f t , i.e. Xi,t ∼ ft ,
i = 1, . . . , nt .
(2.2)
This is different from most previous treatments of this model. C The Author(s). Journal compilation C Royal Economic Society 2009.
Non-parametric regression with a latent time series
189
The model can also be thought of as an additive non-parametric regression model in covariates t/T and X i,t except that the function t → θ t is not assumed to be smooth or even continuous, so most extant theory for additive regression models cannot be applied. Our aim is to estimate the unknown smooth regression function g(·) and the time series (θ t ) t from a sample {Y i,t , X i,t , i = 1, . . . , n t , t = 1, . . . , T }. We allow the data sets to be unbalanced: the number of observations in each time period, denoted n t , and the number of time periods for each observation, denoted T i , are allowed to vary freely but are assumed independent of all other randomness. Observe that the mean of Y i,t is E[Yi,t ] = E[θt ] + g(x)ft (x)dx. (2.3) Without further restrictions, the mean of the latent process {θ t } and the function g(·) are not separately identified. Clearly, we may subtract a constant from θ t and add it to the function g without changing the distribution of the observed data. In the context of additive models, e.g. Linton and Nielsen (1995), it is common to assume that E[g(X)] = 0. However, since we wish to allow for the possibility that the covariate distribution is non-stationary, this is not an attractive assumption. One could instead assume that e.g. E[g(X i,1 )] = 0, which would be consistent with non-stationary covariates. We instead put restrictions on the process {θ t }. A restriction on the mean of θ t would effectively rule out non-stationarity in that component. Therefore, we shall impose that θ 1 = 0 (one could choose an arbitrary initial value instead, if this has better interpretation). This is consistent with the process {θ t } being a unit-root process starting from the origin. It also allows the process {θ t } to be asymptotically stationary. We remark that there is an air of arbitrariness in the decomposition between θ t and g(X i,t ) and whatever restriction is imposed cannot get around this. The quantity ϕ t = E[Y i,t | θ t , X i,t ] = θ t + g(X i,t ) is invariant to the choice of identifying restriction. However, the quantity ϕ t contains two sources of nonstationarity though, θ t and the changing mean of g due to the changing covariate distribution. It is of interest to separate out these two sources of non-stationarity by examining separately θ t and ft . We close this section with some motivation for considering the model (2.1). The model captures the general idea of an underlying and unobserved trend modifying the effect of a covariate on a response. For example, suppose that output of a firm Q is determined by inputs capital K and labour L but the production function F is subject to technological change a that affects all firms in the industry. This could be captured by the deterministic equation Q = aF (K, L). Taking logs and adding a random error yields the specification (2.1) for Y i,t = log Q i,t , θ t = log a t , and g(·) = log F(·). Note that ∂ log Q/∂ log a = 1, and this specification imposes the so-called Hicks Neutral technical change. In this case, the Total Factor Productivity or Solow Residual is θ (t), the part of growth not explainable by measurable changes in the inputs. In the popular special case where the production function is homothetic, one can replace F (K i,t , L i,t ) by f (X i,t ), where X i,t is the scalar capital to labour ratio. Traditional econometric work chose particular functional forms for F like Cobb–Douglas or CES, and made θ t a polynomial function of time. However, there is not general agreement on the form of production functions, see Jorgensen (1986), and so it is well motivated to treat g as a non-parametric function. Likewise it is restrictive to assume a particular form that underpins how the technology should change and so we do not restrict the relationship t → θ t . The model assumption that θ 1 = 0 has a natural interpretation in this case as it corresponds to a 1 = 1, in which case Q i1 = F (K i,1 , L i,1 ) is a baseline level of production. C The Author(s). Journal compilation C Royal Economic Society 2009.
190
O. Linton et al.
3. ESTIMATION Our method for estimation of the unknown quantities is based on minimizing sample sums of squared residuals. This has several advantages: it leads to closed-form estimators; it only requires the conditional moment restriction E[u | X, θ ] = 0 for consistency; it usually implies an efficient procedure under i.i.d. normal error terms as has been noted in earlier work. We also adopt the local linear regression paradigm because of its many advantages; Fan and Gijbels (1996). Extension to the local polynomial case is straightforward conceptually. Our estimation method is related to that considered in the paper of Mammen et al. (2006) except that we consider different identification restrictions, which leads to a slightly different procedure. They consider a more general model with multiple covariates that enter in an additive fashion, which makes their procedure more complicated to describe. Also, they do not provide results for estimation of the latent time series, which is perhaps the main contribution of this paper. We estimate (g(x), g (x)) for each x in a set X and θ = (θ t ) t=2,...,T by minimizing the following integrated weighted sum of squares: nt T Xi,t − x 2 dν(x) (Yi,t − θt − g(x) − g (x)(Xi,t − x)) K ht t=1 i=1 g(x) g(x) = Y − Aθ − Bx Kx,h Y − Aθ − Bx dν(x), g (x) g (x)
(3.1)
for some suitable measure ν concentrated on X where θ 1 = 0, Y = (Y i,t ) i=1,...,nt ,t=1,...,T , while A and B x are
suitable ‘design matrices’ of dimension N × (T − 1) and N × 2 respectively, where N = Tt=1 nt . The rows in B x are of the form [1 X i,t − x] and the typical row in A has a 1 in the t − 1st place and zeros elsewhere if the row corresponds to an observation Y i,t from the tth time period for t = 2, . . . , T ; rows corresponding to observations Y i,1 from the first time period have all elements equal to 0. Finally, K x,h is the N × N diagonal matrix with diagonal elements K((X i,t − x)/h t ), where h t is a bandwidth sequence and K is a kernel function. For any fixed value of θ the integrated sum of squares is minimized by minimizing the integrand. This leads to
ˆ g(x)
gˆ (x)
−1 Bx Kx,h (Y − Aθ ). = Bx Kx,h Bx
(3.2)
We note that this is just the (pooled) local linear regression of Y − Aθ on X in the point x. Plugging this expression into (3.1) yields
(Y − Aθ − Wx,h (Y − Aθ )) Kx,h (Y − Aθ − Wx,h (Y − Aθ )) dν(x) = (Y − Aθ ) Kx,h (IN − Wx,h ) dν(x)(Y − Aθ ), C The Author(s). Journal compilation C Royal Economic Society 2009.
Non-parametric regression with a latent time series
191
−1 where W x,h = B x (B x K x,h B x ) B x K x,h . Minimizing this as a function of θ yields the weighted least-squares estimator −1 A Kx,h (IN − Wx,h ) dν(x)Y θˆ = A Kx,h (IN − Wx,h ) dν(x)A
=
A Kx,h (IN − Wx,h )Adν(x)
−1
A Kx,h (IN − Wx,h )Y dν(x).
(3.3)
The integrals in the matrices in (3.3) are one-dimensional and can be computed by standard numerical integration routines. Moreover, simple expressions can be given for the matrices A K x,h (I N − W x,h )A (see (A.8)) and A K x,h (I N − W x,h )Y . Plugging (3.3) into (3.2) then gives the estimator of g(x) (and g (x)). It is worth noting that having derived the estimator of θ and the estimator of g as a solution to a least-squares problem does not prevent us from using different x’s or another set of bandwidths (or even another choice of kernels) in the final estimation of g. This may be quite useful in some situations, perhaps especially when predicting future observations Y T +s corresponding to a new covariate value x.
4. ASYMPTOTIC RESULTS In this section we give some asymptotic properties of our estimators. Our main focus is in the estimation of the latent time series (θ t ) t but we also provide results for the estimator of g. The first properties we give hold for the large N and fixed T case. Here we give the joint asymptotic distribution of the estimation error for the time series. We then give some results for the case where both quantities grow. Here the focus is on sufficient conditions that allow us to apply standard asymptotic results from time series theory to the estimated time series. 4.1. Asymptotic results when T is fixed We use the following regularity conditions, which as usual are sufficient but not necessary for our results. A SSUMPTION 4.1. (1)
(2) (3)
(4)
(5)
Suppose that X i,t are independent across i and t, and identically distributed across i, while u i,t = σ t (X i,t )ε i,t , where ε i,t are i.i.d. with mean zero and variance one and independent of X i,t . Suppose that dν(x) = ω(x) dx for some density ω and that ν has compact support X . Suppose that g is twice differentiable on the compact set X ⊂ ∩t {x : ft (x) > 0}, and satisfies |g (x) − g (y)| ≤ C|x − y| for some constant C. The marginal densities f t are (uniformly over t) continuous and strictly positive throughout X . The conditional variance functions σ 2t are (uniformly over t) continuous and strictly positive throughout X . Suppose that K is a Lipschitz-continuous density function symmetric about zero (a second-order kernel) with compact support. Define ||K||22 = K(u)2 du and μj (K) =
j K(u)u du.
¯ ⊂ (0, ∞) for each t = Suppose that N = Ts=1 ns → ∞ such that nt /N → λt ∈ [ λ, λ] 1, . . . , T .
C The Author(s). Journal compilation C Royal Economic Society 2009.
192
(6)
O. Linton et al.
¯ ⊂ (0, ∞) for all There exists a sequence h = h(N) such that h t /h → bt , where bt ∈ [b, b] t, while h → 0 and Nh5 → 0.
We have maintained strong assumptions with regard to the errors. In principle, one can allow both cross-sectional dependence and time series dependence in the errors and most of our results go through with some modification of the limiting variances in some cases. However, note that the model itself induces cross-sectional and time series dependence in Y i,t . We are assuming that the number of observations in each time period is of similar magnitude; this can be weakened but at the expense of a more complicated theory. It seems like a reasonable assumption to make here. In Assumption 4.1(6) h may be chosen to be any of the bandwidths h 1 , . . . , h T . Note that since the distribution of covariates and errors may differ from time period to time period it may in practice be very useful to have different bandwidths in each time period. The other assumptions are quite standard in the non-parametric literature. In the setup of this section, where T is fixed, the uniformity in t required in Assumption 4.1(3) is just an assumption for each t. However, we will use the assumption again in Section 5 where T → ∞ and here some sort of uniformity is required. We need to define some quantities that are important in the results. Define the T − 1 × T − 1 matrix D(x) with elements ⎧ λt+1 bt+1 ft+1 (x) ⎪ ⎪ ⎪ft+1 (x) 1 − T if t = t ⎪ ⎨ λ b f (x) s s s s=1 (4.1) D(x)t,t = ⎪ √λ b f (x)√λ b f (x) ⎪ t+1 t+1 t+1 t +1 t +1 t +1 ⎪ ⎪ if t = t . ⎩−
T s=1 λs bs fs (x) Under Assumptions 4.1(3) and (5), the matrix D(x) is strictly positive definite for x ∈ X : If we let v be the T − 1 vector with elements v 2 , v 3 , . . . , v T , where v t = λ t bt ft (x) for t = 1, . . . , −1/2 −1/2 ¯
, where = T and let V = diag(v), then D(x) may be written as −1/2 B −1/2 D(x)B diag{λ 2 , . . . , λ T }, B = diag{b 2 , . . . , b T }, and 1 ¯ D(x) = V − T
s=1 vs
vv .
We note that ¯ −1 = V −1 + 1 iT −1 iT−1 , D(x) v1 which can easily be checked; see Berry et al. (2004) for some results on this type of matrices. In ¯ −1 and therefore also D(x) ¯ particular, D(x) and D(x) are strictly positive definite. Define also the (T − 1) × T-matrix vi C(x) = 0 | IT −1 − T T s=1
vs
,
(4.2)
¯ ¯ 1/2 B¯ 1/2 , where
¯ = where iT = (1, 1, . . . , 1) ∈ RT . Then let C(x) = B −1/2 −1/2 C(x)
¯ diag{λ1 , . . . , λT }, B = diag{b1 , . . . , bT }, and define ¯ ¯ , (x) = C(x) (x) C(x)
(4.3)
where (x) = diag{σ 21 (x)f 1 (x), . . . , σ 2T (x)f t (x)}. C The Author(s). Journal compilation C Royal Economic Society 2009.
Non-parametric regression with a latent time series
193
Let T = diag{n 2 , . . . , n T } and H T = diag{h 2 , . . . , h T }. T HEOREM 4.1. Suppose that Assumption 4.1 holds. Then 1/2
T where = B −1/2
d θˆ − θ −→ N (0, −1 −1 ),
(4.4)
D(x)ω(x) dxB 1/2
=
and
(x)ω(x)2 dx.
The asymptotic variance is a bit unusual for a semi-parametric quantity in that the bandwidth constant matrix B enters the limiting variance. This is due to the fact that we have allowed different bandwidths in each time period; with a single choice of bandwidth this term cancels out. We discuss the form of the limiting variance more below. Consistent standard errors can be obtained by estimating the unknown quantities in the asymptotic variance by consistent estimators. A simpler approach is to work off the leading terms in the asymptotic expansion of the estimator as follows. Let −1/2 −1/2 −1/2 −1/2 ˆ A Kx,h (IN − Wx,h )Aω(x) dx T HT = HT T ˆ ˆ diag uˆ 2 W ˆ =W it ˆ = H −1 −1/2 A Kx,h (IN − Wx,h )ω(x) dx, W T T ˆ i,t ) are non-parametric residuals. where uˆ it = Yi,t − θˆt − g(X We conclude this section with a discussion of the limiting variance (4.4). Consider the special case where σ 2t (x) = σ 2 (x) for all t, f t (x) = f (x) for all t, and λ t = 1/T . Then
2 σ (x)f (x)ω(x)2 dx
IT −1 + iT −1 iT−1 . −1 −1 = 2 ( f (x)ω(x) dx) If we knew the function g, then we would estimate θ t by nt 1 θ˜t = (Yit − g(Xit )), nt i=1
t = 1, . . . , T ,
(4.5)
which satisfies √ d √ n1 (θ˜1 − θ1 ), . . . , nT (θ˜T − θT ) −→ N (0, ),
where = diag( σt2 (x)ft (x) dx). In the special case considered above, =
2 σ (x)f (x) dxIT −1 . Of course this is an unfair comparison in view of the identification issue. If instead of knowing g we know g up to an additive constant, α, (4.5) would estimate θ t + α instead of θ t . Assuming as above that θ 1 = 0 we would estimate θ t by nt n1 1 1 θ˜t = (Yit − g(Xit )) − (Yi1 − g(Xi1 )), nt i=1 n1 i=1 C The Author(s). Journal compilation C Royal Economic Society 2009.
t = 2, . . . , T ,
194
O. Linton et al.
with asymptotic distribution 1/2
T
d 2 ˆθ − θ −→ N 0, σ (x)f (x) dx IT −1 + iT −1 iT −1 .
Observe that we may get arbitrarily close to this asymptotic variance by choosing X to be a large compact subset of {x: f (x) > 0} and letting ω(x) = 1 in Theorem 4.1. Thus, the lack of efficiency of our estimator of θ t is more due to the unidentifiability than to the unknown regression function g. It follows from Theorem 4.1 that we can write θˆ = θ + (θˆ − θ ), where the two terms on the right-hand side are asymptotically independent and the latter term is asymptotically N (0, −1/2 −1/2
T −1 1/2 T )-distributed. Hence, when n t is large we may either model the estimated time series and from this derive a model for the latent time series, or—if n t is sufficiently large so that the prediction error is negligible—use the estimated time series as if it were the latent time series. 4.2. Asymptotics for the estimator of g T HEOREM 4.2. Suppose that Assumption 4.1 holds. Then ⎞ ⎛
T
T 3 2 2 √ λt bt ft (x) λs bs σs (x)fs (x) ⎟ h d ⎜ ˆ − g(x) − μ2 t=1 N h g(x) g (x) −→ N ⎝0, ||K||22 s=1 2 ⎠ , T
2 T λ b f (x) t=1 t t t λ b f (x) s=1 s s s provided
√ √ N hh3 → 0 and N hh2 rN → 0, where rN = maxs=1,...,T (hs + log ns /(ns hs )).
Consistent standard errors can be obtained by estimating the unknown quantities in the asymptotic variance in the usual way; Fan and Gijbels (1996) and Fan and Yao (2003). In particular we note that the constants b 1 , . . . , b T and λ 1 , . . . , λ T in practice may be replaced by h t /h and n t /N , respectively. If we knew the process θ t we would estimate the function g from the pooled nonparametric regression of Y it − θ t on X it . This satisfies the same CLT. In the special case where σ 2t (x) = σ 2 (x) for all t, f t (x) = f (x) for all t, and λ t = 1/T , the asymptotic variance is T ||K||22 σ 2 (x)/f (x).
5. TIME SERIES ANALYSIS If one observed the time series θ t , t = 1, . . . , T , where T is large, the usual econometric approach would be to specify a model for it, thereby enabling description and forecasting. For example, suppose that θ t follows an ARIMA(p,d,q) process with slowly varying mean, A(L)(1 − L)d θ t = functions on [0, 1], ζ t is a white μ(t/T ) + B(L)σ (t/T )ζ t , where
q
p μ(·) and σ (·) are smooth noise process, while A(L) = j =0 aj Lj and B(L) = j =0 bj Lj are lag polynomials with roots outside the unit circle. Here, d is an integer denoting the order of non-stationarity. This is a convenient class of models for forecasting; it is just one (quite general) class of discrete time models that allows a certain type of non-stationary behaviour, others can be contemplated. The properties of estimators in such models generally rely on a long time series so that T → ∞. C The Author(s). Journal compilation C Royal Economic Society 2009.
Non-parametric regression with a latent time series
195
Our previous results can be formally extended to this case, although in an extension of Theorem 4.1, one would have to consider finite dimensional linear combinations of the expanding parameter vector (θ t ) t . Instead, we address the issue of the impact of estimating the time series θ t on inference about the parameters that govern its dynamic evolution. Hansen et al. (2004) consider the general problem of using estimated values in time series models. They prove a general result that provided T 2 p θˆt − θt −→ 0
(5.1)
t=2
as T → ∞, then we may use the estimated time series as if it was the true unobserved time series for instance in estimation and unit root testing in the sense that using the estimated values leads to the same asymptotic distribution (for T → ∞) as if the true values were used. It is understood that the limits here are taken pathwise so that N and T approach infinity at some rate. We next show that this property also holds in our case with a non-parametric covariate effect. As we now consider the case of T → ∞ and min s=1,...,T n s → ∞, we need additional assumptions. When T → ∞ we must have n t /N → 0 if not for all then at least for some t. Thus we need to replace (5) of Assumption 4.1. A natural assumption would be to let all ratios n t /N go to 0 with the same rate. Hence we will assume: A SSUMPTION 5.1. (1)
(2) (3)
Suppose that n t → ∞ for each t and T → ∞ such that there exists a sequence {λ∗s }, bounded away from zero and infinity, such that as T → ∞ ns λ∗s (5.2) sup − = o(1/T ). T s=1,...,T N
For each x Ts=1 nNs fs (x) has a limit, f (x) say, as N → ∞.
T ns 2 s=1 σs (x) N fs (x)dx is bounded as N → ∞.
Note that under Assumption 5.1, (1), the limit in 5.1, (2), may be rewritten as T 1 ∗ λs fs (x). T →∞ T s=1
f (x) = lim
Moreover, under Assumption 4.1, (3), f (x) > 0 for x ∈ X . Under Assumptions 4.1, (3), and 5.1, (2), a sufficient condition for Assumption 5.1, (3), is that σ t (x) is bounded (in t). This latter condition is almost implied by Assumption 4.1, (3). T HEOREM 5.1. Suppose that (1)–(4) and (6) √ of Assumption 4.1, and Assumption 5.1 hold and that log N/(Nh) = o(1), Th2 = o(1) and T /( N h) = o(1) as N → ∞. Then (5.1) holds. This shows that the estimation of θ t does not affect the limiting distribution of the estimators of the parameters of the time series process or the tests. This means that standard errors can be constructed as if the θ t were observed. Furthermore, under the strong exogeneity assumption, we can factor the likelihood so that our two-step approach to estimation of the parameters of θ t does not lose information. Note that our result does not make any assumptions about properties of the process θ t . C The Author(s). Journal compilation C Royal Economic Society 2009.
196
O. Linton et al.
R EMARK 5.1. In this
asymptotic framework, we can revise the result of Theorem 4.2. For any x such that limT →∞ T1 Ts=1 λs bs fs (x) > 0, we have
√ h2 Tt=1 λt bt3 ft (x) ˆ − g(x) − nT h g(x) μ2 g (x)
T 2 t=1 λt bt ft (x) ⎞ ⎛ 1 T 2 limT →∞ T s=1 λs bs σs (x)fs (x) ⎟ d ⎜ −→ N ⎝0, ||K||22 2 ⎠ ,
limT →∞ T1 Ts=1 λs bs fs (x) provided h is chosen to be of order (NT)−1/5 .
6. NUMERICAL RESULTS In this section we present the results of a small simulation experiment. We generated data from the design yit = θt + xit + uit , where u it ∼ N(0, 1), x it ∼ U [−1, 1], and θ t = θ t−1 + η t , where η t ∼ N (0, 0.1) and θ 1 = 0, with all random variables mutually independent. This results in the regression function and the time-varying component having similar scale in most cases; see below. We take T ∈ {20, 40, 80} and n = n t ∈ {50, 100, 200}. Bandwidth was chosen by a Silverman rule of thumb procedure, specifically h = 1.06σˆ (nT )−1/5 , where σˆ was the sample standard deviation of the covariates. This bandwidth is exactly optimal for the integrated mean squared error of a kernel density estimator when the underlying density is Gaussian. Obviously, it is not optimal for the problem at hand. However, it is so widely used and simple to implement and also relatively robust, that we decided on using it here. This means that the performance we report can likely be improved on by using a more time-consuming method like least-squares cross-validation. We evaluate several performance measures: T 2 ˆ =E (θˆt − θt ) ; L∞ (θˆ ) = E max |θˆt − θt | LT 2 (θ) 2≤t≤T
t=2
ˆ = L2 (g)
J 1 ˆ j ) − g(Uj ))2 ; E (g(U J j =1
ˆ = E max |g(U ˆ j ) − g(Uj )| , L∞ (g) 1≤j ≤J
where U j ∼ U [−1, 1] independent of the data. The expectations are computed by averaging over 100 simulation draws. We also of the least-squares estimator of
performance
evaluate the 2 ; we show the standard deviation and bias. the autoregressive coefficient, ρˆ = t θˆt θˆt−1 / t θˆt−1 Our results are given in Table 1. The performance of θˆ clearly improves with n and gets worse √ with T. Note however ˆ roughly doubles and L∞ (θ) ˆ increases by a factor 2 whenever T doubles as that LT 2 (θ) Theorem 4.1 would predict. Our asymptotics in Section 5 refer to the case where T(n) → ∞ as n → ∞ and so one should ideally choose a path through these numbers. Our impression is that the results roughly correspond to the predictions of our asymptotics. The performance of gˆ seems to be much better and it improves with both n and T. Regarding ρˆ performance seems to C The Author(s). Journal compilation C Royal Economic Society 2009.
197
Non-parametric regression with a latent time series
n
T
LT 2 (θˆ )
Table 1. Simulation results ˆ ˆ L∞ (θˆ ) L2 (g) L∞ (g)
50
20 40
0.3181 0.6870
0.2757 0.3215
0.0024 0.0016
80
1.5581
0.3720
20 40
0.1635 0.3761
80
100
200
bias(ρ) ˆ
std(ρ) ˆ
0.0878 0.0722
−0.3022 −0.1589
0.2601 0.1832
0.0009
0.0535
−0.1220
0.1356
0.2045 0.2336
0.0014 0.0010
0.0667 0.0544
−0.1434 −0.0931
0.2473 0.1501
0.8382
0.2811
0.0005
0.0415
−0.0466
0.0832
20
0.0932
0.1540
0.0009
0.0553
−0.0645
0.1913
40 80
0.2184 0.5151
0.1789 0.2159
0.0005 0.0003
0.0434 0.0326
−0.0374 −0.0254
0.1173 0.0696
improve primarily with T (as expected) but also there is some improvement as n increases, which reflects the reduction of the estimation error associated with the first stage. Note that even when the time series is observed and not estimated as here, ρˆ is negatively biased in finite samples. Figure 1 shows a typical outcome.
Figure 1. Actual time series (solid line) with estimated series (circles) for a case with n = 200, T = 40. C The Author(s). Journal compilation C Royal Economic Society 2009.
198
O. Linton et al.
7. CONCLUSIONS We have established the theoretical properties of our estimation procedures for the quantities of interest in this semi-parametric model for large panels. The simulation results generally support our asymptotic arguments. The model can be extended in various ways. If the observed covariates X are multidimensional, our results go through provided we use multi-dimensional kernels and multidimensional local linear estimation. In some multi-variate cases one may wish to impose additional structure on the function g such as additivity, index structure, or partial linearity. Our methodology provides consistent estimation of the unrestricted function; the additional structure may be imposed afterwards; see e.g. Linton and Nielsen (1995). In some applications, one may also be concerned about individual effects; Hsiao (1986). For example, suppose that Yi,t = αi + θt + g(Xi,t ) + ui,t , for some unobserved individual specific effect α i . One can estimate the parameter vector minimizing the re-defined sum of squared residuals in (α i ) i jointly with (θ t ) t and g(·) by
n (3.1) subject to the constraint that i=1 αi = 0. However, with a large cross-section this may be computationally demanding. Alternatively, either differencing or deviation from full mean eliminates the nuisance parameters and reduces the model to something very similar to (2.1).
ACKNOWLEDGMENTS Oliver Linton wishes to thank the ESRC and Leverhulme foundations for financial support. This paper was partly written while he was a Universidad Carlos III de Madrid-Banco Santander Chair of Excellence, and he thanks them for financial support.
REFERENCES Arellano, M. (2003). Panel Data Econometrics. Oxford: Oxford University Press. Bai, J. (2003). Inferential theory for factor models of large dimension. Econometrica 71, 135–71. Bai, J. (2004). Estimating cross-section common stochastic trends in nonstationary panel data. Journal of Econometrics 122, 137–83. Bai, J. and S. Ng (2002). Determining the number of factors in approximate factor models. Econometrica 70, 191–221. Berry, S., O. B. Linton and A. Pakes (2004). Limit theorems for estimating the parameters of differentiated product demand systems. Review of Economic Studies 71, 613–54. Carter, L. R. and R. D. Lee (1992). Modelling and forecasting U.S. mortality. Journal of the American Statistical Association 87, 659–71. Connor, G. and O. Linton (2002). Semiparametric estimation of a characteristic-based factor model of common stock returns. Journal of Empirical Finance 14, 694–717. C The Author(s). Journal compilation C Royal Economic Society 2009.
Non-parametric regression with a latent time series
199
Fan, J. and I. Gijbels (1996). Local Polynomial Modelling and Its Applications. Boca Raton: Chapman and Hall. Fan, J. and R. Li (2004). New estimation and model selection procedures for semiparametric modeling in longitudinal data analysis. Journal of the American Statistical Association 99, 710–23. Fan, J. and Q. Yao (2003). Nonlinear Time Series Analysis. Berlin: Springer. Fan, J., T. Huang and R. Li (2007). Analysis of longitudinal data with semiparametric estimation of covariance function. Journal of the American Statistical Association 102, 632–41. Fengler, M. R., W. K. H¨ardle and E. Mammen (2007). A semiparametric factor model for implied volatility surface dynamics. Journal of Financial Econometrics 5, 189–218. Hansen, L. H., B. Nielsen and J. P. Nielsen (2004). Two sided analysis of variance with a latent time series. Working Paper 2004-W25, Nuffield College, Oxford University. Hsiao, C. (1986). Analysis of Panel Data. Cambridge: Cambridge University Press. Jorgensen, D. W. (1986). Econometric methods for modelling producer behaviour. In Z. Griliches and M. Intrilligator (Eds.), Handbook of Econometrics, Volume 3, 1842–915. Amsterdam: North-Holland. Kyriazidou, E. (1997). Estimation of a panel data sample selection model. Econometrica 65, 1335–64. Linton, O. B. and J. P. Nielsen (1995). A kernel method of estimating structured nonparametric regression based on marginal integration. Biometrika 82, 93–100. Mammen, E., B. Støve and D. Tjøstheim (2006). Nonparametric additive models for panels of time series. Working Paper, University of Mannheim. Pesaran, M. H. (2006). Estimation and inference in large heterogeneous panels with a multifactor error structure. Econometrica 74, 967–1012. Phillips, P. C. B. and H. R. Moon. (1999). Linear regression limit theory for nonstationary panel data. Econometrica 67, 1057–113. Porter, J. (1996). Nonparametric regression estimation for a flexible panel data model. Unpublished Ph.D. thesis, Department of Economics, MIT.
APPENDIX A.1. Lemmas We start by noting that −1 A Kx,h (IN − Wx,h )Y ∗ , θˆx = θ + A Kx,h (IN − Wx,h )A ∗
where Y = Y − Aθ is the vector with elements ∗ = g(Xi,t ) + σt (Xi,t )εi,t Yi,t
= g(x) + g (x)(Xi,t − x) + g(Xi,t ) − g(x) − g (x)(Xi,t − x) + σt (Xi,t )εi,t .
Moreover, as (I N − W x,h )B x = 0
∗
A Kx,h (IN − Wx,h )Y = A Kx,h (IN − Wx,h ) g − Bx C The Author(s). Journal compilation C Royal Economic Society 2009.
g(x) g (x)
+ A Kx,h (IN − Wx,h )u
(A.1)
200
O. Linton et al.
with g = (g(X i,t ) i,t . The first term on the far right is the ‘bias term’, the second is the ‘variance term’. Therefore, 1/2 1/2 HT T (θˆx − θ) −1 −1/2 −1/2 −1/2 −1/2 = HT T A Kx,h (IN − Wx,h )A T HT g(x) −1/2 −1/2 × HT T A Kx,h (IN − Wx,h ) g − Bx g (x) −1 −1/2 −1/2 −1/2 −1/2 −1/2 −1/2 + HT T A Kx,h (IN − Wx,h )A T HT HT T A Kx,h (IN − Wx,h )u.
To prove our results we need the following two lemmas. L EMMA A.1. Suppose that Assumption 4.1 holds. Then −1/2
sup HT x∈X
−1/2
T
−1/2
A Kx,h (IN − Wx,h )A T
−1/2
HT
− D(x) = op (1).
where for a matrix W , W = (tr(W W ))1/2 . Proof: Letting sj ,t (x) =
nt
Xi,t − x
j
K
i=1
Xi,t − x ht
, j ∈ N0 , t = 1, . . . , T ,
it is well known (Fan and Yao, 2003, theorem 5.3) that as n t → ∞ and h t → 0 such that nt ht /log n t → ∞, j +1
sj ,t (x) = N hj +1 · bt
λt ft (x) μj + OP (rN )
(A.2)
with rN = maxs=1,...,T (hs + log ns /(ns hs )) the O p -term is uniform in x ∈ X . Note that by
Assumption 4.1, 4 μ 0 = 1 and μ 1 = 0. Put sj (x) = Tt=1 sj ,t (x), j ∈ N0 . It follows that the 2 × 2-matrix B x K x,h B x is ⎡
s0 (x) s1 (x)
T bt λt ft (x)(1 + OP (rN )) ⎢ Nh ⎢ s1 (x) t=1 ⎢ =⎢ T ⎢ s2 (x) ⎣ N h2 b2 λ f (x)O (r ) P N t t t t=1
T
⎤
⎥ ⎥ ⎥ ⎥. T ⎥ 3 3 μ2 N h bt λt ft (x)(1 + OP (rN )) ⎦ N h2
bt2 λt ft (x)OP (rN )
t=1
(A.3)
t=1
Next we see that for t = 1, . . . , T − 1 the t th row of A K x,h B x is
s0,t+1 (x)
s1,t+1 (x) = N hft+1 (x) bt+1 λt+1 + OP (rN )
hOP (rN ) ,
(A.4)
−1 so that the t’th row of A K x,h B x (B is x K x,h B x )
'
( bt+1 λt+1 ft+1 (x) (1 + OP (rN )) OP (rN /h) ,
T s=1 bs λs fs (x)
(A.5)
Combining (A.4) and (A.5), A K x,h W x,h A is a (T − 1) × (T − 1)-matrix with (t, t )-element given by Nh
bt+1 λt+1 ft+1 (x)bt +1 λt +1 ft +1 (x) (1 + OP (rN )) .
T s=1 bs λs fs (x)
(A.6)
C The Author(s). Journal compilation C Royal Economic Society 2009.
201
Non-parametric regression with a latent time series
nt+1 Xi,t+1 −x = s0,t+1 (x). Hence the matrix The (t, t)-element in the diagonal matrix A K x,h A is i=1 K ht+1 A K x,h (I N − W x,h )A is a (T − 1) × (T − 1)-matrix with diagonal elements
bt+1 λt+1 ft+1 (x) N hbt+1 λt+1 ft+1 (x) 1 − T (1 + OP (rN )) s=1 bs λs fs (x) −1/2
and off-diagonal elements given by (A.6). Pre- and post-multiplying by H T result.
−1/2
T
−1/2
L EMMA A.2. Suppose that Assumption 4.1 holds. Then the t’th element of HT Wx,h )(g − Bx [g(x) g (x) ]T ) is o P (1) uniformly in x ∈ X .
gives the desired −1/2
T
A Kx,h (IN −
Proof: Under our assumptions O(s3∗ (x)) g (x) s2 (x) = + g − Bx 2 g (x) s3 (x) O(s4 (x)) T hbt μ∗3 μ2 g (x) 3 3 = Nh λt bt ft (x) + 2 2 × (1 + OP (rN )) , 2 hbt μ3 h bt μ4 t=1
Bx Kx,h
where μ∗3 =
g(x)
(A.7)
|z|3 K(z)dz and
s3∗ (x)
=
nt T
|Xi,t − x| K 3
t=1 i=1
Xi,t − x ht
= N h4
T
λt bt4 ft (x)μ∗3 (1 + O(rN )) .
t=1
Combining (A.5) and (A.7) the tth element of the vector A Kx,h Wx,h (g − Bx [g(x) g (x) ]T ) is
N h3 μ2
T λs bs3 fs (x) g (x) λt+1 bt+1 ft+1 (x) × s=1 × (1 + OP (rN )) . T 2 s=1 λs bs fs (x)
Similarly, the t th element of A Kx,h (g − Bx [g(x) g (x) ]T ) is g (x) g (x) ∗ 3 s2,t+1 (x) + O(s3,t+1 λt+1 bt+1 (x)) = N h3 μ2 ft+1 (x) (1 + OP (rN )) . 2 2 Therefore, the tth element of the vector A Kx,h (IN − Wx,h )(g − Bx [g(x) g (x) ]T ) is
T 3 g (x) s=1 λs bs fs (x) 2 λt+1 bt+1 ft+1 (x) bt+1 − T N h μ2 (1 + OP (rN )) . 2 s=1 λs bs fs (x) 3
−1/2
Pre-multiplying by the diagonal matrix H T
−1/2
T
we get the desired result.
C The Author(s). Journal compilation C Royal Economic Society 2009.
202
O. Linton et al.
Proof of Theorem 4.1: First, write 1/2
T (θˆ − θ) −1 −1/2 −1/2 −1/2 −1/2 −1/2 1/2 A Kx,h (IN − Wx,h )Aω(x)dx T HT HT = HT HT T g(x) −1 −1/2 ω(x)dx × HT T A Kx,h (IN − Wx,h ) g − Bx g (x) −1 −1/2 −1/2 −1/2 −1/2 −1/2 1/2 A Kx,h (IN − Wx,h )Aω(x)dx T HT HT + HT HT T −1/2 × HT−1 T A Kx,h (IN − Wx,h )uω(x)dx. From Lemma A.1 and the hypothesis on ω, we have −1/2
HT
−1/2
T
−1/2
A Kx,h (IN − Wx,h )Aω(x)dx T
−1/2
=
HT
D(x)ω(x)dx + op (1),
whence −1/2
HT−1 T
−1/2
A Kx,h (IN − Wx,h )Aω(x)dx T
= B −1/2
D(x)ω(x)dxB 1/2 + op (1).
Moreover −1/2 HT−1 T
A Kx,h (IN − Wx,h ) g − Bx
g(x)
g (x)
ω(x)dx = oP (1)
by Lemma A.2. The t’th element of A K x,h (I N − W x,h )u is (using (A.5)) ns T Xi,t+1 − x Xi,s − x σt+1 (Xi,t+1 )εi,t+1 − (Xi,s − x)σs (Xi,s )εi,s OP (rN /h) K h hs t+1 i=1 s=1 i=1 ns T bt+1 λt+1 ft+1 (x) Xi,s − x σs (Xi,s )εi,s (1 + OP (rN )). − T K hs s=1 bs λs fs (x) s=1 i=1
nt+1
K
It follows that −1/2 HT−1 T
A Kx,h (IN − Wx,h )ω(x)dx × u ⎡
⎤ σ (X )ε t i,1 i,1 ⎥ ⎢ h n ⎢ 1 1 i=1 ⎥ ⎢ ⎥ ⎢ ⎥ .. ⎢ ⎥ ω(x)dx × (1 + OP (rN )) . ¯ = C(x) ⎢ ⎥ . ⎢ ⎥ ⎢ ⎥ nT ⎣ 1 ⎦ Xi,T − x σt (Xi,T )εi,T K √ hT nT i=1 hT 1 √
n1
K
Xi,1 − x h1
C The Author(s). Journal compilation C Royal Economic Society 2009.
203
Non-parametric regression with a latent time series ¯ for arbitrary vectors γ ∈ RT −1 . Then Define (c¯t (x))t=1,...,T −1 = γ C(x) nt T 1 1 Xi,t − x ω(x)dx × σt (Xi,t )εi,t K c¯t (x) √ nt ht i=1 ht t=1 =
nt T 1 c¯t (Xi,t )ω(Xi,t )σt (Xi,t )εi,t + op (1), √ nt i=1 t=1
by changing variables and dominated convergence. Using standard arguments the vector Z n = (Z n1 , . . . , Z nT ) , where Znt = ) nt
nt
1 σt2 (x)c¯t2 (x)ω2 (x)ft (x)dx
t = 1, . . . , T ,
c¯t (Xi,t )ω(Xi,t )σt (Xi,t )εi,t
i=1
is jointly asymptotically normal with mean zero and identity variance covariance matrix. It follows that −1/2
γ HT−1 T
d A Kx,h (IN − Wx,h )ω(x)dx × u = iT γ1/2 Zn + op (1) −→ N 0, iT γ iT ,
where γ = diag{ σt2 (x)c¯t2 (x)ω2 (x)ft (x)dx}. Hence i T γ i T equals T
σt2 (x)c¯t2 (x)ω2 (x)ft (x)dx
=γ
¯ ¯ ω2 (x)dx γ = γ C(x) (x) C(x)
(x)ω2 (x)dx γ .
t=1
Therefore, by the Cramer–Wold device −1/2
HT−1 T
d A Kx,h (IN − Wx,h )ω(x)dx × u −→ N 0, (x)ω2 (x)dx .
The result follows. Proof of Theorem 4.2: Let
T λt bt3 ft (x) 1 g (x). βT (x) = μ2 t=1 T 2 t=1 λt bt ft (x) We see that √
ˆ N h g(x) − g(x) − h2 βT (x) √ −1 = N h [ 1 0 ] Bx Kx,h Bx Bx Kx,h (Y − Aθ) − h2 βT (x) √ −1 −1/2 1/2 Bx Kx,h A T T (θˆ − θ) − N h[ 1 0 ] Bx Kx,h Bx √ −1 = N h [ 1 0 ] Bx Kx,h Bx Bx Kx,h (Y − Aθ) − h2 βT (x) + oP (1)
using the results of the previous section and (A.5). Note that −1 [ 1 0 ] Bx Kx,h Bx Bx Kx,h (Y − Aθ) C The Author(s). Journal compilation C Royal Economic Society 2009.
204
O. Linton et al.
is the pooled local linear regression estimator of x based on the independent data Y ∗i,t = g(X i,t ) + σ t (X i,t )ε i,t and the covariates X i,t . We may rewrite this as g(x) −1 −1 ∗ Bx Kx,h Y = [ 1 0 ] Bx Kx,h Bx Bx Kx,h Bx [ 1 0 ] Bx Kx,h Bx g (x) g(x) −1 + [ 1 0 ] Bx Kx,h Bx Bx Kx,h g − Bx g (x) −1 + [ 1 0 ] Bx Kx,h Bx Bx Kx,h u. The first term is g(x). To find the second term we note that by (A.2) ( ' 1 1 −1 OP (rN ) . (1 + OP (rN ))
T = [ 1 0 ] Bx Kx,h Bx Nh N h 1 bs λs fs (x) Using this and (A.7) the second term becomes
√ N h3 Tt=1 λt bt3 ft (x) g 2(x) μ2 N h4 3 = h2 βT (x) + oP (1/ N h). r (1 + O (r )) + h (1 + O (r )) + O
T P N P N P N Nh N h 1 bs λs fs (x) The final term is t 1 · K
T N h t=1 i=1 s=1 bs λs fs (x)
T
1
n
Xi,t − x h
√ σt (Xi,t )εi,t (1 + OP (rN )) + OP (rN / N h),
which using standard arguments is easily shown to be asymptotically normal with mean 0 and variance
T λs bs σs2 (x)fs (x) 1 ||K||22 s=1 2 .
Nh T λ b f (x) s s s s=1 Proof of Theorem 5.1: First note that in the asymptotic set-up of Theorem 5.1 sj ,t (x) =
N hj +1 j +1 ∗ bt λt ft (x) · μj + OP (rN ) , T
where
* rN = max hs + s=1,...,T
+ log ns log N =O h+ . ns h s Nh
We have T t=2
∗
(θˆt − θt )2 =
(Y ) (IN − Wx,h )Kx,h Aω(x)dx
−1
A Kx,h (IN − Wx,h )Aω(x)dx
−1 A Kx,h (IN − Wx,h )Y ∗ ω(x)dx A Kx,h (IN − Wx,h )Aω(x)dx 1 ∗ ∗ (Y ) (IN − Wx,h )Kx,h Aω(x)dx ≤ 2 A Kx,h (IN − Wx,h )Y ω(x)dx , ζN
×
where ζN = inf z z: zz=1
A Kx,h (IN − Wx,h )Aω(x)dx z C The Author(s). Journal compilation C Royal Economic Society 2009.
205
Non-parametric regression with a latent time series
is the smallest eigenvalue of ζN (x)ω(x)dx with
A Kx,h (IN − Wx,h )Aω(x)dx. This may be bounded from below by
ζN (x) = inf zA Kx,h (IN − Wx,h )A z z: zz=1
the smallest eigenvalue of
A Kx,h (IN − Wx,h )A = diag(s0 (x)) − [s0 (x) s1 (x)]
s0 (x) s1 (x)
−1
s0 (x)
s1 (x)
s1 (x) s2 (x)
,
(A.8)
where s0 (x) = (s0,2 (x), s0,3 (x), . . . , s0,T (x)) and s1 (x) = (s1,2 (x), s1,3 (x), . . . , s1,T (x)). Hence, we need to bound
T t=2
zA Kx,h (IN − Wx,h )Az =
s0 (x) − s1 +
T t=2
zt2 s0,t (x) · s0 (x) −
2s1 (x)
T t=2
(x)2 /s
zt s0,t (x)
2 zt s0,t (x)
2 (x)
T t=2
zt s1,t (x) −
s2 (x)s0 (x) − s1
T
+
zt2 s0,t (x) ·
t=2
T t=2
2
zt s1,t (x)
s1 (x)2 s2 (x)s0 (x) − s1 (x)2 s0 (x) (A.9)
(x)2
away from 0. The first term of (A.9) may be re-written as T
2 T s0,s (x)
t=2
zt2 s0,t (x)/
T
s=2 s0,s (x)
−
T t=2
zt s0,t (x)/
T
2
s=2 s0,s (x)
s0 (x) − s1 (x)2 /s2 (x)
s=2
s0,1 (x) z2 s0,t (x). 2 s0 (x) − s1 (x) /s2 (x) t=2 t T
+
Of these two terms, the first one is non-negative and the second may be bounded from below by Nh s0,1 (x) min s0,s (x) = s0 (x) − s1 (x)2 /s2 (x) s=2,...,T T
1 T
b1 λ∗1 f1 (x) min bt λ∗t ft (x)(1 + OP (rN )).
T ∗ f (x) t=2,...,T b λ t=1 t t t
The second term of (A.9) is non-negative (and of smaller order than the first), whereas the third term is of order 2 2 N h /T · OP (rN ) N h/T · (1 + OP (rN )) = OP (N hrN2 /T ). N 2 h4 /T 2 · (1 + OP (rN )) It now follows that T2 ζN ≥ Nh
1 T
b1 λ∗1 f1 (x) min bt λ∗t ft (x) · ω(x)dx (1 + OP (rN )) + OP rN2
T ∗ t=2,...,T t=1 bt λt ft (x)
which is bounded away from 0 by Assumptions 4.1, (3) and (6), and 5.1, (1) and (2). Thus we need to show that T T (Y ∗ ) (IN − Wx,h )Kx,h Aω(x)dx A Kx,h (IN − Wx,h )Y ∗ ω(x)dx (A.10) Nh Nh C The Author(s). Journal compilation C Royal Economic Society 2009.
206
O. Linton et al.
is o P (1). Using (A.1) it suffices to bound g(x) T ω(x)dx A Kx,h (IN − Wx,h ) g − Bx Nh g (x) g(x) T × ω(x)dx A Kx,h (IN − Wx,h ) g − Bx Nh g (x)
(A.11)
and
T Nh
A Kx,h (IN − Wx,h )uω(x)dx
T Nh
A Kx,h (IN − Wx,h )uω(x)dx .
(A.12)
Applying Lemma A.2 we see that the t’th element of (A.11) (the ‘bias term’) is O P (n t+1 h2t+1 r N ) so that (A.11) is
2 T T 2 2 nt t=2 (nt ht rN ) 2 2 2 rN2 = OP T h2 rN2 = oP (1). = OP T h OP T N 2 h2 N t=2 For (A.12) (the ‘variance term’) we write
⎡ n1
A Kx,h (IN − Wh,x )uω(x)dxu =
i=1
⎢ ⎢ C ∗ (x) ⎢ ⎢ ⎣
nT
i=1
σ1 (Xi,1 )K .. .
σT (Xi,T )K
Xi,1 −x h1
Xi,T −x hT
· εi,1
⎤ ⎥ ⎥ ⎥ ω(x)dx × (1 + OP (rN )) ⎥ ⎦
· εi,T
v∗ i
∗ ∗ where C ∗ = [0|IT −1 ] − T T v∗ , with v ∗ = (v ∗s ) s=2, ... ,T and v s = λs bs fs (x) for s = 1, . . . , T . Ignoring the s=1 s remainder term we get ⎤ ⎡ n 2 Xi,2 −x · εi,2 i=1 σ2 (Xi,2 )K h2
T nt X −x ⎥ ⎢ · εi,t σt (Xi,t )K i,tht ⎥ ⎢ i=1 t=1 .. ⎥ ω(x)dx − v ∗ ⎢ ω(x)dx
⎥ ⎢ T . ∗ ⎦ ⎣ s=1 vs
nT Xi,T −x · εi,T i=1 σT (Xi,T )K hT
and since all terms have expectation 0, it suffices to show that n T t T2 Xi,t − x ω(x)dx · εi,t Var σt (Xi,t ) K N 2 h2 t=2 ht i=1 and
T n T t Xi,t − x vs∗ T2 ω(x)dx Var σ (X )ε K
T t i,t i,t ∗ N 2 h2 s=2 ht s=1 vs t=1 i=1
(A.13)
(A.14)
go to 0. Here (A.13) may be bounded as follows: n T t Xi,t − x T2 ω(x)dx · εi,t Var σt (Xi,t ) K N 2 h2 t=2 ht i=1 2 T nt T T2 2 σt (x) ft (x)dx = O ≤ const N h2 N N h2 t=1 C The Author(s). Journal compilation C Royal Economic Society 2009.
Non-parametric regression with a latent time series using Assumption 5.1, (3), whereas (A.14) may be bounded as follows: T n T t T2 Xi,t − x vs∗ ω(x)dx Var σt (Xi,t )εi,t K
T ∗ N 2 h2 s=2 ht s=1 vs t=1 i=1 T T T nt 2 ≤ const σt (x) ft (x)dx = O . N h2 N N h2 t=1
C The Author(s). Journal compilation C Royal Economic Society 2009.
207
The
Econometrics Journal Econometrics Journal (2009), volume 12, pp. 208–231. doi: 10.1111/j.1368-423X.2009.00286.x
Blockwise generalized empirical likelihood inference for non-linear dynamic moment conditions models F RANCESCO B RAVO † †
Department of Economics and Related Studies, University of York, York YO10 5DD, UK E-mail:
[email protected] First version received: July 2007; final version accepted: March 2009
Summary This paper shows how the blockwise generalized empirical likelihood method can be used to obtain valid asymptotic inference in non-linear dynamic moment conditions models for possibly non-stationary weakly dependent stochastic processes. The results of this paper can be used to construct test statistics for overidentifying moment restrictions, for additional moments, and for parametric restrictions expressed in mixed implicit and constraint form. Monte Carlo simulations seem to suggest that some of the proposed test statistics have competitive finite sample properties. Keywords: Blocking techniques, GMM estimators, Near-epoch dependence, Non-linear hypotheses, Overidentifying restrictions.
1. INTRODUCTION Since Hansen’s (1982) seminal paper, generalized method of moments (GMM) has been widely used in empirical economics and empirical finance—see the special issue of the Journal of Business and Economic Statistics, 2002, and especially the monograph of Hall (2005) for a survey of recent applications and development of GMM. There exists, however, Monte Carlo evidence, see e.g. the special issue of the Journal of Business and Economic Statistics, 1996, showing that GMM estimators may be badly biased in finite samples, and exact and nominal sizes of associated test statistic are often very different. This has led to the development of a number of alternative asymptotically equivalent methods, including continuous updating (CU) GMM (Hansen et al., 1996), the so-called efficient bootstrap for GMM (Brown and Newey, 2002), empirical likelihood (EL) (Qin and Lawless, 1994; Kitamura, 1997b), exponential tilting (ET) (Imbens, 1997; Kitamura and Stutzer, 1997; Smith, 1997; Imbens et al., 1998; among others). Smith (2009) (note that a version of the paper was available in 2001) generalizes and extends some of these earlier contributions for weakly dependent data using a kernel function smoothing approach. As shown by Newey and Smith (2004) and Smith (2009), all of these methods share a common structure, being examples of the generalized empirical likelihood (GEL) method originally introduced by Smith (1997) as a quasi-likelihood-based alternative to GMM. Thus, GEL provides a natural framework to analyse a large number of alternatives to GMM. GEL C The Author(s). Journal compilation C Royal Economic Society 2009. Published by Blackwell Publishing Ltd, 9600 Garsington Road,
Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
BGEL for moment conditions models
209
estimators are also characterized by a number of appealing theoretical properties compared to their GMM-based counterparts. First, as shown by Newey and Smith (2004) (see also Anatolyev, 2005) the second-order bias of GEL estimators lacks some of the elements characterizing that of efficient GMM estimators. Second, GEL estimators do not require explicit estimation of the efficient metric in the GMM criterion function. These two features suggest that GEL estimators might be less prone to bias than GMM. Third, GEL is a likelihood-like method, allowing naturally the construction of classical-type statistics such as likelihood ratio, score and Wald for overidentifying moment conditions, additional moment conditions and parametric restrictions. This paper proposes to use blockwise GEL (BGEL) in the context of non-linear dynamic moment conditions models. The blocking technique, originally proposed for EL by Kitamura (1997b), preserves the dependence property of the observations non-parametrically by appropriately choosing blocks of observations. This method is quite general and versatile, and can be used in situations where the parameter of interest is either from an unknown finite dimensional distribution—as that considered in this paper—or from an unknown infinite dimensional joint distribution (with the blocks of blocks procedure suggested by Politis and Romano, 1992). This paper makes the following contributions. First, it shows that BGEL can be used to construct both misspecification and specification test statistics for non-linear dynamic moment conditions models of possibly non-stationary stochastic processes near-epoch dependent (NED) on an underlying mixing process. The NED condition is one of the most general and useful concepts of weak dependence for non-linear models that is available, and can be used to characterize a number of processes widely used in economics and finance, including autoregressive moving average (ARMA), autoregressive conditional heteroscedasticity (ARCH), generalized autoregressive heteroscedasticity (GARCH), bilinear and threshold autoregressive. Thus the results of the paper generalize those of Kitamura (1997a,b), Smith (1997, 2009), Gregory et al. (2002), Bravo (2005), among others. In particular, they are a direct extension of those of Kitamura (1997a) and Bravo (2005) who considered, respectively, blockwise EL test statistics for non-linear restrictions in moment conditions models with stationary strong mixing processes, and blockwise ET test statistics for non-linear restrictions in mixed form in linear regression models with stationary mixing processes. Allowing for non-stationarity is important because there exists large empirical evidence both in macroeconomics and finance documenting non-constant unconditional variances for a number of time series, including exchange rates, interest rates and international stock markets—see e.g. Pagan and Schwert (1990), Loretan and Phillips (1994) and Watson (1999). Therefore, the results of the paper could potentially be applied to a number of macroeconomic and international finance dynamic stochastic models. For example, they could be used in the cash-in-advance model of exchange rate dynamics of Grilli and Roubini (1992), in the money in utility function model for real balances demand of Holman (1998), in the money in utility function model for currency substitution of Imrohoroglu (1994), in the non-linear expectations model of term structure of Lee (1989), and in the non-linear uncovered parity model of Flood and Marron (2000) and Sarantis (2006). Second, this paper provides Monte Carlo evidence about the finite sample properties of a number of GEL-based analogues to Hansen’s (1982) J-statistic for overidentifying restrictions. We focus on the J-statistic partly because of its numerical simplicity, but, more importantly, because it has become the standard diagnostic test for model specification despite its welldocumented finite sample overrejection problems. The model considered in the simulations is a non-linear dynamic instrumental variables regression where both the instruments and the C The Author(s). Journal compilation C Royal Economic Society 2009.
210
F. Bravo
unobservable errors can potentially be non-stationary. We note that none of the above-mentioned papers on GEL can handle theoretically this model because of the non-stationarity. Furthermore, as far as we are aware, the Monte Carlo study of this paper is the first one assessing the finite sample impact of non-stationarity in the context of non-linear moment conditions models. Thus the Monte Carlo results of this paper are important because they provide new finite sample evidence, complementary to that—for example, of Gregory et al. (2002) and of Guggenberger and Smith (2008)—about the effectiveness of GEL as an alternative to GMM in the context non-stationary observations. Third, this paper provides Monte Carlo evidence about the finite sample performance of the bootstrap J-statistic. The (block) bootstrap is a possible alternative to the methods of this paper. Goncalves and White (2004) show the asymptotic validity of blockwise bootstrap for quasi-maximum likelihood estimators of non-linear dynamic models for the same type of NED processes considered in this paper. They also show the validity of suitable bootstrap analogues of Wald and Lagrange multiplier statistics for testing non-linear restrictions. Their results can be readily adapted to the dynamic non-linear moment conditions models considered in this paper to show the consistency of the resulting GMM estimators and related statistics. The Monte Carlo results seem to suggest that the bootstrap does not solve the finite sample problems of the J-statistic, especially for observations characterized by a high degree of persistence and certain forms of non-stationarity. This result is important because it shows that with non-stationary observations the bootstrap does not always provide the same type of accurate approximations as those given with stationary observations (see e.g. the Monte Carlo evidence provided by Goncalves and White, 2004). It should be noted that this paper does not consider the important issue of weak identification—see e.g. Stock and Wright (2000)—which has received a great deal of interest in the econometric literature. Recently Otsu (2006) and Guggenberger and Smith (2008) have shown that it is possible to obtain valid asymptotic inference in the context of non-linear weakly identified dynamic moment conditions models using kernel smoothed GEL-based test statistics. It is possible to show that the blocking method of this paper can be easily adapted to deal with weakly identified non-linear moment conditions models, and construct blockwise analogues of the test statistics considered by Otsu (2006) and Guggenberger and Smith (2008). The rest of the paper is structured as follows. Section 2 introduces the BGEL estimator. Sections 3 and 4, respectively, develop the necessary asymptotic theory, and report the results of the Monte Carlo study. Section 5 contains some concluding remarks. All the proofs are contained in Appendix B.
2. BLOCKWISE GENERALIZED EMPIRICAL LIKELIHOOD Let {znt : n, t ∈ N} denote an array of Rdz -valued random vectors defined on some probability space (, F, P ). Let β ∈ B ⊂ Rk denote a parameter vector, and let g(znt , β) : Rdz × B → Rl (l ≥ k) denote a vector of (F\Borel-measurable for each β ∈ B) functions satisfying the moment condition E[g(znt , β0 )] = 0 ∀n, t,
(2.1)
where β 0 is the true unknown parameter. C The Author(s). Journal compilation C Royal Economic Society 2009.
BGEL for moment conditions models
211
Given an observed sample {z nt , t ≤ n, n ≥ 1}, a sequence of efficient GMM estimators βˆGMM := {βˆGMM,n : n ≥ 1} for β 0 is any sequence of random vectors such that ˆ ˜ −1 g( ˜ −1 g( ˆ n (β) ˆ n (β) ˆ βˆGMM ) := inf g(β) ˆ ˆ β), ˆ βˆGMM ) g( β∈B
ˆ with probability approaching 1 as n → ∞, where g(β) := nt=1 gnt (β)/n, gnt (β) = g(znt , β), ˜ is a consistent estimator of n (β0 ) := V [n1/2 g(β ˆ n (β) ˆ 0 )] with β˜ any preliminary n1/2 and consistent estimator. Under suitable regularity conditions, see e.g. Gallant and White (1988), it can be shown that a
n (β0 )−1/2 n1/2 (βˆGMM − β0 ) ∼ N (0, Ik ),
(2.2)
where n (β 0 ) := [G n (β 0 ) n (β 0 )−1 G n (β 0 )]−1 is the asymptotic covariance matrix of βˆGMM , a ˆ 0 )/∂β ] and ‘∼’ denotes asymptotically distributed as, see e.g. Gallant and Gn (β0 ) := E[∂ g(β White (1988, ch. 5). Note that G n (β 0 ) and n (β 0 ) are not assumed constant but may depend on n. This allows for fairly arbitrary heterogeneity in the sample—see Assumptions 3.2–3.4 and the related discussion in Section 3. 2.1. The BGEL estimator An alternative one-step method to estimate β 0 is to use GEL. A sequence of GEL estimators βˆGEL := {βˆGEL,n : n ≥ 1} for β 0 , as defined in Newey and Smith (2004), is any sequence of random vectors such that ˆ := inf sup Pˆρ (β, λ), Pˆρ (βˆGEL , λ) β∈B
(2.3)
ˆ n (β) λ∈
with probability approaching 1 as n → ∞, where Pˆρ (β, λ) = nt=1 ρ(λ gnt (β))/n, ρ(·) is a concave function on its domain V, an open interval containing 0, with derivatives ρ j (·) = ˆ n (β) := {λ : λ gnt (β) ∈ V, t ≤ n, n ≥ 1}. 1 Thus, the GEL estimator is the d j ρ(·)/d·, and solution to a saddle point problem, where the Rl -valued vector of unknown auxiliary (dual) parameters λ may be interpreted as a Lagrange multiplier for the sample moment condition n t=1 ρ1 (λ gnt (β))gnt (β) = 0. Special important cases of the GEL estimator include Owen’s (1988) EL for ρ(v) = log (1 − v) and V = (−∞, 1), Efron’s (1981) ET for ρ(v) = −exp(v) and all the members of the Cressie–Read family for ρ(v) = −(1 + γ v)(γ +1)/γ/(γ + 1) and γ ∈ R. When the observations are independent and identically distributed, Newey and Smith (2004) show that the GEL estimator is asymptotically normal with a covariance matrix equal to that of the efficient GMM estimator. With weakly dependent observations βˆGEL is still n1/2 consistent and asymptotically normal, but is less efficient than the efficient GMM estimator. More importantly GEL test statistics are no longer asymptotically chi-squared distributed. One way to solve this problem is to consider blocking techniques, as suggested by Kitamura (1997b). Alternatively, one can use kernel smoothing techniques, as suggested by Kitamura and Stutzer (1997) and Smith (1997, 2009), among others.
1 Sufficient conditions for the existence of a (measurable) sequence of such estimators are that Pˆ (β, λ), ˆ viewed as a function of × B → R, is continuous in β for each ω ∈ and is measurable for each fixed β ∈ B, and that B is compact. C The Author(s). Journal compilation C Royal Economic Society 2009.
212
F. Bravo
The idea behind the blocking techniques, which are also used in the bootstrap literature (see e.g. Politis and Romano, 1992), is to construct ‘new’ observations by considering blocks of the original observations, and base estimation and inference on the resulting sequence of blocks. This procedure preserves non-parametrically the dependent structure of the data, delivering therefore valid asymptotic inference. As in Kitamura (1997b), let l = l(n) and m = m(n) denote two integer functions of n such that 1 ≤ l ≤ m, and lim n→∞ m = ∞. Let b i,m,l = [z n,(i−1)l+1 , . . . , z n,(i−1)l+m ] be a block of m consecutive observations starting from n(i − 1)l + 1. Note that m is the block length and l is the separation between block starting points. Thus, if l = m the resulting sequence of blocks is non-overlapping, while if l = 1 it is fully overlapping. Define now the blockwise moment function ψ(bi,m,l , β) := ψni (β) =
m
g(zn,(i−1)l+j , β)/m,
(2.4)
j =1
and note that if (2.1) holds then E[ψ ni (β 0 )] = 0 ∀n, i. BGEL estimation and inference for β 0 is based on the BGEL criterion function Pˆρb (β, λ) :=
q
ρ(λ ψni (β))/q,
(2.5)
i=1
where q = (n − m)/l + 1 is the total number of blocks and · is the integer part function.
3. ASYMPTOTIC RESULTS 3.1. Asymptotic normality We begin this section with a set of regularity conditions sufficient for establishing consistency and asymptotic normality of the BGEL estimator: 2 A SSUMPTION 3.1. ρ(·) is twice continuously differentiable in an open neighbourhood of 0, and ρ k (0) = −1 for k = 1, 2. A SSUMPTION 3.2. (i) z nt is L 2 NED on the strong mixing process v t , (ii) v t is of size −2α/ (α − 2) where α > 2. A SSUMPTION 3.3. (i) The parameter space B is compact, (ii) β0 ∈ B is the unique solution to E[g nt (β)] = 0 ∀n, t, (iii) (a) g nt (β) is continuous a.s. on B ∀n, t, (b) g nt (β) is Lipschitz-L 1 a.s. on B ∀n, t, i.e. ∀β, β0 ∈ Bgnt (β) − gnt (β0 ) ≤ Lnt β − β0 a.s. where supn [ nt=1 E(Lnt )/n] = O(1), (c) E(supn,t supβ∈B gnt (β)3α ) < ∞, (d) g nt (β) is L 2 -NED on v t of size −2(α − 1)/(α − 2) uniformly on (B, κ) where κ is a convenient norm in Rk , (e) ˆ is O(1) and uniformly positive definite ∀β ∈ B. n (β) := V [n1/2 g(β)] A SSUMPTION 3.4. (i) β0 ∈ int(B), (ii) (a) g nt (β) is twice-continuously differentiable on B a.s. ∀n, t, (b) ∂g nt (β)/∂β and ∂ 2 g nt (β)/∂β ∂β j are both Lipschitz-L 1 a.s. on B ∀n, t (j = 1, 2 An array of possibly vector-valued random variables {x , n ∈ N, t ∈ Z} is L -NED on the stochastic basis {v , t ∈ Z} nt p t t+m t+m if (i) Ex nt p < ∞ ∀n, t and (ii) νm = supn,t xnt − E[xnt |Ft−m ]p → 0 as m → ∞, where Ft−m is the sigma field generated by v t=m , . . . , v t+m . If ν m = O(m−a−δ ) x nt is L p -NED of size −a. C The Author(s). Journal compilation C Royal Economic Society 2009.
BGEL for moment conditions models
213
. . . , k) (c) E(supn,t supβ∈B ∂gnt (β)/∂β 3α + ∂ 2 gnt (β)/∂β ∂βj α ) < ∞ (j = 1, . . . , k) (d) ∂g nt (β)/∂β is L 2 -NED on v t of size −2(α − 1)/(α − 2) uniformly on (B, κ) and ∂ 2 g t (β)/∂β ∂β j is L 2 -NED on v t of size −1/2 uniformly on (B, κ), (e) Gn (β0 ) := ˆ 0 )/∂β ] is O(1) and has uniformly full column rank. E[∂ g(β We now discuss the assumptions. Assumption 3.1 is standard in GEL literature (Newey and Smith, 2004). Assumption 3.2, together with 3.3(iii) and 3.4(ii)(d) and (e), allows for considerable dependency and heterogeneity in the sample. In particular, stationarity is not required, and thus processes with time varying as well as jumps in the unconditional variance are allowed. For example, processes of the form z nt = [w t , σ t ε t ] , where [w t , ε t ] is a vectorvalued weakly dependent process, ε t has zero mean and σ t is a (non-stochastic) strictly positive function ∀t, are explicitly allowed. These processes can be used to model linear and nonlinear dynamic regression models with non-stationary errors—see Phillips and Xu (2006) for a recent application to linear time-series models. Assumption 3.2 allows also for deterministically trending processes like those defined in Andrews and McDermott (1995). 3 On the other hand, unbounded deterministic trending processes as well as unit root processes are excluded by the dominance conditions (see Assumptions 3.3–3.4(c)). Assumptions 3.3 and 3.4 are sufficient to prove the consistency and asymptotic normality of the GEL estimator. They are stronger than those typically assumed in the GMM literature on non-linear dynamic models, because they require the existence and smoothness of the second derivative of the moment indicators (instead of just the first derivative), and the existence of the 3αth moment for the moment indicator and its first derivative (instead of the 2αth). However, Assumptions 3.3 and 3.4 allow us to use the same type of arguments used by Newey and Smith (2004) and Smith (2009), suitably adapted to possibly heterogeneous NED processes. In particular, we rely on the fact that blockwise moment indicators, their derivatives and covariances are asymptotically equivalent to the original ones provided the block size grows with the sample size at a certain rate. The results of this paper can then be obtained by using a standard uniform law of large numbers, central limit theorem and certain covariance inequalities for NED processes as given, respectively, by Gallant and White (1988, chs. 4 and 5) and Goncalves and White (2002). We note that, compared to the results of Smith (2009) and to a certain extent those of Kitamura (1997b), the results of this paper require significantly stronger regularity conditions in terms of more stringent mixing and moment conditions, as well as more smoothness (i.e. the Lipschitz condition) of the moment indicators and their first two derivatives. On the other hand, the mixing and moment conditions in Assumptions 3.2–3.4 can be weakened. For example, in the empirically relevant case of (possibly heterogeneous) strong mixing processes {zt , t ∈ N} Assumptions 3.3(iii)(c) and 3.4(ii)(c) can be weakened to E(supt,β∈B gt (β)2α ) < ∞, and E(supt,β∈B ∂gt (β)/∂β 2α + ∂ 2 gt (β)/∂β ∂βj α ) < ∞. Furthermore, if asymptotic 3 To be specific let z ∗ d nt = d(n t/n, v t ) where d(·, ·) : (0, ∞) → R and is a strong mixing process of size −2α/(α − 2) where α > 2, and let g st (β) : g(d(s, v t ), β) for s ∈ (0, n∗ ]. It is possible to show that under Assumptions 3.1–3.5 (and Assumptions 3.3 –3.4 , 3.3 of Section 3.2.2) all of the results of this paper are still valid using the same bounded trend asymptotics framework of Andrews and McDermott (1995), provided that we replace Assumption 3.2 with his Assumptions 3.2(a)–(b), 3.3(i)–(iii)(a), (iii)(d) with his assumptions 1(b), n∗ (c), (e), (f), 3(iii)(c), (iii)(e) with E(sups supβ∈B gst (β)3α ) < ∞, (β) := 0 V [gst (β)]ds/n∗ is positive 3α definite ∀β ∈ B, Assumption 3.4(ii)(c), (ii)(e) with E(sups supβ∈B ∂gst (β)/∂β + ∂ 2 gst (β)/∂β ∂βj α ) < ∞ (j = n∗ 1, . . . , k), G(β0 ) := 0 E[∂gst (β0 )/∂β ]ds/n∗ has full column rank. Similar modifications apply to Assumptions 3.3 , 3.3 and 3.4 .
C The Author(s). Journal compilation C Royal Economic Society 2009.
214
F. Bravo
stationarity is assumed, the mixing condition Assumption 3.2 can be weakened to z t being of size −α/(α − 2). It should also be noted that the results of this paper require the block size m to grow at the rate o(n1/2 ). This contrast with the results of both Kitamura and Stutzer (1997) and Smith (2009), in which the rate of growth is o(n1/2−ε ) and ε > 0 is related to the existence of certain moments of the moment indicators, and this has some interesting implications. For example, it is well known (see e.g. Politis and Romano, 1993) that for strong mixing processes the optimal (in terms of minimizing the asymptotic mean squared error) growth rate is O(n1/3 ). To achieve this rate both Kitamura (1997b) and Smith (2009) require the existence of at least six moments of the moment indicators, as opposed to the weaker 2α (α > 2) moments of this paper. The following theorem generalizes the results of Kitamura (1997b) and Smith (2009) to NED observations on an α-mixing process. T HEOREM 3.1. Assume Assumptions 3.1–3.4 hold. Then for m = o(n1/2 ) n (β0 )−1/2 0 Il 0 n1/2 (βˆGEL − β0 ) a ∼N , 1/2 ˆ 0 0 0 ϒn (β0 ) (n /m)λ
0 0
,
where n (β 0 )−1/2 is as in (2.2) and ϒ n (β 0 ) is a uniformly non-singular l × l matrix such that 0 Il−k 0 a 1/2 ˆ 0) ∼ N ϒn (β0 )n (β0 )n g(β , , 0 0 0 where n (β 0 ) = n (β 0 )−1 (I − G n (β 0 ) n (β 0 )G n (β 0 ) n (β 0 )−1 ). Let πˆ ni = ρ1 (λˆ ψni (βˆGEL ))
q
ρ1 (λˆ ψni (βˆGEL ))
(3.1)
i=1
denote the so-called implied (blockwise) probabilities. Estimators for the asymptotic covariance ˆ respectively, n (β 0 ) and n (β 0 ), can be constructed using matrices of βˆGEL and λ,
ˆ n (βˆGEL ) −1 , ˆ n (βˆGEL ) ˆ n (βˆGEL )−1 G ˆ n (βˆGEL ) = G
ˆ n (βˆGEL ) ˆ n (βˆGEL ) ˆ n (βˆGEL )−1 I − G ˆ n (βˆGEL )G ˆ n (βˆGEL )−1 , ˆ n (βˆGEL ) = ˆ ˆ ˆ where the blockwise sample analogues, that is n(·) = q n(·) and Gn(·) are either ˆ n(·) = q ∂ψni (·)/∂β q, or their blockwise implied probabilities m i=1 ψni (·)ψni (·)/q, and G i=1 ˆ nπˆ (·) obtained by replacing 1/q with πˆ ni . 4 ˆ nπˆ (·), G analogues The following theorem shows that both estimators can be used to obtain heteroscedasticity and autocorrelation (HAC)-consistent covariance matrix estimators that are alternative to the standard kernel based estimators typically used in the econometric literature (see e.g. Andrews, 1991; Newey and West, 1994). These estimators can be used to obtain blockwise versions of standard t- and Wald (or generalized Wald Szroeter, 1983) statistics for testing possibly nonlinear (implicit) hypotheses about β. 4 Alternative estimators for (β ) and (β ) can be based, respectively, on the upper left and lower right (multiplied n 0 n 0 ˆ ˆ by m) blocks of [∂ 2 Pˆρb (θ)/∂θ ∂θ ]−1 , where θ = [β , λ ] . The upper left block of [∂ 2 Pˆρb (θ)/∂θ ∂θ ]−1 can be interpreted as a generalization of the usual Hessian-based estimator for the covariance of maximum likelihood estimators in correctly specified parametric models. I would like to thank a referee for suggesting these estimators and the interpretation.
C The Author(s). Journal compilation C Royal Economic Society 2009.
BGEL for moment conditions models
215
T HEOREM 3.2. Under the same assumptions of Theorem 3.1 ˆ n (βˆGEL ) − n (β0 ) = op (1),
ˆ n (βˆGEL ) − n (β0 ) = op (1).
3.2. Inference In this section, we present a variety of classical-like BGEL test statistics for hypotheses tests in moment-based models as defined in (2.1). The statistics we consider are the BGEL distance (D ρ ), the Lagrange multiplier (LM ρ ), the score (S ρ ) and minimum chi-squared (MCρ ). D ρ statistics are based on differences in the BGEL criterion function between the unconstrained and constrained estimators. LM ρ and S ρ statistics are based on the deviations of the constrained parameters from values solving the unconstrained problem. Finally, MCρ statistics are based on differences between constrained and unconstrained BGEL estimators. Clearly, given the asymptotic equivalence between GMM and GEL estimators, all of these statistics may be evaluated at the efficient GMM estimator (or any other asymptotically equivalent estimator of β 0 ). 3.2.1. Overidentifying restrictions. We consider three statistics that can be used to assess the validity of the overidentifying moment conditions (2.1). As noted by Smith (1997), one can think of the validity of (2.1) as corresponding to the parametric restriction λ = 0. Thus classical-like BGEL statistics, similar to those suggested by Kitamura (1997b), Smith (1997, 2009) and Imbens et al. (1998) are ˆ − ρ(0)), LM ρ = (n/m2 )λˆ ˆ ˆ n (βˆGEL )λ, D ρ = 2cn (Pˆρb (βˆGEL , λ) q q ˆ n (βˆGEL )−1 Sρ = ψni (βˆGEL )/q 1/2 ψni (βˆGEL )/q 1/2 , i=1
(3.2)
i=1
ˆ n(·) is where c n = (q/mn) is a correction factor that account for the overlap in the blocks, as defined in Theorem 3.1, and serves as the generalized inverse of the estimated asymptotic ˆ covariance matrix of (n1/2 /m)λ. T HEOREM 3.3. Under the same assumptions of Theorem 1 and (2.1) a
D ρ , LM ρ , S ρ ∼ χ 2 (l − k). 3.2.2. Specification analysis. We consider as in Smith (1997) the same type of specifications tests based on additional moment conditions developed by Newey (1985). Let θ = [α , β ] where α is an Rp -valued vector of additional parameters, and suppose that there exists an Rs -valued (s ≤ p) vector of functions h(z nt , θ ) := h nt (θ ) satisfying E[hnt (θ0 )] = 0,
∀n, t.
(3.3)
The information contained in the additional set of moment conditions (3.3) can naturally be incorporated into BGEL estimation. To be specific let lnt (θ ) = [gnt (β) , hnt (θ ) ] C The Author(s). Journal compilation C Royal Economic Society 2009.
(3.4)
216
F. Bravo
denote the ‘augmented’moment function, and let n (θ0 ) = V [n1/2 lˆnt (θ0 )]. With a slight abuse of a notation, let ψni (θ ) = m j =1 ln,(i−1)l+j (θ )/m denote the blockwise version of l(·), and in analogy to (2.5) let Pˆρb (θ, λ, ϕ) =
q
a ρ(μ ψni (θ ))/q,
i=1
where μ = [λ , ϕ ] and ϕ is an Rs -valued vector of unknown auxiliary parameters associated with h nt (θ ). To establish the asymptotic normality of the resulting estimators we assume that A SSUMPTION 3.3 . (i) The parameter space = A × B is compact, (ii) θ 0 ∈ is the unique solution to E[l nt (θ )] = 0 ∀n, i, (iii) Assumption 3.3(iii)(a)–(e) hold with g tn (β) replaced by l nt (θ ). A SSUMPTION 3.4 . (i) θ 0 ∈ int(), (ii) Assumption 3.4(ii)(a)–(e) hold with g nt (β) replaced by l nt (θ ). As with the test statistics for overidentifying restrictions, classical-type test statistics for the additional moment conditions (3.3) may be constructed by imposing the restriction ϕ = 0 into the estimation of Pˆρb (θ, μ). 5 In addition to these test statistics, we consider a minimum chi-squared statistic based on the constrained and unconstrained estimators for μ. Let θ˜ and μ˜ = [λ˜ , 0 ] denote the restricted estimators of θ 0 and μ, and define
ˆ − Pˆρb (θ˜GEL , μ) ˜ , D ρ = 2cn Pˆρb (θˆGEL , μ) ˆ n (θˆGEL )(μ˜ − μ), ˆ ˆ MC ρ = (n/m2 )(μ˜ − μ) ˆ n (θ˜GEL )Sϕ ]−1 ϕ, LM ρ = (n/m2 )ϕ˜ [Sϕ ˜ Sρ =
q
ˆ n (θˆGEL )Sϕ sni (βˆGEL ) /q 1/2 Sϕ
i=1
q
sni (θˆGEL )/q 1/2 ,
i=1
ˆ n (θ˜GEL ) is a consistent estimator of where n (θ0 ) = n (θ0 )−1 (I − Ln (θ0 )n (θ0 )Ln (θ0 ) n (θ0 )−1 ), a ˆ n (θ0 ) = [Ln (θ0 ) n (θ0 )−1 Ln (θ0 )]−1 , sni (θˆGEL ) = ρ1 (λˆ ψni (βˆGEL ))ψni (θGEL ) and S ϕ = [0, I ] is a selection matrix such that S ϕ μ = ϕ.
T HEOREM 3.4. Assume Assumptions 3.1–3.2, and 3.3 –3.4 hold. Then under (3.3) a
D ρ , LM ρ , MC ρ , S ρ ∼ χ 2 (s). We now consider the following parametric null hypothesis expressed, as in Smith (1997), in the mixed implicit and constraint equation form q(α0 , β0 ) = 0,
r(α0 ) = 0,
(3.5)
5 Note that if one is interested in the full vector of moment conditions l (θ ) defined in (3.4) one can use exactly nt the same statistics D ρ , LM ρ and S ρ as in (3.2) with βˆGEL and λˆ replaced by θˆGEL and μ, ˆ respectively. Under the null hypothesis that E[l nt (θ 0 )] = 0 ∀n, t, the asymptotic distribution of the three test statistics is χ 2 (l + s − q − k). See the proof of Theorem 3.3 for more details.
C The Author(s). Journal compilation C Royal Economic Society 2009.
BGEL for moment conditions models
217
where q(·) and r(·) are Rq - and Rs -valued vectors of known functions and α is an Rs -valued vector of unknown parameters. The standard approach to deal with (3.5) is to define a constrained GEL estimator θ˜ = [β˜ , α˜ ] by optimizing the BGEL criterion subject to the restrictions. 6 A sequence of constrained BGEL estimators θ˜GEL := {θ˜GEL,n : n ≥ 1} for θ 0 , is any sequence of random vectors such that ˆ = inf sup Pˆρb (θ, λ) : q(α, β) = 0, r(α) = 0 Pˆρb (θ˜GEL , λ) θ∈
ˆ n (β) λ∈
with probability approaching 1 as n → ∞ where = A × B. To establish the asymptotic normality of the resulting estimators we assume that A SSUMPTION 3.3 . (i) The parameter space = A × B is compact, (ii) θ 0 ∈ is the unique solution to E[g nt (β)] = 0, q(θ ) = 0 and r(α) = 0 ∀n, t, Assumption 3.3(iii)(a)–(e) hold. A SSUMPTION 3.5. q(θ ) and r(α) are continuously differentiable functions of α and β in a neighbourhood N of θ 0 , and rank[∂q(α 0 , β 0 )/∂α ] = q and rank[∂r(α 0 )/∂α ] = s. The corresponding BGEL-based statistics for (3.5) are
ˆ − Pˆρb (θ, ˜ λ) ˜ , D ρ = 2cn Pˆρb (θˆ , λ) ˆ n (θ˜ )−1 Qβ (θ) ˜ ϕ, LM ρ = (n/m2 )ϕ˜ Qβ (θ˜ ) ˜ ˆ (Qβ (α, ˆ ˆ n (β) ˆ −1 Qβ (α, ˆ )−1 q(α, ˆ ˜ β) ˜ β) ˜ β) ˜ β), S ρ = nq(α, where Q β (·) = ∂q(·)/∂β . 7 T HEOREM 3.5. Assume that Assumptions 3.1–3.2, 3.3 and 3.4–3.5. Then under (2.1) and (3.5) a
D ρ , LM ρ , S ρ ∼ χ 2 (k + q − s).
4. MONTE CARLO EVIDENCE In this section, we consider instrumental variable estimation of the non-linear regression model ynt = exp(β10 + β20 xnt ) + unt , where we allow the regressor x nt and/or the unobservable error term u nt to be non-stationary weakly dependent processes. To be specific we assume that both x nt and u nt are stable autoregressive processes of order one, and use the same two specifications for the variance σ 2nt as those used by Phillips and Xu (2006), i.e.
σnt2 = σ12 + σ22 − σ12 I (t/n ≥ τ ) for τ ∈ (0, 1) (4.1)
σnt2 = σ12 + σ22 − σ12 (t/n).
6
Alternatively one can incorporate the restrictions directly into the BGEL criterion function as in Smith (1997, 2009). Other asymptotically equivalent Hausman-type test statistics, similar to those defined in Theorem 3.4, could be defined ˆ in terms of the differences λ˜ − λˆ and β˜ − β. 7
C The Author(s). Journal compilation C Royal Economic Society 2009.
218
F. Bravo
Both specifications are consistent with empirically relevant situations: the former corresponds to the case of an abrupt change in the variance (due e.g. to a sudden shock affecting the economy), the latter corresponds to the case of smooth trending variance (due e.g. to an economic cycle). The vector of instruments is w nt = [1, x nt , x n,t−1 , x n,t−2 ] so that the moment conditions model (2.1) is E[wnt (ynt − exp(β10 + β20 xnt ))] = 0.
(4.2)
To test the hypothesis that (4.2) is correctly specified we consider six test statistics for overidentifying restrictions: the BGEL distance D ρ as given in (3.2), the Lagrange multiplier LMπρˆ and score Sπρˆ , that is the implied probabilities analogues of LM ρ and S ρ based on πˆ ni defined in (3.1), Hansen’s (1982) J-statistic based on efficient GMM estimator βˆGMM , i.e. ˆ n (βˆGMM )−1 g( ˆ βˆGMM ), ˆ βˆGMM ) J = ng( and its bootstrapped version J ∗ . To implement the bootstrap we use the same blocks b i,m,l used in Section 2 to define the BGEL, and focus only on the fully overlapping scheme, i.e. l = 1. With this scheme the block bootstrap draws k = n/m blocks b∗i,m,1 randomly with replacement from the set of overlapping blocks [b 1,m,1 , . . . , b q,m,1 ] . Let g ∗ (b∗i,m,1 , β) := g ∗ni (β) denote the centred bootstrap moment indicators: centring is necessary here to obtain the asymptotic equivalence between the ∗ denote the efficient bootstrap GMM estimator, i.e. bootstrap and original J-statistic. Let βˆGMM any sequence of random vectors such that ∗ ∗ ˆ ∗n (β˜ ∗ )−1 gˆ ∗ (βˆGMM ˆ ∗n (β˜ ∗ )−1 gˆ ∗ (β) gˆ ∗ (βˆGMM ) ) := inf gˆ ∗ (β) β∈B
with bootstrap probability approaching 1 in probability as n → ∞, where β˜ ∗ is any preliminary n1/2 -consistent estimator bootstrap estimator. Then the bootstrap J ∗ -statistic is 8 ∗ ∗ ∗ ˆ ∗n (βˆGMM J ∗ = ngˆ ∗ (βˆGMM ) )−1 gˆ ∗ (βˆGMM ).
In the simulations we consider the ET (ρ = ET) and the Euclidean distance (ρ = EU) specifications of the BGEL criterion function. We chose the first specification because of its computational simplicity and numerical stability, while the second one was chosen because it effectively corresponds to Hansen et al. (1996) continuously updated GMM estimator. 9 To estimate n (·) we use the Newey–West estimator (Newey and West, 1987) for the J-statistic, and the block covariance with overlapping blocks (i.e. with l = 1) for the BGEL-based ρ ρ LMπˆ , Sπˆ and J ∗ -statistics. These estimators are asymptotically equivalent for m = o(n1/2 ) and have the same optimal length (bandwidth) parameter m∗ = γ n1/3 , for any choice of finite γ > 0. In the simulations, we consider the Newey and West (1994) non-parametric data-dependent method to choose γ . The method seems to perform reasonably well, even under non-stationarity. 8 The consistency of J ∗ can be shown using the same arguments as those used by Goncalves and White (2004). In ∗ particular, using their lemmas A2, A3 and A4 it is possible to show the consistency of βˆGMM and the asymptotic normality ∗ 1/2 ∗ ˆ of n gˆ (βGMM ) with probability approaching 1. Furthermore, a mean value expansion, their lemmas A4, A5, B2 and the ∗ ˆ ∗n (βˆ ∗ ), ˆ ∗n (β0 ) and E ∗ [ ˆ ∗∗ consistency of βˆGMM can be used to show that n (β0 )] converge in bootstrap probability, GMM ˆ ∗∗ ˆ ∗∗ ˆ∗ ˆ ∗n (β0 ), respectively to n (β0 ) and n (β 0 ) with probability approaching 1, where n (β0 ) equals n (β0 ) without the ˆ ∗n (βˆ ∗ )−1 converges in bootstrap probability to n (β 0 )−1 with probability 1, and the consistency of centring. Thus GMM J ∗ follows. 9 We also considered EL and obtained results that are qualitative very similar to those based on ET, and thus are not reported here. We note, however, that EL was numerically more unstable than ET.
C The Author(s). Journal compilation C Royal Economic Society 2009.
BGEL for moment conditions models
219
We set β 0 = [1, 0.3] , and specify two values for the autoregressive parameter ρ : ρ ∈ {0.4, 0.8}, which correspond to moderate and high persistence, respectively. For the variance specifications (4.1) we let σ 1 = 1 and consider σ = σ 1 /σ 2 ∈ {0.2, 0.5, 2, 5} to allow for both positive (σ < 1) and negative (σ > 1) changes in the variance, and τ ∈ {0.1, 0.5, 0.9} to allow for an early, mid and late break in the sample. The results are obtained using the S-Plus functions rnorm and arima.sim. The GMM estimator is computed using the S-Plus function ms with analytical first derivatives. The BGEL estimator is computed using a nested algorithm that uses a literal interpretation of the saddle point property of the estimator. The inner stage maximizes Pˆρb (β, λ) over λ for a fixed initial value of β. Let λ(β) be the maximizing value of λ. The outer stage minimizes Pˆρb (β, λ(β)) over β using the S-Plus function nlminb with analytical first derivatives and Hessian. As the initial value of β we use the same inefficient GMM estimate used to compute the efficient GMM estimator. The finite sample sizes are calculated using 0.05 asymptotic critical level for all 12 possible combinations of σ and τ for sample sizes n = 100 and 500 using 5000 Monte Carlo replications and 499 bootstrap replications for each Monte Carlo replication. Tables 1 and 2 report, respectively, the finite sample sizes of the six test statistics based on two different cases: both the regressors/instruments and the error are mildly persistent (ρ = 0.4) and possibly non-stationary (Case A); both the regressors/instruments and the errors are highly persistent (ρ = 0.8) but only the errors are possibly non-stationary (Case B). Tables 1 and 2 report also the stationary case (i.e. σ = 1), which is used as a benchmark for comparison. We first discuss the results for the non-stationary case due to an abrupt change in the variance (Table 1). Some interesting patterns seem to emerge. For Case A, we first note that all of the test statistics considered, including those based on the bootstrap, are affected by this type of nonstationarity. The size distortion depends on the location, magnitude and sign of the change. In particular, there is an ‘asymmetric size effect’ in the sense that all of the test statistics have larger size distortion when there is an early positive or late negative change to the variance. The same type of asymmetry was noted by Phillips and Xu (2006) for t-statistics in non-stationary stable autoregressive models, and is also present (albeit smaller in magnitude) in the J-statistic for linear instrumental variables models. 10 Second, all BGEL-based statistics have better finite size properties compared to the J-statistic. In particular, both distance statistics D ρ have good finite sample sizes and especially D ET have considerably better finite sample sizes. Third, the ρ ρ bootstrapped statistic J ∗ has typically better size properties than both LMπˆ and Sπˆ . On the other ρ ET hand, when compared to the distance statistics D (and in particular to D ) the location and sign of the change becomes crucial: for early positive (late negative) changes D ρ has an edge over J ∗ , while for the other cases the reverse is true. For Case B, first we note that there are some similarities with Case A in terms of relative comparisons and ranking of the test statistics considered. The main differences are that the effect of non-stationarity on the size is less evident, and that there is no asymmetric size effect. Indeed in this case the size distortion of all test statistics appear to be caused mainly by the high degree of persistence of the instruments and of the error. Second, for n = 100 with the exception of the early positive and late negative changes in the variance the size distortions of J ∗ are typically bigger than those obtained in Case A, whereas BGEL statistics (and in particular both D ρ ) seem to be less affected by the high persistency of the observations. On the other hand, for n = 500 the bootstrap seems to regain an edge over the distance statistics. 10
Results available upon request.
C The Author(s). Journal compilation C Royal Economic Society 2009.
220
F. Bravo Table 1. Finite sample size for the case of an abrupt change in the variance. Case A. σ J J∗ D ET SπET LMπET D EU SπEU ˆ ˆ ˆ
LMπEU ˆ
1
0.105
0.079
0.083
0.096
0.100
0.104
0.097
0.102
0.2
0.189
0.123
0.130
τ = 0.1 0.161
0.153
0.136
0.152
0.150
0.5 2
0.136 0.154
0.105 0.110
0.109 0.110
0.118 0.132
0.129 0.142
0.119 0.125
0.124 0.137
0.132 0.164
5
0.329
0.245
0.206
0.287
0.291
0.221
0.277
0.295
0.2 0.5
0.151 0.135
0.101 0.096
0.108 0.105
0.135 0.120
0.142 0.130
0.117 0.115
0.139 0.128
0.142 0.135
2 5
0.141 0.149
0.099 0.112
0.102 0.115
0.124 0.132
0.129 0.139
0.116 0.120
0.129 0.134
0.138 0.140
τ = 0.5 n = 100
0.2
0.341
0.219
0.196
τ = 0.9 0.248
0.277
0.201
0.259
0.274
0.5 2 5
0.162 0.154 0.163
0.112 0.116 0.121
0.107 0.110 0.120
0.145 0.139 0.141
0.158 0.142 0.153
0.121 0.125 0.121
0.144 0.133 0.142
0.153 0.142 0.156
1
0.097
0.070
0.077
0.089
0.093
0.096
0.090
0.094
0.2
0.175
0.113
0.120
τ = 0.1 0.146
0.134
0.124
0.135
0.144
0.5 2
0.123 0.144
0.091 0.095
0.094 0.100
0.107 0.118
0.116 0.127
0.104 0.108
0.110 0.121
0.116 0.144
5
0.296
0.196
0.189
0.283
0.249
0.192
0.256
0.226
0.2 0.5
0.136 0.122
0.093 0.086
0.094 0.093
0.124 0.110
0.129 0.118
0.108 0.103
0.124 0.114
0.126 0.120
2 5
0.128 0.134
0.089 0.104
0.090 0.110
0.113 0.120
0.116 0.125
0.104 0.113
0.115 0.119
0.124 0.124
0.2
0.308
0.195
0.176
τ = 0.9 0.223
0.250
0.182
0.235
0.246
0.5 2 5
0.146 0.139 0.147
0.104 0.098 0.100
0.095 0.100 0.104
0.130 0.125 0.127
0.142 0.128 0.138
0.100 0.102 0.109
0.131 0.117 0.124
0.136 0.126 0.130
τ = 0.5 n = 500
We now discuss the results for the non-stationary case due to a trending variance (Table 2). For Case A we note that this type of non-stationarity has a negative effect on the size of all the test statistics. This effect, however, is less pronounced than the corresponding one reported in Case A of Table 1. It is also interesting to note that the direction of the trend does not matter in terms of the magnitude of the size distortion. For Case B we note that the results are qualitatively C The Author(s). Journal compilation C Royal Economic Society 2009.
221
BGEL for moment conditions models
Table 1 (cont.). Finite sample size for the case of an abrupt change in the variance. Case B. σ J J∗ D ET SπET LMπET D EU SπEU LMπEU ˆ ˆ ˆ ˆ 1
0.156
0.096
0.104
0.123
0.130
0.112
0.121
0.134
0.2
0.196
0.127
0.122
τ = 0.1 0.164
0.178
0.133
0.140
0.168
0.5 2
0.171 0.170
0.125 0.127
0.112 0.115
0.152 0.148
0.160 0.148
0.132 0.127
0.132 0.165
0.135 0.160
5
0.192
0.139
0.128
0.156
0.170
0.130
0.171
0.177
0.2 0.5
0.170 0.154
0.115 0.110
0.116 0.112
0.156 0.143
0.159 0.139
0.122 0.125
0.155 0.144
0.161 0.141
2 5
0.167 0.193
0.116 0.126
0.110 0.139
0.139 0.166
0.143 0.174
0.118 0.120
0.132 0.159
0.153 0.165
τ = 0.5 n = 100
0.2
0.186
0.132
0.129
τ = 0.9 0.165
0.171
0.133
0.159
0.161
0.5 2 5
0.164 0.156 0.197
0.127 0.131 0.133
0.120 0.128 0.124
0.136 0.133 0.148
0.140 0.139 0.154
0.117 0.119 0.135
0.138 0.143 0.149
0.149 0.147 0.154
1
0.145
0.070
0.092
0.114
0.120
0.104
0.112
0.124
0.2
0.165
0.109
0.112
τ = 0.1 0.138
0.153
0.116
0.128
0.140
0.5 2
0.144 0.143
0.104 0.100
0.106 0.109
0.126 0.125
0.135 0.128
0.112 0.107
0.124 0.134
0.136 0.138
5
0.166
0.108
0.108
0.134
0.148
0.117
0.139
0.143
0.2 0.5
0.136 0.123
0.095 0.097
0.108 0.105
0.107 0.117
0.128 0.111
0.105 0.108
0.126 0.117
0.128 0.114
2 5
0.134 0.157
0.101 0.110
0.106 0.108
0.114 0.136
0.114 0.141
0.102 0.113
0.106 0.131
0.124 0.133
0.2
0.145
0.122
0.118
τ = 0.9 0.131
0.138
0.116
0.126
0.128
0.5 2 5
0.138 0.122 0.154
0.092 0.096 0.099
0.113 0.110 0.102
0.118 0.117 0.129
0.122 0.123 0.124
0.105 0.099 0.107
0.110 0.114 0.119
0.124 0.121 0.127
τ = 0.5 n = 500
very similar to those of the corresponding Case B of Table 1. In terms of size distortions the relative comparisons and ranking of the test statistics are similar to those of Table 1, with the only notable difference that in this case J ∗ has the smallest size distortion. Overall, the results of Tables 1 and 2 can be summarized as follows. First, non-stationarity affects negatively the finite sample size properties of test statistics for overidentifying restrictions, C The Author(s). Journal compilation C Royal Economic Society 2009.
222
F. Bravo
σ
J
Table 2. Finite sample size for the case of a trending variance. J∗ D ET SπET LMπET D EU ˆ ˆ
SπEU ˆ
LMπEU ˆ
Case A 0.2 0.5 1
0.179 0.124 0.105
0.110 0.096 0.079
0.122 0.102 0.083
0.154 0.119 0.096
0.161 0.120 0.100
0.119 0.108 0.104
0.162 0.116 0.097
0.166 0.120 0.102
2 5
0.132 0.187
0.099 0.104
0.109 0.117
0.117 0.164
0.124 0.171
0.103 0.125
0.120 0.176
0.123 0.145
100 0.2
0.196
0.121
0.134
Case B 0.173
0.182
0.140
0.180
0.183
0.5 1
0.185 0.156
0.113 0.086
0.120 0.099
0.165 0.123
0.169 0.130
0.123 0.112
0.167 0.121
0.173 0.134
2 5
0.154 0.200
0.114 0.126
0.112 0.123
0.144 0.169
0.154 0.179
0.118 0.131
0.159 0.174
0.167 0.182
0.2
0.162
0.098
0.109
Case A 0.136
0.144
0.108
0.144
0.149
0.5 1 2
0.111 0.097 0.120
0.0886 0.070 0.092
0.091 0.077 0.094
0.106 0.089 0.104
0.107 0.093 0.110
0.101 0.096 0.101
0.103 0.090 0.107
0.109 0.094 0.111
5
0.157
0.098
0.105
0.146
0.152
0.113
0.156
0.130
0.2 0.5 1
0.184 0.172 0.145
0.100 0.094 0.079
0.116 0.108 0.092
0.166 0.156 0.114
0.168 0.157 0.120
0.126 0.110 0.104
0.162 0.150 0.112
0.169 0.159 0.124
2 5
0.140 0.184
0.0910 0.0979
0.110 0.134
0.136 0.159
0.145 0.168
0.108 0.118
0.144 0.156
0.167 0.176
500 Case B
with the degree of overrejection depending on both the variance specification and the degree of persistence of the observations. Second, among the test statistics considered, those based on the BGEL distance D ρ are the least affected by the presence of non-stationarity because of their internal Studentization property. This is consistent with the theoretical prediction of the BGEL method given the implicit pivotalness property enjoyed by all the D ρ test statistics. Third, the bootstrap can improve the finite sample size of the J-statistic even when non-stationarity is present. However, the magnitude of the improvement is typically inferior to that observed under stationarity. Moreover, in certain empirically relevant situations, such as those where there is an abrupt large change in the variance BGEL distance, statistics can perform considerably better than those based on the bootstrap. We now consider the power properties of BGEL statistics. Figure 1 reports the finite sample ET power for J , J ∗ , D ET and LMπET ˆ . We do not report power results for Sπˆ , nor for any statistics C The Author(s). Journal compilation C Royal Economic Society 2009.
223
BGEL for moment conditions models (a) Stationarity
(b) Early large positive change in the variance
0.9
0.5
0.4
D(ET) J* J LM(ET)
0.15
0.30 Beta 2
D(ET) J* J LM(ET)
0.2
0.1
0.45
(c) Late small negative change in the variance
0.3 Beta 2
0.5
(d) Large upward trend in the variance
0.9 0.7
0.4 D(ET) J* J LM(ET)
0.1
0.4
0.3
0.7
Beta 2
D(ET) J* J LM(ET)
0.1
0.4
0.7
Beta 2
Figure 1. Finite sample power for J , J ∗ , D ET and LMπET ˆ .
based on the EU specification because they all display power properties similar to those of D ET and LMπET ˆ . The power of each test statistic is calculated for n = 100 under the null hypothesis H 0 : β 10 = 1, β 20 = 0.3 in E[wnt (ynt − exp(β10 + β20 xnt ) − (β2 − 0.3)δnt )] = 0, where δ nt = exp(β 2 x nt ) letting β 2 vary within the interval [−0.1, 0.7] using 1000 replications and Monte Carlo size corrected critical values. We consider four different cases: 1(a) stationarity (i.e. σ = 1), which is used as benchmark for comparison, 1(b) early large positive change in the variance (i.e. τ = 0.1, σ = 5), (c) late small negative change in the variance (i.e. τ = 0.9, σ = 0.5), (d) large upward trend in the variance (i.e. σ = 5). Cases (a)–(b) and (c)–(d) are investigated using the same specifications used for Cases A and B of Tables 1 and 2, respectively. The other combinations of the parameters σ and τ result in power curves with similar features to those displayed in the four cases considered. In particular, in the late large negative change in the variance case (τ = 0.9, σ = 0.2, Case A) the power curves C The Author(s). Journal compilation C Royal Economic Society 2009.
224
F. Bravo
are similar to those displayed in Figure 1(b), while in the large downward trend in the variance (σ = 0.2, Case B) the power curves are the mirror image of those displayed in Figure 1(d). All other combinations of σ and τ produce power curves similar to those displayed in Figure 1(c). We now discuss Figure 1. We begin with Figure 1(a), and first note that under stationarity all test statistics have good power, even for values quite close to the null hypothesis. Second, no test statistic seems to clearly dominate the others, albeit D ET has a slight edge, especially for the alternatives β 2 < 0.3, while LMπET ˆ has the smallest power especially for alternatives about β 2 < 0.2 and β 2 > 0.4. Third, the power of J and J ∗ is virtually identical. In Figure 1(b), we first note that all the power curves are much flatter compared to those of Figure 1(a). Thus an abrupt large change in the variance has a significant negative effect on the power of the statistics. Second, D ET is uniformly the most powerful statistic. Third, J has almost uniformly the lowest power, the exceptions being in the intervals 0.14 < β 2 < 0.16 and 0.48 < β 2 < 0.52 where J ∗ has the lowest power. Figure 1(c) first shows that the power curves are flatter compared to those displayed in Figure 1(a), but are considerably steeper compared to those displayed in Figure 1(b). Second, we note that no test statistic clearly dominates the others: for alternatives β 2 < 0.3D ET has the largest power, while for alternatives in the other direction LMπET ˆ has an edge for 0.3 < β 2 < 0.38 and 0.42 < β 2 < 0.48, D ET for the values in between the latter two intervals, and J for β 2 > 0.48. Third, the power of J ∗ is uniformly lower than that of J, and it is the lowest for −0.1 < β 2 < 0.48. These results are particularly interesting because, as previously mentioned, the power curves associated with the other combinations of σ and τ are very similar to those displayed in Figure 1(c). Thus Figure 1(c) represents the typical power curves of the four statistics when non-stationarity is present. Finally, in Figure 1(d) once again we first note that the power curves are flatter compared to those displayed in Figure 1(a). We also note that there is an important ‘asymmetric power effect’ in that the power curves are much flatter for the alternatives β 2 < 0.3. For these alternatives the power curves closely resemble those of Figure 1(b). On the other hand, for alternatives β 2 > 0.3 the power curves are more similar to those of Figure 1(c). Second, no test statistic dominates the others: for β2 < 0.3 LMπET ˆ is the most powerful statistic, while for β 2 > 0.3J has an edge for 0.3 < β 2 < 0.36, while D ET has the largest power for β 2 > 0.46. Overall Figure 1 suggests two main points: first non-stationarity typically has a negative effect on the finite sample power of test statistics for overidentifying restrictions, with the power losses depending on both the variance specification and, to a certain extent, the degree of persistence of the observations. Second, there is no test statistic that uniformly dominates the others: which test statistic to choose depends on the type of non-stationarity, although the distance statistic seems to display a certain level of robustness to different types of non-stationarity.
5. CONCLUSIONS This paper introduces the BGEL method for estimation and inference in non-linear moment conditions models with possibly non-stationary observations that are NED on an underlying mixing process. The results of the paper generalize a number of results available in the literature, and are of empirical relevance given a large body of empirical evidence documenting nonconstant unconditional variances for a number of economic and financial time series. The effect of non-stationarity on the finite sample properties of a number of test statistics for overidentifying restrictions under non-stationarity, including one based on the bootstrap, are investigated by means of simulations. The results of the latter suggest that, in general, nonstationarity affects negatively the finite sample properties of all of the statistics considered, C The Author(s). Journal compilation C Royal Economic Society 2009.
BGEL for moment conditions models
225
including those based on the bootstrap. One statistic, however, seems to be a little more robust against non-stationarity than the others: the distance statistic. This statistic has better finite sample size properties than Hansen’s (1982) J-statistic based on the efficient GMM estimator and any other BGEL-based statistics. It also has better finite sample size than the bootstrapped version of the J-statistic for certain types of non-stationarity. Moreover, it appears to be less sensitive to the degree of persistency of the observations, and it has good finite sample power properties across different types of non-stationarity. Overall, the results of this paper provide some indications that might be useful for applied researchers. For inference in non-linear dynamic moment conditions models where nonstationarity might be present BGEL distance statistics seem a valid alternative not only to GMMbased statistics but also to bootstrapped ones. Among the three most commonly used BGEL distance statistics, namely EL, ET and Euclidean likelihood, the ET seems to be preferable on the grounds of good finite sample as well as numerical stability properties. Finally, the bootstrap does not always provide the same type of accurate inference as that given under stationarity.
ACKNOWLEDGMENTS I am grateful to the Editor and two referees for useful comments and constructive criticisms that improved noticeably the original version. All remaining errors are my own responsibility.
REFERENCES Anatolyev, S. (2005). GMM, GEL, serial correlation and asymptotic bias. Econometrica 73, 983–1002. Andrews, D. W. K. (1991). Heteroskedasticity and autocorrelation consistent covariance matrix estimation. Econometrica 59, 817–58. Andrews, D. W. K. and C. J. McDermott (1995). Nonlinear econometric models with deterministically trending variables. Review of Economics Studies 62, 343–60. Bravo, F. (2005). Blockwise empirical entropy tests for time series regressions. Journal of Time Series Analysis 26, 185–210. Brown, B. W. and W. K. Newey (2002). Generalized method of moments, efficient bootstrapping, and improved inference. Journal of Business and Economic Statistics 20, 507–17. Efron, B. (1981). Nonparametric standard errors and confidence intervals (with discussion). Canadian Journal of Statistics 9, 139–72. Fitzenberger, B. (1997). The moving blocks bootstrap and robust inference for linear least squares and quantile regressions. Journal of Econometrics 82, 235–87. Flood, R. P. and N. P. Marron (2000). Self fulfilling risk predictions: an application to speculative attacks. Journal of International Economics 50, 245–68. Gallant, A. R. and H. White (1988). A Unified Theory of Estimation and Inference for Nonlinear Dynamic Models. Oxford: Blackwell. Goncalves, S. and H. White (2002). The bootstrap of the mean of heterogeneous dependent processes. Econometric Theory 18, 1367–84. Goncalves, S. and H. White (2004). Maximum likelihood and the bootstrap for nonlinear dynamic models. Journal of Econometrics 119, 199–219. Gregory, A. W., J. F. Lamanche and G. W. Smith (2002). Information-theoretic estimation of preference parameters: macroeconomic applications and simulation evidence. Journal of Econometrics 107, 213– 33. C The Author(s). Journal compilation C Royal Economic Society 2009.
226
F. Bravo
Grilli, V. and N. Roubini (1992). Liquidity and exchange rates. Journal of International Economics 33, 339–52. Guggenberger, P. and R. J. Smith (2008). Generalized empirical likelihood test in time series models with potential identification failure. Journal of Econometrics 142, 134–61. Hall, A. R. (2005). Generalized Method of Moments. Oxford: Oxford University Press. Hansen, L. P. (1982). Large sample properties of generalized method of moments estimators. Econometrica 50, 1029–54. Hansen, L. P., J. Heaton and A. Yaron (1996). Finite sample properties of some alternative GMM estimators. Journal of Business and Economic Statistics 14, 262–80. Holman, J. A. (1998). GMM estimation of a money in the utility function model: the implication of functional form. Journal of Money, Credit and Banking 30, 679–98. Imbens, G. W. (1997). One-step estimators for over-identified generalized method of moments models. Review of Economic Studies 64, 359–33. Imbens, G. W., R. H. Spady and P. Johnson (1998). Information theoretic approaches to inference in moment condition models. Econometrica 66, 333–37. Imrohoroglu, S. (1994). GMM estimates of currency substitution between the Canadian dollar and the U.S. dollar. Journal of Money, Credit and Banking 26, 792–807. Kitamura, Y. (1997a). Empirical likelihood and the bootstrap for time series regressions. Working Paper, University of Minnesota. Kitamura, Y. (1997b). Empirical likelihood methods with weakly dependent processes. Annals of Statistics 25, 2084–102. Kitamura, Y. and M. Stutzer (1997). An information theoretic alternative to generalized method of moments estimation. Econometrica 65, 861–74. Lee, B. S. (1989). A nonlinear expectations model of the term structure of interest rates with time varying premia. Journal of Money, Credit and Banking 21, 348–67. Loretan, M. and P. C. B. Phillips (1994). Testing covariance stationarity of heavy-tailed time-series. Journal of Empirical Finance 1, 211–48. Newey, W. K. (1985). Generalized method of moments specification testing. Journal of Econometrics 29, 229–56. Newey, W. K. and R. J. Smith (2004). Higher order properties of GMM and generalized empirical likelihood estimators. Econometrica 72, 219–56. Newey, W. K. and K. D. West (1987). A simple positive semi-definite heteroskedasticity and autocorrelation consistent covariance matrix. Econometrica 55, 703–08. Newey, W. K. and K. West (1994). Automatic lag selection in covariance matrix estimation. Review of Economic Studies 61, 631–53. Otsu, T. (2006). Generalized empirical likelihood inference for nonlinear and time series models under weak identification. Econometric Theory 22, 513–27. Owen, A. (1988). Empirical likelihood ratio confidence intervals for a single functional. Biometrika 36, 237–49. Pagan, A. R. and G. W. Schwert (1990). Testing for covariance stationarity in stock markets data. Economics Letters 33, 165–70. Phillips, P. C. B. and K. Xu (2006). Inference in autoregression under heteroskedasticity. Journal of Time Series Analysis 27, 289–308. Politis, D. N. and J. P. Romano (1992). A general resampling scheme for triangular arrays of α—mixing random variables with application to the problem of spectral density estimation. Annals of Statistics 20, 1985–2007.
C The Author(s). Journal compilation C Royal Economic Society 2009.
BGEL for moment conditions models
227
Politis, D. N. and J. P. Romano (1993). On the sample variance of linear statistics derived from mixing sequences. Stochastic Processes and their Applications 45, 155–67. Qin, J. and J. Lawless (1994). Empirical likelihood and general estimating equations. Annals of Statistics 22, 300–25. Sarantis, N. (2006). Testing the uncovered interest parity using traded volatility, time varying risk premium and heterogeneous expectations. Journal of International Money and Finance 25, 1168–86. Serfling, R. (1980). Approximation Theorems of Mathematical Statistics. New York: Wiley. Smith, R. J. (1997). Alternative semi-parametric likelihood approaches to generalised method of moments estimation. Economic Journal 107, 503–19. Smith, R. J. (2009). GEL criteria for moment condition models. Forthcoming in Econometric Theory. Stock, J. H. and J. H. Wright (2000). GMM with weak identification. Econometrica 68, 1055–96. Szroeter, J. (1983). Generalized Wald methods for testing nonlinear implicit and overidentifying restrictions. Econometrica 51, 335–53. Watson, M. W. (1999). Explaining the increased variability in long term interest rates for US long (short) term interest rates. Federal Reserve Bank of Richmond, Economics Quarterly 85, 71–96.
APPENDIX A: TECHNICAL LEMMAS The following lemmas can be proved using simple modifications of the results of Fitzenberger (1997), Goncalves and White (2002) and Smith (2009). L EMMA A.1. Let ∂ k := ∂ k · /∂βj1 . . . ∂βjk for k = 0, 1, . . . Assume that (1) B is compact, (2) (i) v t is a strong mixing sequence of size −α/(α − 2) for α > 2, (ii) ∂ k g nt (β) is L 2 -NED on v t of size −1/2 uniformly on (B, κ), (3) E supn,t supβ∈B ∂ k gnt (β)α < ∞, (4) ∂ k gnt (β) is Lipschitz-L 1 a.s. on B ∀n, t. ˆ is continuous on B uniformly in n, and for 1 ≤ l ≤ m and m = o(n) Then E[∂ k g(β)] ˆ ˆ sup ∂ k ψ(β) = op (1). − E[∂ k g(β)] β∈B
ˆ ψ (β) denote its blockwise sample version. ˆ ˆ ∂ k2 g(β)) and L EMMA A.2. Let n (β) = cov(n∂ k1 g(β), Assume that (1) B is compact, (2) (i) v t is a strong mixing sequence of size −2α/(α − 2) for α > 2, (ii) ∂ kj gnt (β) is L 2 -NED on v t of size −1 uniformly on (B, κ) (j = 1, 2), (3) E supn,t supβ∈B ∂ kj gnt (β)3α < ∞ (j = 1, 2). Then for m = o(n1/2 ) and 1 ≤ l ≤ m ˆ ψ (β) − n (β) = op (1) m for each β ∈ B. }. Assume that L EMMA A.3. Let n = n−1/2 supn,i supβ∈B ψi (β) and n = {λ : λ ≤ mn−1/2 ε−1/2 n E supn,t supβ∈B gnt (β)α < ∞ for any α > 2 hold. Then for m = o(n1/2 ) sup
|λ ψni (β)| = op (1) and
sup
ˆ n (β) w.p.a.1. n ⊆
n,i β∈B,λ∈n
L EMMA A.4. Assume that Assumption 3.1 holds. Then under the same assumptions of Lemma A.3 for k = 1, 2 sup
sup
|ρk (λ ψni (β)) + 1| = op (1).
n,i β∈B,λ∈n
C The Author(s). Journal compilation C Royal Economic Society 2009.
228
F. Bravo
ˆ := supλ∈ Pˆρb (β0 , λ). Assume that Assumptions 3.1, 3.2, 3.3(iii)(a), (v) hold. L EMMA A.5. Let Pˆρb (β0 , λ) n 1/2 Then for m = o(n ) ˆ − ρ(0) ≤ Op (m/n). Pˆρb (β0 , λ) L EMMA A.6. Under the same assumptions of Lemma A.5, ˆ β) ˆ = Op (n−1/2 ) ψ(
and
βˆ − β0 = op (1),
¯ and λ¯ ∈ n as defined in Lemma A.3. where βˆ := infβ∈B Pˆρb (β, λ) L EMMA A.7. Assume that β¯ − β0 = op (1), and Assumptions 3.1, 3.2, 3.3 hold. Then λˆ := ¯ λ) exists w.p.a.1 and λ ˆ = Op (m/n1/2 ). arg maxλ∈(β)¯ Pˆρb (β,
APPENDIX B: PROOFS OF THE THEOREMS n Throughout this appendix we use the following abbreviations: lim = limn→∞ , = qi=1 or t=1 (depending on the context) w.p.a.1, CLT, ULLN stand for with probability approaching 1, central a limit theorem, continuous mapping theorem, uniform law of large numbers, and finally ‘=’ denotes a asymptotically equivalent random vectors, i.e. X = Y ⇒ X = Y + op (1), when X and Y are O p (1). The proofs are based on the same type of arguments as those used by Gallant and White (1988), Newey and Smith (2004) and Smith (2009); therefore only the key steps are reported. ˆ λ)/∂β ˆ =0 Proof of Theorem 3.1: Lemmas A.6, A.7 and Assumption 3.4(i) imply that the FOCs ∂Pˆρb (β, ˆ λ)/∂λ ˆ and ∂Pˆρb (β, = 0 are satisfied w.p.a.1. Then by mean value expansion about [β 0 , 0 ] ˆ 0 )] + M¯ n (θ)n ¯ 1/2 [(βˆ − β0 ) , λˆ /m] , 0 = −n1/2 [0 , ψ(β ¯ where M¯ n (θ¯ ) := ∂ 2 Pˆ b (θ)/∂θ ∂θ and θ = [β , λ ] . Lemmas A.1 (for k = 1) and 5.7 combined with standard ˆ 0 ) imply that calculations, Assumption 3.4(ii)(e) and CLT applied to n1/2 ψ(β ˆ β) ˆ + Op (n−1/2 ) = Op (n−1/2 ). (βˆ − β0 ) ≤ G n (β0 )Gn (β0 )−1/2 ψ(
(B.1)
Thus by a further Taylor expansion about β 0 , Lemmas A.2 (for k j = 0, j = 1, 2) and A.4, (B.1), ¯ + Cauchy–Schwarz and triangle inequalities and m = o(n1/2 ) it is possible to show that m∂ 2 Pˆρb (θ)/∂λ∂λ n (β0 ) = op (1). Similarly, Lemmas A.1 (for k = 1, 2), A.2 (for k 1 = 0, k 2 = 1), A.4, CMT and ¯ ¯ = op (1) can be used to show that ∂ 2 Pˆρb (θ¯ )/∂λ∂β + Gn (β0 ) = op (1) and ∂ 2 Pˆρb (θ)/∂β∂β = op (1), λ and thus M¯ n (θ¯ ) − Mn (β0 ) = op (1). Let D n (β 0 ) denote a uniformly positive definite (l − k) × (l − k) diagonal matrix, and let U n (β 0 ) = [U 1n (β 0 ), U 2n (β 0 )] denote an orthonormal matrix that diagonalizes n (β 0 ) with U 1n (β 0 ) n (β 0 ) U 1n (β 0 ) = D n (β 0 ) and n (β 0 ) U 2n (β 0 ) = 0, where 0 is an l × k matrix of zeros. Then ϒn (β0 ) = diag[Dn (β0 )−1/2 , Ik ]Un (β0 ) is uniformly non-singular and by standard calculations (see e.g. Gallant and White, 1988, ch. 5): a diag[n (β0 )−1/2 , ϒn (β0 )]n1/2 [(βˆ − β0 ) , λˆ /m] = ˆ 0 ), [n (β0 )−1/2 Nn (β0 ) , ϒn (β0 )n (β0 )]n1/2 ψ(β
(B.2)
where n (β0 ) = n (β0 )−1 (I − Gn (β0 )n (β0 )Gn (β0 ) n (β0 )−1 ), Nn (β0 ) = n (β0 )−1 Gn (β0 )n (β0 ), n (β0 ) = [Gn (β0 ) n (β0 )−1 Gn (β0 )]−1 . C The Author(s). Journal compilation C Royal Economic Society 2009.
BGEL for moment conditions models
229
ˆ 0 ) has mean 0 and covariance diag[I l−k , 0] so that by CLT and By construction ϒn (β0 )n (β0 )n1/2 ψ(β standard calculations a ˆ 0) ∼ N ([0 , 0 ] 0, diag[Il−k , 0]). ϒn (β0 )n (β0 )n1/2 ψ(β a
ˆ 0 ) ∼ N (0, Ik ) and the result follows. Similarly n (β0 )−1/2 Nn (β0 ) n1/2 ψ(β
ˆ and Proof of Theorem 3.2: Lemmas A.4, A.5 and a mean value expansion for ρ1 (λˆ ψni (β)) ˆ −1 show that ( ρ1 (λˆ ψni (β))) ˆ sup |πˆ ni − 1/q| = sup |λˆ ψni (β)|/q + op (1) = op (1). n,i
(B.3)
n,i
a a ˆ nπˆ (·) = ˆ n (·). The same arguments ˆ n (·) and G ˆ nπˆ (·) = G Thus by (B.3) and triangle inequality ˆ − n (β0 ) = op (1). By Assumption 3.3(iii)(e) ˆ n (β) of Theorem 3.1 can be used to show that ˆ n (β) ˆ −1 − n (β0 )−1 = op (1). Similarly n (β 0 ) is O(1) and uniformly non-singular hence by CMT ˆ ˆ ˆ nπˆ (β) ˆ and Gn (β) − Gn (β0 ) = op (1), and the result follows by CMT. By the same arguments both ˆ are consistent. ˆ n (β)
Proof of Theorem 3.3: Lemma A.3 and a second-order Taylor expansion of D ρ about λ = 0 show that a a ˆ β) ˆ so that using the results of Theorem 3.2 for D ρ = S ρ . By Theorem 3.1 m−1 n1/2 λˆ = −n (β0 )−1 n1/2 ψ( a a ˆ and note that by ˆ it follows that LM ρ = S ρ . By Lemma A.1 (for k = 0) S ρ = ˆ n (β0 )−1 g( ˆ n (β) ˆ β), ˆ β) ng( Taylor expansion ˆ = n (β0 )1/2 n (βn )n (β0 )−1/2 g(β ˆ β) ˆ 0 ) + op (1), n1/2 g( where n (β 0 ) = I l − n (β 0 )−1/2 G n (β 0 ) n (β 0 )G n (β 0 ) n (β 0 )−1/2 . By the Skorohod representation theorem—see e.g. Serfling (1980)—there exist random variables Y n with the same distribution as ˆ ∼ ˆ 0 ) such that Y n = Y + o a.s (1) where Y ∼ N (0, I l ). Then n1/2 g( ˆ β) that of n (β0 )−1/2 n1/2 g(β n (β0 )1/2 n (β0 )Yn = n (β0 )1/2 n (β0 )Y + oa.s (1) and X n = n (β 0 ) Y ∼ N (0, n (β 0 )). Thus a ˆ n (β0 )−1 g( ˆ = Xn Xn + op (1) ∼ ˆ β) ˆ β) ng( χ 2 (l − k),
since n (β 0 ) is idempotent with rank l − k.
Proof of Theorem 3.4: Assumptions 3.3 and 3.4 imply that the results of Lemmas A.1–A.3 are valid } and l nt (θ ) replacing n and g nt (β), respectively. Then as in the proof with an = {μ : μ ≤ mn−1/2 ε−1/2 n ˆ = 0 and ∂Pˆρb (θˆ , μ)/∂μ ˆ = 0 are satisfied w.p.a.1, and by the same of Theorem 3.1 the FOCs ∂Pˆρb (θˆ , μ)/∂θ arguments of Theorems 3.1 and 3.3 a diag[n (θ0 )−1/2 , ϒn (θ0 )]n1/2 [(θˆ − θ0 ) , μˆ /m] ∼ N ([0 , 0 ] , diag[Il+s , 0]), a ˆ λ, ˆ ϕ) ˆ a (θ). a (θˆ ) n (θ0 )−1 ψ 2cn Pˆρb (θ, ˆ = nψ a ˆ 0 ) = g(θ ˆ = a (θ) ˆ 0 ). Noting that ψ Let S g denote an l × (l + s) selection matrix such that Sg l(θ (I − −1 ˆ Ln (β0 )n (β0 ) Ln (β0 ) )l(θ0 ), it follows that a
ˆ 0 ) [n (θ0 )−1 (I − Ln (θ0 )n (θ0 )−1 Ln (θ0 ) n (θ0 )−1 ) D ρ = nl(θ ˆ 0 ). −Sg n (θ0 )−1 (I − Gn (θ0 )n (θ0 )−1 Gn (θ0 ) n (θ0 )−1 ) Sg ]l(θ As in the proof of Theorem 3.3 by the Skorohod representation theorem there exist random variables ˆ 0 ) and n (θ0 )−1/2 Sg l(θ ˆ 0 ) such that Y j n (j = 1, 2) with the same distribution as that of n (β0 )−1/2 n1/2 l(θ Y j n = Y j + o a.s (1) where Y 1 ∼ N (0, I l+s ) and Y 2 ∼ N (0, I l ). Then for X j n = j n (θ 0 ) Y ∼ N (0, C The Author(s). Journal compilation C Royal Economic Society 2009.
230
F. Bravo
j n (β 0 )) where 1n (θ0 ) = n (θ0 )−1/2 (I − Ln (θ0 )n (θ0 )Ln (θ0 ) n (θ0 )−1/2 ), 2n (θ0 ) = n (θ0 )−1/2 (I − Gn (θ0 )n (θ0 )Gn (θ0 ) n (θ0 )−1/2 ), a D ρ = 2j =1 Xj n Xj n + op (1) ∼ χ 2 (s) since j n (θ 0 ) (j = 1, 2) are idempotent matrices with ranks l + s − q − k and l − q − k, respectively. Note that ˆ 0) ˆ = [n (θ0 ) − Sg n (θ0 )−1 (Sg − Gn (θ0 )n (θ0 )−1 Gn (θ0 ) n (θ0 )−1 Sg )]l(θ (n1/2 /m)(μ˜ − μ) a
a
and that n (θ 0 )−1 S g n (β 0 )S g n (θ 0 )−1 = n (θ 0 )−1 so that H ρ = D ρ . Moreover
ρ a
LM = (n/m )ϕ˜ 2
⎧ ⎨
Sϕ
⎩
n (θ0 )
Ln (θ0 )
Ln (θ0 )
0
−1 Sϕ
⎫−1 ⎬ ⎭
ϕ, ˜
˜ ϕ)/∂(θ ˆ λ, ˆ 0) one gets ˜ , λ , ϕ ) at (θ, whereas by a Taylor expansion of the FOCs 0 = ∂Pˆ b (θ˜ , λ,
a ˆ = n1/2 (μ˜ − μ) ˆ , ϕ˜ , (β˜ − β)
n (θ0 ) Ln (θ0 )
Ln (θ0 ) 0
−1 n1/2
ˆ ϕ ψnia (θ)/q, ˆ ρ1 (λˆ ψni (β))S
a ˆ n (θˆ ) [λˆ , 0 ] = 0. Thus LM ρ = since ρ1 (λˆ ψni (β))L S ρ . Finally by mean value expansion and standard calculations a ˆ θ) ˆ ˆ nia (θˆ )/q = ˆ + n (θ0 )Sg n (θ0 )−1 g( ˆ β), − l( ρ1 (λˆ ψni (β))ψ and since n (θ 0 )−1 Sg n (θ 0 ) n (θ 0 ) n (θ 0 )S g n (θ 0 )−1 = n (β 0 ), one gets a a ˆ 0 ) n (θ0 )l(θ ˆ 0 ) − g(β ˆ 0 ) n (β0 )g(β ˆ 0 )) = D ρ . S ρ = n(l(θ
Proof of Theorem 3.5: Note that 0 = { : q(θ) = 0, r(α) = 0} is compact and therefore Assumption 3.3(i) holds also for 0 . Lemmas A.6–A.7 and the continuity of q(·) and r(·) over imply that ˜ = op (1) and β˜ − β0 = op (1). Thus there exist two vectors of Lagrange multipliers—ϕ and η—such λ that ˜ ˜ ˜ ˜ ˜ ρ1 (λ˜ ψni (β))(∂ψ ni (β)/∂β ) λ/q − Qβ (θ) ϕ 0= −Qα (θ˜ ) ϕ˜ − R(α) ˜ η˜ w.p.a.1, where Q α (·) = ∂q(·)/∂α , R(·) = ∂r(·)/∂α . Then Lemmas A.4, A.6 and Assumption 3.5 imply that ϕ ˜ and η ˜ are both O p (m/n1/2 ). A mean value expansion expansion about [0 , θ 0 ] , some lengthy algebra and the same arguments of Theorems 3.3 and 3.4 show that a a a ˆ n (θ0 )(λ˜ − λ) ˆ = nϕ˜ Sn (θ0 )ϕ˜ = LM ρ . D ρ = (n/m2 )(λ˜ − λ) a
ˆ = −Sn (θ0 )ϕ˜ where S(θ 0 ) = ˜ β) A mean value expansion about θ 0 , (B.2) and further algebra yield n1/2 q(α, a Q β (θ 0 ) n (β 0 )−1 Q β (θ 0 ) so that S ρ = LM ρ . Note that n1/2 S(θ0 )1/2 ϕ˜ = (I − Sn (θ0 )−1/2 Qα (θ0 )Kn (θ0 )Qα (θ0 ) Sn (θ0 )−1/2 )ξˆ (β0 )/n1/2 , a
C The Author(s). Journal compilation C Royal Economic Society 2009.
BGEL for moment conditions models where
231
Kn (θ0 ) = Mn (θ0 )−1 I − R(α0 ) Nn (θ0 )−1 R(α0 )Mn (θ0 )−1 , M(θ0 ) = Qα (θ0 ) Sn (θ0 )−1 Qα (θ0 ) + R(α0 ) R(α0 ), Nn (θ0 ) = R(α0 )Mn (θ0 )R(α0 ) , ˆ 0 ). ξˆ (β0 ) = Sn (θ0 )−1/2 Qβ (θ0 )n (θ0 )−1 Gn (β0 ) n (β0 )−1 ψ(β
By the same Skorohod representation theorem argument used in the proofs of Theorems 3.3 and 3.4 and the fact that I − S n (θ 0 )−1/2 Q α (θ 0 )K n (θ 0 )Q α (θ 0 ) S n (θ 0 )−1/2 is idempotent with rank equal to s − (q − r), the result follows.
C The Author(s). Journal compilation C Royal Economic Society 2009.
The
Econometrics Journal Econometrics Journal (2009), volume 12, pp. 232–247. doi: 10.1111/j.1368-423X.2009.00289.x
On skewness and kurtosis of econometric estimators Y ONG B AO † AND A MAN U LLAH ‡ †
‡
Purdue University, 403 W. State Street, West Lafayette, IN 47907, USA E-mail:
[email protected]
University of California, 900 University Ave, Riverside, CA 92521, USA E-mail:
[email protected]
First version received: November 2007; final version accepted: December 2008
Summary We derive the approximate results for two standardized measures of deviation from normality, namely, the skewness and excess kurtosis coefficients, for a class of econometric estimators. The results are built on a stochastic expansion of the moment condition used to identify the econometric estimator. The approximate results can be used not only to study the finite sample behaviour of a particular estimator, but also to compare the finite sample properties of two asymptotically equivalent estimators. We apply the approximate results to the spatial autoregressive model and find that our results approximate the non-normal behaviours of the maximum likelihood estimator reasonably well. However, when the weights matrix becomes denser, the finite sample distribution of the maximum likelihood estimator departs more severely from normality and our results provide less accurate approximation. Keywords: Kurtosis, Skewness, Stochastic expansion.
1. INTRODUCTION Classical statistics and econometrics theory typically relies on asymptotic results for the purpose of estimation and inference. Thanks to various versions of central limit theorems, a typical econometric estimator can be shown to be asymptotically normal and based on this confidence intervals can be constructed and test statistics can be designed. However, in many situations, the asymptotic properties of estimators and test statistics can provide poor approximations of their behaviour in finite samples or even moderately large samples. This has been long recognized in the literature with the earliest work dating back to Fisher (1921), while the monograph of Ullah (2004) provides a comprehensive and up-to-date discussion of finite sample econometrics. Usually, the exact results are complicated to analyse and are available only under very restrictive assumptions on the data-generating process. In light of this, approximate techniques have enjoyed popularity, including the large-n, small-σ , Laplace and saddle-point approximations. We notice that while most of the existing literature focuses on some specific estimators for some specific models, Bao and Ullah (2007b) generalized Rilstone et al. (1996) to develop the approximate first two moments of a large class of estimators in time-series models. An unfinished task in Bao and Ullah (2007b) remains, however, regarding the higher moments of the estimators. C The Author(s). Journal compilation C Royal Economic Society 2009. Published by Blackwell Publishing Ltd, 9600 Garsington Road,
Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
On skewness and kurtosis of econometric estimators
233
In principle, we can follow the approach of Bao and Ullah (2007b) to expand to a higher order the inverse of the gradient of the moment function in the spirit of Nagar (1959) to derive the approximate third and fourth moments of the estimators, say, approximate the third moment up to order O(n−3 ) and the fourth moment up to order O(n−4 ), where n is the sample size, in the ‘second-order’ sense; see Ullah (2004). However, the ‘raw’ third and fourth moments of an estimator are the absolute measures of skewness and tail behaviour of its distribution, relative measures such as the skewness and excess kurtosis coefficients should be more useful in situations when we need to judge how the finite sample properties of this estimator can behave differently from the asymptotic properties. Moreover, even though in principle we can derive the third moment up to order O(n−3 ) and the fourth moment up to order O(n−4 ), the terms that are of order O(n−3 ) and O(n−4 ) may be of very small magnitude for a given n and thereby one may wonder how useful they are to be included in the analysis. The major purpose of this paper is to derive approximate results for two standardized measures of deviation from normality for the estimator, namely, the skewness and excess kurtosis coefficients. Given the knowledge of the non-normality coefficients, one can not only judge the finite sample behaviour of a particular estimator, but also compare the finite sample properties of two asymptotically equivalent estimators. As an application, we study the finite sample properties of the maximum likelihood estimator (MLE) in the spatial autoregressive model. We find that the departure from normality of the MLE can be very severe in small samples. In these cases, our approximate skewness and kurtosis results can sometimes provide poor approximation to the true tail behaviours of the MLE when the true parameter is approaching its boundary value and the weights matrix is dense. As the sample size increases, the performance of our approximation results improves, especially when the weights matrix is sparse. The plan of this paper is as follows. In Section 2, we derive our main results. Section 3 gives the application. Section 4 contains some concluding remarks. Appendix A contains the proof and Appendix B collects some technical details for the spatial model.
2. MAIN RESULTS We follow Bao and Ullah (2007b) to consider a class of estimators identified by the moment condition βˆn = arg{ψn (β) = 0} ,
(2.1)
where ψ n (β) = ψ n (Z; β) is a known k × 1 vector-valued function of the observable data Z = {Z i }ni=1 , and a parameter vector β, with true value β 0 , of k elements (of the same dimension as ψ n (β)) such that E[ψ n (β)] = 0 only happens at β = β 0 . The type of estimators identified by (2.1) are general enough to include the maximum likelihood, least squares, method of moments, generalized method of moments and other extremum estimators, as shown in Rilstone et al. (1996). Usually, the moment condition (2.1) can be interpreted as the orthogonality condition between regressors and error terms, or as the first-order condition of some optimization criterion. In what follows, we use trA to denote the trace, ||A||, to denote the usual norm (tr AA )1/2 , and ∇ s A(β) is the matrix of sth-order partial derivative of A(β) and is obtained recursively (specifically, if A(β) is a k × 1 vector function, the jth element of the lth row of ∇ s A(β) (a k × k s matrix) is the 1 × k vector aljs (β) = ∂a s−1 lj (β)/∂β ). Throughout, the following assumptions are made. C The Author(s). Journal compilation C Royal Economic Society 2009.
234
Y. Bao and A. Ullah
√ √ d ˆ ˆ A SSUMPTION 2.1. βˆn exists, √ and n(βn − β0 ) → N (0, D), where D = avar ( n(βn − β0 )) is the asymptotic variance of n(βˆn − β0 ). √ The first four moments of n(βˆn − β0 ) exist and are bounded, and for each A SSUMPTION 2.2. √ element of D −1/2 n(βˆn − β0 ), its rth cumulant is of order O(n(2−r)/2 ), r = 3, 4. A SSUMPTION 2.3. The sth-order derivatives of ψ n (β) exist for β in a neighbourhood of β 0 and for s up to 3, E(||∇ s ψ n (β 0 )||2 ) < ∞. A SSUMPTION 2.4. For β in some neighbourhood of β 0 , [∇ψ n (β)]−1 = O p (1). A SSUMPTION 2.5. ||∇ s ψ n (β) − ∇ s ψ n (β 0 )|| ≤ ||β − β 0 ||M n for β in some neighbourhood of β 0 , where E(|M n |) < C < ∞ for some positive constant C, for s up to 3. Assumptions 2.1–2.5 are fairly standard for a large class of estimators, though excluding non-stationary time-series models involving a unit root. One may lay out a set of primitive conditions to guarantee existence and consistency of βˆn . Assumption 2.2 requires that the first four moments of the estimator exist, which may be regarded as somewhat strong. In general, existence of moments may be difficult to verify. In that case, the results to be derived in this paper may still be informative, though it should then be noted the results are based on ‘formal’ expansions, perhaps not valid expansions. For example, it is well known that moment of the instrumental estimator for a just identified structural equation do not exist. However, it does have a well-defined exact distribution and limiting distribution and the ‘approximate moments’ can still be obtained. Assumption 2.2 essentially follows Sargan (1974), who showed that for the Nagar-type (Nagar, 1959) large-n approximate moments, which we shall use later to derive our main results, to be valid as asymptotic approximations, the corresponding moments of the exact distribution (of the standardized estimator) must exist and are of order O(1) as n → ∞; also see Srinivasan (1970) and Basmann (1974). For each element of the standardized βˆn , the third and fourth cumulants are nothing but the skewness and (excess) kurtosis coefficients of the corresponding element of βˆn . Assumptions 2.3–2.5 are similar to Rilstone et al. (1996) and Bao and Ullah (2007b) to guarantee that the moment condition is smooth enough so that ˆ as well as a Nagar-type expansion of the inverse of the gradient a stochastic expansion of ψn (β) of the moment function can be implemented around β 0 . Following the notational conventions in Rilstone et al. (1996), ψ n = ψ n (β 0 ) (we suppress the argument of a function when it is evaluated at β 0 ), H i = ∇ i ψ n , Q = [E(H 1 )]−1 , V i = H i − E(H i ), ⊗ represent the Kronecker product, a −s/2 represent terms of order OP (n−s/2 ), and put a−1/2 = −Qψn , 1 a−1 = −QV1 a−1/2 − QE(H2 )(a−1/2 ⊗ a−1/2 ), 2 1 1 a−3/2 = −QV1 a−1 − QV2 (a−1/2 ⊗ a−1/2 ) − QE(H2 )(a−1/2 ⊗ a−1 + a−1 ⊗ a−1/2 ) 2 2 1 − QE(H3 )(a−1/2 ⊗ a−1/2 ⊗ a−1/2 ). 6 As shown in Rilstone et al. (1996) and Bao and Ullah (2007b), one can write a stochastic expansion βˆn − β0 = a−1/2 + a−1 + a−3/2 + oP (n−3/2 ),
(2.2)
C The Author(s). Journal compilation C Royal Economic Society 2009.
235
On skewness and kurtosis of econometric estimators
where the order O(n−1/2 ) term a −1/2 represents the the asymptotic behaviour of βˆn and D = nE(a −1/2 a −1/2 ). Note that E(a −1/2 ) = 0 since E(ψ n ) = 0. Based on this and the expansion (2.2), one can immediately derive the second-order bias and mean squared error (MSE) of βˆn , as done in Rilstone et al. (1996) for models with identically and independently distributed (IID) data and Bao and Ullah (2007b) for models with non-IID data. For example, for a single parameter estimator βˆn,i , 1 ≤ i ≤ k, we have the following first two moments: 1 E(βˆn,i − β0,i ) = E(a−1,i ) + o(n−1 ), 2 2 E[(βˆn,i − β0,i )2 ] = E a−1/2,i + 2a−1/2,i a−1,i + a−1,i + 2a−1/2,i a−3/2,i + o(n−2 ). In principle, one could follow the same strategy to expand βˆn − β0 up to order O(n−5/2 ) and derive the second-order third and fourth moments, up to orders O(n−3 ) and O(n−4 ), respectively. However, as stated in the introduction, instead of the absolute measures of skewness and tail behaviour of βˆn , we are more interested in the relative measures such√as the skewness and excess kurtosis coefficients. To facilitate our derivation, we define Tn,i = n(βˆn,i − β0,i ). Obviously, the skewness and excess kurtosis coefficients of βˆn,i are the same as those of T n,i . Corresponding to (2.2), we write √ Tn,i = n(a−1/2,i + a−1,i + a−3/2,i ) + oP (n−1 ) = ξ0,i + ξ−1/2,i + ξ−1,i + oP (n−1 ),
(2.3)
√ d where ξ−s/2,i = na−(s+1)/2,i = OP (n−s/2 ) for s = 0, 1, 2. By Assumption 2.1, ξ0,i → N(0, Dii ), where D ii denotes the iith element of D. The following theorem gives the approximation skewness and kurtosis results. 2 T HEOREM 2.1. The skewness and excess kurtosis coefficients of βˆn,i can be approximated by γ1 (βˆn,i ) and γ2 (βˆn,i ), up to order O(n−1/2 ) and O(n−1 ), respectively, and they are given by 2 −3/2 3 2 2 + 2ξ0,i ξ−1/2,i ξ−1/2,i − 3E ξ0,i E ξ0,i + 3ξ0,i E(ξ−1/2,i ) , γ1 (βˆn,i ) = E ξ0,i −2 2 2 γ2 (βˆn,i ) = E ξ0,i + ξ−1/2,i + 2ξ0,i ξ−1/2,i + 2ξ0,i ξ−1,i − [E(ξ−1/2,i )]2 4 3 3 2 2 × E ξ0,i + 4ξ0,i ξ−1/2,i + 4ξ0,i ξ−1,i + 6ξ0,i ξ−1/2,i 3 3 2 − 4E ξ0,i E(ξ−1,i ) + 3ξ0,i ξ−1/2,i E(ξ−1/2,i ) − 4E ξ0,i 2 2 (2.4) + 6E ξ [E(ξ−1/2,i )] − 3. 0,i
The proof of Theorem 2.1 is given in Appendix A. Note that the skewness result generalizes several expressions that are given in McCullagh (1987) and in Linton (1997) in the context of 1 We do not discuss the more complicated issue of cross moments of βˆ n,i and βˆn,j , i = j , in this paper. This issue is however important if we are interested in the finite sample properties of some test statistics that involve the whole parameter vector. We leave this for our future study. 2 In an earlier version of this paper, the authors considered the standardized estimator D −1/2 √n(βˆ − β ). However, as n 0 −1/2 √ pointed out by one referee and the co-editor, since in general D is unknown, in practice Dˆ n n(βˆn − β0 ), where Dˆ n consistently estimates D and Dˆ n − D = OP (n−1/2 ), is also of interest. The presence of Dˆ n will introduce additional −1/2 √ terms into the expansion and we can show that for the ith feasible standardized estimator Dˆ n,ii n(βˆn,i − β0,i ) = √ −1/2 √ −1/2 ˆ D −D ξ0,i + ξ−1/2,i + ξ−1,i , where ξ0,i = nDii a−1/2,i , ξ−1/2,i = nDii [a−1,i − 12 ( n,iiDii ii )a−1/2,i ] and ξ−1,i = √ −1/2 Dˆ Dˆ −D −D nDii [a−3/2,i + 38 ( n,iiDii ii )2 a−1/2,i − 12 ( n,iiDii ii )a−1,i ]. With these newly defined ξ ’s, the theorem presented in −1/2 √ this paper is still valid for the skewness and kurtosis of the feasible standardized estimator Dˆ n(βˆn,i − β0,i ). n,ii
C The Author(s). Journal compilation C Royal Economic Society 2009.
236
Y. Bao and A. Ullah
maximum likelihood estimation. Given the skewness and kurtosis results above, one may follow the lines of Rothenberg (1984) to use the two standardized measures to construct an Edgeworthtype approximation to the distribution of a non-linear estimator. However, it is still an open question as to whether the Edgeworth distribution is a valid approximation to the true distribution of a general class of (non-linear) estimators βˆn under the general non-IID set-up. Note that the approximate results in (2.4) are in terms of expectations of terms involving ξ 0,i , ξ −1/2,i and ξ −1,i . In some cases, these expectations can be worked out explicitly (either analytically or numerically, as demonstrated in the next section). In cases when the expectations are difficult to derive, one may use sample averages to approximate the expectations. In either situation, ξ 0,i , ξ −1/2,i and ξ −1,i are functions of the unknown β 0,i . In practice, we may have to replace the unknown β 0,i with its consistent estimator βˆn,i .
3. SPATIAL AUTOREGRESSIVE MODEL Bao and Ullah (2007a) investigated the finite sample behaviour of the MLE of the autoregressive coefficient in a spatial autoregressive model by looking at the second-order bias and MSE. Now we make a more thorough investigation by checking the skewness and kurtosis results. Consider the following spatial lag model y = ρ0 Wy + ε,
(3.1)
where y is an n × 1 vector of observations on the dependent spatial variable, Wy is the corresponding spatially lagged dependent variable for weights matrix W, which is assumed to be known a priori, ε is an n × 1 vector of IID Gaussian error terms with zero mean and finite variance σ 20 , and ρ 0 is the spatial autoregressive parameter. Under the regularity assumptions in Lee (2004), the average sample likelihood function 1 ε ε 1 , L ρ0 , σ02 = ln |I − ρ0 W | − ln 2π σ02 − n 2 2nσ02
(3.2)
where I is the identity matrix, is well defined√and continuous. Lee (2004) proved that the MLE has the usual asymptotic properties, including n-consistency, normality and asymptotic efficiency. If ρ 0 is known, the MLE of σ 20 is given by σˆ n2 = (y − ρ0 Wy) (y − ρ0 Wy)/n = y Cy/n, where C = I − ρ 0 (W + W ) + ρ 20 W W . Usually, the estimation procedure is implemented by substituting σˆ n2 = y Cy/n into the likelihood function (3.2) and maximizing a concentrated likelihood function
n 2π 1 1 1 ln (1 − ρ0 ωi ) − ln L (ρ0 ) = y Cy − , n i=1 2 n 2
(3.3)
where ω i ’s are the eigenvalues of W. Denote A = I − ρ 0 W , M i = A−1 [∂ i (I − ρ 0 (W + W ) + ρ 20 W W )/∂ρ i ]A−1 , B i = i ∂ ln |A|/∂ρ i0 (in particular, B 1 = −tr(A−1 W ), B 2 = −tr[(A−1 W )2 ], B 3 = −2tr[(A−1 W )3 ] and B 4 = −6tr[(A−1 W )4 ]), b i = B i /n. Then we can write the score function ψ n , in terms C The Author(s). Journal compilation C Royal Economic Society 2009.
237
On skewness and kurtosis of econometric estimators
J
ρ0 2
6
10
Table 1. Sample and theoretical bias, MSE, skewness and kurtosis, n = 30. Bias B MSE M SK γ1 KR
γ2
−0.9 −0.4
0.020 0.018
0.018 0.019
0.004 0.025
0.003 0.026
1.404 0.473
3.977 0.251
2.602 0.184
0.428 0.335
−0.2 0
0.013 −0.002
0.011 0.000
0.031 0.031
0.033 0.035
0.244 −0.091
0.112 0.000
−0.167 −0.265
0.892 1.073
0.2 0.4 0.9
−0.004 −0.019 −0.019
−0.011 −0.019 −0.018
0.028 0.027 0.003
0.033 0.026 0.003
−0.341 −0.406 −2.082
−0.112 −0.251 −3.977
−0.052 −0.187 13.400
0.892 0.335 0.428
−0.9 −0.4 −0.2
0.102 −0.016 −0.063
0.002 −0.046 −0.062
0.070 0.111 0.111
0.118 0.114 0.105
1.122 0.032 −0.224
0.182 −0.140 −0.349
0.306 −0.676 −0.535
0.593 −2.126 −3.668
0 0.2 0.4
−0.076 −0.084 −0.085
−0.074 −0.080 −0.080
0.109 0.102 0.076
0.093 0.077 0.060
−0.493 −0.759 −1.022
−0.672 −1.232 −2.422
−0.102 0.508 1.343
−4.840 −5.029 −3.520
0.9
−0.050
−0.039
0.014
0.014
−2.448
10.354
9.685
−0.9 −0.4
0.144 −0.068
−0.056 −0.122
0.110 0.160
0.188 0.182
1.081 0.192
−0.140 −1.191
0.139 −1.024
−8.352 −12.644
−0.2 0 0.2
−0.097 −0.102 −0.146
−0.138 −0.147 −0.148
0.178 0.179 0.176
0.178 0.170 0.157
−0.185 −0.532 −0.782
−2.200 −4.327 −10.486
−0.916 −0.490 0.069
−10.195 −5.196 0.932
0.4 0.9
−0.142 −0.084
−0.139 −0.058
0.156 0.041
0.137 0.046
−1.217 −3.669
−60.228
1.173 20.303
6.648 4.724
Notes: For each J and ρ 0 , Bias is the average bias of the sample estimates over 1000 replications; B is the theoretical second-order bias; MSE is the mean squared error of the sample estimates over the 1000 replications; M is the theoretical second-order mean squared error; SK and KR are the sample skewness and excess kurtosis coefficients, respectively, of the estimates over the 1000 replications; γ 1 and γ 2 are the theoretical approximate skewness and excess kurtosis coefficients.
of our notation in Section 2, for the MLE ρˆn , as well as its higher-order derivatives as follows: ψn = b1 − (ε ε)−1 ε M1 ε/2, H1 = b2 − (ε ε)−1 ε M2 ε/2 + (ε ε)−2 (ε M1 ε)2 /2, H2 = b3 + 3(ε ε)−2 ε M1 εε M2 ε/2 − (ε ε)−3 (ε M1 ε)3 , H3 = b4 + 3(ε ε)−2 (ε M2 ε)2 /2 − 6(ε ε)−3 (ε M1 ε)2 ε M2 ε + 3(ε ε)−4 (ε M1 ε)4 . As it turns out, all the derivatives are in terms of products of ratios of quadratic forms in ε, and the skewness and kurtosis results (2.4) essentially boil down to expectations of them. Appendix B outlines the steps for numerical evaluation and we follow the steps to analyse the skewness and kurtosis behaviour of ρˆn . Following Kelejian and Prucha (1999), we consider three specifications of the weights matrix with different degree of sparseness, namely, the ‘one ahead and one behind’, ‘three ahead and three behind’ and ‘five ahead and five behind’ matrices, C The Author(s). Journal compilation C Royal Economic Society 2009.
238
J
Y. Bao and A. Ullah
ρ0 2
6
10
Table 2. Sample and theoretical bias, MSE, skewness and kurtosis, n = 100. Bias B MSE M SK γ1 KR
γ2
−0.9 −0.4
0.006 0.002
0.005 0.006
0.001 0.007
0.001 0.007
0.839 0.226
1.133 0.224
1.324 −0.073
1.015 0.072
−0.2 0
0.006 0.004
0.003 0.000
0.009 0.010
0.009 0.010
0.197 −0.038
0.106 0.000
0.196 −0.110
0.092 0.100
0.2 0.4 0.9
−0.001 −0.007 −0.005
−0.003 −0.006 −0.005
0.009 0.007 0.001
0.009 0.007 0.001
−0.197 −0.330 −0.796
−0.106 −0.224 −1.133
0.008 0.159 0.890
0.092 0.072 1.015
−0.9
0.045
0.000
0.027
0.037
0.987
0.082
0.198
−0.085
−0.4 −0.2
−0.012 −0.008
−0.014 −0.019
0.039 0.036
0.038 0.035
−0.245 −0.207
−0.168 −0.287
0.022 −0.105
−0.275 −0.292
0 0.2 0.4
−0.023 −0.022 −0.024
−0.022 −0.024 −0.024
0.030 0.026 0.018
0.031 0.025 0.018
−0.462 −0.448 −0.707
−0.428 −0.606 −0.859
0.367 0.351 0.789
−0.222 −0.017 0.416
0.9
−0.010
−0.012
0.002
0.002
−1.479
−6.529
4.115
7.066
−0.9 −0.4
0.063 −0.027
−0.017 −0.037
0.039 0.067
0.079 0.071
1.113 −0.183
−0.141 −0.541
0.581 −0.429
−0.580 −0.263
−0.2 0
−0.030 −0.037
−0.042 −0.045
0.060 0.055
0.064 0.055
−0.432 −0.549
−0.766 −1.062
0.019 0.290
0.165 0.824
0.2 0.4 0.9
−0.039 −0.046 −0.022
−0.045 −0.042 −0.018
0.041 0.034 0.004
0.044 0.032 0.004
−0.541 −0.933 −1.883
−1.483 −2.165 −148.590
0.038 1.505 7.988
1.761 3.082 10.651
Note: See Table 1.
denoted by WJ =2 , WJ =6 and WJ =10 , respectively. 3 We row-standardize the three matrices and set all the non-zero elements to be equal to each other. We normalize σ 20 = 1. Tables 1–3 give the theoretical second-order bias (B, up to order O(n−1 )), mean squared error (M, up to order O(n−2 )), the approximate skewness γ 1 (up to order O(n−1/2 )) and excess kurtosis γ 2 (up to order O(n−1 )) of the estimator ρˆn , as well as the ‘true’ values of them (denoted by Bias, MSE, SK, KR) across 1000 Monte Carlo replications, for n = 30, 100, 200, respectively. Asymptotically, ρˆn should be a normal variable. Bao and Ullah (2007a) documented some evidence indicating that in small samples, when ρ 0 is negatively large and the weights matrix is dense (corresponding to a larger J), the behaviour of ρˆn can be quite different from what the asymptotic theory predicts by checking the first two moments of ρˆn . This is supported again by checking the first two moments of ρˆn . For smaller J, the theoretical bias and MSE results approximate the true bias and MSE quite well. 3 A ‘one ahead and one behind’ matrix has the ith row with non-zero elements only in positions i − 1 and i + 1, i = 2, . . . , n − 1, and the first row has non-zero elements only in positions 2 and n, while for the last row the non-zeros occur only in positions 1 and n − 1. By this, we define the weights matrix in a circular way. The average number of neighbouring units J for the ‘one ahead and one behind’ matrix is hence 2. Similarly, we can define the ‘two ahead and two behind’, ‘three ahead and three behind’ matrices and so on.
C The Author(s). Journal compilation C Royal Economic Society 2009.
239
On skewness and kurtosis of econometric estimators
J
ρ0 2
6
10
Table 3. Sample and theoretical bias, MSE, skewness and kurtosis, n = 200. Bias B MSE M SK γ1 KR
γ2
−0.900 −0.400
0.002 0.000
0.003 0.003
0.000 0.004
0.000 0.004
0.530 0.255
0.684 0.176
0.412 0.028
0.579 0.038
−0.200 0.000
0.002 −0.001
0.002 0.000
0.005 0.005
0.005 0.005
0.202 0.013
0.085 0.000
0.068 −0.004
0.018 0.012
0.200 0.400 0.900
−0.002 −0.006 −0.003
−0.002 −0.003 −0.003
0.004 0.004 0.000
0.005 0.004 0.000
−0.108 −0.228 −0.547
−0.085 −0.176 −0.684
0.118 0.376 0.442
0.018 0.038 0.579
−0.900 −0.400
0.015 −0.008
0.000 −0.007
0.013 0.019
0.019 0.019
0.840 −0.057
0.054 −0.134
0.016 0.015
−0.056 −0.082
−0.200 0.000 0.200
−0.007 −0.009 −0.015
−0.009 −0.011 −0.012
0.018 0.014 0.012
0.018 0.015 0.012
−0.171 −0.196 −0.342
−0.216 −0.306 −0.409
−0.015 −0.180 0.039
−0.055 0.007 0.118
0.400 0.900
−0.015 −0.006
−0.012 −0.006
0.009 0.001
0.009 0.001
−0.581 −0.968
−0.539 −1.942
0.312 1.784
0.313 3.477
−0.900 −0.400 −0.200
0.034 −0.013 −0.019
−0.009 −0.019 −0.021
0.022 0.033 0.032
0.041 0.036 0.032
1.031 −0.182 −0.388
−0.104 −0.351 −0.469
0.413 0.079 0.347
−0.182 0.010 0.190
0.000 0.200
−0.028 −0.020
−0.022 −0.022
0.027 0.020
0.026 0.021
−0.509 −0.571
−0.606 −0.776
0.486 0.369
0.452 0.826
0.400 0.900
−0.025 −0.008
−0.021 −0.009
0.015 0.001
0.014 0.001
−0.593 −1.440
−1.008 −4.991
0.444 4.147
1.378 7.157
Note: See Table 1.
Of course, the behaviours of the first two moments alone do not necessarily indicate how severe the departure of the finite sample distribution of ρˆn from normality is. More conclusive observations can be possibly made by checking the standardized higher moments of ρˆn , namely, γ 1 and γ 2 . In general, to approximate higher moments accurately, bigger sample sizes are needed. Moreover, in small samples, E(ξ 20,i + 2ξ 0,i ξ −1/2,i ) may be negative, so γ 1 is not defined (this corresponds to the missing values for γ 1 in Table 1). Given this, it is not surprising that when n = 30, γ 1 and γ 2 provide in some cases very poor approximations to the true skewness and kurtosis, especially when J is large and/or when ρ 0 is relatively big. Overall, γ 1 seems to provide a better approximation to the skewness coefficient compared with γ 2 (as approximation to the excess kurtosis coefficient) for small J and ρ 0 . Obviously, for a sample of size as small as 30, the behaviour of ρˆn is quite different from a normal density by looking at SK and KR. When we move to a sample of bigger size 100, the performance of γ 1 and γ 2 improves significantly. This improvement can also be seen when n goes from 100 to 200. In either case, SK and KR still indicate that the distribution of ρˆn is far from being normal. In general, when J is small, both γ 1 and γ 2 provide good approximations to SK and KR. When each spatial unit has more neighbours, however, whereas γ 1 still provides reasonable approximation to the skewness of the distribution of ρˆn for moderate ρ 0 , γ 2 proves to approximate the excess kurtosis C The Author(s). Journal compilation C Royal Economic Society 2009.
240
Y. Bao and A. Ullah
very poorly, especially for negatively large ρ 0 . In fact, in these cases, the departure of ρˆn from normality is most severe, as indicated by SK and KR.
4. CONCLUDING REMARKS We have derived new results on the approximate skewness and excess kurtosis coefficients for a large class of econometric estimators. The knowledge of the two relative measures of departure from normality of econometric estimators may not only enable researchers to judge the finite sample behaviour of a particular estimator, but also to compare the finite sample properties of two asymptotically equivalent estimators. Researchers may be tempted to use the two relative measures to construct an Edgeworth-type approximation to the finite sample distribution of the estimator in question. The validity of such an approximation as a distribution function is still an open question to be addressed. In our application, we demonstrate that for the spatial autoregressive model, the departure of the MLE from normality can be quite severe and usually our approximate results capture the true tail behaviours of the MLE quite reasonably well. However, when the departure is most severe, our results do not seem to provide fair approximation in finite samples. As shown in Bao and Ullah (2007a), when a spatial unit is surrounded by many neighbours, the sample estimates of ρˆn are quite noisy and we may need really large sample size to achieve convergence. One of the key assumptions for deriving our √ large-n approximate results is n-convergence of the estimator. Therefore more cautions should be called upon to interpret the empirical results and make first-order inferences when we use a dense weights matrix.
ACKNOWLEDGMENTS The authors would like to thank the co-editor Oliver Linton, two anonymous referees, Prasad Bidarkota, Mohitosh Kejriwal, Gubhinder Kundhi, Jungmin Lee, Peter Thompson, Gautam Tripathi, seminar participants at U Conn, UT Dallas, Florida International, Maryland, Purdue, and conference participants at the 25th Annual Meeting of the Canadian Econometrics Study Group Meeting (Montreal) and the 18th Annual Meeting of the Midwest Econometrics Group (Lawrence) for very helpful comments.
REFERENCES Bao, Y. and A. Ullah (2007a). Finite sample moments of maximum likelihood estimator in spatial models. Journal of Econometrics 137, 396–413. Bao, Y. and A. Ullah (2007b). The second-order bias and mean squared error of estimators in time series models. Journal of Econometrics 140, 650–69. Bao, Y. and A. Ullah (2007c). Expectation of quadratic forms in normal and nonnormal variables with econometric applications. Working paper, Temple University. Basmann, R. L. (1974). Exact finite sample distribution for some econometric estimators and test statistics: a survey and appraisal. In M. D. Intriligator and D. A. Kendrick (Eds.), Frontiers of Quantitative Economics, Volume 2, 209–88. Amsterdam: North-Holland. Fisher, R. A. (1921). On the probable error of a coefficient of correlation deduced from a small sample. Metron 1, 1–32. Ghazal, G. A. (1994). Moments of the ratio of two dependent quadratic forms. Statistics and Probability Letters 30, 313–19. C The Author(s). Journal compilation C Royal Economic Society 2009.
On skewness and kurtosis of econometric estimators
241
Kelejian, H. H. and I. R. Prucha (1999). A generalized moments estimator for the autoregressive parameter in a spatial model. International Economic Review 40, 509–33. Lee, L. F. (2004). Asymptotic distributions of quasi-maximum likelihood estimators for spatial autoregressive models. Econometrica 72, 1899–925. Linton, O. (1997). Asymptotic expansion in the GARCH(1,1) model. Econometric Theory 13, 558–81. McCullagh, P. (1987). Tensor Methods in Statistics. London: Chapman & Hall. Nagar, A. L. (1959). The bias and moment matrix of the general k-class estimators of the parameters in simultaneous equations. Econometrica 27, 575–95. Rilstone, P., V. K. Srivastava and A. Ullah (1996). The second-order bias and mean squared error of nonlinear estimators. Journal of Econometrics 75, 369–95. Rothenberg, T. J. (1984). Approximating the distribution of econometric estimators and test statistics. In Z. Griliches and M. D. Intriligator (Eds.), Handbook of Econometrics, Volume 2, 881–935. Amsterdam: North-Holland. Sargan, J. D. (1974). The validity of Nagar’s expansion for the moments of econometric estimators. Econometrica 42, 169–76. Srinivasan, T. N. (1970). Approximation to finite sample moments of estimators where exact sampling distributions are unknown. Econometrica 38, 533–41. Ullah, A. (2004). Finite Sample Econometrics. New York: Oxford University Press.
APPENDIX A: PROOF √ ∗ Proof of Theorem 2.1: The standardized ith variable Tn,i = [Tn,i − E(Tn,i )]/ Var(Tn )ii has zero mean ∗ and unit variance, and we can easily show that the skewness of βˆn,i is equal to E(T n,i3 ), and its excess ∗3 ∗4 ∗ kurtosis is equal to E(T n,i ) − 3. By Assumption 2.2, E(T n,i ) = O(n−1/2 ) and E(T n,i4 ) − 3 = O(n−1 ). Since we do not know E(T n,i ) and Var(T n ) ii , we can approximate them to a certain order. Corresponding to the expansion (2.3), E(Tn,i ) = ui,−1/2 + o(n−1/2 ), Var(Tn )ii = Var(Tn,i ) = vi,−1/2 + o(n−1/2 ), where u i,−1/2 = E(ξ −1/2,i ) and v i,−1/2 = E(ξ 20,i + 2ξ 0,i ξ −1/2,i ) are the approximate mean and variance √ ∗∗ = (Tn,i − ui,−1/2 )/ vi,−1/2 . of T n,i , up to O(n−1/2 ). Define the approximate standardized statistic Tn,i ∗∗ Obviously, T n,i = O P (1). Moreover, Tn,i − E(Tn,i ) + E(Tn,i ) − ui,−1/2 ∗∗ = Tn,i vi,−1/2 − Var(Tn,i ) + Var(Tn,i ) vi,−1/2 − Var(Tn,i ) −1/2 Tn,i − E(Tn,i ) 1+ = Var(Tn,i ) Var(Tn,i ) vi,−1/2 − Var(Tn,i ) −1/2 E(Tn,i ) − ui,−1/2
1+ + Var(Tn,i ) Var(Tn,i ) v 1 − Var(T ) i,−1/2 n,i ∗ + ··· 1− = Tn,i 2 Var(Tn,i ) E(Tn,i ) − ui,−1/2 1 vi,−1/2 − Var(Tn,i )
+ 1− + ··· 2 Var(Tn,i ) Var(Tn,i ) ∗ + oP (n−1/2 ) = Tn,i
C The Author(s). Journal compilation C Royal Economic Society 2009.
(A.1)
242
Y. Bao and A. Ullah
since T ∗n,i = O P (1), v i,−1/2 − Var(T n,i ) = o(n−1/2 ), u i,−1/2 − E(T n,i ) = o(n−1/2 ), and Var(T n,i ) = O(1). ∗∗3 −1/2 −1/2 −1/2 ), Var(T ∗∗ ) and E(T ∗3 ), i.e. the Immediately, E(T ∗∗ n,i ) = E(T n,i ) + o(n n,i ) = 0 + o(n n,i ) = 1 + o(n ∗∗3 ∗ −1/2 ). Now we third cumulant of T n,i , or that of T n,i , can be approximated by E(T n,i ), up to order O(n expand E(T ∗∗3 n,i ) as follows: ∗∗3 −3/2 E Tn,i = vi,−1/2 E (Tn,i − ui,−1/2 )3 2 −3/2 3 − 3E Tn,i ui,−1/2 + 3E(Tn,i )u2i,−1/2 − u3i,−1/2 = vi,−1/2 E Tn,i 2 −3/2 3 2 (A.2) E(ξ−1/2,i ) + o(n−1/2 ). + 3ξ0,i ξ−1/2,i − 3E ξ0,i = vi,−1/2 E ξ0,i Next, to approximate the kurtosis coefficient of T n,i , alternatively, the fourth moment of T ∗4 n,i , up to order O(n−1 ), we approximate E(T n,i ) and Var(T n ) ii in the definition of T ∗n,i up to order O(n−1 ), E(Tn,i ) = ui,−1 + o(n−1 ), Var(Tn,i ) = vi,−1 + o(n−1 ), where u i,−1 = E(ξ −1/2,i + ξ −1,i ) and v i,−1 = E(ξ 20,i + ξ 2−1/2,i + 2ξ 0,i ξ −1/2,i + 2ξ 0,i ξ −1,i ) − [E(ξ −1/2,i )]2 are the approximate mean and variance of T n,i , up to O(n−1 ). Define the approximate standardized statistic √ ∗∗∗ = (Tn,i − ui,−1 )/ vi,−1 . Obviously, T ∗∗∗ Tn,i n,i = O P (1). Using a similar expansion as (A.1), we can show ∗∗∗ ∗ −1 −1 ∗∗∗ −1 ∗4 ∗∗∗4 T n,i = T n,i + o(n ). Therefore, T ∗∗∗ n,i = 0 + o(n ), Var(T n,i ) = 1 + o(n ) and E(T n,i ) = E(T n,i ) + ) − 3, up to order o(n−1 ), i.e. the fourth cumulant of T ∗n,i , or that of T n,i , can be approximated by E(T ∗∗∗4 n,i ) as follows: O(n−1 ). Now we expand E(T ∗∗∗4 n,i ∗∗∗4 −2 E Tn,i = vi,−1 E (Tn,i − ui,−1 )4 4 3 2 2 −2 = vi,−1 E Tn,i − 4E Tn,i ui,−1 + 6E Tn,i ui,−1 − 4E(Tn,i )u3i,−1 + u4i,−1 4 −2 3 3 2 2 = vi,−1 E ξ0,i + 4ξ0,i ξ−1/2,i + 4ξ0,i ξ−1,i + 6ξ0,i ξ−1/2,i 3 3 2 E(ξ−1,i ) + 3ξ0,i ξ−1/2,i E(ξ−1/2,i ) − 4E ξ0,i − 4E ξ0,i 2 2 −1 (A.3) + 6E ξ0,i [E(ξ−1/2,i )] + o(n ). The skewness and excess kurtosis follow immediately from (A.2) and (A.3).
APPENDIX B: SPATIAL MODEL Using (2.2) and (2.3), we collect the following terms, which are needed in calculating the approximate results in (2.4) (since we have a scalar parameter, in what follows we suppress the subscript i): E ξ02 = nQ2 E ψn2 , E ξ03 = −n3/2 Q3 E ψn3 , E ξ04 = n2 Q4 E ψn4 , √ 1 E(ξ−1/2 ) = n Q2 E(ψn H1 ) − Q3 E(H2 )E ψn2 , 2 1 E(ξ0 ξ−1/2 ) = n Q2 E ψn2 − Q3 E ψn2 H1 + Q4 E(H2 )E ψn3 , 2 2 1 E ξ0 ξ−1/2 = −n3/2 Q3 E ψn3 − Q4 E ψn3 H1 + Q5 E(H2 )E ψn4 , 2 3 5 1 6 2 4 4 5 4 E ξ0 ξ−1/2 = n Q E ψn − Q E ψn H1 + Q E(H2 )E ψn , 2 2 E ξ−1/2 = n Q2 E ψn2 − 2Q3 E ψn2 H1 + Q4 E(H2 )E ψn3 + E ψn2 H12 1 − Q5 E(H2 )E ψn3 H1 + Q6 [E(H2 )]2 E ψn4 , 4 C The Author(s). Journal compilation C Royal Economic Society 2009.
On skewness and kurtosis of econometric estimators
243
2 = −n3/2 Q3 E ψn3 − 2Q4 E ψn3 H1 + Q5 E(H2 )E ψn4 + E ψn3 H12 E ξ0 ξ−1/2 1 − Q6 E(H2 )E ψn4 H1 + Q7 [E(H2 )]2 E ψn5 , 4 2 2 E ξ0 ξ−1/2 = n2 Q4 E ψn4 − 2Q5 E ψn4 H1 + Q6 E(H2 )E ψn5 + E ψn4 H12 5 1 8 6 2 7 − Q E(H2 )E ψn H1 + Q [E(H2 )] E ψn , 4 1 √ E(ξ−1 ) = − n −2Q2 E(ψn H1 ) + Q3 E(H2 )E ψn2 + E ψn H12 + E ψn2 H2 2 1 3 1 3 2 5 E(H3 )E ψn + E(H2 )E ψn H1 + Q [E(H2 )]2 E ψn3 , − Q4 6 2 2 1 E(ξ0 ξ−1 ) = n Q2 E ψn2 − 2Q3 E ψn2 H1 + Q4 E(H2 )E ψn3 + E ψn2 H12 + E ψn3 H2 2 1 3 1 E(H3 )E ψn4 + E(H2 )E ψn3 H1 + Q6 [E(H2 )]2 E ψn4 , − Q5 6 2 2 2 E ξ0 ξ−1 = −n3/2 Q3 E ψn3 − 2Q4 E ψn3 H1 + Q5 E(H2 )E ψn4 + E ψn3 H12 3 1 1 1 + E ψn4 H2 − Q6 E(H3 )E ψn5 + E(H2 )E ψn4 H1 + Q7 [E(H2 )]2 E ψn5 , 2 6 2 2 3 1 E ξ0 ξ−1 = n2 Q4 E ψn4 − 2Q5 E ψn4 H1 + Q6 E(H2 )E ψn5 + E ψn4 H12 + E ψn5 H2 2 1 3 1 2 6 5 8 6 7 E(H3 )E ψn + E(H2 )E ψn H1 + Q [E(H2 )] E ψn . −Q 6 2 2 To work out the expectations as given above, let λij = E[(ε M1 ε)i (ε M2 ε)j /(ε ε)i+j ], then by substituting ψ n and H i , we can write all the expectations in terms of λ ij :
Q= E(H2 ) = E(H3 ) = E ψn2 = E ψn3 = E ψn4 = E ψn5 = E ψn6 = E(ψn H1 ) =
−1
1 1 , b2 − λ01 + λ20 2 2 3 b3 + λ11 − λ30 , 2 3 b4 + λ02 − 6λ21 + 3λ40 , 2 1 b12 − b1 λ10 + λ20 , 4 3 2 3 1 3 b1 − b1 λ10 + b1 λ20 − λ30 , 2 4 8 3 2 1 1 3 4 b1 − 2b1 λ10 + b1 λ20 − b1 λ30 + λ40 , 2 2 16 5 5 5 5 1 b15 − b14 λ10 + b13 λ20 − b12 λ30 + b1 λ40 − λ50 , 2 2 4 16 32 15 5 15 3 1 b16 − 3b15 λ10 + b14 λ20 − b13 λ30 + b12 λ40 − b1 λ50 + λ60 , 4 2 16 16 64 1 1 1 1 1 b1 b2 − b1 λ01 − b2 λ10 + λ11 + b1 λ20 − λ30 , 2 2 4 2 4
C The Author(s). Journal compilation C Royal Economic Society 2009.
244
Y. Bao and A. Ullah
1 1 1 1 1 1 1 E ψn2 H1 = b12 b2 − b12 λ01 − b1 b2 λ10 + b1 λ11 + b12 + b2 λ20 − λ21 − b1 λ30 + λ40 , 2 2 2 4 8 2 8
3 1 3 3 1 3 3 b3 + b1 b2 λ20 − b1 λ21 E ψn H1 = b13 b2 − b13 λ01 − b12 b2 λ10 + b12 λ11 + 2 2 4 2 1 4 8
3 2 1 1 3 1 b + b2 λ30 + λ31 + b1 λ40 − λ50 , − 4 1 8 16 8 16
4 1 4 1 4 3 2 3 3 3 4 b + b b2 λ20 − b12 λ21 E ψn H1 = b1 b2 − b1 λ01 − 2b1 b2 λ10 + b1 λ11 + 2 2 1 2 1 4
1 1 3 1 1 1 1 b12 + b2 λ40 − λ41 − b1 λ50 + λ60 , − b13 + b1 b2 λ30 + b1 λ31 + 2 4 4 16 32 4 32
5 1 5 5 1 5 5 5 4 4 5 3 3 b + b b2 λ20 − b1 λ21 E ψn H1 = b15 b2 − b1 λ01 − b1 b2 λ10 + b1 λ11 + 2 2 4 2 1 2 1 4
5 4 5 2 5 2 5 3 5 5 b + b b2 λ30 + b1 λ31 + b + b1 b2 λ40 − b1 λ41 − 4 1 4 1 8 4 1 16 32
5 2 1 1 5 1 − b + b2 λ50 + λ51 + b1 λ60 − λ70 , 8 1 32 64 32 64 1 1 1 1 1 E ψn H12 = b1 b22 − b1 b2 λ01 + b1 λ02 − b22 λ10 + b2 λ11 − λ12 + b1 b2 λ20 − b1 λ21 4 2 2 8 2 1 1 1 1 − b2 λ30 + λ31 + b1 λ40 − λ50 , 2 4 4 8
2 2 1 2 1 1 2 2 2 E ψn H1 = b1 b2 − b1 b2 λ01 + b1 λ02 − b1 b22 λ10 + b1 b2 λ11 − b1 λ12 + b12 b2 + b22 λ20 4 4 4
1 2 1 1 1 1 2 1 1 − b + b2 λ21 + λ22 − b1 b2 λ30 + b1 λ31 + b + b2 λ40 − λ41 2 1 4 16 2 4 1 4 8 1 1 − b1 λ50 + λ60 , 4 16 3 2 1 3 3 3 3 2 3 E ψn H1 = b1 b2 − b1 b2 λ01 + b13 λ02 − b12 b22 λ10 + b12 b2 λ11 − b12 λ12 4 2 2 8
3 1 3 3 3 3 2 1 2 3 + b1 b2 + b1 b2 λ20 − b1 + b1 b2 λ21 + b1 λ22 − b1 b2 + b22 λ30 4 2 4 16 2 8
3 2 1 1 1 3 3 3 b + b2 λ31 − λ32 + b + b1 b2 λ40 − b1 λ41 + 4 1 8 32 4 1 4 8
3 2 1 1 3 1 b + b2 λ50 + λ51 + b1 λ60 − λ70 , − 8 1 8 16 16 32 4 2 1 1 E ψn H1 = b14 b22 − b14 b2 λ01 + b14 λ02 − 2b13 b22 λ10 + 2b13 b2 λ11 − b13 λ12 4 2
3 2 2 1 4 3 2 3 2 1 4 + b1 b2 + b1 b2 λ20 − b1 + b1 b2 λ21 + b1 λ22 − 2b13 b2 + b1 b22 λ30 2 2 2 8 2
1 1 1 4 3 2 1 2 3 b + b b2 + b2 λ40 + b1 + b1 b2 λ31 − b1 λ32 + 2 8 4 1 2 1 16
3 2 1 1 1 3 1 1 b + b2 λ41 + λ42 − b + b1 b2 λ50 + b1 λ51 − 4 1 16 64 2 1 2 4
3 2 1 1 1 1 b + b2 λ60 − λ61 − b1 λ70 + λ80 , + 8 1 16 32 8 64
C The Author(s). Journal compilation C Royal Economic Society 2009.
245
On skewness and kurtosis of econometric estimators 3 1 3 3 1 E ψn2 H2 = b12 b3 − b1 b3 λ10 + b12 λ11 + b3 λ20 − b1 λ21 − b12 λ30 + λ31 + b1 λ40 − λ50 , 2 4 2 8 4
3 3 2 3 3 3 9 2 1 9 3 3 E ψn H2 = b1 b3 − b1 b3 λ10 + b1 λ11 + b1 b3 λ20 − b1 λ21 − b1 + b3 λ30 + b1 λ31 2 2 4 4 8 8 3 2 3 3 1 + b1 λ40 − λ41 − b1 λ50 + λ60 , 2 16 4 8
4 3 4 3 2 1 9 3 4 E ψn H2 = b1 b3 − 2b1 b3 λ10 + b1 λ11 + b1 b3 λ20 − 3b13 λ21 − b14 + b1 b3 λ30 + b12 λ31 2 2 2 4
1 3 3 3 1 1 + 2b13 + b3 λ40 − b1 λ41 − b12 λ50 + λ51 + b1 λ60 − λ70 , 16 4 2 32 2 16
5 5 3 5 15 5 E ψn H2 = b15 b3 − b14 b3 λ10 + b15 λ11 + b13 b3 λ20 − b14 λ21 − b15 + b12 b3 λ30 2 2 2 4 4
15 3 5 4 5 15 2 5 3 1 b1 λ31 + b1 + b1 b3 λ40 − b1 λ41 − b1 + b3 λ50 + 4 2 16 8 2 32 15 5 2 3 5 1 b1 λ51 + b1 λ60 − λ61 − b1 λ70 + λ80 . + 32 4 64 16 32
So we need to evaluate λ ij , moments of cross-products of ratios of quadratic forms in the normal vector ε ∼ N (0, σ 02 I ). Replacing ε with ε/σ 0 in the definition of λ ij does not change the expectations, so we rewrite λij = E[(ε M1 ε)i (ε M2 ε)j /(ε ε)i+j ], where ε ∼ N (0, I ). Since M 1 and M 2 are symmetric and (trivially) both are commutative with I (the matrix in the quadratic form for the denominator), we can use immediately the separation result from Bao and Ullah (2007c): E[(ε M1 ε)i (ε M2 ε)j ] (ε M1 ε)i (ε M2 ε)j . (B.1) = E i+j (ε ε) E[(ε ε)i+j ] i+j −1 We can easily verify E[(ε ε)i+j ] = n (i+j ) = k=0 (n + 2k). As for the numerator E[(ε M1 ε)i (ε M2 ε)j ], moments of products of quadratic forms, we can utilize the recursive algorithm in Ghazal (1994) and its generalization in Bao and Ullah (2007c): for ε ∼ N (0, I ) and symmetric matrices A i , q q ε Ai ε = E(ε A1 ε) · E ε Ai ε E i=1
q
+2
i=2
⎛
E ⎝ε Aj A1 ε · ε A2 ε · · · ε Aj −1 ε ·
j =2
q k=j +1
⎞ ε Ak ε⎠ .
(B.2)
Given the separation result (B.1) and the recursive algorithm (B.2), we collect in the following the exact expressions of λ ij , in terms of products of traces of matrices involving M 1 and M 2 : 4 n(1) λ01 = trM2 , n(1) λ10 = trM1 , n(2) λ20 = 2trM12 + (trM1 )2 , n(2) λ02 = 2trM22 + (trM2 )2 , n(3) λ30 = 8trM13 + 6trM1 · trM12 + (trM1 )3 ,
4 Note that in Bao and Ullah (2007a), a different approach, namely, the top-order invariant polynomial approach, was used to derive the second-order bias and MSE of ρˆn . C The Author(s). Journal compilation C Royal Economic Society 2009.
246
Y. Bao and A. Ullah 2 n(4) λ40 = 48trM14 + 32trM1 · trM13 + 12trM12 · (trM1 )2 + 12 trM12 + (trM1 )4 , n(5) λ50 = 384trM15 + 240trM1 · trM14 + 160trM12 · trM13 + 80 (trM1 )2 · trM13 2 + 60trM1 · trM12 + 20trM12 · (trM1 )3 + (trM1 )5 , n(6) λ60 = 3840trM16 + 2304trM1 · trM15 + 1440trM12 · trM14 + 960trM1 · trM12 · trM13 2 2 + 720 (trM1 )2 · trM14 + 640 trM13 + 180 (trM1 )2 · trM12 + 160 (trM1 )3 · trM13 3 + 120 trM12 + 30 (trM1 )4 · trM12 + (trM1 )6 , n(7) λ70 = 46080trM17 + 26880trM1 · trM16 + 16128trM12 · trM15 + 13440trM13 · trM14 2 + 10080trM1 · trM12 · trM14 + 8064(trM1 )2 · trM15 + 4480trM1 · trM13 2 2 + 3360 trM12 · trM13 + 1680(trM1 )3 · trM14 + 960 trM12 · trM13 + 840trM1 3 2 · trM12 + 420(trM1 )3 · trM12 + 280(trM1 )4 · trM13 + 42(trM1 )5 · trM12 + (trM1 )7 , n(8) λ80 = 645120trM18 + 368640trM1 · trM17 + 215040trM12 · trM16 + 172032trM13 · trM15 + 129024trM1 · trM12 · trM15 + 107520(trM1 )2 · trM16 + 107520trM1 · trM13 · trM14 2 2 + 80640 trM14 + 40320(trM1 )2 · trM12 · trM14 + 24480trM1 · trM12 · trM13 2 2 + 23520 trM12 · trM14 + 21504(trM1 )3 · trM15 + 19040trM12 · trM13 2 + 17920(trM1 )2 · trM13 + 8960(trM1 )3 · trM12 · trM13 + 3360(trM1 )4 · trM14 3 4 2 + 3360(trM1 )2 · trM12 + 1680 trM12 + 840(trM1 )4 · trM12 + 448(trM1 )5 · trM13 + 56(trM1 )6 · trM12 (trM1 )8 , n(2) λ11 = 2trM1 M2 + trM1 · trM2 , n(3) λ12 = 8trM1 M22 + 4trM1 M2 · trM2 + 2trM1 · trM22 + trM1 · (trM2 )2 , n(3) λ21 = 8trM12 M2 + 4trM1 M2 · trM1 + 2trM2 · trM12 + trM2 · (trM1 )2 , n(4) λ31 = 48trM13 M2 + 24trM12 M2 · trM1 + 12trM1 M2 · trM12 + 8trM13 · trM2 + 6trM1 · trM2 · trM12 + 6trM1 M2 · (trM1 )2 + (trM1 )3 · trM2 , n(4) λ22 = 32trM12 M22 + 16trM1 · trM1 M22 + 16trM2 · trM12 M2 + 16trM1 M2 M1 M2 + 8trM1 · trM2 · trM1 M2 + 8 (trM1 M2 )2 + 4trM12 · trM22 + 2trM12 · (trM2 )2 + 2trM22 · (trM1 )2 + (trM1 )2 · (trM2 )2 , n(5) λ41 = 384trM14 M2 + 192trM1 · trM13 M2 + 96trM12 M2 · trM12 + 64trM1 M2 · trM13 + 48trM14 · trM2 + 48trM12 M2 · (trM1 )2 + 48trM1 · trM12 · trM1 M2 2 + 32trM1 · trM13 · trM2 + 12 (trM1 )2 · trM12 · trM2 + 12 trM12 · trM2 + 8 (trM1 )3 · trM1 M2 + (trM1 )4 · trM2 , n(5) λ32 = 192trM13 M22 + 192trM1 M2 M12 M2 + 96trM1 · trM12 M22 + 96trM12 M2 · trM1 M2 + 96trM13 M2 · trM2 + 48trM1 · trM1 M2 M1 M2 + 48trM12 · trM1 M22 + 48trM1 · trM12 M2 · trM2 + 24(trM1 )2 · trM1 M22 + 24trM12 · trM1 M2 · trM2 + 24trM1 · (trM1 M2 )2 + 16trM13 · trM22 + 12(trM1 )2 · trM1 M2 · trM2 + 12trM1 · trM12 · trM22 + 8trM13 · (trM2 )2 + 6trM1 · trM12 · (trM2 )2 + 2(trM1 )3 · trM22 + (trM1 )3 (trM2 )2 , C The Author(s). Journal compilation C Royal Economic Society 2009.
On skewness and kurtosis of econometric estimators
247
n(6) λ42 = 1536trM14 M22 + 1536trM1 M2 M13 M2 + 768trM1 · trM13 M22 + 768trM12 M2 M12 M2 + 768trM13 M2 · trM1 M2 + 768trM14 M2 · trM2 + 768trM1 · trM1 M2 M12 M2 2 + 384 trM12 M2 + 384trM1 · trM12 M2 · trM1 M2 + 384trM1 · trM13 M2 · trM2 + 384trM12 · trM12 M22 + 256trM13 · trM1 M22 + 192(trM1 )2 · trM12 M22 + 192trM12 · trM1 M2 M1 M2 + 192trM1 · trM12 · trM1 M22 + 192trM12 · trM12 M2 · trM2 + 128trM13 · trM1 M2 · trM2 + 96trM12 · (trM1 M2 )2 + 96(trM1 )2 · trM1 M2 M1 M2 + 96(trM1 )2 · trM12 M2 · trM2 + 96trM1 · trM12 · trM1 M2 · trM2 + 96trM14 · trM22 + 64trM1 · trM13 · trM22 + 48(trM1 )2 · (trM1 M2 )2 + 48trM14 · (trM2 )2
2 + 32(trM1 )3 · trM1 M22 + 32trM1 · trM13 · (trM2 )2 + 24(trM1 )2 · trM12 · trM22 + 24 trM12 · trM22 2 + 16(trM1 )3 · trM1 M2 · trM2 + 12(trM1 )2 · trM12 · (trM2 )2 + 12 trM12 · (trM2 )2 + 2(trM1 )4 · trM22 + (trM1 )4 (trM2 )2 , n(6) λ51 = 3840trM15 M2 + 1920trM1 · trM14 M2 + 960trM12 · trM13 M2 + 640trM13 · trM12 M2 + 480(trM1 )2 · trM13 M2 + 480trM1 · trM12 · trM12 M2 + 480trM14 · trM1 M2 + 384trM15 · trM2 + 320trM1 · trM13 · trM1 M2 + 240trM1 · trM14 · trM2 2 + 160trM12 · trM13 · trM2 + 120(trM1 )2 · trM12 · trM1 M2 + 120 trM12 · trM1 M2 2 + 80(trM1 )3 · trM12 M2 + 80(trM1 )2 · trM13 · trM2 + 60trM1 · trM12 · trM2 + 20(trM1 )3 · trM12 · trM2 + 10(trM1 )4 · trM1 M2 + (trM1 )5 · trM2 , n(7) λ61 = 46080trM16 M2 + 23040trM1 · trM15 M2 + 11520trM12 · trM14 M2 + 7680trM13 · trM13 M2 + 5760(trM1 )2 · trM14 M2 + 5760trM1 · trM12 · trM13 M2 + 5760trM14 · trM12 M2 + 4608trM15 · trM1 M2 + 3840trM1 · trM13 · trM12 M2 + 3840trM16 · trM2 + 2880trM1 · trM14 · trM1 M2 + 2304trM1 · trM15 · trM2 + 1920trM12 · trM13 · trM1 M2 2 + 1440(trM1 )2 · trM12 · trM12 M2 + 1440 trM12 · trM12 M2 + 1440trM12 · trM14 · trM2 + 960(trM1 )2 · trM13 · trM1 M2 + 960trM1 · trM12 · trM13 · trM2 2 + 960(trM1 )3 · trM13 M2 + 720trM1 · trM12 · trM1 M2 + 720(trM1 )2 · trM14 · trM2 2 2 + 640 trM13 · trM2 + 240(trM1 )3 · trM12 · trM1 M2 + 180(trM1 )2 · trM12 · trM2 3 + 160(trM1 )3 · trM13 · trM2 + 120(trM1 )4 · trM12 M2 + 120 trM12 · trM2 + 30(trM1 )4 · trM12 · trM2 + 12(trM1 )5 · trM1 M2 + (trM1 )6 · trM2 . In summary, given ρ 0 and W, the steps to evaluate the finite sample skewness and excess kurtosis of ρˆn are as follows: (1) (2) (3) (4)
Calculate λ ij as given above. Plug λ ij into Q and into expectations of terms involving ψ n , H 1 , H 2 and H 3 . Plug the results from Step 2 into expectations of terms involving ξ 0 , ξ −1/2 and ξ −1 . Plug the results from Step 3 to (2.4) to calculate the approximate results.
C The Author(s). Journal compilation C Royal Economic Society 2009.
The
Econometrics Journal Econometrics Journal (2009), volume 12, pp. 248–271. doi: 10.1111/j.1368-423X.2009.00292.x
Adaptive pointwise estimation in time-inhomogeneous conditional heteroscedasticity models ‡ AND V. S POKOINY § ¨ P. Cˇ ´I Zˇ EK † , W. H ARDLE †
Department of Econometrics & OR, Tilburg University, P.O. Box 90153, 5000LE Tilburg, The Netherlands E-mail:
[email protected]
‡
Humboldt-Universit¨at zu Berlin and CASE, Spandauerstrasse 1, 10178 Berlin, Germany E-mail:
[email protected] §
Weierstrass-Institute, Humboldt-Universit¨at zu Berlin and CASE, Mohrenstrasse 39, 10117 Berlin, Germany E-mail:
[email protected] First version received: April 2008; final version accepted: April 2009
Summary This paper offers a new method for estimation and forecasting of the volatility of financial time series when the stationarity assumption is violated. Our general, local parametric approach particularly applies to general varying-coefficient parametric models, such as GARCH, whose coefficients may arbitrarily vary with time. Global parametric, smooth transition and change-point models are special cases. The method is based on an adaptive pointwise selection of the largest interval of homogeneity with a given right-end point by a local change-point analysis. We construct locally adaptive estimates that can perform this task and investigate them both from the theoretical point of view and by Monte Carlo simulations. In the particular case of GARCH estimation, the proposed method is applied to stock-index series and is shown to outperform the standard parametric GARCH model. Keywords: Adaptive pointwise estimation, Autoregressive models, Conditional heteroscedasticity models, Local time-homogeneity.
1. INTRODUCTION A growing amount of econometrical and statistical research is devoted to modelling financial time series and their volatility, which measures dispersion at a point in time (i.e. conditional variance). Although many economies and financial markets have been recently experiencing many shorter and longer periods of instability or uncertainty such as the Asian crisis (1997), the Russian crisis (1998), the start of the European currency (1999), the ‘dot-Com’ technologybubble crash (2000–02) or the terrorist attacks (September, 2001), the war in Iraq (2003) and the current global recession (2008), mostly used econometric models are based on the assumption of time homogeneity. This includes linear and non-linear autoregressive (AR) and movingaverage models and conditional heteroscedasticity (CH) models such as ARCH (Engel, 1982) C The Author(s). Journal compilation C Royal Economic Society 2009. Published by Blackwell Publishing Ltd, 9600 Garsington Road,
Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
Adaptive estimation in CH models
249
and GARCH (Bollerslev, 1986), stochastic volatility models (Taylor, 1986), as well as their combinations such as AR-GARCH. On the other hand, the market and institutional changes have long been assumed to cause structural breaks in financial time series, which was confirmed, e.g. in data on stock prices (Andreou and Ghysels, 2002, and Beltratti and Morana, 2004) and exchange rates (Herwatz and Reimers, 2001). Moreover, ignoring these breaks can adversely affect the modelling, estimation and forecasting of volatility as suggested e.g. by Diebold and Inoue (2001), Mikosch and Starica (2004), Pesaran and Timmermann (2004) and Hillebrand (2005). Such findings led to the development of the change-point analysis in the context of CH models; see e.g. Chen and Gupta (1997), Kokoszka and Leipus (2000) and Andreou and Ghysels (2006). An alternative approach lies in relaxing the assumption of time homogeneity and allowing some or all model parameters to vary over time (Chen and Tsay, 1993, Cai et al., 2000, and Fan and Zhang, 2008). Without structural assumptions about the transition of model parameters over time, time-varying coefficient models have to be estimated non-parametrically, e.g. under the identification condition that their parameters are smooth functions of time (Cai et al., 2000). In this paper, we follow a different strategy based on the assumption that a time series can be locally, i.e. over short periods of time, approximated by a parametric model. As suggested by Spokoiny (1998), such a local approximation can form a starting point in the search for the longest period of stability (homogeneity), i.e. for the longest time interval in which the series is described well by the parametric model. In the context of the local constant approximation, this strategy was employed for volatility modelling by H¨ardle et al. (2003), Mercurio and Spokoiny (2004) and Spokoiny (2009a). Our aim is to generalize this approach so that it can identify intervals of homogeneity for any parametric CH model regardless of its complexity. In contrast to the local constant approximation of the volatility of a process (Mercurio and Spokoiny, 2004), the main benefit of the proposed generalization consists in the possibility to apply the methodology to a much wider class of models and to forecast over a longer time horizon. The reason is that approximating the mean or volatility process by a constant is in many cases too restrictive or even inappropriate and it is fulfilled only for short time intervals, which precludes its use for longer-term forecasting. On the contrary, parametric models like GARCH mimic the majority of stylized facts about financial time series and can reasonably fit the data over rather long periods of time in many practical situations. Allowing for time dependence of model parameters offers then much more flexibility in modelling real-life time series, which can be both with or without structural breaks since global parametric models are included as a special case. Moreover, the proposed adaptive local parametric modelling unifies the change-point and varying-coefficient models. First, since finding the longest time-homogeneous interval for a parametric model at any point in time corresponds to detecting the most recent change-point in a time series, this approach resembles the change-point modelling as in Bai and Perron (1998) or Mikosch and Starica (1999, 2004), for instance, but it does not require prior information such as the number of changes. Additionally, the traditional structural-change tests require that the number of observations before each break point is large (and can grow to infinity) as these tests rely on asymptotic results. On the contrary, the proposed pointwise adaptive estimation does not rely on asymptotic results and does not thus place any requirements on the number of observations before, between or after any break point. Second, since the adaptively selected time-homogeneous interval used for estimation necessarily differs at each time point, the model coefficients can arbitrarily vary over time. In comparison to varying-coefficient models assuming
C The Author(s). Journal compilation C Royal Economic Society 2009.
250
ˇ ızˇek, W. H¨ardle and V. Spokoiny P. C´
smooth development of parameters over time (Cai et al., 2000), our approach however allows for structural breaks in the form of sudden jumps in parameter values. Although seemingly straightforward, extending Mercurio and Spokoiny’s (2004) procedure to the local parametric modelling is a non-trivial problem, which requires new tools and techniques. We concentrate here on the change-point estimation of financial time series, which are often modelled by data-demanding models such as GARCH. While the benefits of a flexible change-point analysis for time series spanning several years are well known, its feasibility (which stands in the focus of this work) is much more difficult to achieve. The reason is thus that, at each time point, the procedure starts from a small interval, where a local parametric approximation holds, and then iteratively extends this interval and tests it for time-homogeneity until a structural break is found or data exhausted. Hence, a model has to be initially estimated on very short time intervals (e.g. 10 observations). Using standard testing methods, such a procedure might be feasible for simple parametric models, but it is hardly possible for more complex parametric models such as GARCH that generally require rather large samples for reasonably good estimates. Therefore, we use an alternative and more robust approach to local change-point analysis that relies on a finite-sample theory of testing a growing sequence of historical time intervals on homogeneity against a change-point alternative. The proposed adaptive pointwise estimation procedure applies to a wide class of time-series models, including AR and CH models. Concentrating on the latter, we describe in details the adaptive procedure, derive its basic properties, and focusing on the feasibility of adaptive estimation for CH models, study the performance in comparison to the parametric (G)ARCH by means of simulations and real-data applications. The main conclusion is two-fold: on one hand, the adaptive pointwise estimation is feasible and beneficial also in the case of data-demanding models such as GARCH; on the other hand, the adaptive estimates based on various parametric models such as constant, ARCH or GARCH models are much closer to each other (while being better than the usual parametric estimates), which eliminates to some extent the need for using too complex models in adaptive estimation. The rest of the paper is organized as follows. In Section 2, the parametric estimation of CH models and its finite-sample properties are introduced. In Section 3, we define the adaptive pointwise estimation procedure and discuss the choice of its parameters. Theoretical properties of the method are discussed in Section 4. In the specific case of the ARCH(1) and GARCH(1,1) models, a simulation study illustrates the performance of the new methodology with respect to the standard parametric and change-point models in Section 5. Applications to real stock-index series data are presented in Section 6. The proofs are provided in the Appendix.
2. PARAMETRIC CONDITIONAL HETEROSCEDASTICITY MODELS Consider a time series Y t in discrete time, t ∈ N . The CH assumption means that Y t = σ t ε t , where {ε t } t∈N is a white noise process and {σ t } t∈N is a predictable volatility (conditional variance) process. Modelling of the volatility process σ t typically relies on some parametric CH specification such as the ARCH (Engle, 1982) and GARCH (Bollerslev, 1986) models: σt2 = ω +
p i=1
2 αi Yt−i +
q
2 βj σt−j ,
(2.1)
j =1 C The Author(s). Journal compilation C Royal Economic Society 2009.
Adaptive estimation in CH models
251
where p ∈ N, q ∈ N and θ = (ω, α1 , . . . , αp , β1 , . . . , βq ) is the parameter vector. An attractive feature of this model is that, even with very few coefficients, one can model most stylized facts of financial time series like volatility clustering or excessive kurtosis, for instance. A number of (G)ARCH extensions were proposed to make the model even more flexible; e.g. EGARCH (Nelson, 1991), QGARCH (Sentana, 1995) and TGARCH (Glosten et al., 1993) that account for asymmetries in a volatility process. All such CH models can be put into a common class of generalized linear volatility models: (2.2) Yt = σt εt = g(Xt )εt ,
Xt = ω +
p
αi h(Yt−i ) +
q
βj Xt−j ,
(2.3)
j =1
i=1
where g and h are known functions and X t is a (partially) unobserved process (structural variable) that models the volatility coefficient σ 2t via transformation g : σ 2t = g(X t ). For example, the GARCH model (2.1) is described by g(u) = u and h(r) = r 2 . Models (2.2)–(2.3) are time homogeneous in the sense that the process Y t follows the same structural equation at each time point. In other words, the parameter θ and hence the structural dependence in Y t is constant over time. Even though models like (2.2)–(2.3) can often fit data well over a longer period of time, the assumption of homogeneity is too restrictive in practical applications: to guarantee a sufficient amount of data for sufficiently precise estimation, these models are often applied over time spans of many years. On the contrary, the strategy pursued here requires only local time homogeneity, which means that at each time point t there is a (possibly rather short) interval [t − m, t], where the process Y t is well described by models (2.2)–(2.3). This strategy aims then both at finding an interval of homogeneity (preferably as long as possible) and at the estimation of the corresponding parameter values θ , which then enable predicting Y t and X t . Next, we discuss the parameter estimation for models (2.2)–(2.3) using observations Y t from some time interval I = [t 0 , t 1 ]. The conditional distribution of each observation Y t given the past Ft−1 is determined by the structural variable X t , whose dynamics are described by the parameter vector θ : Xt = Xt (θ) for t ∈ I due to (2.3). We denote the underlying value of θ by θ 0 . For estimating θ 0 , we apply the quasi-maximum likelihood (quasi-MLE) approach using the estimating equations generated under the assumption of Gaussian errors ε t . This guarantees efficiency under the normality of innovations and consistency under rather general moment conditions (Hansen and Lee, 1994, and Francq and Zakoian, 2007). The log-likelihood for models (2.2)–(2.3) on an interval I can be represented in the form {Yt , g[Xt (θ )]} LI (θ) = t∈I
with log-likelihood function (y, υ) = −0.5{log (υ) + y 2 /υ}. We define the quasi-MLE estimate θ I of the parameter θ by maximizing the log-likelihood LI (θ), {Yt , g[Xt (θ )]}, (2.4) θ I = argmax LI (θ) = argmax θ∈
θ∈
and denote by LI ( θ I ) the corresponding maximum. C The Author(s). Journal compilation C Royal Economic Society 2009.
t∈I
ˇ ızˇek, W. H¨ardle and V. Spokoiny P. C´
252
To characterize the quality of estimating the parameter vector θ 0 = (ω, α1 , . . . , αp , θ I , we now present an exact (non-asymptotic) exponential risk bound. This β1 , . . . , βq ) by θ I ) = maxθ∈ LI (θ ) rather than the point of bound concerns the value of maximum LI ( θ I , θ 0 ) = LI ( θ I ) − LI (θ 0 ). By maximum θ I . More precisely, we consider the difference LI ( definition, this value is non-negative and represents the deviation of the maximum of the loglikelihood process from its value at the ‘true’ point θ 0 . Later, we comment on how the accuracy θ I relates to the value LI ( θ I , θ 0 ). We will also see that the of estimation of the parameter θ 0 by bound for LI (θ I , θ 0 ) yields the confidence set for the parameter θ 0 , which will be used for the proposed change-point test. Now, the non-asymptotic risk bound is specified in the following theorem, which formulates corollaries 4.2 and 4.3 of Spokoiny (2009b) for the case of the quasiMLE estimation of a CH model (2.2)–(2.3) at θ = θ 0 . The result can be viewed as an extension θ I , θ 0 ) for a linear Gaussian model is χ 2p /2, of the Wilks phenomenon that the distribution of LI ( where p is the number of estimated parameters in the model. T HEOREM 2.1. Assume that the process Y t follows models (2.2)–(2.3) with the parameter θ 0 ∈ , where the set is compact. The function g(·) is assumed to be continuously differentiable with the uniformly bounded first derivative and g(x) ≥ δ > 0 for all x. Further, let the process Xt (θ) be sub-ergodic in the sense that for any smooth function f (·) there exists f ∗ such that for any time interval I 2 f (Xt (θ )) − E θ 0 f (Xt (θ )) ≤ f ∗ |I |, Eθ 0
θ ∈ .
I
Finally, let E exp{κ(εt2 − 1)|Ft−1 } ≤ c(κ) for some κ > 0, c(κ) > 0, and all t ∈ N . Then there are λ > 0 and e(λ, θ 0 ) > 0 such that for any interval I and z > 0 P θ 0 LI ( θ I , θ 0 ) > z ≤ exp{e(λ, θ 0 ) − λz}.
(2.5)
Moreover, for any r > 0, there is a constant Rr (θ 0 ) such that r E θ 0 LI ( θ I , θ 0 ) ≤ Rr (θ 0 ).
(2.6)
R EMARK 2.1. The condition g(x) ≥ δ > 0 guarantees that the variance process cannot reach zero. In the case of GARCH, it is sufficient to assume ω > 0, for instance. One attractive feature of Theorem 2.1, formulated in the following corollary, is that it enables constructing the non-asymptotic confidence sets and testing the parametric hypothesis on the θ I , θ ). This feature is especially important for our procedure basis of the fitted log-likelihood LI ( presented in Section 3. C OROLLARY 2.1. Under the assumptions of Theorem 2.1, let the value zα fulfil e(λ, θ 0 ) − θ I , θ ) ≤ zα } is an αλzα < log α for some α < 1. Then the random set EI (zα ) = {θ : LI ( confidence set for θ 0 in the sense that P θ 0 (θ 0 ∈ EI (zα )) ≤ α. Theorem 2.1 also gives a non-asymptotic and fixed upper bound for the risk of estimation LI ( θ I , θ 0 ) that applies to an arbitrary sample size |I |. To understand the relation of this result to the classical rate result, we can apply the standard arguments based on the quadratic expansion C The Author(s). Journal compilation C Royal Economic Society 2009.
Adaptive estimation in CH models
253
of the log-likelihood L( θ, θ ). Let ∇ 2 L(θ) denote the Hessian matrix of the second derivatives of L(θ) with respect to the parameter θ . Then θ I , θ 0 ) = 0.5( θ I − θ 0 ) ∇ 2 LI (θ I )( θ I − θ 0 ), LI (
(2.7)
θ I . Under usual regularity assumptions and for where θ I is a convex combination of θ 0 and sufficiently large |I |, the normalized matrix |I |−1 ∇ 2 LI (θ) is close to some matrix V (θ), which depends only√on the stationary distribution of Y t and is continuous in θ . Then (2.5) approximately θ I − θ 0 )2 ≤ z/|I | with probability close to 1 for large z. Hence, the large means that V (θ 0 )( deviation result of Theorem 2.1 yields the root-|I | consistency of the MLE estimate θ I . See Spokoiny (2009b) for further details.
3. POINTWISE ADAPTIVE NON-PARAMETRIC ESTIMATION An obvious feature of models (2.2)–(2.3) is that the parametric structure of the process is assumed constant over the whole sample and cannot thus incorporate changes and structural breaks at unknown times in the models. A natural generalization leads to models whose coefficients may change over time (Fan and Zhang, 2008). One can then assume that the structural process X t satisfies the relation (2.3) at any time, but the vector of coefficients θ may vary with the time t, θ = θ(t). The estimation of the coefficients as general functions of time is possible only under some additional assumptions on these functions. Typical assumptions are (i) varying coefficients are smooth functions of time (Cai et al., 2000) and (ii) varying coefficients are piecewise constant functions (Bai and Perron, 1998, and Mikosch and Starica, 1999, 2004). Our local parametric approach differs from the commonly used identification assumptions (i) and (ii). We assume that the observed data Y t are described by a (partially) unobserved process X t due to (2.2), and at each point T, there exists a historical interval I (T ) = [t 0 , T ] in which the process X t ‘nearly’ follows the parametric specification (2.3) (see Section 4 for details on what ‘nearly’ means). This local structural assumption enables us to apply well-developed parametric estimation for data {Y t } t∈I (T ) to estimate the underlying parameter θ = θ (T ) by θˆ = θˆ (T ). (The estimate θˆ = θˆ (T ) can then be used for estimating the value Xˆ T of the process X t at T from equation (2.3) and for further modelling such as forecasting Y T +1 .) Moreover, this assumption includes the above-mentioned ‘smooth transition’ and ‘switching regime’ assumptions (i) and ˆ ) vary over time as the interval I(T) changes with T and, at (ii) as special cases: parameters θ(T the same time, discontinuities and jumps in θˆ (T ) as a function of time are possible. To estimate θˆ (T ), we have to find the historical interval of homogeneity I(T), i.e. the longest interval I with the right-end point T, where data do not contradict a specified parametric model with fixed parameter values. Starting at each time T with a very short interval I = [t 0 , T ], we search by successive extending and testing of interval I on homogeneity against a change-point alternative: if the hypothesis of homogeneity is not rejected for a given I, a larger interval is taken and tested again. Contrary to Bai and Perron (1998) and Mikosch and Starica (1999), who detect all change points in a given time series, our approach is local: it focuses on the local change-point analysis near point T of estimation and tries to find only one change closest to the reference point. In the rest of this section, we first discuss the test statistics employed to test the time-homogeneity of an interval I against a change-point alternative in Section 3.1. Later, we rigorously describe the pointwise adaptive estimation procedure in Section 3.2. Its C The Author(s). Journal compilation C Royal Economic Society 2009.
ˇ ızˇek, W. H¨ardle and V. Spokoiny P. C´
254
implementation and the choice of parameters entering the adaptive procedure are described in Sections 3.2–3.4. Theoretical properties of the method are studied in Section 4. 3.1. Test of homogeneity against a change-point alternative The pointwise adaptive estimation procedure crucially relies on the test of local timehomogeneity of an interval I = [t 0 , T ]. The null hypothesis for I means that the observations {Y t } t∈I follow the parametric models (2.2)–(2.3) with a fixed parameter θ 0 , leading to the quasiMLE estimate θ I from (2.4) and the corresponding fitted log-likelihood LI ( θ I ). The change-point alternative for a given change-point location τ ∈ I can be described as follows: process Y t follows the parametric models (2.2)–(2.3) with a parameter θ J for t ∈ J = [t 0 , τ ] and with a different parameter θ J c for t ∈ J c = [τ + 1, T ]; θ J = θ J c . The fitted θ J ) + LJ c ( θ J c ). The test of homogeneity can be log-likelihood under this alternative reads as LJ ( performed using the likelihood ratio (LR) test statistic T I,τ : θ J ) + LJ c ( θ J c ) − LI ( θI ) . TI ,τ = max {LJ (θ J ) + LJ c (θ J c )} − max LI (θ) = LJ ( θ J ,θ J c ∈
θ∈
Since the change-point location τ is generally not known, we consider the supremum of the LR statistics T I,τ over some subset τ ∈ T (I ); cf. Andrews (1993): TI ,T (I ) = sup TI ,τ . τ ∈T (I )
(3.1)
A typical example of a set T (I ) is T (I ) = {τ : t0 + m ≤ τ ≤ T − m
} for some fixed m , m
> 0. 3.2. Adaptive search for the longest interval of homogeneity This section presents the proposed adaptive pointwise estimation procedure. At each point T, we aim at estimating the unknown parameters θ (T ) from historical data Y t , t ≤ T ; this procedure repeats for every current time point T as new data arrive. At the first step, the procedure selects on the base of historical data an interval Iˆ(T ) of homogeneity in which the data do not contradict the parametric models (2.2)–(2.3). Afterwards, the quasi-MLE estimation is applied using the selected historical interval Iˆ(T ) to obtain estimate θˆ (T ) = θ Iˆ(T ) . From now on, we consider an arbitrary, but fixed time point T. Suppose that a growing set I 0 ⊂ I 1 ⊂ · · · ⊂ I K of historical interval-candidates I k = [T − m k + 1, T ] with the right-end point T is fixed. The smallest interval I 0 is accepted automatically as homogeneous. Then the procedure successively checks every larger interval I k on homogeneity using the test statistic TIk ,T (Ik ) from (3.1). The selected interval Iˆ corresponds to the largest accepted interval Ikˆ with index kˆ such that TIk ,T (Ik ) ≤ zk ,
ˆ k ≤ k,
(3.2)
and TIk+1 ˆ , where the critical values zk are discussed later in this section and ˆ ,T (Ik+1 ˆ ) > zk+1 specified in Section 3.3. This procedure then leads to the adaptive estimate θˆ = θ Iˆ corresponding to the selected interval Iˆ = Ikˆ . The complete description of the procedure includes two steps. (A) Fixing the set-up and the parameters of the procedure. (B) Data-driven search for the longest interval of homogeneity. C The Author(s). Journal compilation C Royal Economic Society 2009.
Adaptive estimation in CH models
(A)
255
Set-up and parameters: 1
(B)
Select specific parametric models (2.2)–(2.3) [e.g. constant volatility, ARCH(1), GARCH(1,1)]. 2 Select the set I = (I0 , . . . , IK ) of interval-candidates, and for each Ik ∈ I , the set T (Ik ) of possible change points τ ∈ I k used in the LR test (3.1). 3 Select the critical values z1 , . . . , zK in (3.2) as described in Section 3.3. θ I0 . Adaptive search and estimation: Set k = 1, Iˆ = I0 and θˆ = 1
2 3
Test the hypothesis H 0,k of no change point within the interval I k using test statistics (3.1) and the critical values zk obtained in (A3). If a change point is detected (H 0,k is rejected), go to (B3). Otherwise proceed with (B2). θ Ik . Further, set k := k + 1. If k ≤ K, repeat (B1); otherwise go Set θˆ = θ Ik and θˆ Ik = to (B3). θ Iˆ . Additionally, set Define Iˆ = Ik−1 = ‘the last accepted interval’ and θˆ = θˆ Ik = · · · = θˆ IK = θˆ if k ≤ K.
In step (A), one has to select three main ingredients of the procedure. First, the parametric model used locally to approximate the process Y t has to be specified in (A1), e.g. the constant volatility or GARCH(1,1) in our context. Next, in step (A2), the set of intervals I = {Ik }K k=0 is fixed, each interval with the right-end point T, length m k = |I k |, and the set T (Ik ) of tested change points. Our default proposal is to use a geometric grid m k = [m 0 a k ], a > 1, and to set I k = [T − m k + 1, T ] and T (Ik ) = [T − mk−1 + 1, T − mk−2 ]. Although our experiments show that the procedure is rather insensitive to the choice of m 0 and a (e.g. we use m 0 = 10 and a = 1.25 in simulations), the length m 0 of interval I 0 should take into account the parametric model selected in (A1). The reason is that I 0 is always assumed to be time-homogeneous and m 0 thus has to reflect flexibility of the parametric model; e.g. while m 0 = 20 might be reasonable for the GARCH(1,1) model, m 0 = 5 could be a reasonable choice for the locally constant approximation of a volatility process. Finally, in step (A3), one has to select the K critical values zk in (3.2) for the LR test statistics TIk ,T (Ik ) from (3.1). The critical values zk will generally depend on the parametric model describing the null hypothesis of time-homogeneity, the set I of intervals I k and corresponding sets of considered change points T (Ik ), k ≤ K, and additionally, on two constants r and ρ that are counterparts of the usual significance level. All these determinants of the critical values can be selected in step (A) and the critical values are thus obtained before the actual estimation takes place in step (B). Due to its importance, the method of constructing critical values {zk }K k=1 is discussed separately in Section 3.3. The main step (B) performs the search for the longest time-homogeneous interval. Initially, I 0 is assumed to be homogeneous. If I k−1 is negatively tested on the presence of a change point, one continues with I k by employing test (3.1) in step (B1), which checks for a potential change point in I k . If no change point is found, then I k is accepted as time-homogeneous in step (B2); otherwise the procedure terminates in step (B3). We sequentially repeat these tests until we find a change point or exhaust all intervals. The latest (longest) interval accepted as time-homogeneous is used for estimation in step (B3). Note that the estimate θˆ Ik defined in (B2) and (B3) corresponds to the latest accepted interval Iˆk after the first k steps, or equivalently, the interval selected out of I 1, . . . , I k . Moreover, the whole search and estimation step (B) can be repeated at different time points T without reiterating the initial step (A) as the critical values zk depend only on the approximating parametric model and interval lengths m k = |I k |, not on the time point T (see Section 3.3). C The Author(s). Journal compilation C Royal Economic Society 2009.
256
ˇ ızˇek, W. H¨ardle and V. Spokoiny P. C´
3.3. Choice of critical values zk The presented method of choosing the interval of homogeneity Iˆ can be viewed as multiple testing procedure. The critical values for this procedure are selected using the general approach of testing theory: to provide a prescribed performance of the procedure under the null hypothesis, i.e. in the pure parametric situation. This means that the procedure is trained on the data generated from the pure parametric time-homogeneous model from step (A1). The correct choice in this situation is the largest considered interval I K and a choice Ikˆ with kˆ < K can be interpreted as a ‘false alarm’. We select the minimal critical values ensuring a small probability of such a false alarm. Our condition slightly differs though from the classical level condition because we focus on parameter estimation rather than on hypothesis testing. In the pure parametric case, the ‘ideal’ estimate corresponds to the largest considered interval θ IK can be measured I K . Due to Theorem 2.1, the quality of estimation of the parameter θ 0 by by the log-likelihood ‘loss’ LIK (θ IK , θ 0 ), which is stochastically bounded with exponential and polynomial moments: E θ 0 |LIK ( θ IK , θ 0 )|r ≤ Rr (θ 0 ). If the adaptive procedure stops earlier at some intermediate step k < K, we select instead of θ IK another estimate θˆ = θ Ik with a larger variability. The loss associated with such a false alarm can be measured by the value ˆ The corresponding condition bounding the loss due to the θ IK , θˆ ) = LIK ( θ IK ) − LIK (θ). LIK ( adaptive estimation reads as r E θ 0 LIK ( θ IK , θˆ ) ≤ ρRr (θ 0 ). (3.3) This is in fact an implicit condition on the critical values {zk }K k=1 , which ensures that the loss associated with the false alarm is at most the ρ-fraction of the log-likelihood loss of the ‘ideal’ or ‘oracle’ estimate θ IK for the parametric situation. The constant r corresponds to the power of the loss in (3.3), while ρ is similar in meaning to the test level. In the limit case when r tends to zero, θ IK = θˆ ) ≤ ρ. this condition (3.3) becomes the usual level condition: P θ 0 (IK is rejected) = P θ 0 ( The choice of the metaparameters r and ρ is discussed in Section 3.4. A condition similar to (3.3) is imposed at each step of the adaptive procedure. The estimate θˆ Ik coming after the k steps of the procedure should satisfy r E θ 0 LIk ( θ Ik , θˆ Ik ) ≤ ρk Rr (θ 0 ), k = 1, . . . , K, (3.4) where ρ k = ρ k/K ≤ ρ. The following theorem presents some sufficient conditions on the critical values {zk }K k=1 ensuring (3.4); recall that m k = |I k | denotes the length of I k . T HEOREM 3.1. Suppose that r > 0, ρ > 0. Under the assumptions of Theorem 2.1, there are constants a 0 , a 1 , a 2 such that the condition (3.4) is fulfilled with the choice zk = a0 r log(ρ −1 ) + a1 r log(mK /mk−1 ) + a2 log(mk ),
k = 1, . . . , K.
Since K and {m k }K k=1 are fixed, the zk ’s in Theorem 3.1 have a form zk = C + D log(mk ) for k = 1, . . . , K with some constant C and D. However, a practically relevant choice of these constants has to be done by Monte Carlo simulations. Note first that every particular choice of the coefficients C and D determines the whole set of the critical values {zk }K k=1 and thus the local change-point procedure. For the critical values given by fixed (C, D), one can run the procedure and observe its performance on the simulated data using the data-generating process (2.2)– (2.3); in particular, one can check whether the condition (3.4) is fulfilled. For any (sufficiently large) fixed value of C, one can thus find the minimal value D(C) < 0 of D that ensures (3.4). C The Author(s). Journal compilation C Royal Economic Society 2009.
Adaptive estimation in CH models
257
Every corresponding set of critical values in the form zk = C + D(C) log(mk ) is admissible. The condition D(C) < 0 ensures that the critical values decreases with k. This reflects the fact that a false alarm at an early stage of the algorithm is more crucial because it leads to the choice of a highly variable estimate. The critical values zk for small k should thus be rather conservative to provide the stability of the algorithm in the parametric situation. To determine C, the value z1 can be fixed by considering the false alarm at the first step of the procedure, which leads to estimation using the smallest interval I 0 instead of the ‘ideal’ largest interval I K . The related condition (used in Section 5.1) reads as r E θ 0 LIK ( θ IK , θ I0 ) 1(TI1 ,T (I1 ) > z1 ) ≤ ρRr (θ 0 )/K. (3.5) Alternatively, one could select a pair (C, D) that minimizes the resulting prediction error; see Section 3.4. 3.4. Selecting parameters r and ρ The choice of critical values using inequality (3.4) additionally depends on two ‘metaparameters’ r and ρ. A simple strategy is to use conservative values for these parameters and the corresponding set of critical values (e.g. our default is r = 1 and ρ = 1). On the other hand, the two parameters are global in the sense that they are independent of T. Hence, one can also determine them in a data-driven way by minimizing some global forecasting error (Cheng et al., 2003). Different values of r and ρ may lead to different sets of critical values and hence to (r,ρ) (r,ρ) (T ) and to different forecasts YˆT +h|T of the future values Y T +h , where h different estimates θˆ is the forecasting horizon. Now, a data-driven choice of r and ρ can be done by minimizing the following objective function: (r,ρ) (ˆr , ρ) ˆ = arg min PE,H (r, ρ) = arg min (3.6) YT +h , YˆT +h|T , r>0,ρ>0
r,ρ
T
h∈H
where is a loss function and H is the forecasting horizon set. For example, one can take r (υ, υ ) = |υ − υ |r for r ∈ [1/2, 2]. For daily data, the forecasting horizon could be one day, H = {1}, or two weeks, H = {1, . . . , 10}.
4. THEORETIC PROPERTIES In this section, we collect basic results describing the quality of the proposed adaptive procedure. First, the definition of the procedure ensures the performance prescribed by (3.4) in the parametric situation. We however claimed that the adaptive pointwise estimation applies even if the process Y t is only locally approximated by a parametric model. Therefore, we now define a locally ‘nearly parametric’ process, for which we derive an analogy of Theorem 2.1 (Section 4.1). Later, we prove certain ‘oracle’ properties of the proposed method (Section 4.2). 4.1. Small modelling bias condition This section discusses the concept of a ‘nearly parametric’ case. To define it rigorously, we have to quantify the quality of approximating the true latent process X t , which drives the observed data Y t due to (2.2), by the parametric process Xt (θ) described by (2.3) for some θ ∈ . Below C The Author(s). Journal compilation C Royal Economic Society 2009.
258
ˇ ızˇek, W. H¨ardle and V. Spokoiny P. C´
we assume that the innovations√ε t in the model (2.2) are independent and identically distributed and denote the distribution of υεt by P υ so that the conditional distribution of Y t given Ft−1 is Pg(Xt ) . To measure the distance of a data-generating process from a parametric model, we introduce for every interval Ik ∈ I and every parameter θ ∈ the random quantity Ik (θ) = K {g(Xt ), g[Xt (θ)]}, t∈Ik
where K (υ, υ ) denotes the Kullback–Leibler distance between P υ and P υ . For CH models with Gaussian innovations εt , K (υ, υ ) = −0.5{log(υ/υ ) + 1 − υ/υ }. In the parametric case with Xt = Xt (θ 0 ), we clearly have Ik (θ 0 ) = 0. To characterize the ‘nearly parametric case’, we introduce a {small modelling bias} (SMB) condition, which simply means that, for some θ ∈ , Ik (θ) is bounded by a small constant with a high probability. Informally, this means that the ‘true’ model can be well approximated on the interval I k by the parametric one with the parameter θ. The best parametric fit (2.3) to the underlying model (2.2) on I k can be θ Ik can be viewed as its estimate. defined by minimizing the value EIk (θ) over θ ∈ and The following theorem claims that the results on the accuracy of estimation given in Theorem 2.1 can be extended from the parametric case to the general non-parametric situation under the ˆ SMB condition. Let (θˆ , θ ) be any loss function for an estimate θ. T HEOREM 4.1. Let for some θ ∈ and some ≥ 0 EIk (θ ) ≤ .
(4.1)
Then it holds for an estimate θˆ constructed from the observations {Yt }t∈Ik that ˆ θ )/E θ (θ, ˆ θ ) ≤ 1 + . E log 1 + (θ, This general result applied to the quasi-MLE estimation with the loss function LI ( θI , θ) yields the following corollary. C OROLLARY 4.1. Let the SMB condition (4.1) hold for some interval I k and θ ∈ . Then
r θ Ik , θ ) /Rr (θ ) ≤ 1 + , E log 1 + LIk ( where Rr (θ) is the parametric risk bound from (2.6). This result shows that the estimation loss |LI ( θ I , θ )|r normalized by the parametric risk Rr (θ ) is stochastically bounded by a constant proportional to e . If is not large, this result extends the parametric risk bound (Theorem 2.1) to the non-parametric situation under the SMB condition. Another implication of Corollary 4.1 is that the confidence set built for the parametric model (Corollary 2.1) continues to hold, with a slightly smaller coverage probability, under SMB. 4.2. The ‘oracle’ choice and the ‘oracle’ result Corollary 4.1 suggests that the ‘optimal’ or ‘oracle’ choice of the interval I k from the set I 1 , . . . , I K can be defined as the largest interval for which the SMB condition (4.1) still holds (for a given small > 0). For such an interval, one can neglect deviations of the underlying C The Author(s). Journal compilation C Royal Economic Society 2009.
Adaptive estimation in CH models
259
process from a parametric model with a fixed parameter θ . Therefore, we say that the choice k ∗ is the ‘oracle’ choice if there exists θ ∈ such that EIk∗ (θ ) ≤
(4.2)
for a fixed > 0 and that (4.2) does not hold for k > k ∗ . Unfortunately, the underlying process X t and, hence, the value Ik is unknown and the oracle choice cannot be implemented. The proposed adaptive procedure tries to mimic this oracle on the basis of available data using the sequential test of homogeneity. The final oracle result claims that the adaptive estimate provides the same (in order) accuracy as the oracle one. By construction, the pointwise adaptive procedure described in Section 3 provides the prescribed performance if the underlying process follows the parametric model (2.2). Now, condition (3.4) combined with Theorem 4.1 implies similar performance in the first k ∗ steps of the adaptive estimation procedure. T HEOREM 4.2. Let θ ∈ and > 0 be such that EIk∗ (θ) ≤ for some k ∗ ≤ K. Also let θ Ik , θ )|r ≤ Rr (θ). Then maxk≤k∗ E θ |LIk ( r r LI ∗ LI ∗ θ Ik ∗ , θ θ Ik∗ , θˆ Ik∗ k k ≤ 1 + and E log 1 + ≤ ρ + . E log 1 + Rr (θ) Rr (θ) Similarly to the parametric case, under the SMB condition EIk∗ (θ) ≤ , any choice kˆ < k ∗ can be viewed as a false alarm. Theorem 4.2 documents that the loss induced by such a false θ Ik∗ , θˆ Ik∗ ) is of the same magnitude as the loss alarm at the first k ∗ steps and measured by LIk∗ ( θ Ik∗ , θ ) of estimating the parameter θ from the SMB (4.2) by θ Ik∗ . Thus, under (4.2) the LIk∗ ( adaptive estimation during steps k ≤ k ∗ does not induce larger errors into estimation than the quasi-MLE estimation itself. For further steps of the algorithm with k > k ∗ , where (4.2) does not hold, the value =
EIk (θ) can be large and the bound for the risk becomes meaningless due to the factor e . To establish the result about the quality of the final estimate, we thus have to show that the quality of estimation cannot be destroyed at the steps k > k ∗ . The next ‘oracle’ result states the final quality of our adaptive estimate θˆ . T HEOREM 4.3. Let EIk∗ (θ) ≤ for some k ∗ ≤ K. Then LIk∗ ( θ Ik∗ , θˆ )1(kˆ ≥ k ∗ ) ≤ zk∗ yielding r
LI ∗ θ Ik∗ , θˆ zrk∗ k E log 1 + ≤ ρ + + log 1 + . Rr (θ) Rr (θ ) Due to this result, the value LIk∗ ( θ Ik∗ , θˆ ) is stochastically bounded. This can be interpreted ˆ as the oracle property of θ because it means that the adaptive estimate θˆ belongs with a high probability to the confidence set of the oracle estimate θ Ik ∗ .
5. SIMULATION STUDY In the last two sections, we present simulation study (Section 5) and real data applications (Section 6) documenting the performance of the proposed adaptive estimation procedure. To verify the practical applicability of the method in a complex setting, we concentrate on the volatility estimation using parametric and adaptive pointwise estimation of constant volatility, ARCH(1) and GARCH(1,1) models (for the sake of brevity, referred to as the local constant, C The Author(s). Journal compilation C Royal Economic Society 2009.
260
ˇ ızˇek, W. H¨ardle and V. Spokoiny P. C´
local ARCH and local GARCH). The reason is that the estimation of GARCH models requires generally hundreds of observations for a reasonable quality of estimation, which puts the adaptive procedure working with samples as small as 10 or 20 observations to a hard test. Additionally, the critical values obtained as described in Section 3.3 depend on the underlying parameter values in the case of (G)ARCH. Here we first study the finite-sample critical values for the test of homogeneity by means of Monte Carlo simulations and discuss practical implementation details (Section 5.1). Later, we demonstrate the performance of the proposed adaptive pointwise estimation procedure in simulated samples (Section 5.2). Note that, throughout this section, we identify the GARCH(1,1) models by triplets (ω, α, β): e.g. (1, 0.1, 0.3)-model. Constant volatility and ARCH(1) are then indicated by α = β = 0 and β = 0, respectively. The GARCH estimation is done using the GARCH 3.0 package (Laurent and Peters, 2006) and Ox 3.30 (Doornik, 2002). Finally, since the focus is on modelling the volatility σ 2t in (2.2), the performance measurement and comparison of all models at time t is done by the absolute prediction error (PE) of the volatility process 2 2 2 − σˆ t+h|t |/|H |, where σˆ t+h|t represents over a prediction horizon H : APE(t) = h∈H |σt+h the volatility prediction by a particular model.
5.1. Finite-sample critical values for the test of homogeneity A practical application of the pointwise adaptive procedure requires critical values for the test of local homogeneity of a time series. Since they are obtained under the null hypothesis that a chosen parametric model (locally) describes the data, see Section 3, we need to obtain the critical values for the constant volatility, ARCH(1) and GARCH(1,1) models. Furthermore, for given r and ρ, the average risk (3.4) between the adaptive and oracle estimates can be bounded for critical values that linearly depend on the logarithm of interval length |Ik | : z(|Ik |) = zk = C + D log(|Ik |) (see Theorem 3.1). As described in Section 3.3, we choose here the smallest C satisfying (3.5) and the corresponding minimum admissible value D = D(C) < 0 that guarantees the conditions (3.4). We simulated the critical values for ARCH(1) and GARCH(1,1) models with different values of underlying parameters; see Table 1 for the critical values corresponding to r = 1 and ρ = 1. Their simulation was performed sequentially on intervals with lengths ranging from |I 0 | = m 0 = 10 to |I K | = 570 observations using a geometric grid with multiplier a = 1.25; see Section 3.2. (The results are, however, not sensitive to the choice of a.) Unfortunately, the critical values depend on the parameters of the underlying (G)ARCH model (in contrast to the constant-volatility model). They generally seem to increase with the values of the ARCH and GARCH parameters keeping the other one fixed; see Table 1. To deal with this dependence on the underlying model parameters, we propose to choose the largest (most conservative) critical values corresponding to any estimated parameter in the analysed data. For example, if the largest estimated parameters of GARCH(1,1) are αˆ = 0.3 and βˆ = 0.8, one should use z(10) = 26.4 and z(570) = 14.5, which are the largest critical values for models with α = 0.3, β ≤ 0.8 and with α ≤ 0.3, β = 0.8. (The proposed procedure is, however, not overly sensitive to this choice, as we shall see later.) Finally, let us have a look at the influence of the tuning constants r and ρ in (3.4) on the critical values for several selected models (Table 2). The influence is significant, but can be classified in the following way. Whereas increasing ρ generally leads to an overall decrease of critical values (cf. Theorem 3.1), but primarily for the longer intervals, increasing r leads to an increase of C The Author(s). Journal compilation C Royal Economic Society 2009.
261
Adaptive estimation in CH models
z(|Ik |) α
Table 1. Critical values zk = z(|Ik |) of the supremum LR test. β |I k |
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
15.5 5.5 16.3
15.5 7.2 14.5
16.4 7.0 15.1
16.8 7.0 15.9
17.9 7.5 16.4
17.3 7.5 15.9
17.0 7.4 16.1
17.0 7.3 16.0
16.9 7.0 16.0
16.0 6.7
0.1
10 570 10
8.6 16.7
9.0 15.2
9.1 15.7
9.6 16.2
9.8 16.9
10.7 18.9
11.5 20.1
12.5 25.1
14.0
0.2
570 10 570 10 570
9.4 18.5 9.7
10.6 16.4 10.8
11.2 16.7 12.0
11.4 16.9 12.4
11.4 18.1 12.9
12.5 21.8 13.5
13.3 26.4 14.5
14.2
0.4
10 570
22.1 9.9
16.5 12.0
18.3 13.0
19.3 13.4
22.8 13.9
30.9 14.7
0.5
26.2 10.7 33.0
19.1 12.6 22.8
19.5 13.8 25.9
25.4 14.0 32.4
38.1 14.6
0.6
10 570 10 570 10
12.7 41.1
12.7 24.8
13.9 29.1
15.3
0.7
570 10 570
16.8 66.2 31.5
14.7 26.4 15.8
16.1
10 570
88.6 60.9
0.0
0.3
0.8 0.9
Note: ω = 1, r = 1 and ρ = 1.
critical values mainly for the shorter intervals; cf. (3.4). In simulations and real applications, we verified that a fixed choice such as r = 1 and ρ = 1 performs well. To optimize the performance of the adaptive methods, one can however determine constants r and ρ in a data-dependent way as described in Section 3.3. We use here this strategy for a small grid of r ∈ {0.5, 1.0} and ρ ∈ {0.5, 1.0, 1.5} and find globally optimal r and ρ. We will document, though, that the differences in the average absolute PE (3.6) for various values of r and ρ are relatively small. 5.2. Simulation study We aim (i) to examine how well the proposed estimation method is able to adapt to long stable (time-homogeneous) periods and to less stable periods with more frequent volatility changes and (ii) to see which adaptively estimated model—local volatility, local ARCH or local GARCH— performs best in different regimes. To this end, we simulated 100 series from two change-point GARCH models with a low GARCH effect (ω, 0.2, 0.1) and a high GARCH effect (ω, 0.2, 0.7). Changes in constant ω are spread over a time span of 1000 days; see Figure 1. There is a long stable period at the beginning (500 days ≈ 2 years) and end (250 days ≈ 1 year) of time series with several volatility changes between them. C The Author(s). Journal compilation C Royal Economic Society 2009.
ˇ ızˇek, W. H¨ardle and V. Spokoiny P. C´
262
Table 2. Critical values z(|Ik |) of the supremum LR test for various values r and ρ. Model (ω, α, β)
(0.1, 0.2, 0.0)
(0.1, 0.1, 0.8)
z(570)
z(10)
z(570)
z(10)
z(570)
1.0 1.0 1.0
0.5 1.0 1.5
16.3 15.4 14.9
7.3 5.5 4.5
17.4 16.7 15.9
11.2 9.4 8.3
18.7 16.0 15.2
17.1 14.0 13.4
0.5 0.5
0.5 1.0
10.7 8.9
7.1 5.5
11.7 10.3
10.1 8.5
11.7 10.3
10.1 8.5
0.5
1.5
7.7
4.6
9.3
7.5
9.3
7.5
0
W B
Parameter value
A
200
400
600 Time
800
1000
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
z(10)
Parameter value
ρ
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
r
(0.1, 0.0, 0.0)
0
W B
A
200
400
600
800
1000
Time
Figure 1. GARCH(1,1) parameters of low (left panel) and high (right panel) GARCH-effect simulations.
5.2.1. Low GARCH effect. Let us now discuss simulation results from the low GARCH-effect model. First, we mention the effect of structural changes in time series on the parameter estimation. Later, we compare the performance of all methods in terms of absolute PE. Estimating a parametric model from data containing a change point will necessarily lead to various biases in estimation. For example, Hillebrand (2005) demonstrates that a change in volatility level ω within a sample drives the GARCH parameter β very close to 1. This is confirmed when we analyse the parameter estimates for parametric and adaptive GARCH at each time point t ∈ [250, 1000] as depicted on Figure 2, where the mean (solid line), the 10% and 90% quantiles (dotted lines), and the true values (thick dotted line) of the model parameters are provided. The parametric estimates are consistent before breaks starting at t = 500, but the GARCH parameter β becomes inconsistent and converges to 1 once data contain breaks, t > 500. The locally adaptive estimates are similar to parametric ones before the breaks and become rather imprecise after the first change point, but they are not too far from the true value on average and stay consistent (in the sense that the confidence interval covers the true values). The low precision of estimation can be attributed to rather short intervals used for estimation (cf. Figure 2 for t < 500). Next, we would like to compare the performance of parametric and adaptive estimation methods by means of absolute PE: first for the prediction horizon of one day, H = {1}, and later for prediction two weeks ahead, H = {1, . . . , 10}. To make the results easier to decipher, C The Author(s). Journal compilation C Royal Economic Society 2009.
263
1.00 GARCH parameter 0.25 0.50 0.75
1.00 ARCH parameter 0.25 0.50 0.75
Parametric GARCH: Const. 0.25 0.50 0.75 1.00
Adaptive estimation in CH models
1000
250
500
750
1000
250
500
750
1000
250
500
750
1000
250
500
750
1000
250
500
750
1000
ARCH parameter 0.25 0.50 0.75
GARCH parameter 0.25 0.50 0.75
1.00
750
1.00
500
Adaptive GARCH: Const. 0.25 0.50 0.75 1.00
250
Figure 2. Parameter values estimated by the parametric (top row) and locally adaptive (bottom row) GARCH methods.
we present in what follows PEs averaged over the past month (21 days). The absolute-PE criterion was also used to determine the optimal values of parameters r and ρ ( jointly across all simulations and for all t = 250, . . . , 1000). The results differ for different models: r = 0.5, ρ = 0.5 for local constant, r = 0.5, ρ = 1.0 for local ARCH, and r = 0.5, ρ = 1.5 for local GARCH. Let us now compare the adaptively estimated local constant, local ARCH and local GARCH models with the parametric GARCH, which is the best performing parametric model in this set-up. Forecasting one period ahead, the average PEs for all methods and the median lengths of the selected time-homogeneous intervals for adaptive methods are presented on Figure 3 for t ∈ [250, 1000]. First of all, let us observe in the case of the simplest local constant model that even the (median) estimated interval of homogeneity at the end of the first homogeneous period, 1 ≤ t < 500, can actually be shorter than the true one. The reason is that the probability of some 5 or 10 subsequent observations used as I 0 having their sample variance very different from the underlying one increases with the length of the series. Next, one can notice that all methods are sensitive to jumps in volatility, especially to the first one at t = 500: the parametric ones because they ignore a structural break, the adaptive ones because they use a small amount of data after a structural change. In general, the local GARCH performs rather similarly to the parametric GARCH for t < 650 because it uses all historical data. After initial volatility jumps, the local GARCH, however, outperforms the parametric one, 650 < t < 775. Following the last jump at t = 750, where the volatility level returns closer to the initial one, the parametric GARCH is best of all methods for some time, 775 < t < 850, until the adaptive estimation procedure detects the (last) break, and after it, ‘collects’ enough observations for estimation. Then the local GARCH and local ARCH become preferable to the parametric model again, 850 < t. Interestingly, the local ARCH approximation performs almost as well as both GARCH methods and even outperforms them shortly after structural breaks (except for break at t = 750), 600 < t < 775 and 850 < t < 1000. Finally, the local constant C The Author(s). Journal compilation C Royal Economic Society 2009.
ˇ ızˇek, W. H¨ardle and V. Spokoiny P. C´
Local ARCH GARCH
600
Local constant Local GARCH
Local constant Local GARCH
Local ARCH
100
0.050
200
L1 error 0.150 0.100
Median interval length 300 400 500
0.200
0.250
264
250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 Time
250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 Time
1.75
Local constant Local GARCH
Local ARCH GARCH
Local constant Local GARCH
Local ARCH GARCH
0.05
0.25
0.10
0.50
0.15
0.75
L1 error 0.20
L1 error 1.00
1.25
0.25
1.50
0.30
0.35
Figure 3. Left-hand panel: Low GARCH-effect simulations—absolute prediction errors one period ahead. Right-hand panel: The median lengths of the adaptively selected intervals.
250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 Time
250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 Time
Figure 4. Left-hand panel: Low GARCH-effect simulations—absolute prediction errors 10 periods ahead. Right-hand panel: High GARCH-effect simulations—absolute prediction errors one period ahead.
volatility is lacking behind the other two adaptive methods whenever there is a longer time period without a structural break, but keeps up with them in periods with frequent volatility changes, 500 < t < 650. All these observations can be documented also by the absolute PE averaged over the whole period 250 ≤ t ≤ 1000 (we refer to it as the global PE from now on): the smallest PE is achieved by local ARCH (0.075), then by local GARCH (0.079) and the worst result is from local constant (0.094). Additionally, all models are compared using the forecasting horizon of 10 days. Most of the results are the same (e.g. parameter estimates) or similar (e.g. absolute PE) to forecasting one period ahead due to the fact that all models rely on at most one past observation. The absolute PEs averaged over one month are summarized for t ∈ [250, 1000] on Figure 4, which reveals that the difference between local constant volatility, local ARCH and local GARCH models are smaller in this case. As a result, it is interesting to note that: (i) the local constant model becomes a viable alternative to the other methods (it has in fact the smallest global PE 0.107 from all adaptive methods) and (ii) the local ARCH model still outperforms the local GARCH (global C The Author(s). Journal compilation C Royal Economic Society 2009.
Adaptive estimation in CH models
265
PEs are 0.108 and 0.116, respectively) even though the underlying model is GARCH (with a small value of β = 0.1 however). 5.2.2. High GARCH effect. Let us now discuss the high GARCH-effect model. One would expect much more prevalent behaviour of both GARCH models, since the underlying GARCH parameter is higher and the changes in the volatility level ω are likely to be small compared to overall volatility fluctuations. Note that the optimal values of tuning constant r and ρ differ from the low GARCH-effect simulations: r = 0.5, ρ = 1.5 for local constant; r = 0.5, ρ = 1.5 for local ARCH; and r = 1.0, ρ = 0.5 for local GARCH. Comparing the absolute PEs for the one-period-ahead forecast at each time point (Figure 4) indicates that the adaptive and parametric GARCH estimations perform approximately equally well. On the other hand, both the parametric and adaptively estimated ARCH and constant volatility models are lacking significantly. Unreported results confirm, similarly to the low GARCH-effect simulations, that the differences among method are much smaller once a longer prediction horizon of 10 days is used.
6. APPLICATIONS The proposed adaptive pointwise estimation method will be now applied to real time series consisting of the log-returns of the DAX and S&P 500 stock indices (Sections 6.1 and 6.2). We will again summarize the results concerning both parametric and adaptive methods by the absolute PEs one day ahead averaged over one month. As a benchmark, we employ the parametric GARCH estimated using the last two years of data (500 observations). Since we however do not have the underlying volatility process now, it is approximated by squared returns. Despite being noisy, this approximation is unbiased and provides usually the correct ranking of methods (Andersen and Bollerslev, 1998). 6.1. DAX analysis Let us now analyse the log-returns of the German stock index DAX from January 1990 till December 2002 depicted at the top of Figure 5. Several periods interesting for comparing the performance of parametric and adaptive pointwise estimates are selected since results for the whole period might be hard to decipher at once. First, consider the estimation results for years 1991 to 1996. Contrary to later periods, there are structural breaks practically immediately detected by all adaptive methods (July 1991 and June 1992; cf. Stapf and Werner, 2003). For the local GARCH, this differs from less pronounced structural changes discussed later, which are typically detected only with delays of several months. One additional break detected by all methods occurs in October 1994. Note that parameters r and ρ were r = 0.5, ρ = 1.5 for local constant, r = 1.0, ρ = 1.0 for local ARCH, and r = 0.5, ρ = 1.5 for local GARCH. The results for the period 1991–96 are summarized in the left bottom panel of Figure 5, which depicts the PEs of each adaptive method relative to the PEs of parametric GARCH. First, one can notice that the local constant and local ARCH approximations are preferable till July 1991, where we have less than 500 observations. After the detection of the structural change in June 1991, all adaptive methods are shortly worse than the parametric GARCH due to the limited amount of data used, but then outperform the parametric GARCH till the next structural break in the second half of 1992. A similar behaviour can be observed after the break detected in October 1994, C The Author(s). Journal compilation C Royal Economic Society 2009.
ˇ ızˇek, W. H¨ardle and V. Spokoiny P. C´
0.000
Returns
0.050
266
12/91
11/93
10/95
09/97
08/99
07/01
06/03
12/91
09/92
06/93 Time
03/94
12/94
Local ARCH to parametric GARCH
03/91
12/91
09/92
06/93 Time
03/94
12/94
03/91
12/91
09/92
06/93 Time
03/94
12/94
09/95
01/00
05/00 Time
10/00
03/01
10/00
03/01
10/00
03/01
Local ARCH to parametric GARCH
08/99
09/95
Local GARCH to parametric GARCH
Local Constant to parametric GARCH
08/99
09/95 Ratio of L1 errors 1 2
Ratio of L1 errors 2 1
03/91
Ratio of L1 errors 2 1
Ratio of L1 errors 1 2
Local Constant to parametric GARCH
Ratio of L1 errors 2 1
Ratio of L1 errors 2 1
Time
01/00
05/00 Time
Local GARCH to parametric GARCH
08/99
01/00
05/00 Time
Figure 5. Top panel: The log-returns of DAX series. Bottom panels: The absolute prediction errors of the pointwise adaptive methods relative to the parametric GARCH errors for predictions one period ahead.
where the local constant and local ARCH models actually outperform both the parametric and adaptive GARCH. In the other parts of the data, the performance of all methods is approximately the same, and even though the adaptive GARCH is overall better than the parametric one, the most interesting fact is that the adaptively estimated local constant and local ARCH models perform equally well. In terms of the global PE, the local constant is best (0.829), followed by the local ARCH (0.844) and local GARCH (0.869). This closely corresponds to our findings in simulation study with low GARCH effect in Section 5.2. Note that for other choices of r and ρ, the global PEs are at most 0.835 and 0.851 for the local constant and local ARCH, respectively. This indicates low sensitivity to the choice of these parameters. Next, we discuss the estimation results for years 1999 to 2001 (r = 1.0 for all methods now). After the financial markets were hit by the Asian crisis in 1997 and the Russian crisis in 1998, the market headed to a more stable state in year 1999. The adaptive methods detected the structural breaks in the autumn of 1997 and 1998. The local GARCH detected them, however, with more than a one-year delay—only during 1999. The results in Figure 5 (right bottom panel) confirm that the benefits of the adaptive GARCH are practically negligible compared to the parametric GARCH in such a case. On the other hand, the local constant and ARCH methods perform slightly better than both GARCH methods during the first presented year (July 1999 to June 2000). From July 2000, the situation becomes just the opposite and the performance C The Author(s). Journal compilation C Royal Economic Society 2009.
267
0.02
0.04
Ratio of L1 errors 2 1
Adaptive estimation in CH models
Local Constant to parametric GARCH
Returns 0.00
Ratio of L1 errors 1 2
02/03
Ratio of L1 errors 1 2
02/03
01/00
10/00
08/01
05/02 Time
02/03
11/03
09/04
07/03
11/03 Time
04/04
09/04
04/04
09/04
04/04
09/04
Local ARCH to parametric GARCH
07/03
11/03 Time
Local GARCH to parametric GARCH
02/03
07/03
11/03 Time
Figure 6. Left-hand panel: The log-returns of S&P 500. Right-hand panel: The absolute prediction errors of the pointwise adaptive methods relative to the parametric GARCH errors for predictions one period ahead.
of the GARCH models is better (parametric and adaptive GARCH estimates are practically the same in this period since the last detected structural change occurred approximately two years ago). Together with previous results, this opens the question of model selection among adaptive procedures as different parametric approximations might be preferred in different time periods. Judging by the global PE, the local ARCH provides slightly better predictions on average than the local constant and local GARCH—despite the ‘peak’ of the PE ratio in the second half of year 2000 (see Figure 5). This, however, depends on the specific choice of loss in (3.6). Finally, let us mention that the relatively similar behaviour of the local constant and local ARCH methods is probably due to the use of ARCH(1) model, which is not sufficient to capture more complex time developments. Hence, ARCH(p) might be a more appropriate interim step between the local constant and GARCH models. 6.2. S&P 500 Now we turn our attention to more recent data regarding the S&P 500 stock index considered from January 2000 to December 2004; see Figure 6. This period is marked by many substantial events affecting the financial markets, ranging from September 11, 2001, terrorist attacks and the war in Iraq (2003) to the crash of the technology stock-market bubble (2000–02). For the sake of simplicity, a particular time period is again selected: year 2003 representing a more volatile period (the war in Iraq) and year 2004 being a less volatile period. All adaptive methods detected rather quickly a structural break at the beginning of 2003, and additionally they detected a structural break in the second half of 2003, although the adaptive GARCH did so with a delay of more than eight months. The ratios of monthly PE of all adaptive methods to those of the parametric GARCH from January 2003 to December 2004 are summarized on Figure 6 (r = 0.5 and ρ = 1.5 for all methods). C The Author(s). Journal compilation C Royal Economic Society 2009.
268
ˇ ızˇek, W. H¨ardle and V. Spokoiny P. C´
In the beginning of year 2003, corresponding with 2002 to a more volatile period (see Figure 6), all adaptive methods perform as well as the parametric GARCH. In the middle of year 2003, the local constant and local ARCH models are able to detect another structural change (possibly less pronounced than the one at the beginning of 2003 because of its late detection by the adaptive GARCH). Around this period, the local ARCH shortly performs worse than the parametric GARCH. From the end of 2003 and in year 2004, all adaptive methods starts to outperform the parametric GARCH, where the reduction of the PEs due to the adaptive estimation amounts to 20% on average. All adaptive pointwise estimates exhibit a short period of instability in the first months of 2004, where their performance temporarily worsens to the level of parametric GARCH. This corresponds to ‘uncertainty’ of the adaptive methods about the length of the interval of homogeneity. After this short period, the performance of all adaptive methods is comparable, although the local constant performs overall best of all methods (closely followed by local ARCH) judged by the global PE. Similarly to the low GARCH-effect simulations and to the analysis of DAX in Section 6.1, it seems that the benefit of pointwise adaptive estimation is most pronounced during periods of stability that follow an unstable period (i.e. year 2004) rather than during a presumably rapidly changing environment. The reason is that, despite possible inconsistency of parametric methods under change points, the adaptive methods tend to have a rather large variance when the intervals of time homogeneity become very short.
7. CONCLUSION We extend the idea of adaptive pointwise estimation to parametric CH models. In the specific case of ARCH and GARCH, which represent particularly difficult cases due to high data demands and dependence of critical values on underlying parameters, we demonstrate the use and feasibility of the proposed procedure: on the one hand, the adaptive procedure, which itself depends on a number of auxiliary parameters, is shown to be rather insensitive to their choice, and on the other hand, it facilitates the global selection of these parameters by means of fit or forecasting criteria. The real-data applications highlight the flexibility of the proposed time-inhomogeneous models since even simple varying-coefficients models such as constant volatility and ARCH(1) can outperform standard parametric methods such as GARCH(1,1). Finally, the relatively small differences among the adaptive estimates based on different parametric approximations indicate that, in the context of adaptive pointwise estimation, it is sufficient to concentrate on simpler and less data-intensive models such as ARCH( p), 0 ≤ p ≤ 3, to achieve good forecasts.
ACKNOWLEDGMENTS This research was supported by the Deutsche Forschungsgemeinschaft through the SFB 649 ‘Economic Risk’.
REFERENCES Andersen, T. G. and T. Bollerslev (1998). Answering the skeptics: yes, standard volatility models do provide accurate forecasts. International Economic Review 39, 885–905. C The Author(s). Journal compilation C Royal Economic Society 2009.
Adaptive estimation in CH models
269
Andreou, E. and E. Ghysels (2002). Detecting multiple breaks in financial market volatility dynamics. Journal of Applied Econometrics 17, 579–600. Andreou, E. and E. Ghysels (2006). Monitoring disruptions in financial markets. Journal of Econometrics 135, 77–124. Andrews, D. W. K. (1993). Tests for parameter instability and structural change with unknown change point. Econometrica 61, 821–56. Bai, J. and P. Perron (1998). Estimating and testing linear models with multiple structural changes. Econometrica 66, 47–78. Beltratti, A. and C. Morana (2004). Structural change and long-range dependence in volatility of exchange rates: either, neither or both? Journal of Empirical Finance 11, 629–58. Bollerslev, T. (1986). Generalized autoregressive conditional heteroskedasticity. Journal of Econometrics 31, 307–27. Cai, Z., J. Fan and Q. Yao (2000). Functional coefficient regression models for nonlinear time series. Journal of the American Statistical Association 95, 941–56. Chen, J. and A. K. Gupta (1997). Testing and locating variance changepoints with application to stock prices. Journal of the American Statistical Association 92, 739–47. Chen, R. and R. J. Tsay (1993). Functional-coefficient autoregressive models. Journal of the American Statistical Association 88, 298–308. Cheng, M.-Y., J. Fan and V. Spokoiny (2003). Dynamic nonparametric filtering with application to volatility estimation. In M. G. Akritas and D. N. Politis (Eds.), Recent Advances and Trends in Nonparametric Statistics, 315–33. Amsterdam: Elsevier. Diebold, F. X. and A. Inoue (2001). Long memory and regime switching. Journal of Econometrics 105, 131–59. Doornik, J. A. (2002). Object-oriented programming in econometrics and statistics using Ox: a comparison with C++, Java and C#. In S. S. Nielsen (Ed.), Programming Languages and Systems in Computational Economics and Finance, 115–47. Dordrecht: Kluwer. Engle, R. F. (1982). Autoregressive conditional heteroscedasticity with estimates of the variance of United Kingdom inflation. Econometrica 50, 987–1008. Fan, J. and W. Zhang (2008). Statistical models with varying coefficient models. Statistics and Its Interface 1, 179–95. Francq, C. and J.-M. Zakoian (2007). Quasi-maximum likelihood estimation in GARCH processes when some coefficients are equal to zero. Stochastic Processes and their Applications 117, 1265– 84. Glosten, L. R., R. Jagannathan and D. E. Runkle (1993). On the relation between the expected value and the volatility of the nominal excess return on stocks. Journal of Finance 48, 1779–801. Hansen, B. and S.-W. Lee (1994). Asymptotic theory for the GARCH(1,1) quasi-maximum likelihood estimator. Econometric Theory 10, 29–53. H¨ardle, W., H. Herwatz and V. Spokoiny (2003). Time inhomogeneous multiple volatility modelling. Journal of Financial Econometrics 1, 55–99. Herwatz, H. and H. E. Reimers (2001). Empirical modeling of the DEM/USD and DEM/JPY foreign exchange rate: structural shifts in GARCH-models and their implications. 2001–83, Discussion Paper SFB 373, Humboldt-Univerzit¨at zu Berlin, Germany. Hillebrand, E. (2005). Neglecting parameter changes in GARCH models. Journal of Econometrics 129, 121–38. Kokoszka, P. and R. Leipus (2000). Change-point estimation in ARCH models. Bernoulli 6, 513–39. Laurent, S. and J.-P. Peters (2006). G@RCH 4.2, Estimating and Forecasting ARCH Models. London: Timberlake Consultants Press. C The Author(s). Journal compilation C Royal Economic Society 2009.
ˇ ızˇek, W. H¨ardle and V. Spokoiny P. C´
270
Mercurio, D. and V. Spokoiny (2004). Statistical inference for time-inhomogeneous volatility models. Annals of Statistics 32, 577–602. Mikosch, T. and C. Starica (1999). Change of structure in financial time series, long range dependence and the GARCH model. Working Paper, Department of Statistics, University of Pennsylvania. See http://citeseer.ist.psu.edu/mikosch99change.html. Mikosch, T. and C. Starica (2004). Changes of structure in financial time series and the GARCH model. Revstat Statistical Journal 2, 41–73. Nelson, D. B. (1991). Conditional heteroskedasticity in asset returns: a new approach. Econometrica 59, 347–70. Pesaran, M. H. and A. Timmermann (2004). How costly is it to ignore breaks when forecasting the direction of a time series? International Journal of Forecasting 20, 411–25. Sentana, E. (1995). Quadratic ARCH models. Review of Economic Studies 62, 639–61. Spokoiny, V. (1998). Estimation of a function with discontinuities via local polynomial fit with an adaptive window choice. Annals of Statistics 26, 1356–78. Spokoiny, V. (2009a). Multiscale local change-point detection with applications to value-at-risk. Annals of Statistics 37, 1405–36. Spokoiny, V. (2009b). Parameter estimation in time series analysis. WIAS Preprint No. 1404, Weierstrass Institute for Applied Analysis and Stochastics, Berlin, Germany. Stapf, J. and T. Werner (2003). How wacky is DAX? The changing structure of German stock market volatility. Discussion Paper 2003/18, Deutsche Bundesbank, Germany. Taylor, S. J. (1986). Modeling Financial Time Series. Chichester: Wiley.
APPENDIX: PROOFS Proof of Corollary 2.1: Given the choice of zα , it directly follows from (2.5).
Proof of Theorem 3.1: Consider the event Bk = {Iˆ = Ik−1 } for some k ≤ K. This particularly means that I k−1 is accepted while I k = [T − m k + 1, T ] is rejected; i.e. there is I = [t , T ] ⊆ I k and τ ∈ T (Ik ) such that TIk ,τ > zk = zIk ,T (Ik ) . For every fixed τ ∈ T (Ik ) and J = I k \ [τ + 1, T ], J c = [τ + 1, T ], it holds by definition of TIk ,τ that TIk ,τ ≤ LJ ( θ J ) + LJ c ( θ J c ) − LI (θ 0 ) = LJ ( θ J , θ 0 ) + LJ c ( θ J c , θ 0 ). This implies by Theorem 2.1 that P θ 0 (TIk ,τ > 2z) ≤ exp{e(λ, θ 0 ) − λz}. Now, T −m0
P θ 0 (Bk ) ≤
T −m0 +1
2 exp{e(λ, θ 0 ) − λzk /2} ≤ 2
t =T −mk +1 τ =t +1
m2k exp{e(λ, θ 0 ) − λzk /2}. 2
Next, by the Cauchy–Schwartz inequality E θ 0 |LIK ( θ IK , θˆ )|r =
K
E θ 0 [|LIK ( θ IK , θ k−1 )|r 1(Bk )]
k=1
≤
K
1/2 1/2 E θ 0 |LIK ( θ IK , θ k−1 )|2r P θ 0 (Bk ).
k=1
Under the conditions of Theorem 2.1, it follows similarly to (2.6) that θ IK , θ k−1 )|2r ≤ (mK /mk−1 )2r R∗2r (θ 0 ) E θ 0 |LIK ( C The Author(s). Journal compilation C Royal Economic Society 2009.
Adaptive estimation in CH models
271
for some constant R∗2r (θ 0 ) and k = 1, . . . , K, and therefore, ˆ r ≤ [R∗ (θ 0 )]1/2 E θ 0 |LIK ( θ IK , θ)| 2r
K
mk (mK /mk−1 )r exp{e(λ, θ 0 )/2 − λzk /4}
k=1
and the result follows by simple algebra provided that a 1 λ/4 ≥ 1 and a 2 λ/4 > 2.
L EMMA A.1. Let P and P 0 be two measures such that the Kullback–Leibler divergence E log(d P/d P 0 ), satisfies E log(d P/d P 0 ) ≤ < ∞. Then for any random variable ζ with E 0 ζ < ∞, it holds that E log(1 + ζ ) ≤ + E 0 ζ. Proof: By simple algebra one can check that for any fixed y the maximum of the function f (x) = xy − x log x + x is attained at x = ey leading to the inequality xy ≤ x log x − x + ey . Using this inequality and the representation E log(1 + ζ ) = E 0 {Z log(1 + ζ )} with Z = d P/d P 0 we obtain E log(1 + ζ ) = E 0 {Z log(1 + ζ )} ≤ E 0 (Z log Z − Z) + E 0 (1 + ζ ) = E 0 (Z log Z) + E 0 ζ − E 0 Z + 1. It remains to note that E 0 Z = 1 and E 0 (Z log Z) = E log Z.
ˆ θ)/E θ (θ, ˆ θ) yields the result in the view of Proof of Theorem 4.1: Lemma A.1 applied with ζ = (θ,
p[Yt , g(Xt )] p[Y t , g(Xt (θ))] t∈I ⎧ ⎫ ⎬ ⎨ p[Yt , g(Xt )] Ft−1 = EIk (θ). = E E log ⎩ ⎭ p[Yt , g(Xt (θ))]
E θ (ZI ,θ log ZI ,θ ) = E log ZI ,θ = E
log
t∈I
Proof of Corollary 4.1: It is Theorem 4.1 formulated for (θ , θ) = LI (θ , θ).
Proof of Theorem 4.2: The first inequality follows from Corollary 4.1, the second one from condition (3.4) and the property x ≥ log x for x > 0. Proof of Theorem 4.3: Let kˆ = k > k ∗ . This means that I k is not rejected as homogeneous. Next, we show that for every k > k ∗ the inequality TIk ,τ ≤ TIk ,T (Ik ) ≤ zk with τ = T − mk∗ = T − |Ik∗ | implies θ Ik ∗ , θ Ik ) ≤ zk∗ . Indeed with J = Ik \Ik∗ , this means that, by construction, zk ≤ zk∗ for k > k ∗ and LIk∗ ( θ Ik ∗ , θ Ik ) + LJ ( θ J , θ Ik ) ≥ LIk∗ ( θ Ik ∗ , θ Ik ). zk ≥ TIk ,τ = LIk∗ ( It remains to note that ˆ r ≤ |LI ∗ ( θ Ik∗ , θ)| θ Ik∗ , θˆ Ik∗ )|r 1(kˆ < k ∗ ) + zrk∗ 1(kˆ > k ∗ ), |LIk∗ ( k which obviously yields the assertion.
C The Author(s). Journal compilation C Royal Economic Society 2009.
The
Econometrics Journal Econometrics Journal (2009), volume 12, pp. 272–291. doi: 10.1111/j.1368-423X.2009.00290.x
Multi-tail generalized elliptical distributions for asset returns S EBASTIAN K RING † , S VETLOZAR T. R ACHEV †,‡,§ , † , F RANK J. FABOZZI ¶ ¨ ¨ M ARKUS H OCHST OTTER AND M ICHELE L EONARDO B IANCHI †† †
School of Economics and Business Engineering, University of Karlsruhe and KIT, Kollegium am Schloss, Bau II, 20.12, R210, Postfach 6980, D-76128, Karlsruhe, Germany E-mails:
[email protected],
[email protected],
[email protected] ‡
Department of Statistics and Applied Probability, University of California, Santa Barbara, CA, 93106-3110, USA §
¶
††
FinAnalytica Inc., 122 42nd St., New York, NY, 10168, USA
Yale School of Management, 135 Prospect Street, New Haven, CT, 06511, USA E-mail:
[email protected]
Specialized Intermediaries Supervision Department, Bank of Italy, Via Nazionale, 91, 00184, Rome, Italy E-mail:
[email protected] First version received: October 2007; final version accepted: January 2009
Summary In the study of asset returns, the preponderance of empirical evidence finds that return distributions are not normally distributed. Despite this evidence, non-normal multivariate modelling of asset returns does not appear to play an important role in asset management or risk management because of the complexity of estimating multivariate nonnormal distributions from market return data. In this paper, we present a new subclass of generalized elliptical distributions for asset returns that is sufficiently user friendly, so that it can be utilized by asset managers and risk managers for modelling multivariate nonnormal distributions of asset returns. For the distribution we present, which we call the multi-tail generalized elliptical distribution, we (1) derive the densities using results of the theory of generalized elliptical distributions and (2) introduce a function, which we label the tail function, to describe their tail behaviour. We test the model on German stock returns and find that (1) the multi-tail model introduced in the paper significantly outperforms the classical elliptical model and (2) the hypothesis of homogeneous tail behaviour can be rejected. Keywords: α-Stable distributions, Generalized elliptical distributions, Likelihood ratio test, Risk management, t-Distributions, Varying-tail parameter.
C The Author(s). Journal compilation C Royal Economic Society 2009. Published by Blackwell Publishing Ltd, 9600 Garsington Road,
Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
Multi-tail generalized elliptical distributions for asset returns
273
1. INTRODUCTION Since the seminal work of Mandelbrot (1963), there has been a great deal of empirical evidence supporting the existence of heavy-tailed models in finance (see Fama, 1965, Loretan and Phillips, 1994, McCulloch, 1996, and Rachev and Mittnik, 2000). Several models have been proposed to model multivariate heavy-tailed return data. Rachev and Mittnik (2000) suggest multivariate α-stable distributions to model multivariate asset returns because such distributions (1) are a natural extension of the normal distribution in terms of the generalized central limit theorem and (2) allow the modelling of the rich dependence structure of asset returns. Eberlein and Keller (1995) and Eberlein et al. (1998) suggest the generalized hyperbolic distribution for modelling asset returns, while Kotz and Nadarajah (2004) propose the multivariate t-distributions. A stylized fact that has been observed in equity prices is that extreme price declines are often joint extremes, in the sense that a large price decline for the stock of one company is accompanied by a simultaneous large price drop for the stock of other companies (see McNeil et al., 2005). This stylized fact can be captured by choosing distributions in multivariate models that allow for so-called tail dependence. Heavy-tailed elliptical distributions exhibit tail dependence and, in the case of elliptical distributions, this property has been extensively studied by Hult and Lindskog (2002) and Schmidt (2002). Elliptical distributions (e.g. t-distributions, symmetric generalized hyperbolic distributions and α-stable sub-Gaussian distributions) are radial symmetric (see Fang et al., 1990). Empirically, McNeil et al. (2005) report that the lower tail dependence is often much stronger than the upper tail dependence. This property, however, cannot be captured by elliptical distributions because of their radial symmetry. Skew-elliptical distributions (see Genton, 2004) are a generalization of elliptical distributions that might be capable of capturing this behaviour observed for asset returns in financial markets. Frahm (2004) introduces generalized elliptical distributions by assuming that the radial and spherical component of an elliptical distribution are not independent. Frahm’s approach simplifies the modelling of stylized facts of multivariate financial time series. Furthermore, and not surprisingly, the studies of Loretan and Phillips (1994) and Rachev and Mittnik (2000) find that the tail index varies significantly between assets. Despite this wellknown fact, most existing research on heavy-tailed portfolios and factor models has assumed that the tail indices are the same in every direction. Additionally, several multivariate financial times series studies report that the lower tail dependence is stronger than the upper tail dependence. This empirical finding cannot be captured by elliptical or skew-elliptical distributions because these distributions fail to allow for varying-tail indices. These well-documented findings reported for asset returns are not mere academic conclusions that hold little interest for practitioners. Rather, they have important implications for asset managers and risk managers. Not properly accounting for these stylized facts can result in models that result in inferior investment performance by asset managers and disastrous financial consequences for financial institutions that rely upon them for risk management. Discussions with practitioners as to why models that can deal with these stylized facts are not used in practice suggest that it is due to the complexity of estimating the non-normal multivariate models proposed in the literature. Accordingly, our purpose in this paper is to introduce a flexible model for dealing with these stylized facts but, at the same time, a model that is relatively easy for a practitioner to estimate. This is accomplished by introducing a subclass of generalized elliptical distributions, with a tail parameter depending on the direction. The particular form for the function of the tail parameter is motivated by the fact that we want to use information given by the principal components analysis applied to asset returns. The way in which we construct the C The Author(s). Journal compilation C Royal Economic Society 2009.
274
S. Kring et al.
tail function provides (1) an economic justification for our model and (2) a feasible estimation algorithm. Furthermore, the stochastic representation allows a practitioner to easily simulate random variates from the distribution and to consider it in models for risk management. The main contribution of this paper is that tail dependence among stock returns is modelled through a tail function, whose value depends on the direction. Furthermore, from a practical perspective, we derive a flexible model that is simple to estimate and simulate and competitive with the generalized hyperbolic framework. This paper is organized as follows. In Section 2, we introduce the multi-tail generalized elliptical distributions, derive their basic properties and give examples. In Section 3, we develop a three-step estimation procedure to estimate the parameters of a multi-tail generalized elliptical distribution. In Section 4, we apply the multi-tail generalized elliptical distributions to return data, and Section 5 summarizes our conclusions.
2. MULTI-TAIL GENERALIZED ELLIPTICAL DISTRIBUTIONS Let ∈ Rd×d be a positive definite matrix. Then we denote the ‘Cholesky factor’ of by 1/2 and the inverse by −1/2 . With the random vector S ∈ Rd , we denote a uniformly distributed random vector on the unit hypersphere S d−1 = {x ∈ Rd : ||x|| = 1}. Let x ∈ Rd , then we call s(x) = x/||x|| ∈ S d−1 the spectral projection of x. Every elliptical random vector X can be written in the form X = μ + RAS,
(2.1)
where R is a non-negative random variable independent of S, A ∈ Rd×d a matrix and μ a location parameter. We denote a random variable R with tail parameter α > 0 by R α and the density of a random vector X by fX . When applying an elliptical random vector X = (X1 , . . . , Xd ) = μ + RAS
(2.2)
to daily log-returns of a portfolio with d assets, one can assume μ ≈ 0 (see RiskMetrics, 1996). Furthermore, the product AS (due to equation (2.1)) determines the direction of the log-return X while the random variable R > 0 being independent of AS determines the absolute value of X, (i.e. ||X||, where ||·|| is the Euclidean norm). In this situation, the assumption that R and AS are independent is questionable. This basic idea that R and S are not independent has been presented by Frahm (2004), leading us to the generalized elliptical distributions because of the following scenario: Assume that we have partial information that all components of X are negative (hence, all components of AS are negative since R is positive). We may conclude that such an observation is caused by an important macroeconomic event or financial crisis (i.e. where markets are in a stress situation). The markets tend to be extreme, and it is very likely that large losses in many components of X may be observed, implying a large value for R in our model. In such a market stress scenario, the correlations approach unity (see McNeil et al., 2005). However, instead of assuming a higher correlation between assets, if we assume lower tail parameters of R in these directions (e.g. AS = (−1, . . . , −1)), this would result in large values of R and hence outliers of X in these directions that are much more likely. These lower tail parameters incorporate a higher tail dependence in these directions and cause simultaneous high losses. C The Author(s). Journal compilation C Royal Economic Society 2009.
275
Multi-tail generalized elliptical distributions for asset returns 0.2
0.2
0.15
0.15 0.1
F
1
Deutsche Bank
DaimlerChrysler
0.1 0.05 0 –0.05
–0.05 –0.1
–0.15
–0.15 –0.1
0 BMW
0.1
0.2
1
0
–0.1
–0.2 –0.2
F
0.05
–0.2 –0.2
–0.1
0 0.1 Commerzbank
(a)
0.2
(b)
Figure 1. Bivariate scatterplot of daily log-returns.
Let us now consider a two-dimensional example in order to investigate graphically a possible market behaviour. We choose daily log-returns from May 6, 2002, to March 31, 2006, for four selected companies included in the DAX30: BMW and DaimlerChrysler (DC), representing companies with high market capitalization in the automobile sector, and Deutsche Bank (DB) and Commerzbank (CB), representing companies with high market capitalization in the bank sector. The two bivariate scatterplots in Figure 1 (BMW versus DC, and DB versus CB) show that most of the outliers are in a cone around the first principal component. This example suggests that in the area around the first principal component, the tail parameter of R should be smaller than in other directions, implying that the tail parameter of R should vary and depend on the direction s(AS). In Section 4, we will show how to capture this property in a statistical model, first in two dimensions and then, in a more realistic situation, by taking into account 29 stocks returns included in the DAX30. 2.1. Definition and basic properties D EFINITION 2.1. (Multi-tail generalized elliptical distributions). Let S ∈ Rd be a uniformly distributed random vector on the unit hypersphere S d−1 , I an interval of tail parameters and (R α ) α∈I a family of positive random variables with tail parameter α > 0. The random vector X = (X1 , X2 , . . . , Xd ) ∈ Rd has a multi-tail generalized elliptical distribution, if X satisfies d
X = μ + Rα(s(AS)) AS,
(2.3)
where A ∈ Rd×d is a regular matrix, μ a location parameter and α : S d−1 → I a function. In particular, we call the function α : S d−1 → I the tail function of a multi-tail generalized elliptical distribution. In Section 2.2, we discuss several examples of the tail function and its impact on the distribution. C The Author(s). Journal compilation C Royal Economic Society 2009.
276
S. Kring et al.
The multi-tail generalized elliptical distributions are a generalization of the elliptical distributions (see Fang et al., 1990) and a new subclass of the generalized elliptical distributions (see Frahm, 2004). R EMARK 2.1. It is a consequence from Definition 2.1 that we have d
X − μ|(s(X − μ) = u) = Rα(u) ||AS||u
(2.4)
and the tail parameter of the random vector X − μ|(s(X − μ) = u) is α(u). In this sense, we say a multi-tail generalized elliptical random vector X has a varying-tail parameter α(s(X − μ)) in terms of the direction s(X − μ). T HEOREM 2.1. Let X = R α(s(AS)) AS be a multi-tail generalized elliptical distribution, where A ∈ Rd×d is a regular matrix, I an interval of tail parameters, (R α ) α∈I a family of positive random variables, (fRα )α∈I its family of densities, S a uniformly distributed random vector on the hypersphere S d−1 and α : S d−1 → I a tail function. Then the density of X is given by fX (x) = det()−1/2 gα(s(x−μ)) ((x − μ) −1 (x − μ))
(2.5)
and gα(u) (r 2 ) = 2π −d/2 (d/2)r −d+1 fRα(u) (r), where = AA is the dispersion matrix and g α(·) is called the density generator of the multi-tail generalized elliptical distribution. This result is a consequence of corollary 30 in Frahm (2004). We note that matrix A is not part of the density term in equation (2.5). This is important for empirical work, since matrix A is difficult to observe and is not unique because of S = PS hence AS = (AP)S and = AA = APP A, d
d
(2.6)
where P ∈ Rd×d is an orthogonal matrix. Equation (2.6) holds because S is a spherical random vector. For the definition of a spherical random vector, see McNeil et al. (2005). In the following, we discuss the connection between multi-tail generalized elliptical random vectors and elliptical random vectors. D EFINITION 2.2. Let (R α ) α∈I be a family of positive random variables with tail parameter α and Y α = μ + R α AS, α ∈ I being a family of elliptical random vectors. We call the random vector X a corresponding multi-tail generalized elliptical random vector to the family (Y α ) α∈I of elliptical random vectors if it is given by X = μ + R α(s(AS)) AS, where α : S d−1 → I is a tail function. The density of an elliptical random vector is given by the following theorem, which is a standard result of the theory of elliptical distributions (see Fang et al., 1990). T HEOREM 2.2. Let X = μ + R α AS be an elliptical random vector, where μ ∈ Rd is a location vector, A ∈ Rd×d a regular matrix, R α a positive random variable with tail parameter α and the random vector S ∈ Rd uniformly distributed on S d−1 . Then X possesses a density fX if and only if R α has a density fRα . The relationship between fX and fRα is as follows: fX (x) = det()−1/2 g((x − μ) −1 (x − μ)) and g(r 2 ) = 2π −d/2 (d/2)r d−1 fRα (r), where = AA is the dispersion matrix. C The Author(s). Journal compilation C Royal Economic Society 2009.
Multi-tail generalized elliptical distributions for asset returns
277
Comparing equation (2.5) of Theorem 2.1 to equation (2.7) of Theorem 2.2, one can see that the density of a multi-tail generalized elliptical random vector can be obtained by substituting the constant tail parameter α through the tail function α(·) in the density of the corresponding family of elliptical random vectors. It is obvious that multi-tail generalized elliptical distributions are a generalization of elliptical distributions. If we choose the tail function to be constant, R and S are independent, and we obtain the classical case. Summing up, we obtain: R EMARK 2.2. Densities of a family of elliptical random vectors differ from the density of a corresponding multi-tail generalized elliptical random vector by substituting the constant α through a tail function α(·). 2.2. Principal component tail functions In the following, we introduce tail functions based on the principal components of the dispersion matrix = AA of a multi-tail generalized elliptical random vector. By taking into consideration the approach of Frahm (2004), we will construct a tail index of a multivariate random variable, which depends on directionality. We denote eigenvector–eigenvalue pairs of the dispersion matrix = AA of a multi-tail generalized elliptical random vector with (v 1 , λ 1 ), (v 2 , λ 2 ), . . . , (vd , λ d ), where λ 1 ≥ λ 2 ≥ · · · ≥ λ d > 0 and ||v 1 || = · · · = ||vd || = 1. We will see in Section 3 that we can estimate without knowing the tail function. This is, of course, important in applications. D EFINITION 2.3. Let X ∈ Rd be a multi-tail generalized elliptical random vector with dispersion matrix ∈ Rd×d . We call the tail function α : S d−1 → I of X a principal component tail function (pc-tail function) if it satisfies α(s) =
d + wi (< s, vi >)αi+ + wi− (< s, vi >)αi− , i=1
where s ∈ S d−1 , wi+ , wi− : [−1, 1] → [0, 1], i = 1, . . . , d, are weighting functions with d
wi+ (< s, vi >) + wi− (< s, vi >) = 1
i=1
and
w+ i (0)
=
w− i (0)
= 0.
Note that for all pc-tail functions α, we have α(s) ∈ I , s ∈ S d−1 since I is an interval and + α(s) can be interpreted as a convex combination of α − i , α i ∈ I , i = 1, . . . , d. According to Remark 2.1, the tail parameter of X in direction s(X − μ) is α(s(X − μ)). This means that the tail parameters for the directions vi , i = 1, . . . , d, are α(vi ) = wi+ (1)αi+ + wi− (1)αi− , and for the directions −vi , i = 1, . . . , d, α(−vi ) = wi+ (−1)αi+ + wi− (−1)αi− . For any other direction s ∈ S d−1 , we have the tail parameter α(s) =
d + wi (< s, vi >)αi+ + wi− (< s, vi >)αi− . i=1
C The Author(s). Journal compilation C Royal Economic Society 2009.
278
S. Kring et al.
The general idea behind the pc-tail function is that the tail parameter of a direction s is a − weighted sum of the tail parameters α + i and α i . The weights are determined by the scalar product − <s, vi > and weighting functions w i and w + i , i = 1, . . . , d. With Definition 2.3, we capture the phenomenon that in different areas—in particular cones around the principal components— the distribution of the asset returns has different tail parameters according to Remark 2.1 and the discussion at the beginning of this section. In the following, we give two examples of the pc-tail functions. The first pc-tail function α1 : S d−1 → I is given by α1 (s) =
d
< s, vi >2 αi ,
i=1
with weighting functions wi+ (< s, vi >) = wi− (< s, vi >) = 12 < s, vi >2 for all i = 1, . . . , d. This tail function assigns to every principal component a tail parameter α i , i = 1, . . . , d. In particular, we have α(vi ) = α(−vi ) = α i for i = 1, . . . , d. In any other direction s ∈ S d−1 , α(s) is a convex combination of the tail parameters α i . In fact, α 1 (·) is a pc-tail function since we have, for all s ∈ S d−1 , d 1 i=1
2
< s, vi >2 +
d d 1 < s, vi >2 = < s, vi >2 = < s, < s, vi > vi > 2 i=1 i=1
= < s, s > = 1. A refinement of the pc-tail function α 1 (·) is the pc-tail function α2 : S d−1 → I given by α2 (s) =
d i=1
2
< u, vi > I(0,∞) (<
u, vi >)αi+
+
d
< u, vi >2 I(−∞,0) (< u, vi >)αi− ,
i=1
which allows for different tail parameters for each direction and thus for asymmetry. It is a tailfunction since we have d
< s, vi >2 I(0,∞) (< s, vi >) + < s, vi >2 I(−∞,0) (< s, vi >) = 1.
(2.7)
i=1
2.3. An example of a multi-tail generalized elliptical distribution The density of a t-distributed random vector Y is given by −(ν+d)/2 12 (ν + d) (x − μ) (c0 )−1 (x − μ) 1+ , fY (x) = 1 ν 2 ν (π ν)d/2 det(c0 )1/2 where ν > 0 is the tail parameter, c > 0 a scaling parameter and 0 the normalized dispersion matrix. We discuss this issue more thoroughly and introduce normalization criteria for a dispersion matrix in Section 3. As in the classical elliptical case, the dispersion matrix of a multi-tail generalized elliptical distribution is only determined up to a scaling constant, hence we have to normalize it. We denote the normalized dispersion matrix by 0 . The corresponding C The Author(s). Journal compilation C Royal Economic Society 2009.
279
Multi-tail generalized elliptical distributions for asset returns
10 8
10
The tail function starts to dominate
8
6
6
4
4
2
2
0
0
–2
–2
–4
–4
–6
–6
–8
–8
The tail function starts to dominate
–10
–10 –10 –8 –6 –4 –2
0
2
4
6
8 10
–10 –8 –6 –4 –2
0
2
4
6
8 10
Figure 2. Contour lines of the densities of two asymmetric multi-tail t-distributions.
density of a multi-tail t-distribution is
1
(ν(s(x − μ)) + d) fX (x) = 1 2 ν(s(x − μ)) (π ν(s(x − μ)))d/2 det(c0 )1/2 −(ν(s(x−μ))+d)/2 (x − μ) (c0 )−1 (x − μ) · 1+ , ν(s(x − μ))
2
(2.8)
where ν : S d−1 → (0, ∞) is a tail function, 0 the normalized dispersion matrix, c > 0 a scaling parameter and μ ∈ Rd a location parameter (a similar extension of the multivariate t-distribution can be found in Frahm, 2004, p. 54). Figure 2 depicts the density contour lines of two multi-tail t-distributions with dispersion 21 matrix 1 2 . In Figure 2(a), we have the pc-tail function ν(s) = < s, F1 >2 I(0,∞) (< s, F1 >) · 5 + < s, F1 >2 I(−∞,0) (< s, F1 >) · 2.5 + < s, F2 >2 · 5, and in Figure 2(b), ν(s) = < s, F1 >2 I(0,∞) (< s, F1 >) · 5 + < s, F1 >2 I(−∞,0) (< s, F1 >) · 3 + < s, F2 >2 · 5. We observe in Figure 2 that the shape of the contour lines around the mean (0, 0) is determined by the dispersion matrix . But in the tails (i.e. far away from (0, 0) ) the influence of the tail function starts to dominate. In particular, outliers in a cone around −F 1 are much more likely than in any other direction. Summing up, the matrix determines the elliptical shape of the distribution around the mean while the influence of the tail functions increases in the tails of a multi-tail generalized elliptical distribution. C The Author(s). Journal compilation C Royal Economic Society 2009.
280
S. Kring et al.
3. ESTIMATION OF MULTI-TAIL GENERALIZED ELLIPTICAL DISTRIBUTIONS In this section, we present a three-step estimation procedure to estimate the parameters of a multi-tail generalized elliptical random vector X. In the first step, we estimate the location vector μ ∈ Rd with some robust method (see Frahm, 2004, for more details). In the second step, we estimate the dispersion matrix ∈ Rd×d up to a scaling constant c > 0 using the spectral estimator developed by Tyler (1987a,b) and Kent and Tyler (1988) and investigated by Frahm (2004); in the third step, we estimate the scaling constant c and the tail function α(·), applying again the maximum likelihood (ML) method. Since we have an analytic expression for the density of a multi-tail generalized elliptical distribution, we could, in principle, estimate all parameters in a single optimization step. However, this approach is not recommended, at least in higher dimensions, because it leads to an extremely complex optimization problem. As in the classical elliptical case (see McNeil et al., 2005), a dispersion matrix of a multi-tail generalized elliptical random vector is only determined up to a scaling constant because of X = μ + cRα(s(AS))
A S, c
for c > 0. Hence we have to normalize it. If second moments exist, one can normalize the dispersion matrix by the the covariance matrix (see McNeil et al., 2005). In general, the following normalization schemes are always applicable: (i) 11 = 1
(ii) det() = 1 (iii) tr() = 1,
(3.1)
even though second moments do not exist. For the remainder of this section, we denote a normalized dispersion matrix by 0 . In the third step, we have to estimate the scale parameter c and the tail function α(·). Since we assume a pc-tail function, we have to evaluate the tail parameters (α1 , . . . , αk ) ∈ I k , k ∈ N, of the pc-tail function. In the last step, we determine the parameters from the set = R+ × I k , where I is the interval of tail parameters. 3.1. Estimation of the dispersion matrix In order to estimate the dispersion matrix of a generalized elliptical distribution, we use the socalled spectral estimator based on the work of Tyler (1987a,b), Kent and Tyler (1988) and Frahm (2004). Furthermore, we assume the location parameter to be known. The next theorem shows how to apply the spectral estimator to multi-tail generalized elliptical distributions. T HEOREM 3.1. Let X 1 , . . . , Xn be a sample of identically distributed multi-tail generalized ˆ of the equation elliptical data vectors with n > d. Then a fix point n (Xi − μ)(Xi − μ) ˆ =d ˆ −1 (Xi − μ) n i=1 (Xi − μ) C The Author(s). Journal compilation C Royal Economic Society 2009.
Multi-tail generalized elliptical distributions for asset returns
281
˜ (i) )i∈N defined by exists, and it is unique up to a scale parameter. In particular, the sequence ( ˜ (0) = Id, n (Xi − μ) (Xi − μ) ˜ (i+1) = d , ˜ (i) )−1 (Xi − μ) n i=1 (Xi − μ) (
(3.2)
ˆ converges a.s. to the ML estimator . In order to estimate a normalized version of the dispersion matrix of a multi-tail generalized elliptical random vector, we apply the iterative scheme in Theorem 3.1 k-times (k sufficiently ˜ (k) through one of the schemes given in equation (3.1). large) and normalize 3.2. Estimation of the parameter set We assume that we have already estimated the location parameter μ and the normalized dispersion matrix 0 . We can write the multi-tail generalized elliptical random vector X in the form 1/2
X = μ + cRα(s( 1/2 S)) 0 S, 0
(3.3)
where c > 0 is a thus far an unknown scale parameter. Note that we cannot estimate c in the second step because in that step, the dispersion matrix can only be determined up to this scale parameter. Since we assume α(·) to be a pc-tail function (see Section 2.2), it can be determined by the tail parameters (α1 , . . . , αk ) ∈ I k , k ∈ N. Hence, we have to estimate (c, α1 , α2 , . . . , αk ) ∈ R+ × I k = . In the following, we present two equivalent methods to estimate the parameters (c, α 1 , . . . , α k ) ∈ : the radial variate ML-approach and the density generator ML-approach. 3.2.1. Radial variate ML-approach. For the radial variate ML-approach, we need the following proposition. P ROPOSITION 3.1. Let X = μ + Rα(s(AS)) AS ∈ Rd be a multi-tail generalized elliptical random vector. Then we have d (X − μ) −1 (X − μ)|(s(X − μ) = u) = Rα(u) , d
2 or equivalently, (X − μ) −1 (X − μ)|(s(X − μ) = u) = Rα(u) , where = AA .
The proof for the elliptical case can be found in Cambanis et al. (1981). The extension for the generalized elliptical case is straightforward. Let X1 , . . . , Xn ∈ Rd be a sample of identically distributed multi-tail generalized elliptical data vectors. We assume the location parameter μ and the normalized dispersion matrix 0 to be known. We define the samples
Ri = (Xi − μ) 0−1 (Xi − μ), i = 1, . . . , n, and Si = s(Xi − μ), i = 1, . . . , n. Then the data vectors R 1 |S 1 , R 2 |S 2 , . . . , Rn |Sn are independent since we assume μ and 0 to be known. According to Proposition 3.1, the log-likelihood function C The Author(s). Journal compilation C Royal Economic Society 2009.
282
S. Kring et al.
of this sample satisfies n
n 1 fRα(Si ) (Ri /c) log fcR(Si ) (Ri ) = log c i=1 i=1 = −n log(c) +
n
log(fRα(Si ) (Ri /c)).
i=1
Finally, this leads to the optimization problem θˆ = argmaxθ∈ − n log(c) +
n
log fRα(Si ) (Ri /c) .
(3.4)
i=1
Summing up, we obtain ˆ 1 , . . . , Xn ) = argmaxθ∈ − n log(c) θ(X
n log fRα(s(Xi −μ)) (Xi − μ) 0−1 (Xi − μ)/c . + i=1
3.2.2. The density generator ML-approach. The density generator ML-approach uses directly the density fX of a multi-tail generalized elliptical random vector. Due to Theorem 2.1, we know that the density satisfies fX (x) = |det(c2 0 )|−1/2 gα(s(x−μ)) ((x − μ) (c2 0 )−1 (x − μ)). Since we assume 0 and μ to be known, we obtain the following log-likelihood function n
n n log fX (Xi ) = log fX (Xi ) = −dn log(c) − log(det(0 )) 2 i=1 i=1
n (Xi − μ) 0−1 (Xi − μ) + log gα(s(Xi −μ)) . c2 i=1 Since the term − n2 log(det(0 )) is a constant subject to , we can neglect this term in the optimization problem. Thus, we obtain the following log-likelihood optimization problem:
n (Xi − μ)0−1 (Xi − μ) θˆ = argmaxθ∈ − dn log(c) + log gα(s(Xi −μ)) . (3.5) c2 i=1 It can be seen that the density generator ML-approach and the radial variate ML-approach are equivalent. 3.2.3. Statistical analysis. For empirical analysis of the spectral estimator, we simulate samples from a multi-tail t-distribution. In particular, we choose the location parameter μ to be (0, 0) and 21 the dispersion matrix to be 1 2 . We choose a tail function α(·) of the following structure: α(s) = < s, F1 >2 · 3 + < s, F2 >2 · 6, C The Author(s). Journal compilation C Royal Economic Society 2009.
283
Multi-tail generalized elliptical distributions for asset returns
where F 1 = s((1, 1) ) and F 2 = s((1, −1) ) are the first two principal components of . We assume that the scaling parameter c = 2, the first tail parameter ν 1 = 3 and the second tail parameter ν 2 = 6. Thus, we have that the dispersion matrix up to a scaling constant is
1 0.5 . 0 = 0.5 1 Since we know the density of muti-tail t-distribution, we apply the density generator ML-approach to estimate these parameters from simulated data. In Tables 1–3, we report the quantiles of the empirical distributions and the relative error of the estimators cˆ2 , νˆ 1 and νˆ 2 , and in Figure 3 we show the boxplots of their empirical distribution.
Table 1. Estimation of cˆ2 . q 0.5 q 0.75 q 0.9
Size
q 0.05
q 0.1
q 0.25
100 250 500
1.514 1.680 1.763
1.618 1.746 1.812
1.804 1.869 1.902
2.04 2.008 2.011
2.321 2.171 2.111
1000 3000 6000
1.838 1.897 1.927
1.871 1.92 1.942
1.933 1.96 1.970
2.004 2.004 2.002
2.076 2.045 2.032
q 0.95
e0.5 rel (%)
e0.9 rel (%)
2.578 2.301 2.224
2.719 2.422 2.300
16 8.5 5.5
36 21 15
2.146 2.086 2.057
2.194 2.110 2.073
3.8 2.3 1.6
10 5 3.6
Note: Quantiles of the estimator cˆ2 for different sample sizes per estimate and the corresponding relative errors.
Table 2. Estimation of νˆ 1 . q 0.5 q 0.75 q 0.9
q 0.95
e0.5 rel (%)
e0.9 rel (%)
5.733 4.406
7.263 5.002
34 21
142 67
3.378 3.279 3.169
3.827 3.56 3.299
4.165 3.725 3.394
13 9 6
39 24 13
3.093
3.195
3.267
3
9
Size
q 0.05
q 0.1
q 0.25
100 250
1.679 2.112
1.871 2.267
2.326 2.576
2.978 3.034
4.046 3.638
500 1000 3000
2.321 2.482 2.681
2.48 2.588 2.761
2.714 2.789 2.878
3.004 3.01 3.015
6000
2.772
2.822
2.896
2.999
Note: Quantiles of the estimator νˆ 1 for different sample sizes per estimate and the corresponding relative errors.
Table 3. Estimation of νˆ 2 . q 0.5 q 0.75 q 0.9
Size
q 0.05
q 0.1
q 0.25
100
2.955
3.423
4.571
6.634
11.36
250 500 1000
3.669 4.216 4.479
4.060 4.585 4.7617
4.935 5.210 5.376
6.148 6.183 6.049
7.954 7.361 6.867
3000 6000
5.086 5.356
5.279 5.482
5.610 5.735
6.069 6.062
6.514 6.363
q 0.95
e0.5 rel (%)
e0.9 rel (%)
19.31
29.93
89
399
11.04 8.878 7.802
13.46 9.897 8.492
36 23 14
124 64 41
6.971 6.672
7.222 6.868
9 6
20 14
Note: Quantiles of the estimator νˆ 2 for different sample sizes per estimate and the corresponding relative errors. C The Author(s). Journal compilation C Royal Economic Society 2009.
284
S. Kring et al.
3.4 3.2
2.3
2.8
Scaling parameter c
Scaling parameter c
3 2.6 2.4 2.2 2 1.8 1.6
2.2 2.1 2 1.9 1.8
1.4 1.7
1.2 100
250 Sample size per estimate
500
1000
3000 Sample size per estimate
6000
1000
3000 Sample size per estimate
6000
1000
3000 Sample size per estimate
6000
4.4 4.2
8
4
7
3.8
Tail parameter nu1
Tail parameter nu1
9
6 5 4 3
3.4 3.2 3 2.8 2.6
2
2.4
1
2.2 100
250 Sample size per estimate
500
28 26 24 22 20 18 16 14 12 10 8 6 4 2
Tail parameter nu2
Tail parameter nu2
3.6
100
250 Sample size per estimate
500
9.5 9 8.5 8 7.5 7 6.5 6 5.5 5 4.5 4 3.5
Figure 3. Boxplots of the estimate of cˆ2 , νˆ 1 and νˆ 2 .
Each boxplot consists of 1000 estimates. The relative error is defined by qα/2 − c q1−α/2 − c 1−α , erel = max , c c
(3.6)
and it means that the relative error of the estimator is smaller than e1−α rel with probability 1 − α measured by the empirical distribution of the estimator. C The Author(s). Journal compilation C Royal Economic Society 2009.
Multi-tail generalized elliptical distributions for asset returns
285
We observe that medians are close to the corresponding true values. The empirical distributions of cˆ2 , νˆ 1 and νˆ 2 are skewed to the left for small sample sizes and become more symmetric for large sample sizes per estimate. The accuracy of the estimator cˆ2 is higher than the one of νˆ 1 and that of νˆ 1 is higher than the one of νˆ 2 (compare the relative errors of Tables 1– 3). It is not surprising that cˆ2 performs better than both νˆ 1 and νˆ 2 since the parameters ν 1 and ν 2 determine the tail behaviour of the distribution, and naturally, we do not have many observations in the tails. In particular, the accuracy of νˆ 1 and, especially, νˆ 2 is very poor for small sample sizes per estimate, i.e. sample size less than 1000. For reliable estimates, we need at least a sample size of 3000 because of the relative error e0.9 rel of Tables 2 and 3. In Figure 3, all estimates of c are depicted, highlighting again that the scale parameter can be estimated with high accuracy. In the case of ν 1 and sample size 100, 23 estimates ranging from 10 to 127 are not shown, and for sample size 250, only one estimate (value = 16.09) is not depicted. Finally, in the case of ν 2 , 49 estimates are not illustrated, ranging from 30.2 to 2794 for sample size 100, and for sample size 250, three estimates (values =36.6, 33.7, 38.47) are not depicted.
4. APPLICATIONS To empirically investigate multi-tail generalized elliptical distribution, we used the daily logarithmic return series for 29 German stocks included in the DAX. We excluded HypoRealEstate Bank because we did not have sufficient return data for this stock for the period covered. The period covered is from May 6, 2002, to March 31, 2006 (1000 daily observations for each stock), as in the example in Section 2. In particular, the main focus of our analysis is to empirically assess whether a multi-tail model is superior to a classical elliptical one in the analysis of asset-return behaviour. In the statistical analysis, we assume that the return series are stationary. Our intention is to estimate the unconditional multivariate distribution of the asset returns. We assume an unconditional model; thus, an ML estimation is performed. It is a wellknown result that ML-estimators are consistent for the parameters of a stationary distribution, but statistical inference cannot be applied when the data are serially dependent. The model does not capture the serial dependence of the data since we consider stationary time series. For a discussion of conditional and unconditional financial time series modelling, see McNeil et al. (2005). ML-estimators indeed are consistent for the parameters of the stationary distribution, but statistical inference (which is based on standard likelihood theory) cannot be applied when the data are serially dependent. 4.1. Two-dimensional analysis Figure 1(a) depicts the two-dimensional scatterplots of BMW versus DC and Figure 1(b), the scatterplots of CB versus DB. In both figures, we can see that there are more outliers in the directions around the first principal component F 1 motivating a multi-tail model. A stylized fact of short-term log-returns of financial time series is that one can assume a negligible median. A practice is to estimate the location parameter of a generalized elliptical random vector by means of the component-wise median and not with the mean (see Frahm, 2004). Thus, without loss of generality, we will assume the location parameter μ = 0 in our empirical study. Applying the C The Author(s). Journal compilation C Royal Economic Society 2009.
286
S. Kring et al.
spectral estimator for both samples, we obtain the normalized dispersion matrix (σˆ 11 = 1)
1.000 0.762 ˆ 0 (X1 , . . . , X1000 ) = 0.762 1.204 for BMW versus DC, and
ˆ 0 (Y1 , . . . , Y1000 ) =
1.000 0.568
0.568 0.745
for CB versus DB, representing the first step of the estimation procedure described in the previous section. Note that due to the properties of the spectral estimator, these normalized dispersion matrices are valid for the elliptical, as well as for the multi-tail generalized elliptical model. In order to estimate the parameters c, ν 1 and ν 2 , we have to make a concrete distributional assumption. In our analysis, we choose the t- and multi-tail t-distribution. Note that the t-distribution has a constant tail function ν : S 1 → I = R + , ν(s) = ν0 , whereas for the multi-tail model, we specify the tail function satisfying ν : S 1 → R + , ν(s) = < s, F1 >2 ν1 + < s, F2 >2 ν2 , ˆ 0 . Besides estimating the where F 1 and F 2 are the first and second principal components of scale parameter c and the tail parameters ν 0 , ν 1 and ν 2 in both models, we apply the Akaike information criterion and likelihood ratio test to identify the superior model. The likelihood ratio test statistic satisfies supθ∈0 L(θ, X) . λ(X) = supθ∈ L(θ, X) Under the null hypothesis, it can be shown that −2 ln λ(X) ∼ χ 2q , where q is the difference between the free parameters in and 0 . Panel (a) of Table 4 shows the estimates for the scale parameter and tail parameters in both models. In both models, the scale parameters are the same while the tail parameters differ. This
#par
cˆ2
Table 4. Likelihood estimates. νˆ 1 νˆ 2 ln L AIC
p-Value
(a) ML-estimates of the scale parameter and tail parameters for the BMW–DC returns t multi-t
6 7
1.6 · 10−4 1.6 · 10−4
3.7 3.1
3.7 5.7
5595.6 5598.3
−11,176.2 −11,182.6
– <2.5%
(b) ML-estimates of the scale parameter and tail parameters for the CB–DB returns t multi-t
6 7
2.41 · 10−4 2.39 · 10−4
3.4 2.6
3.4 6.1
5331.5 5337.8
−10,651 −10,662
– 0
(c) ML-estimates of the scale parameter and tail parameters for the DAX returns t multi-t
465 466
1.45 · 10−4 1.44 · 10−4
4 2.7
4 4.7
84,711.5 84,716.3
−168,493 −168,500.6
–
0.05%
Notes: The scale parameter and tail parameters for both models are estimated. The table shows the number of parameters (#par), the value of the log-likelihood at the maximum (ln L), the value of the Akaike information criterion (AIC) and the p-value for a likelihood ratio test against the elliptical model. The period investigated is May 6, 2002–March 31, 2006. C The Author(s). Journal compilation C Royal Economic Society 2009.
Multi-tail generalized elliptical distributions for asset returns
287
result is to be expected from our discussion in Section 3, because the scaling properties expressed by 0 and c and the tail behaviour captured by the tail parameters and the specified tail function are fairly independent for larger sample sizes. According to the Akaike information criterion, the multi-tail model is better because we observe a smaller value for that model. For the ML ratio test, we have = {(ν1 , ν2 ) ∈ R+ : 0 < ν1 ≤ ν2 } and the null hypothesis H0 : θ ∈ 0 = {(ν1 , ν2 ) ∈ R+ : 0 < ν1 = ν2 } against the alternative H1 : θ ∈ 0 = {(ν1 , ν2 ) ∈ R+ : 0 < ν1 < ν2 }. According to Table 4, the p-value for this test is less than 2.5%, so it is reasonable to reject the elliptical model. Panel (b) of Table 4 shows that we obtain basically the same results as in the previous case. The returns for CB and DB demand even more of a multi-tail model. The spread in the first and second tail parameters is larger than before. The difference between the log-likelihood values and Akaike information criterion is also greater. Finally, the p-value of the ML ratio test is practically equal to zero. Again, the scaling parameters are close and the tail parameters differ, indicating that the ML estimator cˆ2 for the scale parameter is fairly independent of the ML estimates for ν 1 and ν 2 . In particular, the results reported in panels (a) and (b) in Table 4 coincide with scatterplots in Figures 1(a) and (b), since in (b) we observe more pronounced outliers along the first principal component than in (a). 4.2. Multi-tail generalized elliptical model check for the DAX The investigated return data X1 , X2 , . . . , X1000 ∈ R29 are 29 of the 30 German stocks included in the DAX index. The period covered is May 6, 2002–March 31, 2006. We start our analysis by ˆ 0 (X1 , . . . , X1000 ) using the spectral estimator. The estimating the normalized dispersion matrix results are depicted in Figure 5(d). Again, we assume μ = 0 for daily log-returns according to RiskMetrics (1996). Figures 4(a) and (b) show the factor loadings of the first two principal components F1 and F2 . Figure 4(c) depicts the eigenvalues of the normalized dispersion matrix obtained by the spectral estimator. We see that the eigenvalue of the first principal component is significantly larger than the others. The first vector of loadings is positively weighted for all stocks and can be thought of as describing a kind of index portfolio. Figure 4(d) shows the heat map of the normalized sample dispersion matrix ( 11 = 1) estimated by the spectral estimator. Pale boxes correspond to low values (min = 0.202) which increase from bright to deep dark boxes for Infineon (IFX) (max = 3.2719). Figures 5(a)–(c) illustrate the pairwise scatterplots of these components. We can see in Figures 5(a) and (b) that the scatterplots are stretched along the first principal component. This scaling behaviour is caused by the large first eigenvalue of F 1 . Moreover, it is important to note that we observe many outliers along F 1 . This phenomenon may be attributed to smaller tail parameters in the directions around F 1 and −F 1 . In Figure 5(c), both principal components have fundamentally the same scale and the outliers are not so pronounced as in the former ones, suggesting a similar tail behaviour. The visual analysis of Figure 5 motivates a multi-tail model for the logarithmic returns investigated. Since the first principal component differ from 2the others, we propose the tail function ν : S 28 → R + , s →< s, F1 >2 ν1 + 29 i=2 < s, Fi > ν2 . As we did in Section 4.1, we compare this multi-tail model with an elliptical one, which has a constant tail function (ν(s) = ν0 , s ∈ S 28 ). C The Author(s). Journal compilation C Royal Economic Society 2009.
288
S. Kring et al.
Figure 4. Barplots summarizing the loading vectors of the first two principal components F 1 and F 2 , the eigenvalue decomposition and the heat map.
We conduct the same statistical analysis as in the previous section. In particular, we fit a tand multi-tail t-distribution to the data. The results are reported in Table 4. The elliptical model has 29 location parameters, 29 · 15 = 435 dispersion parameters, and one tail parameter, while the multi-tail model has two tail parameters. Thus, we have 465 parameters in the elliptical and 466 in the multi-tail model. In the first step, we estimate the dispersion parameter with the spectral estimator up to a scaling constant. The scale parameter cˆ2 and tail parameters νˆ 1 and νˆ 2 are estimated in a second step. Panel (c) of Table 4 shows that in both models the scale parameter cˆ2 are almost the same, whereas the tail parameters differ significantly. The Akaike information criterion as well as the likelihood ratio test do favour the multi-tail model. A p-value of less than 0.05% for this test indicates that we can reject the null hypothesis of an elliptical model at a very high confidence level. 4.3. Comparison to other models We compare the multivariate generalized hyperbolic (MGH) distribution to the multi-tail generalized elliptical model. For more information about the generalized hyperbolic distribution, readers are referred to Barndorff-Nielsen (1978), Eberlein and Keller (1995), Eberlein et al. C The Author(s). Journal compilation C Royal Economic Society 2009.
289
Multi-tail generalized elliptical distributions for asset returns
0.4
0.3
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0.1
0
F
F
0
F3
0.5
0.4
3
0.5
0.4
2
0.5
0
–0.1
–0.1
–0.1
–0.2
–0.2
–0.2
–0.3
–0.3
–0.3
–0.4
–0.4
–0.4
–0.5 –0.5–0.4–0.3 –0.2 –0.1 0 0.1 0.2 0.3 0.4 0.5 F1
–0.5 –0.5–0.4–0.3 –0.2 –0.1 0 F
0.1 0.2 0.3 0.4 0.5
1
(a)
–0.5 –0.5–0.4–0.3 –0.2 –0.1 0 F
0.1 0.2 0.3 0.4 0.5
2
(b)
(c)
Figure 5. Pairwise, two-dimensional scatterplots of the first three principal components.
#par BMW-DC CB-DB DAX
Table 5. Likelihood estimates of the MGH model. ψˆ λˆ χˆ ln L
AIC
9 9
0.6424 1.0352
1.1946 0.378
−0.1 −1
5602.84 5336.93
−11,187.68 −10,655.86
496
0.6738
2.2602
0.5
84,650
−168,308
Notes: Likelihood estimates for the parameters of an MGH distribution. The table shows the number of parameters (#par) and the log-likelihood at the maximum (ln L). AIC: Akaike information criterion.
(1998) and McNeil et al. (2005). The generalized hyperbolic density is given by −1 Kλ−(d/2) (χ + (x − μ) −1 (x − μ))(φ + γ −1 γ ) e(x−μ) γ f (x) = c , (d/2)−λ (χ + (x − μ) −1 (x − μ))(φ + γ −1 γ ) √
−λ
λ
−1
(d/2)−λ
ψ (ψ+γ γ ) √ where the normalizing constant is c = ( χψ) and K . (·) is a modified Bessel (2π)d/2 det 1/2 Kλ ( χψ) function of the third kind. The parameters satisfy χ > 0, ψ ≥ 0 if λ < 0; χ > 0, ψ > 0 if λ = 0; and χ ≥ 0, ψ > 0 if λ > 0. We choose the MGH distribution since it covers many special cases, such as the generalized hyperbolic, generalized Laplace and skew t-distribution. In order to calculate the ML-estimates for the generalized hyperbolic distribution, we apply the stepwise expectation maximization (EM) algorithm described in McNeil et al. (2005). 1 Table 5 shows the estimates for χ , ψ and λ in the MGH model. The parameter vectors γ and μ are not depicted since they are close to zero (<10−4 ). The EM-algorithm works extremely well for the two-dimensional data set. Comparing the MGH model with the multi-tail generalized elliptical model, we observe that the Akaike information criterion prefers the MGH model for BMW–DC returns and the multi-tail generalized elliptical model for the CB–DB returns. This result is not unexpected since the CB–DB returns are much more dispersed around the first principal component than the BMW–DC returns. Furthermore, the tail parameters ν 1 and ν 2
1 The actual MATLAB code was provided by Saket Sathe. Contact address as of June 9, 2006, was
[email protected].
C The Author(s). Journal compilation C Royal Economic Society 2009.
290
S. Kring et al.
differ in a more pronounced way for the CB–DB returns (ν 1 = 2.6 and ν 2 = 6.1, see Table 4) than for the BMW–DC returns (ν 1 = 3.1 and ν 2 = 5.7, see Table 4) indicating that a multi-tail model is very well suited for the CB–DB data. In the case of the DAX returns, we had problems in applying the EM-algorithm since the algorithm converges very slowly. Thus, some uncertainty remains concerning the true maximum since we bounded the maximum number of iterations to assure that a solution is obtained in finite time. For example, in the MGH model the value of the log-likelihood is 84,650 while in the t-model the log-likelihood is 84,711.5 (see Table 4), when applying the presented threestep estimation procedure. In particular, we show that a multi-tail generalized elliptical model can be applied with high-dimensional data. This makes it even more attractive, especially for practitioners.
5. CONCLUSION In this paper, we introduce a new subclass of generalized elliptical distributions that is flexible enough to capture a varying-tail behaviour of the underlying multivariate return data. We motivate this new type of distribution from typical behaviour of financial markets. By introducing the notion of a tail function, we show how to capture varying tail behaviour and present examples for tail functions. A three-step estimation procedure is presented to fit the multi-tail generalized elliptical distributions to data. By applying the Akaike information criterion and likelihood ratio test, we find empirical evidence that a simple multi-tail generalized elliptical model outperforms common elliptical models. Moreover, for the sample of stocks investigated, the hypothesis of homogeneous tail behaviour was rejected.
ACKNOWLEDGMENTS The views expressed in this paper are those of the authors and should not be attributed to the institutions to which they belong. The authors would like to thank the anonymous referees and Jianqing Fan (one of the co-editors) for valuable comments, as well as Stoyan Stoyanov and Borjana Racheva-Iotova from FinAnalytica Inc. for providing ML-estimators encoded in MATLAB. For further information, see Stoyanov and Racheva-Iotova (2004). Svetlozar Rachev gratefully acknowledges research support by grants from the Division of Mathematical, Life and Physical Sciences, College of Letters and Science, University of California, Santa Barbara, the Deutschen Forschungsgemeinschaft and the Deutscher Akademischer Austauschdienst.
REFERENCES Barndorff-Nielsen, O. E. (1978). Hyperbolic distributions and distributions on hyperbolae. Scandinavian Journal of Statistics 5, 151–57. Cambanis, S., S. Huang and G. Simons (1981). On the theory of elliptically contoured distributions. Journal of Multivariate Analysis 11, 368–85. Eberlein, E. and U. Keller (1995). Hyperbolic distribution in finance. Bernoulli 1, 281–99. Eberlein, E., U. Keller and K. Prause (1998). New insights into smile, mispricing, and value at risk: the hyperbolic model. Journal of Business 71, 371–406. Fama, E. (1965). The behavior of stock market prices. Journal of Business 38, 34–105. C The Author(s). Journal compilation C Royal Economic Society 2009.
Multi-tail generalized elliptical distributions for asset returns
291
Fang, K. T., S. Kotz and K. W. Ng (1990). Symmetric Multivariate and Related Distributions. London: Chapman and Hall. Frahm, G. (2004). Generalized elliptical distributions: theory and applications. Unpublished Ph.D. thesis, University of Cologne. Genton, M. G. (2004). Skew-Elliptical Distributions and Their Applications: A Journey Beyond Normality. Boca Raton, Chapman and Hall. Hult, H. and F. Lindskog (2002). Multivariate extremes, aggregation and dependence in elliptical distributions. Advances in Applied Probability 34, 587–608. Kent, J. T. and D. E. Tyler (1988). Maximum likelihood estimation for the wrapped Cauchy distribution. Journal of Applied Statistics 15, 247–54. Kotz, S. and S. Nadarajah (2004). Multivariate t-Distributions and Their Applications. Cambridge: Cambridge University Press. Loretan, M. and P. Phillips (1994). Testing the covariance stationarity of heavy tailed time series. Journal of Empirical Finance 1, 211–48. Mandelbrot, B. B. (1963). The variation of certain speculative prices. Journal of Business 36, 394–419. McCulloch, J. H. (1996). Financial applications of stable distributions. In G. S. Maddala and C. R. Rao (Eds.), Handbook of Statistics, Volume 14, 393–425. Amsterdam: Elsevier. McNeil, A. J., R. Frey and P. Embrechts (2005). Quantitative Risk Management. Princeton: Princeton University Press. Rachev, S. T. and S. Mittnik (2000). Stable Paretian Models in Finance. New York: John Wiley & Sons. Rachev, S. T. and S. Han (2000). Portfolio management with stable distributions. Mathematical Methods of Operations Research 51, 341–52. RiskMetrics (1996). Riskmetrics Technical Document (4th ed.). New York: RiskMetricsTM . Schmidt, R. (2002). Tail dependence for elliptically contoured distributions. Mathematical Methods of Operations Research 55, 301–27. Stoyanov, S. V. and B. Racheva-Iotova (2004). Univariate stable laws in the fields of finance: approximations of density and distribution functions. Journal of Concrete and Applicable Mathematics 2, 37–57. Tyler, D. E. (1987a). A distribution-free M-estimator of multivariate scatter. Annals of Statistics 15, 234–51. Tyler, D. E. (1987b). Statistical analysis for the angular central Gaussian distribution on the sphere. Biometrika 74, 579–89.
C The Author(s). Journal compilation C Royal Economic Society 2009.
The
Econometrics Journal Econometrics Journal (2009), volume 12, pp. 292–309. doi: 10.1111/j.1368-423X.2009.00284.x
Multivariate stochastic volatility, leverage and news impact surfaces M ANABU A SAI † AND M ICHAEL M C A LEER ‡,§ †
‡
Faculty of Economics, Soka University, 1-236 Tangi-cho, Hachiaji, Tokyo 192-8577, Japan E-mail:
[email protected]
Department of Applied Economics, National Chung Hsing University, Taichung 402, Taiwan §
Economics Institute, Erasmus School of Economics, Erasmus University Rotterdam, Rotterdam 3000, The Netherlands E-mail:
[email protected] First version received: August 2007; final version accepted: January 2009
Summary Alternative multivariate stochastic volatility (MSV) models with leverage have been proposed in the literature. However, the existing MSV with leverage models are unclear about the definition of leverage, specifically the timing of the relationship between the innovations in financial returns and the associated shocks to volatility, as well as their connection to partial correlations. This paper proposes a new MSV with leverage (MSVL) model in which leverage is defined clearly in terms of the innovations in both financial returns and volatility, such that the leverage effect associated with one financial return is not related to the leverage effect of another. News impact surfaces are developed for MSV models with leverage based on both log-volatility and volatility and are compared with the special case of news impact functions for their univariate counterparts. In order to capture heavy tails in each return distribution, we incorporate an additional factor for the volatility of each return. An empirical example based on bivariate data for Standard and Poor’s 500 Composite Index and the Nikkei 225 Index is presented to illustrate the usefulness of the new MSVL model and the associated news impact surfaces. Likelihood ratio (LR) tests are considered for model selection. The LR tests show that the two-factor MSVL model is supported, indicating that the restrictions considered in the paper are empirically adequate under heavy-tailed return distributions. Keywords: Dynamic latent variables, Leverage effect, News impact function, News impact surface, Stochastic volatility.
1. INTRODUCTION In both the conditional volatility and stochastic volatility (SV) literature, asymmetric behaviour including the leverage effect has been widely observed (see McAleer, 2005, for a comparison of alternative univariate and multivariate, conditional and stochastic, volatility models). Such asymmetry typically focuses on the different effects of positive and negative shocks of equal magnitude on subsequent volatility. Regarding the leverage effect, Christie (1982) originally investigated the negative relation between the ex post volatility in the rate of returns on equity and the current value of the equity. The asymmetric property of the SV model is based on the direct C The Author(s). Journal compilation C Royal Economic Society 2009. Published by Blackwell Publishing Ltd, 9600 Garsington Road,
Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
MSV, leverage and NIS
293
correlation between the innovations in both returns and volatility. For a theoretical development in the continuous time framework, Hull and White (1987) generalized the Black–Scholes option pricing formula to analyse SV and the negative correlation between the innovation terms. In empirical research, extensions of a simple discrete time model due to Taylor (1986) have been analysed by Wiggins (1987), Chesney and Scott (1989) and Harvey and Shephard (1996) in order to accommodate the direct correlation. Although the extension to examine the direct correlation between the innovations has been called the asymmetric SV model, we will refer to the asymmetric behaviour based on the direct correlation between the innovations as the ‘SV model with leverage’ to emphasize the underlying nature of the model. Asai and McAleer (2005) proposed alternative general univariate asymmetric SV models with leverage. In the framework of univariate models, Jacquier et al. (2004) developed the Bayesian Markov chain Monte Carlo (MCMC) estimation technique for estimating alternative SV models with leverage. Yu (2005) compared the model of Jacquier et al. (2004) with the traditional specification, by using the news impact function, and showed that the traditional specification should be used in order to describe leverage. Omori et al. (2007) proposed a more efficient MCMC method than that of Yu (2005). The first multivariate SV (MSV) model was proposed by Harvey et al. (1994), who specified the model in terms of instantaneous correlations in the mean and volatility equations. This MSV model did not include leverage effects. Shephard (1996) proposed a one-factor MSV model. Based on these two approaches, several authors, including Danielsson (1998), Chan et al. (2006) and Asai and McAleer (2006), have proposed alternative MSV models with leverage. These MSV with leverage models, however, are unclear about the definition of leverage, specifically the timing of the relationship between the innovations in financial returns and the associated shocks to volatility, as well as their connection to partial correlations (see Asai et al., 2006, for a comparison of alternative MSV models in the literature). In this paper, we make the following assumptions to clarify the definition and timing of leverage in a multivariate framework: A SSUMPTION 1.1. For the ith variable, the innovation of the return is negatively correlated with the innovation of its volatility, which defines leverage; A SSUMPTION 1.2. The innovations of the returns vector are mutually correlated, as are the innovations of the volatility vector; A SSUMPTION 1.3. Conditional on the remaining variables, the innovation of the ith return is uncorrelated with the innovation of the jth volatility, if i = j . Under Assumptions 1.1–1.3, the leverage effect of the ith variable has no connection with the leverage effect of the jth variable. If the leverage effects are related, there will be leverage spillover effects. Before proceeding, it is convenient to present the connection between partial correlations and the inverse of the correlation matrix. For any covariance matrix, = {σ ij }, of an m × 1 vector, X, we denote −1 = {σ ij }. Then the partial correlation between Xi and Xj is given by −σ ij ρij ·rest = √ , σ ii σ jj where ‘rest’ denotes {1, 2, . . . , m}\{i, j }. F INANCIAL L EVERAGE. The necessary and sufficient condition that Xi is partially uncorrelated with Xj is given by σ ij = 0. The financial leverage between Xi and Xj is given by σ ij > 0. C The Author(s). Journal compilation C Royal Economic Society 2009.
294
M. Asai and M. McAleer
The remainder of the paper is organized as follows. Section 2 introduces a new MSV with leverage (MSVL) model in which leverage and its relation to partial correlations are defined precisely. Section 3 develops news impact surfaces for MSV models with leverage for logvolatility and volatility and makes a comparison with the special case of news impact functions for their univariate counterparts. Section 4 discusses the heavy-tailed return distributions and proposes the use of multifactors for each return volatility. Section 5 presents some empirical results using bivariate data for Standard and Poor’s 500 Composite Index and the Nikkei 225 Index to illustrate the usefulness of the new specification of leverage effects under heavy-tailed return distributions. Section 6 presents some concluding remarks.
2. THE MSV WITH LEVERAGE MODEL Let yt and ht be m × 1 vectors of observations and unobserved components, respectively. We define exp(x) for an m-vector, x, as the element-by-element operator of exponentiation; that is, exp(x) = [exp(x 1 ), . . . , exp(xm )] . The MSV model of Harvey et al. (1994) is given by yt = Dt εt , Dt = diag{exp(0.5ht )},
(2.1)
ht+1 = φ ◦ ht + ηt , where the operator ◦ denotes the Hadamard (or element-by-element) product, φ is an m × 1 parameter vector. The innovation vectors ε t and η t follows the mutually independent multivariate normal distributions, N (0, S 1 P ε S 1 ) and N (0, S 2 P η S 2 ), respectively, where S 1 = diag{σ } = diag{σ 1 , . . . , σ m }, S 2 = diag{σ η1 , . . . , σ ηm }, and P ε and P η are the correlation matrices of ε t and η t , respectively. In the model, the vector of volatilities is defined by S 21 D 2t , and the vector of log-volatilities is given by ht + ln σ 2 , where ln(x) denotes the element-by-element logarithmic operator. Taking account of the correlation between ε t and η t , we can define a more flexible MSV model. We assume that the innovation vector follows the mutually independent normal distribution, (εt , ηt ) ∼ N (0, SPS), where S1 O Pε Pεη S= , P = , (2.2) Pηε Pη O S2 where P εη = P ηε . Asai and McAleer (2006) imposed the restriction that P εη is diagonal to eliminate the link between the ith element of ε t in the mean equation being conditionally uncorrelated with the jth element of η t in the volatility equation, for i = j . This attempt was unsuccessful for the reason given below. Instead, we incorporate Assumptions 1.1–1.3 to propose a new MSV with leverage (MSVL) model, as follows: 11 P −1 , (2.3) P = P 22 where P 11 and P 22 are the m-dimensional symmetric and positive definite matrices, and = diag{λ 1 , . . . , λ m }. Thus, the new MSVL model has a restriction on to be diagonal, implying that the ith element of ε t in the mean equation is conditionally uncorrelated with the jth element C The Author(s). Journal compilation C Royal Economic Society 2009.
MSV, leverage and NIS
295
of η t in the volatility equation, i = j . We will discuss this property below. For convenience, we will refer to the unrestricted and restricted models as the ‘general MSVL’ model and the ‘MSVL’ model, respectively. Starting from equation (2.2), we explain the connection between P and P −1 in greater −1 22 −1 = (P η − P εη P −1 and = detail. First, we have P 11 = (P ε − P εη P −1 η P εη ) , P ε P εη ) 22 −1 22 −P P εη P ε . Then we focus on the last equation, especially the role of P . As P η is typically not a diagonal matrix in empirical analysis, P 22 is also not diagonal. Furthermore, since P 22 and P −1 ε are not diagonal, a condition that P εη is diagonal will not guarantee that is also diagonal. Hence, imposing a restriction on P εη as in Asai and McAleer (2006) does not guarantee that the ith element of ε t and the jth element of η t are uncorrelated, conditional on all the other elements of ε t and η t . Therefore, it is necessary to impose the restriction that is diagonal in order to consider leverage, as defined below. It should be noted that, if is diagonal, P εη will not generally be diagonal, by a similar argument. The diagonal structure of is crucial to the interpretation and timing of leverage in the MSVL model. If is not diagonal, there will be volatility spillovers and difficulty in interpreting the off-diagonal elements of . In alternative MSV models with leverage, including the new MSVL model presented above, it is necessary to examine the relationship between the structures of P εη and . The interpretation of leverage for MSV models depends on being diagonal, regardless of whether P εη is diagonal. Given this specification, the partial correlation between ε it and η it , which is the leverage of the ith return, is given by Li = −λi ρii11 ρii22 , while that of ε it and η j t (i = j ) is zero. Thus, for leverage to be negative, as required, the sign of λ i is expected to be positive for all i. Furthermore, the partial correlation between ε it and ε j t is given by 11 ρii11 ρjj , ρε,ij ·rest = −ρij11 and the partial correlation between η it and η j t is given by 22 ρii22 ρjj . ρη,ij ·rest = −ρij22 It should be noted that some of the elements of the partial correlation matrix, P −1 , are zero, specifically the off-diagonal terms of , so that the new MSVL model differs from the existing MSV with leverage models of Danielsson (1998), Chan et al. (2006) and Asai and McAleer (2006). Before going to clarify the similarities and differences of alternative MSV with leverage models, we would like to turn to the timing of the leverage effect, namely between the shocks to returns and the subsequent shocks to volatility. In the univariate case, Yu (2005) showed that the asymmetric SV model of Harvey and Shephard (1996) is the discrete-time approximation of the continuous-time SV model with leverage, while the asymmetric SV model of Jacquier et al. (2004) is not. Asai and McAleer (2006) discussed the timing of leverage for the multivariate case. It should be noted that the model of Jacquier et al. (2004) describes a kind of asymmetric effects, but it does not correspond to the leverage effects. The alternative MSV with leverage models and their distinguishing features are summarized in Table 1, as follows: C The Author(s). Journal compilation C Royal Economic Society 2009.
296
M. Asai and M. McAleer
Authors Harvey et al. (1994)
Table 1. Alternative MSV models with leverage. Assumptions/Features Corr(ε t , η s ) = O for any t, s, including P εη = O (a) Basic model without leverage (b) No news impact surface
Danielsson (1998)
P εη = O, but Corr(ε t , η t−1 ) = diag{k 1 , . . . , km }, with ki < 0 (a) Incorrect timing of leverage effect between ε t and η t−1 (b) In empirical analysis, ki = 0 is assumed for all i.
Chan et al. (2006)
P εη = O, but Corr(ε t , η t−1 ) = K = {kij } (a) Basic model without leverage effect between ε t and η t−1 (b) No interpretation of leverage and spillover effects (c) No restrictions on any elements in K
Asai and McAleer (2006)
Corr(ε t , η t ) = P εη = diag{k 1 , . . . , km }, with ki < 0 (a) Correct timing of leverage effect between ε t and η t (b) Does not accommodate partial correlations
This paper
Corr(ε t ,η t ) = P εη,
Pε Pεη P 11 = , Pηε Pη P 22 = diag{λ 1 , . . . , λ m }, with λ i > 0 P −1 =
(a) Correct timing of leverage effect between ε t and η t (b) Accommodates partial correlations (c) Derivation of news impact surface −1 Note: The correlation between ε t and η s is defined as Corr(ε t , η s ) = S −1 1 E(ε t ηs )S 2 = [Corr(η s , ε t )] .
(i)
(ii)
(iii)
In Danielsson (1998), the timing of the leverage effect is incorrectly given as between ε t and η t−1 , so that the matrices P εη and Pηε in equation (2.2) are implicitly restricted to be the null matrix. Although the matrix of incorrectly timed leverage effects in the specification of Danielsson (1998) is diagonal, this matrix was not estimated in the empirical example presented. In Chan et al. (2006), the timing of the leverage effect is incorrectly given as ε t and η t−1 , so that the matrices P εη and P ηε in equation (2.2) are implicitly set equal to the null matrix. The matrix of incorrectly timed leverage effects in the specification of the Chan et al. (2006) model is not diagonal, and there are no restrictions on any elements of the matrix; so, there are leverage spillover effects. However, no interpretation was given of such multivariate spillover effects, and the model was not estimated. In Asai and McAleer (2006), the timing of the leverage effect is given correctly as between ε t and η t . Although the matrices P εη and P ηε in equation (2.2) are diagonal in the specification of the Asai and McAleer (2006) model, it does not accommodate partial correlations and the model was not estimated. C The Author(s). Journal compilation C Royal Economic Society 2009.
297
MSV, leverage and NIS
For purposes of considering correlation coefficients, including leverage effects, between innovation terms in alternative MSV models, it is necessary to start with the partial correlations, ρ ε,ij ·rest , ρ η,ij ·rest and Li , otherwise the interpretations of the parameters to be estimated will not be possible. The following is a convenient way of constructing the correlation matrix, P, from the partial correlations, ρ ε,ij ·rest , ρ η,ij ·rest and Li . For the case m = 2, if we consider the matrices given by ⎛ ⎞−1 1 −ρε,12·rest −L1 0 ⎜ −ρ 1 0 −L2 ⎟ ⎜ ε,12·rest ⎟ R = {rij } = ⎜ ⎟ , ⎝ −L1 0 1 −ρη,12·rest ⎠ 0
−L2
−ρη,12·rest
1
and R ∗ = diag{r 11 , . . . , rmm }, it follows that P = R ∗−1/2 RR∗−1/2 . Before ending this section, we return to the general MSVL model, which is given by equations (2.1) and (2.2) with the unrestricted matrix = {λ ij }. In this case, the partial
22 correlation between ε it and η j t is defined by lj i = −λj i / ρii11 ρjj . We can use a likelihood ratio test for the proposed MSVL model against the more general MSVL model. If the null hypothesis of the new MSVL model is rejected, there will be volatility spillovers to return series.
3. NEWS IMPACT SURFACE Engle and Ng (1993) developed the news impact curve (NIC), which is a useful tool for measuring the effects of news on the conditional variances. They showed, graphically, the asymmetric reactions of the conditional variances to positive and negative shocks of equal magnitude for the GJR model of Glosten et al. (1992) and the EGARCH model of Nelson (1991). Recently, Caporin and McAleer (2009) developed news impact surfaces (NIS) for multivariate conditional volatility models, specifically, the dynamic asymmetric multivariate GARCH, or DAMGARCH, model. In the framework of univariate SV models, Yu (2005) developed the news impact function (NIF) for evaluating the effects of news on the log-volatilities as an adaptation of the NIC. As there has not yet been any adaptation of the NIS for MSV models, in what follows two alternative NIS will be developed in terms of volatilities and log-volatilities for MSV with leverage models. For convenience, we define the vectors of volatilities and log-volatilities as Vt = S 21 exp(ht ) and α t = ht + ln σ 2 , respectively. Thus, the mean of the log-volatility is given by μ = ln σ 2 . Given yt , we have the following results for the equation of volatility: ht+1 = φ ◦ ht − S2 (P 22 )−1 S1−1 Dt−1 yt + ηt∗ ,
ηt∗ ∼ N 0, S2 (P 22 )−1 S2 , where S 1 = diag{σ 1 , . . . , σ m } and S 2 = diag{σ η1 , . . . , σ ηm }, as defined in Section 2. The properties of the conditional distribution of a multivariate normal distribution and of the inverse of a partitioned matrix have been used to derive the above equations. Based on these equations, a news impact surface (NIS) can be defined for MSV with leverage models as E(αt+1 |yt ) = μ − S2 (P 22 )−1 S1−1 Dc−1 yt , C The Author(s). Journal compilation C Royal Economic Society 2009.
(3.1)
298
M. Asai and M. McAleer (a-2)
g-V
g-V
NIS 0.0
NIS 0.00
0.2
0.25
(a-1)
2 0 Y1
2
2
0 Y1
g-V
(b-2)
0 Y2
2
g-V
NIS 0.0
NIS 0.00
0.2
0.25
(b-1)
0 Y2
2 0 Y1
0 Y2
2
2 0 Y1
0 Y2
2
Figure 1. Simulated NIS based on log-volatilities.
where Dc = diag{dc1 , . . . , dcm },
dci = exp 0.125σηi2 1 − φi2 . For the special case where m = 1, namely a univariate SV model with leverage, the NIS reduces to the NIF of Yu (2005), although the expression for the NIF in the first equation on page 169 of Yu (2005) is incorrect. The correct expression for the NIF based on log-volatility is as follows: ση2 2 −1 (3.2) E(αt+1 |yt ) = ln σ − ρση σ exp − yt , 8(1 − φ 2 ) √ where ρ = −2λ/(1 + 1 + 4λ2 ) or λ = −ρ/(1 − ρ 2 ). In equation (3.1), S 1 , S 2 , and Dc are diagonal matrices, so that the news impact of the ith return on the jth log-volatility is determined by the ( j, i)th element of (P 22 )−1 . In other words, not only does an asset return have a news impact on its log-volatility, but the other asset will also have an effect via the partial correlation between the shocks of the log-volatilities, where the conditional correlation is defined by using P 22 . Figure 1 shows the graphs of the NIS for bivariate MSVL models. The parameter values for Figures 1(a-1) and (a-2) are given in Table 2 (see Jacquier et al., 1994, Shephard and Pitt, 1997, and Asai and McAleer, 2005, for further details regarding the parameter values), while the parameter values for Figures 1(b-1) and (b-2) are the same as in Table 2, except for C The Author(s). Journal compilation C Royal Economic Society 2009.
299
MSV, leverage and NIS
i
φi
1 2
0.98 0.95
Table 2. Parameter values for examples. σi σ ηi 1 1
Li −0.6 −0.3
0.166 0.260
ρ ε,12·rest = 0.2
ρ η,12·rest = 0.4
ρ η,12·rest = −0.4. All the graphs in Figure 1 are flat planes as the NIS based on log-volatilities is a linear combination of y 1t and y 2t , which follows from the definition of the NIS. Figure 1(a-1) shows that, given that y 2t is constant, the NIS for log-volatility 1, that is, α 1t , increases as y 1t decreases. In the same manner, Figure 1(a-2) indicates that, given that y 1t is constant, the NIS for log-volatility 2 increases as y 2t decreases. The direction of the planes is controlled by the partial correlation between the shocks in the log-volatilities, that is, ρ η,12·rest . As ρ η,12·rest is positive in Figures 1(a-1) and (a-2), a negative shock in y 2t increases the NIS for log-volatility 1, while a positive shock in y 1t decreases the NIS for log-volatility 2. On the contrary, when ρ η,12·rest is negative, which is the case for Figures 1(b-1) and (b-2), a negative shock in y 2t decreases the NIS for log-volatility 1, and a negative shock in y 1t also decreases the NIS for log-volatility 2. When ρ η,12·rest is zero, the NIS for log-volatility 1 (log-volatility 2) is indifferent to shocks in y 2t (y 1t ). As the NIS developed above includes the NIF of Yu (2005) as a special case, it will be informative to give the results for the univariate case. Figure 2(a) presents the NIF for logvolatility 1 when y 2t = 0, while Figure 2(b) gives the NIF for log-volatility 2 when y 1t = 0. At first sight, it might seem that positive and negative shocks have the same effect regarding the size of the effects on volatility, but this would not be correct. In order to evaluate the effects of shocks on volatility, we need a measure for the effect on volatility rather than on log-volatility. Although the NIS developed above is new and useful and contains the NIF of Yu (2005) for the univariate model as a special case, there is a drawback regarding the scale effect. A large positive impact on log-volatility has little impact on volatility, so that a comparison of the size of the effects based on log-volatilities is misleading. In order to avoid such an effect in the ARCH class of models, Engle and Ng (1993) originally considered an NIC based on conditional volatility rather than its log-volatility counterpart. Consequently, it is worth developing an alternative NIS, with an appropriate NIF as a special case, which is based on volatility as opposed to log-volatility. Defining exp(x) = [exp(x 1 ), . . . , exp(xm )] , we present an alternative NIS as follows:
−1
VtNIS = S12 exp S22 c + dc∗ − S2 P 22 S1−1 Dc−1 yt ,
(3.3)
where c = (c 1 , . . . , cm ) , ci is 0.5 times the (i, i)th element of (P 22 )−1 , d ∗c = (d ∗c1 , . . . , d ∗cm ) and d ∗ci = exp(0.5φ 2i σ 2ηi /(1 − φ 2i )). Figure 3 shows the graphs of the NIS for the alternative specification based on volatility. The parameter settings are the same as in Figure 1. All the graphs in Figure 3 are products of exponential curves, so that the scale effects in Figure 3 are consistent with the scale effects for the volatilities. It is clear that the NIS given in each panel of Figure 3 is not a flat plane but is curved, in keeping with the scale effects. Overall, the NIS based on volatilities would seem to be more meaningful than their NIS counterparts based on log-volatilities. C The Author(s). Journal compilation C Royal Economic Society 2009.
300
M. Asai and M. McAleer 1
g-V
2
g-V
0.50 0.5 0.25 0.0
0.00
0.0 (c) NIF for Volatility 1
2.5
5.0
0.0 (d) NIF for Volatility 2
2.5
5.0
0.0
2.5
5.0
7
8
6 6
5 4
4
3 2 0.0
2.5
5.0
Figure 2. Simulated news impact functions.
For the special case where m = 1, namely a univariate SV model with leverage, the NIF for volatilities is given by 2 2 2 2 2 σ (1 − ρ ) σ φ σ ρσ η η η η VtNI F = σ 2 exp (3.4) + + exp − yt . 2 2(1 − φ 2 ) σ 8(1 − φ 2 ) Figures 2(c) and (d) are the NIF based on volatility for the respective univariate cases. Compared with Figures 2(c) and (d), the NIF based on log-volatility in Figures 2(a) and (b), which are based on the development in Yu (2005), would seem to be misleading with regard to the scale effects. For the case m ≥ 3, it is useful to consider 0.5m(m + 1) sets of three-dimensional graphs. For each graph regarding the ith and jth variables, we set the lth (l = i, j ) element of yt to zero and calculate the NIS for the ith and jth variables. At this stage, we reconsider the similarities and differences between the NIF based on volatility and the NIC of Engle and Ng (1993). First, the y-axis is the expected value of SV given return for the former, while it is conditional volatility given the past information for the latter. Thus, we have a similar interpretation for the news impact on volatility. Second, the shape of the NIF in Figures 2(c) and (d) are different from the J-curves given by the GJR and EGARCH models. This is due to the definitions of leverage and asymmetry. As leverage is the negative correlation between the return and future volatility, leverage is an asymmetric effect. Asai and McAleer (2005) suggested a more general asymmetric SV model, which is given by ht+1 = φht + γ1 yt + γ2 {|yt | − E|yt |} + ηt , C The Author(s). Journal compilation C Royal Economic Society 2009.
301
MSV, leverage and NIS (a-2)
3
NIS 4
NIS 4 5
5
6
(a-1)
2 0 Y1
0 Y2
2
2
0 Y1
0 Y2
2
(b-2)
3
NIS 4
NIS 4 5
5
6
(b-1)
2 0 Y1
0 Y2
2
2 0 Y1
0 Y2
2
Figure 3. Simulated NIS based on volatilities.
with unknown parameters γ 1 and γ 2 . The model can accommodate asymmetric effects flexibly, but it is beyond the purpose of the current paper. It should be noted that in finance theory, option pricing models based on continuous-time SV models assume leverage effects if asymmetric effects are to be considered (see, for example, Hull and White, 1987, and Bollerslev and Zhou, 2006).
4. HEAVY-TAILED RETURN DISTRIBUTIONS It is well known that financial return series have fatter tails than the Gaussian distribution. Although the class of SV and GARCH models enable the observed series, yt , to have heavy-tailed distributions, empirical analysis has shown that assuming a Gaussian conditional distribution is insufficient to describe the tail behaviour of real data (see, for example, Kim et al., 1998, and Liesenfeld and Jung, 2000). There are three major approaches to cope with this problem. The first is to assume fatter-tailed conditional distributions than the Gaussian distribution (see, for example, Liesenfeld and Jung, 2000, and Chib et al., 2002). The second method is to incorporate a jump process, as in Chib et al. (2002). The third is to employ multifactor models, as proposed in Chernov et al. (2003). One of the contributions of Chernov et al. (2003) is to attain a heavy-tailed return distribution by introducing multifactors, without assuming heavy-tailed conditional distributions. With respect to Standard and Poor’s 500 Composite Index (S&P) returns, the empirical findings of Chib et al. (2002) indicate that assuming the t-distribution is superior to using the C The Author(s). Journal compilation C Royal Economic Society 2009.
302
M. Asai and M. McAleer
jump process, while Asai (2008) showed that the first and third methods are competitive. For the MSV model, Harvey et al. (1994) considered the multivariate t-distribution and estimated the MSV-t via the quasi-maximum likelihood method. However, it is not easy to derive the exact distribution of ln ε2t when ε t follows the multivariate t-distribution. Hence, we will use the third approach in the present paper. Chernov et al. (2003) suggested a multifactor SV model in the framework of a continuous time diffusion process. For leverage effects, they introduced the negative correlation between the innovations of return and each factor. They also assumed independence of each factor, and incorporated conditional correlations implicitly. Asai (2008) considered the discrete time approximation of their model and compared it with the SV-t model. We consider a multivariate extension of Chernov et al. (2003) and Asai (2008) in the following. Instead of assuming the vector of unobserved components, ht , to follow a vector autoregressive model, we assume that ht is the sum of the two m-vectors, as follows: ht = νt + ft , νt+1 = φ ◦ νt + ηt , ft+1 = φ ◦ ft + f
(4.1)
f ηt , f
where φ and φ f are the m-vectors of parameters, and the innovation terms, η t and ηt , have the f Gaussian distribution. For leverage effects, we assume that the innovation vector (εt , ηt , (ηt ) ) follows the mutually independent normal distribution, N (0, SPS), where ⎛ ⎞ ⎞ ⎛ 11 f S1 O O P ⎜ ⎟ ⎟ ⎜ S = ⎝ O S2 O ⎠ , P −1 = ⎝ P 22 O ⎠ , (4.2) f 33 O O S3 O P f
f
f
f
with S 3 = diag{σ η,1 , . . . , σ η,m }, = diag{λ 1 , . . . , λ m } and f = diag{λ1 , . . . , λm }. The specification states that the ith element of ε t is conditionally correlated with the ith element f f of η t (ηt ), and ith element of ε t is conditionally uncorrelated with jth element of η t (ηt ) for f i = j . Furthermore, η t is conditionally uncorrelated with ηt . Hence, the model specifies the leverage effect precisely, and excludes volatility spillover effects. We refer to this model as the ‘Two-Factor MSVL’ (2FMSVL) model. It should be noted that the multifactor model discussed here is different from the meanfactor model considered by Pitt and Shephard (1999) and Chib et al. (2006) in the sense that the multifactor model accommodates the additional factor in the volatility equation. We may conduct several kinds of likelihood ratio (LR) tests. The first is the test for the MSVL model against the 2FMSVL model. The rejection of the 2FMSVL model implies that the additional factors are redundant, and that the tail-behaviour of the return can be explained by the MSVL model. The second test is for the 2FMSVL model against the general 2FMSVL model. If the 2FMSVL model is rejected, there are volatility spillovers. The NIS and NIF discussed in the previous section can incorporate the 2FMSVL model by considering the expectation of volatility Vt = S 21 exp(ht ) = S 21 exp(ν t + ft ). It is straightforward to extend the above two-factor MSVL model to the multifactor model. As Chernov et al. (2003) show that the two-factor model is sufficient for stock return data, we will use the two-factor MSVL model. For the empirical example in the next section, we also consider the ‘general 2FMSVL’ model in which we assume no constraints on P −1 in (4.2). C The Author(s). Journal compilation C Royal Economic Society 2009.
303
MSV, leverage and NIS
5. EMPIRICAL EXAMPLE This section presents the MCL estimates of the new MSVL model using bivariate data for S&P and the Nikkei 225 Index (Nikkei). The sample period for both series is 1/2/1986 to 10/4/2000, giving T = 3605 observations. Returns Rit are defined as 100 × {ln Pit − ln P i,t−1 }, where Pit is the closing price on day t for stock i. We have used filtered data, yit = Rit − E(Rit |I t−1 ), based on the threshold AR(1) model. For the estimation of the MSVL models, we use the Monte Carlo likelihood (MCL) approach proposed by Durbin and Koopman (1997). Sandmann and Koopman (1998) applied the MCL method to univariate SV models, while Asai and McAleer (2006) used it for MSV models. These two papers rely on the logarithmic transformation of squared returns. For the case of the MSV model, we consider the density of xt = ln y 2t . The merit of the transformation is that xt have the state space form with xt = ht + ln ε2t and h t+1 = φ ◦ ht + η t . The measurement equation of the state space form has the non-Gaussian density because of ln ε2t . In the MCL method, the likelihood function can be approximated arbitrarily by decomposing it into a Gaussian part, which is constructed by the Kalman filter, and a remainder function, for which the expectation is evaluated through simulation. Table 3 shows the values of the MCL estimates for the MSVL model. The estimates for (φ i , σ i , σ ηi ) (i = 1, 2) are typical of MSV models. All the estimated partial correlations are significant, except for y 1t and y 2t . It is notable that both estimates of leverage, Li , which are based on the diagonal matrix in equation (2.2), are negative and highly significant, while the partial correlation between η 1t and η 2t is positive and highly significant. Thus, there are significant leverage effects for each series, but there are no leverage spillover effects, as these restrictions on the model are imposed in estimation. The volatilities of the two returns are positively and significantly correlated, but the returns themselves are uncorrelated. Table 4 presents the general MSVL model, indicating that the results are close to Table 3. Regarding the conditional correlation between ε it and η j t (i = j ), which corresponds to the offdiagonal elements of , l 12 and l 21 are insignificant, showing that the MSVL model of Table 3
Parameters
Table 3. MCL estimates of the MSVL model for S&P 500 and Nikkei 225. Estimates Standard errors
φ1
0.9668
0.0071
φ2 σ1 σ2
0.9675 0.8051 1.0811
0.0052 0.0397 0.0587
σ η1 σ η2
0.1993 0.2248
0.0208 0.0168
ρ ε,12·rest ρ η,12·rest L1
0.0348 0.4433 −0.1820
0.1523 0.0806 0.0459
L2 Log-likelihood AIC
−0.3091 −15,950.9 31,921.8
0.0423
BIC
31,983.7
C The Author(s). Journal compilation C Royal Economic Society 2009.
304
Parameters
M. Asai and M. McAleer Table 4. MCL estimates of the general MSVL model for S&P 500 and Nikkei 225. Estimates Standard errors
φ1 φ2
0.9695 0.9685
0.0067 0.0050
σ1 σ2
0.8072 1.0844
0.0430 0.0594
σ η1 σ η2 ρ ε,12·rest
0.1978 0.2235 −0.00043
0.0207 0.0169 0.2534
ρ η,12·rest l 11
0.5213 −0.1925
0.0941 0.0656
l 21 l 12 l 22
0.0227 0.1106 −0.3553
0.0797 0.0709 0.0535
Log-likelihood AIC
−15,949.5 31,922.9
BIC
31,997.2
is appropriate. An LR test for the proposed MSVL model against the more general MSVL model also did not reject the null hypothesis of the MSVL model. Figure 4 presents the NIS and NIF based on volatilities (rather than log-volatilities) for the bivariate MSVL model. Figures 4(a) and (b) are the graphs of the NIS, which are products of exponential curves. Negative news has a larger impact on volatility than does positive news of similar magnitude. In this case, the news impact on volatility 1 (volatility 2) by y 1t (y 2t ) is greater than the news impact made by y 2t (y 1t ) regarding the scale. The graphs in Figures 4(c) and (d) are the NIF based on volatilities. The curve given in Figure 4(d) is steeper than that in Figure 4(c), primarily because of the differences in the two leverage effects. Table 5 gives the MCL results for the 2FMSVL model. For convenience, we call ν t and ft in (4.1) the vectors of the ‘first factor’ and ‘second factor’, respectively. The first factor corresponds to ht in the MSVL model. Compared with Table 3, the estimates of φ i increase, while those of σ i and σ η,i decrease, upon introducing the second factor. For the second factors, the estimates of f f φ 1 and φ 2 are positive and significant, but the persistence of the second factors are much lower than the first factors. These results are also observed in the empirical analysis of Chernov et al. f (2003) and Asai (2008), though they analysed univariate SV models. The estimates of σ η1 and f
σ η2 are 0.522 and 0.345, respectively. With respect to the unconditional variance of the first and second factors, σ 2ηi /(1 − φ 2i ) is f2
f2
larger than σ ηi /(1 − φ i ) for each i, indicating that the variation of the second factor is minor compared with that of the first factor. The coefficients for the leverage effects for the first factor are negative and significant, while those for the second factor are negative but insignificant. As for the MSVL model, the returns are uncorrelated, but the error terms of the first factor are positively and significantly correlated. Interestingly, the conditional correlation between the innovations of the second factor is significant and close to one. As we introduced the second factor to capture C The Author(s). Journal compilation C Royal Economic Society 2009.
305
MSV, leverage and NIS (b) NIS for Volatility 2
5
NIS
6
NIS 2.25 2.50 2.75 3.00
(a) NIS for Volatility 1
2 0 Y1
2
2
0 Y2
0 Y1
(c) NIF for Volatility 1
2
0 Y2
(d) NIF for Volatility 2 7
3.0 6
5
2.5
4 0.0
2.5
5.0
0.0
2.5
5.0
Figure 4. Empirical NIS and NIF based on volatilities: S&P 500 and Nikkei 225.
heavy tails of the return distributions, the result implies that the innovations are almost similar for the second term. Table 6 presents that p-value of the LR test for the MSVL model against the 2FMSVL model, showing the rejection of the latter. This result shows that each return distribution is so heavy that one factor is not sufficient to capture the tail behaviour. We also estimated the general 2FMSVL model, which allows volatility spillovers, but we have omitted the MCL results to save space. The LR test for the 2FMSVL model against the general 2FMSVL model, which is given in Table 6, indicates that the 2FMSVL model is supported, and that the data favour the restrictions empirically. Consequently, the restrictions to describe the leverage effect precisely are supported empirically under heavy-tailed return distributions. Following Koopman and Uspensky (2002), we conducted diagnostic tests in Table 7 for the assumptions of normality and no autocorrelation for the standardized residuals of the MSVL and 2FMSVL models. With respect to the MSVL model, the Jarque–Bera tests rejected the null hypothesis of normality for S&P returns, while they did not reject normality for the Nikkei 225 series. More precisely for S&P returns, the standardized residuals are left-skewed and leptokurtotic. The Ljung–Box test with 20 lags does not reject the null hypothesis of no autocorrelation for both series. Regarding the 2FMSVL model, introducing the second factor made the kurtosis smaller than for the MSVL model, implying that the second factor captures the tail behaviour. Unfortunately, the values of the kurtosis for the 2FMSVL model are significantly smaller than three. Other tests do not reject the null hypothesis, thereby supporting the assumptions underlying the conditional distribution, ε t . C The Author(s). Journal compilation C Royal Economic Society 2009.
306
Parameters
M. Asai and M. McAleer Table 5. MCL Estimates of the 2FMSVL model for S&P 500 and Nikkei 225. Estimates Standard errors
φ1
0.9916
0.0027
φ2 σ1 σ2
0.9807 0.7946 1.0692
0.0041 0.1147 0.0734
σ η1 σ η2 ρ ε,12·rest
0.0848 0.1419 0.0473
0.0116 0.0190 0.1069
ρ η,12·rest L1
0.2560 −0.2755
0.0795 0.0784
L2 f φ1 f φ2
−0.4534 0.2163 0.6769
0.0559 0.0575 0.0717
f
0.5219 0.3447
0.0385 0.0378
ρ η,12·rest f L1 f L2
f
0.8638 −0.0366 −0.0809
0.1392 0.0259 0.0600
Log-likelihood AIC
−15,884.4 31,802.7
σ η1 f σ η2
BIC
Null model MSVL MSVL 2FMSVL MSVL Restricted 2FMSVL
31,908.0 Table 6. Likelihood ratio tests. Alternative model
d.f.
p-value
General MSVL
2
0.246
2FMSVL General 2FMSVL
7 8
0.000 0.377
Restricted 2FMSVL 2FMSVL
3 4
0.000 0.000
Before including the second factor, the standardized residuals of Nikkei returns follow the normal distribution, thereby satisfying the distributional assumption. Hence, the second factor might be inadequate regarding the assumption of a normal distribution. From this viewpoint, we f f f f may restrict some parameters of the 2FMSVL model, namely φ 2 = σ η2 = ρ η,12·rest = L2 = 0. For convenience, we call this model the ‘restricted 2FMSVL’ model. Table 6 shows that the LR statistic rejected the null hypothesis of the MSVL model against the restricted 2FMSVL model. Furthermore, the LR test rejected the restrictions in the 2FMSVL model, thereby indicating the significance of the second factor for Nikkei returns. Table 7 shows that the diagnostic tests indicate the standardized residuals for Nikkei satisfy the distributional assumption, while those for S&P do not. Although the kurtosis for Nikkei becomes less than 3, introducing the second factor still has merit for modelling the bivariate process. C The Author(s). Journal compilation C Royal Economic Society 2009.
307
MSV, leverage and NIS
Model MSVL
2FMSVL
Restricted 2FMSVL
Table 7. Diagnostic tests for standardized residuals. Data Skewness Kurtosis ∗
∗
LB(20)
S&P
−0.1380
3.2648
24.400
Nikkei
(0.001) −0.0297 (0.467)
(0.001) 2.9969 (0.970)
(0.225) 27.237 (0.128)
S&P
−0.0799 (0.050)
2.6742∗ (0.000)
26.473 (0.151)
Nikkei
−0.0277 (0.497)
2.7460∗ (0.002)
29.075 (0.086)
S&P
−0.0885 (0.030) −0.0425
2.6481∗ (0.000) 3.0168
14.319∗ (0.006) 27.632
(0.298)
(0.837)
(0.118)
Nikkei ∗
Note: Denotes significance at the 5% level. p-values are given in parentheses.
6. CONCLUSION Alternative multivariate stochastic volatility (MSV) models with leverage have been proposed in the literature. However, the existing MSV with leverage models have been unclear about the definition of leverage, specifically, the timing of the relationship between the innovations in financial returns and the associated shocks to volatility, as well as their connection to partial correlations. This paper proposed a new MSV with leverage (MSVL) model, in which leverage was defined clearly in terms of the innovations in both financial returns and volatility, such that the leverage effect associated with one financial return was not related to the leverage effect of another. News impact surfaces were developed for MSV models with leverage based on both log-volatility and volatility and were compared with the special case of news impact functions for their univariate counterparts. In order to capture heavy-tails of return distributions, additional factors were introduced. An empirical example based on bivariate data for Standard and Poor’s 500 Composite Index and the Nikkei 225 Index was also presented to illustrate the usefulness of the new MSVL model and the associated news impact surfaces. Likelihood ratio tests selected the two-factor MSVL model, thereby supporting the proposed specification of leverage under heavy-tailed return distributions.
ACKNOWLEDGMENTS The authors wish to thank the Editor and two referees for insightful comments and suggestions and Yoshi Baba, Massimiliano Caporin and Christian Gourieroux for helpful discussions. The first author acknowledges the financial support of the Japan Society for the Promotion of Science and the C The Author(s). Journal compilation C Royal Economic Society 2009.
308
M. Asai and M. McAleer
Australian Academy of Science. The second author is most grateful for the financial support of the Australian Research Council and the Japanese Ministry of Education, Culture, Sports, Science and Technology.
REFERENCES Asai, M. (2008). Autoregressive stochastic volatility models with heavy-tailed distributions: a comparison with multifactor volatility models. Journal of Empirical Finance 15, 332–41. Asai, M. and M. McAleer (2005). Dynamic asymmetric leverage in stochastic volatility models. Econometric Reviews 24, 317–32. Asai, M. and M. McAleer (2006). Asymmetric multivariate stochastic volatility. Econometric Reviews 25, 453–73. Asai, M., M. McAleer and J. Yu (2006). Multivariate stochastic volatility: a review. Econometric Reviews 25, 145–75. Bollerslev, T. and H. Zhou (2006). Volatility puzzles: a simple framework for gauging return-volatility regressions. Journal of Econometrics 131, 123–50. Caporin, M. and M. McAleer (2009). Threshold, news impact surfaces and dynamic asymmetric multivariate GARCH. Working Paper, Complutense University of Madrid. Chan, D., R. Kohn and C. Kirby (2006). Multivariate stochastic volatility models with correlated errors. Econometric Reviews 25, 245–74. Chernov, M., A. Gallant, E. Ghysels and G. Tauchen (2003). Alternative models for stock price dynamics. Journal of Econometrics 116, 225–57. Chesney, M. and L. Scott (1989). Pricing European currency options: a comparison of the modified Black– Scholes model and a random variance model. Journal of Financial and Quantitative Analysis 24, 267–84. Chib, S., F. Nardari and N. Shephard (2002). Markov chain Monte Carlo methods for generalized stochastic volatility models. Journal of Econometrics 108, 281–316. Chib, S., F. Nardari and N. Shephard (2006). Analysis of high dimensional multivariate stochastic volatility models. Journal of Econometrics 134, 341–71. Christie, A. (1982). The stochastic behavior of common stock variances: value, leverage and interest rate effects. Journal of Financial Economics 10, 407–32. Danielsson, J. (1998). Multivariate stochastic volatility models: estimation and a comparison with VGARCH models. Journal of Empirical Finance 5, 155–73. Durbin, J. and S. J. Koopman (1997). Monte Carlo maximum likelihood estimation of non-Gaussian state space model. Biometrika 84, 669–84. Engle, R. and V. Ng (1993). Measuring and testing the impact of news on volatility. Journal of Finance 48, 1749–78. Glosten, L., R. Jagannathan and D. Runkle (1992). On the relation between the expected value and volatility of the nominal excess return on stocks. Journal of Finance 46, 1779–801. Harvey, A. C., E. Ruiz and N. Shephard (1994). Multivariate stochastic variance models. Review of Economic Studies 61, 247–64. Harvey, A. C. and N. Shephard (1996). Estimation of asymmetric stochastic volatility model for asset returns. Journal of Business and Economic Statistics 14, 429–34. Hull, J. and A. White (1987). The pricing of options on assets with stochastic volatility. Journal of Finance 42, 281–300. Jacquier, E., N. G. Polson and P. E. Rossi (1994). Bayesian analysis of stochastic volatility models (with discussion). Journal of Business and Economic Statistics 12, 371–89.
C The Author(s). Journal compilation C Royal Economic Society 2009.
MSV, leverage and NIS
309
Jacquier, E., N. G. Polson and P. E. Rossi (2004). Bayesian analysis of stochastic volatility with fat-tails and correlated errors. Journal of Econometrics 122, 185–212. Kim, S., N. Shephard and S. Chib (1998). Stochastic volatility: likelihood inference and comparison with ARCH models. Review of Economic Studies 65, 361–93. Koopman, S. and E. Uspensky (2002). The stochastic volatility in mean model: empirical evidence from international stock markets. Journal of Applied Econometrics 17, 667–89. Liesenfeld, R. and R. Jung (2000). Stochastic volatility models: conditional normality versus heavy-tailed distributions. Journal of Applied Econometrics 15, 137–60. McAleer, M. (2005). Automated inference and learning in modelling financial volatility. Econometric Theory 21, 232–61. Nelson, D. (1991). Conditional heteroskedasticity in asset pricing: a new approach. Econometrica 59, 347– 70. Omori, Y., S. Chib, N. Shephard and J. Nakajima (2007). Stochastic volatility with leverage: fast likelihood inference. Journal of Econometrics 140, 425–49. Pitt, M. K. and N. Shephard (1999). Time varying covariances: a factor stochastic volatility approach. In J. M. Bernardo, J. O. Berger, A. P. Dawid and A. F. M. Smith (Eds.), Bayesian Statistics, Volume 6, 547–70. Oxford: Oxford University Press. Sandmann, G. and S. Koopman (1998). Estimation of stochastic volatility models via Monte Carlo maximum likelihood. Journal of Econometrics 87, 271–301. Shephard, N. (1996). Statistical aspects of ARCH and stochastic volatility. In D. R. Cox, D. V. Hinkley and O. E. Barndorff-Nielsen (Eds.), Time Series Models in Econometrics, Finance and Other Fields, 1–67. London: Chapman & Hall. Shephard, N. and M. K. Pitt (1997). Likelihood analysis of non-Gaussian measurement time series. Biometrika 84, 653–67. Taylor, S. J. (1986). Modelling Financial Time Series. Chichester: John Wiley. Wiggins, J. (1987). Option values under stochastic volatility: theory and empirical estimates. Journal of Financial Economics 19, 351–72. Yu, J. (2005). On leverage in a stochastic volatility model. Journal of Econometrics 127, 165–78.
C The Author(s). Journal compilation C Royal Economic Society 2009.
The
Econometrics Journal Econometrics Journal (2009), volume 12, pp. 310–323. doi: 10.1111/j.1368-423X.2009.00281.x
Looking for skewness in financial time series M ATTEO G RIGOLETTO † AND F RANCESCO L ISI † †
Department of Statistical Sciences, Via Cesare Battisti 241, 35121 Padova, Italy E-mails:
[email protected],
[email protected] First version received: April 2007; final version accepted: December 2008
Summary In this paper, we study marginal and conditional skewness in financial returns for nine time series of major international stock indices. For this purpose, we develop a new variant of the GARCH model with dynamic skewness and kurtosis. Our empirical results indicate that there is no evidence of marginal asymmetry in the nine time series under consideration. We do however find significant time-varying conditional skewness. The economic significance of conditional skewness is analysed in terms of Value-at-Risk measures and Market Risk Capital Requirements set by the Basel Accord. Keywords: Conditional skewness, Financial returns, GARCH models, Time-varying skewness, Skewness.
1. INTRODUCTION The huge amount of work in financial time series has led to a general consensus of the scientific community about some empirical statistical features known as stylised facts (i.e. positive correlation among square or absolute returns, conditional heteroscedasticity, clustering effects, leptokurtosis of return distributions), which have been thoroughly investigated. On the contrary, skewness in marginal and conditional return distributions has been quite neglected and relatively little work has been done to detect it. As a consequence, the occurrence of skewness, both unconditional and conditional, is still disputable and the empirical findings are not univocal. While some authors found assumed or declared significant asymmetries in return distributions (e.g. Chen et al., 2001, Cont, 2001, Engle and Patton, 2001, Hueng and McDonald, 2005, and Silvennoinen et al., 2005), others (e.g. Kim and White, 2004, Peir´o, 2004, Premaratne and Bera, 2005, and Lisi, 2007) are more doubtful about the pervasive presence of skewness in returns. The existence (or lack) of both unconditional and conditional symmetry is important in a number of situations relevant to both economic and statistical contexts. From a financial perspective, skewness is crucial since it may itself be considered as a measure of risk. For example, Kim and White (2004) stressed that, if investors prefer right-skewed portfolios then, for equal variance, one should expect a “skew premium” to reward investors willing to invest in left-skewed portfolios. With respect to optimal portfolio allocation, Chunhachinda et al. (1997) showed that allocation can change considerably if higher than second moments are considered in selection. Along the same lines, Jondeau and Rockinger (2004) measured the advantages of using C The Author(s). Journal compilation C Royal Economic Society 2009. Published by Blackwell Publishing Ltd, 9600 Garsington Road,
Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
Skewness in financial time series
311
a strategy based on higher-order moments. With respect to option pricing problems, Corrado and Su (1997) attributed the anomaly known as ‘volatility skew’ in option pricing to the skewness and kurtosis of the return distribution. In the context of hedge funds, some authors showed that the funds exhibit option-like features in their returns and have significant left-tail risk (Fung and Hsieh, 2001, Mitchell and Pulvino, 2001). The role played by skewness in risk management is also described by Rosenberg and Schuermann (2006). In general, it is reasonable to expect that, when skewness is present, accounting for it may lead to more precise estimates of risk measures such as Value-at-Risk (VaR). Several economic theories have been offered as an explanation of the mechanism generating the asymmetry, including leverage effects (Black, 1976, Christie, 1982), the volatility feedback mechanism (Campbell and Hentschel, 1992), stochastic bubbles models (Blanchard and Watson, 1982) and investor heterogeneity (Hong and Stein, 2003). On the other hand, the interest in possible asymmetries is motivated also by statistical reasons. For example, often estimation procedures assume conditional symmetry and thus a proper evaluation of this assumption may be advisable. In particular, Newey and Steigerwald (1997) showed that consistent estimation of the GARCH parameters can be obtained by QMLE if both the true and the assumed innovation densities are symmetric around zero and unimodal. When conditional symmetry does not hold, an additional parameter is necessary to identify the location of the innovation distribution. The assumption of conditional symmetry is also commonly used in adaptive estimation, and modelling of the conditional distribution is crucial in any dynamic analysis, such as dynamic optimal portfolio allocation or VaR estimation. Within this context, we first analyse the statistical significance of unconditional and conditional skewness, in order to assess whether asymmetry is a widespread characteristic of financial returns. In our analysis we consider nine time series of stock index returns for which marginal symmetry is investigated with the test proposed by Bai and Ng (2005). Then, for the same series, conditional skewness is studied using the Bai and Ng (2001, 2005) tests and a non-Gaussian GARCH-type model. In both steps, skewness is assumed to be constant. The possibility of conditional time-varying skewness is introduced in a third step, through a generalisation of the previous GARCH-type representation, that allows to dynamically model conditional variance, skewness and kurtosis. Although several models with dynamic conditional skewness and/or kurtosis have been studied in the literature (Hansen, 1994, Harvey and Siddique, 1999, Brooks et al., 2005, and Yan, 2005), here a new model is adopted for our aims. A second goal of this paper is to analyse the economic significance and the financial impact of a correct modelling of skewness. With this purpose VaR and connected capital requirements, as defined by the second Basel Accord (Basel Committee on Banking Supervision, 1995, 1996), were considered for the stock index returns. While in the available literature it is common practice to evaluate model performance by comparing nominal and observed VaR, much less frequent is the study of the connection between Value-at-Risk and the capital requirements introduced by the Basel Accord. The paper is organised as follows. Section 2 introduces a model which allows to study both constant and time-varying conditional skewness and kurtosis. Empirical evidences and statistical and economic significance of skewness are investigated in Section 3. Concluding remarks are presented in Section 4.
C The Author(s). Journal compilation C Royal Economic Society 2009.
312
M. Grigoletto and F. Lisi
2. FINDING EVIDENCE OF SKEWNESS The first step of our study consists of testing for unconditional skewness by means of the 3/2 standardised third moment S = μ 3 /μ2 , where μ j is the jth central moment. In this context, √ d it should first be noted that the standard asymptotic test, based on the relationship nSˆ −→ N(0, 6), does not work correctly, either for dependent Gaussian or independent non-Gaussian data (Bai and Ng, 2005, Premaratne and Bera, 2005, and Lisi, 2007). In particular, for leptokurtic distributions this test strongly overestimates the asymmetry. To overcome this problem, Bai and ˆ that works Ng (2005) proposed a test for unconditional skewness, based on the distribution of S, properly also for dependent and non-Gaussian data. An asymptotically distribution-free test for conditional symmetry, which uses the empirical distribution function of the estimated residuals, was introduced by Bai and Ng (2001). It should also be noted that a test for conditional skewness based on S can be obtained by applying the Bai and Ng (2005) test to the standardised residuals. Conditional skewness can also be assessed using models able to describe an asymmetric behaviour. In this work we analyse the presence of conditional skewness using a GARCHtype model with innovations having a Pearson’s Type IV (henceforth Pearson IV ) distribution. This model represents a generalisation of the standard GARCH model because it can account for asymmetry and kurtosis in the conditional distribution. Conditional skewness and kurtosis can be time-varying, thus allowing to study possible dynamics in higher-order moments. In the following, the acronym GARCHDSK (GARCH with dynamic skewness and kurtosis) will be used to denote this model. Time-varying skewness and kurtosis were first introduced by Hansen (1994), who extended the ARCH framework by adopting a conditional generalised Student’s t distribution, and modelling its parameters as functions of the lagged errors. Approaches in which dynamics are imposed on shape parameters, thus inducing time-varying skewness and kurtosis, have also been adopted by, among others, Jondeau and Rockinger (2003) and Yan (2005). In other cases, higherorder moments are modelled directly. For example, Harvey and Siddique (1999) introduce a GARCH-type expression for the conditional skewness, while Brooks et al. (2005) use a similar representation for the kurtosis. Le´on et al. (2005) employ a GARCH specification for both conditional skewness and kurtosis. In the spirit of Hansen (1994), here dynamics on skewness and kurtosis are introduced by modelling shape parameters, rather than directly skewness and kurtosis. As remarked by Yan (2005), this approach is less computationally intensive and allows skewness and kurtosis to explode, while the shape parameters remain stationary. This is particularly useful when modelling extremal events. The approach suggested by Yan (2005) involves first estimating the dynamics of volatility. Then, conditionally on the results obtained, a model for skewness and kurtosis can be developed using the standardised residuals. However, this two-step procedure implies that the variability of the parameters ruling the dynamics of skewness and kurtosis is underestimated, being computed conditionally on the estimated volatility. For this reason, in the approach suggested here the parameters governing the dynamics of volatility, skewness and kurtosis are estimated together, in a single step. Among the proposed conditional distributions, in the literature concerning skewness of financial returns we find the log-generalised gamma (Br¨ann¨as and Nordman, 2003a,b), the normal inverse Gaussian (Jensen and Lunde, 2001) and the z distributions (Lanne and Saikkonen, 2007). In the present paper we follow Premaratne and Bera (2001) and Br¨ann¨as and Nordman C The Author(s). Journal compilation C Royal Economic Society 2009.
Skewness in financial time series
313
(2003b) in the use of a Pearson IV distribution. This distribution is flexible, in the sense that it implies a wide range of feasible skewness–kurtosis couples. For example, the range associated with the Gram–Charlier density studied in Jondeau and Rockinger (2001) and adopted by Le´on et al. (2005) is relatively rather limited (Yan, 2005). The Pearson IV is also found to approximate the generalised Student’s t distribution on a large area of the skewness–kurtosis plane, but is computationally less demanding (see Premaratne and Bera, 2001, and the computational techniques discussed in Heinrich, 2004). The GARCH-type model that we will use to assess skewness has the following structure: yt = μ t + ε t (t = 1, . . . , n), where μ t = E(yt |I t−1 ), and ε t is such that ε t |I t−1 ∼ Pearson IV (λ t , at , ν t , rt ). Hence, the conditional density is defined by
f (εt | It−1 ) = Ct 1 +
εt − λt at
2 −(rt +2)/2
εt − λt . exp −νt arctan at
(2.1)
Jointly, parameters λ t , at , ν t and r t control the conditional mean, variance, skewness and kurtosis. The parameter C t is a normalising constant depending on at , ν t and r t . The distribution is symmetric for ν t = 0, positively skewed for ν t < 0 and negatively skewed for ν t > 0. For fixed ν t , increasing r t decreases the kurtosis. The Pearson IV distribution is essentially a skewed √ version of the Student’s t and, for ν t = 0, rt = gt − 1 and at = gt , reduces to a Student’s t with g t degrees of freedom. The normal distribution is a limit case where ν t = 0 and rt → ∞. Setting λ t = at ν t /rt in order to have a zero mean error term, for the conditional distribution of ε t we have E(εt |It−1 ) = 0, σt2
= Var(εt |It−1 ) =
at2 rt2 + νt2
, rt2 (rt − 1)
rt − 1 −4νt , St = S(εt |It−1 ) = rt − 2 rt2 + νt2 3(rt − 1) (rt + 6) rt2 + νt2 − 8rt2 Kt = K(εt |It−1 ) = , (rt − 2)(rt − 3) rt2 + νt2 where S t and K t are the conditional skewness and kurtosis coefficients, given by the standardised third and fourth moments. In this framework, the conditional variance σ 2t depends jointly on at , ν t and r t , whereas conditional skewness and kurtosis depend only on ν t and r t . In particular, if ν t = 0 then St = 0 and this is why ν t can be interpreted as the ‘skewness parameter’. When ν t = ν and rt = r, ∀t, conditional skewness and kurtosis are constant. Let us now turn to the description of the dynamics of σ 2t , St and K t . Our proposal is to define them through the evolution of the parameters at , ν t and r t . This evolution is induced by the following autoregressive GARCH-type structure: 2 2 at2 = ωa + αa a¯ t−1 + βa at−1 ,
νt = ωv + αν ν¯ t−1 + βν νt−1 , rt = ωr + αr r¯t−1 + βr rt−1 , C The Author(s). Journal compilation C Royal Economic Society 2009.
314
M. Grigoletto and F. Lisi
with a¯ t , ν¯ t and r¯t being moment-based estimators of at , ν t and r t (see Stuart and Ord, 1994, and Heinrich, 2004) defined by
σ¯ t2 16(¯rt − 1) − S¯t2 (¯rt − 2)2 , (2.2) a¯ t = 4 r¯t (¯rt − 2) S¯t ν¯ t = − , 16(¯rt − 1) − S¯t2 (¯rt − 2)2
(2.3)
6 K¯ t − S¯t2 − 1 r¯t = , 2K¯ t − 3S¯t2 − 6
(2.4)
εt−i+1 − ε¯ˆ t )j /m. where σ¯ t2 = μˆ 2t , S¯t = μˆ 3t /(μˆ 2t )3/2 , K¯ t = μˆ 4t /(μˆ 2t )2 , with μˆ j t = m i=1 (ˆ Hence, the estimates defined in (2.2), (2.3) and (2.4) are ‘local’, in the sense that only the m more recent values of the series are used in the computation of σ¯ t2 , S¯t and K¯ t . In the following, the choice of m will be based on goodness-of-fit criteria. Since at , ν t and r t depend only on past information, conditional variance, skewness and kurtosis at time t can be computed at time t − 1. The introduction of the constraints α ν = α r = β ν = β r = 0 allows to estimate models with constant skewness and kurtosis. However, note that for a dynamic behaviour of both conditional skewness and kurtosis, it is sufficient that at least one of these parameters is different from zero. Modelling at , ν t and r t rather than directly variance, skewness and kurtosis turns out to be easier because the latter quantities need to satisfy non-linear constraints which are difficult to impose at each point in time, while the constraints concerning a t , ν t and r t can be implemented straightforwardly. The issue of what constraints are necessary and sufficient to ensure the stationarity of the model requires further study. However, by simulations, we found that the following conditions, besides guaranteeing the positivity of the variance and kurtosis parameters, are sufficient for a non-explosive behaviour: ω a > 0, ω r > 3, α i , β i ≥ 0, α i + β i < 1, for i = a, ν, r. 1 In particular, the constraint ω r > 3 is needed to ensure existence of the kurtosis. Estimates for the ω i , α i and β i (i = a, ν, r) parameters are obtained by maximising the log-likelihood function n εˆ t − λt 2 rt + 2 εˆ t − λt log 1 + log Ct − − νt arctan , (2.5) 2 at at t=1 where εˆ t = yt − μˆ t . The estimate μˆ t is computed in a first step of the procedure, by fitting an ARMA model, which in the present context usually represents a very weak correlation structure. Since parameters at , ν t and r t are functions of ω i , α i and β i (i = a, ν, r), expression (2.5) can be maximised with respect to these latter. In principle, maximum likelihood can also be used to estimate the parameter m in the definition of a¯ t , ν¯ t and r¯t . However, this would imply a large computational burden. Hence, the choice of m will be based on goodness-of-fit considerations (see the next section). 1 We achieved this result by simulating series from GARCHDSK models with parameters belonging to a grid of values and by checking for non-explosive behaviour and positive variance and kurtosis after 100,000 iterations. While these simulations show the conditions are sufficient, we cannot state that they are necessary.
C The Author(s). Journal compilation C Royal Economic Society 2009.
315
Skewness in financial time series
3. SKEWNESS IN INTERNATIONAL STOCK INDEXES 3.1. Empirical evidences and statistical significance of skewness We now look for empirical evidences of asymmetry by applying the previous methods to daily returns, adjusted for split and dividends, of nine international stock indexes, namely the indexes CAC40, DAX, FTSE100, MIB30, Dow Jones, S&P500, Nasdaq, Nikkei225 and SMI. The sample periods and sample sizes of the series are listed in Table 1. Most of the series present some abnormal values that can be classified as outliers. Since we are interested in the systematic skewness in the data, to avoid dependence of results on possible outliers we removed them. Identification was based on graphical examination of both the return time series and the standardised residuals of a GARCH(1, 1) model. The number of identified outliers was very small: no more than three for each series. Outliers were replaced with the mean of the data. Sample skewness and kurtosis coefficients are given in Table 1: all indexes have negative skewness and severe excess kurtosis. Only the Nikkei225 index has positive, but very small, skewness. These results are consistent with other findings in the literature (e.g. Cont, 2001, Belaire-Franch and Peir´o, 2003, Kim and White, 2004, and Peir´o, 2004). As a starting point, we looked for unconditional skewness by applying the Bai and Ng (2005) test and, as a benchmark, the asymptotic standard test. The p-values for the hypothesis of symmetry, reported in Table 1, show that the Bai and Ng test accepts the null hypothesis, at the 5% level of significance, in eight cases on nine, with a p-value of 0.0464 for the FTSE100. On the contrary, the use of the standard asymptotic test would have led to strongly reject the symmetry in all cases except for the Nikkei225. These analyses indicate that no clear evidence of unconditional asymmetry was found in these time series. As a second step, in order to test for constant conditional skewness, the previous two tests and the Bai and Ng (2001) test were applied to the standardised residuals of a suitable ARMAThreshold GARCH model. Table 2 shows the sample skewness and kurtosis coefficients for standardised residuals, and the results of the tests. We note that, although the conditional kurtosis coefficients are always
Series
Table 1. Unconditional symmetry tests for index returns. Sˆ Kˆ Period n
BN05
AS
CAC40 DAX
(01/03/1990–13/12/2005) (28/11/1990–13/12/2005)
3977 3792
−0.103 −0.117
5.798 6.204
0.393 0.326
0.008 0.003
FTSE100 MIB30
(02/04/1984–13/12/2005) (03/01/2000–13/12/2005)
5483 1547
−0.264 −0.189
6.362 6.608
0.046 0.410
0.000 0.002
SMI Dow Jones Nasdaq
(12/11/1990–13/12/2005) (02/01/1990–13/12/2005) (02/01/1990–13/12/2005)
3797 4023 4023
−0.200 −0.223 −0.174
6.861 7.548 7.638
0.177 0.227 0.278
0.000 0.000 0.000
S&P500 Nikkei225
(02/01/1990–13/12/2005) (04/01/1990–13/12/2005)
4023 3926
−0.103 0.038
6.767 5.098
0.504 0.691
0.008 0.332
Notes: Sˆ and Kˆ are the empirical skewness and kurtosis coefficients for the observed series; columns BN05 and AS give the p-values for the Bai and Ng (2005) and the standard asymptotic tests. C The Author(s). Journal compilation C Royal Economic Society 2009.
316
M. Grigoletto and F. Lisi Table 2. Conditional symmetry tests for index returns. Kˆ BN05 BN01 AS Sc Kc
Series
Sˆ
CAC40 DAX
−0.366 −0.124
5.336 3.974
0.083 0.130
1.042 1.874
0.000 0.002
−0.212 −0.219
FTSE100 MIB30
−0.208 −0.419
3.941 4.237
0.008 0.003
2.608 2.430
0.000 0.000
SMI Dow Jones Nasdaq
−0.280 −0.348 −0.412
3.915 4.731 4.316
0.000 0.002 0.000
2.777 1.156 4.665
S&P500 Nikkei225
−0.345 −0.053
4.759 4.578
0.002 0.616
0.719 0.956
S tv
K tv
3.78 3.99
−0.288 −0.264
3.71 3.99
−0.253 −0.390
3.62 4.52
−0.269 −0.354
3.60 3.40
0.000 0.000 0.000
−0.365 −0.239 −0.407
4.17 4.59 4.02
−0.404 −0.321 −0.408
4.02 4.50 4.02
0.000 0.179
−0.237 −0.089
4.54 4.56
−0.329 0.000
4.23 4.55
Notes: Standardised residuals of ARMA-GARCH models have been used. Sˆ and Kˆ are the empirical skewness and kurtosis coefficients for the standardised residuals; columns BN05 and AS give the p-values for the Bai and Ng (2005) and the standard asymptotic tests; column BN01 presents the value of the test statistic for the Bai and Ng (2001) test, to be compared to the critical values 2.21 and 2.78, at the 5% and 1% levels of significance. S c , K c , S tv and K tv are the conditional skewness and kurtosis implied by the GARCHSK (S c and K c ) and GARCHDSK (S tv and K tv ) models (in the latter case the average conditional skewness and kurtosis are shown).
smaller than the marginal ones, the same does not hold for skewness. As in the marginal case, the Nikkei225 index has skewness very close to zero. As regards conditional skewness, results about the statistical significance are quite different from those on the marginal distributions and show more evidence of asymmetry. In particular, at the 5% significance level, the null hypothesis of symmetry is rejected in six cases on nine by the BN05 test (in another case skewness is significant at the 9% level) and in four cases by the BN01 test. The Nikkei225 index is the only one for which all tests, even the asymptotic one, clearly agree on accepting symmetry. In the whole the three tests lead to the same conclusions in five cases on nine. The presence of significant conditional skewness was further investigated by estimating GARCHDSK models, as defined in Section 2, assuming constant conditional skewness and kurtosis. This amounts to imposing α ν = β ν = α r = β r = 0, thus obtaining a subset of models that we will indicate with GARCHSK. The Kolmogorov–Smirnov goodness-of-fit test described below led us to the choice of m = 10 in the definition of a¯ t , ν¯ t and r¯t . The maximum likelihood parameter estimates and their t-statistics are given in Table 3. The t-statistics indicate that conditional skewness is statistically significant for all series except Nikkei225. Table 2 lists the conditional skewness and kurtosis implied by the estimated models and shows that all indexes are negatively skewed, with the Nikkei225 having the smallest absolute coefficient. The model introduced in Section 2 also allows us to investigate the presence of dynamic, rather than constant, conditional skewness and kurtosis. Table 4 lists the estimated parameters and shows that for seven indexes the parameter α ν is significant, implying that both skewness and kurtosis are time varying. For all these models, the Ljung–Box test at lag 15 on standardised squared residuals accepts the hypothesis of no residual correlation. In order to check goodness of fit, we applied the Kolmogorov–Smirnov test to assess the uniformity of the values Fˆ (ˆεt |It−1 ), t = 1, . . . , n, where F (·|I t−1 ) denotes the c.d.f. corresponding to the density defined in (2.1), and Fˆ (·|It−1 ) is obtained by substituting the ML parameter estimates in the c.d.f. definition. Table 5 lists the test p-values for each series. C The Author(s). Journal compilation C Royal Economic Society 2009.
317
Skewness in financial time series
Parameter
Table 3. ML estimates and t-statistics for the GARCHSK model parameters. Estimate t-stat. Estimate t-stat. Estimate CAC40
DAX
t-stat.
FTSE
ωa
0.214
2.92
0.191
3.37
0.141
3.03
αa βa
0.075 0.914
7.60 83.12
0.101 0.888
9.66 83.01
0.074 0.915
8.39 88.85
ων ωr
1.801 11.460
2.65 5.43
1.386 9.602
2.89 6.05
3.138 14.390
3.44 5.73
MIB30
SMI
Dow Jones
ωa
0.112
2.49
0.163
3.32
0.052
3.78
αa βa
0.088 0.899
6.73 68.67
0.097 0.886
8.02 64.70
0.051 0.938
9.52 170.60
ων ωr
1.782 7.829
2.44 4.05
2.212 9.365
3.76 6.23
0.867 7.042
2.89 7.74
Nasdaq
S&P500
Nikkei225
ωa αa βa
0.147 0.085 0.905
3.73 12.84 135.10
0.083 0.098 0.891
3.47 9.31 80.28
0.195 0.081 0.906
3.44 8.58 87.12
ων ωr
3.389 11.090
3.88 5.87
0.891 7.168
2.82 7.05
0.307 6.859
1.21 7.56
The p-values for other models, described in the next section, are also shown. Since the model parameters are estimated, the assumption of independence underlying the Kolmogorov–Smirnov test is violated. Therefore, p-values should not be interpreted as an absolute measure of goodness of fit, but rather as a relative measure. The results suggest that the GARCHDSK models compare favourably to the other models considered. In summary, the results on the nine time series indicate that there are no strong evidences of unconditional skewness. Conditional skewness, on the other hand, appears to be more widespread. In particular, there are clear indications of dynamic skewness. Of course, these conclusions depend on the adoption of the Pearson IV conditional distribution and on the particular model used to impose the dynamic structure. 3.2. Measures of risk and economic significance In the above section we analysed skewness mainly from a statistical viewpoint. Here we study the economic and financial importance of skewness by analysing its role in risk modelling. With this purpose, the time-varying Value-at-Risk (VaR t ) was computed with GARCHDSK and some alternative models. Market VaR measures how the market value of a portfolio, of value P, is likely to decrease over a certain time period, under normal market conditions. Given a holding period h and a C The Author(s). Journal compilation C Royal Economic Society 2009.
318
Parameter
M. Grigoletto and F. Lisi Table 4. ML estimates and t-statistics for the GARCHDSK model parameters. Estimate t-stat. Estimate t-stat. Estimate CAC40
DAX
t-stat.
FTSE100
ωa
0.242
2.83
0.166
3.11
0.155
3.28
αa βa
0.062 0.925
7.97 103.90
0.087 0.903
9.22 93.80
0.063 0.923
9.21 113.50
ων αν ωr
4.069 0.218 14.630
2.52 4.54 4.06
2.151 0.239 10.580
3.04 4.22 5.29
4.656 0.210 16.860
3.85 5.13 5.40
MIB30
SMI
Dow Jones
ωa αa
0.410 0.072
1.61 6.85
0.176 0.085
3.08 7.73
0.058 0.054
3.41 8.80
βa ων αν
0.913 23.590 0.153
83.50 1.21 2.14
0.897 3.918 0.224
74.70 3.22 4.33
0.935 1.658 0.280
138.40 3.00 4.10
ωr
35.870
1.63
11.770
4.73
8.064
6.05
Nasdaq
S&P500
Nikkei225
ωa
0.147
3.73
0.079
2.85
0.195
3.44
αa βa
0.084 0.905
12.85 135.07
0.076 0.913
9.17 106.65
0.081 0.906
8.58 87.12
ων αν ωr
3.389 – 11.090
3.88 – 5.87
2.296 0.275 9.431
1.92 4.47 3.72
0.306 – 6.859
1.21 – 7.56
confidence level 1 − α, the VaR is a bound such that the loss over the holding period is less than this bound with probability 1 − α. Assuming that the portfolio value at time t is P t and that the profits and losses over h periods are represented by the log-returns r t,h = log(Pt /P t−h ), with distribution F h , we have VaR h,α = −P t−h [F h−1 (α)]. When the distribution of r t,h is not constant over time, VaR is also time-varying. In the following, we will use the typical holding period of one day, the 99% confidence level, and will assume P = 1 in a portfolio given by the index. Time-varying VaR t is computed using the GARCHDSK models estimated in the previous subsection, the Riskmetrics approach with the usual smoothing parameter λ = 0.94 (see Alexander, 2001), a Gaussian GARCH(1, 1) and a GARCH(1, 1) with Student’s t innovations. The means of the estimated VaRs are given in Table 6. Table 6 also shows the observed in-sample significance level αˆ when the nominal level is 0.01. To compare nominal and observed levels the two-sided Kupiec test for the null H 0 : α = 0.01 was conducted (Kupiec, 1995). Then, for evaluating independence and, jointly, independence and coverage for VaR violations, the EACD test by Christoffersen and Pelletier (2004) and the test by Christoffersen (1998) were also applied. Asterisks, circles and daggers in Table 6 indicate that the null hypotheses for the Kupiec (1995) test, for the Christoffersen (1998) C The Author(s). Journal compilation C Royal Economic Society 2009.
319
Skewness in financial time series Table 5. p-values for the Kolmogorov–Smirnov goodness-of-fit test. Riskmetrics GARCH-N GARCH-t
GARCHDSK
CAC40
0.427
0.044
0.450
0.090
DAX FTSE100 MIB30
0.183 0.033 0.002
0.003 0.049 <0.001
0.339 0.022 0.006
0.182 0.104 0.415
SMI Dow Jones Nasdaq
0.056 0.028 <0.001
0.002 <0.001 <0.001
0.040 0.364 <0.001
0.117 0.243 0.295
S&P500 Nikkei225
0.011 0.174
<0.001 0.005
0.297 0.768
0.090 0.658
R
Series
VaR
CAC40
2.90
Table 6. Mean value-at-risk and in-sample observed levels. VaRN VaRt VaRDSK αˆ R αˆ N 2.94
3.14
3.37
0.008∗◦† ∗◦†
0.014†
αˆ t 0.011∗◦†
∗◦†
∗◦†
αˆ DSK 0.008∗†
DAX FTSE100 MIB30
2.96 2.16 2.72
3.02 2.19 2.77
3.24 2.33 3.04
3.43 2.46 3.18
0.008 0.007◦† 0.012∗◦†
0.013 0.013† 0.020†
0.010 0.010∗◦† 0.013∗◦†
0.009∗◦† 0.008∗◦† 0.010∗◦†
SMI Dow Jones
2.38 2.15
2.41 2.20
2.58 2.39
2.86 2.55
0.009∗◦† 0.008∗†
0.016† 0.013∗◦†
0.011∗◦† 0.009∗◦†
0.008∗† 0.008∗◦†
Nasdaq S&P500 Nikkei225
3.08 2.17 3.26
3.09 2.21 3.32
3.28 2.39 3.60
3.62 2.59 3.63
0.010∗ 0.008∗† 0.009∗†
0.014◦ 0.014† 0.015†
0.012∗◦† 0.009∗◦† 0.009∗◦†
0.008∗◦ 0.008∗◦† 0.009∗◦†
Notes: The nominal level considered is 0.01. The asterisk (∗ ), circle (◦) and dagger (†) indicate that the null hypothesis is accepted (at the 5% level) using, respectively, the Kupiec, the Christoffersen and the Christoffersen–Pelletier test. The considered models are: Riskmetrics, Normal GARCH(1, 1), Student’s t GARCH(1, 1) and GARCHDSK.
test and for the EACD test by Christoffersen and Pelletier (2004), respectively, are accepted at the 5% significance level. The Kupiec test shows that the coverages are generally correct. A notable exception is the Gaussian GARCH model, which yields observed levels often significantly greater than the nominal ones. For this reason we will not comment further on the results for this model. When independence of VaR violations is also tested (Christoffersen, 1998), we see that the null hypothesis is rejected in 4, 6, 0 and 2 cases for the Riskmetrics, GARCH-N, GARCH-t and DSKGARCH models, respectively. For the considered models and time series, the null hypothesis is almost always accepted when the EACD test by Christoffersen and Pelletier (2004) is performed. On the whole, the Riskmetrics model leads to the smallest mean values of VaR, followed by is the GARCH-t and GARCHDSK models. However, it is interesting to note that the VaRDSK t < VaRRt approximately 17% of the times not always (i.e. ∀t) greater than the VaRRt ; e.g. VaRDSK t for MIB30, 10% of the times for CAC40, 13% of the times for DAX and 15% of the times for FTSE100. C The Author(s). Journal compilation C Royal Economic Society 2009.
320
M. Grigoletto and F. Lisi Table 7. Basel Accord penalty factor K. 5 6 7
No. of violations
≤4
Penalty factor K
3.00
3.40
3.50
3.65
8
9
≥10
3.75
3.85
4.00
Note: The number of violations refers to the last 250 business days.
Table 8. Number of VaR violations in the last 250 days, considering each time t. ≤4 5 6 7 ≥8 ≤4 5 6 Riskmetrics GARCH-N
CAC40
GARCH-t GARCHDSK Riskmetrics
FTSE100
GARCH-N GARCH-t GARCHDSK Riskmetrics GARCH-N GARCH-t
SMI
GARCHDSK Riskmetrics
Nasdaq
GARCH-N GARCH-t GARCHDSK Riskmetrics GARCH-N GARCH-t GARCHDSK
Nikkei225
3540 2581
170 555
7 368
0 172
0 41
3299 3717
374 0
44 0
0 0
0 0
4976
156
91
0
0
3908 4397 4884
333 460 238
362 165 98
312 110 3
308 91 0
3129 1999 2985
299 575 441
109 523 111
0 343 0
0 97 0
3537
0
0
0
0
2989
503
174
97
0
2195 2794 3587
723 552 68
442 315 108
331 70 0
72 32 0
3609 2780 3519
57 407 90
0 355 57
0 124 0
0 0 0
3609
57
0
0
0
DAX
MIB30
Dow Jones
S&P500
7
≥8
3484 2897
48 372
0 96
0 157
0 10
3208 3435
173 97
141 0
10 0
0 0
1027
138
102
20
0
549 1027 1287
370 138 0
183 102 0
164 20 0
21 0 0
3657 2774 3514
106 503 211
0 262 38
0 145 0
0 79 0
3684
79
0
0
0
3374
215
174
0
0
2484 3118 3471
470 322 283
451 285 9
320 38 0
38 0 0
Value-at-Risk is also connected to the Market Risk Capital Requirements (MRCR) adopted in 1995 by the Basel Committee on Banking Supervision (1995, 1996). The Basel Accord sets minimum capital requirements which must be met by banks to face market risks. It carefully examines how banks’ VaR measures can be converted into capital requirements that appropriately reflect the prudential concerns of supervisors. Eventually, it defines the MRCR as a function of past VaRs and of their violations. In particular, the Accord establishes that MRCR t is expressed as: 60 1 VaRt−i , MRCRt = max VaRt−1 ; K 60 i=1 with K being a penalty factor depending on the number of VaR violations in the previous 250 business days, as described in Table 7. C The Author(s). Journal compilation C Royal Economic Society 2009.
Skewness in financial time series
321
In this framework, another way to evaluate a model – when the implied coverage levels are correct – is to analyse the series of the number of VaR violations in the last 250 days, since this is directly connected to MRCR. Table 8 lists the number of VaR violations in the last 250 days, considering each time t. As expected, the Gaussian GARCH model leads to the greatest number of violations. Despite the Gaussian assumption, the Riskmetrics approach gives results that outperform those of a GARCH-t model. However, the best results are clearly those related to the GARCHDSK model, which gives in most cases the smallest number of violations greater than four. In particular, from this point of view the GARCHDSK model performs always better than the GARCH-t model. This is true both for series having dynamic skewness and kurtosis and for indexes with constant conditional skewness and kurtosis, i.e the Nasdaq index. When there are no findings of skewness, e.g. in the Nikkei225 case, the Riskmetrics model and the GARCHDSK model give the same small number of exceedances of four violations.
4. CONCLUSIONS This paper has focused on the issue of empirical evidence of asymmetry for time series of financial returns. Nine series of daily stock index returns have been analysed, in order to assess whether skewness can be considered a stylised fact for real data. We studied both unconditional and conditional skewness, by means of tests and models. In particular, we proposed a new GARCH-type model, the GARCHDSK model, which allows to take into account both skewness and kurtosis. A characteristic feature of this approach is that skewness and kurtosis are allowed to evolve dynamically. This is done by assuming Pearson’s Type IV errors and defining suitable dynamics for the distribution parameters. The dynamic structure depends on moment-based estimators. Our results indicate that for the considered series there are no strong evidences of unconditional asymmetry. Different conclusions are drawn with respect to conditional skewness, which was found to be significantly present in eight of the nine stock index returns analysed. In particular, in seven of the eight cases, we found significant time-varying skewness and kurtosis. These findings are consistent with those of studies by, among others, Jensen and Lunde (2001), Brooks et al. (2005), Le´on et al. (2005), Cappuccio et al. (2006) and Lanne and Saikkonen (2007), who also found significant—constant or time-varying—conditional skewness. However, results are not univocal: it should be noted, for example, that in an application to daily returns of the NYSE composite index, Br¨ann¨as and Nordman (2003b) found contrasting results concerning the need to model time-varying conditional skewness. We also remark that Wilhelmsson (2006), who adopted a model different from that used here for representing the dynamic behaviour of higher moments, found that allowing skewness and kurtosis to be time varying does not significantly improve the out-of-sample forecasting performance. In order to investigate the economic importance of modelling skewness, we compared different models with respect to VaR and MRCR defined by the Basel Accord. These analyses confirm that skewness is important not only from a statistical point of view, but also from a financial perspective, particularly in risk management.
ACKNOWLEDGMENTS The authors would like to thank the editor and an anonymous referee for insightful comments and suggestions, which helped improve an earlier version of this work. C The Author(s). Journal compilation C Royal Economic Society 2009.
322
M. Grigoletto and F. Lisi
REFERENCES Alexander, C. (2001). Market Models. New York: Wiley. Bai, J. and S. Ng (2001). A consistent test for conditional symmetry in time series models. Journal of Econometrics 103, 225–58. Bai, J. and S. Ng (2005). Test for skewness, kurtosis and normality for time series data. Journal of Business and Economic Statistics 23, 49–60. Basel Committee on Banking Supervision (1995). An internal model-based approach to market risk capital requirements. Bank for International Settlements, Basel, Switzerland. Basel Committee on Banking Supervision (1996). Supervisory framework for the use of ‘backtesting’ in conjunction with the internal models approach to market risk capital requirements. Bank for International Settlements, Basel, Switzerland. Belaire-Franch, J. and A. Peir´o (2003). Conditional and unconditional asymmetry in U.S. macroeconomic time series. Studies in Nonlinear Dynamics and Econometrics 7, Issue 1. Black, F. (1976). Studies of stock price volatility changes. Proceedings of the American Statistical Association Business and Economic Statistics Section, 177–81. Blanchard, O. J. and M. W. Watson (1982). Bubbles, rational expectations and financial markets. In P. Watchel (Ed.), Crises in Economic and Financial Structure, 295–315. Lexington, MA: Lexington Books. Br¨ann¨as, K. and N. Nordman (2003a). An alternative conditional asymmetry specification for stock returns. Applied Financial Economics 13, 537–41. Br¨ann¨as, K. and N. Nordman (2003b). Conditional skewness modelling for stock returns. Applied Economics Letters 10, 725–28. Brooks, C., S. P. Burke, S. Heravi and G. Persand (2005). Autoregressive conditional kurtosis. Journal of Financial Econometrics 3, 399–421. Campbell, J. Y. and L. Hentschel (1992). No news is good news: an asymmetric model of changing volatility in stock returns. Journal of Financial Economics 31, 281–318. Cappuccio, N., D. Lubian and D. Raggi (2006). Investigating asymmetry in US stock market indexes: evidence from a stochastic volatility model. Applied Financial Economics 16, 479–90. Chen, J., H. Hong and J. C. Stein (2001). Forecasting crashes: trading volume, past returns and conditional skewness in stock prices. Journal of Financial Economics 61, 345–81. Christie, A. A. (1982). The stochastic behavior of common stock variances—value, leverage and interest rate effects. Journal of Financial Economics 10, 407–32. Christoffersen, P. F. (1998). Evaluating interval forecasts. International Economic Review 39, 841–62. Christoffersen, P. and D. Pelletier (2004). Backtesting Value-at-Risk: a duration based approach. Journal of Financial Econometrics 2, 84–108. Chunhachinda, P., K. Dandapani, S. Hamid and A. J. Prakash (1997). Portfolio selection and skewness: evidence from international stock markets. Journal of Banking and Finance 21, 143–67. Cont, R. (2001). Empirical properties of asset returns: stylized facts and statistical issues. Quantitative Finance 1, 223–36. Corrado, C. J. and T. Su (1997). Implied volatility skews and stock returns skewness and kurtosis implied by stock option prices. European Journal of Finance 3, 73–85. Engle, R. F. and A. Patton (2001). What good is a volatility model? Quantitative Finance 1, 237–45. Fung, W. and D. Hsieh (2001). The risk in hedge fund strategies: theory and evidence from trend followers. Review of Financial Studies 14, 313–41. Hansen, B. E. (1994). Autoregressive conditional density estimation. International Economic Review 35, 705–30. C The Author(s). Journal compilation C Royal Economic Society 2009.
Skewness in financial time series
323
Harvey, C. R. and A. Siddique (1999). Autoregressive conditional skewness. Journal of Financial and Quantitative Analysis 34, 465–87. Heinrich, J. (2004). A guide to the Pearson type IV distribution. Working Paper, University of Pennsylvania. Hueng, C. J. and J. B. McDonald (2005). Forecasting asymmetries in aggregate stock market returns: evidence from conditional skewness. Journal of Empirical Finance 12, 666–85. Hong, H. and J. C. Stein (2003). Differences of opinion, short-sales constraints, and market crashes. Review of Financial Studies 16, 487–525. Jensen, M. B. and A. Lunde (2001). The NIG-S&ARCH model: a fat-tailed, stochastic and autoregressive conditional heteroskedastic volatility model. Econometrics Journal 4, 319–42. Jondeau, E. and M. Rockinger (2001). Gram–Charlier densities. Journal of Economic Dynamics and Control 25, 1457–83. Jondeau, E. and M. Rockinger (2003). Conditional volatility, skewness, and kurtosis: existence, persistence and comovements. Journal of Economic Dynamics and Control 27, 1699–737. Jondeau, E. and M. Rockinger (2004). Optimal portfolio allocation under higher moments. Note d’´etude et recherche no. 108, Banque de France. Kim, T. H. and A. White (2004). On more robust estimation of skewness and kurtosis: simulation and application to the S&P500 index. Finance Research Letters 1, 56–70. Kupiec, P. H. (1995). Techniques for verifying the accuracy of risk measurement models. Journal of Derivatives 3, 73–84. Lanne, M. and P. Saikkonen (2007). Modeling conditional skewness in stock returns. European Journal of Finance 13, 691–704. ´ G. Rubio and G. Serna (2005). Autoregressive conditional volatility, skewness and kurtosis. Le´on, A., Quarterly Review of Economics and Finance 45, 599–618. Lisi, F. (2007). Testing asymmetry in financial time series. Quantitative Finance 7, 687–96. Mitchell, M. and T. Pulvino (2001). Characteristics of risk and return in risk arbitrage. Journal of Finance 6, 2135–75. Newey, W. K. and D. G. Steigerwald (1997). Asymptotic bias for quasi-maximum likelihood estimators in conditionally heteroskedastic models. Econometrica 65, 587–99. Peir´o, D. A. (2004). Asymmetries and tails in stock index returns: are their distributions really asymmetric? Quantitative Finance 4, 37–44. Premaratne, G. and A. K. Bera (2001). Modeling asymmetry and excess kurtosis in stock return data. Working paper 01-0118, College of Business, University of Illinois at Urbana-Champaign. Premaratne, G. and A. K. Bera (2005). A test for symmetry with leptokurtic financial data. Journal of Financial Econometrics 3, 169–87. Rosenberg, J. V. and T. Schuermann (2006). A general approach to integrated risk management with skewed, fat-tails risks. Journal of Financial Economics 79, 569–614. Silvennoinen, A., T. Terasvirta and C. He (2005). Unconditional skewness from asymmetry in the conditional mean and variance. Working Paper, Department of Economic Statistics, Stockholm School of Economics, Stockholm. Stuart, A. and K. Ord (1994). Kendall’s Advanced Theory of Statistics. New York: Oxford University Press. Wilhelmsson A. (2006). Garch forecasting performance under different distribution assumptions. Journal of Forecasting 25, 561–78. Yan, J. (2005). Asymmetry, fat-tail, and autoregressive conditional density in financial return data with systems of frequency curves. Working paper 355, Department of Statistics and Actuarial Science, University of Iowa.
C The Author(s). Journal compilation C Royal Economic Society 2009.
The
Econometrics Journal Econometrics Journal (2009), volume 12, pp. 324–339. doi: 10.1111/j.1368-423X.2009.00283.x
Bayesian estimation of a random effects heteroscedastic probit model Y UANYUAN G U † , D ENZIL G. F IEBIG † , E DWARD C RIPPS ‡ AND R OBERT K OHN † †
School of Economics, University of New South Wales, Sydney, NSW 2052, Australia E-mail:
[email protected] ‡
School of Mathematics and Statistics, University of Western Australia, Crawley, WA 6009, Australia E-mail:
[email protected] First version received: May 2008; final version accepted: January 2009
Summary Bayesian analysis is given of a random effects binary probit model that allows for heteroscedasticity. Real and simulated examples illustrate the approach and show that ignoring heteroscedasticity when it exists may lead to biased estimates and poor prediction. The computation is carried out by an efficient Markov chain Monte Carlo sampling scheme that generates the parameters in blocks. We use the Bayes factor, cross-validation of the predictive density, the deviance information criterion and Receiver Operating Characteristic (ROC) curves for model comparison. Keywords: Bayes factor, Cross-validation, Deviance information criterion, Marginal effects, Marginal likelihood, Markov chain Monte Carlo, ROC curve.
1. INTRODUCTION This paper presents a Bayesian framework for estimation and inference in a heteroscedastic probit model with random effects. The random effects probit (REP) model proposed by Heckman and Willis (1975) that is popular in empirical research extends the standard probit model by adding a random intercept which takes into account unobserved heterogeneity due to groups or panels. Its application falls into two categories of panel data: one for data where each panel relates to structural group effects in which case the individuals from a specific group such as a family or regional location share a common component in the specification of a conditional mean. Borjas and Sueyoshi (1994) called this the probit model with structural group effects and argued that it is slightly different from the random effects probit model of Heckman and Willis (1975) because its panel size may be very large; the other, for data where each panel consists of repeated measures on an individual over time. In both cases, the ‘common component’ of each panel is specified in the mean function as a normal random intercept with zero mean and constant variance. This random intercept is usually referred to as the random effects error term and the other error component in the model is called the remainder error term. C The Author(s). Journal compilation C Royal Economic Society 2009. Published by Blackwell Publishing Ltd, 9600 Garsington Road,
Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
Bayesian estimation of a random effects heteroscedastic probit model
325
Although it is well known that fitting a homoscedastic probit regression model when the data-generating process is heteroscedastic may result in biased and inconsistent parameter estimates (Yatchew and Griliches, 1985), applied researchers rarely check the assumption of homoscedasticity when fitting the REP model, possibly because of the difficulty of estimating a random effects probit model with heteroscedasticity. As there are two random terms in the random effects model, the question arises of where to include the heteroscedasticity in the random effects, the errors or in both. In linear regression with random effects, the literature considers all three possibilities. For example, Li and Stengos (1994) derived an adaptive estimator when heteroscedasticity is modelled in the remainder error term. Roy (2002) adopted a similar strategy when only the random effects are assumed heteroscedastic, and Randolph (1988) gave a more general heteroscedastic model where both errors are heteroscedastic (see Baltagi, 2005, for a review of this literature). For the heteroscedastic random effects probit (REHET) model proposed in this paper, the heteroscedasticity is modelled in the remainder error term, similarly to Li and Stengos (1994). In the random effects linear regression context, Baltagi et al. (2005) checked the sensitivity of the estimators proposed by Li and Stengos (1994) and Roy (2002) under misspecification of the form of heteroscedasticity and concluded that the former may be preferred to the latter when the researcher does not know the source of the heteroscedasticity. Although this strategy risks misspecifying the model, we consider it as an initial step in this research area. In the REHET model we use the multiplicative form of heteroscedasticity proposed by Harvey (1976), which is arguably the most popular heteroscedastic model in both linear and probit regression (e.g. Greene, 2003). To estimate the REP or REHET models using maximum likelihood it is necessary to use numerical integration methods such as quadrature (Butler and Moffitt, 1982) or simulated maximum likelihood (Keane, 1993). Our article uses a Bayesian approach to estimate the REP and REHET models. Instead of integrating over the normal distribution of the random effects, Bayesian analysis samples from posterior distributions. An efficient Markov chain Monte Carlo sampling scheme is provided in this paper to carry out the computation. The REP and REHET models are fitted to a real data set and compared using the Bayes factor, cross-validation on the predictive density and the deviance information criterion (DIC) introduced by Spiegelhalter et al. (2002). We also evaluate the in-sample predictive performance of each model using the Receiver Operating Characteristic (ROC) curve; see, for example, Hosmer and Lemeshow (2000, p. 160). Finally, we carry out a simulation study to determine the robustness of the REHET model when heteroscedasticity is present and when it is not.
2. THE MODEL 2.1. Random effects probit We start with the latent variable specification of the standard random effects probit model in which the random effects result from the group-specific error terms. That is, yit∗ = xit β + μi + νit , for groups i = 1, . . . , N and individuals t = 1, . . . , Ti in group i; x it is a 1 × k vector of explanatory variables and β is a k × 1 vector of regression coefficients. The random effects C The Author(s). Journal compilation C Royal Economic Society 2009.
326
Y. Gu et al.
μ i are assumed to be independent of the covariates x it and the remainder error term ν it . The latent variable y ∗it is related to the binary outcome y it by yit =
1 if yit∗ > 0 0 otherwise.
We assume that both μ i and ν it are independent, identically distributed, and normal, i.e. μi ∼ N 0, σμ2 , νit ∼ N 0, σν2 . For the standard random effects probit, we set σ 2ν = 1 without loss of generality. 2.2. Random effects heteroscedastic probit We now extend the REP model to include heteroscedasticity in the remainder error term ν it : νit ∼ N 0, σit2 , σit2 = exp(δ0 )exp(wit δ) = σ 2 exp(wit δ), where wit is a vector of covariates which explain the heterogeneity and δ is an m × 1 vector of coefficients. The wit will often be the same as x it . To ensure the model is identified we set σ 2 = 1 or equivalently δ 0 = 0 and omit an intercept from wit . To set up a convenient notation for the Bayesian analysis in the next section we write the REHET model as Y ∗ = Zα + ν,
(2.1)
∗ ∗ ∗ ∗ , . . . , y1T , . . . , yN1 , . . . , yNT ) which is a ( N where Y ∗ = (y11 i=1 Ti ) × 1 vector of latent N 1 , . . . , xiT ) . Then variable responses. Let 1Ti denote a Ti × 1 vector of ones and let Xi = (xi1 i ⎤ ⎡ 1T1 0 0 X1 ⎢ .. ⎥ .. ⎥ Z=⎢ . 0 . ⎦, ⎣ 0 0 0 1TN XN N which is a ( i=1 Ti ) × (N + k) matrix. For convenience, denote Z = [J X ], where X = (X1 , . . . , XN ) and J is defined conformally. α = [U , β ] where U = (μ 1 , μ 2 , . . . , μ N ) so that α is an (N + k) × 1 vector. ν is a ( N i=1 Ti ) × 1 error vector with mean zero and a N N ( i=1 Ti ) × ( i=1 Ti ) diagonal covariance matrix with its diagonal 2 2 2 2 = σ11 , . . . , σ1T , . . . , σN1 , . . . , σ1T , 1 N , . . . , wiT ). where log() = W δ, W = (W1 , . . . , WN ) and Wi = (wi1 i C The Author(s). Journal compilation C Royal Economic Society 2009.
Bayesian estimation of a random effects heteroscedastic probit model
327
3. BAYESIAN ANALYSIS 3.1. Prior specification Given the model specified in Section 2.2, we now discuss the priors for the parameters. Following Chan et al. (2006), let D(δ) = diag(exp(w11 δ/2), . . . , exp(w1T1 δ/2), . . . , exp(wN1 δ/2), . . . , exp(wNTN δ/2))
(3.1)
and define Y˜ ∗ = D(δ)−1 Y ∗ , Z˜ = D(δ)−1 Z such that Z˜ = [J˜ X˜ ], where J˜ = D(δ)−1 J and X˜ = D(δ)−1 X. Then (2.1) can be written as, ˜ + ζ, Y˜ ∗ = Zα
ζ ∼ N (0, I ).
For a given value of δ, the prior for β is ˜ −1 ). β | δ ∼ N(0, cβ (X˜ X)
(3.2)
The scale hyperparameter c β is taken as a suitably large positive integer, which in our work we take as the total number of observations N i=1 Ti . In the Gaussian case the prior (3.2) is known as Zellner’s g-prior (Zellner, 1986). However, it is also sensible to use this prior in the binary random effects probit case because it takes into account the scale of the covariates X and W while allowing some flexibility by estimating the hyperparameter c β . For a given value of σ 2μ , the prior for U is U | σμ2 ∼ N 0, σμ2 IN . The prior for α, given σ 2μ and δ, is α | σμ2 , δ ∼ N(0, Vα ), where ⎡
σμ2
⎢ ⎢0 Vα = ⎢ ⎢ ⎣0
0
0
0
0 σμ2
0 .. . 0
0
0
˜ −1 cβ (X˜ X)
..
.
0
⎤ ⎥ ⎥ ⎥, ⎥ ⎦
which is an (N + k) × (N + k) matrix. We now consider the parameters in the variance function. The prior for δ is δ ∼ N (0, cδ Im ), where c δ is the scale hyperparameter. Finally, we choose the same inverse gamma prior IG(a, b) for both the scale parameters c δ and σ 2μ , with a = 1 + 10−10 and b = 1 + 10−5 , which is proper but highly non-informative. C The Author(s). Journal compilation C Royal Economic Society 2009.
328
Y. Gu et al.
3.2. Sampling scheme This section describes an MCMC scheme to generate samples from the posterior distribution of the parameters and latent variables in the REHET model. The modifications necessary for the REP model are straightforward. The sampling scheme uses the following steps. (1) (2)
(3)
Initialize α, δ, σ 2μ and c δ . We apply the approach of Albert and Chib (1993) to generate latent variables Y ∗ from truncated normal distributions. For i = 1, . . . , N and t = 1, . . . , Ti , y ∗it is generated from its full conditional density: N(μ i + xit β, σ 2it ) with support [0, ∞) if yit = 1 and with support [−∞, 0) if yit = 0, where σ 2it = exp{wit δ}. Generate α as a block. U and β are generated jointly through α from its full conditional density: N ((Z˜ Z˜ + Vα−1 )−1 Z˜ Y˜ ∗ , (Z˜ Z˜ + Vα−1 )−1 ) using the Cholesky decomposition to factor = (Z˜ Z˜ + Vα−1 ). We write
A B = B C
and note that the N × N matrix A is diagonal. This means that the factorization can be performed efficiently which is important when N is large. (4) Generate a proposal δ p from a Student-t approximation to p(δ p | Y ∗ , α, c δ , σ 2μ ). The proposed value δ p is then accepted or rejected according to the Metropolis–Hastings rule. We use the Metropolis–Hastings method to generate δ because the full conditional density of δ is not standard. Our approach is based on a one-step Fisher Scoring approximation suggested by Gamerman (1997) as adapted for variance function estimation by Chan et al. (2006). It avoids the cumbersome process of iterating to the mode at each iteration but usually at the cost of a little higher autocorrelation among the iterates. That is, using the one-step Fisher Scoring method, we may need more iterations to achieve convergence. However, in our case just one step may get you most of the way to the mode and get a good proposal at smaller computational expense. For the Papsmear data analysed below the acceptance rate was about 90% for the Metropolis–Hastings step. We now explain the steps in the algorithm. (a) For i = 1, . . . , N and t = 1, . . . , Ti , calculate eit = (y ∗it − μ i − xit β)2 . c (b) Denote σ 2it and δ c for the current values of σ 2it and δ. Calculate ηitc = witc δ c +
(c)
eit − σitc σitc
c c c c and write ηc = (η11 , . . . , η1T , . . . , ηN1 , . . . , ηNT ) . N 1 Calculate the mean and variance for a proposal density which is multivariate t with four degrees of freedom. That is, δˆp = ( c2δ I + W W )−1 W ηc and p = ( c1δ I + 1 W W )−1 . The proposed value of δ is generated by a multivariate t-density with four 2 degrees of freedom T4 (δˆp , p ) and denote this proposal density as Q(δ c → δ p ). Write δˆc and c for the proposal mean and covariance matrix when taking a step in the reverse direction, and Q(δ p → δ c ) for the corresponding multivariate t-density with four degrees of freedom T4 (δˆc , c ). We accept the proposal δ p with probability p Y ∗ | β, U , δ p , σμ2 , cδ exp − (1/2cδ )δ p δ p Q(δ p → δ c ) min 1, ∗ . p Y | β, U , δ c , σμ2 , cδ exp − (1/2cδ )δ c δ c Q(δ c → δ p )
C The Author(s). Journal compilation C Royal Economic Society 2009.
Bayesian estimation of a random effects heteroscedastic probit model
329
(5)
Generate σ 2μ from its full conditional distribution which is an inverse gamma
(6)
distribution: IG( N+2a , i=1 2 i ). 2 Generate c δ from its full conditional distribution which is an inverse gamma distribution: IG( m2 + a, δ2δ + b).
N
μ2 +2b
3.3. Posterior inference The sampling scheme is run for a suitably long period, known as the burn-in period, to ensure the chain has converged to its target distribution. The values generated in the burn-in period are discarded and the post burn-in values are retained for inference. For a more thorough review of MCMC simulation see Casella and George (1992) and Chib and Greenberg (1995). In order to confidently use the post burn-in iterates for inference, it is necessary to check that the sampling scheme has converged. We judge convergence visually by running the sampling scheme from three different initial positions and plot various functionals of the iterates on the same graph. Successful convergence is indicated by the overlap of the functionals from the three chains. For the example in this paper we used a burn-in period of 10,000 iterations and we then used the next 10,000 iterates for inference. In Bayesian inference, the parameters are usually estimated by their posterior means. A desirable outcome of any model is an understanding of how small changes in the covariates affect the mean response. In a homoscedastic linear regression model such marginal effects are given by the regression coefficients, but this is not the case in the heteroscedatic probit regression model. There has been a considerable discussion in the literature on how to compute such marginal effects; see, for example, Greene (2003, p. 668). In general, it is possible to evaluate the expressions (derivatives for continuous variables and differences between two probabilities for dummy variables) at the sample means of the data or evaluate the marginal effects at every observation and use the sample mean of the individual marginal effects. We will use the second method, i.e. averaging the individual marginal effects, which is favoured in current practice. We define the marginal effect of a continuous covariate h for the ith group and the tth individual in that group as ∂p(yit = 1 | μi , β, δ, xit , wit ) p(U , β, δ | X, W ) dU dβ dδ, (3.3) E(mehit | X, W ) = ∂hit where ∂p(yit = 1 | μi , β, δ, xit , wit ) =φ ∂hit
μi + xit β exp(wit δ/2)
βh − (μi + xit β)δh /2 , exp(wit δ/2)
with the term (μ i + xit β)δ h /2 absent if h appears only in the mean function. The marginal effect of covariate h averaged over all i and t is defined as E(meh | X, W ) = where T =
N i=1
N Ti 1 E(mehit | X, W ), T i=1 t=1
Ti . E(me hit | X, W ) is estimated using the iterates of the MCMC chain as [j ] S 1 ∂p yit = 1 | μi , β [j ] , δ [j ] , xit , wit , m ehit = S j =1 ∂hit
C The Author(s). Journal compilation C Royal Economic Society 2009.
330
Y. Gu et al.
where S is the number of posterior draws after the burn-in period and E(meh | X, W ) is estimated as m eh =
N Ti 1 m ehit . T i=1 t=1
When h is a dummy variable we calculate E(me hit | X, W ) as in (3.3) but with the derivative replaced by the difference in probabilities p(yit = 1 | μi , β, δ, xit , wit , hit = 1) − p(yit = 1 | μi , β, δ, xit , wit , hit = 0). To end this section, we show how to calculate predicted probabilities using the iterates from the MCMC simulation. We note that p(yit = 1 | μi , β, δ, xit , wit )p(μi , β, δ | xit , wit ) dμi dβ dδ, E(pit | xit , wit ) = which is estimated by pˆ it =
S 1 [j ] p yit = 1 | μi , β [j ] , δ [j ] , xit , wit . S j =1
4. EXAMPLE In this section a real data set is analysed using both the REP and REHET models and the estimates from the two models are compared. The fit of the two models is also compared using the Bayes factor, K-fold cross-validation of the predictive density, DIC and ROC curves. The results of a simulation study are also presented to assess the performance of the REP and REHET models when heteroscedasticity exists and when it does not. 4.1. Papsmear data This data set is from a stated preference study of Australian women who choose whether or not to have a Pap test; see Fiebig and Hall (2005). There are 79 women in the sample and each respondent is presented with 32 scenarios. Thus, in terms of the panel structure described in Section 3, N = 79 and Ti = 32, which means that there are a relatively large number of repeated observations for each individual. The data set also contains five covariates available to be included in the mean and variance functions. These covariates are described in Table 1. Table 2 presents the results of estimating the REP and REHET models. For the REHET model, the covariates selected into the variance function are the same as those in the mean
Variable
Table 1. Papsmear data. Definition of variables. Definition
knowgp
=1 if the GP is known to the patient; =0 otherwise
sexgp testdue drrec
=1 if the GP is male; =0 if the GP is female =1 if the patient is due or overdue for a Paptest; =0 otherwise =1 if GP recommends that the patient has a Pap test; =0 otherwise
papcost
cost of test in $A C The Author(s). Journal compilation C Royal Economic Society 2009.
331
Bayesian estimation of a random effects heteroscedastic probit model
Table 2. Papsmear data: Estimates of the coefficients in the mean and variance functions and the mean marginal effects (me). REP REHET Mean Est (Std.Er) me (Std.Er) Est (Std.Er) me (Std.Er) Constant knowgp sexgp
−0.2888 (0.1822) 0.3039 (0.0649) −0.6608 (0.0672)
– 0.0661 (0.0139) −0.1466 (0.0142)
−0.0758 (0.0992) 0.1533 (0.0408) −0.3594 (0.0682)
– 0.0699 (0.0130) −0.1535 (0.0136)
testdue drrec papcost
1.1891 (0.0763) 0.4967 (0.0736) −0.0095 (0.0029)
0.2764 (0.0166) 0.1105 (0.0163) −0.0021 (0.0006)
0.5397 (0.1028) 0.1869 (0.0535) −0.0042 (0.0014)
0.2620 (0.0178) 0.0933 (0.0156) −0.0020 (0.0006)
knowgp sexgp testdue
– – –
– – –
0.1855 (0.2041) −1.1495 (0.2352) −1.2914 (0.2715)
– – –
drrec papcost
– –
– –
−0.3124 (0.2216) 0.0007 (0.0086)
– –
–
0.6499 (0.1249)
–
Variance
σμ
1.3403 (0.1359)
Note: Standard errors in parentheses.
function because we do not have prior information on the source of heteroscedasticity. The table shows that the parameter estimates differ substantially in terms of magnitude between REHET and REP models. However, due to the scale identification problem, the direct comparison of the parameters should be made by rescaling the REP model’s error variance (currently it is set as 1). For instance, if we choose the estimate of σ μ from the REHET model as the basis, the REP estimates are then rescaled by multiplying 0.6499/1.3403. The rescaled parameter estimates (with their standard errors) are constant: −0.1400 (0.0883), knowgp: 0.1474 (0.0315), sexgp: −0.3204 (0.0326), testdue: 0.5766 (0.0370), drrec: 0.2408 (0.0357) and papcost: −0.0046 (0.0014). Except for the constant, which is not statistically significant in both models, the estimates obtained from the REP model are similar to the REHET estimates. For the mean function both the REHET and REP models report all variables to be statistically significant. For the variance function, REHET identifies knowgp and testdue as being statistically significant. The estimated average marginal effects, measuring the changes on the predicted probability due to the change on each independent variable, are almost the same for both models. However, the interpretation is somewhat different as part of the estimated marginal effects of the REHET model is due to the influence of two significant effects in the estimated variance function. There is significantly less variability in women’s screening choices when their GP is male and when the test is due. Both of these results are sensible. On average, women are less likely to test when their GP is male but for some women the gender of their GP does not have a big impact leading to increased variability in their choices. While women are more likely to test when the test is due there is considerable debate about what the recommended screening interval should be, which is likely to translate into more variability in testing choices. We also compute the percentage change in the marginal effects for each individual in going from the REP to the REHET model, because the mean marginal effect may mask such individual C The Author(s). Journal compilation C Royal Economic Society 2009.
332
Y. Gu et al.
Marginal effects percentage change
1600
400
900 2500
1400 1200
300
700
2000
600
1000
400
200
300
500
1500 800
100
400
600
500
800
200
300
1000
0
200
400 500
100
100
200
0
0 0
0 knowgp
sexgp
testdue
drrec
papcost
Figure 1. Boxplots of the percentage change in going from REP to REHET.
differences. The percentage change is defined as (REHET mehit − REP mehit ) × 100 , REP mehit with i = 1, . . . , N and t = 1, . . . , Ti . Figure 1 presents the box plots of the percentage changes in the marginal effects for all five independent variables. The range of the percentage changes is relatively wide, which means variability grows over the individual marginal effects by using REHET. This is not only true for sexgp and testdue but also for the other three variables. The reason is that the significant heteroscedasticity affects the marginal effects for all the independent variables. It is also worth mentioning that distributions of the changes are often asymmetric around zero. For example, the changes for drrec are mostly on the negative side, which is the reason its estimated mean marginal effect in the REHET model is somewhat smaller than its value in the REP model. Finally, it is worth noting that although the REP model reports significant covariates for the mean function, these estimates ought to be treated with caution since REHET suggests that heterogeneity does exist and ignoring it may result in biased and inconsistent estimates. 4.2. Bayes factor A standard Bayesian approach for comparing two models M 1 and M 2 is to compute their Bayes factor which is the ratio of their marginal likelihoods BF =
m(Y | M1 ) , m(Y | M2 )
where m(Y | Mi ) = f (Y | Mi , θi )πi (θi | Mi ) dθi . There are a number of ways to compute the marginal likelihoods and we use the methods of Chib (1995) and Chib and Jeliazkov (2001) C The Author(s). Journal compilation C Royal Economic Society 2009.
Bayesian estimation of a random effects heteroscedastic probit model
333
which are based on the identity m(Y ) =
f (Y | θ )π (θ ) π (θ | Y )
or equivalently ln m(Y ) = ln f (Y | θ ) + ln π (θ ) − ln π (θ | Y ), which holds for any θ . In our work, θ consists of all the parameters in the model, excluding the random effects. Although the above identity for the marginal likelihood holds for any θ , the best results are obtained by selecting a θ = θ ∗ which has a high posterior density; see Chib (1995). It is straightforward to compute the prior at θ ∗ . To compute the likelihood f (Y | θ ∗ ) we integrate out the random effects, f (Y | θ ∗ ) = f (Y | θ ∗ , U )f (U | θ ∗ ) dU =
Ti N
f yit | θ ∗ , μi f μi | θ ∗ dμi ,
i=1 t=1
with the integral approximated by generating a random sample of μ i from f (μ i | θ ∗ ). We now discuss how to estimate the posterior density π (θ ∗ | Y ). This is relatively easy for the REP model as its full conditional densities are all closed-form; see Chib (1995). The logarithm of ˆ σˆ μ2 ) is the posterior mean. ˆ σˆ μ2 | Y ), where (β, the posterior ordinate for the REP model is ln π (β, Note that ˆ σˆ μ2 | Y = π σˆ μ2 | Y π βˆ | Y , σˆ μ2 π β, and π (σˆ μ2 | Y ) can be estimated by using Monte Carlo integration G 1 2 [g] πˆ σˆ μ2 | Y = π σˆ μ | U , G g=1
where G denotes the number of draws from the posterior and is taken as 5000 in our implementation. π (βˆ | Y , σˆ μ2 ) can be estimated by using Monte Carlo integration from a reduced MCMC run with σ 2μ fixed as σˆ μ2 . The posterior ordinate for the REHET model is ˆ σˆ μ2 | Y = π (cˆδ | Y )π (δˆ | Y , cˆδ )π σˆ μ2 | Y , cˆδ , δˆ π βˆ | Y , cˆδ , δ, ˆ σˆ μ2 . ˆ cˆδ , β, π δ, The term π (δˆ | Y , cˆδ ) is estimated by the method of Chib and Jeliazkov (2001) as it cannot be estimated directly by Monte Carlo integration because δ is generated by a Metropolis–Hastings ˆ and π (βˆ | Y , cˆδ , δ, ˆ σˆ μ2 ) are estimated using standard step. The terms π (cˆδ | Y ), π (σˆ μ2 | Y , cˆδ , δ) Markov chain Monte Carlo simulation. To estimate the standard error of the log marginal likelihood estimates we repeated the estimation 50 times using different random number seeds. For the REP model, the mean of the 50 estimates was −1122.04 with a standard deviation of 1.51, so the standard error of the mean is 0.21; for the REHET model, the average of the 50 estimates was −1096.81 with a standard deviation of 2.19, so the standard error of the mean is 0.31. As the mean of the logarithm of the Bayes factor of the REHET to REP models is 25.23 it is clear from the standard errors that the REHET model is preferred based on the Bayes factor. C The Author(s). Journal compilation C Royal Economic Society 2009.
334
Y. Gu et al.
4.3. K-fold cross-validation of the predictive density Lempers (1971) suggested setting aside part of the data as a training sample from which to compute an informative prior. Model comparison is then carried out using Bayes factors conditional on the training sample. This Bayes factor is usually called a partial Bayes factor. There are several variations of this approach including pseudo-Bayes factors (Geisser and Eddy, 1979), intrinsic Bayes factors (Berger and Pericchi, 1996) and fractional Bayes factors (O’Hagan, 1995). The Bayes factor approach used in this section belongs to this family. We randomly split the data Y into K roughly equal parts {Yr ; r = 1, 2, . . . , K} by partitioning on the women. Then the conditional predictive densities are {f (Yr | Y −r ); r = 1, 2, . . . , K}, with f (Yr | Y−r ) = f (Yr | θ, Y−r )f (θ | Y−r ) dθ, (4.1) where Y −r denotes all elements of Y except Y r . The attraction of using the conditional marginal likelihood (4.4) is that it eliminates dependence on the prior and is straightforward to compute using Monte Carlo integration. We use the partial Bayes factors (PBFs) PBF r =
f (Yr | Y−r , REHET) f (Yr | Y−r , REP)
r = 1, 2, . . . , K
to compare REP and REHET. If PBFr > 1 the test sample Y r supports the REHET model; otherwise, it supports the REP model. A common choice for K is 10. Table 3 reports the results and shows that 8 out of 10 partial Bayes factors exceed 1, suggesting support for the REHET model. Moreover, we calculate the arithmetic and geometric means of the 10 partial Bayes factors: 2063.37 and 38.84. Both of them are clearly greater than 1, indicating the REHET model outperforms the REP model. 4.4. Deviance information criterion We also compare the REP and REHET models using the DIC proposed by Spiegelhalter et al. (2002). The DIC is a hierarchical modelling generalization of the Akaike information criterion (AIC) and is based on the posterior distribution of the deviance statistic D(θ ) = −2 ln f (Y | θ ) + 2 ln f (Y ), where θ now consists of all the unknown parameters and random effects in the model and f (Y) is some standardizing function of the data only. The DIC is defined as DIC = D¯ + pD , where D¯ = Eθ | Y (D(θ )) and measures the quality of model fit to the data. The effective number of ¯ where θ¯ is the posterior mean of θ . DIC uses p D to parameters p D is estimated by D¯ − D(θ), penalize models with more parameters. In comparing models using DIC , the best model is the one with the smallest DIC. The DIC is computed for each model using the MCMC iterates. For the Papsmear data, the p D of the REHET
Table 3. Model comparison using 10-fold cross-validation of the predictive densities for the Papsmear data. r 1 2 3 4 5 6 7 8 9 10 PBF r
12,094.4
41.02
2.48
239.9
29.04
19.57
0.33
0.14
8084.5
122.38
C The Author(s). Journal compilation C Royal Economic Society 2009.
335
Bayesian estimation of a random effects heteroscedastic probit model
model is 82.7, which is larger than the p D of the REP model which is 77.7. However, the value of the DIC for the REHET model is 1967.9 which is appreciably smaller than the DIC value of the REP which is 2047.9, so that REHET is preferred over REP based on DIC. 4.5. ROC curve Another approach for comparing two binary response models is their ability to classify correctly both of the binary outcomes (0 and 1) across a range of covariate values. An important visual display of this ability is the ROC curve; see, for example, Hosmer and Lemeshow (2000, p. 160). We now briefly describe what a ROC curve is. Let θˆ be the estimated parameters in the model and (x, w) a set of covariates. For a given cutoff c, 0 ≤ c ≤ 1, let yˆc = 1 if Pr(y = 1 | x, w, θˆ ) > c and let yˆc = 0 otherwise. A ROC curve is a plot of Pr(yˆc = 1 | y = 1) against Pr(yˆc = 1 | y = 0) for different values of c and the covariates. In a ROC plot, the bigger the value of Pr(yˆc = 1 | y = 1) for a given Pr(yˆc = 1 | y = 0), the better the performance of the classifier. Figure 2 plots the ROC curves for the REP and REHET models applied to the Papsmear data, which suggests that the REHET model outperforms the REP model based on in-sample classification performance.
1 0.9 0.8
P r(ˆ y = 1|y = 1)
0.7 0.6 0.5 0.4 0.3
REP REHET
0.2 0.1 0
0
0.1
0.2
0.3
0.4
0.5
0.6
P r(ˆ y = 1|y = 0) Figure 2. ROC curve for Papsmear data. C The Author(s). Journal compilation C Royal Economic Society 2009.
0.7
0.8
0.9
1
336
Y. Gu et al.
4.6. Simulation from the fitted model We use the estimates for the REP model fitted to the Papsmear data (see Table 2) to carry out the first simulation study. Denote the estimates of regression coefficients and the standard deviation of the random effects used to simulate the data βˆ (a 6 × 1 vector), δˆ (a 5 × 1 vector) and σˆ μ . One hundred replications of data are generated by yit∗ = xit βˆ + μi + νit ,
i = 1, . . . , N ,
t = 1, . . . , Ti ,
where x it is a (1 × k) vector of covariates, μ i is the random effect for group i generated from N(0, σˆ μ2 ) and ν it is a normal random deviate with mean zero and standard deviation σ it which is set as 1. We construct the binary response y it by setting 1 if yit∗ > 0 yit = 0 otherwise. The generated data are then used to estimate both the REP and REHET models. Table 4 presents the averages of the parameter estimates based on 100 replications as well as the standard errors. The table also presents the averages of the marginal effects and the corresponding standard errors. The results show that the REP model and the REHET model perform similarly because the coefficient estimates of the variance function in the REHET model shrink to nearly zero. The results suggest that if the model is homoscedastic then little is lost by fitting a heteroscedastic model.
Table 4. Simulation 1 result: Estimates of the coefficients in the mean and variance functions and the mean marginal effects (me). REP REHET Mean Truth Est (Std.Er) me (Std.Er) Est (Std.Er) me (Std.Er) β0 β1
−0.2888 0.3039
−0.3070 (0.1740) 0.3029 (0.0719)
– 0.0616 (0.0151)
−0.3436 (0.1912) 0.3199 (0.0967)
– 0.0616 (0.0151)
β2 β3
−0.6608 1.1891
−0.6613 (0.0672) 1.1890 (0.0755)
−0.1362 (0.0157) 0.2594 (0.0184)
−0.6775 (0.0994) 1.2326 (0.1755)
−0.1355 (0.0157) 0.2584 (0.0183)
β4 β5 Variance
0.4967 −0.0095
0.4986 (0.0757) −0.0096 (0.0031)
0.1042 (0.0174) −0.0019 (0.0006)
0.5205 (0.1209) −0.0097 (0.0035)
0.1038 (0.0175) −0.0019 (0.0006)
δ1 δ2
– –
– –
– –
0.0026 (0.1606) 0.0324 (0.1559)
– –
δ3 δ4 δ5
– – –
– – –
– – –
−0.0140 (0.1671) 0.0192 (0.1780) 0.0008 (0.0085)
– – –
–
1.4223 (0.2196)
–
σμ
1.3403
1.3673 (0.1234)
Note: Standard errors in parentheses.
C The Author(s). Journal compilation C Royal Economic Society 2009.
337
Bayesian estimation of a random effects heteroscedastic probit model
Table 5. Simulation 2 result: Estimates of the coefficients in the mean and variance functions and the mean marginal effects (me). REP REHET Mean Truth Est (Std.Er) me (Std.Er) Est (Std.Er) me (Std.Er) β0 β1 β2
−0.0758 0.1533 −0.3594
−0.2908 (0.1550) 0.3372 (0.0682) −0.6714 (0.0809)
– 0.0668 (0.0139) −0.1371 (0.0188)
−0.0870 (0.0961) 0.1714 (0.0455) −0.3743 (0.0786)
– 0.0622 (0.0120) −0.1378 (0.0186)
β3 β4 β5
0.5397 0.1869 −0.0042
1.1729 (0.0833) 0.4374 (0.0786) −0.0088 (0.0032)
0.2566 (0.0233) 0.0883 (0.0169) −0.0017 (0.0006)
0.5747 (0.1092) 0.1954 (0.0542) −0.0043 (0.0015)
0.2573 (0.0218) 0.0855 (0.0164) −0.0017 (0.0005)
Variance δ1 δ2 δ3 δ4 δ5 σμ
0.1855
–
–
0.1790 (0.1648)
–
−1.1495 −1.2914 −0.3124
– – –
– – –
−1.1256 (0.2040) −1.2215 (0.2384) −0.3205 (0.2008)
– – –
– –
0.0009 (0.0084) 0.6877 (0.1211)
– –
0.0007 0.6499
– 1.3379 (0.1138)
Note: Standard errors in parentheses.
In the second simulation study, the data are generated using the estimates from the REHET model fitted to the Papsmear data (see Table 2). The data-generating process is the same as the ˆ one in the first simulation study except that σit = exp(wit δ/2). Table 5 presents the means and standard errors of the parameter estimates based on 100 replications. It also presents the averages of the marginal effects and the corresponding standard errors. The results suggest that if heteroscedasticity is present the REHET model produces very accurate estimates in both the mean and variance functions. As before, we cannot compare the parameters between two models unless the REP model’s error variance is rescaled. If we choose the estimate of σ μ from the REHET model as the basis, the rescaled REP parameter estimates (with their standard errors) are −0.1495 (0.0797), 0.1733 (0.0351), −0.3451 (0.0416), 0.6029 (0.0428), 0.2248 (0.0404) and −0.0045 (0.0016). Although they are not as accurate as the estimates of the REHET model, they are now much closer to the true values. As expected, the REP model provides similar overall marginal effects estimates to the REHET model.
5. DISCUSSION A Bayesian analysis of a random effects probit model is presented which allows for heteroscedasticity that is modelled as an exponential function of a linear combination of the covariates. For the Papsmear data, we compared the heteroscedastic model to a homoscedastic random effects probit model and found that the heteroscedastic model outperforms the homoscedastic model on all four of the model comparison criteria that we used. We also applied the heteroscedastic model to other data sets, some of which had a large number of random effects and unbalanced panel structure. In particular, an earlier version of this paper analysed an exercise C The Author(s). Journal compilation C Royal Economic Society 2009.
338
Y. Gu et al.
data example which had 11,938 groups with the group size varying from 1 to 6. Our MCMC sampling scheme performed well for these data sets because it generates the random effects and regression coefficients as a block and used one-step Fisher scoring for the Metropolis–Hastings proposal for the variance parameters.
ACKNOWLEDGMENTS The research of Denzil G. Fiebig was partially supported by National Health and Medical Research Council Program Grant No. 254202. The research of Robert Kohn was partially supported by ARC research grant DP0667069. The research of Yuanyuan Gu was partially supported by both of these two grants. We would like to thank two anonymous referees whose comments improved the presentation of the paper.
REFERENCES Albert, J. and S. Chib (1993). Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association 88, 669–79. Baltagi, B. H. (2005). Econometric Analysis of Panel Data. Chichester: John Wiley. Baltagi, B. H., G. Bresson and A. Pirotte (2005). Adaptive estimation of heteroskedastic error component models. Econometric Reviews 24, 39–58. Berger, J. O. and L. Pericchi (1996). The intrinsic Bayes factor for model selection and prediction. Journal of the American Statistical Association 91, 109–22. Borjas, G. J. and G. T. Sueyoshi (1994). A two-stage estimator for probit models with structural group effects. Journal of Econometrics 64, 165–82. Butler, J. S. and R. Moffitt (1982). A computationally efficient quadrature procedure for the one-factor multinomial probit model. Econometrica 50, 761–64. Casella, G. and E. George (1992). Explaining the Gibbs sampler. The American Statistician 46, 167–74. Chan, D., R. Kohn, D. Nott and C. Kirby (2006). Locally adaptive semiparametric estimation of the mean and variance functions in regression models. Journal of Computational and Graphical Statistics 22, 915–36. Chib, S. (1995). Marginal likelihood from the Gibbs output. Journal of the American Statistical Association 90, 1313–21. Chib, S. and E. Greenberg (1995). Understanding the Metropolis–Hastings algorithm. The American Statistician 49, 327–35. Chib, S. and I. Jeliazkov (2001). Marginal likelihood from the Metropolis–Hastings output. Journal of the American Statistical Association 96, 270–81. Fiebig, D. G. and J. Hall (2005). Discrete choice experiments in the analysis of health policy. In Productivity Commission Conference, November 2004: Quantitative Tools for Microeconomic Policy Analysis, 119– 36. Melbourne: Media and Publications, Productivity Commission. Gamerman, D. (1997). Sampling from the posterior distribution in generalized linear mixed models. Statistics and Computing 7, 57–68. Geisser, S. and W. Eddy (1979). A predictive approach to model selection. Journal of the American Statistical Association 74, 153–60. Greene, W. H. (2003). Econometric Analysis. Upper Saddle River, NJ: Prentice Hall. Harvey, A. (1976). Estimating regression models with multiplicative heteroscedasticity. Econometrica 44, 461–65. C The Author(s). Journal compilation C Royal Economic Society 2009.
Bayesian estimation of a random effects heteroscedastic probit model
339
Heckman, J. J. and R. J. Willis (1975). Estimation of a stochastic model of reproduction: an econometric approach. In N. Terleckyj (Ed.), Household Production and Consumption, NBER Studies in Income and Wealth, Volume 40, 99–145. New York: Columbia University Press. Hosmer, D. W. and S. Lemeshow (2000). Applied Logistic Regression. Hoboken, NJ: John Wiley. Keane, M. P. (1993). Simulation estimation for panel data with limited dependent variables. In G. S. Maddala, C. R. Rao and H. D. Vinod (Eds.), Handbook of Statistics, Volume 11, 545–72. Amsterdam: Elsevier Science and Technology. Lempers, F. B. (1971). Posterior Probabilities of Alternative Linear Models. Rotterdam: University Press. Li, Q. and T. Stengos (1994). Adaptive estimation in the panel data error component model with heteroskedasticity of unknown form. International Economic Review 35, 981–1000. O’Hagan, A. (1995). Fractional Bayes factor for model comparison. Journal of the Royal Statistical Society, Series B 57, 99–138. Randolph, W. C. (1988). A transformation for heteroscedastic error components regression models. Economics Letters 27, 349–54. Roy, N. (2002). Is adaptive estimation useful for panel models with heteroscedasticity in the individual specific error component? Some Monte Carlo evidence. Econometric Reviews 21, 189–203. Spiegelhalter, D. J., N. G. Best, B. P. Carlin and A. Van Der Linde (2002). Bayesian measures of model complexity and fit (with discussion). Journal of the Royal Statistical Society, Series B 64, 583–639. Yatchew, A. and Z. Griliches (1985). Specification error in probit models. Review of Economics and Statistics 18, 134–39. Zellner, A. (1986). On assessing prior distributions and Bayesian regression analysis with g-prior distributions. In P. K. Goel and A. Zellner (Eds.), Bayesian Inference and Decision Techniques: Essays in Honor of Bruno de Finetti, 233–43. Amsterdam: North-Holland.
C The Author(s). Journal compilation C Royal Economic Society 2009.
The
Econometrics Journal Econometrics Journal (2009), volume 12, pp. 340–366. doi: 10.1111/j.1368-423X.2009.00287.x
Panel unit root tests in the presence of cross-sectional dependence: finite sample performance and an application S. DE S ILVA † , K. H ADRI ‡ AND A. R. T REMAYNE § ,¶ †
‡
Faculty of Economics and Business, University of Sydney, NSW 2006, Australia E-mail:
[email protected]
Queen’s University Management School, Queen’s University Belfast, 25 University Square, Belfast BT7 1NN, UK E-mail:
[email protected] §
Department of Economics, University of York, Heslington, York YO10 5DD, UK ¶
Department of Economics, University of Melbourne, VIC 3010, Australia E-mail:
[email protected]
First version received: November 2006; final version accepted: March 2009
Summary This paper examines the finite sample properties of three testing regimes for the null hypothesis of a panel unit root against stationary alternatives in the presence of cross-sectional correlation. The regimes of Bai and Ng (2004), Moon and Perron (2004) and Pesaran (2007) are assessed in the presence of multiple factors and also other non-standard situations. The behaviour of some information criteria used to determine the number of factors in a panel is examined and new information criteria with improved properties in small-N panels proposed. An application to the efficient markets hypothesis is also provided. The null hypothesis of a panel random walk is not rejected by any of the tests, supporting the efficient markets hypothesis in the financial services sector of the Australian Stock Exchange. Keywords: Cross-section dependence, Efficient markets hypothesis, Factor models, Finite sample properties, Panel data, Unit root tests.
1. INTRODUCTION Panel unit root tests and panel stationarity tests are applied widely to help assess the validity of important theories such as purchasing power parity and the efficient markets hypothesis. The main motivation for using panel tests instead of univariate tests is the increase in power obtained by exploiting the cross-sectional dimension. Banerjee (1999), Baltagi and Kao (2000), Baltagi (2001) and more recently Breitung and Pesaran (2008) provide comprehensive surveys on the subject. The limit theory for this class of panel data test has been developed in a seminal paper by Phillips and Moon (1999). Application of earlier tests in this field, such as those of Hadri (2000), Choi (2001), Levin et al. (2002), Im et al. (2003) and Hadri and Larsson (2005) assume the independence of individual cross-sections. However, this assumption is restrictive in practice. O’Connell (1998) C The Author(s). Journal compilation C Royal Economic Society 2009. Published by Blackwell Publishing Ltd, 9600 Garsington Road,
Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
Panel unit root tests in the presence of cross-sectional dependence
341
examines the effect of cross-sectional correlation on the test of Levin et al. (2002), first published as a working paper by Levin and Lin (1992). O’Connell (1998) finds that the size of the test in the presence of cross-sectional correlation is considerably distorted. Strauss and Yigit (2003) discuss the effect of cross-sectional correlation on the test of Im et al. (1997) and come to similar conclusions. Parametric alternatives to these popular tests have been proposed by O’Connell (1998) and Breuer et al. (2001); however, both procedures have their drawbacks. O’Connell (1998) uses an estimated cross-sectional covariance matrix to transform the data into a cross-sectionally independent form. This generalized least squares (GLS) method requires the estimation of a large number of covariance parameters and restrictions on the covariance matrix when the number of cross-sectional units, N, exceeds the number of time points, T, for which data are available. The Breuer et al. (2001) procedure tests the unit root hypothesis as individual cross-sections in a seemingly unrelated regressions framework, yielding N tests. non-linear instrumental variable methods have been proposed by Chang (2002) and Chang and Song (2002). However, the efficacy of this testing method has been called into question by Im and Pesaran (2003). Bootstrapping methods, first developed for this context by Maddala and Wu (1999) to allow for cross-sectional correlation of a general form in panel unit root tests, have also been advanced by Chang (2004) and Smith et al. (2004). Several alternative approaches have been proposed that allow for cross-sectional correlation by modelling it as an approximate linear factor process. The advantage of this method of accounting for cross-sectional correlation is that, for small numbers of factors, it reduces the dimensionality of the covariance matrix and the number of parameters requiring to be estimated, problems faced by other parametric methods. Further, the method allows each factor to have a unique (and possibly no) effect on each cross-section. Approaches to panel unit root testing allowing for an approximate linear factor process include: the panel analysis of non-stationarity in idiosyncratic and common components (PANIC) method proposed by Bai and Ng (2004); an orthogonalization method due to Moon and Perron (2004); and a proxy method proposed by Pesaran (2007). An alternative test from the perspective of a stationarity test incorporating approximate linear factors is that due to Harris et al. (2005). One purpose of this paper is to examine the finite sample properties of these tests under a number of different data-generation processes (DGPs). In particular, the effect of multiple factors, non-normal processes and stochastic unit roots is examined. In order to provide a benchmark, the tests are compared to the panel unit root test derived by Im et al. (2003) which does not allow for cross-sectional correlation of any form but has been widely applied in the literature nonetheless; we shall use the acronym IPS for this statistic in what follows. Although some of the properties of these tests have also been examined by Gengenbach et al. (2004), our paper provides quite a comprehensive analysis of the size and power properties of the tests under a broad range of DGPs. In order to apply the panel unit root tests examined here, the number of common factors present in a panel must be determined. We find that the information criteria of Bai and Ng (2002) often overestimate the number of factors when N is small, making their application inadvisable under these circumstances. The univariate model selection criterion of Hannan and Quinn (1979) has a parameter penalty adjustment that increases at the slowest rate possible among the class of consistent criteria. We propose new panel criteria based on that of Hannan and Quinn (1979) and some of these alternatives appear to perform better than those of Bai and Ng (2002), particularly when N is small.
C The Author(s). Journal compilation C Royal Economic Society 2009.
342
S. de Silva, K. Hadri and A. R. Tremayne
We also assess the evidence in favour of the efficient markets hypothesis in the financial services sector of the Australian Stock Exchange (ASX). Applying such tests to a panel of financial services firms listed on the ASX and overseen by the Australian Prudential Regulation Authority (APRA), rather than applying a univariate unit root test to a market index, provides evidence illustrating APRA’s ability to promote informational efficiency in the financial services sector while offering the potential power advantage of panel unit root tests. The outline of the rest of the paper is as follows. New information criteria based on the univariate criterion of Hannan and Quinn (1979) are introduced in Section 2 and their finite sample properties explored. Section 3 reviews data-generating processes (DGPs) and panel unit root tests. Section 4 reports abbreviated results from a large set of Monte Carlo experiments examining the finite sample size properties of the tests; this is extended in Section 5 to cover the size-adjusted power of tests. Section 6 examines an application of the tests to 18 financial services firms listed on the ASX and Section 7 concludes.
2. INFORMATION CRITERIA In practice, the number of common factors in a panel is never known. The tests of both Bai and Ng (2004) and Moon and Perron (2004) require that the number of common factors be determined in some manner. The tests of Pesaran (2007) assume the existence of a single common factor and later results will indicate that the validity of this assumption is pivotal to the finite sample properties of the test. Determining the number of factors reliably is, therefore, crucial when applying these tests. The information criteria of Bai and Ng (2002) provide one means of estimating the number of factors. This section provides a brief overview of the information criteria of Bai and Ng (2002) and also suggests several new criteria based on the univariate criterion of Hannan and Quinn (1979). In order to determine the number of common factors in an approximate linear factor process with L true factors, yit = γi ft + εit , where f t and γ i are L × 1 vectors of factors and factor loadings, respectively, ε it is the idiosyncratic error and i = 1, . . . , N and t = 1, . . . , T , Bai and Ng suggest 12 criteria and these are detailed in Table 1. In this table, the quantities V (k , k), σˆ 2 and C NT are defined as
T N 2 V (ˆ k , k) = min yit − γˆik fˆtk ;
i=1 t=1
(2.1)
σˆ 2 = V (ˆ Lmax , Lmax ), √ √ and CNT = min[ N , T ], where L max is the maximum number of factors allowed, k is the N × k matrix k = (γ 1k , γ k2 , . . . , γ kN ) , γ ki and ftk are both k×1 vectors. The set is defined as = { 1 , 2 , . . . , Lmax }. The BIC 3 criterion given in Table 1 is not that proposed by Bai and Ng, but a criterion suggested by Moon and Perron (2004) who find it has desirable properties when N is small. The quantity IC(k) is calculated for k = 0, 1, . . . , L max factors where IC(k) is the information criterion of choice. The estimate of the number of common factors is Lˆ = arg min0≤k≤Lmax IC(k). C The Author(s). Journal compilation C Royal Economic Society 2009.
343
Panel unit root tests in the presence of cross-sectional dependence Table 1. The information criteria of Bai and Ng. Criterion PC 1 (k)
IC(k) NT V (k , k) + σˆ 2 k N+T ln( N+T ) NT
PC 2 (k)
2 V (k , k) + σˆ 2 k N+T ln(CNT ) NT
PC 3 (k)
2 2 V (k , k) + σˆ 2 k ln(CNT )/CNT
IC 1 (k)
NT ln(V (k , k)) + k N+T ln( N+T ) NT
IC 2 (k)
2 ln(V (k , k)) + k N+T ln(CNT ) NT
IC 3 (k)
2 2 ln(V (k , k)) + k ln(CNT )/CNT
AIC 1 (k)
V (k , k) + σˆ 2 k( T2 )
AIC 2 (k)
V (k , k) + σˆ 2 k( N2 )
AIC 3 (k)
−k V (k , k) + σˆ 2 2k N+T NT
BIC 1 (k)
) V (k , k) + σˆ 2 k( ln(T ) T
BIC 2 (k)
V (k , k) + σˆ 2 k( ln(N) ) N
BIC 3 (k)
V (k , k) + σˆ 2 k N+T ln(N T ) NT Table 2. The HQ(k) class of criteria.
Criterion
HQ(k)
HQ 1 (k)
ln(V (k , k)) +
HQ 2 (k)
ln(V (k , k)) +
HQ 3 (k)
ln(V (k , k)) + 2kc N+T ln(ln(N T )) NT
HQ 4 (k)
NT ln(V (k , k)) + 2kc N+T ln(ln( N+T )) NT
HQ 5 (k)
ln(V (k , k)) +
2kc 2 CN T
HQ 6 (k)
ln(V (k , k)) +
2kc 2 −k CN T
HQ 7 (k)
−k NT ln(V (k , k)) + 2kc N+T ln(ln( N+T )) NT −k
2kc T 2kc N
ln(ln(T )) ln(ln(N ))
2 ln(ln(CNT )) 2 ln(ln(CNT − k))
We propose several panel adaptations of the univariate Hannan and Quinn (1979) criterion as alternatives to the Bai and Ng information criteria for determining the number of common factors. The univariate Hannan–Quinn criterion has a parameter penalty adjustment that increases with the sample size at the slowest rate possible among the class of consistent criteria. As a comparison, the univariate Bayesian information criterion (BIC) increases at a faster rate, while the univariate Akaike Information Criterion (AIC) has a parameter penalty adjustment that fails to increase with sample size so as to make the criterion consistent. Although the AIC will choose a correct model with probability one as the sample size increases, it may choose a more profligate model than either the Hannan–Quinn or BIC criteria. It is this property of the univariate Hannan–Quinn criterion that makes panel versions of it a possible alternative to the Bai and Ng (2002) information criteria, which are themselves adaptations of the univariate BIC or AIC criteria. Seven panel adaptations of the univariate Hannan–Quinn criterion are given in Table 2. The fixed parameter c is required by Hannan and Quinn to be c > 1. All calculations reported here use c = 2, which, after some experimentation, proved preferable with our sample sizes. C The Author(s). Journal compilation C Royal Economic Society 2009.
344
S. de Silva, K. Hadri and A. R. Tremayne
As this paper is concerned with applying these criteria under the null hypothesis of a panel unit root, special attention is paid to their behaviour in this case. All the criteria in Tables 1 and 2 are calculated using an adapted version of the PANIC regime suggested by Bai and Ng (2004) for use with the information criteria of Bai and Ng (2002) in the presence of non-stationarity. Rather than estimate the factors and loadings using the method of Principal Components on the data in levels, the factors and loadings are estimated from the data in differences, resulting in consistent estimators under the null hypothesis of a unit root. For further details, see Bai and Ng (2004). Initially, the information criteria are examined under a DGP with no factors, yit = εit , and ε it = ε it−1 + v it with v it ∼ NID(0, 1). The chosen DGP is just a panel of N simple random walks. Values of N = 10, 20, 80 and T = 25, 50, 100, 200, 400 are examined. These values have been chosen to reflect the finite sample properties of the criteria (and later tests) in panels with a small number of cross-sections, such as panels of OECD countries, and also to reflect the properties of the procedures when the number of cross-sections sometimes approaches or exceeds the number of time series observations. The lengths of the time series are chosen to reflect finite sample properties of the tests both in reasonably long time series and when the time series length is short. It is hoped that this choice of sample sizes will be general enough to assist the applied researcher in a wide variety of applications. Here and in what follows, 5000 replications are performed in each experiment. The left-hand panel of Table 3a reports the average number of factors estimated by selected information criteria under this cross-sectionally independent DGP under the heading L = 0. The Bai and Ng IC 1 criterion is reported, along with the BIC 3 criterion modified by Moon and Perron
Table 3a. Average number of factors estimated by the information criteria. L=0
L=1
N
T
BIC 3
IC 1
HQ 2
HQ 4
HQ 6
BIC 3
IC 1
HQ 2
HQ 4
HQ 6
10 20
25 25
7.742 2.149
8.000 0.060
7.782 0.037
3.821 0.000
8.000 0.005
7.738 2.461
8.000 1.381
7.964 1.217
6.438 1.000
8.000 1.004
80 10 20
25 50 50
0.000 7.150 0.005
0.000 8.000 0.000
8.000 0.331 0.000
0.000 0.006 0.000
0.000 8.000 0.000
1.000 7.146 1.001
1.000 8.000 1.000
8.000 3.120 1.000
1.000 1.166 1.000
1.000 8.000 1.000
80 10
50 100
0.000 5.794
0.000 8.000
0.002 0.000
0.000 0.000
0.000 8.000
1.000 5.865
1.000 8.000
1.004 1.000
1.000 1.000
1.000 8.000
20 80 10
100 100 200
0.000 0.000 3.301
0.000 0.000 7.706
0.000 0.000 0.000
0.000 0.000 0.000
0.000 0.000 8.000
1.000 1.000 3.565
1.000 1.000 8.000
1.000 1.000 1.000
1.000 1.000 1.000
1.000 1.000 8.000
20 80 10
200 200 400
0.000 0.000 0.271
0.000 0.000 2.843
0.000 0.000 0.000
0.000 0.000 0.000
0.000 0.000 8.000
1.000 1.000 1.087
1.000 1.000 8.000
1.000 1.000 1.000
1.000 1.000 1.000
1.000 1.000 8.000
20 80
400 400
0.000 0.000
0.000 0.000
0.000 0.000
0.000 0.000
0.000 0.000
1.000 1.000
1.000 1.000
1.000 1.000
1.000 1.000
1.000 1.000
C The Author(s). Journal compilation C Royal Economic Society 2009.
Panel unit root tests in the presence of cross-sectional dependence
345
Table 3b. Average number of factors estimated by criteria. L=2 N
T
BIC 3
IC 1
HQ 2
HQ 4
HQ 6
10
25
7.744
8.000
7.999
7.536
8.000
20 80 10
25 25 50
2.852 1.998 7.192
2.925 2.000 8.000
2.750 8.000 6.459
1.898 1.991 3.151
1.980 1.997 8.000
20 80
50 50
1.999 2.000
1.999 2.000
1.992 2.010
1.951 2.000
1.979 2.000
10 20 80
100 100 100
5.973 1.997 2.000
8.000 2.000 2.000
1.941 1.993 2.000
1.843 1.981 2.000
8.000 1.983 2.000
10 20
200 200
3.784 1.994
8.000 2.000
1.882 1.995
1.868 1.990
8.000 1.984
80 10 20
200 400 400
2.000 2.024 1.984
2.000 8.000 2.000
2.000 1.867 1.993
2.000 1.860 1.990
2.000 8.000 1.983
80
400
2.000
2.000
2.000
2.000
2.000
(2004). Monte Carlo simulations in de Silva (2008) indicate that most of the PC and IC criteria of Bai and Ng (2002) have similar properties and, for this reason, only the IC 1 criterion is reported. The AIC 1,2,3 and BIC 1,2 criteria are also shown in de Silva (2008) to be unable to accurately determine the number of common factors in a panel at the sample sizes considered. These criteria are omitted from further discussion. Of the Hannan–Quinn criteria, the HQ 3 criterion is unable to determine the number of common factors at the sample sizes considered, while the HQ 1 criterion displays very poor finite sample properties. These two criteria are not reported further. The HQ 5 criterion has identical properties to the HQ 2 criterion at many of the sample sizes examined and the HQ 7 criterion has similar properties to the HQ 6 criterion. The HQ 5 and HQ 7 criteria are, therefore, also omitted from discussion for the sake of brevity. Some criteria overestimate the number of common factors when N is very small (N = 10); these include the IC 1 and HQ 6 criteria. The HQ 6 criterion, for example, chooses the maximum number of factors allowed when N = 10, no matter the size of T. The IC 1 and BIC 3 criteria also overestimate the number of common factors when N = 10, but the degree of overestimation improves considerably as T gets large. When both N and T are very small, the HQ 2 and HQ 4 criteria also overestimate the number of factors, but the accuracy of estimation in these criteria is considerably better for small values of T, when compared to the other criteria. As N and T become larger, all the criteria examined in Table 3a estimate the number of common factors very accurately (whether L = 0 or 1). We next turn to an approximate linear factor process. The finite sample properties of these information criteria are compared using a DGP which will also be used to examine the properties of the panel unit root tests outlined in Section 3. For this purpose a DGP very similar to that of Moon and Perron (2004) is used, as it has also been employed in part by Bai and Ng (2002). The C The Author(s). Journal compilation C Royal Economic Society 2009.
346
S. de Silva, K. Hadri and A. R. Tremayne
dynamic factor DGP takes the form yit = αi + yit0
(2.2)
with y 0it = ρ i y 0it−1 + x it , where ρ i = 1 for all i = 1, . . . , N under the null hypothesis. The following factor structure is specified for the term x it : xit =
L
γij fj t +
√ Lεit .
(2.3)
j =1
The idiosyncratic shocks, ε it , and the factors, fj t , are generated independently as NID(0, 1) random variables for j = 1, . . . , L, where L is the true number of factors. The factor loadings, γ ij ∼ NID(μ γ , 1), μ γ = 1, for i = 1, . . . , N and j = 1, . . . , L. The initial values of each series are set to zero, y 0i0 = 0; see Abadir (1993) and Abadir and Hadri (2000) for the importance of initial values in autoregressive models. The N intercepts, α i0 , are generated as standard normal random variables, as in Moon and Perron (2004). The right-hand panel of Table 3a reports the average number of factors chosen by the criteria in the presence of one factor under the heading L = 1 and Table 3b reports the average number of factors chosen by the criteria in the presence of two factors. In the presence of a single factor and if both N and T are very small (N = 10, T ≤ 50), none of the information criteria considered have good finite sample properties and so cannot be recommended for use under these circumstances. The only exception is the HQ 4 criterion which has reasonable properties if T = 50. Interestingly, if T < N, the HQ 2 criterion does not monotonically improve in accuracy as N increases for constant T. When N is much larger than T, the criterion overestimates the number of factors. If N remains small, then all the criteria, with the exception of HQ 2 and HQ 4 , have a tendency to drastically overestimate the number of common factors even if T is quite large. All the criteria improve in accuracy as both N and T increase. When N = 10, the HQ 2 and HQ 4 criteria are able to accurately determine the number of factors in the panel, provided T is large enough. In the presence of two factors and when N = 10, there is little difference in the behaviour of the BIC 3 , IC 1 and HQ 6 criteria compared to when L = 1. In addition, as T increases with N constant, the BIC 3 and HQ 6 criteria do not monotonically improve in accuracy, but underestimate the number of factors a little when N = 20. When T ≥ 50, the HQ 2 and HQ 4 criteria estimate the number of factors very accurately (except when N = 10 and T = 50). The finite sample properties of the information criteria under scrutiny have, hitherto, both in this paper and elsewhere, largely presumed a normal (Gaussian) DGP. In practice, both economic and financial panels may have distinctly non-normal features. The observed distributions of such data often present notable skewness and/or kurtosis. The construction of the univariate AIC criterion and, by extension, the panel criteria adapted from it, is based on the likelihood of the estimated model, assuming Gaussianity. If the process generating the data is in fact non-normal, then the properties of the information criteria may be adversely affected. We examine the finite-sample properties of the information criteria when both factors and errors are generated as i.i.d. t 5 variates, reflecting heavier tails to the error and factor processes than shown by the normal distribution and when L = 1. Table 4 reports the results of this experiment. All criteria show a marginal increase in the number of common factors estimated compared to Table 3. This improves as the sample size increases. We also examined the properties of the tests and information criteria when both errors and factors are generated independently as centred and standardized χ 24 variates. This distribution not only has considerably different tails from the normal, but also pronounced skewness. Our results C The Author(s). Journal compilation C Royal Economic Society 2009.
Panel unit root tests in the presence of cross-sectional dependence
N
Table 4. The average number of common factors chosen by information criteria. T BIC 3 IC 1 HQ 2 HQ 4
347
HQ 6
10 20
25 25
7.760 2.966
8.000 2.568
7.989 2.241
7.157 1.022
8.000 1.125
80 10
25 50
1.008 7.253
1.024 8.000
8.000 4.775
1.005 1.789
1.009 8.000
20 80 10
50 50 100
1.103 1.002 6.105
1.068 1.013 8.000
1.026 1.228 1.029
1.010 1.003 1.012
1.014 1.012 8.000
20 80
100 100
1.011 1.003
1.019 1.009
1.007 1.015
1.004 1.004
1.004 1.014
10 20 80
200 200 200
4.093 1.002 1.001
8.000 1.006 1.003
1.001 1.002 1.003
1.001 1.001 1.001
8.000 1.001 1.003
10 20
400 400
1.563 1.001
8.000 1.002
1.000 1.001
1.000 1.001
8.000 1.001
80
400
1.000
1.003
1.002
1.001
1.002
Note: L = 1, ε it ∼ t 5 , f t ∼ t 5 , γ i ∼ N (1, 1).
indicate little change from the previous results obtained when factors and errors are generated as Gaussian variates. These and other results not discussed in detail are available from the authors on request. Overall, the new criteria suggested in this paper, in particular HQ 2 and HQ 4 , offer improved finite sample properties over the currently available criteria, especially when N is small. Unfortunately, if both N and T are very small, none of the criteria have finite sample properties which make them suitable for use in practice.
3. PANEL UNIT ROOT TESTS This section outlines three approaches for testing a panel unit root that model cross-sectional correlation as an approximate linear factor process. Each regime uses factors in a different way to account for the cross-sectional correlation. The finite sample properties of these tests will be discussed in detail in Sections 4 and 5. 3.1. The panel test of Bai and Ng (2004) Bai and Ng (2004, p. 1128) argue that attempting to test the stationarity of an observed variable that is a combination of I(0) and I(1) unobserved variables is a difficult and often misleading process. The authors assume an approximate L-factor dynamic linear factor model where, for i = 1, . . . , N and t = 1, . . . , T , yit = αi + γi ft + eit , C The Author(s). Journal compilation C Royal Economic Society 2009.
(3.1)
348
S. de Silva, K. Hadri and A. R. Tremayne
with e it = λ i e i,t−1 + ε it and f t = βf t−1 + u t (see Bai and Ng, 2004, sec. 4.1). The authors propose separating the observed process y it into: the unobserved component parts; the idiosyncratic errors e it ; and the common factors f t . Estimates of the common factors, f t , and the associated L factor loadings, γ i , are obtained by the method of principal components performed on the differenced data, y it . If there is a trend in the DGP, the method is slightly different but essentially the same; for full details, see Bai and Ng (2004). The idiosyncratic error of each cross-section is tested separately (for the null hypothesis of a unit root, H 0 : λ i = 1) using augmented Dickey–Fuller (ADF) tests. When L = 1, the factor is also tested for a unit root (H 0 : β = 1) using an ADF test procedure. In the presence of multiple factors, Bai and Ng use modified versions of the tests of Stock and Watson (1988); the factor procedure itself is not explicitly examined here. To determine a suitable panel unit root test, Bai and Ng suggest pooling the p-values of the individual univariate test statistics and forming a variant of the Fisher test, −2 N i=1 log(peˆi ) − 2N Peˆ = , (3.2) √ 4N where peˆi is the p-value of the ADF test performed on cross-section i. We obtain the p-values of the univariate test statistics using the approximate asymptotic (T → ∞) distribution functions of ADF test statistics estimated by MacKinnon (1994). The resulting Fisher test has a standard normal limiting distribution as N → ∞ under the joint null hypothesis H 0 : λ i = 1 for all i = 1, . . . , N . The null hypothesis is rejected for suitably large, positive values of the test statistic, Peˆ . Large, negative values of the ADF test statistics produce p-values, peˆi , of zero when estimated by the method proposed by MacKinnon (1994). When applying the Fisher test, these are arbitrarily set to peˆi = ε where ε is a small, positive number; we use ε = 0.000001. An alternative test statistic, examined by Westerlund and Larsson (2007), is also based upon the concept of separation of components. Unlike the Fisher test based on the p-values of individual ADF tests, it uses standardized versions thereof. The test statistic is given by
N 1 DF − E (i) (B) ˆ e √ i=1 N ZPANIC = N , √ V (B) where DFeˆ (i) is the Dickey–Fuller test statistic calculated on the estimated idiosyncratic errors (eˆit ) in cross-section i, E(B) is the expected value of the Dickey and Fuller (1979) distribution and V (B) is the variance of that distribution. In this paper, we use the values of E(B) and V (B) derived by Nabeya (1999). Westerlund and Larsson note a finite sample bias in the Z PANIC statistic and propose a biascorrected test statistic to remedy the problem. The bias-corrected statistic is aN 1 + , 1.67 √ ZPANIC = ZPANIC + √ V (B) N where N 1 aN = γˆ N i=1 i
N 1 γˆi γˆi N i=1
−1 γˆi .
Both the Z PANIC and Z + PANIC test statistics are distributed as standard normal variates as N , T → ∞ and reject the null hypothesis for large, negative values. C The Author(s). Journal compilation C Royal Economic Society 2009.
Panel unit root tests in the presence of cross-sectional dependence
349
3.2. The tests of Moon and Perron (2004) Moon and Perron use the dynamic panel model with L factors as given in (2.2), which is repeated here for convenience: yit = αi + yit0 , 0 0 0 where y 0it = ρ i y 0it−1 + x it and x it = γ 0 i f t + e it . Here γ i and f t are both L × 1 vectors. The estimates of the L factors, a T × L matrix, fˆ = (fˆ1 , fˆ2 , . . . , fˆT ) , and factor loadings, an N × L matrix, γˆL = (γˆ1 , γˆ2 , . . . , γˆN ) , are obtained by the method of principal components performed on the residuals,
xˆ = y − ρˆpool y−1 , where ρˆpool =
tr(y−1 y) , tr(y−1 y−1 )
(3.3)
tr(A) means trace(A) and
x = (x 1 , x 2 , . . . , x N ),
x i = (xi1 , xi2 , . . . , xiT ) ,
y = (y 1 , y 2 , . . . , y N ),
y i = (yi1 , yi2 , . . . , yiT ) , y −1,i = (yi0 , yi1 , . . . , yi,T −1 ) .
y−1 = (y −1,1 , y −1,2 , . . . , y −1,N ),
Moon and Perron state that the matrix QγˆL = I − γˆL (γˆL γˆL )−1 γˆL , orthogonal to the space of the estimated factor loadings, is a consistent estimator of QγL = I − γL (γL γL )−1 γL , such that 1 1 ,
QγˆL − QγL = Op max √ , √ N T where A denotes the Euclidean norm with A = (tr(A A))1/2 . The residuals denoted eˆ = ˆ γˆL are, therefore, cross-sectionally independent asymptotically, allowing the application of xQ a unit root test across cross-sections. Phillips and Sul (2003) independently propose a similar method of orthogonalization which is not considered in this paper. In order to standardize their test Moon and Perron require several long-run variance statistics, 2 ω ˆ as the two-sided long-run panel variance where estimators. They define ωˆ e2 = N1 N i=1 e,i 2 ωˆ e,i =
T −1 j =−T +1
w
j hω
ˆ i (j )
1 eˆit eˆi,t+j , ˆ i (j ) = T t
and
N
4 . Similarly, a one-sided long-run variance is defined as λˆ N while φˆ e4 = N1 i=1 ωˆ e,i e = N 1 ˆ i=1 λe,i where N
λˆ e,i =
T −1 j =1
w
j hλ
ˆ i (j ).
In the above, w(·) is a kernel function and h λ and h ω are bandwidth parameters satisfying assumptions 10–13 of Moon and Perron (2004, p. 91). The kernel function w(·) used here is the quadratic spectral kernel without prewhitening. Using this asymptotically cross-sectionally independent panel, Moon and Perron derive two panel unit root test statistics, t ∗a and t ∗b , for the null hypothesis of a unit root, H 0 : ρ i = 1 for all i = 1, . . . , N, against the alternative hypothesis, H A : ρ i < 1 for some or all i = 1, . . . , N . The C The Author(s). Journal compilation C Royal Economic Society 2009.
350
S. de Silva, K. Hadri and A. R. Tremayne
statistics given below have standard normal limiting distributions under the null hypothesis as N , ∗ T → ∞, jointly, and where ρˆpool =
tr(y−1 QγˆL y )−NT λˆ N e tr(y−1 QγˆL y−1 )
√
ta∗
=
and the second is tb∗
=
√
∗ N T (ρˆpool − 1) 2φˆ e4 /ωˆ e4
∗ N T (ρˆpool
− 1)
. The first statistic is
1 ωˆ e . tr(y Q y ) −1 γ ˆ L −1 2 NT φˆ e2
The null hypothesis of a unit root (in all panel units) is rejected for suitably large, negative values of the test statistics. 3.3. The test of Pesaran (2007) Pesaran (2007) considers the factor model: yit = (1 − ρi )αi + ρi yit−1 + γi ft + εit
(3.4)
and assumes the existence of a single factor (L = 1), requiring γ i and f t to be scalar. Defining γ = N1 N i=1 γi and assuming γ = 0 in both the finite sample and as N → ∞, Pesaran proxies the common factor f t by the cross-sectional mean y t = N1 N i=1 yit and its lagged value, y t−1 = N 1 y where y is presumed to have a fixed value. i,t−1 i0 i=1 N The t-statistics from individual cross-sections for the null hypothesis ρ i = 1 are obtained from the cross-sectionally augmented Dickey–Fuller regression (see Pesaran, 2007, equation (6)): yit = ai + bi yi,t−1 + ci y t−1 + di y t + eit . The individual t-statistics are defined as ti (N , T ) = yi = (yi1 , yi2 , . . . , yiT ) , yi,−1 = (yi0 , yi1 , . . . , yi,T −1 ) ,
Gi = (yi,−1 , W ),
(3.5)
for i = 1, . . . , N where
y = (y 1 , y 2 , . . . , y T ) ,
y −1 = (y 0 , y 1 , . . . , y T −1 ) ,
W = (ι, y, y −1 ), y M
yi Mw yi,−1 σˆ i (yi,−1 Mw yi,−1 )1/2
ι = (1, 1, . . . , 1) ,
Mw = IT − W (W W )−1 W , Mi,w = IT − Gi (Gi Gi )−1 Gi
y
i,w i and σˆ i2 = i T −4 . To create a panel test, the t-statistics are averaged to form a cross-sectionally augmented version of the test statistic developed by Im et al. (2003), the CIPS statistic,
CIPS(N , T ) = N −1
N
ti (N , T ).
i=1
The presence of cross-sectional correlation makes determining the limiting distribution of the CIPS statistic as N , T → ∞ difficult. A truncated version of the CIPS test statistic, CIPS*, is shown to converge to a limiting distribution as all required moments exist by construction. This distribution is simulated and tabulated by Pesaran (2007). The null hypothesis of a unit C The Author(s). Journal compilation C Royal Economic Society 2009.
Panel unit root tests in the presence of cross-sectional dependence
351
root is rejected for suitably large, negative values of the test statistics. Pesaran extends his test to the case where individual specific error terms are also serially correlated; see Pesaran (2007, sec. 5) for details. In Section 5.3 he gives a modified version of the CIPS test having the same limit distribution as the one given above. In some of the Monte Carlo results reported below, this version of the statistic is used and is denoted CIPS a (resp. CIPS∗a ), where applicable.
4. MONTE CARLO SIMULATION: SIZE OF PANEL UNIT ROOT TESTS 4.1. Empirical size of the tests in a cross-sectionally independent DGP This sub-section discusses the size properties of the panel unit root tests in the presence of crosssectional independence. This is of relevance because it is important to ascertain whether tests designed to be used in the presence of cross-sectional dependence behave reliably when such correlation is absent in truth; in the next sub-section, we shall turn to the size of the tests when there are indeed one or more factor processes present. As suggested in the Introduction, the tests discussed in Section 3 are compared to the IPS test of Im et al. (2003) which does not allow for cross-sectional correlation, though it is widely applied in the literature. Previous studies, such as Strauss and Yigit (2003), have shown that this test has poor finite sample properties in the presence of cross-sectional correlation. The test is based on the average of the t-statistics from each cross-section unit in the test regression yit = ai + bi yi,t−1 + εit ,
(4.1)
where ε it is assumed to be cross-sectionally independent. The average of the t-statistics is standardized using moments calculated by simulation and provided by Im et al. (2003). The test rejects the null hypothesis of a panel unit root for large, negative values of the test statistic and has a standard normal limiting distribution. The CIPS tests discussed in Section 3 are an extension of this test. All results here and in what follows are reported at a nominal 5% significance level. In all cases, there is almost no difference in performance between the CIPS and CIPS* tests and so only results for the former are reported in the upcoming tables. This suggests that the truncation Pesaran (2007) employs to ensure the existence of the appropriate moments may be innocuous. As a starting point, however, we compare the size of the tests under the extremely simple DGP yit = εit and ε it = ε it−1 + v it with v it ∼ NID(0, 1). Thus the DGP is just that initially used in Section 2. Table 5 reports the empirical size of the various tests using this simple cross-sectionally independent DGP. For those tests that require the estimation of a factor process we assume L = 1 so as to calculate the test statistics. Table 5 shows that the IPS test has size reasonably close to the nominal size, although for small values of T there is some under-rejection of the null hypothesis. The CIPS test has a size closer to the nominal size overall and is generally reliable. However, the t ∗a and t ∗b tests systematically overreject the null. In the former case, this is quite dramatic with a rejection frequency almost three times the nominal value on occasion. The Fisher test overrejects the null hypothesis marginally, but not to the same degree that the t ∗a and t ∗b tests do. The Z PANIC test statistic shows slight under-rejection of the null hypothesis when N is large, but otherwise C The Author(s). Journal compilation C Royal Economic Society 2009.
352
S. de Silva, K. Hadri and A. R. Tremayne Table 5. Empirical size of the tests under cross-sectional independence. IPS t ∗a t ∗b Fisher Z PANIC Z+ PANIC
N
T
CIPS
10
25
0.039
0.133
0.089
0.073
0.056
0.017
0.053
20 80 10
25 25 50
0.036 0.018 0.042
0.104 0.082 0.134
0.078 0.068 0.081
0.065 0.053 0.072
0.043 0.034 0.053
0.017 0.024 0.016
0.045 0.060 0.044
20 80 10
50 50 100
0.038 0.036 0.045
0.109 0.083 0.136
0.076 0.068 0.086
0.067 0.053 0.070
0.049 0.039 0.052
0.021 0.026 0.014
0.052 0.051 0.054
20 80
100 100
0.041 0.041
0.112 0.080
0.075 0.063
0.063 0.053
0.050 0.045
0.021 0.028
0.051 0.054
10 20 80
200 200 200
0.048 0.048 0.041
0.140 0.106 0.084
0.081 0.070 0.066
0.072 0.057 0.057
0.060 0.049 0.051
0.014 0.021 0.033
0.049 0.055 0.066
10 20
400 400
0.042 0.041
0.142 0.103
0.083 0.069
0.072 0.061
0.054 0.047
0.015 0.022
0.055 0.046
80
400
0.046
0.082
0.063
0.062
0.057
0.038
0.067
has a size close to the nominal 5%. The bias-corrected version of the test, Z + PANIC , does not show this behaviour, but under-rejects the null hypothesis consistently. Overall, the Fisher, Z PANIC and CIPS tests all have quite reliable size, even though an erroneous assumption that L = 1 is used in their construction. The results in Table 5 indicate that some of the panel tests have a less reliable size than the IPS test when the panel is cross-sectionally independent in truth. In practice, therefore, it may be prudent to recommend application of a pre-test for the existence of cross-section correlation (such as that due to Pesaran, 2004) before considering implementation of certain panel tests that allow for its presence. 4.2. Empirical size of the tests in the presence of cross-sectional dependence Moon and Perron’s (2004) DGP used in Section 2 will be used initially, with some minor alterations. However, in order to obtain a comprehensive overview of the finite sample properties of the tests, other DGPs will also be used. One of the major assumptions of the CIPS test(s) is that γ = 0, both in finite samples and in the limit as N → ∞. The DGP of Moon and Perron (2004) (MPDGP) specifies that γ ij ∼ NID(μ γ , 1) for i = 1, . . . , N and j = 1, . . . , L, and the parameter μ γ controls the probability distribution of γ for fixed N. Setting μ γ = 0 abrogates Pesaran’s assumption, while setting μ γ = 1 is consistent with the assumptions underlying the CIPS test. We examine both values in order to determine the importance of this assumption to the finite sample properties of the CIPS test. Results in Table 6 show that the t ∗a and t ∗b tests exhibit empirical size that exceeds the nominal value for both values of μ γ , particularly when N is small; there is some improvement displayed in both panels of the table as N and T increase. The Fisher test has a size generally close to the nominal 5%, although it is not as well-behaved as the CIPS test when Pesaran’s assumption is C The Author(s). Journal compilation C Royal Economic Society 2009.
T
25 25
25 50
50 50 100
100 100
200 200 200
400 400
400
N
10 20
80 10
20 80 10
20 80
C The Author(s). Journal compilation C Royal Economic Society 2009.
10 20 80
10 20
80
0.288
0.091 0.136
0.094 0.138 0.287
0.136 0.274
0.113 0.224 0.092
0.218 0.077
0.097 0.118
IPS
0.070
0.138 0.100
0.135 0.109 0.074
0.102 0.065
0.085 0.082 0.135
0.074 0.116
0.115 0.085
t ∗a
0.056
0.084 0.066
0.079 0.071 0.061
0.070 0.053
0.061 0.077 0.082
0.068 0.076
0.077 0.060
t ∗b
0.051
0.069 0.061
0.073 0.064 0.050
0.056 0.051
0.063 0.060 0.066
0.062 0.071
0.071 0.063
Fisher
0.044
0.059 0.051
0.060 0.052 0.049
0.049 0.038
0.051 0.045 0.055
0.042 0.048
0.048 0.049
Z PANIC
0.030
0.017 0.023
0.018 0.021 0.032
0.022 0.025
0.022 0.029 0.015
0.030 0.016
0.015 0.021
Z+ PANIC
0.163
0.075 0.092
0.078 0.092 0.150
0.095 0.133
0.096 0.144 0.073
0.135 0.081
0.086 0.096
CIPS
0.369
0.130 0.209
0.136 0.203 0.365
0.201 0.346
0.188 0.339 0.135
0.311 0.125
0.127 0.169
IPS
0.076
0.138 0.107
0.143 0.108 0.081
0.104 0.083
0.104 0.100 0.136
0.128 0.129
0.115 0.093
t ∗a
0.062
0.085 0.069
0.087 0.073 0.062
0.072 0.069
0.071 0.088 0.083
0.118 0.080
0.072 0.068
t ∗b
Table 6. Empirical size of the tests with one known factor; MPDGP. μγ = 0 μγ = 1
0.058
0.065 0.058
0.060 0.066 0.056
0.063 0.058
0.066 0.055 0.062
0.063 0.066
0.070 0.065
Fisher
0.045
0.051 0.045
0.050 0.056 0.045
0.050 0.047
0.048 0.041 0.050
0.039 0.050
0.052 0.044
Z PANIC
0.027
0.015 0.018
0.013 0.025 0.028
0.022 0.030
0.021 0.028 0.013
0.026 0.015
0.011 0.018
Z+ PANIC
0.063
0.057 0.052
0.046 0.057 0.059
0.051 0.063
0.057 0.053 0.064
0.056 0.053
0.052 0.059
CIPS
Panel unit root tests in the presence of cross-sectional dependence
353
354
S. de Silva, K. Hadri and A. R. Tremayne Table 7. Empirical size of the tests with two known factors; MPDGP. IPS t ∗a t ∗b Fisher Z PANIC Z+ PANIC
N
T
CIPS
10
25
0.107
0.161
0.104
0.078
0.063
0.004
0.071
20 80 10
25 25 50
0.160 0.290 0.106
0.114 0.133 0.159
0.083 0.120 0.097
0.064 0.062 0.072
0.046 0.049 0.058
0.008 0.025 0.006
0.085 0.122 0.068
20 80 10
50 50 100
0.172 0.314 0.127
0.114 0.092 0.154
0.079 0.078 0.098
0.061 0.057 0.074
0.048 0.043 0.064
0.010 0.020 0.006
0.088 0.130 0.069
20 80
100 100
0.181 0.343
0.104 0.080
0.066 0.063
0.068 0.060
0.054 0.051
0.008 0.021
0.087 0.125
10 20 80
200 200 200
0.126 0.177 0.345
0.161 0.117 0.075
0.100 0.077 0.061
0.077 0.064 0.053
0.069 0.054 0.045
0.005 0.008 0.021
0.071 0.087 0.140
10 20
400 400
0.121 0.192
0.148 0.116
0.093 0.081
0.069 0.056
0.058 0.048
0.004 0.009
0.071 0.085
80
400
0.336
0.074
0.057
0.055
0.049
0.023
0.126
Note: μ γ = 1.
satisfied. It, too, improves as T increases. The Z PANIC test also has a robust size, but the biasadjusted Z + PANIC test is too conservative. The size of tests where the selected value of L may be greater than one is next calculated under MPDGP using the HQ 2 and BIC 3 criteria to determine the number of common factors, L = 1 in truth; the results are not reported in tabular form to save space. Perhaps unsurprisingly, the size of the tests depends critically on the accuracy of the criteria. As outlined in Section 2, when N and T are very small, the criteria severely overestimate the number of factors and this may be responsible for the upward size distortion observed. When N and T are increased, the criteria determine the number of factors more accurately and the size of the tests more closely resembles that given in Table 6. Table 7 reports the size of the tests in the presence of two factors and when μ γ = 1. The IPS test remains grossly oversized and there is little change to the properties of the t ∗a , t ∗b , Fisher and Z PANIC tests, compared to Table 6, except a slight increase in size, especially when N is small. The presence of a second factor causes the Z + PANIC test to under-reject in a more pronounced way. Unfortunately, the CIPS test is sensitive to the assumption that there is a single common factor and, as when L = 1 and μ γ = 0, the size of the tests increases substantially as N increases. Although the results in Table 6 depict an encouraging view of the size properties of the CIPS tests when L = 1 and μ γ = 1, they depend pivotally on the form of the DGP used in the Monte Carlo experiment (e.g. deteriorating when L = 2 in Table 7). Changing the DGP slightly to emulate that in Bai and Ng (2004) (BNDGP), the data are now generated with a single factor and scalar γ i as yit = γi ft + eit ,
(4.2)
C The Author(s). Journal compilation C Royal Economic Society 2009.
355
Panel unit root tests in the presence of cross-sectional dependence Table 8a. Empirical size of the tests; BNDGP. t ∗a t ∗b Fisher Z PANIC
Z+ PANIC
CIPS a
0.043 0.031
0.013 0.016
0.023 0.013
0.051 0.064
0.014 0.044
0.009 0.014
0.003 0.021
0.113 0.271 0.142
0.055 0.047 0.065
0.042 0.036 0.051
0.018 0.023 0.014
0.006 0.000 0.018
0.147 0.201
0.103 0.170
0.061 0.052
0.048 0.032
0.022 0.021
0.005 0.000
0.759 0.938 1.000
0.191 0.151 0.141
0.122 0.106 0.112
0.068 0.063 0.050
0.053 0.044 0.040
0.015 0.018 0.026
0.029 0.005 0.000
400 400
0.841 0.974
0.165 0.142
0.103 0.096
0.073 0.059
0.057 0.047
0.019 0.019
0.040 0.015
400
1.000
0.106
0.084
0.052
0.046
0.031
0.000
N
T
IPS
10 20
25 25
0.199 0.307
0.178 0.190
0.110 0.133
0.070 0.064
80 10
25 50
0.543 0.367
0.439 0.177
0.405 0.117
20 80 10
50 50 100
0.542 0.848 0.594
0.160 0.305 0.202
20 80
100 100
0.819 0.993
10 20 80
200 200 200
10 20 80
Note: L = 1 known.
with γ i ∼ NID(1, 1), while ft = 0.8ft−1 + vt
with vt ∼ N I D(0, 1)
eit = ei,t−1 + εit
with εit ∼ N I D(0, 1).
and
This DGP has a unit root in every cross-section and a highly autocorrelated factor process. The estimated size of the tests is reported in Table 8a under this regime. Unlike the tests of Bai and Ng, and Moon and Perron, the CIPS test statistic must be adjusted to explicitly allow for serial correlation in the DGP and so we employ CIPS a . Table 8a indicates that the CIPS a test has very poor size in virtually always under-rejecting the null hypothesis. Under BNDGP, the IPS, t ∗a and t ∗b tests display pronounced upward size distortion. Of course, IPS was not designed with the intention of using it in conjunction with dynamic factor DGPs, and the two t-tests of Moon and Perron were designed with a somewhat different DGP in mind. Thus, in practice, some caution should be exercised in their use dependent upon what type of DGP is envisaged. The Fisher test has an empirical size reasonably close to the nominal 5% value. The Z PANIC test under-rejects the null hypothesis slightly and this is more pronounced in its bias-adjusted counterpart. Changing only the factor process so that it is generated as ft = 0.4ft−1 + vt
with vt ∼ N I D(0, 1),
there is little material difference in the behaviour of the tests. Finally, the DGP used by Pesaran (2007) (PDGP) is used to examine the size of the tests. Following that paper, it takes the form, yit = (1 − φi )μi + φi yi,t−1 + uit C The Author(s). Journal compilation C Royal Economic Society 2009.
356
S. de Silva, K. Hadri and A. R. Tremayne Table 8b. Empirical size of the tests; PDGP. t ∗a t ∗b Fisher Z PANIC
Z+ PANIC
CIPS
0.050 0.045
0.017 0.020
0.052 0.052
0.064 0.068
0.047 0.053
0.035 0.015
0.053 0.046
0.072 0.067 0.075
0.058 0.060 0.069
0.046 0.043 0.052
0.020 0.029 0.013
0.050 0.054 0.050
0.043 0.021
0.070 0.057
0.061 0.056
0.052 0.045
0.021 0.031
0.053 0.062
0.041 0.049 0.044
0.095 0.063 0.041
0.082 0.068 0.060
0.073 0.062 0.054
0.055 0.046 0.047
0.018 0.021 0.030
0.050 0.053 0.060
400 400
0.045 0.045
0.110 0.092
0.077 0.077
0.064 0.057
0.057 0.051
0.018 0.021
0.053 0.049
400
0.047
0.054
0.056
0.060
0.048
0.032
0.063
N
T
IPS
10 20
25 25
0.038 0.030
0.014 0.006
0.086 0.068
0.078 0.070
80 10
25 50
0.019 0.040
0.001 0.038
0.065 0.085
20 80 10
50 50 100
0.042 0.031 0.045
0.019 0.007 0.061
20 80
100 100
0.039 0.041
10 20 80
200 200 200
10 20 80
Note: L = 1 known.
and uit = γi ft + εit , where: μ i ∼ i.i.d. N (0, 1); f t ∼ i.i.d. N(0, 1); γ i ∼ i.i.d. U [0, 0.2]; ε it ∼ i.i.d. N (0, σ 2i ); and σ 2i ∼ i.i.d. U [0.5, 1.5]. Under the null hypothesis of a common panel unit root we have φ i = 1, i = 1, . . . , N; the results are reported in Table 8b. The version of PDGP used here corresponds to what Pesaran (2007) calls low-level cross-section dependence. The IPS test does not exhibit the significant upward size distortion that previous tables have shown and other authors, such as Strauss and Yigit (2003), have noted. There is some under-rejection of the null hypothesis, particularly when T < N . In addition, when T is small, the t ∗a test is somewhat conservative, although its empirical size is too big when N is small and T is large. The t ∗b and Fisher tests exhibit a slight preponderance to overreject the null hypothesis, while the CIPS and Z PANIC tests have excellent size. Again, the Z + PANIC test is too conservative. The experiment was repeated, generating γ i ∼ i.i.d. U [0, 1] so as to increase the level of cross-sectional correlation in the DGP; see the next paragraph for details of how this may be assessed. The results remain broadly similar, though the IPS test does now overreject the null hypothesis, conforming with our earlier evidence and that of Strauss and Yigit (2003). Detailed results are not presented to save space. In order to compare the different DGPs, some measure of the degree of cross-sectional dependence in each one must be used. In the case of a single factor, Im and Pesaran (2003) suggest the correlation coefficient between x it and x j t in the process xit = γi ft + εit . For given γ i and γ j the cross-section correlation coefficient is ψij =
γi γj , 1 + γi2 1 + γj2
i = j ,
C The Author(s). Journal compilation C Royal Economic Society 2009.
357
Panel unit root tests in the presence of cross-sectional dependence
if both σε2 and σf2 are unity. The way in which γ i is generated, therefore, has a profound effect on the degree of cross-sectional correlation in each DGP. To compare the degree of crosssectional correlation in each DGP, the average off-diagonal cross-sectional correlation, ψ, was simulated in over 5000 replications under different regimes for generating γ i . In the case of MPDGP, simulating ψ with N = 80, we find that, if γi ∼ N I D(μγ , 1), ψ = 0.004 when μ γ = 0, and ψ = 0.198 when μ γ = 1. For PDGP, if γ i ∼ i.i.d. U [0, 0.2], then ψ = 0.010 and if γ i ∼ i.i.d. U [0, 1], then ψ = 0.192. Clearly, both the mean and the variance of the factor loadings have a profound effect on the cross-sectional correlation properties of the DGPs. 4.3. Effects of serial correlation and trend We first examine the effects of serial correlation in the idiosyncratic error process on the behaviour of tests in conjunction with MPDGP. Bai and Ng (2004) suggest that if the nature of serial correlation is not known (as will generally be the case in practice), then the number of lags to be included in the univariate augmented Dickey–Fuller test regressions from which the PANIC panel test statistics are calculated may be chosen automatically. The automatic lag-length selection criterion they employ is min {N , T } 0.25 p = int 4 100 and this is the number of lags used in the test behaviour reported in Table 9. The Z PANIC and Z+ PANIC tests are included in the table to help determine whether this method of allowing for Table 9. Empirical size of the tests; MPDGP. AR(1) positive serial correlation t ∗b
Fisher
Z PANIC
Z+ PANIC
CIPS a
N
T
IPS
t ∗a
10 20
25 25
0.012 0.010
0.009 0.001
0.005 0.002
0.077 0.067
0.050 0.049
0.016 0.018
0.046 0.039
80 10 20
25 50 50
0.009 0.010 0.007
0.000 0.013 0.002
0.000 0.013 0.002
0.062 0.067 0.063
0.045 0.053 0.048
0.029 0.014 0.022
0.033 0.046 0.042
80 10
50 100
0.005 0.009
0.000 0.033
0.000 0.037
0.062 0.066
0.043 0.052
0.031 0.016
0.036 0.054
20 80 10
100 100 200
0.005 0.006 0.007
0.003 0.000 0.033
0.003 0.000 0.039
0.063 0.058 0.070
0.051 0.049 0.058
0.021 0.031 0.016
0.048 0.046 0.045
20 80 10
200 200 400
0.006 0.004 0.009
0.011 0.001 0.038
0.014 0.001 0.043
0.066 0.055 0.068
0.059 0.046 0.050
0.024 0.032 0.014
0.045 0.059 0.060
20 80
400 400
0.003 0.005
0.032 0.000
0.043 0.001
0.061 0.057
0.047 0.050
0.021 0.032
0.054 0.054
ε it = 0.5ε it−1 + u it Note: L = 1 known. C The Author(s). Journal compilation C Royal Economic Society 2009.
358
S. de Silva, K. Hadri and A. R. Tremayne Table 10. Empirical size of the tests; MPDGP. MA(1) positive serial correlation
N
T
t ∗a
t ∗b
Fisher Z PANIC Z + PANIC CIPS a
t ∗a
t ∗b
Fisher Z PANIC Z + PANIC CIPS a
10 25 0.699 0.543 0.192
0.141
0.058 0.411
0.236 0.149 0.079
0.052
0.019
0.075
20 25 0.760 0.642 0.260 80 25 0.807 0.769 0.524 10 50 0.659 0.409 0.228
0.197 0.449 0.188
0.116 0.561 0.380 0.759 0.080 0.569
0.243 0.177 0.082 0.393 0.365 0.082 0.236 0.129 0.071
0.049 0.032 0.053
0.020 0.021 0.017
0.077 0.096 0.085
20 50 0.850 0.707 0.329 80 50 0.876 0.803 0.334
0.275 0.275
0.170 0.750 0.218 0.936
0.233 0.155 0.078 0.320 0.263 0.082
0.052 0.045
0.023 0.030
0.099 0.122
10 100 0.512 0.198 0.274 20 100 0.855 0.666 0.393 80 100 0.936 0.847 0.370
0.235 0.341 0.333
0.106 0.672 0.223 0.838 0.268 0.984
0.196 0.094 0.073 0.225 0.133 0.072 0.267 0.197 0.069
0.057 0.057 0.052
0.017 0.024 0.035
0.089 0.100 0.143
10 200 0.488 0.160 0.322 20 200 0.606 0.287 0.434
0.268 0.384
0.127 0.717 0.255 0.879
0.186 0.089 0.081 0.175 0.091 0.073
0.061 0.055
0.018 0.025
0.095 0.113
80 200 0.955 0.853 0.414 10 400 0.469 0.130 0.312 20 400 0.394 0.093 0.453
0.380 0.266 0.406
0.306 0.994 0.127 0.735 0.274 0.896
0.218 0.156 0.064 0.190 0.086 0.080 0.131 0.061 0.075
0.053 0.064 0.065
0.034 0.021 0.027
0.156 0.091 0.105
80 400 0.961 0.848 0.430
0.402
0.333 0.997
0.202 0.134 0.058
0.046
0.030
0.159
ε it = u it − 0.5u it−1
ε it = u it − 0.2u it−1
Note: L = 1 known.
serial correlation confers any advantage over the other methods employed by the tests discussed in this paper. As with other serially correlated error processes, CIPS a is used. In Table 9, the idiosyncratic error process is generated as ε it = φ i ε it−1 + u it where u it ∼ NID(0, 1) with φ i = 0.5 for all i = 1, . . . , N . The IPS, t ∗a and t ∗b tests seriously under-reject the null hypothesis (excepting t ∗b if T is very large and N is small) and the Z + PANIC test is too conservative. The CIPS a test has excellent size properties, particularly as T increases. The Fisher test, too, has reasonable empirical size, though there is some tendency to overreject the null, a trait not shared to the same degree by Z PANIC . In the presence of negative serial correlation (results not reported in tabular form to save space), φ i = −0.5, the size problems encountered by the t ∗a and t ∗b tests were of a qualitatively different nature. There is now considerable evidence of overrejection of the null hypothesis, thus emphasizing the unreliability of these tests in the presence of serial correlation in the ε it . The Fisher and CIPS a tests, by contrast, do not suffer from appreciable size distortion in this case either, thereby adducing to their reliability. At least since the work of Schwert (1989), it has been known that positive moving average components can affect the finite sample behaviour of univariate unit root tests and it seems worthwhile to investigate the matter in the panel case, albeit only as a pilot exercise. Table 10 provides the size of various tests in the presence of moving average idiosyncratic errors with ε it = u it − 0.5u it−1 where u it ∼ NID(0, 1). Due to its poor performance in the presence of AR(1) serial correlation, the IPS test is not considered in this experiment. The table illustrates that the t ∗a , t ∗b and CIPS a tests exhibit completely unacceptably inflated size, while the Fisher, Z PANIC and Z+ PANIC tests also have sizes considerably in excess of the nominal 5% with size ranging from 10% or 20% to 50%. The exception is the Z + PANIC test which has a size close to the nominal when C The Author(s). Journal compilation C Royal Economic Society 2009.
Panel unit root tests in the presence of cross-sectional dependence
359
N and T are very small. These results clearly indicate that the difficulties identified by Schwert in the univariate case can be exacerbated in the case of panel data. The size of the tests was also examined with a more moderate value of the moving average coefficient so that ε it = u it − 0.2u it−1 . The size distortion presented by the t ∗a and t ∗b tests is still present, although moderated to a considerable degree. Only the Fisher and Z PANIC tests attain a size close to the nominal 5% in most of the sample sizes considered. The Z + PANIC test is, again, too conservative. When T is small, the CIPS a test has a slightly inflated size close to the nominal 5%, but as T increases, the size of the test inflates to up to three times the nominal. This suggests that caution is required when applying these tests to panels with serial correlation where a moving average component is suspected and that more work is required in this area; such considerations are beyond the remit of this paper and remain for future work. The effect of a trend in the DGP is also briefly considered. The tests of Moon and Perron do not allow for trends (though they did use a trending DGP of the type introduced below in some of their experiments), so t ∗a and t ∗b are not considered. The Z PANIC and its bias-adjusted counterpart are not considered as the relevant moments are not available. The DGP used here was that of Moon and Perron given in (2.2) altered so that yit = αi0 + αi1 t + yit0 ,
(4.3)
with y 0it = ρ i y 0it−1 + x it and x it follows the process defined in (2.3) when there is a single factor and f t ∼ NID(0, 1), ε it ∼ NID(0, 1) and γ i ∼ NID(1, 1). Here, α i0 , α i1 ∼ NID(0, 1). 1 In the presence of trends, both the CIPS and Fisher tests have good size properties, although there is limited evidence that the Fisher test overrejects the null hypothesis, especially when N and T are very small. Overall, the empirical size of the Fisher and CIPS tests is very similar to Table 6 and so the results are not given in detail. The IPS test suffers from considerable upward size distortion and is not to be recommended in this situation.
5. POWER PROPERTIES OF PANEL UNIT ROOT TESTS This section discusses the size-adjusted power of the panel unit root tests. The size-adjusted power of the tests is calculated using empirical critical values to reject, or not reject, the null hypothesis. These critical values are calculated under the null hypothesis for each test using 5000 replications. Of course, in applications researchers will likely work off nominal critical values and the poor empirical size properties of some tests (e.g. IPS and t ∗a in the context of difficulties outlined in Section 4, but included in Table 11 below) would render them unsuitable for practical use. After size adjustment, they are included for completeness in the experiments reported. The alternative hypothesis and DGP used initially is the same as that in Moon and Perron (2004), viz. ρ i ∼ U [0.98, 1.00] and the DGP is (2.2). This hypothesis reflects stationary values that are very close to the unit circle. Other departures from unity, including random values in the explosive region will be considered presently. Table 11 reports the size-adjusted power of the panel unit root tests when μ γ = 1, L = 1 and 2. It can be seen that Table 11 includes the power properties of the Z PANIC test, but Z + PANIC is not included separately as it produces the same 1 The p-values for applying the Fisher test in this experiment were kindly supplied S. Ng and J. Bai, to whom we extend our sincere thanks.
C The Author(s). Journal compilation C Royal Economic Society 2009.
360
S. de Silva, K. Hadri and A. R. Tremayne Table 11. Size-adjusted power of the tests. L=1
N
T
10
25
20 80 10
25 25 50
20 80
50 50
IPS
t ∗a
t ∗b
L=2
Fisher Z PANIC
CIPS
0.062 0.100 0.102
0.073
0.081
0.054 0.101 0.133 0.068 0.268 0.285 0.000 0.173 0.112
0.087 0.139 0.112
0.117 0.227 0.142
0.004 0.205 0.160 0.088 0.565 0.606
0.137 0.351
10 100 0.000 0.341 0.259 20 100 0.004 0.454 0.422 80 100 0.099 0.868 0.885
IPS
t ∗a
t ∗b
Fisher Z PANIC
CIPS
0.056 0.070 0.077 0.084
0.070
0.072
0.052
0.059 0.060 0.067 0.094 0.052 0.063 0.162 0.274 0.066 0.001 0.139 0.121
0.087 0.134 0.104
0.097 0.165 0.115
0.061 0.054 0.061
0.240 0.600
0.061 0.004 0.147 0.167 0.055 0.076 0.438 0.582
0.139 0.321
0.189 0.545
0.065 0.061
0.181 0.318 0.784
0.286 0.481 0.928
0.069 0.000 0.284 0.247 0.070 0.005 0.375 0.397 0.072 0.117 0.836 0.884
0.185 0.294 0.757
0.242 0.427 0.917
0.064 0.075 0.066
10 200 0.000 0.593 0.508 20 200 0.004 0.751 0.724
0.444 0.690
0.542 0.801
0.110 0.001 0.536 0.482 0.115 0.007 0.701 0.708
0.360 0.624
0.493 0.771
0.107 0.108
80 200 0.182 0.941 0.945 10 400 0.001 0.802 0.746 20 400 0.023 0.890 0.875
0.967 0.770 0.946
0.976 0.840 0.958
0.142 0.201 0.948 0.958 0.249 0.002 0.775 0.736 0.401 0.031 0.875 0.874
0.978 0.734 0.933
0.984 0.794 0.949
0.121 0.235 0.315
80 400 0.435 0.964 0.967
0.999
0.995
0.642 0.486 0.976 0.979
0.998
0.997
0.395
Note: L known.
results as the former test at the sample sizes considered. When adjusted for size, both the IPS and CIPS tests have very poor power at the sample sizes considered. The CIPS test, however, increases in power rapidly when T = 400, the largest value considered. The t ∗a , t ∗b , Fisher and Z PANIC tests have higher power than do the IPS and CIPS tests. The t ∗a and t ∗b tests have higher power than the Fisher test does when N and T are small but, as these values increase, the Fisher test has greater power. The Z PANIC test has power generally higher than that of the Fisher test, but this difference attenuates as N and T increase. Altering the number of factors does not have a notable detrimental effect on the power of the tests and the power properties of most of the tests are similar whether L = 1 or 2, although for L = 2, the power of the t ∗a and t ∗b tests is slightly lower than when L = 1. The power of the tests is also calculated in the presence of a trend using the trended DGP given at (4.3). The power of the tests is similar to that presented in Table 11, although slightly lower in the deterministic trending case. The Moon and Perron, Z PANIC and Z + PANIC tests were obviously not included in this experiment for the reasons pointed out in Section 4.3. Although not reported in detail, the size and power of the tests are also computed using errors generated from the t 5 and χ 24 distributions to assess the robustness of the tests to a Gaussian assumption. Neither the empirical sizes nor the rejection frequencies of the tests are materially affected by generating processes with errors from fat-tailed and leptokurtotic distributions such as these, though the power of the tests is reduced slightly from that reported in Table 11. So far, the tests have been examined under an alternative hypothesis requiring the DGP to be strictly stationary. To examine a different type of departure from the null of a common unit root across panel members, we consider a stochastic unit root model of the kind used in a univariate context by McCabe and Tremayne (1995) with ρ i ∼ N (1, ω2 ). In order for sensible and reestimable factors to be generated under this regime, local departures from unity are required. C The Author(s). Journal compilation C Royal Economic Society 2009.
361
Panel unit root tests in the presence of cross-sectional dependence Table 12. Size-adjusted power of the tests, stochastic unit root. L=1 N
T
IPS
t ∗a
t ∗b
Fisher
Z PANIC
CIPS
10
25
0.045
0.051
0.054
0.047
0.052
0.041
20 80 10
25 25 50
0.044 0.042 0.000
0.036 0.045 0.045
0.045 0.045 0.045
0.049 0.050 0.046
0.054 0.038 0.048
0.047 0.043 0.043
20 80
50 50
0.003 0.045
0.022 0.012
0.017 0.016
0.046 0.037
0.035 0.018
0.037 0.029
10 20 80
100 100 100
0.000 0.001 0.027
0.031 0.014 0.001
0.021 0.012 0.002
0.035 0.028 0.020
0.025 0.017 0.005
0.036 0.023 0.008
10 20
200 200
0.000 0.000
0.018 0.002
0.011 0.001
0.026 0.019
0.019 0.006
0.017 0.007
80 10 20
200 400 400
0.000 0.000 0.000
0.000 0.008 0.000
0.000 0.006 0.000
0.008 0.023 0.035
0.001 0.018 0.014
0.000 0.004 0.000
80
400
0.000
0.000
0.000
0.107
0.045
0.000
After some experimentation, we found that a value of ω = 0.00625 yields re-estimable factors in the context of all the values of T (and N) used, but increasing ω appreciably results in difficulties estimating the factors. Table 12 reports the power of the tests under the stochastic unit root regime. The tests have considerably lower power than previously and this actually deteriorates as N and T increase, particularly in the case of the CIPS and t ∗b tests. Although Table 11 indicates that the Fisher and the t ∗b tests have a good ability to reject the null hypothesis when it is false, Table 12 shows that the power of the tests depends critically on departures from the null hypothesis being only in the stationary direction. The table shows that, under the stochastic unit root hypothesis, the tests effectively have no power and generally have rejection frequencies less than the nominal size. We may surmize that this is due to the presence of explosive panel members. Conceivably, the problem could be lessened by developing the arguments of Abadir and Distaso (2007) to take advantage of the one-sided nature of the departures from the null hypothesis that are of concern.
6. AN APPLICATION TO THE EFFICIENT MARKETS HYPOTHESIS Since Lo and MacKinlay (1988) restarted the debate on the informational efficiency of financial markets, a great deal of effort has been invested in determining whether stock prices follow a random walk. In a regulatory environment, the random walk hypothesis has a further implication beyond informational efficiency alone. Regulatory bodies may, by their policies, be encouraging, or inhibiting, informational efficiency in the market. Determining whether stock prices in any industry follow a random walk may assist in assessing how well such a regulatory body promotes informational efficiency. C The Author(s). Journal compilation C Royal Economic Society 2009.
362
S. de Silva, K. Hadri and A. R. Tremayne
The efficient markets hypothesis is generally tested as a univariate test of a random walk using a market index as an estimate of the behaviour of the market as a whole. See for example, Smith and Ryoo (2003), Tabak (2003) and Lima and Tabak (2004). This method has several disadvantages, not the least of which is the poor power properties of univariate unit root tests. A further problem, leading to possibly even lower power, is that using such market indices may result in attempts to test variables that are actually combinations of I(0) and I(1) variables, an exercise observed by Bai and Ng (2004) to be sometimes misleading. The advantages of using panel data to test for a unit root are numerous. Important amongst these are, first, against the heterogeneous alternative hypothesis allowing ρ i to vary across crosssections, panel unit root tests may have higher power than do univariate unit root tests against the univariate alternative hypothesis. Secondly, the approximate linear factor model may account for market-wide behaviour and correlation, which a market index can only approximate. We assess the evidence for the efficient markets hypothesis in the context of the Australian prudential industry. Applications of the efficient markets hypothesis to date have largely focused on American markets, although there have been some studies of Australian markets. Patro and Wu (2004) reject the univariate null hypothesis of a random walk in the Morgan Stanley Capital International index for Australia, while Lee (1992) also rejects the random walk null hypothesis for the All Ordinaries index. The data used relate to a small section of the ASX, specifically, a panel of 16 retail banks, fullline insurers and investment banks listed on the ASX. The log stock prices of these firms observed weekly over the period August 11, 2000, to April 15, 2005 (T = 245) are tested for a panel unit root. Weekly data mitigate the problems associated with daily data by compensating for any biases associated with infrequent trading while allowing for a reasonably long time series (Lee, 1992). The data were obtained from the Factiva database and are available on request. The firms included in the panel are Adelaide Bank Limited, AMP Limited, ANZ Banking Group, AXA Asia Pacific Holdings Limited, Bank of Queensland Limited, Bendigo Bank Limited, Calliden Group, Commonwealth Bank, Futuris Corporation (owner of Elders Bank), Insurance Australia Group Limited, Macquarie Bank Limited, National Australia Bank, QBE Insurance, St George Bank, Suncorp-Metway Limited and Westpac Banking. All of the firms included in the panel are in some way monitored by the APRA. Support for the panel unit root null hypothesis provides support for the Efficient Markets Hypothesis in this sector of the market and also yields evidence of APRA’s ability to promote informational efficiency in the financial services industry. Inferences from this panel may not generalize to the entire ASX, but provide some indication of the efficiency of a single sector of the ASX and the effect of regulation in this environment. The number of factors in the panel requires estimation. The number of factors selected using the BIC 3 criterion is three; while all other criteria suggested by Bai and Ng (2002) unrealistically choose the maximum number of factors allowed, eight. The HQ 2 , HQ 4 and HQ 5 criteria estimate three factors in the panel. The HQ 6 criterion estimates two factors while HQ 7 unrealistically estimates eight factors. As indicated in Section 2, HQ 1 and HQ 3 are not used. In light of this, the test statistics are reported three times: once working on the premise that the assumption of the CIPS tests is correct and there is a single common factor, secondly assuming that there are two factors, in line with the HQ 6 criterion’s estimate. Finally, the test statistics are reported assuming the existence of three common factors, as suggested by the BIC 3 , HQ 2 , HQ 4 and HQ 5 criteria. The IPS statistic does not allow for any factors and the CIPS statistic allows only a single factor and so both these test statistics are reported once.
C The Author(s). Journal compilation C Royal Economic Society 2009.
363
Panel unit root tests in the presence of cross-sectional dependence
Statistic IPS t ∗a
Table 13. Test statistics and critical values of the panel unit root tests. No factors One factor Two factors Three factors 10% critical value 4.740 –
– 0.189
– 0.152
– 0.289
−1.28 −1.28
t ∗b Fisher
– –
1.279 −0.247
0.614 −0.995
1.301 −0.367
−1.28 1.28
Z PANIC Z+ PANIC CIPS
– – –
0.400 0.826 −1.784
0.585 1.573 –
0.296 1.437 –
−1.28 −1.28 −2.15
The IPS, t ∗a , t ∗b and CIPS tests reject the null hypothesis for large negative values of the test statistics, while the Fisher test rejects the null hypothesis for large positive values of the test statistic. The critical values of the tests at the 10% level and the values of the test statistics themselves are given in Table 13. The testing outcomes are robust to the number of factors assumed to be present in the panel, and the absence of significant values in Table 13 indicates that all tests fail to reject the null hypothesis of a panel unit root at all conventional significance levels. Perhaps unsurprisingly, univariate testing produces similar results. If all cross-sections are treated independently and tested individually using the Dickey–Fuller test, only one cross-section rejects the unit root null hypothesis. (That cross-section unit is the National Australia Bank at the 5% level.) Overall, these results suggest that there is little evidence to support the hypothesis that at least one of the 16 cross-sections in the panel is stationary, implying that this sector of the financial services industry listed on the ASX is informationally efficient. This contrasts with univariate evidence of Lee (1992) and Patro and Wu (2004) which reject the null hypothesis of a univariate random walk for the All Ordinaries and Morgan Stanley Capital Indices. The inference that may be drawn from this panel is that a random walk cannot be rejected in this sector of the ASX and, consequently, the efficient markets hypothesis cannot be rejected. This, in turn, leads to the conclusion that the policies of the Australian Prudential Regulatory environment are not hampering market efficiency in this sector of the ASX.
7. CONCLUSION The model selection criteria proposed by Bai and Ng (2002) have a tendency to overestimate the number of factors when N is small and suggest the need for more effective ones. We propose new criteria based on that of Hannan and Quinn (1979) and find that they are an effective substitute because they are less sensitive to the value of N as long as T is suitably large. Specifically, T > 50 is suitable for using the HQ 4 criterion if N is small. As N gets larger, the difference between the criteria attenuates. Neither the information criteria nor the panel unit root tests examined are materially affected by departures from Gaussianity in factor generation or idiosyncratic errors that we examine. The general conclusions reached regarding panel unit root tests are as follows. The IPS test has reasonable size properties in the absence of cross-sectional correlation, but poor size when this correlation is strong. As might be expected, the CIPS tests are highly sensitive to the assumptions that γ = 0 and that there is only a single factor in the model. If these assumptions C The Author(s). Journal compilation C Royal Economic Society 2009.
364
S. de Silva, K. Hadri and A. R. Tremayne
hold, the CIPS test has excellent size control, but considerably less power than the other tests examined at the sample sizes considered. When these assumptions are not satisfied, the tests cannot be recommended because of their poor empirical performance. The Moon and Perron tests exhibit upwards size distortion when N is small and this is especially true of the t ∗a test which cannot be recommended for this reason. The t ∗b test generally has somewhat better size and power properties than its stablemate. The size properties of the IPS, CIPS, t ∗a and t ∗b tests are shown to depend on the DGP used. As such, these tests need to be treated with caution in practice. Overall, the Fisher test proposed by Bai and Ng (2004) has good size and power over a wide variety of DGPs. The Z PANIC test also has excellent size in general, though its bias-corrected counterpart consistently under-rejects the null hypothesis. However, a caveat would have to be that serial correlation in idiosyncratic errors, particularly of the moving average type, can have serious detrimental effects on the performance of all tests considered. The power of all the tests depends on departures from unity being in the stationary direction, non-stationary values of autoregressive parameters result in very low power. The experiments reported in this paper indicate that the separation of components approach first suggested by Bai and Ng (2004) results in test statistics of a more robust size and higher power under a wider range of DGPs than any other approach considered. In examining the evidence regarding the informational efficiency of the financial services sector of the ASX, we find that there is no evidence to reject the null hypothesis of a panel random walk. This inference is robust to the number of factors assumed to be in the data. The panel unit root tests are unable to refute the efficient markets hypothesis, suggesting that the policies of the APRA are not inhibiting informational efficiency in this sector.
ACKNOWLEDGMENTS The authors would like to thank two anonymous referees and an Associate Editor for their insightful comments. All errors remain our own.
REFERENCES Abadir, K. (1993). OLS bias in nonstationary autoregression. Econometric Theory 9, 81–93. Abadir, K. and W. Distaso (2007). Testing joint hypotheses when one of the alternatives is one-sided. Journal of Econometrics 140, 695–718. Abadir, K. and K. Hadri (2000). Is more information a good thing? Bias nonmonotonicity in stochastic difference equations. Bulletin of Economic Research 52, 91–100. Bai, J. and S. Ng (2002). Determining the number of factors in approximate factor models. Econometrica 70, 191–221. Bai, J. and S. Ng (2004). A PANIC attack on unit roots and cointegration. Econometrica 72, 1127–77. Baltagi, B. (2001). Econometric Analysis of Panel Data. Chichester: Wiley. Baltagi, B. and C. Kao (2000). Nonstationary panels, cointegration in panels and dynamic panels: a survey. Advances in Econometrics 15, 7–51. Banerjee, A. (1999). Panel data unit roots and cointegration: an overview. Oxford Bulletin of Economics and Statistics 61, 607–29. Breitung, J. and M. Pesaran (2008). Unit roots and cointegration in panels. In L. Matyas and P. Sevestre (Eds.), The Econometrics of Panel Data, 278–352. New Jersey: Springer-Verlag. C The Author(s). Journal compilation C Royal Economic Society 2009.
Panel unit root tests in the presence of cross-sectional dependence
365
Breuer, J., R. McNown and M. Wallace (2001). Misleading inferences from panel unit-root tests with an illustration from purchasing power parity. Review of International Economics 9, 482–93. Chang, Y. (2002). Nonlinear IV unit root tests in panels with cross-sectional dependency. Journal of Econometrics 110, 261–92. Chang, Y. (2004). Bootstrap unit root tests in panels with cross-sectional dependency. Journal of Econometrics 120, 263–93. Chang, Y. and W. Song (2002). Panel unit root tests in the presence of cross-sectional dependency and heterogeneity. Working paper, Rice University. Choi, I. (2001). Unit root tests for panel data. Journal of International Money and Finance 20, 249–72. de Silva, S. (2008). On the specificity and performance of panel unit root tests. Unpublished Ph.D. thesis, University of Sydney. Dickey, D. and W. Fuller (1979). Distribution of the estimators for autoregressive time series with a unit root. Journal of the American Statistical Association 74, 427–31. Gengenbach, C., F. Palm and J. Urbain (2004). Panel unit root tests in the presence of cross-sectional dependencies: comparison and implications for modelling. METEOR Research Memorandum 040, Maastricht University. Hadri, K. (2000). Testing for stationarity in heterogeneous panel data. Econometrics Journal 3, 148–61. Hadri, K. and R. Larsson (2005). Testing for stationarity in heterogeneous panel data where the time dimension is finite. Econometrics Journal 8, 55–69. Hannan, E. and B. Quinn (1979). The determination of the order of an autoregression. Journal of the Royal Statistical Society, Series B 41, 190–95. Harris, D., S. Lebourne and B. McCabe (2005). Panel stationarity tests for purchasing power parity with cross-sectional dependence. Journal of Business and Economic Statistics 23, 395–410. Im, K. and M. Pesaran (2003). On the panel unit root tests using nonlinear instrumental variables. Working Papers in Economics 0347, University of Cambridge. Im, K., M. Pesaran and Y. Shin (1997). Testing for unit roots in heterogeneous panels. Working Papers in Economics 9526, University of Cambridge. Im, K., M. Pesaran and Y. Shin (2003). Testing for unit roots in heterogeneous panels. Journal of Econometrics 115, 53–74. Lee, U. (1992). Do stock prices follow random walk? Some international evidence. International Review of Economics and Finance 1, 315–27. Levin, A. and C. Lin (1992). Unit root tests in panel data: asymptotic and finite sample properties. University of California at San Diego, Department of Economics, Working Paper, No. 92–23. Levin, A., C. Lin and C. J. Chu (2002). Unit root tests in panel data: asymptotic and finite-sample properties. Journal of Econometrics 108, 1–24. Lima, E. and B. Tabak (2004). Tests of the random walk hypothesis for equity markets: evidence from China, Hong Kong and Singapore. Applied Economics Letters 11, 255–58. Lo, A. and A. MacKinlay (1988). Stock market prices do not follow random walks: evidence from a simple specification test. Review of Financial Studies 1, 41–66. MacKinnon, J. (1994). Approximate asymptotic distribution functions for unit-root and cointegration tests. Journal of Business and Economic Statistics 12, 167–76. Maddala, G. S. and S. Wu (1999). A comparative study of unit root tests with panel data and a new simple test. Oxford Bulletin of Economics and Statistics 61, 631–52. McCabe, B. and A. Tremayne (1995). Testing a time series for difference stationarity. Annals of Statistics 23, 1015–28. Moon, H. and B. Perron (2004). Testing for a unit root in panels with dynamic factors. Journal of Econometrics 122, 81–126. C The Author(s). Journal compilation C Royal Economic Society 2009.
366
S. de Silva, K. Hadri and A. R. Tremayne
Nabeya, S. (1999). Asymptotic moments of some unit root test statistics in the null case. Econometric Theory 15, 139–49. O’Connell, P. (1998). The overvaluation of purchasing power parity. Journal of International Economics 44, 1–19. Patro, D. and Y. Wu (2004). Predictability of short-horizon returns in international equity markets. Journal of Empirical Finance 11, 553–84. Pesaran, M. (2004). General diagnostic tests for cross-section dependence in panels. Working Papers in Economics 0435, University of Cambridge. Pesaran, M. (2007). A simple panel unit root test in the presence of cross section dependence. Journal of Applied Econometrics 22, 265–312. Phillips, P. and H. Moon (1999). Linear regression limit theory for nonstationary panel data. Econometrica 67, 1057–111. Phillips, P. and D. Sul (2003). Dynamic panel estimation and homogeneity testing under cross section dependence. Econometrics Journal 6, 217–59. Schwert, G. (1989). Tests for unit roots: a Monte Carlo investigation. Journal of Business and Economic Statistics 20, 147–60. Smith, G. and H. Ryoo (2003). Variance ratio tests of the random walk hypothesis for European emerging stock markets. European Journal of Finance 9, 290–300. Smith, L., S. Leybourne, T. Kim and P. Newbold (2004). More powerful panel data unit root tests with an application to mean reversion in real exchange rates. Journal of Applied Econometrics 19, 147–70. Stock, J. and M. Watson (1988). Testing for common trends. Journal of the American Statistical Association 83, 1097–107. Strauss, J. and T. Yigit (2003). Shortfalls of panel unit root testing. Economics Letters 81, 309–13. Tabak, B. (2003). The random walk hypothesis and the behaviour of foreign capital portfolio flows: the Brazilian stock market case. Applied Financial Economics 13, 369–78. Westerlund, J. and R. Larsson (2007). A note on the pooling of individual PANIC unit root tests. Working paper, Lund University.
C The Author(s). Journal compilation C Royal Economic Society 2009.
The
Econometrics Journal Econometrics Journal (2009), volume 12, pp. 367–381. doi: 10.1111/j.1368-423X.2009.00282.x
The empirical process of autoregressive residuals E RIC E NGLER † AND B ENT N IELSEN ‡ †
D. E. Shaw & Co., 39th Floor, Tower 45, 120 W 45th St, New York, NY 10016, USA E-mail:
[email protected] ‡
Nuffield College, Oxford OX1 1NF, UK E-mail:
[email protected] First version received: June 2008; final version accepted: November 2008
Summary Asymptotic theory is developed for the residual empirical process of autoregressive distributed lag models with an intercept and possibly other deterministic terms. The asymptotic distribution is shown not to depend on the location of characteristic roots. This contrasts to situations without intercept where unit roots give rise to non-standard distributions. This is important in applications, as the question of the innovation distribution can be addressed without knowledge of the characteristic roots. Keywords: Autoregression, Empirical process, Kolmogorov–Smirnov test, Probability– probability plots, Quantile–quantile plots, Residuals.
1. INTRODUCTION The asymptotic theory of the empirical process of autoregressive residuals is revisited. The focus is on autoregressions including intercepts and possibly other deterministic terms as in most applications. The asymptotic distribution is shown not to depend on the location of characteristic roots. This contrasts to situations without intercept where unit roots give rise to non-standard distributions. This is important in applications, as the question of the distribution of the innovations then can be addressed without having to locate the characteristic roots. The simple first-order autoregression without intercept has been studied extensively. Boldin (1981) and Koul and Leventhal (1989) studied the stationary and the explosive case, respectively, and found the same Gaussian limiting distribution as in regressions with fixed regressors but no intercept. Ling (1998) and Lee and Wei (1999) studied the case with a unit root and a known scale and found a non-Gaussian limiting distribution. Thus, for the simple first-order autoregression the questions of the distribution of the characteristic roots and of the location of the characteristic root cannot be separated. For autoregressions with intercept Pierce (1985) and Koul and Leventhal (1989) studied the stationary and the explosive case, respectively, and found the same Gaussian limiting distribution as in regressions with fixed regressors including an intercept. The unit root case has not been studied. In this paper it is shown that the limiting distribution is the same, so that the limiting distribution is invariant to the location of the characteristic roots. The key to the result is that the non-Gaussian component in the case of no intercept stems from the sum of the residuals. C The Author(s). Journal compilation C Royal Economic Society 2009. Published by Blackwell Publishing Ltd, 9600 Garsington Road,
Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
368
E. Engler and B. Nielsen
When an intercept is included in the model, as in most applications, this sum is zero and the nonGaussian component does not arise. A similar invariance holds for the likelihood-based tests for unit roots and for the order of autoregressions, but not for the usual correlograms; see Nielsen (2001, 2006a,b). The proof draws on the results of Lee and Wei (1999) and Koul (2002) for residual empirical processes for stochastic regressions. Some work is needed though, since the former covers some non-stationary cases with known location and scale, while the latter gives techniques for dealing with location and scale for the non-trending case. Combining this work with the asymptotic theory of autoregressive estimators developed in Nielsen (2005) a wide range of frequently applied autoregressive models can be considered. The simple autoregression has received some further attention in the stationary case without intercept and known scale. In that situation the limiting process is a Brownian bridge. The partial empirical process has been proved to converge to a Kiefer process by Bai (1994) and by Na et al. (2005) for a case with measurement errors. Since the empirical process converges to a Brownian bridge the Kolmorov–Smirnov test is distribution free. The empirical process can then be used not only for testing for a given reference distribution of the innovations but also to estimate the innovation distribution. When the scale is unknown the Kolmogorov–Smirnov statistic is not distribution free, but a distribution-free statistic can be found using the Khmaladze transformation involving the innovation density. If the focus is on estimation the density must also be estimated, which is challenging unless the sample size is large. Bai (2003) points out that the Khmaladze transformation applies for autoregressions, but focuses on the testing issue for non-linear time series where the transformation simplifies complicated limit distributions. The considered class of autoregressions is quite general. Autoregressive distributed lag models are allowed where the joint distribution of the included variables is vector autoregressive with the only restrictions related to any explosive characteristic roots. The deterministic components can be polynomial and seasonal. The main result states the asymptotic properties of the residual empirical process. Focusing on the Gaussian case new response surfaces are given for the distributions of Kolmogorov–Smirnov and Anderson–Darling-type test statistics. These in turn imply test bands for probability–probability and quantile–quantile plots.
2. THE EMPIRICAL PROCESS A general autoregressive model is set up and is followed by the asymptotic theory. 2.1. The general autoregressive model Let X t be a p-dimensional time series partitioned in terms of a univariate time series Y t and a time series Z t of dimension p − 1 ≥ 0. The general univariate model is given by Yt = ρZt +
k j =1
αj Yt−j +
k
βj Zt−j + νDt−1 + σ εt
(t = 1, . . . , T ),
(2.1)
j =1
conditional on X 0 , . . . , X 1−k , with independent innovations ε t with distribution function F. The term D t−1 is a deterministic term, which is discussed in further detail below. When ρ is restricted to zero, so Z t is absent on the right-hand side, this is the marginal equation of a vector autoregression, and when ρ is unrestricted this is an autoregressive distributed lags model. If C The Author(s). Journal compilation C Royal Economic Society 2009.
369
Empirical process
the Z process is absent this reduces to an autoregression. When k = 0 and D t−1 = 1 the model reduces to a classical regression model with an intercept. In addition, when ρ is restricted to zero, this becomes a location-scale model for Y t . Least-squares estimation of (2.1) gives the scaled residuals ⎛ ⎞ k k ˆ t− αˆ j Yt−j − εˆ t = σˆ −1 ⎝Yt − ρZ βˆj Zt−j − νˆ Dt−1 ⎠ , j =1
j =1
if, for instance, ρ is unrestricted. The empirical distribution function of the residuals is ˆ F(x) = T −1
T
1(ˆεt ≤x) .
t=1
For a continuous reference distribution, the associated empirical process is ˆ −1 (u)} − u], ˆ F(u) = T 1/2 [F{F
(2.2)
which has argument u = F(x) on the unit interval. ˆ The intercept plays a crucial role. The issue is that the asymptotic expansion of F involves the term T −1/2 Tt=1 εˆ t . When no intercept is included this vanishes as long as the characteristic roots are different from one, whereas manipulations as in Chan and Wei (1988) shows that it has a Dickey–Fuller-type distribution in the presence of unit roots. When an intercept is included then Tt=1 εˆ t = 0 by construction. In order to discuss the distribution of the empirical process the joint distribution of the time series X t = (Y t , Z t ) has to be specified. If this is assumed to satisfy a vector autoregression the results for general vector autoregressions given by Nielsen (2005) can be used. That paper is a generalisation of the work by Lai and Wei (1985), who did not consider deterministic terms. Thus, suppose the time series X t and the deterministic series D t satisfy the autoregressive equations Xt =
k
Aj Xt−j + μDt−1 + ξt
(t = 1, . . . , T ),
(2.3)
j =1
Dt = DDt−1 ,
(2.4)
conditional on X 0 , . . . , X 1−k , where the vector innovations, ξ t , are partitioned as ξ t = (ξ y,t , ξ z,t ) . The vector innovations are assumed to have mean zero and a positive definite variance matrix partitioned as yy yz Var (ξt ) = = . (2.5) zy zz The parameter space of the vector autoregression is then given by A 1 , . . . , A k , μ, freely 2 2 varying in Rkp +p dim(D)+p so is positive definite. The only restriction on the parameter space relates to the explosive roots of the companion matrix
(A1 , . . . , Ak−1 ) Ak B= . Ip(k−1) 0 C The Author(s). Journal compilation C Royal Economic Society 2009.
370
E. Engler and B. Nielsen
A SSUMPTION 2.1. All explosive roots of B have geometric multiplicity of unity. That is, for all λ ∈ C so |λ| > 1 then rank(B − λI pk ) ≥ pk − 1. Assumption 2.1 is always satisfied for univariate autoregressions, where p = 1, and for vector autoregressions with at most one explosive root. For multivariate autoregressions, Anderson (1959) and Duflo et al. (1991) pointed out that this assumptions is needed for consistency of the least-squares estimators as this ensures positive definiteness of the normalised information matrix associated with the explosive roots discussed in Lai and Wei (1985, Theorem 4) and Nielsen (2005, Corollary 7.2). The univariate model (2.1) arises from the vector autoregression in two ways. When ρ is restricted to zero then (2.1) is simply the first equation in (2.3) so σ ε t = ξ y,t with mean zero and variance σ 2 = yy . When ρ = yz −1 zz and ξ is normally distributed, then (2.1) states the conditional model for Y t given Z t and the past, with σ ε t = ξ y,t − ρξ z,t with mean zero and variance σ 2 = yy − yz −1 zz zy . The formulation for the deterministic term D t allows a joint autoregressive companion representation of X t , D t , and is inspired by Johansen (2000). The matrix D has characteristic roots on the complex unit circle, so D t is a vector of terms such as a constant, a linear trend, or periodic functions like seasonal dummies. For example, D=
1
0
1 −1
with
D0 =
1 1
will generate a constant and a dummy for a bi-annual frequency. The deterministic term D t is assumed to have linearly independent coordinates, which is formalised as follows. A SSUMPTION 2.2. |eigen(D)| = 1 and rank(D1 , . . . , Ddim D ) = dim D. The next assumption ensures that an intercept is included in the regression. Note that no restrictions are needed for the coefficient μ in (2.3), which could therefore be zero. A SSUMPTION 2.3. Assume D has at least one eigenvalue of unity. 2.2. Asymptotic theory for the empirical process To formulate the main result some assumptions to innovations ε t and ξ t are needed. Essentially, ε t needs to be independent identically distributed, while it suffices that ξ t is a martingale difference sequence with respect to an increasing sequence of σ -fields (Ft ). A set of three assumptions is needed for the innovations ε t of the regression equation (2.1). The first assumption ensures the convergence of the uniform empirical process. A SSUMPTION 2.4. Suppose the innovations ε t of the regression equation (2.1) are independent identically distributed with distribution function F so Eεt = 0 and Var εt = 1. A second assumption is a martingale assumption, which is needed when ρ is unrestricted so the regression equation (2.1) is an autoregressive distributed lags equation. A SSUMPTION 2.5. If ρ is unrestricted let Gt−1 be the sigma field over Ft−1 and Z t . If ρ is restricted to zero let Gt−1 = Ft−1 . Suppose the innovations ε t are independent of Gt−1 . C The Author(s). Journal compilation C Royal Economic Society 2009.
371
Empirical process
The third assumption concerns the appearance of the distribution function F. A SSUMPTION 2.6. Suppose the distribution function F has density f which is positive and differentiable everywhere, and satisfies sup |xf(x)| < ∞,
sup |(1 + x 2 )f (x)| < ∞.
x∈R
x∈R
Assumption 2.6 is satisfied by many distributions, notably the standard normal distributions. Denoting the standard normal density by ϕ it holds that ϕ (x) = −xϕ(x). Since ϕ has exponentially declining tails the boundedness follows. Less restrictive assumptions are needed for the innovations, ξ t , of the underlying vector autoregression (2.3). These are assumed to satisfy a martingale difference sequence assumption to exploit the consistency results of Nielsen (2005). A SSUMPTION 2.7. Suppose (ξt , Ft ) is a martingale difference sequence, so E(ξt | Ft−1 ) = 0, that the initial values X 0 , . . . , X 1−k are F0 -measurable and a.s.
sup E{(ξt ξt )λ/2 |Ft−1 } < ∞
for some λ > 4,
(2.6)
where is positive definite.
(2.7)
t
E(ξt ξt |Ft−1 ) = a.s.
Assumption 2.7 excludes the possibility that the vector autoregressive innovations could be autoregressive conditional heteroscedastic (ARCH). If ARCH were present this should possibly be modelled directly anyway. Assumption 2.7 is used to cover the explosive case and is therefore needed in order to explore the invariance with respect to the autoregressive parameters. The limiting distribution of the empirical process is expressed in terms of stochastic integrals with respect to a Brownian bridge U; see Shorack and Wellner (1986, pp. 91–95 and 197–200). 1 Stochastic integrals 0 hj (u)dU(u) are well defined for square integrable functions h 1 , h 2 . They are normally distributed with expectation zero, and covariance
Cov =
1
h1 (u)dU(u), 0 1 0
1
h2 (u)dU(u) 0
1
h1 (u)h2 (u)du −
h1 (u)du
0
1
h2 (u)du .
(2.8)
0
1 The Brownian bridge itself can be written as U(v) = 0 1(u≤v) dU(u). The main result can now be formulated. This shows that as long as an intercept is included in the autoregression, see Assumption 2.2, possibly with a zero coefficient, and as long as any explosive root has geometric multiplicity one, see Assumption 2.1, the same Gaussian limit distribution applies for all values of the characteristic roots. Thus, it is possible to make inference about the distribution of the autoregressive innovations without knowing the location of the characteristic roots. T HEOREM 2.1. Suppose model (2.3) and Assumptions 2.1–2.7 are satisfied. Then Fˆ (·) ⇒ XF (·) , C The Author(s). Journal compilation C Royal Economic Society 2009.
372
E. Engler and B. Nielsen
on D[0, 1], where XF is the Gaussian process defined by ⎡ ⎤ ⎡ ⎤ 1(s≤u) 1 1 ⎢ −1 ⎥ ⎢ ⎥ f{F−1 (u)} XF (u) = ⎣ ⎣ F (s) ⎦ dU (s) . ⎦ 0 1 −1 F (u)f{F−1 (u)} {F−1 (s)}2 2
(2.9)
The process XF defined in (2.9) is the sum of three components. The first component is ˆ say. The second simply the Brownian bridge U(u) arising from the uniform empirical process, U ˆ and arise due to the estimation of location and scale, and third components relate to Fˆ − U respectively. The process XF is standard, in the sense of applying to cross-sectional regression problems with unknown location and scale; see Shorack and Wellner (1986, p. 197f) and Koul (2002, p. 58). For a known reference distribution the process XF can be simulated in various ways. One approach is to use the covariance structure of XF . For any choice of reference distribution F this covariance structure can be found using (2.8). For the Gaussian case it is found in (2.10) below. For a particular grid over the interval [0, 1], the covariance matrix is computed and a multivariate normal distribution is simulated. Another approach is to compute the stochastic integrals in (2.9) directly. First, a Brownian motion B is computed for a grid over [0, 1] by taking partial sums of generated independent standard normal variables. This is transformed into a Brownian bridge by U(u) = B(u) − uB(1). Next, the integral (2.9) is formed. This approach is a little more convoluted, but numerically faster when dealing with a fine grid over [0, 1]. The case where the reference distribution F is standard Gaussian, denoted , is of special interest. Then, the covariance structure of the limiting Gaussian process X is Cov{X (u), X (v)}
1 −1 −1 (u) (v) + 1 , = u(1 − v) − ϕ{ (u)}ϕ{ (v)} 2 −1
−1
for 0 ≤ u ≤ v ≤ 1, due to (2.8), or equivalently, for x, y ∈ R so x ≤ y,
(2.10)
xy +1 . Cov[X { (x)}, X { (y)}] = (x){1 − (y)} − ϕ(x)ϕ(y) 2
3. TESTS BASED ON THE EMPIRICAL PROCESS Theorem 2.1 generates a wide range of ways to test the reference distribution in autoregressive analysis. Some new response surfaces are presented for Kolmogorov–Smirnov and Anderson– Darling tests for Gaussianity. Subsequently, it is discussed how to implement test bands in probability–probability and quantile–quantile plots. In applications it is often useful to work with the Gaussian reference distribution in which case least-squares estimation amounts to maximum likelihood estimation. Asymptotically, Gaussianity is not needed when subsequently drawing inferences on the location of the roots. Testing the Gaussian assumption can, however, help investigators in building better econometric models. An example is the analysis of Hendry and Nielsen (2007, chapter 13) of the daily time series of prices and quantities from the Fulton Fish Market data of Graddy (1995). By looking at tests for normality and at outliers three days are identified in which the market functions C The Author(s). Journal compilation C Royal Economic Society 2009.
373
Empirical process
abnormally, not because of supply shocks, but because of semi-holidays. Thereby an improved empirical analysis can be carried out. 3.1. Test statistics based on the empirical process In many applications it is convenient to summarise the empirical process in a single statistic. Kolmogorov–Smirnov-type statistics are formed as d ˆ Dˆ F = sup |Fˆ {F(x)}| = sup |F(u)| → sup |XF (u)| = DF , x∈R
0≤u≤1
(3.1)
0≤u≤1
d ˆ Dˆ F+ = sup Fˆ {F(x)} = sup F(u) → sup XF (u) = DF+ , x∈R
0≤u≤1
(3.2)
0≤u≤1
where the limiting distributions arise by applying the Continuous Mapping Theorem to the asymptotic distribution reported in Theorem 2.1. These statistics put most weight on values of u close to 0.5. The limiting distributions of DF , DF+ depend on the distribution of F. This contrasts to the situation where the empirical distribution of identically independent variables is compared ˆ is of relevance and a time to a known distribution. In that situation only the uniform term U transformation argument can be employed. When testing for normality the distributions D and D + are of relevance. These were previously tabulated by Stephens (1974, table 1A, case 3). Stephens’ experiment was repeated using a fine grid with 104 points and 106 repetitions giving the numbers reported in Table 1. For practical purposes a convenient approximation to the p-values can be found using a Gamma distribution with matching mean and variance as done for instance for a Dickey–Fuller F-type distribution in Nielsen (1997). Due to the extreme value nature of the Kolmogorov–Smirnov statistic, the Gamma approximation under-fits the extreme upper tail slightly. For the reported quantiles the relative error of the Gamma quantiles compared to the simulated quantiles was at most 3.0%, and at most 1.8% when excluding the two most extreme quantiles. It was also attempted to fit a Weibull distribution, as the asymptotic distribution of the DF+ statistic based on the uniform empirical process is Weibull; see Billingsley (1968, p. 85). The Weibull distribution can be fitted using mean and variance of the log-transformed statistic; see Johnson et al. (1994, section 21.4). For the empirical process of residuals the fit is, however, much worse than the Gamma fit, with relative errors of up to 17%. While the Kolmogorov–Smirnov statistic puts most weight on deviations in the middle of the distribution, Anderson and Darling (1952) considered the possibility of constructing ˆ Kolmogorov–Smirnov-type statistics for the standardised empirical process, Zˆ F (u) = F(u)/ F (u)}. By the Continuous Mapping Theorem then Zˆ F (u) → ZF (u) in distribution on D[b, c] std{X for [b, c] ⊂ (0, 1), where ZF = XF (u)/std{XF (u)}. Anticipating the tightness problem, that ˆ ˇ → ∞ in probability as shown by Cibisov (1966), they therefore sup0
D D+
E
Var
0.631 0.547
0.0219 0.0231
Table 1. Simulated distribution of Dˆ and Dˆ + . E log Var log 50% 80% 90% −0.487 −0.640
0.0526 0.0731
0.612 0.524
C The Author(s). Journal compilation C Royal Economic Society 2009.
0.748 0.666
0.830 0.753
95%
97.5%
99%
0.903 0.831
0.971 0.903
1.053 0.990
374
E. Engler and B. Nielsen Table 2. Response surface in a for expectation and variance of Kˆ ,a and Rˆ ,a . 1 a a2 a −1 a −2 ( 21 − a)−1
K ,a R ,a
−0.04406
E Var E
1.969 0.365 1.672
1.405 −0.291 −1.583
0.141 4.904
Var
0.135
0.053
−1.293
0.0008920
−0.02773
( 21 − a)−2
0.00028 −0.00001 0.03722
−0.0000677
0.02743
−0.0000151
symmetric interval this gives Kˆ F,a =
d
ˆ F (u)| → |Z
sup 1 1 2 −a
|ZF (u)| = KF,a .
sup 1 1 2 −a
The empirical quantile process converges in a similar way so that ˆ F (x) = R
√ XF (x) = RF (x) T [Fˆ −1 {F(x)} − x] ⇒ f(x)
(3.3)
on D[b, c] for [b, c] ⊂ R. A Kolmogorov–Smirnov-type statistic can be constructed for the empirical quantile process Rˆ F,a =
d
ˆ F (u)| → |R
sup 1 1 2 −a
|RF (u)| = RF,a ,
sup 1 1 2 −a
whereas the Anderson–Darling-type statistic for the empirical quantile process satisfies R ˆ F (u) d ˆ F,a = |ZF (u)| = KF,a , Q sup sup → ˆ 1 1 −a
2
2
2
and has the same limiting distribution as the Kolmogorov–Smirnov statistic. When testing for normality the distributions K ,a and R ,a are of relevance. These were simulated for a range of a values, a fine grid over u with 104 points and 106 repetitions. Response surfaces in a for the expectation and variance of K ,a and R ,a are reported in Table 2 using 22 values of a chosen as (0.05:0.40, 0.05), (0.41:0.49, 0.01), (0.491:0.495, 0.001). In all cases, the R2 of the fits exceed 0.9995. It is not advisable to extrapolate the response surface for values of a outside 0.01 < a < 0.495. For a given value of a the expectation and variance are computed and the distribution approximated using a Gamma distribution as above. It is evident that the response surfaces diverge for a → 0.5. Different choices of a will emphasize different departures ˇ from normality. Due to the Cibisov result the literature does not give any guidance towards the choice of a as a function of the sample size. 3.2. P–P and Q–Q plots Pointwise and simultaneous test bands can be constructed for probability–probability and quantile–quantile plots, also called P–P and Q–Q plots, using the results presented here. Pointwise bands could be used as a diagnostic tool when the nature of the departure from the reference distribution is unknown, whereas the simultaneous bands are used to detect more specific types of departures. C The Author(s). Journal compilation C Royal Economic Society 2009.
Empirical process
375
ˆ Probability–probability plots are plots of uˆ = F(x) against u = F(x) on a [0, 1]2 -square. Test bands of the type u ± n−1/2 c α σ u can be constructed. Three bands are noted. Pointwise test bands can be established from Theorem 2.1, where c α is the 1 − α/2 quantile of the standard normal distribution and σ 2u is the variance Var{XF (u)} at the point u. For the standard normal case, F = , the variance is reported in (2.10). For other reference distributions it can be computed using (2.8). Kolmogorov–Smirnov bands can be constructed by σ u = 1 and choosing c α from limiting distributions of the Kolmogorov–Smirnov statistics. Such bands emphasise departures in the middle of the distribution. Two-sided bands are found by choosing c α as the 1 − α quantile of DF , whereas one-sided bands are found from the 1 − α/2 quantile of DF+ . Anderson–Darling bands are constructed by first choosing a value of a (perhaps 0.3, 0.4 or 0.45). Then σ u is chosen as for the pointwise bands, whereas c α is chosen as the 1 − α quantile of K ,a . Quantile–quantile plots are plots of xˆ = Fˆ −1 (u) against x = F−1 (u) on an R2 -square. Test bands of the type u ± n−1/2 c α σ u can be constructed. Two bands are noted. Pointwise test bands are established directly from Theorem 2.1 and (3.3). Here c α is the 1 − α/2 quantile of the standard normal distribution, and σ u equals std[XF {F(x)}]/f(x) at the point x. For the standard normal case, F = , the variance is reported in (2.10). Kolmogorov–Smirnov bands can be constructed by σ u = 1 and choosing c α from the 1 − α quantile of RF .
ACKNOWLEDGMENTS The numerical results were generated using Ox; see Doornik (1999). The second author received financial support from ESRC grant RES-000-27-0179.
REFERENCES Anderson, T. W. (1959). On asymptotic distributions of estimates of parameters of stochastic difference equations. Annals of Mathematical Statistics 30, 676–87. Anderson, T. W. and D. A. Darling (1952). Asymptotic theory of certain ‘goodness of fit’ criteria based on stochastic processes. Annals of Mathematical Statistics 23, 193–212. Bai, J. (1994). Weak convergence of the sequential empirical processes of residuals in ARMA models. Annals of Statistics 22, 2051–61. Bai, J. (2003). Testing parametric conditional distributions of dynamic models. Review of Economics and Statistics 85, 531–49. Billingsley, P. (1968). Convergence of Probability Measures. New York: Wiley. Boldin, M. V. (1981). Estimation of the distribution of noise in an autoregressive scheme. Theory of Probability and its Applications 27, 866–71. Brown, B. M. and G. K. Eagleson (1971). Martingale convergence to infinitely divisible laws with finite variance. Transactions of the American Mathematical Society 162, 449–53. Chan, N. H. (1989). Asymptotic inference for unstable autoregressive time series with drifts. Journal of Statistical Planning and Inference 23, 301–12. Chan, N. H. and C. Z. Wei (1988). Limiting distributions of least squares estimates of unstable autoregressive processes. Annals of Statistics 16, 367–401.
C The Author(s). Journal compilation C Royal Economic Society 2009.
376
E. Engler and B. Nielsen
ˇ Cibisov, D. M. (1966). Some theorems on the limiting behavior of the empirical distribution function. Selected Translations in Mathematical Statistics and Probability 6, 147–56. Doornik, J. A. (1999). Object-Oriented Matrix Programming Using Ox (3rd ed.). London: Timberlake Consultants Press. Duflo, M., R. Senoussi and R. Touati (1991). Propri´et´es asymptotiques presque sˆure de l’estimateur des moindres carr´es d’un mod`ele autor´egressif vectoriel. Annales de l’Institut Henri Poincar´e – Probabilit´es et Statistiques 27, 1–25. Graddy, K. (1995). Testing for imperfect competition at the Fulton Fish Market. RAND Journal of Economics 26, 75–92. Hendry, D. F. and B. Nielsen (2007). Econometric Modeling. Princeton, NJ: Princeton University Press. Johansen, S. (2000). A Bartlett correction factor for tests on the cointegrating relations. Econometric Theory 16, 740–78. Johnson, N. L., S. Kotz and N. Balakrishnan (1994). Continuous Univariate Distributions. New York: Wiley. Koul, H. L. (2002). Weighted Empirical Processes in Dynamic Nonlinear Models (2nd ed.). New York: Springer. Koul, H. L. and S. Leventhal (1989). Weak convergence of the residual empirical process in explosive autoregression. Annals of Statistics 17, 1784–94. Lai, T. L. and C. Z. Wei (1985). Asymptotic properties of multivariate weighted sums with applications to stochastic regression in linear dynamic systems. In P. R. Krishnaiah (Ed.), Multivariate Analysis, Volume VI, 375–93. Amsterdam: Elsevier. Lee, S. and C. Z. Wei (1999). On residual empirical processes of stochastic regression models with applications to time series. Annals of Statistics 27, 237–61. Ling, S. (1998). Weak convergence of the sequential empirical processes of residuals in nonstationary autoregressive models. Annals of Statistics 26, 741–54. Loynes, R. M. (1980). The empirical distribution function of residuals from generalized regression. Annals of Statistics 8, 285–98. Na, S., S. Lee and H. Park (2005). Sequential empirical process in autoregressive models with measurement errors. Journal of Statistical Planning and Inference 136, 4204–16. Nielsen, B. (1997). Bartlett correction of the unit root test in autoregressive models. Biometrika 84, 500–04. Nielsen, B. (2001). The asymptotic distribution of unit root tests of unstable autoregressive processes. Econometrica 69, 211–19. Nielsen, B. (2005). Strong consistency results for least squares estimators in general vector autoregressions with deterministic terms. Econometric Theory 21, 534–61. Nielsen, B. (2006a). Correlograms for non-stationary autoregressions. Journal of the Royal Statistical Society B68, 707–20. Nielsen, B. (2006b). Order determination in general vector autoregressions. In H.-C. Ho, C.-K. Ing and T. L. Lai (Eds.), Time Series and Related Topics: In Memory of Ching-Zong Wei, 93–112. IMS Lecture Notes and Monograph Series, Volume 52. Beachwood, OH: Institute of Mathematical Statistics. Pierce, D. A. (1985). Testing normality in autoregressive models. Biometrika 72, 293–97. Rao, J. S. and J. Sethuraman (1975). Weak convergence of empirical distribution functions of random variables subject to perturbations and scale factors. Annals of Statistics 3, 299–313. Shorack, G. R. and J. A. Wellner (1986). Empirical Processes with Applications to Statistics. New York: Wiley. Stephens, M. A. (1974). EDF statistics for goodness of fit and some comparisons. Journal of the American Statistical Association 69, 730–37. C The Author(s). Journal compilation C Royal Economic Society 2009.
377
Empirical process
APPENDIX: PROOFS Theorem 2.1 is now proved. For the asymptotic analysis, it is convenient to decompose the empirical process. To facilitate this, the set (ˆεt ≤ x) is rewritten in three steps: First, both sides of the inequality are scaled by σˆ /σ to bring the residuals to the population scale; secondly, εt − σˆ εˆ t /σ is added to both sides; and thirdly, x is added and subtracted on the right-hand side. This gives (ˆεt ≤ x) = (εt ≤ x + zˆ t ),
(A.1)
where ˆ + bˆt , zˆ t = ax
aˆ =
where
σˆ − 1, σ
σˆ bˆt = εt − εˆ t . σ
(A.2)
Writing the vector autoregression (2.3) in companion form X t = θS t−1 + ξ t , where θ = (A 1 , . . . , A k , μ) and S t−1 = (X t−1 , . . . , Xt−k , D t−1 ) , and using that σ ε t = (1, − ρ)ξ and ξˆt − ξt = −(θˆ − θ)St−1 it follows that ˆ ξˆt = (0, ρˆ − ρ)ξt + (1, −ρ)( ˆ θˆ − θ)St−1 . σ bˆt = (1, −ρ)ξt − (1, −ρ)
(A.3)
The empirical distribution of the residuals can then be decomposed as ˆ ˆ ˆ ˆ F(u) = U(u) + V(u) + W(u),
(A.4)
where T 1 ˆ 1{εt ≤F−1 (u)} − u , U(u) = √ T t=1 T 1 ˆ V(u) = f{F−1 (u)} √ νt , T t=1
where
νt = εt + 12 F−1 (u) εt2 − 1 ,
T 1 ˆ 1{ˆεt ≤F−1 (u)} − 1{εt ≤F−1 (u)} − f{F−1 (u)}νt . W(u) = √ T t=1
ˆ and V ˆ are the leading terms while W, ˆ vanishes. For the analysis of W ˆ it is convenient The components U ˆ ˆ ˆ ˆ ˆ to decompose W = W1 + W2 + W3 + W4 , letting F(x) = u, T ˆ 1 (u) = f{F−1 (u)} √1 W zˆ t − εt − 12 F−1 (u) εt2 − 1 , T t=1 T ˆ 2 (u) = √1 ˆ , W 1(εt ≤x+ax) ˆ − 1(εt ≤x) − f(x)ax T t=1 T ˆ 3 (u) = √1 ˆ + bˆt ) − F(x + ax) ˆ − f(x)bˆt }, W {F(x + ax T t=1 T ˆ 4 (u) = √1 ˆ + bˆt ) + F(x + ax) ˆ W 1(εt ≤x+ax+ . ˆ − F(x + ax ˆ bˆt ) − 1(εt ≤x+ax) T t=1
ˆ 1 is linear in the estimation error for the parameters, so this is where the intercept plays a role. The term W ˆ 3 and W ˆ 4 show the influence of ˆ The term W2 shows the influence of the estimation error for scale, while W the estimation of the expectation parameters. These terms are shown to vanish in Theorems A.1–A.4. It is convenient to establish a result concerning the least-squares estimators. L EMMA A.1. Suppose model (2.3) and Assumptions 2.1, 2.2 and 2.7 are satisfied. Then C The Author(s). Journal compilation C Royal Economic Society 2009.
378 (i) (ii) (iii) (iv)
E. Engler and B. Nielsen T 1/2 (σˆ 2 − σ 2 ) = T −1/2 σ 2 Tt=1 (εt2 − 1) + oP (1) = OP (1). 1/2 T (ρˆ − ρ) = OP (1). −1/2 ). aˆ = O P (T T −1/2 ˆ t2 = (1 + x 2 )oP (1). T t=1 z
Proof: Define Mξ ξ =
T
ξt ξt ,
t=1
Mξ S =
T
ξt St−1 ,
MSS =
t=1
T
St−1 St−1 .
(A.5)
t=1
ˆ = T −1 Mξ ξ + o(T −1/2 ) a.s. due to Nielsen (2005, theorem 2.6) under Assumptions 2.1, (i, ii) It holds 2.2 and 2.7, noting that λ > 4 in (2.6). The Central Limit Theorem for martingale differences √ by Brown ˆ − ) is and Eagleson (1971) is applicable when (2.6) and (2.7) are satisfied, and implies that T ( asymptotically normal. Using the functional δ-method it is seen that this property is shared by σˆ and ρ. ˆ (iii) Apply (i) and the Taylor expansion σˆ −1= σ
σˆ 2 − σ 2 σˆ 2 − σ 2 1+ − 1 = + OP σ2 2σ 2
σˆ 2 − σ 2 σ2
2 .
ˆ + bˆt and the expression for bˆt in (A.3). Then (iv) Recall that zˆ t = ax ˆ 2 ≤ 2(σ ax) ˆ 2 + 2(σ bˆt )2 (σ b) ˆ 2 + 4(0, ρˆ − ρ)ξt ξt (0, ρˆ − ρ) ≤ 2(σ ax) + 4(1, −ρ)( ˆ θˆ − θ)St−1 St−1 (θˆ − θ) (1, −ρ) ˆ by the inequality (x + y)2 ≤ 2(x 2 + y 2 ). Noting that θˆ − θ = Mξ S M−1 SS this implies T
ˆ 2 + 4(0, ρˆ − ρ)Mξ ξ (0, ρˆ − ρ) (σ zˆ t )2 ≤ 2T (σ ax)
t=1
+ 4(1, −ρ)M ˆ ξ S M−1 ˆ . SS MSξ (1, −ρ) 1/2 It holds Mξ S M−1 ) and Mξ ξ = OP (1) due to Nielsen (2005, theorem 2.6) under AssumSS MSξ = oP (T ptions 2.1, 2.2 and 2.7. Combine this with (i) and (ii).
ˆ 1 vanishes. Here the inclusion of the intercept, as stated in It is now argued that the component W Assumption 2.3, is crucial. T HEOREM A.1. Suppose model (2.3) and Assumptions 2.1–2.3, 2.6 and 2.7 are satisfied. Then ˆ 1 (u)| = oP (1). sup0≤u≤1 |W Proof: By (A.2) then S = Tt=1 (ˆzt − νt ) is rewritten as S=
σˆ − 1 F−1 (u) + εt − εˆ t − εt − 12 F−1 (u) εt2 − 1 . σ σ
T σˆ t=1
Due to the inclusion of an intercept in Assumption 2.3 then
T t=1
εˆ t = 0. Since the ε t -terms cancel then
T 1 2 σˆ −1 − ε −1 . S = F (u) T σ 2 t=1 t −1
C The Author(s). Journal compilation C Royal Economic Society 2009.
Empirical process
379
Due to Lemma A.1 then S = F−1 (u)oP (T 1/2 ). Thus by Assumption 2.6 that supx∈R |xf(x)| < ∞ then ˆ 1 (u) = f{F−1 (u)}T −1/2 S vanishes. W ˆ 2 vanishes using a result of Koul (2002). It is now argued that the component W T HEOREM A.2. Suppose model (2.3) and Assumptions 2.2, 2.6 and 2.7 are satisfied. Then ˆ 2 (u)| = oP (1). sup0≤u≤1 |W Proof: From Lemma A.1(iii) it follows that aˆ is OP (T −1/2 ). As pointed out by Rao and Sethuraman (1975) ˆ < b with probability close to one. Thus it suffices to and Loynes (1980) a bound b > 0 can be found so |a| show T 1 1{εt ≤x(1+T −1/2 s)} − 1(εt ≤x) − f(x)T −1/2 sx = oP (1). sup √ x∈R T t=1 |s|
This result follows from Koul (2002, corollary 2.3.2, p. 59). A set of assumptions have to be checked. First, note that, in the notation of Koul (2002), n = T , dni = T −1/2 , cni = 0, Xni = εt , H (x) = Fni = F(x), fni (x) = f(x). Since d ni is uniform in t, the conditions to d ni in N1 and N2 of Koul (2002, p. 16) are satisfied. Since c ni = 0 then the conditions to c ni in (2.3.6) and (2.3.7) of Koul (2002, p. 52) are satisfied. The conditions F1, F2, F3 to f of Koul (2002, p. 59) are satisfied by Assumption 2.6 as follows: F1 requires uniform continuity of f, which is satisfied since f is differentiable and supx∈R |f (x)| < ∞; F2 requires f to be positive; F3 requires supx∈R |xf(x)| < ∞. T HEOREM A.3. Suppose model (2.3) and Assumptions 2.1, 2.2, 2.6 and 2.7 are satisfied. Then ˆ 3 (u)| = oP (1). sup0≤u≤1 |W ˆ 3 (u) is written as an integral Proof: At first W T T 1 x+ˆzt ˆ 3 (u) = √1 {F(x + zˆ t ) − F(x) − f(x)ˆzt } = √ {f(y) − f(x)} dy. W T t=1 T t=1 x
By the triangle inequality T x+ˆzt ˆ 3 (u)| ≤ √1 |f(y) − f(x)| dy. |W T t=1 x
The integrand can be bounded by its maximum, so T ˆ 3 (u)| ≤ √1 |ˆzt | max |f(x + h) − f(x)|. |W |h|≤|ˆzt | T t=1
The Mean Value Theorem then implies a further bound, T ˆ 3 (u)| ≤ √2 |ˆzt |2 max |f (x + h)|. |W |h|≤|ˆzt | T t=1
Taking the maximum over the entire real axis, and using Lemma A.1(iv), which requires Assumptions 2.1, 2.2 and 2.7, gives ˆ 3 (u)| ≤ oP (1) sup |(1 + x 2 )f (x)|, |W x∈R
which vanishes due to Assumption 2.6.
ˆ 4 vanishes. Two ideas of Lee and Wei (1999) are used. First, to deal with the It is now proved that W issue that for explosive component WT W T and Tt=1 Wt Wt , the largest and the smallest components are C The Author(s). Journal compilation C Royal Economic Society 2009.
380
E. Engler and B. Nielsen
treated separately. Lee and Wei do not actually consider explosive and non-explosive components jointly, but the joint evaluation turns out to not to pose any problems. Secondly, theorem 2.2 of Lee and Wei gives an asymptotic uniform linearity property for triangular arrays, which can be used here. T HEOREM A.4. Suppose model (2.3) and Assumptions 2.1, 2.2, 2.4–2.7 are satisfied. Then ˆ 4 (u)| = oP (1). sup0≤u≤1 |W ˆ bˆt defined in (A.2), let xˆ = x(1 + a). ˆ Proof: Some notation is needed. Let u = F(x) and, recalling a, Define ˆ + bˆt ) + F(x). ˆ ˆ = 1(εt ≤x+ w(t, x) ˆ − F(x ˆ bˆt ) − 1(εt ≤x) ˆ4=W ˆ 4,1 + W ˆ 4,2 where Decompose W T −g(T ) ˆ 4,1 (u) = √1 ˆ W w(t, x), T t=1
1 ˆ 4,2 (x) ˆ = √ W T
T
ˆ w(t, x),
t=T −g(T )+1
√ for some function g(T ) chosen so g(T )/ T → 0 and g(T )/log T → ∞. ˆ 4,2 . It is immediately seen that |w(t, u)| ≤ 2, so Analysis of W T
ˆ 4,2 (u)| ≤ √1 sup |W T 0≤u≤1
2g(T ) → 0. 2= √ T t=T −g(T )+1
ˆ 4,1 . First, note that taking supremum over x ∈ R and over xˆ ∈ R gives the same supremum Analysis of W so
−g(T ) −g(T ) 1 T 1 T ˆ 4,1 (u)| = sup √ ˆ = sup √ w(t, x) w(t, x) . sup |W x∈R T x∈R T 0≤u≤1 t=1
t=1
Secondly, this can be written as T 1 sup √ {1(εt ≤x−αT zT t ) − 1(εt ≤x) − F(x − αT zT t ) + F(x)} , x∈R T t=1
where, due to (A.3), and for some normalisation matrix N T to be described below,
(0, Ip−1 )ξt 1/2 −1/2 ˆ αT = −{(ρˆ − ρ), (1, −ρ)( ˆ θ − θ )}NT , zT t = NT 1{t≤T −g(T )} . St−1 By construction the triangular array z T t is Gt−1 -measurable; see Assumption 2.5. Thirdly, the desired result now follows if the conditions of Lee and Wei (1999, corollary 2.1) can be established. First, ε t are independent and identically distributed with distribution function F according to Assumption 2.4, and independent of Gt−1 by Assumption 2.5. The vectors α T and z T t have a dimension not depending on T where z T t is Gt−1 -measurable. Since the ε t ’s have marginal distribution function F with uniformly bounded second derivatives by Assumption 2.6 then Lee and Wei’s condition (2.11) is trivially satisfied. Thus, it is left to show that α T and Tt=1 zT t zT t are OP (1). To show Tt=1 zT t zT t = OP (1) note that a matrix M exists so Rt−1 Rt R 0 eR,t = = , MSt = 0 W Wt Wt−1 eW ,t where the absolute values of the eigenvalues of R, W are at most one and greater than one, respectively; see Nielsen (2005, section 3). The deterministic components are therefore included in the R t process. Thus, C The Author(s). Journal compilation C Royal Economic Society 2009.
Empirical process
381
zT t = {ξ t (0, Ip−1 ) , R t−1 , W t−1 }1 . Accordingly, let N T = diag(N ξ , N R , N W ) be block diagonal T −g(T ) {t≤T −g(T )} T −g(T ) T −g(T ) Rt−1 Rt−1 and NW = t=1 Wt−1 Wt−1 . It with Nξ = (0, Ip−1 ) t=1 ξt ξt (0, Ip−1 ) , NR = t=1 suffices to note that z T t has asymptotically uncorrelated blocks, a.s.; see Nielsen (2005, Theorems 2.4, 9.1 and 9.2). ρˆ = O(1) and that θˆ − θ = Mξ S M−1 the notation in To show αT = OP (1) note that SS recalling T , Mξ W = Tt=1 ξt Wt−1 and (A.5). In the same way let Mξ R = t=1 ξt Rt−1 , MRR = Tt=1 Rt−1 Rt−1 T MW W = t=1 Wt−1 Wt−1 . Due to the asymptotic uncorrelatedness of R t and W t it suffices to show (a) −1/2 1/2 1/2 −1 (ρˆ − ρ)Nξ = OP (1), (b) Mξ R M−1 RR NR = OP (1) and (c) Mξ W MW W NW = OP (1). First, (a) follows from Lemma A.1(ii) and Nielsen (2005, Theorems 6.1). Secondly, (b) follows by the kind of arguments −1/2 employed by Chan and Wei (1988) and Chan (1989). Thirdly, (c) follows since Mξ W MW W = o(T 1/4 ) −1/2 1/2 −g(T ) a.s. and MW W NW = O{W } a.s. by Nielsen (2005, Theorem 2.4, Corollary 7.2) and by noting that T 1/4 W−g(T ) = o(1) when g(T )/log T → ∞.
ˆ +V ˆ +W ˆ 1+W ˆ 2+W ˆ 3+W ˆ 4 . Theorems A.1–A.4 show that Proof of Theorem 2.1: Decompose Fˆ = U ˆ 2, W ˆ 3, W ˆ 4 vanish. The terms U, ˆ V ˆ only involve the errors ε t which are independent and identically ˆ 1, W W distributed by Assumption 2.4. Thus the convergence of these terms follows as in Shorack and Wellner (1986, pp. 197–200).
C The Author(s). Journal compilation C Royal Economic Society 2009.
The
Econometrics Journal Econometrics Journal (2009), volume 12, pp. 382–395. doi: 10.1111/j.1368-423X.2009.00291.x
A note on non-parametric estimation with predicted variables S TEFAN S PERLICH † †
¨ Georg-August Universit¨at G¨ottingen, Institut f¨ur Statistik und Okonometrie, Platz der G¨ottinger Sieben 5, 37073 G¨ottingen, Germany E-mail:
[email protected]
First version received: December 2007; final version accepted: January 2009
Summary This article gives the asymptotic properties of non-parametric kernel-based density and regression estimators when one of the variables is predicted. Such variables, also known as ‘constructed variables’ or ‘generated predictors’, occur quite frequently in econometric and applied economic analysis. The impact of using predicted rather than observed values on the properties of estimators has been extensively studied in the fully parametric context. The results derived here are applicable to the general situation in which the predictor is estimated using a consistent non-parametric method with standard convergence rates. Therefore, the presented results are, generally speaking, the asymptotics for semi-nonparametric two-step (or plug-in) estimation problems. The case of parametric estimation based on non-parametric predictors is also covered. Keywords: Constructed variables, Generated regressors, Non-parametric estimation, Nonparametric instruments, Non-parametric plug-in methods, Predicted variables.
1. INTRODUCTION In econometrics, estimation problems with generated regressors are rather common. Examples are estimation problems with endogeneity, simultaneous equation systems, link function testing with unknown indices, selection bias problems or unobserved variables like missing or data matching problems. Estimation with constructed variables can also be used for dimension reduction; think e.g. of weak separable or multiple index models where the indices of a nonparametric link function are unknown functions themselves. A list of particular examples can be found in Pagan (1984) or Oxley and McAleer (1993), where the problem is studied in the parametric context. Other examples are given in Rilstone (1996) and Stengos and Yan (2001), who considered semi-parametric nested regression functions. Lewbel and Linton (2007) dealt with non-parametrically generated regressors when considering homothetically separable functions. Newey et al. (1999) and Rodr´ıguez-P´oo et al. (2005) analysed semi-parametric simultaneous equation systems, and Das et al. (2003) considered a specific case of nonparametric estimation in sample selection models, with possibly endogenous covariates. Newey (1990) derived efficient instrumental variables estimators with non-parametric plug-in methods. Horowitz (2001) considered non- and semi-parametric weakly separable models, where nonparametrically estimated components appear in a joint link function. Li and Wooldrige (2002) estimated partial linear models with generated regressors. As mentioned, another typical problem C The Author(s). Journal compilation C Royal Economic Society 2009. Published by Blackwell Publishing Ltd, 9600 Garsington Road,
Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
Non-parametric estimation with predicted variables
383
where predicted regressors appear is the so-called data mapping; see e.g. Elbers et al. (2003), who tried to estimate poverty and inequality from census data but first had to predict income and consumption. For further examples see also Pagan and Ullah (1999) or Li and Racine (2007). Note, however, that all these works give only solutions specific to their problem. We will consider the general regression problem Yi = m(Zi , Xi ) + εi σε (Zi , Xi ),
i = 1, . . . , n,
(1.1)
where Yi ∈ and the Zi ∈ q are observable regressors, but the Xi ∈ have to be predicted. The errors ε i are assumed to fulfil E[ε i | Zi , Xi ] = 0 and Var[ε i | Zi , Xi ] = 1. It is supposed that the Xi can be predicted non- , semi- or even parametrically. To be as general as possible, we do not treat the estimation of the regressors xi , instead derive the asymptotics for any consistent estimate xˆi of xi with given bias and variance. Sometimes the distribution of unobserved but predicted variables is of interest; e.g. Van Keilegom and Veraverbeke (2002) derived the asymptotic properties of density estimators for residuals in censored regression models. In economic theory, it is often of particular interest to check for twin peaks, multiple equilibria etc. or to study the convergence in the world income distribution; see e.g. Bianchi (1997) for an early semi-parametric work on this. More recently, Sala-i-Martin (2006) searched for falling poverty, studying the convergence of the world income distribution. Considering the development over time, he had the problem of plenty of missing data, which he therefore predicted. Both studies used non-parametric kernel density estimates based on predictors. Other studies look at non-parametric densities of semi-parametrically estimated return to scales; see Profit and Sperlich (2004), who explored the development of the Czech labour market. Finally, in agricultural economics, efficiency and/or productivity of farms is predicted non-parametrically by DEA, and afterwards studied using non-parametric methods (density estimation and regression). Therefore, we have included the derivation of the asymptotic properties of a non-parametric density estimator of fX (x) when X is predicted. This paper focuses on kernel density and regression estimation. Nevertheless, our results serve as a general guideline when using any other non-parametric methods. We assume the estimation problem to be well defined, i.e. we do not treat the problem of identification, which always depends on the specific model under consideration; see Hoderlein and Mammen (2007) as a most recent reference. We also comment on parametric estimation when non-parametric predictors are used.
2. MAIN RESULTS 2.1. Density estimation Suppose we have continuous unobservables Xi ∈ , with density fX . To predict the realizations xi , we assume that one can construct xˆi , i = 1, . . . , n, e.g. with some instruments W ∈ δ , either from the same data set or a different one (of size N). As said, we deliberately avoid specifying the predictor but need assumptions on its bias and variance converging to zero at specified rates. As the further step will use kernel estimators, we have decided to give these rates in terms of N and a smoothing parameter g with g → 0 for N → ∞. A SSUMPTION P1. Let xˆ be a predictor of x on the support of fX (·). Its deterministic error, ∂ b(x), are both of order O(g 2 ) uniformly. The the bias b(x), and its partial derivative, b (x) := ∂x variance function σ 2u (x) of the stochastic error is uniformly of order O Ng1 δ . C The Author(s). Journal compilation C Royal Economic Society 2009.
384
S. Sperlich
So we can write xˆ = x + b(x) + ux σu (x) with E[ux | X = x] = 0, V [ux | X = x] = 1. Note that we are not assuming an additive error of the original prediction problem. What we assume is the use of a predictor with additive bias and stochastic error, an assumption that holds for almost all estimators that exist in the non- or semi-parametric literature. Depending on the context, we will index u either with x or just with i when referring to xi . For its covariances we assume: A SSUMPTION P2. Both b(·) and σ u (·) are Lipschitz continuous. We have E[ui σu (xi )uj σu (xj )] = O n1 uniformly for i = j = 1, . . . , n. We write u (v, w) := E[u v σ u (v)u w σ u (w)], which is Lipschitz in v and in w. These conditions, most of which concern rates of convergence, are easy to check, though Assumption P2 looks artificial. They are not restrictive; in fact they hold for any standard nonparametric estimator for xi that satisfies the usual regularity conditions. Usually, the bias and variance of a regression function are written as functions of the regressors, say b(w) and σ 2u (w). In an abuse of notation, we write b(x), σ 2u (x) to emphasize that we focus here on a particular point x. Strictly speaking, this notation is only precise for the case in which w −→ x is monotone. We consider the traditional kernel density estimator but replacing the observation xi with its pre-estimate xˆi , i = 1, . . . , n, respectively, Xi by Xˆ i : 1 f˜X (x) = Kh (Xˆ i − x), n i=1 n
(2.1)
with Kh (u) = h1 K(u/h), K(·) being some kernel and h the bandwidth. To give the asymptotic properties of (2.1), we assume the following: A SSUMPTION D1. The density function fX (·) has a bounded and continuous second derivative. A SSUMPTION D2. For the smoothing parameters h and g, we assume that both go to zero, but that (nh) and (Ngδ ) go to infinity as n, N → ∞ with δ from Assumption P1. Furthermore, g 2 h−1 and (hNgδ )−1 go to zero. A SSUMPTION D3. The kernel function K(·) is non-negative, bounded, compactly supported and fulfils K(u)du = 1, uK(u)du = 0. Additionally, it has a continuous second derivative with K (u)du = K (u)du = 0. For the results that follow, the conditions on the kernel can be substituted by others. The condition on the second derivative is due to the problem of having constructed variables. The smoothness conditions on the density fX are common in kernel density estimation, whereas g 2 h−1 , (hNgδ )−1 → 0 are assumed to simplify the asymptotics; they are sufficient but not necessary. Then the statistical properties of f˜X (x) are: T HEOREM 2.1. Under Assumptions P1, P2 and D1–D3, x being an interior point of the support of X and f˜X (x) defined in (2.1), (i) (ii) (iii)
E[f˜X (x)] = fX (x) + Bf (x) + o(h2+ g 2 ) where Bf (x) = h2 fX (x)μ2 (K) + {b(x)fX (x) + b (x)fX (x)}μ1 (K ) with μs (L) = v s L(v)dv. 1 1
K 22 fX (x) + nh1 3 σu2 (x)fX (x) K 22 + o nh + Ngδ1nh3 . V [f˜X (x)] = nh √ With t := min{nh, nh3 Ng δ }, t{f˜X (x) − fX (x) − Bf (x)} converges to a centred Gaussian distribution with its variance given implicitly in (ii). 2
C The Author(s). Journal compilation C Royal Economic Society 2009.
Non-parametric estimation with predicted variables
385
The theorem has basically three messages. When the variable of the kernel density estimate is non-parametrically constructed, the bias is augmented by an additive factor that is proportional to b(x). The variance is increased by an additive factor proportional to nh1 3 σu2 (x), i.e. it has the rate of the variance of the predictor times (nh3 )−1 . As a consequence, there are many combinations of h and g, for which the asymptotic distribution of the non-parametric density estimator is not affected by the fact that the xi were predicted. This theorem also reveals what happens when the variables are constructed parametrically. In those cases, we change Assumption P1 by A SSUMPTION P3. Let xˆ be a parametric predictor of x on the support of fX with a deterministic with standard deviation σ u (x). The terms error b(x), its derivative b (x) and a stochastic error b, b and σ u are all uniformly bounded by O √1n . C OROLLARY 2.1. Under Assumptions P2, P3 and D1–D3, and x being an interior point of the 2.1. This means that, support of X, the asymptotic bias and variance of f˜X (x) are 1as in Theorem + h2 , the asymptotic distribution due to the rate of b(x), b (x) and σ 2u (x) being faster than O nh of f˜X (x) is the same as that of the estimator based on the xi , had these been observed. A frequently studied problem is that of including non-parametric nuisance parameters in (semi-) parametric estimation. As we are not aware of a general statement in the literature on finite parameter estimation with non-parametrically constructed variables, we address this point briefly. We consider maximum likelihood estimation of a parameter when non-parametric predictors xˆi are used. The pdf of interest is fX (x; θ 0 ), and we set ϕ(x; θ ) = ln fX (x; θ ). To estimate θ 0 ∈ d out of an open set ⊆ d , we need some (standard) assumptions: A SSUMPTION M1. The parameter set contains an open neighbourhood of θ 0 in which for a.e. x, ϕ is twice differentiable in θ , ∂ϕ/∂x exists and is Lipschitz continuous w.r.t. x. Furthermore, the third-order partial derivatives w.r.t. θ are continuous in θ and bounded by a function M(x) with Eθ0 [M(X)] < ∞. Eθ0 [ ∂θ∂ j ϕ(X; θ )] = 0, E[ ∂θ∂ j ϕX2 (X; θ0 )] < ∞ for all j = 1, . . . , d. ∂2 T = −E ∂θ∂θ Furthermore, the Fisher information i(θ ) := E ∂θ∂ ϕ(X; θ ) ∂θ∂ ϕ(X; θ ) T ϕ(X; θ ) is positive definite. A SSUMPTION M2.
A SSUMPTION M3. The support of fX does not vary with θ ∈ . C OROLLARY 2.2. Under Assumptions P1, P2 and M1–M3 and the existence of the maximum likelihood estimator θ˜ of θ , the expectation of θ˜ is
2 1 ∂ ϕ(X; θ )b(X) + o g 2 + √ θ +E ∂x∂θ n and the variance is
2 ∂ ϕ(v; θ ) ∂ 2 ϕ(w; θ ) u (v, w) fX (v)fX (w) dvdw i −1 (θ ) i −1 (θ ) + i −1 (θ ) ∂x∂θ ∂x∂θ
1 1 . +o + nNg δ n
In general, Corollary 2.2 says that for a parameter g chosen such that both g 2 and (nN)−1/2 g −δ/2 are√of rate n−1/2 (it is sufficient that N −1 g −δ goes to zero), the estimator θ˜ has convergence rate n, i.e. undersmoothing is necessary when predicting the xi to get an efficient C The Author(s). Journal compilation C Royal Economic Society 2009.
386
S. Sperlich
estimator of θ . The need for (nN)−1 g −δ → 0 comes from the diagonal terms of the covariance function u ; see Assumptions P1 and P2. A simple example is the estimation of the mean μ of ¯ˆ = μ + E[b(X)] X using the average of non-parametric predictors xˆi , i = 1, . . . , n. We get E[X] 2 σ ¯ X ˆ = + u (v, w)fX (v)fX (w)dvdw, where the last term is of order O(n−1 ), cf. and Var[X] n Assumption P2. The extension to conditional densities is straightforward. Therefore, even though the result is given in the section on density estimation, it carries over to many interesting (parametric) maximum likelihood problems. This includes any regression where the conditional distribution is assumed to be known up to a finite dimensional parameter. 2.2. Regression Turning to the regression problem (1.1), we first need some more assumptions on the stochastic ˆ concerning the correlation with ε and some higher moments: part u(x) of x, A SSUMPTION P4. E[uri σur (xi )εjs σεs (xj )] = O N r1grδ uniformly ∀ i, j , s = 1, 2, r = 1, 2; and γ γ γ E[ux σu (x)] = O(σu (x)) = o Ng1 δ for γ > 2. A SSUMPTION P5. For the correlation of the errors, we have E[εi εj usi σus (xi )urj σur (xj )] = O (Nghδ )r+s uniformly ∀ i =j and r = 1, 2, s = 0, 1, 2. Again, it can be easily verified that this holds even for the case when xˆ is estimated from the same sample, i.e. with error terms strongly correlated with the ε i (or even identical like in Pendakur and Sperlich, 2009, which predicts real expenditures to estimate Hicksian demand functions). In that particular case, the third or fourth cumulant in Assumption P5 is even equal to zero. So, Assumption P5 seems to be somewhat artificial. However, depending on the predictor of X, rather complex relationships between the errors uj and ε i might occur. To define a multivariate kernel regression estimator, we need some more notation. Define matrix H ∈ (q+1)×(q+1) as the bandwidths matrix with (0, . . . , 0, h) in the last row and (0, . . . , 0, h)T in the last column. We define the multivariate kernel K : q+1 → and assume (for convenience) that K = L × K, with K(·) being one-dimensional and L(·) a q-dimensional kernel. Let Zˆ i = (Zi , Xˆ i )T and consider the estimator
n ˜ x) = i=1 m(z, n
K(H {Zˆ i − (z, x)T })Yi . K(H {Zˆ i − (z, x)T })
(2.2)
i=1
Assumptions D1–D3 have to be exchanged by the following ones: A SSUMPTION R1. derivatives.
The mean function m(·) has bounded and continuous second partial
A SSUMPTION R2. The variance function, σ 2ε (·), is bounded and Lipschitz continuous. A SSUMPTION R3. The density f (·) of (Z, X) is uniformly bounded away from zero and infinity and is Lipschitz continuous. A SSUMPTION R4. The and multivariate kernel K(·) is bounded, compactly supported T = 1 and wK(w)dw = 0. Additionally, ww K(w)dw = fulfils (with w ∈ q+1 ) K(w)dw μ2 (K)I(q+1) , where μ2 (K) = wj2 K(w)dw is independent of j = 1, . . . , q + 1. C The Author(s). Journal compilation C Royal Economic Society 2009.
Non-parametric estimation with predicted variables
387
A SSUMPTION R5. For the bandwidths we assume Assumption D2 and also that each entry of matrix H and the term {n det(H )}−1 tends to zero as n → ∞. The ratio of the largest to the smallest eigenvalue of H is bounded by a constant for all n. For discussion of the conditions, especially on H, we refer to Ruppert and Wand (1994). The smoothness assumptions on the density can be modified for discrete variables or fixed designs. Denote the gradient of a function φ, if it exists, by ∇ φ and its Hessian matrix by Hφ . Then we can state: T HEOREM 2.2. Under Assumptions R1–R5, D3, P1, P2, P4 and P5, (z, x) being an interior ˜ defined as in (2.2), one has that point of the support of (Z, X) and m(·) (i)
(ii) (iii)
˜ x)] = m(z, x) + Bm (z, x) + o(h2 + g 2 ) with Bm (z, x) = E[m(z,
∇m (z, x)T H H T ∇f (z, x) 1 ∂ T + tr{H Hm (z, x)H } + b(x)μ1 (K ) m(z, x). μ2 (K) f (z, x) 2 ∂x 2 1 σ 2 (z,x) 1 ˜ x)] = n det(H V [m(z,
K 22 fε(z,x) + σu2 (x)μ21 (K ) ∂m(z,x) + o n det(H + Ng1 δ . ) ∂x ) √ ˜ x) − m(z, x) − Bm (z, x)) converges to a centred With t := min{n det(H ), Ng δ }, t(m(z, Gaussian distribution with its variance given implicitly in (ii).
We find similar messages to those of Theorem 2.1. The bias is augmented by an additive factor proportional to b(x), and the variance is increased by an additive factor proportional to σu2 (x). Thus e.g. if δ = 1 and g is of the same rate as h, the asymptotic rate is not affected. Again, the bias of the predictor only affects the bias of the final estimator, and the variance of the predictor only affects the variance. In contrast to density estimation, the extension to higher dimensions of X is of interest. ˆ x, b(x), U ∈ p , p ≥ 1, with notation x = (x 1 , x 2 , . . . , x p )T Consider xˆ = x + b(x) + U (x), x, and U(x) being a mean zero heteroscedastic error with Var[U (x)] = u (x). For the estimation, let K be a p-dimensional kernel. Then the bias becomes
∇m (z, x)T H H T ∇f (z, x) 1 T + tr H Hm (z, x)H Bm (z, x) = μ2 (K) f (z, x) 2 p ∂ m(z, x), + b(x)μ1 (K ) j ∂x j =1
and the asymptotic variance becomes σ 2 (z, x) 1 ∂m(z, x) u ∂m(z, x)
K 22 ε + μ21 (K ) . (x) n det(H ) f (z, x) ∂x T ∂x Like in the case of density estimation, it is easy to derive the asymptotics of non-parametric regression estimators with parametric predictors xˆi , i = 1, . . . , n. As a consequence of Theorem 2.2, we obtain the following assumptions: A SSUMPTION P6. E[uri σur (xi )εjs σεs (xj )] = O n1r uniformly ∀ i, j , s = 1, 2, r = 1, 2; and γ γ γ E[ux σu (x)] = σu (x) = o n1 for γ > 2. A SSUMPTION P7. For the third cumulant of the errors, we have E[εi εj ui σu (xi )] = o n1 uniformly ∀ i, j = 1, . . . , n. C The Author(s). Journal compilation C Royal Economic Society 2009.
388
S. Sperlich
C OROLLARY 2.3. Under Assumptions R1–R5, D3, P2, P3, P6 and P7 and (z, x) being an ˜ x) are the interior point of the support of (Z, X), the asymptotic bias and variance of m(z, same as in Theorem 2.2. Due to the faster rate of b(x) and σ 2u (x), the asymptotic distribution of ˜ x) is the same as that of the estimator based on the xi , had these been observed. m(z, Consider the popular example of single or multiple index models E[Y | Z = z] = m(z) = G(zT β) = G(x), x = zT β with G(·) being non-parametric. For those models, it is possible to estimate β and predict x with the parametric rate, yielding an important dimension reduction in the original problem of estimating m(·). Other examples are (generalized) partial linear, transformation or varying coefficient models; see Andrews (1995) for an overview of convergence rates of typical semi-parametric models. When looking at parametric regression estimation with non-parametric predictors, we get similar results for maximum likelihood estimation as in Corollary 2.2, with ϕ now being the logarithm of the conditional density of the dependent variable Y. Here, often more sophisticated methods are necessary to get efficient estimators for the parameters.
3. CONCLUSIONS AND EXTENSIONS We have derived the asymptotic properties for non-parametric kernel density and regression estimation with constructed variables. We did this under general assumptions for the predictors; they can be parametrically, semi-parametrically or non-parametrically generated. Both in density and regression estimation, the bias and variance are composed additively by the classical bias, namely variance, plus a new term dominated by the bias, namely variance of the predictor. In density estimation, a proper undersmoothing in the prediction of the constructed variable can even make the additional terms disappear. These results have direct consequences for all sequential (two-, three- or more-step) non- or semi-parametric estimators. For ease of presentation, we have limited our attention to the estimation of the density and mean function. It might be of interest to extend these results to the estimation of its derivatives. For this, one needs additional continuous higher-order derivatives of the kernel K(·) and of the density fX (·), namely the mean m(·). Nevertheless, it could be done easily by using local polynomials; see Fan and Gijbels (1996). Also the extensions of our results to (generalized) partial linear models, (generalized) additive models, etc. are straightforward. We did not consider higher-dimensional constructed variables for two reasons: they are of little importance in practice, and this extension affects only the notation. The asymptotics and convergence rates will depend strongly on the way in which the predictors enter the model, e.g. whether or not they enter via separable functions. Further, we disregarded the possibility of applying higher smoothness assumptions on the prediction problem for the xi . In that case, one can use bias-reducing predictors. However, in practice, these typically show poor performance unless N is extremely large. Finally, we have restricted our study to the case of i.i.d. observations. The extension to mixing processes requires different approximation techniques and assumptions, but, typically, one obtains the same asymptotic expressions; see e.g. Bosq (1996).
ACKNOWLEDGMENTS This research was financially supported by the Deutsche Forschungsgemeinschaft FOR916. I gratefully acknowledge helpful discussions with Walter Zucchini, Miguel Delgado, Arthur Lewbel and Oliver Linton, and the discussion of two anonymous referees. C The Author(s). Journal compilation C Royal Economic Society 2009.
Non-parametric estimation with predicted variables
389
REFERENCES Andrews, D. W. K. (1995). Nonparametric kernel estimation for semiparametric models. Econometric Theory 11, 560–96. Bianchi, M. (1997). Testing for convergence: evidence from non-parametric multimodality tests. Journal of Applied Econometrics 12, 393–409. Bosq, D. (1996). Nonparametric Statistics for Stochastic Processes: Estimation and Prediction. Lecture Notes Series in Statistics, Volume 110, New York: Springer Verlag. Das, M., W. K. Newey and F. Vella (2003). Nonparametric estimation of sample selection models. Review of Economic Studies 70, 33–58. Elbers, C., J. O. Lanjouw and P. Lanjouw (2003). Micro-level estimation of poverty and inequality. Econometrica 71, 355–64. Fan, J. and I. Gijbels (1996). Local Polynomial Modelling and its Applications. London: Chapman and Hall. Hoderlein, S. and E. Mammen (2007). Identification of marginal effects in nonseparable models without monotonicity. Econometrica 5, 1513–18. Horowitz, J. L. (2001). Nonparametric estimation of a generalized additive model with an unknown link function. Econometrica 69, 499–513. Lewbel, A. and O. Linton (2007). Nonparametric matching and efficient estimators of homothetically separable functions. Econometrica 75, 1209–27. Li, Q. and J. S. Racine (2007). Nonparametric Econometrics. Princeton: Princeton University Press. Li, Q. and J. M. Wooldrige (2002). Semiparametric estimation of partial linear models for dependent data with generated regressors. Econometric Theory 18, 625–45. Newey, W. K. (1990). Efficient instrumental variables estimation of nonlinear models. Econometrica 58, 809–37. Newey, W. K., J. Powell and F. Vella (1999). Nonparametric estimation of triangular simultaneous equation models. Econometrica 67, 565–603. Oxley, L. and M. McAleer (1993). Econometric issues in macroeconomic models with generated regressors. Journal of Economic Surveys 7, 1–40. Pagan, A. R. (1984). Econometric issues in the analysis of regressions with generated regressors. International Economic Review 25, 221–47. Pagan, A. R. and A. Ullah (1999). Nonparametric Econometrics. New York: Cambridge University Press. Pendakur, K. and S. Sperlich (2009). Semiparametric estimation of consumer demand systems in real expenditure. Forthcoming in Journal of Applied Econometrics. Profit, S. and S. Sperlich (2004). Non-uniformity of job-matching in a transition economy—a nonparametric analysis for Czech Republic. Applied Economics 36, 695–714. Rilstone, P. (1996). Nonparametric estimation of models with generated regressors. International Economic Review 37, 299–313. Rodr´ıguez-P´oo, J. M., S. Sperlich and A. I. Fern´andez (2005). Semiparametric three step estimation methods for simultaneous equation systems. Journal of Applied Econometrics 20, 699–721. Ruppert, D. and M. P. Wand (1994). Multivariate locally weighted least squares regression. Annals of Statistics 22, 1346–70. Sala-i-Martin, X. (2006). The world distribution of income: falling poverty and . . . convergence, period. Quarterly Journal of Economics 121, 351–97. Stengos, T. and B. Yan (2001). Double kernel nonparametric estimation in semiparametric econometric models. Nonparametric Statistics 13, 883–906. Van Keilegom, I. and N. Veraverbeke (2002). Density and hazard estimation in censored regression models. Bernoulli 8, 607–25. C The Author(s). Journal compilation C Royal Economic Society 2009.
390
S. Sperlich
APPENDIX: PROOFS For the sake of simplifying notation, we will write f instead of fX , and for up to higher-order terms. This unusual notation is used as the higher order can come from h, g, n, N , Assumptions P1, P3, respectively, P4 or combinations.
Proof of Theorem 2.1: Recall the definition f˜(x) = n1 ni=1 Kh (Xˆ i − x), with Xˆ i = Xi + b(Xi ) + ui σu (Xi ). For κ ∈ (0, 1), the mean value theorem gives n h2 1 E[Kh (Xi − x) + Kh (Xˆ i − x) − Kh (Xi − x)] f (x) + f (x)μ2 (K) E[f˜(x)] = n i=1 2
n {b(Xi ) + ui σu (Xi )}2 b(Xi ) + ui σu (Xi ) Xi − x 1 + E K + n i=1 h2 h 2h3
Xi − x + κb(Xi ) + κui σu (Xi ) , × K h
where the expectation is taken over theXi and ui . The expectation of b(Xi )+uh2i σu (Xi ) K Xih−x can be approximated by
b(x) + vhb (x) K (v)[f (x) + {vh − κb(x)}f (x)]dv{1 + op (1)} μ1 (K ){b(x)f (x) + b (x)f (x)}. h
{b(Xi )+ui σu (Xi )}2 K Xi −x+κb(Xhi )+κui σu (Xi ) 2h3 2 σ u (x)f (x)μ 0 (K )/(2h2 ) for the squared
When approximating the expectation of
up with b2 (x)f (x)μ 0 (K )/(2h2 ) and mixed terms. The assertion (i) follows now from Assumptions D2 and D3. For calculating the variance, we write (with a κ ∈ (0, 1))
, after substitution, we end 1 for the h2 Ng δ
and b(x)O
n 11 Xi − x K n i=1 h h
b(Xi ) + ui σu (Xi ) Xi − x K + h2 h 2
2 {b(Xi ) + ui σu (Xi )} Xi − x + κb(Xi ) + κui σu (Xi ) K + 2h3 h
E[f˜2 (x)] = E
= T1 + 2T2 + T3 , where
⎡
⎤
n n 1 X 1 X i −x j −x ⎦, K T1 = E ⎣ 2 K n j =1 i=1 h2 h h ⎡
n n 1 1 Xj − x b(Xi ) + ui σu (Xi ) Xi − x K K T2 = E ⎣ 2 n j =1 i=1 h h h2 h
{b(Xi ) + ui σu (Xi )}2 Xi − x + κb(Xi ) + κui σu (Xi ) , K + 2h3 h C The Author(s). Journal compilation C Royal Economic Society 2009.
Non-parametric estimation with predicted variables
391
⎡
n n 1 b(Xi ) + ui σu (Xi ) Xi − x K n2 i=1 j =1 h2 h
{b(Xi ) + ui σu (Xi )}2 Xi − x + κb(Xi ) + κui σu (Xi ) K + 2h3 h
b(Xj ) + uj σu (Xj ) Xj − x K × h2 h
{b(Xj ) + uj σu (Xj )}2 Xj − x + κb(Xj ) + κuj σu (Xj ) . + K 2h3 h
T3 = E ⎣
It is well known that T 1 is equal to f 2 (x) + 2f (x)
h2 h4 2 1 f (x)μ2 (K) + f (x)μ22 (K) +
K 22 f (x), 2 4 nh
up to higher-order terms. Then, with Assumptions P1, P2 and D2, D3, we get for T 3
n n ui uj σu (Xi )σu (Xj ) + b(Xi )b(Xj ) Xi − x Xj − x 1 E K K n2 i=1 j =1 h4 h h
1 1 2 = σu2 (v)h−4 K ({x − v}h−1 )f (v)dv + O n n + h−4 b(v)b(w)K ({x − v}h−1 )K ({x − w}h−1 )f (v)f (w)dvdw
1 2
K 22 f (x)σu2 (x) + b2 (x)f (x)μ21 (K ) nh3 + 2b(x)b (x)f (x)f (x)μ21 (K ) + b 2 (x)f 2 (x)μ21 (K ).
Similarly we get T2 μ1 (K )f (x){f (x)b(x) + b (x)f (x)} 1 + h2 {b(x)f (x) + b (x)f (x)}μ1 (K ) μ2 (K)f (x). 2 Now, subtracting E 2 [f˜(x)] from E[f˜2 (x)] we can conclude (ii). Using Assumption P2, assertion (iii) can be derived from the central limit theorem. Proof of Theorem 2.2: As our kernel estimator differs from the classical one only in the use of constructed variables in one regressor, it is sufficient to concentrate on a one-dimensional kernel regression estimator of the form
n K(h−1 {Xˆ i − x})Yi ˜ . m(x) = i=1 n −1 ˆ i=1 K(h {Xi − x}) Including Z ∈ q is straightforward (following the proof in Ruppert and Wand, 1994, for the case without generated regressors), although the notation becomes complicated. With Theorem 2.1, we have n 1 −1 ˆ ˜ Kh (Xi − x){Yi − m(x)} {1 + op (1)}. m(x) − m(x) = f (x) n i=1 C The Author(s). Journal compilation C Royal Economic Society 2009.
392
S. Sperlich
As in our proof of Theorem 2.1, this is, with κ ∈ (0, 1),
n b(Xi ) + ui σu (Xi ) Xi − x 1 {m(Xi ) − m(x) + εi σε (Xi )} Kh (Xi − x) + K n i=1 h2 h
{b(Xi ) + ui σu (Xi )}2 Xi − x + κ{b(Xi ) + ui σu (Xi )} f −1 (x). + K 2h3 h
The last term of the Taylor series is of higher order, thanks to Assumptions D2, D3 and E[u | X] = 0. Taking the expectation gives for the first Taylor term, h2 μ2 (K){m (x)f (x)f −1 (x) + 12 m (x)}, whereas the second can be approximated with Assumptions P4 and P5 by
b(v) v − x {m(v) − m(x)}f (v)dv K h2 h b(x + wh) K (w){m(x + hw) − m(x)}{f (x) + hwf (x)}dw f −1 (x) h b(x)m (x)μ1 (K ),
f −1 (x)
which gives (i). To calculate the variance, we consider f −2 (x) E[T 4 − E[T 4 ]]2 , where for κ ∈ (0, 1), T4 =
n 1 b(Xi ) + ui σu (Xi ) Xi − x {m(Xi ) − m(x) + εi σε (Xi )} Kh (Xi − x) + K n i=1 h2 h
2 {b(Xi ) + ui σu (Xi )} Xi − x + κ{b(Xi ) + ui σu (Xi )} . + K 2h3 h
First, note that f −2 (x)E[T 4 − E[T 4 ]]2 is equal to
n 1 b(Xi ) + ui σu (Xi ) Xi − x εi σε (Xi ) Kh (Xi − x) + K n i=1 h2 h
{b(Xi ) + ui σu (Xi )}2 Xi − x + κ{b(Xi ) + ui σu (Xi )} K + 2h3 h
n ui σu (Xi ) Xi − x 1 {m(Xi ) − m(x)} K + n i=1 h2 h
f −2 (x) · E
2
b(Xi )ui σu (Xi ) Xi − x + κ{b(Xi ) + ui σu (Xi )} K + h3 h = f −2 (x)E[T5 + 2T6 + T7 ] , with T5 =
n n 1 b(Xi ) + ui σu (Xi ) Xi − x ε σ (X )ε σ (X ) K (X − x) + K i ε i j ε j h i n2 i=1 j =1 h2 h {b(Xi ) + ui σu (Xi )}2 K + 2h3
Xi − x + κ{b(Xi ) + ui σu (Xi )} h
C The Author(s). Journal compilation C Royal Economic Society 2009.
Non-parametric estimation with predicted variables
393
{b(Xj ) + uj σu (Xj )}2 b(Xj ) + uj σu (Xj ) Xj − x + × Kh (Xj − x) + K h2 h 2h3
Xj − x + κ{b(Xj ) + uj σu (Xj )} × K , h
n n 1 b(Xi ) + ui σu (Xi ) Xi − x εi σε (Xi ){m(Xj ) − m(x)} Kh (Xi − x) + K T6 = 2 n i=1 j =1 h2 h
{b(Xi ) + ui σu (Xi )}2 Xi − x + κ{b(Xi ) + ui σu (Xi )} K 2h3 h
b(Xj )uj σu (Xj ) Xj − x + κ{b(Xj ) + uj σu (Xj )} uj σu (Xj ) Xj − x + , × K K h2 h h3 h +
T7 =
n n 1 {m(Xi ) − m(x)}{m(Xj ) − m(x)} 2 n i=1 j =1
b(Xi )ui σu (Xi ) Xi − x + κ{b(Xi ) + ui σu (Xi )} Xi − x + K h h3 h
b(Xj )uj σu (Xj ) Xj − x + κ{b(Xj ) + uj σu (Xj )} uj σu (Xj ) Xj − x + . K K × h2 h h3 h
×
ui σu (Xi ) K h2
We define the following abbreviations: Ki := Kh (Xi − x), εx := εx σε (x), εi := εXi ;
Xi − x 1 , υx := ui σu (x), υi := υXi ; Ki := K h h
Xi − x + κ{b(Xi ) + ui σu (Xi )} 1 , bi := b(Xi ). Ki := K h h It is well known that
n 1 1 1 2 2 2 2 f (x)σε (x) K 2 + o . E Ki εi = n2 i=1 nh nh Furthermore, with Assumption P4, for E 1 ≥ max i,j E[ε i ε j υ j ], we get n n 1 E Ki Kj εi εj (bj + υj ) 2 hn i=1 j =1
=
n n n n n 1 1 1 2 + + E K K ε ε b E K K ε υ E Ki Kj εi εj υj i i j j i i j i i hn2 i=1 j =1 hn2 i=1 hn2 i=1 j =i
≤O
=O
g2 nh g2 nh
+O
+O
1 1 nh Ng δ 1 1 nh Ng δ
+ {1 + o(1)}E1
K(z)K (y)h−1 f (x + zh)f (x + yh)dzdy
+ {1 + o(1)}E1 μ1 (K )f (x)f (x).
C The Author(s). Journal compilation C Royal Economic Society 2009.
(A.1)
394
S. Sperlich
Now, with Assumption P5, we have E1 = o same decomposition as above we get
1 Ng δ
. Next, for E 2 := max i,j E[ε i ε j υ 2j ], and applying the
n n 1 E Ki Kj εi εj (bj + υj )2 2h2 n2 i=1 j =1
=O
g4 nh2 g4 nh2
1 nh
+O
1 g2 + 2 2δ hN g hNg δ
1 g2 +O + ≤O hN 2 g 2δ hNg δ With Assumption P5 we have E2 = O N 2hg2δ . Similarly, it can be shown that 1 nh
n 1 g2 + O E1 2 + 2 2 E Ki Kj εi εj υj2 h 2h n i=1 j =i
g2 + O E1 2 h
+ O(E2 h−2 ).
4 n n 1 g E K K ε ε (b + υ )(b + υ ) ≤ O i j j i j i j i h2 n2 i=1 j =1 nh2
+O
1 nh
1 g2 + hN 2 g 2δ hNg δ
+O
h2 N 2 g 2δ
+O
6 n n 1 g 2 E K K ε ε (b + υ ) (b + υ ) ≤ O i j j i j i j i h3 n2 i=1 j =1 nh3
1 g4 g2 + + h2 N 3 g 3δ h2 Ng δ h2 N 2 g 2δ
4 −1
2 g h g +O . +O 2 2δ N g Ng δ
+O
1 nh
+O
hg 2 Ng δ
h N 3 g 3δ
(A.2)
,
Finally, applying the same decomposition as above we get n n 1 E Ki Kj εi εj (bj + υj )2 (bi + υi )2 4 2 4h n i=1 j =1
O
+O
g8 nh4
g 4 h−2 N 2 g 2δ
+O
As maxi,j E[εi εj υi2 υj2 ] = O
1 nh
+O
h N 4 g 4δ
1 g2 g2 + 3 2 2δ + 3 3 3δ h3 N 4 g 4δ hN g hN g
g 2 h−1 N 3 g 3δ
+ {1 + o(1)}
+O
g 6 h−3 Ng δ
n 1 E Ki Kj εi εj υi2 υj2 . 4 2 4h n i=1 j =i
, see Assumption P5, the last term is o
Combining this with Assumptions D2 and D3 gives that f Furthermore, with similar calculations, one gets for E[T 6 ], =
−2
1 Ng δ
(A.3)
(A.4)
.
(x)E[T5 ] =
1 f −1 (x)σεˆ2 (x) K 22 . nh
2 n n 1 1 1 g + O + O E K K ε υ {m(X ) − m(x)} 1 + O i i j j j hn2 i=1 j =1 h hNg δ h2 N 2 g 2δ
h
m (x) μ1 (K )σu (x)σε (x)E[uε] = O f (x)
h Ng δ
.
C The Author(s). Journal compilation C Royal Economic Society 2009.
Non-parametric estimation with predicted variables
395
Similarly, E[T7 ] =
n n 1 E Ki Kj υi υj {m(Xi ) − m(x)}{m(Xj ) − m(x)} (1 + o(1)) 2 2 h n i=1 j =1
= m 2 (X)f 2 (x)σu2 (x)μ21 (K ) (1 + o(1)) = O
1 Ng δ
.
Putting all together gives assertion (ii). Again, bearing in mind that we consider i.i.d. observations and Assumption P2, (iii) can be derived from the central limit theorem.
C The Author(s). Journal compilation C Royal Economic Society 2009.