The
Econometrics Journal Econometrics Journal (2011), volume 14, pp. 131–155. doi: 10.1111/j.1368-423X.2010.00333.x
An I(2) cointegration model with piecewise linear trends TAKAMITSU K URITA † , H EINO B OHN N IELSEN ‡ AND A NDERS R AHBEK § ,¶ †
‡
Faculty of Economics, Fukuoka University, Bunkei Center Building, 8-19-1 Nanakuma, Johnanku, Fukuoka 814-0180, Japan. E-mail:
[email protected]
Department of Economics, University of Copenhagen, Øster Farimagsgade 5, Building 26, DK-1353 Copenhagen K, Denmark. E-mail:
[email protected]
§ Department
of Economics, University of Copenhagen, Øster Farimagsgade 5, Building 26, DK-1353 Copenhagen K, Denmark.
¶ CREATES,
School of Economics and Management, Aarhus University, Building 1322, Bartholins All´e 10, DK-8000 Aarhus C, Denmark. E-mail:
[email protected]
First version received: November 2009; final version accepted: August 2010
Summary This paper presents likelihood analysis of the I(2) cointegrated vector autoregression which allows for piecewise linear deterministic terms. Limiting behaviour of the maximum likelihood estimators are derived, which is used to further derive the limiting distribution of the likelihood ratio statistic for the cointegration ranks, extending Nielsen and Rahbek. The provided asymptotic theory extends also the results in Johansen et al. where asymptotic inference is discussed in detail for one of the cointegration parameters. An empirical analysis of US consumption, income and wealth, 1965–2008, is performed, emphasizing the importance of a change in nominal price trends after 1980. Keywords: Cointegration, I(2), Likelihood analysis, Piecewise linear trends, Rank test, US consumption.
1. INTRODUCTION Thispaper presents asymptotic likelihood analysis of the I(2) cointegrated vector autoregression (VAR) with piecewise linear trends, i.e. a model where the slopes of the deterministic trends and the equilibrium means are allowed to change at q known breakpoints. Our focus is inference on cointegration ranks, or indices, and testing hypotheses on cointegrating parameters based on likelihood ratio (LR) statistics. Empirically, an I(2) model with piecewise linear trends is highly relevant. Many OECD countries have experienced pronounced shifts in inflation rates since the 1960s, leading to changes in trend slopes of nominal variables. These changes, if suitably smooth, may be C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society. Published by Blackwell Publishing Ltd, 9600
Garsington Road, Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
132
T. Kurita, H. B. Nielsen and A. Rahbek
well captured by stochastic I(2) trends, while, if abrupt, may alternatively be described by a deterministic change in the trend. The proposed model thus embeds both, and allows one to test, rather than assuming, which is more relevant for a given empirical analysis. More generally, there is a large strand of literature, originated by Perron (1989, 1990), illustrating that it is important to have a relevant deterministic specification for the data before the presence of unit roots is tested; and neglecting to model deterministic shifts will bias conventional tests towards the finding of unit roots. This approach is reflected in the included application, where the model is applied to quarterly observations of nominal variables for US consumption, income and wealth, 1965–2008. We find a significant difference in the trend slope before and after 1981, a break that can be attributed to a shift in policy focus following the stagflation period and recession in 1981. Furthermore, we find clear evidence of I(2) trends in the nominal variables, also when we allow for the alternative of a deterministic change in the trend. Moreover, homogeneity between nominal variables is not rejected; thus money illusion is excluded in the long run, and a nominal-to-real transformation from I(2) to I(1) is possible; see also Kongsted (2005). Homogeneity, and hence the validity of the theoretically relevant I(2)-to-I(1) transformation, on the other hand, is strongly rejected in a (misspecified) I(2) analysis that does not allow for changing linear trends. Other empirical research modelling nominal variables by I(2) include e.g. Juselius (1998, 1999), Diamandis et al. (2000), Banerjee et al. (2001), Fliess and MacDonald (2001), Nielsen (2002), Bacchiocchi and Fanelli (2005) and Nielsen and Bowdler (2006). Apart from the formulation of the I(2) model, the theoretical contribution of the paper is the asymptotic theory: Theorem 3.2 gives asymptotic distributions of maximum likelihood estimators (MLEs) of the parameters. These are applied for our derivation of the limiting distribution of the LR rank test statistic in Theorem 3.3, and LR statistics on the cointegration parameters. We thereby extend both the analysis in Nielsen and Rahbek (2007), where cointegration rank testing is considered for I(2) VAR models with no changes in trend and level, and the analysis in Johansen et al. (2000), where piecewise linear trends are studied for I(1) models. Furthermore, the paper complements results in Johansen et al. (2010) as it provides a full asymptotic theory for estimators and test statistics. The organization of this paper is as follows. Section 2 introduces the relevant representations of the VAR model for I(2) processes in the presence of changing linear trends. Section 3 provides asymptotic results and, finally, Section 4 presents the empirical illustration. All proofs are located in the Appendix. Throughout, use is made of the following notation: for any p × r matrix α of rank r, r < p, let α ⊥ indicate a p × (p − r) matrix whose columns form a basis of the orthogonal complement of ¯ = α α¯ is the orthogonal projection matrix onto span(α). span(α). Set α¯ = α(α α)−1 such that αα p D The symbols ⇒ and → are used to indicate weak convergence and convergence in probability, respectively. Finally, we use [x] to denote the largest integer smaller than x, x ∈ R, and 1(A) the indicator function which equals one if A is true, zero otherwise.
2. THE MODEL 2.1. The I(2) model with no deterministic terms To introduce the notation consider initially the unrestricted VAR model with k ≥ 2 lags and parametrized conveniently for I(2) analysis of the p-dimensional Xt , C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
An I(2) cointegration model with piecewise linear trends
133
2 Xt = Xt−1 − Xt−1 + 2 Xt−1 + t ,
t = 1, 2, . . . , T , (2.1) 2 where and are (p × p)-dimensional matrices, 2 Xt−1 = k−2 i=1 i Xt−i , with i 2 (p × p) matrices. Initial values X0 , X0 and X0 are conditioned upon, and t is a pdimensional i.i.d. Np (0, ) sequence, > 0. While the assumption of normality of t defines the likelihood function to be maximized below, we note that it is well known from the analysis of I(1) cointegration models, that asymptotic properties of estimators and test statistics hold under the less strict assumption of t being a martingale difference sequence; see e.g. Cavaliere et al. (2010). The I(2) model, H (r, s), is then defined by two reduced rank restrictions, = αβ
and
α⊥ β⊥ = ξ η ,
(2.2)
with α and β (p × r) dimensional matrices, ξ and η are (p − r) × s matrices with r ≤ p and s ≤ p − r. Next, to interpret the parameters, and the dynamics of Xt , we need the following assumption which is maintained throughout the paper: A SSUMPTION 2.1. Assume that the characteristic polynomial, A(z) = Ip (1 − z)2 − z + 2 i (1 − z)z − k−2 i=1 i (1 − z) z , has 2(p − r) − s roots at z = 1 and the remaining roots outside the unit circle, |z| > 1. For estimation purposes we follow Johansen (1997) and Nielsen and Rahbek (2007) and write the model in (2.1) subject to the two reduced rank restrictions in (2.2) in terms of freely varying parameters as, (2.3) 2 Xt = α ρ τ Xt−1 + ψ Xt−1 + α˜ ⊥ κ τ Xt−1 + 2 Xt−1 + t , where ρ is ((r + s) × r) dimensional, τ is (p × (r + s)), ψ is (p × r), and κ is ((r + s) × α⊥ )−1 is (p × (p − r)) dimensional. Under Assumption 2.1, (p − r)). Finally, α˜ ⊥ = α⊥ (α⊥ 2 Xt , ρ τ Xt + ψ Xt = β Xt + ψ Xt and τ Xt all have a stationary representation and Xt is therefore a cointegrated I(2) process. As mentioned, the parameters θ = (α, ρ, τ, ψ, κ, ) and in (2.3) are all freely varying. Estimates are obtained by a switching algorithm maximizing the Gaussian likelihood function given by T 1 −1 t (θ ) t (θ ) , LT (θ, ) = − log || + 2 t=1 with t (θ ) = 2 Xt − α ρ τ Xt−1 + ψ Xt−1 − α˜ ⊥ κ τ Xt−1 − 2 Xt−1 . Specifically, for fixed τ , the parameters α ⊥ and α can be obtained by solving an eigenvalue problem and the remaining parameters can be found from ordinary linear regression. For fixed values of these parameters, τ can be estimated by generalized least squares; see Johansen (1997) for more details. Note that the original parameters in (2.1), imposing the reduced rank restrictions in (2.2), can be derived from the parameters in (2.3) as follows. First, write α⊥ = (α⊥1 , α⊥2 ) and β⊥ = (β⊥1 , β⊥2 ), where α⊥1 = α¯ ⊥ ξ, β⊥1 = β¯⊥ η, α ⊥2 = (α, α ⊥1 )⊥ , β ⊥2 = (β, β ⊥1 )⊥ . Then it holds that τ = (β, β⊥1 ), β = τρ, ψ = −α˜ −1 , with α˜ −1 = −1 α(α −1 α)−1 and C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
134
T. Kurita, H. B. Nielsen and A. Rahbek
¯ β¯⊥1 ) = −(α⊥ ¯ ξ ), using the skew-projection identity κ = −α⊥ (β, β, α α˜ ˜ ⊥ α⊥ = Ip . −1 + α
(2.4)
2.2. Deterministic terms Our focus will be on the inclusion of piecewise linear trends in the I(2) model. Specifically, we allow for a linear deterministic trend and q changes in the trend slopes and equilibrium levels. The deterministic terms enter the model to allow piecewise linear trends in all directions of the process, including the cointegrating relationships. An immediate implication of this specification is that the rank determination can be based on asymptotically similar distributions; see Section 3.3.3 below for details, as well as Nielsen and Rahbek (2000) for a general discussion on similarity. Let therefore Dt = (D0t , D1t . . . , Dqt ) denote a (q + 1)-dimensional deterministic linearly trending variable, and set dt = Dt . We set D 0t = t, that is, the first component of Dt is throughout a linear trend, while Dit for i = 1, . . . , q allow q linearly independent changing linear trends. A changing trend slope at Ti , with 1 < Ti < T , is then represented by defining Dit = (t − (Ti − 1))1(t ≥ Ti ). For the asymptotic analysis we make the following assumption. A SSUMPTION 2.2. For the (q + 1)-dimensional deterministic linear trend term Dt = (D0t , . . . , Dqt ) , with D 0t = t and Dit = [t − (Ti − 1)]1(t ≥ Ti ), 1 < Ti < T , Ti = Tj for i = j , assume that Ti = [T ui ] for some ui ∈ ]0, 1[, i = 1, 2, . . . , q. It is an immediate implication of Assumption 2.2 that T −1 D [Tu] ⇒ D(u) on the space of (q + 1)-dimensional cadlag functions on [0, 1], or for each component, T −1 Di[T u] ⇒ Di (u),
as
T → ∞,
i = 0, 1, . . . , q,
where Di (u) = (u − ui )1(u ≥ ui ), with u0 = 0 and ui ∈ ]0, 1[ satisfying [T ui ] = Ti , i = 1, 2, . . . , q. Note that while Ti denotes the time point of a change in the discrete time interval [1, T ], ui denotes the corresponding (limiting) fraction in the continuous time interval [0, 1]. Furthermore, as T → ∞, 1 T T −3 Dt Dt ⇒ D(u)D(u) du, t=1
0
which is a positive definite (q + 1) × (q + 1) matrix. As dit = Dit = 1(t ≥ Ti ), then with dt = Dt , we have d [Tu] → d(u) = 0 by Assumption 2.2, with di (u) = 1(u ≥ ui ). 2.2.1. Constant linear trend. The case of Dt = D 0t = t, which allows for a linear trend in all linear combinations of the I(2) process Xt , is analysed in Rahbek et al. (1999) and Nielsen and Rahbek (2007), and it is briefly reviewed here before introducing the changing linear trends; see also Paruolo (2000) for other relevant specifications. Let Dt = t, dt = 1 and define Xt∗ = (Xt , Dt ) . Then the I(2) model with a linear trend is conveniently given by ∗ ∗ ∗ 2 Xt = α ρ τ ∗ Xt−1 + α˜ ⊥ κ τ ∗ Xt−1 + ψ ∗ Xt−1 + 2 Xt−1 + t , (2.5) C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
An I(2) cointegration model with piecewise linear trends
135
where τ ∗ = (τ , τD ) [(p + 1) × (r + s)], while ψ ∗ = (ψ , ψd ) [(p + 1) × r] and the remaining parameters are as in (2.3). Under Assumption 2.1, it was shown in Rahbek et al. (1999, Theorem 2.1) that indeed Xt in (2.5) is an I(2) process with the representation, Xt = C2
s t
i + C1
s=1 i=1
−1 β⊥2 α⊥2 β⊥2 α⊥2 ,
t
i + γD Dt + γd dt + C0 (L)t ,
i=1
(2.6)
β C1 = α¯ C2 , β⊥1 C1 = α¯ ⊥1 (I − C2 ). ∞ 0 Here = β¯ α¯ + Ip − k−2 i=1 i and C0 (L)t = i=0 Ci t−i is a stationary mean-zero I(0) 1 process with exponentially decaying coefficients. The coefficients γ D and γ D for the trend and level, respectively, depend on τ D and ψ d as well as on the initial values of the process. It follows from (2.6) that τ ∗ X∗t = τ Xt + τ D Dt is I(1) whereas the (p − r − s) linear combinations β ⊥2 Xt are I(2). In other words, τ ∗ X∗t and β ⊥2 2 Xt are mean zero stationary, or I(0), processes in addition to the r mean-zero stationary linear combinations given by
C2 =
st∗ = ρ τ ∗ Xt + ψ ∗ Xt∗ . 2.2.2. Changing linear trend. By the analysis in Rahbek et al. (1999), one may view the resulting model in (2.5) as derived from the I(2) model with no deterministic terms in (2.3), replacing Xt by Xt∗ = (Xt , Dt ) = (Xt , t) . This results in the model in (2.5) where Xt as stated is replaced by Xt∗ = (Xt , Dt ) , and Xt by Xt∗ = (Xt , dt ) . Note that as 2 Dt = dt = 0, the zero in 2 X∗t−j = (2 Xt−j , 0) can be omitted as in (2.5). Consider next extending Dt to include the additional q changing linear trends from Section 2.2. Initially, observe that in this case, with a changing linear trend such as D1t = (t − (T1 − 1))1(t ≥ T1 ), then 2 D1t = d1t = 1(t = T1 ) = δ1t , say, where δ 1t = 0. That is, the secondorder difference of the changing trend is an impulse dummy, which needs to be included. Likewise, 2 d 1t = δ 1t = 0. Note that including (δ1t , δ1t ) as an unrestricted regressor is equivalent to including (δ1t , δ1t−1 ) , and below we include such impulse dummies and not their differences. Introduce for that purpose δt , which is an m-dimensional variable of impulse dummies, δt = (δ1t , . . . , δmt ) , where δit = 1(t = Ti ) for some Ti ,
1 < Ti < T ,
i = 1, 2, . . . ., m.
If no additional impulse dummies are included, then m = qk, such that in general m ≥ qk; see below. Then, similar to the constant trend case, we extend the I(2) model to allow for changing linear trends by including Xt∗ = (Xt , Dt ) , Xt∗ = (Xt , dt ) and also impulse dummies in δt in the model, denoted H D (r, s) : ∗ ∗ ∗ 2 Xt = α ρ τ ∗ Xt−1 + α˜ ⊥ κ τ ∗ Xt−1 + ψ ∗ Xt−1 + 2 Xt−1 + δ δt + t , (2.7) where τ ∗ = (τ , τD ) is [(p + q + 1) × (r + s)], and ψ ∗ = (ψ , ψd ) is [(p + q + 1) × r]. The remaining parameters are as in (2.3), except the additional δ (p × m) parameter. Note that the inclusion of impulse dummies in δt as unrestricted regressors implies that ˆTi = 0, where ˆt are the estimated residuals in (2.7).
1
||C 0i || < ν i with 0 ≤ ν < 1.
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
136
T. Kurita, H. B. Nielsen and A. Rahbek
In empirical models, the m impulse dummies in δt are in general included for two different reasons. First of all, δt includes the impulse dummies resulting from differencing of the q changing linear trends (and levels). Specifically, with the example of D 1t , this leads to the inclusion of δ 1t−j , j = 0, . . . , k − 1, which are k impulse dummies for t = T 1 + j . With q changing linear trends, a total of qk impulse dummies should thus be included in δt . As noted above, the k corresponding estimated residuals ˆT1 , . . . , ˆT1 +(k−1) all equal zero, and the inclusion of these k dummies is therefore equivalent to conditioning on XT1 +(k−1) , XT1 +(k−1) and 2 XT1 +(k−1) in estimation. In addition to these qk induced impulse dummies, we allow for further impulse dummies δ it , and hence m ≥ qk. The additional impulse dummies entered as unrestricted regressors are sometimes referred to as innovation dummies and are common in empirical analyses since they often lead to a better empirical fit of the model within sample as in our application; see Section 4. We demonstrate below that they play no role in the asymptotic analysis, and the precise specification of δ t is not important asymptotically. Similarly for transitory impulse dummies, defined as δ it = 1(t = Ti ) − 1(t = Ti + 1). To summarize, we use the following notation for the different dummies in the statistical model: Dt = (t, D1t , . . . , Dqt ) where
dit =
0 1
and
for t < Ti for t ≥ Ti
dt = Dt = (1, d1t , . . . , dqt ) ,
and
Dit =
t
(2.8)
dit ,
i=0
so that the piecewise linear trend and the level shift both begin with the value one at t = Ti . Finally, δ t contains the qk induced dummies of the form (2 D t , 2 D t−1 , . . . , 2 D t−k+1 ) together with additional innovation dummies. Prior to estimation, any redundant dummy is removed. It follows directly by Rahbek et al. (1999, proof of Theorem 2.1) that the representation of Xt is identical to (2.6), with the only exception that now t is replaced by δt = t + δ δ t . We get that, under Assumption 2.1, Xt in (2.7) has the representation, Xt = C2
s t s=1 i=1
iδ
+ C1
t
iδ + γD Dt + γd dt + C0 (L)tδ .
(2.9)
i=1
This was also used in Johansen et al. (2010, proof of Lemma 1) where a generic infinite sum of impulse dummies is introduced to facilitate the interpretation. Define here such a generic infinite sum, ∞
λt = Cδ (L)δt ,
(2.10)
with Cδ (z) = i=0 Ciδ zi , Ciδ exponentially decreasing, and δt impulse dummies. The idea is that λt vanishes asymptotically as noted above and in this sense unimportant for the representation of Xt . For example, C0 (L)tδ contains such a vanishing term, C0 (L)δ δt (= λt ). Thus, from (2.9) it holds that Xt is an I(2) process with broken linear trends and levels, and that 2 Xt − E(2 Xt ) is I(0) with E(2 Xt ) = λt , a generic infinite sum of impulse dummies. Likewise, τ ∗ X∗t = τ Xt + τ D dt is I(0) except for E(τ ∗ Xt∗ ) = λt . Finally, with β ∗ = τ ∗ ρ the r linear combinations given by β ∗ Xt + ψ ∗ X∗t are I(0) except for E(β ∗ Xt + ψ ∗ Xt∗ ) = λt . Thus in this sense the interpretation remains identical to the linear trend case, except for C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
An I(2) cointegration model with piecewise linear trends
137
the additional asymptotically vanishing infinite sums of impulse dummies. In the empirical application below, we illustrate the role of the impulse dummies and the interpretation of the deterministic terms.
3. LIKELIHOOD INFERENCE 3.1. Estimation Under H D (r, s), ML estimators in (2.7) are obtained by the usual switching algorithm described above for the I(2) model with no deterministic terms. Note that the loading to the impulse dummies, δ , is estimated from single observations only, and hence is bounded but inconsistent; see Theorem 4.1. 3.2. The rank test statistic For determination of the cointegration ranks, r and s, we consider the LR statistic for H D (r, s) against the unrestricted alternative H D (p) = H D (p, 0), and it is defined by ˇ ˆ −1 QLR (r,s) = −T log | |, ˇ and ˆ denote the covariance matrices estimated under H D (r, s) and H D (p), where respectively. 3.3. Asymptotics When reporting results for the asymptotics of the parameter estimators emphasis will be on the parameters τ ∗ = (τ , τD ) , ψ ∗ = (ψ , ψd ) and ρ . The parameters α, κ, and have the same asymptotic behaviour as in the model with no deterministic terms analysed in Johansen (1997). As shown the remaining parameter δ plays no role for the asymptotic analysis, and we also note ˆ δ is not consistent. We start by providing the necessary results for parameters in this respect that in the I(2) model which are of theoretical interest, as their asymptotic distributions are used to derive the limiting behaviour of the LR statistics for rank and linear hypotheses, respectively. 3.3.1. Theoretical parameters. In the following θˆ denotes the ML estimator of a parameter θ , while θ 0 denotes the true value. Furthermore, the parameters β, τ and α ⊥ under H D (r, s) are normalized on β¯0 , τ¯0 and α¯˜ ⊥0 , respectively, such that β¯0 β = Ir , τ¯0 τ = Ir+s and α¯˜ ⊥0 α⊥ = Ip−r . These are theoretically convenient normalizations which ensure identification of all parameters in the model. Note in particular that ρ = τ¯0 β which is (r + s) × r. Define next the parameters, in terms of (2.7), B0 = (ψ − ψ0 ) β¯⊥20 B1 = (β − β0 ) β¯⊥10 B2 = (β − β0 ) β¯⊥20 = ρ (τ − τ0 ) β¯⊥20 BD = ρ (τD − τD0 )
C0 = ρ⊥ (τ − τ0 ) β¯⊥20
CD = ρ⊥ (τD − τD0 )
Bd = (ψd − ψd0 ) − (ψ − ψ0 ) τ¯0 τD0
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
(3.1)
138
T. Kurita, H. B. Nielsen and A. Rahbek
Note that B 0 , B 1 , B 2 and C0 are identical to the definitions in Johansen (1997), while BD , Bd and CD are new parameters corresponding to the deterministic terms. We first turn to consistency of the just defined parameters. T HEOREM 3.1. For the model H D (r, s) under Assumptions 2.1 and 2.2 the ML estimators exist with probability tending to one, and using the definitions in (3.1), 1/2 p
p 1/2 1/2 3/2 T Cˆ 0 , Cˆ D → 0 (3.2) T Bˆ 0 , T Bˆ 1 , T Bˆ 2 , T Bˆ D , Bˆ d → 0 and p ˆ and ˆ are consistent. Finally, ( ˆδ − as T → ∞. Moreover, T 1/2 (ρˆ − ρ0 ) → 0, and α, ˆ κ, ˆ δ0 ) = OP (1).
In particular, the MLEs of the parameters to the coefficients loading the impulse dummies, ˆ δ , are not consistent but are bounded in probability. Theorem 3.1 establishes also rates of convergence for the remaining parameters and the next theorem gives the asymptotic distributions of these estimators. To report these some definitions are needed first. Define first for X(u), Y (u) and Z(u) of dimension px , py and pz defined on the unit interval u ∈ [0, 1],
1
X(u) | Y = X(u) − S(X, Y ) =
dXY
0
1
X(s)Y (s) ds 0
1
Y (s)Y (s) ds
1
R(Y , Z) = 0
Y (u),
0
1
−1 Y (u)Y (u) du
0
−1
−1 Y (u)Y (u) du
1
Y dX ,
(3.3)
0 1
Y dZ .
0
And next define the process H(u) by
H (u) = H0 (u) , H1 (u) , H2 (u) = V (u) C20 β⊥20 , V (u) C10 β⊥10 ,
u
V (s)
0
dsC20 β⊥20
,
with V(u) a Brownian motion on u ∈ [0, 1] with covariance 0 . Furthermore, define
−1 −1 α0 0 V (u), V1 (u) = α0 −1 0 α0
−1 −1 V2 (u) = φ0 −1 φ0 0 V (u), 0 φ0
(3.4)
(3.5) (3.6)
κ0 α˜ ⊥0 . where φ0 = ρ¯⊥0
T HEOREM 3.2. For the model H D (r, s) under Assumptions 2.1 and 2.2, as T → ∞,
D
T Bˆ 0 , T Bˆ 1 , T 2 Bˆ 2 , T 3/2 Bˆ D , T 1/2 Bˆ d ⇒ B ∞ = B0∞ , B1∞ , B2∞ , BD∞ , Bd∞ = R(H ∗ , V1 )
1/2 D ∞ ∞ ∞ T Cˆ 0 , T Cˆ D ⇒ C = C0 , CD = R H0∗ , V2 , where the equalities on the right-hand side define B ∞ and C ∞ , respectively. Here H ∗ (u) = (H (u) , D(u) , d(u) ) and H ∗0 (u) = (H 0 (u) , d(u) ) , where H(u) is defined in (3.4), V 1 (u), V 2 (u) are defined in (3.5)–(3.6), and D(u), d(u) are defined in Section 2.2. Finally, C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
An I(2) cointegration model with piecewise linear trends
139
T (ρˆ − ρ0 )⇒τ¯0 β⊥10 B1∞ while the remaining parameters are asymptotically Gaussian. In D −1 ), where θ z0 defined in (A.8) in the particular, T 1/2 (θˆ z0 − θ0z0 ) → Np×(2r+s+p) (0, 0 ⊗ 00 Appendix is the coefficient for the (asymptotically) stationary relations Z 0t in (A.7) of the model, and 00 = V (Z0t ). 3.3.2. Asymptotics for hypotheses on individual parameters. The results in Theorem 3.2 are directly applicable for deriving the limiting distribution of the parameters in θ as for example τˆ ∗ and ψˆ ∗ . That is, applying the definitions in (3.1) analogous to Johansen (1997, Theorems 3, 4 and 5), the limiting distributions can be derived, with the parameters normalized on known constants rather than as here, the true parameters. We exemplify below by establishing that τˆ ∗ is mixed Gaussian, by considering the limiting distribution of τˆ ∗ = (τˆ , τˆD ) when τ is normalized by a known p × (r + s)-dimensional matrix a, say. That is, τˆa ∗ ∗ −1 τˆa = τˆ (a τˆ ) = . (3.7) τˆDa With a such that a τ 0 = I r+s we have the following corollary. C OROLLARY 3.1.
For τˆa∗ defined in (3.7) it follows that, T β¯⊥20 (τˆa − τ0 ) √ , ⇒ C ∞ ρ¯⊥0 T (τˆDa − τD0 )
which is mixed Gaussian. An immediate implication of the mixed Gaussianity is that likelihood ratio tests for linear hypotheses as applied in Section 4 of the form τ ∗ = H φ, with H a known ((p + q + 1) × h)dimensional matrix, r + s ≤ h ≤ p + q + 1 and φ (h × (r + s))-dimensional, are asymptotically χ 2 distributed; see Johansen (2006). A thorough discussion of hypothesis testing on the I(2) cointegration parameters τ , as well as β = τ ρ is given in Boswijk (2000) and Johansen (2006) for the I(2) model with no deterministics. It is applied in Johansen et al. (2010) for a general discussion on χ 2 -based inference on β ∗ = τ ∗ ρ. Note in this respect, that Johansen et al. (2010) consider in particular the distribution of β ∗ under general and empirically relevant identifying restrictions. These results may also be derived from our Theorems 3.1 and 3.2. ˆ a difficult open question within the I(2) literature for more than a decade With respect to ψ, Xt in the has been whether or not the estimator of ψ β¯⊥2 loading the non-stationary β⊥2 cointegrating relations, is mixed Gaussian; see Paruolo (2000, Theorem 4.2). As there, it follows immediately from Theorem 3.2 that ψˆ β¯⊥20 is asymptotically mixed Gaussian, which we exploit in our application below where β ⊥2 is indeed known and fixed. New results in Boswijk (2010) show that the estimator of ψ β¯⊥2 is also mixed Gaussian for β¯⊥2 estimated. While the results there are not derived for our new specification with piecewise linear trends, as well as not used in our application, from the results in Theorem 3.2, we conjecture that Boswijk (2010) results also apply here. 3.3.3. Rank test asymptotics. From the results in Theorem 3.2, and using Nielsen and Rahbek (2007), we get the asymptotic distribution of the rank test statistic. C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
140
T. Kurita, H. B. Nielsen and A. Rahbek Table 1. Moments and quantiles of the asymptotic distribution in Theorem 3.3. Quantiles
p−r
s
Mean
Variance
0.50
0.80
0.85
0.90
0.95
5 5
0 1
209.660 178.362
294.066 259.383
208.91 177.65
223.84 191.72
227.43 195.14
232.07 199.49
239.02 206.01
5 5 5
2 3 4
151.031 127.583 108.089
221.336 190.067 163.135
150.37 126.93 107.49
163.27 138.92 118.60
166.40 141.81 121.28
170.57 145.58 124.70
176.71 151.36 130.07
5 4
5 0
92.253 144.721
141.580 204.845
91.67 143.98
101.90 156.61
104.44 159.63
107.80 163.56
112.90 169.45
4 4 4
1 2 3
119.242 97.787 79.946
175.203 147.598 123.639
118.59 97.16 79.35
130.07 107.85 89.00
132.98 110.38 91.42
136.65 113.69 94.51
142.12 118.77 99.28
4 3
4 0
66.078 91.270
102.288 134.829
65.43 90.67
74.24 100.74
76.53 103.30
79.42 106.57
83.79 111.38
3 3 3
1 2 3
71.688 55.865 43.869
108.736 86.523 69.429
71.03 55.31 43.21
80.18 63.45 50.63
82.47 65.49 52.52
85.39 68.15 54.93
89.85 72.13 58.67
2 2
0 1
49.588 35.549
73.772 56.247
48.98 34.90
56.50 41.55
58.40 43.23
60.94 45.50
64.80 48.83
2 1 1
2 0 1
25.347 19.109 10.825
41.418 29.493 18.675
24.71 18.48 10.16
30.48 23.42 14.16
31.96 24.68 15.20
33.90 26.37 16.60
36.89 28.96 18.84
∞ Note: The asymptotic distribution in Theorem 3.3, Q∞ r + Q(r,s) , simulated with q = 1 break at u1 = 0.377. Based on random walks with 2000 steps and 50,000 replications.
T HEOREM 3.3.
Under Assumptions 2.1 and 2.2, then as T → ∞, ∞ ∞ QLR (r,s) ⇒ Qr + Q(r,s) ,
(3.8)
∞ r (r,s) )}. Here W = (W1 , W2 ) is a (p − r)where Q∞ r = tr{S(W , G )} and Q(r,s) = tr{S(W2 , G dimensional standard Brownian motion, where W1 (u) is s-dimensional and W 2 (u) is (p − r − s)u dimensional. Furthermore, Gr (u) = ((W1 (u) , 0 W2 (s) ds, D(u) ) | G(r,s) ) and G(r,s) (u) = (W2 (u) , d(u) ) with u ∈ [0, 1].
Note that while the limiting distribution depends on the number of stochastic trends and the break locations, there are no nuisance parameters in the limiting distribution, and thus the test is asymptotically similar; see Nielsen and Rahbek (2000). To apply the limiting distribution in empirical applications, asymptotic quantiles have to be simulated for each specification of p − r, p − r − s and q, as well as the timing of the changing trend slopes, (u1 , u2 , . . . , uq ). We illustrate this for our application in Table 1.
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
141
An I(2) cointegration model with piecewise linear trends
(A) Nominal variables, logs 16
(B) Growth rates Δc Δp
c: consumption 0.03
15
p: prices (+15) 0.02
14 0.01 13 1970
1980
1990
2000
0.00 1970
(C) Real variables
1980
1990
2000
(D) Bond yield 0.04
16.0
R: bond rate 0.03 15.5 0.02 15.0 0.01 14.5
0.00 1970
1980
1990
2000
1970
1980
1990
2000
Figure 1. US data for the empirical analysis, 1964–2008; the time series in graphs (A) and (C) have been shifted to have comparable means.
4. EMPIRICAL ILLUSTRATION To illustrate the theoretical results we conduct an empirical analysis of US quarterly consumption data, 1964–2008. We consider the p = 5 dimensional vector: Xt = (ct , yt , wt , pt , Rt ) ,
(4.1)
where c is nominal private consumption, y is nominal disposable income after tax, w is nominal wealth including financial wealth and housing equity, while p represents the price level measured as the consumption deflator. These variables are all transformed by natural logs. To capture interest rate effects on savings, we include the annual bond yield, R, divided by 4 to be comparable to a quarterly inflation rate, pt . See Appendix B for details on the data. Similar data sets for real rather than nominal variables have been analysed in inter alia Lettau and Ludvigson (2001) and Palumbo et al. (2006). The time series are presented in Figure 1. In panel (A), the developments of the nominal variables, c, y and w, are quite parallel, although the wealth variable, w, fluctuates more. Recent discussions have referred to this as signs of ‘bubbles’ in asset and house prices. The price index, p, has increased less over time, but seems to share a similar smooth stochastic trend. It appears that the average trend slope changes just after 1980, both in the price deflator and in the nominal measures, and in the empirical analysis we allow for a deterministic change in the trend slope in C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
142
T. Kurita, H. B. Nielsen and A. Rahbek
1981:2. The first differences to consumption and prices are given in panel (B) indicating I(2) type behaviour of the levels and a systematic shift in average growth after 1980. Growth rates in y and w have similar patterns but are much more volatile. Panel (C) shows the real magnitudes, with all variables deflated by the consumption deflator p. We observe that the shift in growth rates is less pronounced in the real variables indicating that the broken trend may be a purely nominal phenomenon. We note that the behaviour of the nominal time series in panel (A) could be consistent with only stochastic I(2) trends in the data and no deterministic shifts. However, in the empirical analysis it is of interest to distinguish I(1) and I(2) components of the data and statistical testing is used to decide whether the behaviour is most adequately modelled by smooth stochastic trends or by more abrupt deterministic shifts. Moreover, it is of particular interest to consider the possibility that the change in trend slope is also present in the cointegrating relationships, and a piecewise linear trend is therefore included as discussed above to allow a consistent and asymptotically similar test procedure. Economically, the shift in trend slope is interpreted as reflecting a change in policy focus following the stagflation period in late 1970s. The US entered a severe recession in July 1981 partly initiated by a contractionary monetary policy to dampen inflation; cf. the nominal growth rates in panel (C) and the bond yield in panel (D). After the recovery of the US economy through 1982, the inflation rate stayed at more moderate values than the previous decade. In our application here, the breakpoint T 1 equivalent to the date 1981:2 is chosen such that on the one side the econometric model is well-specified in terms of (residual-based) misspecification analysis, while at the same time the break date has economic interpretation as discussed above. An alternative statistical approach is to include the breakpoint as an unknown parameter to be estimated; see Perron (2006) for the multivariate case of stationary and unit root variables. The breakpoint in this case would be identified by maximizing the likelihood function over possible breakpoints, Tˆ1 = arg max LT (T1 , θ, ), T1
where the likelihood function depends on T 1 through the construction of the dummies. This procedure is shown to consistently estimate the breakpoint for I(0) and I(1) variables; see e.g. Bai et al. (1998), L¨utkepohl et al. (2004) and Qu and Perron (2007). To test whether an identified break is statistically significant in this case a sup-type test is applied; see Perron (2006). The extension of this approach to the I(2) setting is left for future research. 4.1. The VAR model Our empirical analysis is based on an unrestricted VAR with k = 3 lags, which nests the I(2) cointegration model, and hence allows for formal statistical testing of I(2) as done below. The effective sample contains the T = 175 observations from 1965:1 to 2008:3, hence keeping observations for 1964:2, 1964:3 and 1964:4 fixed as initial values. The model incorporates in addition to the standard constant and linear trend term, a change in levels and trend slopes in 1981:2, and hence three induced impulse dummies as well. The likelihood function of the unrestricted model seems to account for the main features of the data, and the hypotheses of no autocorrelation of order one and two are not rejected with χ 2 (25) and χ 2 (50) statistics of ξ AR(1) = 36 and ξ AR(2) = 63, respectively. There are several outlying residuals in the model, however, associated with special events and large shocks in the sample period, and the C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
143
An I(2) cointegration model with piecewise linear trends Table 2. Test for cointegration ranks. QLR (r,s)
r 0 1
442.8 [0.00]
342.4 [0.00] 248.5 [0.00]
2 3 4 p−r −s
5
260.8 [0.00] 180.1 [0.00]
202.4 [0.00] 129.3 [0.01]
167.7 [0.00] 104.0 [0.02]
154.6 [0.00] 94.3 [0.01]
124.8 [0.01]
85.4 [0.10] 54.3 [0.27]
67.2 [0.12] 42.5 [0.17] 22.1 [0.27]
58.8 [0.05] 31.5 [0.17] 12.4 [0.31]
3
2
1
0
4
Note: Likelihood ratio test statistics for H D (r, s) | H D (p). The numbers in brackets are asymptotic tail probabilities derived as the proportion of replications in the relevant simulated distribution in Table 1 larger than the test statistic obtained for the data.
Jarque–Bera test for the null hypothesis of Gaussian residuals is rejected with a χ 2 (10) statistic of ξ JB = 178. We will refer to this as the baseline model in the following. To account for a number of the large shocks in the sample period, and to restore normality of the residuals, we also consider a version of the model that includes nine additional impulse dummies in δ t , defined to take the value one in 1972:4, 1974:1, 1975:2, 1980:2, 1982:4, 1984:2, 1993:1, 1999:4 and 2008:2, respectively. For this, the augmented model, the above hypotheses for no-autocorrelation and Gaussianity are not rejected (ξ AR(1) = 34, ξ AR(2) = 58 and ξ JB = 17), and the null hypothesis of no first-order ARCH effects is not rejected with a χ 2 (225) test statistic of ξ ARCH(1) = 222. Recall that the additional unrestricted impulse dummies do not change the asymptotic distributions of estimators and test statistics, and, as we illustrate below, they only marginally change the finite sample results; in fact, all main conclusions of the empirical analysis are unchanged.2 In terms of the notation in (2.8) we model the variable Xt in (4.1) and use Dt = (t, D1t )
and
dt = Dt = (1, d1t ) ,
where the broken trend D 1t begins with the value D 1t = 1 in 1982:1 and is zero before that, while D 1t equals one from 1982:1 onwards. Finally, δ t contains the three induced impulse dummies and in the augmented model the nine innovational impulse dummies in addition. 4.2. Determination of cointegration ranks To make inference on the cointegration ranks (r, s) we first simulate the asymptotic distribution in (3.8) for the current q = 1 and u1 = 0.377. This is done by replacing the Brownian motion W(u) with a random walk with 2000 steps, replacing D(u) with a discrete time trend function, and replacing d(u) with the corresponding discrete step function. The simulation here is based on 50,000 replications and moments and quantiles are reported in Table 1. Table 2 reports the LR statistics for the cointegration ranks for the baseline model, together with the asymptotic tail probabilities derived from the distribution in Table 1. Note that we do not assume that variables are I(2), or I(1), but test this in the VAR model, as the cointegration 2 If we do not include the piecewise linear trend, and consider instead a model with constant trend and the nine innovational dummies of the augmented model above, the corresponding likelihood function is misspecified. Tests for no autocorrelation reject with ξ AR(1) = 46, ξ AR(2) = 68, and the test for no ARCH effects rejects with ξ ARCH(1) = 275. C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
144
T. Kurita, H. B. Nielsen and A. Rahbek
ranks determined correspond to the number of double unit roots (p − r − s) and single unit roots (s), respectively. The hypotheses H D (r, s) are tested sequentially against H D (p) based on the partial nesting structure. All models with r = 0 and 1 are safely rejected. In the row for r = 2 the reductions to the models H D (2, 1) and H D (2, 2) have tail probabilities around 10%, and we note that in the augmented model with nine additional impulse dummies the tail probabilities for the LR statistics for the two candidate models are 9% and 14%, respectively. The two potentially preferred models are nested, H D (2, 1) ⊂ H D (2, 2), and can be compared using an LR test. It follows directly from the result in Theorem 3.3 that the likelihood ratio statistic for H D (2, 1) | H D (2, 2), calculated from the estimated covariances as ˆ ˆ −1 QLR (2,1)|(2,2) = −T log (2,1) (2,2) , has the limiting distribution of the maximum eigenvalue of S(W2 , G(r,s) ), see Nielsen (2007), which is easily simulated. For the baseline model the statistic is 18.2, corresponding to a tail probability of 7%, while the augmented model produces a test of 20.5 and a tail probability of 3%, showing that the reduction to the model H D (2, 1) is marginal. Furthermore, the model H D (2, 2) with p − r − s = 1 I(2) trend is most easily reconciled with economic theory and together with the statistical evidence we take this model as the preferred in the following. Note that there are strong indications of an I(2) trend in the data, even after allowing for a shift in the deterministic trend slope. For the baseline model with the hypothesis H D (2, 2) imposed, the characteristic polynomial has four unit roots and the remaining 11 roots are all strictly outside the unit circle; see Assumption 2.1. 4.3. Testing parameter restrictions Based on the preferred model H D (2, 2), we first investigate if the change in the linear trend implied by D 1t is needed, or equivalently, we test the restriction of a common deterministic trend slope in all cointegrating relationships, τ Xt , in the two sub-samples. We formulate this as, I6 ∗ H0 : τ = ϕ, 0(1×6) with ϕ unrestricted 6 × 4, imposing a zero row in τ ∗ . The LR statistic for H0 | H D (2, 2) equals 17.8 corresponding to a tail probability of approximately 1.4 × 10−3 in the asymptotic χ 2 (4) distribution. This emphasizes the relevance of the piecewise linear trend. In the augmented model with nine additional impulse dummies the corresponding statistic is 29.6, strongly confirming this conclusion. Another important hypothesis is that the common I(2) trend loads into the nominal variables with equal coefficients, so that the real variables, ct − pt , yt − pt and wt − pt , are first-order non-stationary, I(1); cf. also the graphical appearance in Figure 1, graph (C). Economically, this hypothesis implies that money illusion is excluded in the long-run, and the hypothesis would allow a nominal-to-real transformation from the I(2) vector Xt to a vector of I(1) variables, e.g. Yt = (ct − pt , yt − pt , wt − pt , Rt , pt ) ; see Kongsted (2005). Given homogeneity, the subsequent I(1) cointegration analysis of Yt can be conducted without loss of information and the cointegrating relationships between levels and C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
An I(2) cointegration model with piecewise linear trends
145
differences are embedded as usual cointegrating relationships in the I(1) cointegration model; see Kongsted and Nielsen (2004). Often this hypothesis is imposed a priori, see e.g. the analyses of real consumption variables in Lettau and Ludvigson (2001) and Palumbo et al. (2006), but here we want to explicitly test the hypothesis of homogeneity. Equal loadings to the single stochastic I(2) trend corresponds to β ⊥2 being proportional to B = (1, 1, 1, 1, 0) and hence known; see (2.9). From the baseline case and from the augmented model, the estimated unrestricted counterparts are given by baseline = (1, 0.918, 1.436, 1.163, 0.018) , βˆ⊥2 augmented βˆ⊥2 = (1, 0.914, 1.263, 1.130, 0.029) ,
respectively. The results suggest that the unrestricted estimates under H D (2, 2) are quite close to homogeneous, except slightly larger coefficients to w and p. Noting that β ⊥2 = τ ⊥ , the homogeneity restriction can be formally tested as B⊥ 0(5×2) ∗ ϕ, H1 : τ = 0(2×4) I2 where B ⊥ is 5 × 4 and ϕ is 6 × 4 with unrestricted parameters. The LR statistics for H1 are 5.3 and 3.4 in the two specifications, corresponding to tail probabilities of 26% and 49% in the asymptotic χ 2 (4) distribution. This gives no evidence to reject homogeneity between the stochastic trends. For comparison, the homogeneity restriction in H1 has also been tested in the model with no piecewise linear trend (which is therefore misspecified; see footnote 3). Without allowing for the changing trend slopes, the LR statistic is 25.2, and this would lead to a firm rejection of homogeneity of the stochastic trends, and, hence, a rejection of the economically relevant nominal-to-real transformation. 4.4. Interpretation of the model and the deterministic terms To interpret the model and the deterministic terms in more detail we conclude the analysis by reporting the main estimated parameters of the model. We focus on the augmented model with homogeneity imposed. Note that under homogeneity, β ⊥2 is known up to a normalization and we present results for β⊥2 = 14 · (1, 1, 1, 1, 0) , such that β ⊥2 Xt = (ct + yt + wt + pt )/4 is the average inflation rate. The relationships that cointegrate from I(2) to I(1) are represented by τ ∗ , and τ ∗ Xt is I(1). For the interpretation we first normalize τ to highlight the importance of real magnitudes, and we note that under the imposed homogeneity restriction in H1 above, it holds that c − p, y − p, w − p and R corrected for the piecewise linear trends are I(1) variables. Next, the stationary combinations between levels and differences can be written as sˆt = ρˆ τˆ Xt + ψˆ β¯⊥2 β⊥2 Xt ,
(4.2)
with cointegrating coefficients given by (ρˆ , ψˆ β¯⊥2 ) . We normalize ρ and ψ such that (ρˆ , ψˆ β¯⊥2 ) can be given an economic interpretation. The first stationary relationship is normalized to be a consumption function, with unit coefficient to real consumption, c − p, and a zero coefficient to the nominal interest rate, R. The second relation is identified as a relation for the real interest rate, R − β ⊥2 Xt . With the ordering of the variables given by Xt = (c, y, w, p, C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
146
T. Kurita, H. B. Nielsen and A. Rahbek
R) , we get the following estimates, with asymptotic standard errors in parentheses: ⎛
1
⎜ ⎜ 0 ⎜ ⎜ 0 ⎜ ⎜ ⎜ −1 ∗ τˆ = ⎜ ⎜ 0 ⎜ ⎜ ⎜ −0.818 ⎜ (0.095) ⎝ 0.110 (0.122)
0
0
1 0
0 1
−1 0
−1 0
−0.713
−0.376
0.091
−0.513
(0.118) (0.152)
0
⎟ ⎟ ⎟ ⎟ ⎟ ⎟ 0 ⎟ ⎟, 1 ⎟ ⎟ ⎟ −0.034 ⎟ (0.013) ⎟ ⎠ 0.061 0 0
(0.306) (0.414)
⎞
⎛
1
⎜ ⎜ −0.635 ⎜ ⎜ (0.080) ⎜ ρˆ =⎜ ⎜ −0.096 ˆ β¯⊥2 ψ ⎜ (0.056) ⎜ ⎜ 0 ⎝ 0.980 (0.305)
−0.335
⎞
⎟ ⎟ 0.375 ⎟ (0.024) ⎟ ⎟ −0.034 ⎟ ⎟, (0.012) ⎟ ⎟ 1 ⎟ ⎠ −1 (0.026)
(0.017)
where we have scaled the deterministic terms with 10−2 to avoid small coefficients. We note that the change in the trend slope (last row of τˆ ∗ ) is statistically significant in the nominal interest rate, a feature that was also visible in Figure 1 (panel D). The asymptotic distributions of τˆ ∗ and ρˆ are taken from Corollary 3.1 and Theorem 3.2. The first stationary linear combination can be written in terms of the first column in the estimated β ∗ , i.e. βˆ1∗ = τˆ ∗ ρˆ1 , as ct − pt = 0.635 ·(yt − pt ) + 0.096 ·(wt − pt ) + 0.329 ·t − 0.101 ·D1t − 0.980 ·β⊥2 Xt + u1t , (0.080)
(0.056)
(0.054)
(0.064)
(0.305)
with u1t being the stationary deviation from equilibrium. This is a quite standard consumption function for real consumption in terms of real income and real wealth. The price homogeneity, as tested and imposed above, rules out long-run money illusion. As expected the coefficients to real income and real wealth are positive, but the coefficients do not sum to one indicating a less than proportional increase in consumption on a balanced growth path. This may be surprising but is captured in sample by an estimated positive linear deterministic trend reflecting an autonomous growth of approximately 1.3% per year. The trade-off between a positive deterministic trend and less than unit coefficient to income and wealth is common and reflects multicollinearity between the linear deterministic trend and the stochastic I(1) trends in the real magnitudes. There is a negative effect of the inflation term, β ⊥2 Xt , in the consumption function, reflecting a deviation from price homogeneity in the short run. The modelled break in the deterministic trend function entails a slightly slower autonomous growth after 1981, but it is not very significant in the first relation. The second stationary relation in (4.2), identified as a relation for the real interest rate, includes c − p and y − p with almost opposite coefficient, corresponding to a consumption–income ratio, while real wealth has a small coefficient. This suggests a positive association between the real interest rate and the business cycle in consumption. Imposing equal coefficients to real consumption and income gives the approximate relation: Xt = 0.335 ·(ct − yt ) + 0.034(wt − pt ) + 0.015 ·t − 0.075 D1t + u2t , Rt − β⊥2 (0.026)
(0.012)
(0.008)
(0.010)
where u2t is stationary. In this case the deterministic trend function gives a small positive trend before 1981, and a quite strong negative trend after 1982. We conjecture that this is related to the period of disinflation after the stagflation period in the late 1970s. C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
147
An I(2) cointegration model with piecewise linear trends
The error correction of deviations from equilibrium is given by α, which is estimated as ⎛ ⎞ −0.232 −0.162 (0.156) ⎟ ⎜ (0.094) ⎜ ⎟ ⎜ −0.303 −0.559 ⎟ ⎜ (0.095) (0.209) ⎟ ⎜ ⎟ ⎜ 0.715 ⎟ . αˆ = ⎜ −0.167 (0.583) ⎟ ⎜ (0.233) ⎟ ⎜ ⎟ ⎜ −0.092 −0.007 ⎟ ⎜ (0.041) (0.089) ⎟ ⎝ ⎠ −0.016 −0.155 (0.016)
(0.026)
We note that deviations from the consumption function are corrected mainly by changes in consumption and income. Deviations from the real interest rate relation are corrected by the nominal interest rate and income; the latter is negative showing the contractionary effect of a high real interest rate. To illustrate the role of the deterministic components in the model, i.e. the effects of the trend with a changing slope in 1981:2, the k = 3 impulse dummies induced by the changing trend slope, and the nine additional innovation dummies included to account for outliers, we calculate the terms in (2.9) involving the deterministic variables, Dt , dt , δ t , and the initial values, X0 , X0 and 2 X0 , i.e. C2
s t s=1 i=1
δ δi + C1
t
δ δi + γD Dt + γd dt + C0 (L)δ δt ,
(4.3)
i=1
where γ D and γ D contain also the effects of the initial values of the process. The innovation dummies in δ t enter the dynamics in the same way as the innovations ( t ), and accumulate (once and twice) to produce level shifts and changing trend slopes in the data. We note that the changed trend slopes implied by the unrestricted impulse dummies in δ t , i.e. C2 ts=1 si=1 δ δi , are identified by single observations only, and are not consistently estimable; see also Theorem 3.1. In the model they have the same effect as the model innovations, t , and can be interpreted as the effects of extraordinary large shocks to the dynamic system. Empirically, it may not be particularly important whether the large shocks are modelled with dummy variables or not. Asymptotically the effects of the unrestricted impulse dummies are negligible, and in the empirical analysis above the results for the baseline and augmented model yield largely similar results. Figure 2 shows the r = 2 stable combinations in (4.2) together with their deterministic components. For sˆ1t we note the marked linear trend in equilibrium as discussed above. The break in 1981:2 allows for a downward shift in the equilibrium level and a marginally smaller trend slope. For the real interest relation in sˆ2t , the changing trend slope is clearly important. Regarding the impulse dummies, we note that the k = 3 induced dummies play the role of conditioning on observations for 1981:2, 1981:3 and 1981:4, and the effect is comparable to fixing the initial values, 1964:2, 1964:3 and 1964:4. In addition, Figure 2 highlights the observations modelled by innovation impulse dummies.3 From the accumulation in (4.3), with β C 2 = 0 and β C 1 = 0, X , based on the estimated augmented model with homogeneity 3 The plot shows sˆ = (ˆ s1t , sˆ2t ) = βˆ Xt + ψˆ β¯⊥2 β⊥2 t t imposed and their deterministic components. The deterministic parts are the terms in (2.9) depending on Dt , dt , δ t and the initial values; see (4.3).
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
148
T. Kurita, H. B. Nielsen and A. Rahbek
(A) sˆ1t 4.0 3.9
(B) sˆ2t 0.16
Actual observations Deterministic components Conditioning observations Innovation dummies
0.14
3.8 0.12 3.7 0.10
3.6
0.08 1970
1980
1990
2000
1970
1980
1990
2000
Figure 2. Stationary cointegrating relationships.
the impulse dummies give at most level shifts in βˆ Xt , but the accumulated effects cancel in the cointegrating relations producing only exponentially decreasing effects (λt ) in sˆit , i = 1, 2. For empirical applications a choice must be made between allowing an innovation dummy, producing changing trend slopes in the data that do not change the equilibrium relationships, or allowing also a changing trend slope in the equilibrium relationships. Economically, this amounts to choosing between large shocks that follow the usual dynamics of the normal innovations versus genuine regime shifts. In the application above, this choice was based on a priori reasoning and the graphical appearance of the data, and in the maintained model we test whether the changing trend slope is needed in the cointegrating relationships. This approach could of course be extended for each potential outlying observation. For each innovational dummy in the model we could test the relevance of the break in the cointegrating relationships by extending Dt to allow the relevant broken trend in all linear combinations of the data and test whether the changing trend slope cancels in the cointegrating relationships. This would allow asymptotically similar inference on the cointegrating rank and a consistent testing for the presence of changing trend slopes in the cointegrating relationships.
ACKNOWLEDGMENTS The authors wish to thank three anonymous referees as well as the editor and D. F. Hendry, S. Johansen and B. Nielsen for very helpful comments. Kurita is grateful for support from grant JSPS KAKENHI (19830111). The calculations are done using Ox 6.0; see Doornik (2007). Ox code for simulating the asymptotic distribution of the I(2) rank test can be downloaded from www.econ.ku.dk/okohn.
REFERENCES Bacchiocchi, E. and L. Fanelli (2005). Testing the purchasing power parity through I(2) cointegration techniques. Journal of Applied Econometrics 20, 749–70. Bai, J., R. L. Lumsdaine and J. H. Stock (1998). Testing for and dating common breaks in multivariate time series. Review of Economic Studies 65, 395–432. C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
An I(2) cointegration model with piecewise linear trends
149
Banerjee, A., L. Cockerell and B. Russell (2001). An I(2) analysis of inflation and the markup. Journal of Applied Econometrics 16, 221–40. Boswijk, H. P. (2000). Mixed normality and ancillarity in I(2) systems. Econometric Theory 16, 878–904. Boswijk, H. P. (2010). Mixed normal inference on multicointegration. Forthcoming in Econometric Theory. Cavaliere, G., A. Rahbek and R. Taylor (2010). Co-integration rank testing under conditional heteroskedasticity. Forthcoming in Econometric Theory. Diamandis, P., D. Georgoutsos and G. Kouretas (2000). The monetary model in the presence of I(2) components: long-run relationships, short-run dynamics and forecasting of the Greek drachma. Journal of International Money and Finance 19, 917–41. Doornik, J. A. (2007). Object-Oriented Matrix Programming Using Ox (3rd ed.). London: Timberlake Consultants Press. Fliess, N. and R. MacDonald (2001). The instability of the money demand function: an I(2) interpretation. Oxford Bulletin of Economics and Statistics 63, 475–95. Johansen, S. (1997). Likelihood analysis of the I(2) model. Scandinavian Journal of Statistics 24, 433–62. Johansen, S. (2006). Statistical analysis of hypotheses on the cointegrating relations in the I(2) model. Journal of Econometrics 132, 81–115. Johansen, S., K. Juselius, R. Frydman and M. Goldberg (2010). Testing hypotheses in an I(2) model with piecewise linear trends. An analysis of the persistent long swings in the Dmk/$ rate. Journal of Econometrics 158, 117–29. Johansen, S., R. Mosconi and B. Nielsen (2000). Cointegration in the presence of structural breaks in the deterministic trend. Econometrics Journal 1, 216–49. Juselius, K. (1998). A structured VAR for Denmark under changing monetary regimes. Journal of Business and Economic Statistics 16, 400–12. Juselius, K. (1999). Price convergence in the medium and long run: an I(2) analysis of six price indices. In R. F. Engle and H. White (Eds.), Cointegration, Causality, and Forecasting: A Festschrift in Honour of Clive W. J. Granger, 301–25. Oxford: Oxford University Press. Kongsted, H. C. (2005). Testing the nominal-to-real transformation. Journal of Econometrics 124, 205–25. Kongsted, H. C. and H. B. Nielsen (2004). Analyzing I(2) systems by transformed vector autoregressions. Oxford Bulletin of Economics and Statistics 66, 379–97. Lettau, M. and S. Ludvigson (2001). Consumption, aggregate wealth, and expected stock returns. Journal of Finance 56, 815–49. L¨utkepohl, H., P. Saikkonen and C. Trenkler (2004). Testing for the cointegration rank of a VAR process with level shift at unknown time. Econometrica 74, 647–62. Magnus, J. R. and H. Neudecker (1999). Matrix Differential Calculus with Applications in Statistics and Econometrics. Chichester: John Wiley. Nielsen, B. and A. Rahbek (2000). Similarity issues in cointegration analysis. Oxford Bulletin of Economics and Statistics 62, 5–22. Nielsen, H. B. (2002). An I(2) cointegration analysis of price and quantity formation in Danish manufactured exports. Oxford Bulletin of Economics and Statistics 64, 449–72. Nielsen, H. B. (2007). A maximum-eigenvalue test for the cointegration ranks in I(2) VAR models. Economics Letters 94, 445–51. Nielsen, H. B. and C. Bowdler (2006). Inflation adjustment in the open economy: an I(2) analysis of UK prices. Empirical Economics 31, 569–86. Nielsen, H. B. and A. Rahbek (2007). The likelihood ratio test for cointegration ranks in the I(2) model. Econometric Theory 23, 613–35. Palumbo, M., J. Rudd and K. Whelan (2006). On the relationships between real consumption, income, and wealth. Journal of Business and Economic Statistics 25, 1–11. C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
150
T. Kurita, H. B. Nielsen and A. Rahbek
Paruolo, P. (2000). Asymptotic efficiency of the two stage estimator in I(2) systems. Econometric Theory 16, 524–50. Perron, P. (1989). The Great Crash, the oil price shock, and the unit root hypothesis. Econometrica 57, 1361–401. Perron, P. (1990). Testing for a unit root in a time series with a changing mean. Journal of Business and Economic Statistics 8, 153–62. Perron, P. (2006). Dealing with structural breaks. In K. Patterson and T. Mills (Eds.), Palgrave Handbook of Econometrics, Volume 1: Econometric Theory, 278–352. London: Palgrave Macmillan. Qu, Z. and P. Perron (2007). Estimating and testing structural changes in multivariate regressions. Econometrica 75, 459–502. Rahbek, A., H. C. Kongsted and C. Jørgensen (1999). Trend-stationarity in the I(2) cointegration model. Journal of Econometrics 90, 265–89.
APPENDIX A: ASYMPTOTICS A.1. Proof of Theorem 3.1 The I(2) model in (2.7) is a regression model with non-linear parameters. To analyse this, it is as in Johansen (1997, Theorem A1) useful to initially analyse a linear regression model with regressors as in (2.7). With Vt p-dimensional, write the linear regression model as, Vt = θ 0 Z0tλ + θ 1 Z1tλ + θ 2 Z2tλ + θ D ZDt + θ d Zdt + θ δ δt + vt (θ),
(A.1)
where for t = 1, 2, . . . , T , vt (θ ) is Np (0, ) distributed, conditional on the regressors Z λit , and past Vt and Z λit . The pzi -dimensional regressors Z λit are—apart from an asymptotically vanishing term λt defined in (2.10) in terms of impulse dummies δt —mean-zero I(i) processes for i = 0, 1, 2. Specifically, with ηt independent of vt and i.i.d.(0, η ) distributed, Z λit = Zit + λt , where i Zit = the p η -dimensional i i C η C i (L)ηt = ∞ j =0 j t−j and the coefficients C j exponentially decreasing. Furthermore, ZDt = D t−1 which is pD -dimensional and which satisfies Assumption 2.2, and ZDt = D t−1 , with dt = Dt . Finally δt is a p δ -dimensional impulse dummy regressor with entries δit = 1(t = Ti ), 1 < Ti < T , and Ti = [T ui ] with ui ∈ ]0, 1[. L EMMA 4.1. Set θ Z = (θ 0 , θ 1 , θ 2 , θ D , θ d ), and θ = (θ Z , θ δ ) ∈ ⊂ Rn , where is closed and > 0 varies freely. Then for the MLE θˆ it holds that
p (A.2) NT−1 θˆ Z − θ0Z ⇒ 0, as T → ∞ and with NT = blockdiag (Ipz0 , T −1/2 Ipz1 , T −3/2 Ipz2 , T −1 IpD , IpD ). Furthermore, (θˆ δ − θ0δ ) = OP (1). Proof: Define Ztλ = (Z0tλ , Z1tλ , Z2tλ , ZDt , Zdt ) , Zt = (Z0t , Z1t , Z2t , ZDt , Zdt ) and set vt = vt (θ0 ). Moreover, use the notation that for any px - and py -dimensional time series Xt and Yt , respectively,
Myx =
T 1 Yt Xt . T t=1
(A.3)
Next, note that
θˆ Z − θ0Z = Mzλ ·δ Mz−1 λ zλ ·δ
−1 −1 −1 = Mzλ − Mδ Mδδ Mδzλ Mzλ zλ − Mzλ δ Mδδ Mδzλ . C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
151
An I(2) cointegration model with piecewise linear trends
By definition of the p δ -dimensional impulse dummy δt , and the generic λt defined in (2.10), standard limit arguments immediately give NT Mzλ zλ ·δ NT = NT Mzz NT + oP (1). That is, the OLS correction for δt is asymptotically negligible, and moreover, Z λit = Zit + λt behaves asymptotically as Zit for i = 0, 1 and 2. Hence, ⎞ ⎛ 0 00 ⎟ ⎜ 1 (A.4) NT Mzλ zλ ·δ NT = NT Mzz NT + oP (1) ⇒ ⎝ ⎠, 0 F (u)F (u) du 0
∞
where 00 = V (Z0t ) = C 1 (1)W η (u) and F2 (u) =
= (F1 (u) , F2 (u) , D(u) , d(u) ) . Here F 1 (u) = (u) a Brownian motion with variance η0 . Similarly,
1 (A.5) = T 1/2 NT Mz + oP (1) ⇒ Npz0 ×p (0, 00 ⊗ 0 ), F dW ,
0 0 and F (u) i η0 Ci i=0 C s 2 η C (1) 0 W (s)ds, with W η
T 1/2 NT Mzλ ·δ
0
where W (u) is a p-dimensional Brownian motion with variance 0 . Collecting terms (A.2) holds. Note that it is essential for the results that the asymptotically stationary Z λ0t = Z 0t + λt regressor has mean zero apart from the generic λt defined in (2.10) which is asymptotically vanishing. If not, e.g. the blockdiagonality in (A.4), which corresponds to the limiting information, would not apply. Finally, with each entry δ it in the p δ -dimensional δt of the form δit = 1 (t = Ti ), it follows that θˆ δ = −1 −1 Z M(θ)δ ˆ Mδδ = (Mvδ − θˆ Mzλ δ )Mδδ , or
δ θˆ − θ0δ = vT1 − θˆ Z − θ0Z ZT1 , . . . , vTpδ − θˆ Z − θ0Z ZTpδ
= vT1 , . . . , vTpδ + oP (1), from which θˆ δ = OP (1) and inconsistency holds.
Proof of Theorem 3.1: Rewrite the I(2) model H D (r, s) in (2.7) as in (A.1), 2 Xt = θ 0 Z0tλ + θ 1 Z1tλ + θ 2 Z2tλ + θ D ZDt + θ d Zdt + θ δ δt + t (θ), Xt−1 , θ δ δt = δ δt and where ZDt = Dt−1 , Zdt = dt−1 , Z2tλ = β⊥20 ⎞ ⎛ β0 Xt−1 + ψ0 Xt−1 + ρ0 τD0 Dt−1 + ψd0 dt−1 ⎟ ⎜ τ0 Xt−1 + τD0 dt−1 Z0tλ = ⎝ ⎠ 2 Xt−1 β X t−1 ⊥20 λ Z1t = . β⊥10 Xt−1 + β⊥10 τ¯0 τD0 Dt−1
(A.6)
(A.7)
Recall that ρ = τ¯0 β, such that ρ = ρ0 + τ¯0 β⊥10 B1 = ρ(B1 ), implying ρ⊥ = ρ⊥ (B1 ) as well. Using the definitions in (3.1), the parameters θ 0 , θ 1 and θ 2 are given by (A.8),
θ 0 = α, α(ψ − ψ0 ) τ¯0 + α˜ ⊥ κ , ,
θ 1 = αB0 + α˜ ⊥ κ ρ¯⊥ (B1 )C0 + ρ(B ¯ 1 )B2 , αB1 , (A.8) θ 2 = αB2 , while the parameters for the deterministic regressors are given by θ δ = δ ,
¯ 1 )BD . θ D = αBD and θ d = αBd + α˜ ⊥ κ ρ¯⊥ (B1 )CD + ρ(B
(A.9)
Applying our Lemma 4.1, the proof is identical to the proof of Theorem 2 in Johansen (1997), apart from the θ D and θ D parameters in (A.9), and hence BD , Bd and CD in (3.1). As θˆ D = αˆ Bˆ D , with αˆ consistent, P we can conclude by Lemma 4.1 that T Bˆ D → 0. Next, with θ D defined in (A.9), multiply by αˆ ⊥ to see C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
152
T. Kurita, H. B. Nielsen and A. Rahbek
p ˆ αˆ ⊥ and Bˆ 1 and Bˆ D are consistent. Likewise, multiplying by αˆ ˆ −1 , gives that Cˆ D → 0, as ρˆ⊥ , ρ, ˆ κ, ˆ , p ˆ Bd → 0.
A.2. Proof of Theorem 3.2 The proof proceeds basically as in the proof of Lemma 1 in Johansen (1997), apart from the additional deterministic terms here. Thus, in terms of the parametrization in (A.6) note initially that the parameters α, θ 02 , , δ , , B 0 , B 1 , B 2 , BD , Bd , C 0 and CD are all freely varying, where θ20 = α(ψ − ψ0 ) τ¯0 + α˜ ⊥ κ .
(A.10)
Clearly α, , δ and are trivial to obtain from these, as noted above ρ = ρ(B1 ), while β = β(B1 , B2 ), ψ = ψ(θ20 , B0 , α, ), κ = κ(θ20 , α, ) and τ = τ (C0 , B2 ); see also Johansen (1997, eqn. (48)). For the remaining new parameters τ D and ψ d note first that τ D can be found from ρ = ρ(B1 ), BD and CD as ¯ 1 )BD + ρ¯⊥ (B1 )CD . (τD − τD0 ) = ρ(B
(A.11)
. (ψd − ψd0 ) = Bd + (ψ − ψ0 ) τ¯0 τD0
(A.12)
Next,
With θ = (θ , θ , θ , θ , θ ), θ = δ and θ = (θ , θ ) the log-likelihood function is given by T 1 −1 LT (θ, ) = − t (θ)t (θ) , T log || + tr 2 t=1 Z
0
1
2
D
d
δ
Z
δ
(A.13)
where, with t = t (θ0 ), t (θ ) = 2 Xt − θ 0 Z0tλ − θ 1 Z1tλ − θ 2 Z2tλ − θ D ZDt − θ d Zdt − θ δ δt
= t − θ 0 − θ00 Z0tλ − θ 1 Z1tλ − θ 2 Z2tλ
D − θ − θ0D ZDt − (θ d − θ0d )Zdt − (θ δ − θ0δ )δt . ˆ The limiting distribution of θˆ Z is found by considering an asymptotic expansions of the score evaluated at θ. ˆ ˆ Introduce therefore the notation dLT (θ , ; dA) for the differential of the log-likelihood function in (A.13) in the direction dA, where A is a matrix (or vector) valued parameter in θ, and the differential is evaluated ˆ 5 at (θ, ) = (θˆ , ). Set B = (B0 , B1 , B2 , BD , Bd ), BT = (T B0 , T B1 , T 2 B2 , T 3/2 BD , T 1/2 Bd ) and define accordingly , Zdt ) . Moreover, corresponding to the order of magnitudes of the processes in ZBt , ZBt = (Z1tλ , Z2tλ , ZDt ˆ ; ˆ dB) = 0, and with set N BT =diag(T −1/2 Ip−r , T −3/2 Ip−r−s , T −1 Iq+1 , Iq+1 ). Then by definition dLT (θ, ˆ one finds that Bˆ T inserted for B, 1/2
B ˆ B T MzB NTB − α0 −1 (A.14) tr α0 −1 0 0 α0 BT NT MzB zB NT dB = oP (1). This is the equivalent of Johansen (1997, eqn. (55)) and holds as thereby applying limiting arguments in terms of Z λ0t , Z1tλ and Z2tλ which, apart from asymptotically vanishing λt terms, are I(0), I(1) and I(2), respectively; see the proof of Lemma 4.1. A further difference is the inclusion of the (θ δ − θ0δ )δt term in the residual t (θ ). In (A.14), we have in particular used that, δ
θˆ − θ0δ T 1/2 MδzB NTB = oP (1), α0 −1 0 which holds since T 1/2 MδzB NTB = oP (1) as δt contain only impulse dummies, and as (θˆ δ − θ0δ ) = OP (1); see Theorem 4.1. 5
See Magnus and Neudecker (1999) for the theory of matrix differential calculus. C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
An I(2) cointegration model with piecewise linear trends
153
Similar to Johansen (1997) one may also note that differentials of θ 1 and θ d in the direction dB1 do not matter asymptotically as they are multiplied by either of C 0 , B 2 , CD or BD and hence by Theorem 3.1 converge in probability to zero. Likewise, the differentials of θ 1 in the direction dB2 and of θ d in the direction dBD do not matter asymptotically. Moreover, the definitions in (A.8) and (A.9) have been used, in addition to the consistency results of Theorem 3.1, to see that,
ˆ ˆ ˆ −1 T θˆ 1 = α0 −1 αˆ 0 α0 T B0 , T B1 + oP (1), 2
ˆ ˆ −1 T 2 θˆ 2 = α0 −1 αˆ 0 α0 T B2 + oP (1), 3/2
−1 ˆ −1 3/2 ˆ D αˆ T θ = α0 0 α0 T Bˆ D + oP (1), 1/2 ˆ ˆ −1 T 1/2 θˆ d = α0 −1 αˆ Bd ) + oP (1). 0 α0 (T
Next, by (A.14) and (A.4), then in the limit as T → ∞, with B ∞ denoting the limiting distribution of Bˆ T , α0 −1 0
1
dV H ∗
∞ = α0 −1 0 α0 B
0
1
H ∗ (u)H ∗ (u) du,
(A.15)
0
from which the first result in Theorem 3.2 follows. Note that H ∗ (u) = (H (u) , D(u) , d(u) ) is defined in Theorem 3.2 in terms of H (u) in (3.4), limit of the deterministic terms and the p-dimensional Brownian motion V (u) with covariance 0 . For the asymptotics of Cˆ 0 and Cˆ D , set similar to above C = (C0 , CD ), CT = (T C0 , T 1/2 CD ) and ) ; that is, ZCt = (Xt−1 β20 , Zdt ) . Moreover, set N CT = blockdiag define ZCt = (Z1t (Ip−r−s , 0), Zdt ˆ ; ˆ dC) = 0, (T −1/2 Ip−r−s , Iq+1 ) corresponding to the order of magnitude of ZCt . By definition, dLT (θ, ˆ and similar to (A.14), and with Cˆ T inserted for C, tr
1/2
C ˆ C T MzC NTC − φ0 −1 φ0 −1 0 0 φ0 CT NT MzC zC NT dC = oP (1),
(A.16)
where φ = α˜ ⊥ κ ρ¯⊥ = α⊥ (α⊥ α⊥ )−1 κ ρ¯⊥ ; cf. (3.6). This is the equivalent of Johansen (1997, p. 461) and holds as above by standard limiting arguments, the fact that δt is asymptotically negligible, and the definitions in (A.8) and (A.9), in addition to the consistency results of Theorem 3.1. In particular, it has been used that φ −1 α = 0 such that,
ˆ Ip−r−s , 0p−r−s×s + oP (1), ˆ −1 T θˆ 1 = φ0 −1 φˆ 0 φ0 T C0 1/2 ˆ ˆ −1 T 1/2 θˆ d = φ0 −1 αˆ CD ) + oP (1). 0 φ0 (T Next, by (A.16) and (A.4), then in the limit as T → ∞, with C ∞ denoting the limiting distribution of Cˆ T , φ0 −1 0
0
1
dV H0∗
∞ = φ0 −1 0 φ0 C
1
H0∗ (u)H0∗ (u) du,
(A.17)
0
from which the second result in Theorem 3.2 follows using the definition of φ. Note that H0∗ (u) = (H0 (u) , Hd (u) ) is defined in Theorem 3.2 in terms of H 0 (u) in (3.4). As in Johansen (1997, pp. 461–62) the asymptotic distribution of ρˆ follows from the identity, ρˆ = τ¯0 βˆ = ρ0 + τ¯0 β⊥10 Bˆ 1 , −1 ), where 00 = V (Z0t ). while T 1/2 (θˆ 0 − θ00 ) ⇒ Np×(2r+s+p) (0, 0 ⊗ 00
(A.18)
A.3. Proof of Corollary 3.1 The proof consists of two parts. In the first we apply Theorem 3.2 to find the asymptotic distribution of (τˆ ∗ − τ0∗ ), which is then used in the second part where a Taylor expansion of (τˆa∗ − τ0∗ ) is applied. C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
154
T. Kurita, H. B. Nielsen and A. Rahbek
Part 1: Theorem 3.2 implies that the asymptotic distribution of (τˆ ∗ − τ0∗ ) is given by T β¯⊥20 (τˆ − τ0 ) T Cˆ 0
√ + oP (1) ⇒ C ∞ ρ¯⊥0 = R H0∗ , V2 ρ¯⊥0 . ρ¯⊥0 = √ T (τˆD − τD0 ) T Cˆ D
(A.19)
(τˆ − τ0 ) = β¯⊥20 (τˆ − τ0 )(ρ0 ρ¯0 + ρ⊥0 ρ¯⊥0 ). Using the identities in Johansen (1997, To see this, note that β¯⊥20 pp. 461 and 459), together with the derived consistency of ρˆ in Theorem 3.1 here, one finds
(τˆ − τ0 )ρ0 = Bˆ 2 − Cˆ 0 Bˆ 1 + oP T −2 (τˆ − τ0 )ρ⊥0 = Cˆ 0 + oP T −1 , and β¯⊥20 β¯⊥20 τ¯0 β⊥10 = Is . Likewise, where it has been used that ρˆ⊥ − ρ⊥0 = −ρ0 (ρˆ0 ρ0 )−1 (ρˆ − ρ0 ) ρ⊥0 and ρ¯⊥0 −3/2
and (τˆD − τD0 ) ρ⊥0 = Cˆ D + oP T −1/2 , (τˆD − τD0 ) ρ0 = Bˆ D − Cˆ D Bˆ 1 + oP T
and collecting terms (A.19) holds. Part 2: As in the proof of Theorem 4.2 in Rahbek et al. (1999) and Lemma 3 in Johansen et al. (2010), ∗ = τ0∗ (a τ0 )−1 = τ0∗ , use the expansion around τa0
τˆa∗ − τ0∗ = Ip+1+q − τ0∗ (a , 0) τˆ ∗ − τ0∗ + OP |τˆ ∗ − τ0∗ |2 ∗ −1 ∗ ∗
= (a , 0)⊥ τ⊥0 (a , 0)⊥ τ⊥0 τˆ − τ0∗ + OP |τˆ ∗ − τ0∗ |2 . Observe that, ∗ τ⊥0
∗
τˆ −
τ0∗
and (a
, 0)⊥
∗ −1 τ⊥0 (a , 0)⊥ =
=
(τˆ − τ0 ) β¯⊥20
τˆD − τD0,
−1 a⊥ β⊥20 a⊥
−1 τD0 τ¯0 a⊥ β⊥20 a⊥
0
Iq+1
and the result holds as claimed using (A.19).
A.4. Proof of Theorem 3.3 The results follow by mimicking the proof of Theorem 2 in Nielsen and Rahbek (2007) (NR henceforth), using the results in Theorem 3.2. Specifically, replacing in NR indices ‘l’ (for linear) by ‘D’, and ‘c’ (for constant) by ‘d’ the arguments are completely identical except for the role of the impulse dummy δ t as additional regressor. NR falls in two parts. First, asymptotics for the test of an auxiliary null H aux against H D (r, s), and, next, asymptotics for an auxiliary null against H D (p). On H aux against H D(r,s) :
Replace the residuals ˆt , ˆt0 and ˇt in NR by,
ˆt = 2 Xt − θˆ 0 Z0tλ − θˆ 1 Z1tλ − θˆ 2 Z2tλ − θˆ D ZDt − θˆ d Zdt − θˆ δ δt , ˆt0 = 2 Xt − θˆ 0 Z0tλ − θˆ δ δt , ˇt = 2 Xt − θˇ 0 Z0tλ − θˇ δ δt , where θˆ denotes the estimator under H D (r, s), while θˇ denotes the estimator under the auxiliary null, given by 2 Xt = θ 0 Z λ0t + θ δ δt + t (θ ). That is, ψ, ψ d , τ , τ D and β fixed at their true values. All arguments remain the same as in NR, except in the study of the covariance estimated under H D (r, s), ˆ = Mˆ ˆ = Mˆ0 ˆ0 + XT − YT − YT , C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
An I(2) cointegration model with piecewise linear trends
155
with XT as in NR, while here
YT = MZB − θˆ 0 − θ00 MZ0 ZB − θˆ δ − θ0δ MδZB θˆ ZB .
, Zdt ) (see proof of Theorem 3.2, where also the corresponding Here ZB refers to ZBt = (Z1tλ , Z2tλ , ZDt B ˆ ZB = (θˆ 1 , θˆ 2 , θˆ D v, θˆ d ) . For normalization matrix N T is defined, which in NR corresponds to D −1 T ) and θ the extra term in YT , (θˆ δ − θ0δ )MδZB , it is needed that
T θˆ δ − θ0δ MδZB θˆ ZB = oP (1). √ But √ this holds as (i) (θˆ δ − θ0δ ) = OP (1) by Theorem 3.1, (ii) T NTB θˆ ZB = OP (1) from Theorem 3.2 and (iii) T MδZB (NTB )−1 = oP (1) by definition of δ t .
On Haux against H D(p) :
Similar to NR, the model H D (p) is given by
2 Xt = Xt−1 + D Dt−1 − Xt−1 − d dt−1 + 2 Xt−1 + δ δt + t (θ), and as shown in the proof of Lemma 4.1 above, the additional regressor δ t plays no role asymptotically, and therefore the arguments in NR remain identical.
APPENDIX B: DATA IN EMPIRICAL APPLICATION The data are from the Flow of Funds Accounts (FFA) by the Federal Board of Governors and the National Income and Product Account (NIPA) from the US Department of Commerce. Consumption is measured as the personal expenditures of households and non-profit organizations on non-durable goods and services from NIPA. The price level is measured as the corresponding implicit deflator. Income is measured as the disposable income of households and non-profit organizations from NIPA, calculated as personal income minus current taxes. Wealth is taken from the FFA and is calculated as household tangible and financial assets minus liabilities. Finally, the bond rate is the Federal funds 10-year bond rate from the US Department of Commerce.
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
The
Econometrics Journal Econometrics Journal (2011), volume 14, pp. 156–185. doi: 10.1111/j.1368-423X.2010.00329.x
Cointegration and sampling frequency M ARCUS J. C HAMBERS † †
Department of Economics, University of Essex, Wivenhoe Park, Colchester, Essex CO4 3SQ, UK. E-mail:
[email protected] First version received: February 2010; final version accepted: July 2010
Summary This paper analyses the effects of sampling frequency on the properties of ordinary least squares (OLS) and fully modified least squares (FM-OLS) regression estimators of cointegrating parameters. Large sample asymptotic properties are derived under three scenarios concerning the span of data and sampling frequency, each scenario depending on whether span or frequency (or both) tends to infinity. In cases where span tends to infinity the OLS estimators are consistent but their limiting distributions suffer from second-order bias effects arising from serial correlation and endogeneity; the OLS estimators are not even consistent when the span is fixed and sampling frequency increases. In contrast, the FMOLS estimators are shown to have limiting mixed normal distributions when span tends to infinity and associated Wald statistics have limiting chi-square distributions. The finite sample performance of the estimators and test statistics is explored in a simulation study in which the superiority of the FM-OLS estimator in terms of bias and mean square error is demonstrated and the Wald statistics are found to generally have good size and power properties. Directions in which the model can be extended, and the effects of such extensions, are also discussed. Keywords: Cointegration, Fully modified estimation, Sampling frequency.
1. INTRODUCTION Most econometric and statistical theory involving time series is concerned with issues of estimation and inference in models using data observed at a fixed frequency. In economics, and particularly finance, the variables of interest are often available at different frequencies and covering different time spans, and the use of different data frequencies presumably has some effect on the properties of the estimation and inference procedures employed. The aim of this paper is to make a contribution to the understanding of the effects of sampling frequency in the context of the estimation of the parameters in a model of cointegration. The effects of sampling frequency on estimators and test statistics have been analysed in a variety of settings, and Bergstrom (1984) provides a summary of early work based on stationary continuous time systems. Most recent research has been univariate in nature but has relaxed the stationarity requirement. Phillips (1987a,b) derived continuous record asymptotics for the ordinary least squares (OLS) estimator in a first-order autoregression with a unit root, while Perron (1991) examined the consistency of tests of the random walk hypothesis showing that it
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society. Published by Blackwell Publishing Ltd, 9600
Garsington Road, Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
Cointegration and sampling frequency
157
is the increasing span of the data, rather than the frequency, that is important. Perron’s results for stock variables have recently been extended by Chambers (2004) to the case of flow variables. The underlying model used here is a continuous time version of the prototypical triangular cointegration model of Phillips (1991a). The model is specified deliberately in this simple form so as to clearly focus on the effects of sampling frequency. It is driven by the increment in a vector Brownian motion process which enables the focus to be on the estimation of the matrix of cointegration parameters rather than having to account for additional complications such as more general forms of system dynamics and volatility, for example, although it is shown how the model can be extended in these and other directions later in the paper. As a result, the effects of sampling frequency on estimators of the cointegration parameters are clearly established. Furthermore, the continuous time specification ensures that the discrete time model satisfied by the observed data is independent of the sampling frequency, a feature that is not always true in temporal aggregation of discrete time models. The continuous time approach also allows naturally for the distinct treatment of stock and flow variables, which is one additional complicating feature that is considered, and both types of variable are incorporated within the system simultaneously. Such distinctions are important in multivariate models where temporal aggregation is considered, enabling a correct treatment of the different ways in which variables are observed. Examples of such distinctions in economics include the co-dependence of prices (stock variables) and consumption expenditures (flows), while in finance models jointly including asset prices and dividend flows are commonplace. A further advantage of the triangular error correction model (ECM) format in continuous time is that the corresponding model satisfied by the discrete time data is also of triangular ECM form, albeit with a moving average disturbance vector that arises due to the temporal aggregation of flow variables. The discrete time model therefore remains linear in the cointegrating matrix which simplifies estimation. The triangular representation of a cointegrated system has found a number of empirical applications, although it appears to have been less popular than the vector error correction model approach of Johansen (1991). Some examples of empirical applications include Ng (1995) and Attfield (1997) who both deal with issues concerned with the estimation and testing of consumer demand systems with integrated time series data; Mark and Sul (2003) whose application is to long-run money demand using a panel data set of countries; and Moon and Perron (2004) whose interest is in testing for purchasing power parity (PPP). Although much macroeconomic data remains available at best only at coarse sampling intervals (e.g. quarterly or monthly), the results presented in this paper enable comparisons of the properties of estimators obtained at different frequencies to be carried out. In the future it is possible that such data will be available at finer time intervals. Indeed this is already the case with interest rate and exchange rate data that are used in PPP testing, while the availability of scanner-type data may lead to much higher frequency data being available in consumer demand studies. In such circumstances, the results pertaining to an increasing frequency of data will become pertinent. The paper is organized as follows. Section 2 defines the continuous time model and presents, in Lemma 2.1, the form of the discrete time model satisfied by the observed data. The order of magnitude (in terms of sampling frequency) of the variance and autocovariance matrices of the disturbance vector are also derived and are an essential factor in determining the correct convergence rates in the asymptotic analysis of the estimators. Section 3 considers sub-system estimation of the cointegrating matrix by ordinary least squares (OLS). The large sample asymptotic properties of the estimator are derived allowing for three different combinations of behaviour of data span and sampling frequency, in all of which the number of observations tends to infinity. The three combinations are: fixed sampling frequency and data span tends to infinity, C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
158
M. J. Chambers
fixed span of data and sampling frequency tends to infinity and a combination of both span and sampling frequency tending to infinity. Note that the increasing sampling frequency is captured by allowing the sampling interval (the time between observations) to tend to zero. It is found that an increasing span of data is required for the OLS estimator to be consistent, although the limiting distributions contain second-order bias effects that arise due to the serial correlation and endogeneity of regressors. When the span of data is fixed, the estimator is inconsistent and its distribution depends on the initial value of the regressors. This finding of requiring an increasing span for consistency is in accordance with Perron (1991) and Chambers (2004) in the context of unit root tests as well as the large literature involving infill asymptotics in financial econometrics as surveyed by Bandi and Phillips (2009). In none of the three types of asymptotics considered in Section 3 is the OLS estimator optimal in the sense of Phillips (1991a). Section 4 therefore builds on these results to consider optimal estimation in the two cases where the OLS estimator is at least consistent, namely when span tends to infinity, the case of fixed span not being pursued further. Although a number of possibilities for optimal estimation are available, the analysis investigates the fully modified OLS (FM-OLS) estimator of Phillips and Hansen (1990). Consistent estimation of the required covariance matrices is established as a precursor to deriving the limiting distribution of the FM-OLS estimator in the two cases, each of which is shown to be optimal (mixed normal). The result in the case of fixed frequency is used to analyse the relative efficiency of estimators obtained from data observed at two different, but fixed, frequencies. The limiting distributions of associated Wald statistics are also derived and are shown to be chi-square thereby enabling standard inference to be conducted. The finite sample performance of the OLS and FM-OLS estimators is examined in a simulation experiment in Section 5 in which both span and sampling frequency are allowed to vary. The size and (size-adjusted) power properties of the Wald statistic are also investigated. The superiority of the FM-OLS estimator over the OLS estimator in terms of bias and mean square error is demonstrated and the Wald statistics are found to generally have good size and power properties. Section 6 discusses a number of directions in which the model can be extended, while some concluding comments are provided in Section 7. Proofs of all lemmas and theorems are given in Appendix A, whereas additional results that are used in the proofs are provided in Appendix B.
2. MODEL SPECIFICATION AND DISCRETE TIME REPRESENTATION The basic continuous time triangular ECM for the m × 1 vector y is motivated by the prototypical discrete time triangular ECM with white noise innovations considered by Phillips (1991a). It is defined by the stochastic differential equation system dy(τ ) = −J Ay(τ ) dτ + dw(τ ),
τ > 0,
(2.1)
where τ denotes the continuous time parameter and w(τ ) is a Brownian motion process with variance matrix τ . Defining y = [y1 , y2 ] , where y 1 is m1 × 1, y 2 is m2 × 1, and m1 + m2 = m, the ECM representation is consistent with an underlying cointegrating relationship between the sub-vectors y 1 and y 2 such that y 1 − By 2 is stationary, where B denotes the m1 × m2 matrix of cointegrating parameters. The matrix B enters (2.1) via the matrix A = [Im1 , −B], while C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
159
Cointegration and sampling frequency
J = [Im1 , 0] . The first m1 equations of (2.1) are therefore dy1 (τ ) = −[y1 (τ ) − By2 (τ )] dτ + dw1 (τ ),
τ > 0,
while the last m2 equations in (2.1) depict the common stochastic trends dy2 (τ ) = dw2 (τ ),
τ > 0,
where w has been partitioned conformably with y. The unique mean square solution to (2.1), initialized at τ = 0, is given by τ e−(τ −r)J A dw(r) + e−τ J A y(0), τ > 0, (2.2) y(τ ) = 0
j where, for any square matrix C, eC = ∞ j =0 C /j ! defines the matrix exponential. The vectors y 1 and y 2 are assumed to be composed of both stock and flow variables. Without loss of generality the variables in each vector will be arranged with the stocks first followed by the flows, and the matrix B will be partitioned accordingly, so that S S BSF BSS y1 (τ ) y2 (τ ) y1 (τ ) = F . , y2 (τ ) = F , B= BF S BF F y1 (τ ) y2 (τ ) The vectors y1S and y1F are of dimensions mS1 × 1 and mF1 × 1 respectively, with mS1 + mF1 = m1 , whereas the sub-vectors of y 2 are of similarly defined dimensions with mS2 + mF2 = m2 . It is also convenient to partition the covariance matrix conformably with y 1 and y 2 as well as with respect to the stocks and flows, yielding ⎤ ⎡ SS 11 SF SS SF 11 12 12 ⎥ ⎢ FS ⎢ 11 12 F11F F12S F12F ⎥ ⎥ ⎢ 11 = =⎢ ⎥, SF SS SF ⎥ SS ⎢ 21 22 21 22 22 ⎦ ⎣21 F21S
F21F
F22S
F22F
k l where each ij is mi × mj (i, j = 1, 2) and each kl ij is mi × mj (i, j = 1, 2) (k, l = S, F ). The sampling interval, i.e. the period between observations, will be denoted by h, so that the sampling frequency is given by h−1 . Observations on the stock variables are made at points in time separated by a period of length h while observations on flow variables are of the form of integrals of the underlying rate of flow over each successive interval of length h. Introducing the variable t to index observations and letting T denote the sample size, the observations are of the form ⎤ ⎡ S y1S (th) y1,th ⎥ ⎢ =⎣1 h y1,th = F ⎦ , t = 1, . . . , T , y1,th y1F (th − s) ds h 0
y2,th =
S y2,th F y2,th
⎡
y2S (th)
⎢ =⎣1 h
h 0
y2F (th − s) ds
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
⎤ ⎥ ⎦,
t = 1, . . . , T .
160
M. J. Chambers
Denoting the span of the data by N , it follows that T = N /h. The observations are therefore made at the points th (t = 1, . . . , T ), which divides (continuous) time (indexed by τ ) into T intervals each of length h. Note that the flow variables are normalized by the factor 1/h, which ensures that all terms in the discrete time ECM in Lemma 2.1 below are stationary. The discrete time model itself is derived from two sets of equations. First, the solution to (2.2) yields the difference equation y(th) = e−hJ A y(th − h) + (th), (th) =
th
e−(th−r)J A dw(r),
t = 1, . . . , T ,
t = 1, . . . , T .
(2.3)
(2.4)
th−h
It can be shown, noting that AJ = Im1 , that e−hJA = Im − φ h J A, where φ h = 1 − e−h , and so (2.3) can be written as the discrete time triangular ECM h y(th) = −φh J Ay(th − h) + (th),
t = 1, . . . , T ,
(2.5)
where h = 1 − Lh and L denotes the lag operator such that Lh y(th) = y(th − h). If the sample consisted entirely of stock variables then (2.5) would represent the exact discrete model. However, it is also necessary to take into account the sampling of flow variables; integrating (2.5) th over the interval (th − h, th] and defining Yth = h−1 th−h y(r) dr yields h Yth = −φh J AYth−h + ηth , 1 ηth = h
th
(r) dr,
t = 2, . . . , T ,
t = 2, . . . , T ,
(2.6)
(2.7)
th−h
these equations defining the exact discrete time model in the case of flows. It is, however, also necessary to supplement (2.6) and (2.7) with an equation relating Yh to y(0), which is obtained by setting t = 1 in (2.5) and then integrating over (0, h] to give Yh − y(0) = −φh J Ay(0) + ηh .
(2.8)
Equations (2.5) and (2.4) in the case of stock variables, and (2.6), (2.7) and (2.8) in the case of flow variables, are used to derive the exact discrete time representation for the mixed variable vector yth sampled at intervals of length h; the result is presented in Lemma 2.1, in which it is convenient to write th = (th) when sampled at discrete time points t = 1, . . . , T . L EMMA 2.1. Let y(τ ) be generated by (2.1) and let yth = [y1,th , y2,th ] (t = 1, . . . , T ) denote the vector of observations on y 1 and y 2 . Then yth satisfies the triangular ECM given by
yh − y(0) = −φh J Ay(0) + ξh , h yth = −φh J Ayth−h + ξth ,
t = 2, . . . , T ,
(2.9)
(2.10)
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
161
Cointegration and sampling frequency
where the disturbance vectors are given by ⎤ ⎡ S ⎤ ⎡ S F 1,th + φh BSF δ2,th−h 1,h ⎥ ⎥ ⎢ ⎢ ⎥ ⎢ηF ⎥ ⎢ ηF − φ B δ S h F S 2,th−h ⎥ ⎢ 1,h ⎥ ⎢ 1,th ⎥ , ξth = ⎢ ⎥ , t = 2, . . . , T , (2.11) ξh = ⎢ ⎥ ⎢ S ⎥ ⎢ S ⎥ ⎢2,h ⎥ ⎢ 2,th ⎦ ⎦ ⎣ ⎣ F F η2,h η2,th th and δth = h−1 th−h [h − (th − r)] dw(r). Furthermore, the autocovariances of ξ th satisfy
h,11 ≡ E ξh ξh = h00 + O(h2 ),
= h0 + O(h2 ), t = 2, . . . , T h,0 ≡ E ξth ξth
= h1 + O(h2 ), t = 2, . . . , T , h,1 ≡ E ξth ξth−h
|j | > 1, h,j ≡ E ξth ξth−j h = 0, where
⎡
SS 11
00
⎢ ⎢ ⎢ 1 FS ⎢ 11 ⎢2 =⎢ ⎢ ⎢SS ⎢ 21 ⎢ ⎣1 F S 2 21 ⎡
SS 11
⎢ ⎢ ⎢ 1 FS ⎢ 11 ⎢2 0 = ⎢ ⎢ ⎢SS ⎢ 21 ⎢ ⎣1 F S 2 21 ⎡
0
⎢1 ⎢ ⎢ F11S ⎢2 1 = ⎢ ⎢ 0 ⎢ ⎣1 F S 2 21
1 SF 2 11 1 FF 3 11 1 SF 2 21 1 FF 3 21
SS 12
1 SF 2 11 2 FF 3 11 1 SF 2 21 2 FF 3 21
SS 12
1 FS 2 12 SS 22 1 FS 2 22
1 FS 2 12 SS 22 1 FS 2 22
0
0
1 FF 6 11 0
1 FS 2 12 0
1 FF 6 21
1 FS 2 22
1 SF ⎤ 2 12 ⎥ ⎥ 1 FF ⎥ ⎥ 3 12 ⎥ ⎥, 1 SF ⎥ ⎥ ⎥ 2 22 ⎥ 1 FF ⎦ 3 22 1 SF ⎤ 2 12 ⎥ ⎥ 2 FF ⎥ ⎥ 3 12 ⎥ ⎥, 1 SF ⎥ ⎥ ⎥ 2 22 ⎥ 2 FF ⎦ 3 22 0
⎤
1 FF ⎥ ⎥ ⎥ 6 12 ⎥ ⎥. 0 ⎥ ⎥ 1 FF ⎦ 6 22
The exact discrete time representation mirrors the ECM form of the underlying continuous time system, although the presence of the double integral in ηth implies that the disturbance vector ξ th is a first-order moving average process. The property that the autocovariances are O(h) will C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
162
M. J. Chambers
play an important role in the asymptotics when the sampling interval is allowed to converge towards zero. Nevertheless, a distinct advantage of the discrete time ECM representation is that it retains its linearity in the unknown cointegrating matrix B. Note, also, that = 0 + 1 + 1 plays the role of the long-run covariance matrix for the Brownian motion process that characterizes the asymptotics when h is allowed to tend to zero in the analysis that follows.
3. SUB-SYSTEM LEAST SQUARES REGRESSION It is convenient to begin by first analysing a sub-optimal sub-system least squares estimator, the properties of which can then be used to help identify ways in which optimal (or efficient) estimation can be achieved. It is also convenient to make the assumption that y(0) is fixed and observed. While this may be the case in practice for stock variables it is not typically true for flows but serves to simplify the analysis. Three main scenarios will be considered with regard to the sampling scheme, reflecting different joint behaviour of span N and frequency h−1 . The first is where h is fixed but N ↑ ∞. This represents the usual situation in which sample size T (= N /h) ↑ ∞ but emphasizes the dependence on a given sampling frequency, not necessarily equal to unity. Such results can be useful for comparing the properties of estimators obtained from two fixed, but different, sampling frequencies. The second scenario is where h ↓ 0 and N ↑ ∞ jointly, so that the data are tending towards a continuous record limit at the same time as the span increases. Such a scenario is of particular relevance to applications in finance, although, in the future, it is possible that economic data may be observable at much higher frequencies than at present. The third case keeps N fixed but allows h ↓ 0 so that a continuous record is the result in the limit but one which covers a fixed span. Note that in all cases sample size T ↑ ∞. The first m1 equations of (2.9) and (2.10) may be written ya,th = Bφh y2,th−h + ξ1,th ,
t = 1, . . . , T ,
(3.1)
where y a,th = h y 1,th + φ h y 1,th−h = h y 1,th + Op (h). The analysis is aided by considering a n triangular array of random variables {{ynt }Tt=1 }∞ n=1 and by allowing the span and data frequency to be indexed by n, giving Nn and hn . In this setup, sample size Tn = Nn /hn always tends to infinity with n, while Nn ↑ ∞ or Nn = N and hn ↓ 0 or hn = h. The equation used for estimation, (3.1), then becomes ya,nt = Bφn y2,nt−1 + ξ1,nt ,
t = 1, . . . , Tn ,
(3.2)
where ya,nt = ya,thn , φn = φhn , y2,nt = y2,thn and ξ1,nt = ξ1,thn . The linearity of (3.2) in the matrix B makes this an appealing equation for estimation. Furthermore, cointegrating parameters in continuous time systems are not subject to the usual aliasing identification problems that arise in stationary systems. If the continuous time processes are cointegrated with cointegrating matrix B, then so are the sampled discrete time processes, regardless of the sampling frequency; see, for example, Phillips (1991b), Comte (1999) and Kessler and Rahbek (2004). The least squares estimator of the matrix B in (3.2) is given by
= φn−1 Ya Y2 Y2 Y2 −1 = B + φn−1 ξ1 Y2 Y2 Y2 −1 , B
(3.3)
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
163
Cointegration and sampling frequency
where Ya (Tn × m1 ), Y 2 (Tn × m2 ) and ξ 1 (Tn × m1 ) are defined by ⎤ ⎡ ⎤ ⎡ ⎡ y2,n0 ξ1,n1 ya,n1 ⎥ ⎢ ⎥ ⎢ ⎢ ⎢ ya,n2 ⎥ ⎢ y2,n1 ⎥ ⎢ ξ1,n2 ⎥ ⎥ ⎢ ⎢ ⎢ , ξ Ya = ⎢ . ⎥ , Y 2 = ⎢ = ⎥ ⎢ . 1 .. ⎥ ⎢ . ⎥ ⎢ ⎢ . . ⎦ ⎣ . ⎦ ⎣ ⎣ . y2,nT n −1
ya,nT n
⎤ ⎥ ⎥ ⎥ ⎥. ⎥ ⎦
ξ1,nT n
use will be made of the two m × 1 Brownian In describing the asymptotic properties of B motion processes on r ∈ (0, 1] given by W (r) and Wh (r), having covariance matrices r and h r respectively, where h = h−1 (h,0 + h,1 + h,1 ) and h,0 , h,1 are defined in Lemma 2.1. Both W and Wh will be partitioned in the same way as y into sub-vectors of dimensions , Wh2 ) . The covariance matrices m1 × 1 and m2 × 1, so that W = (W1 , W2 ) and Wh = (Wh1 and h will be partitioned accordingly. It is also notationally convenient to define the following functionals in which the arguments of W (r) and Wh (r) are suppressed for convenience: 1
1/2 N W2 + y2 (0) N 1/2 W2 + y2 (0) , F (N , W2 , y2 (0)) = N 0
G(N, W , 1,12 , y2 (0)) = N 1/2 0
1
1/2 N W2 + y2 (0) dW1 + N 1,12 .
The following convergence properties hold under the three scenarios considered, the symbol ⇒ denoting weak convergence. L EMMA 3.1.
(a) If hn = h and Nn ↑ ∞ as n ↑ ∞, then 1 1 1 1 1 1 Y Y ⇒ W W , Y ξ ⇒ Wh2 dWh1 + h1,12 . 2 h2 h2 1 Nn2 2 h 0 Nn 2 h 0
(b) If hn ↓ 0 and Nn ↑ ∞ as n ↑ ∞, then 1 hn Y Y ⇒ W2 W2 , 2 Nn2 2 0
1 Y ξ1 ⇒ Nn 2
1
0
W2 dW1 + 1,12 .
(c) If hn ↓ 0 and Nn = N as n ↑ ∞, then hn Y2 Y2 ⇒ F (N , W2 , y2 (0)),
Y2 ξ1 ⇒ G(N , W , 1,12 , y2 (0)).
The first part of Lemma 3.1 generalizes the usual result where h = 1 and sample size (equal to span in this case) tends to infinity, the generalization here allowing for an arbitrary sampling interval. The rates of convergence of the sample moments are, of course, the same as when h = 1. Part (b) then allows the sampling interval to tend to zero while the span tends to infinity. This affects the rate of convergence of Y2 Y2 but not of Y2 ξ1 . Finally, part (c) keeps the span fixed while the sampling frequency increases, the result being that the initial conditions, y 2 (0), do not vanish in this limit. Application of Lemma 3.1 yields the limiting distribution of B.
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
164
M. J. Chambers
(a) If hn = h and Nn ↑ ∞ as n ↑ ∞, then 1 1 −1 h 1 dWh1 Wh2 + h1,12 Wh2 Wh2 . Nn (B − B) ⇒ φh 0 h 0
T HEOREM 3.1.
(b) If hn ↓ 0 and Nn ↑ ∞ as n ↑ ∞, then 1 dW1 W2 + 1,12 Nn (B − B) ⇒ 0
0
1
W2 W2
−1 .
(c) If hn ↓ 0 and Nn = N as n ↑ ∞, then − B) ⇒ G(N , W , 1,12 , y2 (0)) F (N , W2 , y2 (0))−1 . (B The first part of Theorem 3.1 generalizes the usual result to allow for an arbitrary sampling frequency. The least squares estimator is consistent but contains second-order bias effects due to the presence of the serial correlation and endogeneity of the regressors. The same is true in part (b) where the sampling frequency increases without bound. A comparison of (a) and (b) reveals that the ratio h/φ h can be regarded as a partial measure of the inefficiency from discrete time sampling. Because h = + O(h) and limh↓0 h/φ h = 1 we find that the result in (a) collapses to that in (b) when sampling frequency increases (although the results were not explicitly derived that way). If h is taken to represent one year with 1/φ 1 = 1.5820 then quarterly sampling yields (1/4)/φ 1/4 = 1.13023, monthly sampling (1/12)/φ 1/12 = 1.0422 and daily sampling (1/365)/φ 1/365 = 1.0014. In part (c), where the span is fixed, the fact that sample size tends to infinity (via h tending to zero) is not sufficient to ensure consistency, and in fact the is not consistent in this case. The distribution of (B − B) depends on y 2 (0) whose estimator B effects are only eliminated when span tends to infinity, as in parts (a) and (b). When y 2 (0) = 0 the distribution in (c) becomes 1 −1 1 1 dW1 W2 + 1,12 W2 W2 , (B − B) ⇒ N 0 0 which is simply 1/N times the distribution in part (b). However, in part (b), the increase in Nn enables the effect of initial conditions to be eliminated when y 2 (0) = 0. As the sub-system least squares estimator is not optimal in any of the cases considered here, we now use the form of these limiting distributions to analyse an optimal estimator of B in cases (a) and (b), acknowledging that optimal estimation, or even consistent estimation, is not possible in case (c) and, hence, is not pursued further.
4. OPTIMAL ESTIMATION: FULLY MODIFIED LEAST SQUARES A number of estimation methods have been developed to overcome the second-order biases in the distribution of the least squares estimator that arise because of endogeneity of regressors and serial correlation of disturbances. Among them are the spectral regression estimators of Phillips (1991a,c), the maximum likelihood estimator of Johansen (1991), the dynamic OLS estimator of Stock and Watson (1993), and the FM-OLS and instrumental variables estimators of Phillips and Hansen (1990). In principle all of these methods could be applied to the model of interest here C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
Cointegration and sampling frequency
165
and all should have the same optimal limiting distribution. However, because the properties of the OLS estimator have already been derived and it is known that the disturbance vector in the model is MA(1), the most straightforward approach is to consider the FM-OLS estimator. In order to construct the FM-OLS estimator it is necessary to be able to consistently estimate the matrices h and , which is possible in the two cases where Nn ↑ ∞ using the OLS residual vectors n y2,nt−1 , ξ1,nt = ya,nt − Bφ
t = 1, . . . , Tn .
when the span is fixed means that this case is ruled out. Let Clearly, the inconsistency of B ξ1,nt ξnt = , t = 1, . . . , Tn , y2,nt noting that the stochastic trends y 2,nt = ξ 2,nt are observable. The following lemma establishes the consistency of autocovariance estimators based on ξnt that can be used to estimate h and . L EMMA 4.1.
(a) If hn = h and Nn ↑ ∞ as n ↑ ∞, then h0 =
Tn 1 p 1 ξnt ξnt → h0 , Nn t=1 h
h1 =
Tn 1 p 1 ξnt ξnt−1 → h1 . Nn t=1 h
(b) If hn ↓ 0 and Nn ↑ ∞ as n ↑ ∞, then 0 =
Tn 1 p ξnt ξnt → 0 , Nn t=1
1 =
Tn 1 p ξnt ξnt−1 → 1 . Nn t=1
Lemma 4.1 shows that the sample variance and autocovariance matrix estimators based on ξnt converge in probability to the matrices of interest, thereby motivating the following estimators of h and : h = h0 + h1 + h1 ,
(4.1)
= 0 + 1 + 1 .
(4.2)
It follows from Lemma 4.1 that, as n ↑ ∞, p h →
1 h , h
p → .
In order to define the FM-OLS estimators, in case (a) let −1 h,12 ya,nt = ya,nt − h,22 y2,nt , −1 h1,12 = h1,12 − h,12 h,22 h1,22 , −1 h,12 ξ1,nt = ξ1,nt − h,22 y2,nt , C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
166
M. J. Chambers
and in case (b) define −1 12 ya,nt = ya,nt − 22 y2,nt , 1,12 = −1 1,12 − 12 22 1,22 , −1 12 ξ1,nt = ξ1,nt − 22 y2,nt . a to be the Tn × m1 matrix with typical row ya,nt and ξ1 to be the Tn × m1 Further, defining Y matrix with typical row ξ1,nt , the FM-OLS estimator of B in case (a) is
1 a Y2 − Nn = 1 Y h1,12 Y2 Y2 −1 = B + h1,12 Y2 Y2 −1 , ξ1 Y2 − Nn B φn φn
(4.3)
while the FM-OLS estimator of B in case (b) is
1 = 1 Y a Y2 − Nn 1,12 Y2 Y2 −1 = B + 1,12 Y2 Y2 −1 . B ξ1 Y2 − Nn φn φn
(4.4)
The limiting distributions of these estimators are presented in Theorem 4.1. T HEOREM 4.1.
(a) If hn = h and Nn ↑ ∞ as n ↑ ∞, then 1 −1 1 h Nn (B − B) ⇒ dWh,1.2 Wh2 Wh2 Wh2 , φh 0 0
where Wh,1.2 = Wh1 − h,12 −1 h,22 Wh2 . (b) If hn ↓ 0 and Nn ↑ ∞ as n ↑ ∞, then − B) ⇒ Nn (B
1 0
dW1.2 W2
1 0
W2 W2
−1 ,
where W1.2 = W1 − 12 −1 22 W2 . The limiting distributions of the FM-OLS estimators in Theorem 4.1 are mixed normal and, hence, the estimators are optimal in the sense of Phillips (1991a). Defining the random matrix 1 −1 ) with probability measure Ph (Gh ) enables the distribution in part (a) of Gh = ( 0 Wh2 Wh2 Theorem 4.1 to be written h2 − B) ⇒ Nn vec(B N 0, Gh ⊗ 2 h,11.2 dPh (Gh ), (4.5) φh Gh >0 where h,11.2 = h,11 − h,12 −1 h,22 h,21 denotes the covariance matrix of W h,1.2 and the −1 1 integration is taken over all positive definite matrices Gh . Similarly, defining G = 0 W2 W2 with probability measure P (G) enables part (b) of Theorem 4.1 to be written N (0, G ⊗ 11.2 ) dP (G), Nn vec(B − B) ⇒ G>0
12 −1 22 21
denotes the covariance matrix of W 1.2 . In both cases, where 11.2 = 11 − conditional on the realization {y 2,nt }, the limiting distributions are normal.
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
Cointegration and sampling frequency
167
The fixed-frequency asymptotics in Theorem 4.1(a) enable a formal comparison of the relative efficiency of estimators obtained from two different, but fixed, sampling frequencies to be carried out. For this purpose it is convenient to write the distribution in (4.5) as h2 −1 Nn vec(B − B) ⇒ N 0, γ h,22 ⊗ 2 h,11.2 dPγ (γ ), (4.6) φh γ >0 1 where γ = e2 ( 0 WI WI )−1 e2 , e2 is any unit m2 × 1 vector and WI is a Wiener process with variance matrix Im2 . Consider two fixed sampling intervals, h1 < h2 , and let h22 h21 −1 −1 (h2 , h1 ) = γ h2 ,22 ⊗ 2 h2 ,11.2 − γ h1 ,22 ⊗ 2 h1 ,11.2 , φh 2 φh 1 which is a measure of the difference in efficiency of h2 compared to h1 . Defining i = γ −1 hi ,22 and i = (h2i /φh2i )hi ,11.2 for i = 1, 2, (h2 , h1 ) can be written (h2 , h1 ) = (2 ⊗ ( 2 − 1 )) + ((2 − 1 ) ⊗ 1 )) . Given that 2 and 1 are positive definite, it remains to determine the definiteness of the differences 2 − 1 and 2 − 1 . Using Lemma 2.1 and the definition of yields h22 h21 h2 h2 2 − 1 = 11.2 − 2 + 22 O(h2 ) − 21 O(h1 ), 2 φh 2 φh 1 φh 2 φh 1 which is positive definite to O(h2 ) owing to the fact that hi /φhi = O(1) (i = 1, 2) and (h2 /φh2 )2 − (h1 /φh1 )2 > 0. Furthermore,
−1 −1 −1 2 − 1 = γ −1 h2 ,22 − h1 ,22 = γ h1 ,22 (h1 ,22 − h2 ,22 )h2 ,22 , −1 and, because both −1 h1 ,22 and h2 ,22 are positive definite and γ > 0, interest centres on
h1 ,22 − h2 ,22 = 22 + O(h1 ) − (22 + O(h2 )) = O(h2 ), which is a null matrix to O(h2 ). Hence (h2 , h1 ) is positive definite to O(h2 ) with the implication that higher frequency sampling (h1 ) yields estimators that are more efficient within the framework of this model. demonstrated by Phillips An advantage of the mixed normal limiting distributions for B, and Hansen (1990), is that it is possible to conduct asymptotic chi-square tests of hypotheses concerning B of the form H0 : R vec(B) = r
against
H1 : R vec(B) = r,
where R is q × m1 m2 , r is q × 1, and vec(B) denotes the m1 m2 × 1 vector obtained by stacking the columns of B vertically on top of each other. These q restrictions can be tested in case (a) using the Wald statistic −1 −1 2 h h 2 − r) R − r), h,11.2 R Y Y2 ⊗ 2 (R vec(B) Wa = Nn (R vec(B) Nn2 2 φh (4.7) C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
168
M. J. Chambers
while in case (b) the relevant statistic is −1 −1 h n 2 − r) R − r). 11.2 R Y Y2 ⊗ (R vec(B) Wb = Nn (R vec(B) Nn2 2
(4.8)
The limiting distributions of these statistics are given later. T HEOREM 4.2. (a) If hn = h and Nn ↑ ∞ as n ↑ ∞, then Wa ⇒ χq2 under H 0 . (b) If hn ↓ 0 and Nn ↑ ∞ as n ↑ ∞, then Wb ⇒ χq2 under H 0 . A further advantage of the fully modified estimation is, therefore, that standard inference can provided that the sample span is increasing. be conducted with the Wald statistics based on B, The finite sample performance of the fully modified estimators, and the associated Wald statistics, is explored in the next section.
5. SIMULATION EVIDENCE The simulations are designed to assess how well the predictions of Theorems 3.1 and 4.1 describe the finite sample performance of the OLS and FM-OLS estimators as both sampling frequency and sample span are allowed to vary. The finite sample properties of a Wald statistic are also examined based on the critical values provided by the asymptotic chi-square distribution of Theorem 4.2. The model underlying the simulations is composed of a vector y containing both stock and flow variables of the form y(t) = y1S (t), y1F (t), y2S (t), y2F (t) , and the 2 × 2 cointegrating matrix B and 4 × 4 covariance matrix of the Brownian motion process are given by BSS BSF ρI2 1 1 I2 11 12 B= = = , , = BF S BF F 21 22 ρI2 I2 1 1 where I 2 denotes a 2 × 2 identity matrix and −1 < ρ < 1 allows for correlation between the Brownian motions driving y1i (t) and y2i (t) (i = S, F ). The following values for sample span N , sampling frequency h, and correlation parameter ρ are employed: 1 1 , ρ = {−0.9, 0, 0.9}. N = {25, 50, 100}, h = 1, , 4 12 The sample sizes, T = N /h, therefore range from 25 (N = 25, h = 1) through to 1200 (N = 100, 1 h = 12 ). The initial values y 2 (0) are taken to be proportional to the variances of the Brownian motions driving y 2 (t), each of which are equal to one, so that 0 1 10 y2 (0) = , , , 0 1 10 and the system is initially in equilibrium with y 1 (0) = By 2 (0). A total of 10,000 replications are carried out for each combination of ρ and y 2 (0), with 12,000 observations on the observable
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
169
Cointegration and sampling frequency
1 4 × 1 vector yt being generated for h = 12 according to the exact discrete time model in Lemma 2.1. Appropriate aggregation then leads to the data series for h = 1 and h = 14 . A Wald statistic is also computed to test the null hypothesis
H0 : BSS = 1, BSF = 1, BF S = 1
and
BF F = 1,
its finite sample size properties being based on the 5% critical value from the χ42 distribution, and its (size-adjusted) power properties being based on computing the value of the Wald statistic using data generated with each coefficient of B equal to 1.1 (the same set of underlying random variables being used to generate the new sequence of yt values). The summary results of the simulations are contained in Tables 1–3. In order to save space Tables 1 and 2 present, respectively, the estimates of bias and mean square error (MSE) of the composite estimator 1 SF + B F S + B F F , BSS + B b= 4 the true value of which is one. In these tables the bias and MSE values have been multiplied by 103 . The sampling schemes depicted in Theorems 3.1 and 4.1 appear in the tables in the following ways. For fixed h, moving down the three entries in the relevant column (increasing N ) corresponds to case (a), while for fixed N , moving along the three entries in the row corresponds to case (c), with case (b) being depicted by downwards diagonal movement so that N increases and h decreases. Table 1 clearly illustrates the advantages of the FM-OLS estimator over the OLS estimator in terms of its uniformly smaller bias. When ρ = −0.9 the bias of the OLS estimator is seen to increase, for given N , when sampling frequency (and hence sample size) increases, this presumably being a manifestation of the inconsistency of the estimator in this scenario. The effect of increasing y 2 (0) on each estimator for given ρ is to reduce the bias, while increasing ρ from negative to positive tends to increase the bias. The MSEs in Table 2 show similar patterns although those for the FM-OLS estimator are not uniformly smaller than those for the OLS estimator, being larger in six of the nine cases although only at the very smallest sample size (when N = 25 and h = 1). The properties of the Wald test reported in Table 3 show large size distortions when h = 1 but the size is close to the nominal 5% level when h = 12. The values of ρ and y 2 (0) do not have a large impact on the size of the test. The power of the test is also affected when h = 1 and for smaller spans of data but otherwise the test typically has high power, though there is a noticeable loss of power for smaller values of y 2 (0) when ρ = 0 compared to when ρ = 0. These simulations provide support for the analytical results obtained in Theorems 3.1–4.2 and show that FM-OLS estimators and related Wald tests perform well in cointegrated models when sampling frequency and span are allowed to vary.
6. FURTHER EXTENSIONS As mentioned in Section 1, the results in this paper have been derived using a simple model in order to focus on the raw effects of sampling frequency on the estimation of cointegrating vectors. In practice there are a number of extensions to the model that are likely to be important.
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
170
M. J. Chambers Table 1. Simulation results: mean bias of b (×103 ). OLS
ρ
y2 (0)
−0.9
0.0
N
h=1
(0, 0)
25
−13.51
37.04
47.17
−6.59
6.95
2.93
(1, 1)
50 100 25
−7.45 −3.80 −12.59
18.45 9.21 33.26
23.48 11.71 42.88
−2.06 −0.58 −6.38
1.70 0.41 5.89
0.68 0.15 2.77
50 100
−7.15 −3.77
17.63 9.21
22.48 11.67
−2.18 −0.64
1.74 0.48
0.81 0.19
(10, 10)
25 50 100
−0.90 −0.96 −0.91
2.74 2.52 2.44
3.59 3.28 3.12
−0.11 −0.12 −0.08
0.46 0.18 0.08
0.23 0.07 0.02
(0, 0)
25 50
−67.80 −36.03
−18.14 −9.45
−7.68 −4.04
−36.94 −11.98
−3.60 −1.40
−0.98 −0.60
100 25 50
−18.33 −58.94 −33.07
−4.62 −15.69 −8.61
−1.92 −6.53 −3.74
−3.44 −30.28 −10.50
−0.39 −2.90 −1.34
−0.17 −0.56 −0.61
100 25
−17.49 −4.34
−4.42 −1.26
−1.88 −0.54
−3.11 −1.87
−0.38 −0.22
−0.21 −0.09
50 100
−4.54 −4.26
−1.20 −1.04
−0.46 −0.37
−1.07 −0.49
−0.11 −0.02
−0.03 0.02
(1, 1)
(10, 10)
0.9
h=
1 4
h=
1 12
h=1
FM-OLS h=
1 4
h=
1 12
25
−118.39
−71.43
−61.51
−66.39
−12.86
−4.10
50 100
−60.78 −30.85
−35.11 −17.55
−30.02 −14.94
−19.12 −5.30
−3.00 −0.84
−0.82 −0.30
(1, 1)
25 50 100
−106.79 −58.80 −30.63
−64.85 −34.12 −17.37
−55.81 −29.07 −14.71
−55.70 −17.70 −5.05
−10.42 −2.81 −0.71
−3.16 −0.75 −0.15
(10, 10)
25 50
−7.60 −8.16
−4.86 −4.94
−4.23 −4.24
−2.89 −1.94
−0.59 −0.32
−0.16 −0.09
100
−7.93
−4.61
−3.93
−1.03
−0.16
−0.05
(0, 0)
It is shown below how the model and results can be generalized to allow for some of these extensions. 6.1. Deterministic components An important extension of the model is to allow for deterministic terms, such as a vector of constants, (polynomial) time trends, and dummy variables. If z(τ ) denotes a p × 1 vector of such deterministic terms then the model (2.1) can be augmented as follows:
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
171
Cointegration and sampling frequency Table 2. Simulation results: mean square error of b (×103 ). OLS FM-OLS ρ
y2 (0)
−0.9
(0, 0)
25
3.92
(1, 1)
50 100 25
1.00 0.24 3.28
50 100 (10, 10)
(0, 0)
0.0
(1, 1)
(10, 10)
0.9
(0, 0)
1 12
h=1
4.86
7.02
5.99
1.31
0.98
1.20 0.30 4.08
1.74 0.44 6.02
0.77 0.16 3.43
0.27 0.06 1.10
0.23 0.05 0.84
0.90 0.23
1.11 0.29
1.60 0.42
0.68 0.16
0.24 0.06
0.20 0.05
25 50 100
0.14 0.08 0.04
0.13 0.07 0.04
0.17 0.09 0.06
0.59 0.05 0.02
0.04 0.02 0.01
0.04 0.02 0.01
25 50
17.41 4.67
4.90 1.23
4.08 1.02
40.86 2.08
4.62 1.08
4.14 1.01
100 25 50
1.21 14.00 4.15
0.31 3.99 1.09
0.26 3.40 0.91
0.39 15.80 1.80
0.26 3.78 0.95
0.25 3.46 0.90
100 25
1.14 0.41
0.29 0.17
0.24 0.16
0.37 0.38
0.24 0.17
0.23 0.16
50 100
0.24 0.14
0.09 0.04
0.08 0.04
0.12 0.05
0.08 0.04
0.08 0.04
N
h=1
h=
1 4
h=
h=
1 4
h=
1 12
25
38.84
14.11
10.63
22.16
1.43
0.69
50 100
10.28 2.71
3.45 0.89
2.57 0.66
1.98 0.21
0.21 0.04
0.15 0.04
(1, 1)
25 50 100
31.57 9.52 2.60
11.78 3.25 0.85
8.88 2.40 0.62
30.80 1.64 0.18
1.13 0.19 0.04
0.59 0.14 0.04
(10, 10)
25 50
0.74 0.48
0.27 0.17
0.21 0.13
1.15 0.07
0.03 0.01
0.02 0.01
100
0.31
0.11
0.08
0.02
0.01
0.01
dy(τ ) = [z(τ ) − J Ay(τ )] dτ + dw(τ ),
τ > 0,
(6.1)
where is an m × p matrix of coefficients. The equations corresponding to the sub-vectors y 1 and y 2 now become dy1 (τ ) = −[y1 (τ ) − 1 z(τ ) − By2 (τ )]dτ + dw1 (τ ), dy2 (τ ) = 2 z(τ )dτ + dw2 (τ ), C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
τ > 0,
τ > 0,
172
M. J. Chambers Table 3. Simulation results: size and size-adjusted power of Wald test. Size Size-adjusted power
ρ
y2 (0)
N
h=1
−0.9
(0, 0)
25 50
42.60 24.52
9.79 5.98
5.65 4.75
32.33 84.93
87.63 99.70
92.12 99.81
100 25 50
12.75 42.00 25.05
4.93 9.83 6.20
4.71 5.90 4.96
99.45 37.80 86.32
100.00 90.86 99.78
100.00 94.20 99.81
100 25
13.36 40.92
5.37 7.77
4.92 4.91
99.56 92.87
100.00 100.00
100.00 99.96
50 100
23.05 11.33
5.37 4.76
4.90 4.87
98.85 99.73
100.00 100.00
99.99 100.00
(1, 1)
(10, 10)
0.0
0.9
(0, 0)
h=
1 4
h=
1 12
h=1
h=
1 4
h=
1 12
25
44.21
13.76
7.04
17.25
50.12
56.02
50 100
24.89 12.87
9.27 6.63
6.27 5.46
67.35 96.59
85.99 99.22
88.11 99.48
(1, 1)
25 50 100
44.11 24.04 13.33
13.07 8.91 6.72
7.47 6.28 5.66
20.43 70.35 96.82
55.97 88.60 99.27
60.66 90.21 99.43
(10, 10)
25 50
42.44 24.12
13.81 9.10
7.76 6.61
92.08 99.79
100.00 100.00
100.00 100.00
100
12.85
6.93
6.05
100.00
100.00
100.00
25 50
38.77 16.38
8.08 5.41
5.43 4.92
35.03 84.05
86.47 99.33
93.96 99.86
(1, 1)
100 25
6.23 36.37
4.38 7.42
4.62 5.06
99.31 39.78
100.00 89.36
100.00 95.84
(10, 10)
50 100 25
15.76 6.04 32.56
5.19 4.67 7.08
5.04 4.92 4.93
85.61 99.35 96.15
99.55 100.00 100.00
99.97 100.00 100.00
50 100
13.83 5.58
4.96 4.63
4.54 5.06
99.83 100.00
100.00 100.00
100.00 100.00
(0, 0)
where = [1 , 2 ] , 1 and 2 being of dimensions m1 × p and m2 × p, respectively. The solution to the system is given by τ τ e−(τ −r)J A z(r) dr + e−(τ −r)J A dw(r) + e−τ J A y(0), τ > 0, (6.2) y(τ ) = 0
0
which also affects the systems of equations used to derive the exact discrete model in the mixedsample case. For example, the difference equation (2.3) becomes th e−(th−r)J A z(r) dr + (th), t = 1, . . . , T , h y(th) = −φh J Ay(th − h) + th−h C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
173
Cointegration and sampling frequency
with a corresponding term appearing in (2.6) obtained by a further integration of the above equation. In view of the vector z(τ ) comprising deterministic terms it is possible to explicitly evaluate the integrals that appear in such equations. For example, if z(τ ) = [1, τ ] and = [θ 1 , θ 2 ], with θ 1 and θ 2 each being m × 1, then it can be shown that th th −(th−r)J A e z(r) dr = [I − φ(th − r)J A] (θ1 + θ2 r) dr = c1 + c2 th, (6.3) th−h
th−h
where c1 = [I − J A](hθ 1 − (h /2)θ 2 ) + J A[e−h θ 1 + (he−h − φ h )θ 2 ] and the discrete time trend vector is c2 = [h(I − J A) + φ h J A]θ 2 . Hence the continuous time linear trend is transformed into a discrete time linear trend, the complicated mapping between [θ 1 , θ 2 ] and [c1 , c2 ] being a consequence of the process of temporal aggregation. A further consequence of the presence of such trends is that the Brownian motions that describe the limiting distribution of the FM-OLS estimator of B need to be replaced by their detrended versions as is common in the estimation of cointegrated systems. 2
6.2. Non-triangular cointegrated system The triangular form of continuous time cointegrated system that underlies the analysis of the effects of sampling frequency has the distinct advantage of the discrete time representation also being of triangular form, thereby maintaining the linearity in terms of the matrix B and the separation of the common stochastic trends represented by the equations determining y 2 . In empirical work cointegrated systems are often specified along the lines of the cointegrated VAR of Johansen (1991) which involves a matrix of speed-of-adjustment parameters describing the responsiveness of the system variables to long-run disequilibrium. The cointegrated VAR also typically serves as a vehicle for the testing of cointegration rank, which has been assumed known in the analysis of the triangular system. Consider, then, a first-order continuous time VAR model with linear trend, specified as dy(τ ) = [a + bτ + AB y(τ )] dτ + dw(τ ),
τ > 0,
(6.4)
where A (m × m1 ) is the matrix of adjustment coefficients, B (in this specification also m × m1 ) is the matrix of cointegrating parameters, and a and b are m × 1 vectors. In circumstances in which the cointegrating relationships themselves contain a linear trend component, the vectors a and b can be restricted appropriately; details of how this can be achieved in discrete time cointegrated systems can be found in Pesaran et al. (2000) and carry over straightforwardly to the continuous time case. A system of the form (6.4) was considered by Chambers (2009) in which A = [C , 0] and B = [Im1 , −]. The matrix B therefore embodies the m21 restrictions required for identification of the cointegrating parameters in in the same way as the triangular system studied here, although the matrix A allows for C = I which enables feedback from the long-run disequilibrium onto y 1 but not y 2 . As can be seen from Theorem 1 in Chambers (2009) the allowance of C = I does complicate the formulae determining the discrete time representation as compared to Lemma 2.1 here. In particular, the discrete time representation involves the product B on the lagged vector y which invalidates the regression-based procedures considered here although, of course, maximum likelihood provides an alternative method for estimating the model parameters and formulae for constructing the Gaussian likelihood (along with a discussion of computational issues) are provided in Chambers (2009). C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
174
M. J. Chambers
6.3. Higher-order dynamics Higher-order dynamics can be introduced into the system (2.1) by replacing the uncorrelated increment process dw(t) with a stationary process whose dynamics are governed by a higherorder stochastic differential equation system, yielding dy(τ ) = −J Ay(τ ) dτ + u(τ ) dτ,
τ > 0,
(6.5)
where u(τ ) satisfies C(D)u(τ ) dt = dw(τ ),
−∞ < τ < ∞, (6.6) p−1 w(t) is as previously defined and C(z) = zp Im + j =0 Cj zj , the Cj being m × m matrices of parameters and D being the mean square differential operator. The dynamics for y(t) are then determined by the cointegrated stochastic differential equation system of order p + 1 defined by C(D)(DIm + J A)y(τ ) dτ = dw(t). The corresponding discrete time representation preserves the cointegration between the variables and can be written in the form of a VARMA system. Formulae for the discrete time representation and corresponding Gaussian likelihood function can be found in Chambers (1999), while the asymptotic properties of a rival frequency domain approach are derived in Chambers and McCrorie (2007). Alternatively, if the focus is purely on the matrix B rather than (joint) estimation of the parameters governing the dynamics (the parameters of C(z)), the FM-OLS approach based on (3.2) can still be used provided that the modification for serial correlation (i.e. the estimation of the long-run variance matrix ) includes sufficient lags to capture adequately the dynamics. 6.4. Volatility Of particular importance in the modelling of financial time series, and also of relevance to some macroeconomic time series, is the issue of (stochastic) volatility. Much progress has been made in the financial econometrics literature using continuous time models incorporating volatility; see, for example, Andersen et al. (2009) for a recent review of such work. Little attention, though, has been paid to issues of cointegration in this area, due to the focus mainly being concerned with issues of volatility measurement. In a different vein other recent work has incorporated volatility into discrete time cointegrated VARs; see, for example, Boswijk and Zu (2007) and Cavaliere et al. (2010). Although a thorough analysis of the effects of sampling frequency on the estimation of cointegrating parameters in a model containing volatility is beyond the scope of this paper, some comments as to how this may be achieved are provided later. Consider the model dy(τ ) = −J Ay(τ ) dτ + (τ ) dw(τ ),
τ > 0,
(6.7)
where (τ ) is an m × m (almost surely) positive definite matrix for all τ > 0 and w(τ ) is Brownian motion with variance matrix Im . Assume that y contains only stock variables so that the observed vector is yth = y(th) (t = 1, . . . , T ). The impact of the presence of the volatility is that the discrete time disturbances will also display volatility and, from (2.3), (2.4) and (2.5), the discrete time representation will be h yth = −φh J Ayth−h + th ,
t = 1, . . . , T ,
(6.8)
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
175
Cointegration and sampling frequency
th =
th
e−(th−r)J A (r) dw(r),
t = 1, . . . , T .
(6.9)
th−h
With stock-sampling the vector th is serially uncorrelated and its variance matrix is th
th = E th th = e−(th−r)J A (r)(r) e−(th−r)A J dr =
th−h h −sJ A
e
(th − s)e−sA J ds,
0
where (r) = (r)(r) . In terms of the asymptotics developed earlier, both sample size (T ) and frequency (h) are indexed by n, so that interest concerns the triangular model = nt = thn and unt is a serially ynt = −φ n J Ay nt−1 + nt . Let nt = nt unt , where nt nt uncorrelated random vector with mean vector zero and variance matrix Im , and define the random step functions [T n r] [Tn r]+1 , r ∈ [0, 1), −1/2 Un (r) = Tn unt , r ∈ [0, 1], n (r) = Tn , r = 1. t=1 Boswijk and Zu (2007) assume that (their equivalent of) (Un (r), n (r)) ⇒ (W (r), (r)) as n → ∞, where W (r) is an m × 1 standard Brownian motion process and (r) is an m × m matrixprocess on [0,1], independent of W (r), with elements having continuous sample paths 1 and E 0 ij (r)2 dr < ∞ for all i, j = 1, . . . , m. A consequence of this is that Tn−1/2
[T n r] t=1
nt ⇒
r
(s) dW (s) ≡ V (r),
r ∈ [0, 1],
0
and so the limiting properties of sample moments are expressed in terms of the random process V (r) rather than W (r). The rates of convergence of the estimators will remain the same, but the limiting distributions will be expressed in terms of the random process V (r).
7. CONCLUDING COMMENTS This paper has investigated the large-sample asymptotic properties of OLS and FM-OLS regression estimators of cointegrating parameters under three scenarios concerning the span of data and sampling frequency, each scenario depending on whether span or frequency (or both) tends to infinity. In cases where span tends to infinity the OLS estimators are consistent but their limiting distributions suffer from second-order bias effects that arise due to serial correlation and endogeneity of regressors. When the span is fixed the OLS estimators are not even consistent and the distribution depends explicitly on the initial conditions. In contrast, the FM-OLS estimators are shown to have limiting mixed normal distributions when the span tends to infinity and are therefore members of the class of optimal estimators defined by Phillips (1991a). As a result Wald statistics associated with the FM-OLS estimators have limiting chi-square distributions. The finite sample performance of the estimators and test statistics is explored in a simulation study in which the superiority of the FM-OLS estimator in terms of bias and mean square error is demonstrated and the Wald statistics are found to generally have good size and power properties. C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
176
M. J. Chambers
The model underlying the results in this paper is a continuous time version of the prototypical model used in Phillips (1991a), its simplicity enabling the analysis to focus on the effects of sampling frequency on the estimators of the cointegrating parameters without the need to be concerned with accounting for additional complications, such as system dynamics and volatility, that would arise in more applications-oriented models. Despite the simplicity one can expect the same rates of convergence as for the FM-OLS estimators to apply to optimal estimators in more complicated models that incorporate extensions of the type considered in Section 6, although the limiting distributions may be defined in terms of detrended Brownian motions or random processes that incorporate volatility, depending on the extension considered. Other extensions, such as the incorporation of jump processes, may result in a component that does not satisfy the functional central limit theorem and so were not considered here. A further interesting use of the results presented here would be the derivation of the theoretical properties of tests for cointegration when sampling frequency varies. Such research is ongoing and will be helpful in explaining the simulation findings of Hooker (1993), Lahiri and Mamingi (1995), Otero and Smith (2000) and Haug (2002).
ACKNOWLEDGMENTS I am grateful to three anonymous referees for their helpful comments. The original work on which this paper is based was supported in part by the ESRC under grant number R000222961 and by the Leverhulme Trust in the form of a Philip Leverhulme Prize.
REFERENCES Andersen, T. G., T. Bollerslev and F. X. Diebold (2009). Parametric and nonparametric volatility measurement. In Y. Ait-Sahalia and L. P. Hansen (Eds.), Handbook of Financial Econometrics, Volume 1, 67–138. Amsterdam: North-Holland. Attfield, C. L. F. (1997). Estimating a cointegrating demand system. European Economic Review 41, 61–73. Bandi, F. and P. C. B. Phillips (2009). Nonstationary continuous-time processes. In Y. Ait-Sahalia and L. P. Hansen (Eds.), Handbook of Financial Econometrics, Volume 1, 139–202. Amsterdam: North-Holland. Bergstrom, A. R. (1984). Continuous time stochastic models and issues of aggregation over time. In Z. Griliches and M. D. Intriligator (Eds.), Handbook of Econometrics, Volume 2, 1145–212. Amsterdam: North-Holland. Boswijk, H. P. and Y. Zu (2007). Testing for cointegration with nonstationary volatility. Working paper, University of Amsterdam. Cavaliere, G., A. Rahbek and A. M. R. Taylor (2010). Testing for co-integration in vector autoregressions with non-stationary volatility. Journal of Econometrics 158, 7–24. Chambers, M. J. (1999). Discrete time representation of stationary and non-stationary continuous time systems. Journal of Economic Dynamics and Control 23, 619–39. Chambers, M. J. (2004). Testing for unit roots with flow data and varying sampling frequency. Journal of Econometrics 119, 1–18 (Corrigendum: Journal of Econometrics (2008) 144, 524–25). Chambers, M. J. (2009). Discrete time representations of cointegrated continuous time models with mixed sample data. Econometric Theory 25, 1030–49. Chambers, M. J. and J. R. McCrorie (2007). Frequency domain estimation of temporally aggregated Gaussian cointegrated systems. Journal of Econometrics 136, 1–29. Comte, F. (1999). Discrete and continuous time cointegration. Journal of Econometrics 88, 207–26. C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
Cointegration and sampling frequency
177
Haug, A. A. (2002). Temporal aggregation and the power of cointegration tests: a Monte Carlo study. Oxford Bulletin of Economics and Statistics 64, 399–412. Hooker, M. A. (1993). Testing for cointegration: power versus frequency of observation. Economics Letters 41, 359–62. Johansen, S. (1991). Estimation and hypothesis testing of cointegration vectors in Gaussian vector autoregressive models. Econometrica 59, 1551–80. Kessler, M. and A. Rahbek (2004). Identification and inference for cointegrated and ergodic diffusions. Statistical Inference for Stochastic Processes 7, 137–51. Lahiri, K. and N. Mamingi (1995). Testing for cointegration: power versus frequency of observation— another view. Economics Letters 49, 121–24. Mark, N. C. and D. Sul (2003). Cointegration vector estimation by panel DOLS and long-run money demand. Oxford Bulletin of Economics and Statistics 65, 655–80. Moon, H. R. and B. Perron (2004). Efficient estimation of the SUR cointegration regression model and testing for purchasing power parity. Econometric Reviews 23, 293–323. Ng, S. (1995). Testing for homogeneity in demand systems when the regressors are nonstationary. Journal of Applied Econometrics 10, 147–63. Otero, J. and J. Smith (2000). Testing for cointegration: power versus frequency of observation—further Monte Carlo results. Economics Letters 67, 5–9. Perron, P. (1991). Test consistency with varying sampling frequency. Econometric Theory 7, 341–68. Pesaran, M. H., Y. Shin and R. J. Smith (2000). Structural analysis of vector error correction models with exogenous I(1) variables. Journal of Econometrics 97, 293–343. Phillips, P. C. B. (1987a). Time series regression with a unit root. Econometrica 55, 277–301. Phillips, P. C. B. (1987b). Towards a unified asymptotic theory for autoregression. Biometrika 74, 535–47. Phillips, P. C. B. (1991a). Optimal inference in cointegrated systems. Econometrica 59, 283–306. Phillips, P. C. B. (1991b). Error correction and long run equilibrium in continuous time. Econometrica 59, 967–80. Phillips, P. C. B. (1991c). Spectral regression for cointegrated time series. In W. A. Barnett, J. Powell and G. Tauchen (Eds.), Nonparametric and Semiparametric Methods in Economics and Statistics, 413–35. Cambridge: Cambridge University Press. Phillips, P. C. B. and B. E. Hansen (1990). Statistical inference in instrumental variables regression with I(1) processes. Review of Economic Studies 57, 99–125. Stock, J. H. and M. W. Watson (1993). A simple estimator of cointegrating vectors in higher order integrated systems. Econometrica 61, 783–820.
APPENDIX A Picking out the stock and flow equations from (2.5) and (2.6) yields S S S S = −φh y1,th−h − BSS y2,th−h − BSF y2F (th − h) + 1,th , h y1,th
Proof of Lemma 2.1:
(A.1)
F F S F F = −φh y1,th−h − BF S Y2,th−h − BF F y2,th−h , + η1,th h y1,th
(A.2)
S S = 2,th , h y2,th
(A.3)
F F = η2,th . h y2,th
(A.4)
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
178
M. J. Chambers
S The unobservable components, y2F (th − h) in (A.1) and Y2,th−h in (A.2), can be eliminated by setting u(τ )dτ = dw(τ ) in Lemma B.1, which results in F F + δ2,th−h , y2F (th − h) = y2,th−h
S S S Y2,th−h = y2,th−h − δ2,th−h ,
where δ th is defined following (2.11). Hence S S S F S h y1,th + ξ1,th = −φh y1,th−h − BSS y2,th−h − BSF y2,th−h ,
(A.5)
S S F ξ1,th = 1,th + φh BSF δ2,th−h ,
(A.6)
F F S F F h y1,th + ξ1,th = −φh y1,th−h − BF S y2,th−h − BF F y2,th−h ,
(A.7)
F F S ξ1,th = η1,th − φh BF S δ2,th−h .
(A.8)
The discrete time representation (2.10) is obtained by combining (A.5), (A.7), (A.3) and (A.4), while the associated disturbance vector (2.11) follows from these representations. The equations for t = 1 are obtained straightforwardly from (2.5) and (2.8). For the autocovariances of ξ th , it is convenient to define the selection matrices S1 =
J1S
0
0
J2S
,
S2 =
J1F
0
0
J2F
S3 =
,
0
B∗
0
0
,
where the sub-matrices are defined by JiS =
ImS
0
0
0
i
,
JiF =
0
0
0
ImFi
,
0
BSF
−BF S
0
∗
B =
.
Then the discrete time disturbance vectors can be written ξh = S1 h + S2 ηh ,
ξth = S1 th + S2 ηth + φh S3 δth−h ,
t = 2, . . . , T .
We then find, from the above representation and Lemma B.3, that η η h,11 = S1 h,0 S1 + S1 η h,0 S2 + S2 h,0 S1 + S2 h,11 S2
= hS1 S1 + h2 (S1 S2 + S2 S1 ) + h3 S2 S2 + O(h2 ), η η ηδ δη h,0 = S1 h,0 S1 + S1 η h,0 S2 + S2 h,0 S1 + S2 h,0 S2 + φh S2 h,1 S3 + φh S3 h,−1 S2
+ φh2 S3 δh,0 S3 = hS1 S1 + h2 (S1 S2 + S2 S1 ) +
2h S S2 3 2
+ O(h2 ),
η δη δ h,1 = S2 η h,1 S1 + S2 h,1 S2 + φh S3 h,0 S1 + φh S3 h,0 S2
= h2 S2 S1 + h6 S2 S2 + O(h2 ). The matrices 00 , 0 and 1 then follow by examining the terms Si Sj (i, j = 1, 2).
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
179
Cointegration and sampling frequency Proof of Lemma 3.1:
Note, first, that y 2,nt = S 2,nt + y 2 (0), where Snt = (S1,nt S2,nt ) =
Y2 Y2 =
Tn
y2,nt−1 y2,nt−1 =
t=1
+ y2 (0)
S2,nt−1 S2,nt−1 +
t S2,nt−1
t
j =1 ξnj .
Then
S2,nt−1 y2 (0)
t
+ Tn y2 (0)y2 (0) ,
t
Y2 ξ1 =
Tn
y2,nt−1 ξ1,nt =
S2,nt−1 ξ1,nt + y2 (0)S1,nT . n
t
t=1
The results then follows from applying Lemma B.4 to the above expressions.
Proof of Theorem 3.1: Follows straightforwardly from (3.3) and Lemma 3.1.
− B)φn y2,nt−1 and hence ξnt = ξnt − Proof of Lemma 4.1: In both cases note that ξ1,nt = ξ1,nt − (B − B)φn y2,nt−1 . It follows that J (B Tn Tn Tn 1 1 − B)φn 1 ξnt ξnt = ξnt ξnt − J (B y2,nt−1 ξnt Nn t=1 Nn t=1 Nn t=1
−
Tn 1 − B) J ξnt y2,nt−1 φn (B Nn t=1
− B)φn2 + J (B =
Tn 1 − B) J y2,nt−1 y2,nt−1 (B Nn t=1
Tn 1 ξnt ξnt + op (1), Nn t=1
making use of the convergence rates in Theorem 3.1 and Lemma 3.1. The stated limits then follow and the analysis of the estimators of the autocovariances follows analogously. Proof of Theorem 4.1: (a) Using Lemmas 3.1 and 4.1 it follows that 1 1 1 −1 ξ1 = Y Y ξ1 − Y ξ2 h,22 h,12 Nn 2 Nn 2 Nn 2 1 1 1 1 ⇒ Wh2 dWh1 + h1,12 − Wh2 dWh2 + h1,22 −1 h,22 h,12 , h h 0 0 which implies that 1 h1,12 ⇒ ξ1 Y2 − Nn Nn
1
1 1 h1,12 − h,12 −1 h,22 h1,22 h h 0 1 1 1 h1,12 − h,12 −1 − dWh,1.2 Wh2 . h,22 h1,22 = h h 0 dWh,1.2 Wh2 +
The result then follows when combined with the convergence of Nn−2 Y2 Y2 in Lemma 3.1. (b) This follows as in part (a) but with replacing h . C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
180
M. J. Chambers
Proof of Theorem 4.2: The proof of both parts is straightforward; see, for example, Theorem 5.1 of Phillips and Hansen (1990) who establish the limiting chi-square distribution for the Wald statistic based on the FM-OLS estimator.
APPENDIX B L EMMA B.1. Let y(τ ) satisfy dy(τ ) = u(τ )d τ (τ > 0), where u(τ ) is a stationary integrable continuous time random process. Then
y(th) −
1 h
th
th−h
Proof: First note that y(th) = y(0) +
th
1 h
y(r) dr = th
th
[h − (th − r)] u(r) dr.
th−h
u(r) dr so that
0
th
y(r) dr = hy(0) +
r
u(s) dsdr,
th−h
th−h
0
and hence
y(th) −
1 h
th
y(r) dr =
th−h
th
1 h
u(r) dr −
0
th
r
u(s) dsdr. th−h
(B.1)
0
The double integral can be evaluated as follows:
th
th−h
r
0
th
u(s) dsdr =
th−h
u(s) dsdr +
0
th−h
=
th−h
0
dr u(s) ds +
th−h
th−h
=h
th−h
th
th−h
th
u(s) ds +
0
th
u(s) dsdr s
th
th
th
dr u(s) ds
s
(th − s)u(s) ds.
(B.2)
th−h
Using (B.2) in (B.1) yields
y(th) −
1 h
th
th
y(r) dr =
th−h
0
0 th
=
u(s) ds −
th−h
=
1 h
th−h
u(s) ds −
th
th−h
1 h
u(s) ds −
th
1 h
th
(th − s)u(s) ds
th−h
(th − s)u(s) ds
th−h
[h − (th − r)] u(r) dr.
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
181
Cointegration and sampling frequency
L EMMA B.2. The random vectors th and ηth , defined in (2.4) and (2.7) respectively, have the following representations: th =
th
[I − φ(th − r)J A] dw(r),
th−h
ηh = ηth =
1 h 1 h
h
K0 (h − r) dw(r),
0
th
1 h
K0 (th − r) dw(r) +
th−h
th−h
K1 (th − h − r) dw(r),
t = 2, . . . , T ,
th−2h
where K 0 (r) = rI + k 0 (r)J A, K 1 (r) = K 0 (h) − K 0 (r), k 0 (r) = φ(r) − r and φ(r) = 1 − e−r . Proof: The representation for th follows from (2.4) and noting that e−rJA = Im − φ(r)J A. For ηth the definition in (2.7) yields 1 h
ηth =
1 h
=
th
(s) ds =
th−h
th
1 h
+
th−h
th−h
th
th
r
th−h
[I − φ(s − r)J A] dw(r) ds
s−h
[I − φ(s − r)J A] ds
th−2h
s
r+h
dw(r)
[I − φ(s − r)J A] ds
dw(r).
th−h
Evaluation of the integrals with respect to s yields the required expression. A similar procedure applied to ηh , which is defined by
h
ηh =
(s) ds =
0
0
h
s
[I − φ(s − r)J A] dw(r) ds,
0
yields the stated expression. L EMMA B.3. The non-zero autocovariances of th , ηth and δ th are as follows:
= h,0 = E th th
h
[I − φ(r)J A] [I − φ(r)J A] dr 1 = h − (h − φh )(J A + A J ) + h − 2φh + φ2h J AA J 2 0
= h + O(h2 ),
E(ηh ηh ) = ηh,11 =
1 h2
h
t = 1, . . . , T ,
K0 (r)K0 (r) dr
0
2 1 h3 h h −h − − (φh − he ) (J A + A J ) = + 2 3 h 2 3 1 h3 1 + φ2h − 2he−h J AA J h − h2 + + 2 h 3 2 h = + O(h2 ), 3 C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
182
M. J. Chambers h 1 K0 (r)K0 (r) dr + 2 K1 (r)K1 (r) dr h 0 0 2 = 2ηh,11 + 1 + (φh − h) + h (J A + A J ) h 1 2 + + 2 (φh − h)2 + (φh − h) J AA J h h 2 = h + O(h2 ), t = 2, . . . , T , 3
E(ηth ηth ) = ηh,0 =
1 h2
h
h 1 K1 (r)K0 (r) dr h2 0 1 1 h2 h h − φh − A J = + (φh − h)J A + 2 2 h 2 1 h2 J AA J − ηh,11 + 2 (φh − h) h − φh − h 2 h = + O(h2 ), t = 2, . . . , T , 6
) = ηh,1 = E(ηth ηth−h
) E(δth δth
=
δh,0
1 = 2 h
0
h
(h − r)2 dr =
h , 3
t = 1, . . . , T ,
1 h [I − φ(r)J A] K0 (r) dr h 0 1 h2 h 1 h2 − (φh − he−h ) J A + h − φh − A J = − 2 h 2 h 2 h2 1 1 h− − φh + φ2h − he−h J AA J − h 2 2 h 2 = + O(h ), t = 1, . . . , T , 2
) = η E(th ηth h,0 =
h 1 (h − r)K1 (r) dr h2 0 2 h h3 h2 h + h (φh − h) + + − (φh − he−h ) J A = + 3 2 2 6 h = + O(h2 ), t = 2, . . . , T , 3
) = ηδ E(ηth δth−h h,1 =
1 h K1 (r) [I − φ(r)J A] dr h 0 1 h2 h h − φh − J A = + φh − h − 2 h 2 1 h + (φh − h) + − (φh − he−h ) A J 2 h 1 1 h2 (φh − h)2 + (h − 1)φh − + φ2h J AA J + h 2 2 h 2 = + O(h ), t = 2, . . . , T , 2
) = η E(ηth th−h h,1 =
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
183
Cointegration and sampling frequency 1 h (h − r) [I − φ(r)J A] dr h 0 h 1 = + (φh − h) + h − (φh − he−h ) A J 2 h h = + O(h2 ), t = 1, . . . , T , 2
E(δth th ) = δ h,0 =
h 1 (h − r)K0 (r) dr h2 0 h 1 h 1 1 = + (h − φh ) − − + 2 (φh − he−h ) A J 6 h 2 6 h h 2 = + O(h ), t = 1, . . . , T . 6
E(δth ηth ) = δη h,0 =
Proof: The autocovariances are obtained using the results th th A(th − r) dw(r) B(th − r) dw(r) = E th−h
th−h
th
E
A(th − r) dw(r)
th−h
h
A(r)B(r) dr,
0
th−j h
B(th − r) dw(r)
= 0,
j ≥ 1,
th−j h−h
where A(r) and B(r) are m × m matrix functions. The expressions in terms of h are obtained by carrying out the relevant integrations, and the expansions in powers of h follow by expanding the relevant terms. As an example, consider h,0 ; the remainder follow in a similar way. The first expression follows from the definition of th in Lemma B.2 and application of the above rules. The integral requiring evaluation is h h φ(r) dr (J A + A J ) [I − φ(r)J A] [I − φ(r)J A] dr = h − 0 0 h φ(r)2 dr J AA J . + 0
Because φ(r) = 1 − e
−r
it can be shown that h φ(r) dr = h − φh , 0
h
φ(r)2 dr = h − 2φh +
0
φ2h , 2
which yields the second expression. The expansion in h then follows by noting that φ h = 1 − e−h and e−h = 1 − h + h2 /2 + O(h3 ). L EMMA B.4. Let Snt = tj =1 ξnj . (a) If hn = h and Nn ↑ ∞ as n ↑ ∞, then Nn−1/2 Sn[Tn r] ⇒ Wh (r), where Wh ∼ BM(h ) and h = h−1 (h,0 + h,1 + h,1 ). Furthermore, Tn 1 Snt−1 Snt−1 ⇒ Nn2 t=1
1 h
1 0
Wh Wh ;
Tn 1 3/2 Nn t=1
Snt−1 ⇒
1 h
Wh ; 0
1 Tn 1 1 Snt−1 ξnt ⇒ Wh dWh + h1 ; Nn t=1 h 0
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
1
Tn 1 p 1 ξnt ξnt → h0 . Nn t=1 h
184
M. J. Chambers
(b) If hn ↓ 0 and Nn ↑ ∞ as n ↑ ∞, then Nn−1/2 Sn[Tn r] ⇒ W (r), where W ∼ BM(). Also, Tn 1 hn Snt−1 Snt−1 ⇒ 0 W W ; Nn2 t=1
Tn hn 3/2
Nn
1
Snt−1 ⇒
W; 0
t=1
1 Tn 1 Snt−1 ξnt ⇒ W dW + 1 ; Nn t=1 0
Tn 1 p ξnt ξnt → 0 . Nn t=1
(c) If hn ↓ 0 and Nn = N as n ↑ ∞, then Sn[Tn r] ⇒ N 1/2 W (r), where W ∼ BM(). Also, hn
Tn
Snt−1 Snt−1 ⇒ N2
1 0
W W ;
t=1 Tn
hn
Tn
1 0
t=1
1
W; 0
t=1
Snt−1 ξnt ⇒ N
Snt−1 ⇒ N 3/2
Tn p W dW + 1 ; ξnt ξnt → N 0 . t=1
Proof: In each case, the key is to first determine the properties of a suitably normalized version of ξ nt and then to write the quantities of interest in terms of this quantity and the suitably normalized partial sum Snt . The limiting distributions then result from application of the continuous mapping theorem (CMT). (a) Note that h−1/2 ξ nt is an MA(1) process with variance h−1 h0 and autocovariance h−1 h1 , and hence standard results yield the invariance principle: [T n r]
Tn−1/2
h−1/2 ξnt = Nn−1/2
t=1
[T n r]
ξnt ⇒ Wh (r) asn ↑ ∞
t=1
where Wh has covariance matrix h = h−1 (h,0 + h,1 + h,1 ) and [Tn r] denotes the integer part of Tn r. Next, 1 Tn Nn 1 Snt−1 Snt−1 = Tn Sn[Tn r] Sn[T dr = Sn[Tn r] Sn[T dr (B.3) n r] n r] h 0 0 t=1 and so Tn 1 1 1 Sn[Tn r] Sn[T n r] S S = dr nt−1 nt−1 Nn2 t=1 h 0 Nn1/2 Nn1/2 has the stated distribution by the CMT. Similarly Tn
1
Snt−1 = Tn 0
t=1
Sn[Tn r] dr =
Nn h
1
0
Sn[Tn r] dr,
(B.4)
implying that Tn 1 3/2 Nn t=1
Snt−1 =
1 h
0
1
Sn[Tn r] 1/2
Nn
dr
has the stated distribution. Standard results can then be applied to Tn Tn Snt−1 ξnt 1 1 = Snt−1 ξnt Tn t=1 h1/2 h1/2 Nn t=1
and
Tn Tn ξnt ξnt 1 1 = ξnt ξnt . Tn t=1 h1/2 h1/2 Nn t=1
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
185
Cointegration and sampling frequency
(b) In this case ξ nt is an MA(1) process with variance matrix hn ,0 = hn 0 + O(h2n ) and first-order autocovariance matrix hn ,1 = hn 1 + O(h2n ) (using Lemma 2.1). Consider the representation ξnt = unt + Cn unt−1 ,
E(unt unt ) = Vhn = O(hn ),
E(unt unt−j ) = 0 (j = 0).
It follows that the following relationships must hold: hn ,0 = Vhn + Cn Vhn Cn ,
hn ,1 = Cn Vhn ,
unt is i.i.d.(0, h−1 implying that Cn = O(1). Because h−1/2 n n Vhn ) it then follows that Tn−1/2
[T n r]
h−1/2 unt = Nn−1/2 n
t=1
[T n r]
unt ⇒ Wu (r) asn → ∞,
t=1
where Wu (r) is Brownian motion with covariance matrix V = limn→∞ h−1 n Vhn . Now [T n r]
ξnt = (I + C)
t=1
[T n r]
unt + (Cn − C)
t=1
[T n r]
unt + vnt − vn0 ,
t=1
where C = limn→∞ Cn and vnt = −Cn unt , implying that Nn−1/2
[T n r]
ξnt = (I + C)Nn−1/2
t=1
[T n r]
unt + (Cn − C)Nn−1/2
t=1
= (I + C)Nn−1/2
[T n r]
[T n r]
unt + Nn−1/2 (vnt − vn0 )
t=1
unt + op (1) ⇒ W (r) asn → ∞,
t=1
where W (r) is Brownian motion with covariance = (I + C)V (I + C) = 0 + 1 + 1 . From (B.3) and (B.4) we find that 1 Tn Sn[Tn r] Sn[T hn n r] Snt−1 Snt−1 = dr 1/2 1/2 2 Nn t=1 N N 0 n n
and
Tn hn 3/2 Nn t=1
1
Snt−1 = 0
Sn[Tn r] 1/2
Nn
dr,
leading to the stated results. For the final terms we obtain Tn Tn Snt−1 ξnt 1 1 = Snt−1 ξnt 1/2 1/2 Tn t=1 hn hn Nn t=1
and
Tn Tn ξnt ξnt 1 1 = ξnt ξnt . 1/2 1/2 Tn t=1 hn hn Nn t=1
(c) Here, because Nn = N is fixed, the results follow straightforwardly from part (b).
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
The
Econometrics Journal Econometrics Journal (2011), volume 14, pp. 186–203. doi: 10.1111/j.1368-423X.2010.00332.x
Misspecification in moment inequality models: back to moment equalities? M ARIA P ONOMAREVA † AND E LIE TAMER ‡ †
Department of Economics, University of Western Ontario, London Ontario, Canada, N6A 5C2. E-mail:
[email protected] ‡
Department of Economics, Northwestern University, 2001 Sheridan Road, Evanston, Il 60208, USA. E-mail:
[email protected] First version received: December 2008; final version accepted: July 2010
Summary Consider the linear model E[y | x] = x β where one is interested in learning about β given data on y and x and when y is interval measured; that is, we observe ([y0 , y1 ], x) such that P (y ∈ [y0 , y1 ]) = 1. Moment inequality procedures use the implication E[y0 | x] ≤ x β ≤ E[y1 | x]. As compared to least squares in the classical regression model, estimates obtained using an objective function based on these moment inequalities do not provide a clear approximation to the underlying unobserved conditional mean function. Most importantly, under misspecification, it is not unusual that no parameter β satisfies the previous inequalities for all values of x, and hence minima of an objective function based on these moment inequalities are typically tight. We construct set estimates for β in the linear model that have a clear interpretation when the model is misspecified. These sets are based on moment equality models. We illustrate these sets and compare them to estimates obtained using moment inequality-based methods. In addition to the linear model with interval outcomes we also analyse the binary missing data model with a monotone instrument assumption (MIV), we find there that when this assumption is misspecified, bounds can still be non-empty, and can differ from parameters obtained via maximum likelihood. We also examine a bivariate discrete game with multiple equilibria. In sum, misspecification in moment inequality models is of a different flavour than in moment equality models, and so care should be taken with (1) the˙interpretation of the estimates and (2) the size of the ‘identified set’. Keywords: Misspecification, Moment inequality models.
1. INTRODUCTION It is well known but perhaps often ignored that interpretation of estimates in parametric or semiparametric models is conditioned on the correct specification of that model. When models are approximations of the true data-generating process, estimates provide a particular approximation of the underlying objects of interest. In misspecified moment equality models, for example, parameter estimates are ones that minimize some distance of the vector of moments to zero.
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society. Published by Blackwell Publishing Ltd, 9600
Garsington Road, Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
Misspecification in moment inequality models
187
The interpretation of these parameters from a moment inequality model when the model is misspecified is the subject of this paper. Recently, there has been a flurry of work in econometrics related to inference in partially identified or incomplete models.1 These incomplete ‘models’ are based on a strictly smaller set of assumptions than their complete model counterpart in the hopes that conclusions about those parameters obtained using this partial list of assumptions are more robust. The price to pay is that a model with fewer assumptions is only able to restrict parameters to a non-trivial set—or that the parameter of interest is partially identified. Parametric (or semi-parametric) incomplete models consist of a set of maintained assumptions, some are non-testable and others can be rejected in the data. We point out, through a series of simple examples, that care should be taken when interpreting and comparing parameter estimates in partially identified moment inequality models. In essence, one needs to first define the object of interest, and then ensure that the objective function used to estimate this parameter allows for a clear interpretation in the presence of misspecification. More crucially, care should be taken when comparing parameters obtained from moment inequality to others obtained using, say, moment equality models. In addition, a consequence of misspecification is that partially identified parametric models based on moment inequalities deliver set estimates of the object of interest that are tight (meaning that those set estimates are quite small) when estimated using real data. Under misspecification, it is possible that no parameter vector obeys all the moment inequalities and so what is presented, heuristically, is the set of parameters that minimizes a certain criterion function. These small identification ‘regions’ should not be interpreted as coming from inference under weaker assumptions; rather, this behaviour ought to be investigated further.2 So, even though the motivation behind the partially identified literature has been to gain robustness against certain (or almost all) parametric assumptions that are esoteric to the problem at hand, those more robust (but still misspecified) partially identified models are delivering estimates that are as tight as ones obtained with models based on more assumptions. The main reason for obtaining these tight estimates, as we see below in the examples, is that estimates with these moment inequality models are not comparable to ones that are obtained with a more complete model since under misspecification these two sets of parameters are estimating different objects (and hence comparing the estimates from these two sets of models is not appropriate). Finally, we shed light on the choice of objective function in moment inequality models. The choice of the objective function is not motivated by power perspective (or obtaining the smallest set that obeys a confidence property), but by the fact that objective functions, as is well known, provide estimates of different objects when the model is misspecified. Hence, it is important that one thinks of the misspecification issue also when choosing the objective function. The paper is organized as follows. Section 2 discusses our main ideas in the context of the ubiquitous linear model. We provide a least squares (LS) set that coincides with the identified set for best linear predictor constructed by Stoye (2006) and show that this set contains the set of parameters that provide the best approximation to the underlying conditional mean function. Section 3 discusses a non-parametric missing data model, and then adds a monotone instrumental 1 The literature can be classified into identification and estimation in partially identified models. To get a flavour of the identification results see Manski (1995, 2007), Haile and Tamer (2003) and Beresteanu et al. (2008). For a flavour of the estimation and confidence regions literature see Imbens and Manski (2004), Chernozhukov et al. (2007), Molinari (2008), Romano and Shaikh (2008), Andrews and Soares (2010), Canay (2010) and others. 2 Typically, papers using moment inequalities report tight estimates and confidence regions, and so at first look, these estimates appear as an indicator that the modelling strategy is ‘succeeding’, in that it is delivering tight estimates with fewer assumptions than required for point identification.
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
188
M. Ponomareva and E. Tamer
variable (MIV) assumption . There, the set we construct is the one that maximizes a well-defined likelihood (that partially identifies the parameter of interest). Interestingly, we show through a simple example that when the MIV is false, the set that one derives based on moment inequalities can be non-empty and differs from the maximum likelihood estimator (MLE) set, while both of these sets coincide if the model is well specified. Section 3 discusses a discrete game example. Throughout, we clarify the issue raised above. Section 4 concludes.
2. LINEAR MEAN REGRESSION WITH INTERVAL DATA We consider in detail the problem of inference on β = (β0 , β1 ) ∈ R2 in the model E[y | x] = β0 + β1 x.
(2.1)
The analysis can be generalized to mean regressions with k regressors. The model partially identifies β since we assume that we observe ([y0 , y1 ], x), where P (y ∈ [y0 , y1 ]) = 1 and where y0 is smaller than y1 . We start by considering a complete version of the model in (2.1) above (without censoring) and then introduce censoring and analyse what different partialidentification-based methods estimate when this model is not well specified; that is, when E[y | x] is not linear in x. Throughout the analysis, it is important to keep in mind that the object of interest is the parameter vector β in (2.1) above where if the model is well specified, β0 + β1 x is the true conditional mean function, but if the model is misspecified, β0 + β1 x represents the best linear approximation to the conditional mean function which is the ultimate object of interest in this paper. In the presence of a random sample of observations on y and x, we can use least squares to obtain consistent estimates of β 0 and β 1 . This model is making an important assumption about the relationship between y and x, mainly that the conditional mean function of y given x is linear in x. Hence, estimating β 0 and β 1 is sufficient to learn this function. It is also well known that if the model above is misspecified, that is, E[y | x] is not linear in x, then β0 + β1 x is the best linear predictor of y given x under square loss. This line can equivalently be shown to be the straight line that comes closest, in a mean squared sense, to the true conditional mean function; that is, with the slope and the intercept defined as β = (β0 , β1 ) = arg min E[(E[y | x] − b0 − b1 x)2 ]. b0 ,b1
Now, suppose that we do not observe y, rather we observe [y0 , y1 ] such that P (y ∈ [y0 , y1 ]) = 1. In this situation, there are a few approaches to inference on β = (β0 , β1 ). One approach is parametric and is based on making an assumption on the censoring mechanism and the underlying distribution of y (conditional on x). For example, if we have fixed censoring at zero (say), then a normality assumption on the conditional distribution of y | x will allow us to use a likelihood-based approach to consistently estimate β (this is the classic Tobit model).3 Under misspecification, parameter estimates’ interpretation changes and depends on the objective function used. In the Tobit situation, the MLE is the quasi-likelihood (White, 1994), and the parameter estimates then are ones that minimize the distance (in the Kullback–Leibler or entropy sense) between the parametric likelihood (model) and the non-parametric likelihood (data). 3 There are other ‘semi-parametric’ approaches also, notably the censored LAD model of Powell (1984) and its generalizations to random censoring by Honor´e et al. (2002). C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
Misspecification in moment inequality models
189
Estimates from these parametric models are more transparent but researchers must pay attention to the sensitivity of their estimates to ad hoc assumptions.4 Under general forms of censoring, where all the information that is available is in terms of upper and lower bounds on y (y0 and y1 ), one can still use a parametric assumption on the censoring process (like assume that y is the middle point of [y0 , y1 ]) and then run least squares. Another approach is semi-parametric, where one does not make assumptions on the censoring mechanism; that is, y can be anywhere in [y0 , y1 ] but allow for partial identification (see e.g. Manski and Tamer, 2002, for more on this approach). So, if the model is well specified, this means that for all x, E[y0 | x] ≤ β0 + β1 x ≤ E[y1 | x].
(2.2)
The above is a canonical example of a model based on inequality restrictions, or moment inequality model. One objective function that can be used to make inference on the parameters is the following: (2.3) Qmmd (b0 , b1 ) = Ex (E[y0 | x] − b0 − b1 x)2+ + (E[y1 | x] − b0 − b1 x)2− , where (u)+ = max{u, 0} and (u)− = min{u, 0}. Notice that Qmmd (b0 , b1 ) ≥ 0 for all (b0 , b1 ) ∈ R2 . In case the model is well specified, that is, (2.1) holds, we can easily see that Qmmd (β0 , β1 ) = 0. Moreover, with censoring, this objective function is minimized on a non-trivial set of parameters, each of which is a candidate for a conditional mean function. If the model is well specified, the inferential procedure based on minimizing (2.3) will provide the set of (linear) conditional mean functions that are consistent with the data. Hence, one expects that in generic censoring situation (certainly if the censoring is heavy), that the arg min of the above function is not unique. So for inference in these situations, one can use methods developed for models that are not point identified. One such approach is provided by Chernozhukov et al. (2007). 2.1. Inference under misspecification The problem arises if the model is misspecified; that is, when the true but unobserved conditional mean of y is not linear in x. Again, the object of interest is still the vector β which is the best linear approximation to the (unknown) conditional mean function. Note that minimizers of (2.3) do not have the interpretation of parameters that minimize the squared distance between the true conditional mean function (that is censored) and the set of linear functions. The argmins of (2.3) are parameters that minimize the distance between a certain non-linear region and the set of linear functions. The non-linear region is the one bounded above and below by E[y0 | x] and E[y1 | x]. This is illustrated in Figure 1, where the unobserved conditional mean function that is non-linear, is graphed alongside the upper and lower conditional mean functions. You can see that there does NOT exist a straight line that obeys (2.2) for all x, and hence what comes out of minimizing (2.3) is a parameter vector that minimizes the distance between a line and the upper and lower bounds. See Theorem 2.2 for more on this. Hence, with misspecification, the vector β that minimizes (2.3) does not have the same interpretation as the one that we get from the same 4 There seems to be a slight confusion in the literature on the difference between robustness and sensitivity. Sensitivity analysis asks the questions of whether in your data, the estimates you obtained depend on the underlying assumptions and so sensitivity analysis is linked to a particular data set. For example, if in a data set the least squares estimator is close to the LAD estimator, then we say that the estimates are not sensitive to assumptions on the conditional distribution of y | x, but of course, the LAD estimator is more robust (has a bounded influence function in the statistical sense), a statement that one makes before confronting the data. C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
190
M. Ponomareva and E. Tamer E[y1 |x]
True E[y|x]
E[y0 |x]
Figure 1. Linear mean regression with interval data under misspecification.
misspecified least squares model in the absence of censoring. The moment inequality model in (2.2) applies (2.1) to get (2.2). The set of minimizers of (2.3) is small (and sometimes it may even be a singleton, as would be in Figure 1).5 So, when the model is misspecified, we can get tight estimates even when no assumptions are made on the censoring mechanism. However, these estimates do not compare to β, the best linear approximation to the conditional mean function E[y | x]. So, we are getting (tight) estimates of a different object. The next section provides a method to construct estimates of the set of parameters that minimize the squared distance between a linear function and the (unobserved) conditional mean function. We do that by first constructing the set of all conditional mean functions that are consistent with the data (this will be made precise below), and then for each of these functions, we find the least squares estimates. This set of estimates is termed the Least Squares Set and consists of the set of parameters that are best linear approximations to the (non-identified) conditional mean function. 2.2. Construction of a least squares (LS) set We first describe the method heuristically before providing a practical estimator. Suppose that one can construct all the possible models for y | x that obey the assumptions above; that is, P (y ∈ [y0 , y1 ]) = 1. This is a set of conditional probability distributions, Py | x that is consistent with the data. For each candidate model Py | x ∈ Py | x , collect the corresponding slope and intercept that minimize the mean squared error by the method of least squares. This is the set of interest. This procedure completes the model to obtain a set of fully specified moment equality models, each of which leads to an estimate of β that has the same interpretation under misspecification that a model without censoring has. This will be illustrated for the linear model in (2.1) above. This intuition of completing the model holds in other setups. We will highlight these below in other examples. In the linear model above, there is a simple procedure that we can use to construct the least squares set estimates for the best linear approximation to the (unobserved) conditional mean 5 For this particular case in the picture the minimum of (2.3) is unique and the modified minimum distance (MMD) objective function is strictly positive. This is also the case for the misspecified model described in Section 2.4. However, we do not rule out the possibility of having the arg min set that is not a singleton and where the MMD objective is strictly positive. C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
Misspecification in moment inequality models
191
function E[y | x]. Consider a random variable λ with support on [0, 1] and then construct yλ as follows: yλ = y0 λ + y1 (1 − λ). The distribution of λ conditional on all other variables ((y0 , y1 , x)), Fλ , is left unspecified and it belongs to the set of probability distributions on [0, 1]. We can then define β(Fλ ) = (β0 (Fλ ), β1 (Fλ )) = arg min E[(yλ − b0 − b1 x)2 ].
(2.4)
b0 ,b1
The least squares (LS) set is = {β(Fλ ) : Fλ ∈ }.
(2.5)
We can derive the asymptotic distribution of the stochastic process β(·) indexed by bounded functions as follows. We examine the case of the slope parameter β1 (·) for simplicity. Let μx and μx be the population mean and sample mean of x respectively. Then, β1 (Fλ ) =
1 Ex [(x − μx ){E(y1 | x) + E(y0 λ | x) − E(y1 λ | x)} ]. var(x)
(2.6)
To each distribution function Fλ there corresponds the conditional expectation function g(u, v0 , v1 ) = E(λ | x = u, y0 = v0 , y1 = v1 ), which is bounded between 0 and 1, so that 0 g(·, ·, ·) 1. Similarly, any such function g defines some β(Fλ ), so β1 (Fλ ) ≡ β1 (g) =
1 Ex [(x − μx ){E(y1 | x) + E(y0 g(x, y0 , y1 ) | x) var(x) − E(y1 g(x, y0 , y1 ) | x)}]
(2.7)
and its sample analogue estimator is 1 (g) = β
1 n (y1 | x) + E n (y0 g(x, y0 , y1 ) | x) μx ){E Ex,n [(x − v arn (x) n (y1 g(x, y0 , y1 ) | x)}]. −E
(2.8)
Here we use the conventional notation for sample analogues of the population means, so that, for example, given the i.i.d. sample (xi , y0i , y1i ) from the distribution of (x, y0 , y1 ), we define n (y0 g(x, y0 , y1 ) | x)] = 1 ni=1 y0i g(xi , y0i , y1i ). Similarly, E x,n [(x − μx )E n (y1 | x)] = x,n [E E n 1 n 1 n 2 ˆ x )y1i and v arn (x) = n i=1 (xi − μˆ x ) . i=1 (xi − μ n √ 1 (g) − β1 (g)) and g = g1 − g2 . Let (G, ρ) be the metric space, where G Let Zn (g) = n(β is the set of all functions g : R3 → R such that 0 ≤ g(z) ≤ 1, where z = (u, v0 , v1 ), and metric ρ is defined by ρ(g1 , g2 ) = g1 − g2 = supz1 ,z2 | g1 (z1 ) − g2 (z2 ) | . Then, by linearity, | Zn (g1 ) − Zn (g2 ) | = | Z˜ n (g) | , where Z˜ n (g) =
√
n
1 n (y0 g | x) − E n (y1 g | x)}] μx ){E Ex,n [(x − v arn (x) 1 E [(x − μx ){E(y0 g | x) − E(y1 g | x)}] . − var(x)
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
(2.9)
192
M. Ponomareva and E. Tamer
So, we have sup g δ | Z˜ n (g) | = δ | Op (1) | . Therefore, ∀ε > 0 and ∀η > 0 ∃δ > 0 so that
(2.10) lim sup P sup | Z˜ n (g) | > η < ε. n→∞
g δ
√ 1 (g) − β1 (g)). Together with standard This implies stochastic equicontinuity of Zn (g) = n(β CLT assumptions and the observation that metric space (G, ρ) is totally bounded, stochastic equicontinuity of Zn (g) warrants √ L (Fλ ) − β(Fλ )) =⇒ n(β Z(Fλ ) (2.11) where Z(·) is a Gaussian process indexed by Fλ ∈ . 2.2.1. Extreme points. The extreme points of least squares set can be estimated by exploiting the form of the solution to the optimization problem in (2.4).6 Starting with the slope, we know that E[(x − μx )yλ ] cov(yλ , x) = . (2.12) β1 (Fλ ) = var(x) var(x) Hence, to obtain the largest and the smallest values of β 1 that belong to , we need to compute the largest and the smallest values of the numerator since the denominator is positive and can be consistently estimated. It is interesting to note that the above bounds are the same as ones derived for best linear predictors under outcome censoring by Stoye (2006). Moreover, the above bounds for cases where x is a vector can be constructed similarly. For example, suppose x = (x1 , x2 ). Then, bounds on β 1 , the slope coefficient associated with x1 , can be constructed after first premultiplying the regression by the projection matrix for x2 hence transforming the regression into one that has a scalar regressor. T HEOREM 2.1. β1max β1min
The extreme points of the set are the following: 1 (E[(x − μx )y1 1[(x − μx ) ≥ 0]] + E[(x − μx )y0 1[(x − μx ) < 0]]), = var(x) (2.13) 1 (E[(x − μx )y0 1[(x − μx ) ≥ 0]] + E[(x − μx )y1 1[(x − μx ) < 0]]. = var(x)
For the intercept, the extreme points are 1 E[(var(x) − (x − μx )μx )1[(var(x) − (x − μx )μx ) ≥ 0] y1 ] var(x) 1 E[(var(x) − (x − μx )μx )1[(var(x) − (x − μx )μx ) < 0] y0 ], + var(x) 1 E[(var(x) − (x − μx )μx )1[(var(x) − (x − μx )μx ) ≥ 0] y0 ] = var(x) 1 E[(var(x) − (x − μx )μx )1[(var(x) − (x − μx )μx ) < 0] y1 ]. + var(x)
β0max =
β0min
(2.14)
6 We focus on the outer extreme points of the identified for ease of computations. The shape of the identified set in this linear model depends on the support of x. It is a polygon if x has finite support, and it is strictly concave if x is continuously distributed. The set is symmetric around E(xx )−1 E(xy), where y = (y0 + y1 )/2. This is all that can be said about the sharp set without further information on the distribution of (x, y0 , y1 ). C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
Misspecification in moment inequality models
193
As we can see, the above theorem provides the extreme points of the LS set in (2.5). Each parameter in the LS set represents a best linear approximation to some conditional expectation function that lies between the upper and lower conditional expectation functions. So, any two conditional expectations E[yλ1 | x] and E[yλ2 | x] that are consistent with the model are treated equally in that β(Fλ1 ) and β(Fλ2 ) belong to . To characterize the whole LS set rather than its extreme points, we can use the support function approach in Stoye (2006) and note that for any vector c = (c0 , c1 ) ∈ R2 the sharp upper bound on c0 β0 (Fλ ) + c1 β1 (Fλ ) is given by 1 E[(c0 (var(x) − (x − μx )μx ) + c1 (x − μx ))1[c0 (var(x) − (x − μx )μx ) + c1 (x − μx ) ≥ 0]y1 ] var(x) 1 E[(c0 (var(x) − (x − μx )μx ) + c1 (x − μx ))1 + var(x) × [(c0 (var(x) − (x − μx )μx ) + c1 (x − μx )) < 0]y0 ]. 2.3. The MMD parameter The previous section derives bounds on the best approximation to the conditional expectation function that is unobserved but is known to lie between observed upper and lower bounds. Here, we provide a link between those best approximations to the conditional expectation and the minimizers of the MMD objective function. T HEOREM 2.2.
Let
β0mmd , β1mmd ∈ arg min Qmmd (b0 , b1 ). b0 ,b1
Let h0 (x), h1 (x) be non-negative functions, and define the function Q as follows: Q(b0 , b1 , h0 , h1 ) = E[(E[y0 | x] + h0 (x) − b0 − b1 x)2 + (E[y1 | x] − h1 (x) − b0 − b1 x)2 ]. (2.15) Then,
mmd mmd ∗ β0 , β1 , h0 (x), h∗1 (x) ∈
arg min
Q(b0 , b1 , h0 , h1 ),
h0 (x)≥0,h1 (x)≥0,b0 ,b1
where h∗0 (x) = β0mmd + β1mmd x − E[y0 | x] + and h∗1 (x) = E[y1 | x] − β0mmd − β1mmd x + . The minimizers of the MMD objective function also minimize the sum of two approximation errors in (2.15). The first one measures the squared distance between the upper conditional expectation E[y1 | x] and a partly linear function and the second piece is similar. So, as we can see, the minimand of the MMD objective function minimizes the sum of squared approximations in (2.15) and not a least squares like objective function and naturally then the interpretations of both under misspecification differ. 2.3.1. What not to compare. In principle, the LS set and the MMD sets are both interesting objects to report. For example, the LS set is attractive since it contains parameters that are directly comparable to the parameter one would get if no censoring occurs. This is relevant if the objective of the empirical exercise is the underlying best approximation to the conditional mean function. C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
194
M. Ponomareva and E. Tamer
This is exactly what one would be after when running a linear regression in the absence of censoring. The MMD set, on the other hand, contains the truth when the model is well specified, but otherwise this set contains parameters that are defined as in Theorem 2.2. Also, when the MMD objective is always strictly bigger than zero, then, we know that the model is misspecified and that the conditional expectation is not linear. On the other hand, a zero MMD objective is not indicative that the model is well specified. With misspecification, the two sets are always different.7 Estimates obtained using a complete model (through some least squares procedure such as regressing the midpoint of y0 and y1 on x) should not be compared to estimates from the MMD set since each is estimating a different object. Care should be taken with the interpretation, and especially testing, in moment inequality models. 2.4. Monte Carlo experiments For purposes of exposition, we ran a set of Monte Carlo experiments that illustrates the identified set that one obtains using MMD, which relies on well specification, to the least squares method we propose for the linear model. The calculations are done on the population level. In particular, for the MMD model, we plot the contour set that minimizes (2.3) above. The minimum of the objective function is strictly positive if the model is misspecified. First, for a well-specified model, the conditional mean of y | x is a linear function of x. For the misspecified model, there does not exist a straight line that fits between the upper and lower bounds, so that the MMD objective is strictly positive.8 In both models, we generate the upper bound E[y1 | x] by adding f1 (x) = 2.5 + 0.5x 2 > 0 to E[y | x] and lower bound E[y0 | x] by subtracting f0 (x) = 0.5 + 0.2x 2 > 0 from E[y | x]. E[y | x] = 5 + x, where x ∼ Uniform[0, 5]
(well-specified model),
E[y | x] = 5 + (x − 2)3 − x 2 , where x ∼ Uniform[0, 5]
(misspecified model).
(2.16)
The left-hand panel in Figure 2 provides a graph of the conditional mean functions in the case of a well-specified model along with the conditional expectation of the upper and lower envelopes E[y1 | x] and E[y0 | x]. The right-hand panel of the same figure provides an example of a model in which the conditional expectation function is misspecified along with upper and lower envelopes on it. There will be a set of lines that will fit between the upper and lower curves in the left-hand display. The set of slopes and intercepts corresponding to this set of lines is the identified set. However, from the second display, we see that there will be NO lines that fit entirely (at least in [0, 5]) between the upper and lower envelopes. So, here, one would expect that an MMD estimator would result in a unique parameter. In Figure 3, we first plot the MMD set and the LS superset [β1min , β1max ] × [β0min , β0max ] in the well-specified case. As one can see, the true parameter is contained in the MMD set which is smaller and contained within the LS superset. The MMD is the set at which the population MMD is equal to zero. It contains the set of lines that obey the model; that is, that fit within the upper and lower 7 Under misspecification, best linear approximation to, for example, E(y | x) or E(y | x) belongs to the LS set, but 0 1 does not belong to the MMD set. 8 This model illustrates the case where both E[y | x] and E[y | x] are non-monotone. 0 1 C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
195
Misspecification in moment inequality models 25
25 E[y|x] E[y |x] 1
20
E[y|x] E[y1|x]
20
E[y0|x]
E[y |x] 0
15 15 E[y|x]
10 5
10
0 5 −5 0
0
1
2
x
3
4
5
−10
0
1
2
3
x
4
5
Figure 2. Well-specified model (left) and misspecified model (right). 12 10
MMD set
Intercept
8 6 4 Truth
2 0 LS superset −2 −2
−1
0
1
Slope
2
3
4
5
Figure 3. LS versus MMD sets in the well-specified case.
bound functions. On the other hand, the extreme points of the LS set are computed as we did above, and this set contains the best linear approximations to all conditional expectations, both linear and non-linear that fit between E[y0 | x] and E[y1 | x]. So, obviously, this set is bigger than the MMD set in the well-specified case. This set will further shrink to the MMD set if we keep parameters (b0∗ , b1∗ ) such that E[(E[yλ | x] − b0∗ − b1∗ x)2 ] = 0. In the case of misspecification, we plot, in Figure 4, the LS set, and the MMD set along with the true parameter vector β that is the best linear approximation to the (unknown) conditional expectation function. The important feature that this plot highlights is how small the MMD set is. In fact, with minor changes in the design, it can be reduced to basically a point. If an empirical worker uses C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
196
M. Ponomareva and E. Tamer 8 6 Truth
4
Intercept
MMD set 2 0 −2 −4 LS superset −6 −3
−2
−1
0
Slope
1
2
3
4
Figure 4. LS and MMD in the misspecified case.
the MMD objective function, he/she will obtain tight estimates even in the presence of censoring which runs contrary to what one would expect in models with interval-censored outcomes. The MMD set estimates its parameter well which, because of misspecification, is not comparable to β. Notice that the LS set, which estimates the set of best linear approximations to a conditional expectation function that obeys the model and is consistent with the data, is wide.
3. MORE EXAMPLES We provide other illustrative examples of the issues raised above. The first example examines a non-parametric partially identified model where no testable assumptions are made. There, we can see that the MMD procedure as well as the LS procedure give the same answers. This is because, in non-parametric models, there is no scope for misspecification. Then, we add an MIV assumption and we characterize cases when simple models with inequality restrictions can give different estimates as compared to maximum likelihood-based estimates. The second example we give is one of a simple game with multiple equilibria. Again, here we provide inference procedures that are based on the likelihood (as opposed on some moment inequalities) that have a natural interpretation under misspecification. 3.1. Non-parametric missing data model Here, consider the problem of learning the mean β of a binary variable y, but we only observe y when z = 1. Hence, we have β = P (y = 1) = P (y = 1 | z = 1)P (z = 1) + P (y = 1 | z = 0)P (z = 0) = P (y = 1 | z = 1)P1 + α(1 − P1 ).
(3.1)
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
197
Misspecification in moment inequality models β
P1 P (y = 1|z = 1) + (1 − P1 )
P1 P (y = 1|z = 1)
0
1
α
Figure 5. Identified set for (β, α) in (3.1).
In the complete model where y is also observed when z = 0, one can use maximum likelihood to get β = P (y = 1) and hence that the MLE for β is the sample choice probability. In addition, this is a well-specified model since it is non-parametric. Now, we only observe y when z = 1, then, one possible parametrization of the likelihood is in terms of two parameters: β, the parameter of interest, and α = P (y = 1 | z = 0) which is another parameter. The observed data is yz=1 , z and hence the MLE maximizes the following likelihood:
P1 − β + (1 − P1 )α β − (1 − P1 )α . + (1 − y) log L(β, α) = E y log P1 P1 Maximizing this likelihood is meaningful under misspecification also since it provides the parameters that minimize the entropy distance (or the Kullback–Leibler Information Criterion, KLIC) between the true observed data density and the pre-cited model density. Again, here, the model is non-parametric and so well specified, but the arg max of the likelihood is not unique, or the parameter vector (α, β) is not point identified. Note here that we do not use moment inequalities. Rather, we use a parametrization for the likelihood of the observed data (there are other parametrizations) and so the maximizers of the above function L are the set of (b, a) such that b − a(1 − P1 ) = (b, a) ∈ [0, 1]2 : P (y = 1 | z = 1) = . P1 This set is the interval graphed in Figure 5. In particular, it maps into the following interval for β: β ∈ [P1 P (y = 1 | z = 1), P1 P (y = 1 | z = 1) + 1 − P1 ], which is exactly the Manski worst-case bound for β. This case is not very interesting again since the model is non-parametric and so there is no scope for misspecification. C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
198
M. Ponomareva and E. Tamer
3.2. Adding an MIV assumption Suppose we make an MIV assumption similar to one made in Manski (2003). Mainly, we assume that there exists a binary variable v such that P (y = 1 | v = 1) ≥ P (y = 1 | v = 0). This is in contrast to the (usual) IV assumption which would require statistical independence between y and v. Consider the following definitions: P1 = P (z = 1)
P0 = P (z = 0)
π0 = P (v = 0 | z = 0)
π1 = P (v = 1 | z = 0)
q0 = P (z = 0 | v = 0)
q1 = P (z = 0 | v = 1)
0 = P (y = 1 | v = 0, z = 1)P (z = 1 | v = 0)
1 = P (y = 1 | v = 1, z = 1)P (z = 1 | v = 1)
β = P (y = 1), α = P (y = 1 | v = 0, z = 0),
γ = P (y = 1 | v = 1, z = 0). (3.2)
Tight bounds on β = P (y = 1) under the MIV assumption have been derived by Manski and Pepper (2000) and Manski (2007): MIV = [P (v = 0)0 + P (v = 1) max{0 , 1 }, P (v = 0) min{0 + q0 , 1 + q1 } + P (v = 1)(1 + q1 )].
(3.3)
If the set MIV is empty, then the MIV assumption does not hold. However, the reverse in general is not true: non-empty MIV does not necessarily imply that the monotonicity assumption holds. Therefore, in the case of the misspecified model, that is, the model where the DGP violates the monotonicity assumption, the above inference procedure may produce non-empty (and sometimes quite tight) bounds that do not cover the true value of β. It is unclear what is the interpretation of a non-empty MIV in this case. An alternative approach would be to construct the maximum likelihood set (MLS) for the parameter β that allows for a meaningful interpretation even when the MIV assumption does not hold. To do so, we consider the following constrained likelihood maximization problem:
β − P0 (π0 μ + π1 ν) β − P0 (π0 μ + π1 ν) + (1 − y) log 1 − max E y log β,μ,ν P1 P1 1 + q1 μ ≥ 0 + q0 ν s.t. 0 ≤ β, μ, ν ≤ 1. We see that this gives us the following bounds for β:
1 − P0 M0 MLS = P1 min P (y = 1 | z = 1), P1 1 − P0 M0 + P0 M0 , P1 min P (y = 1 | z = 1), + P0 M1 , P1 where
0 − 1 and M0 = max 0, π0 q1
1 − 0 + q1 M1 = min 1, π0 + π1 . q0
(3.4)
(3.5)
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
Misspecification in moment inequality models
199
If the underlying model is well specified, then MLS = MIV ; that is, the two sets coincide. Alternatively, if the model is misspecified, that is, the MIV assumption does not hold, the MLS set contains the parameter values that minimize the Kullback–Leibler divergence between the data and a model with imposed MIV restriction. So, when using the MIV assumption, three scenarios can happen: (1) the MIV assumption is well specified, and so here, MLS = MIV ; (2) the MIV is misspecified, and both MIV and MLS are empty, and, most interestingly, (3) the MIV is misspecified but typically MIV = MLS and both do not violate the no-information bounds. Typically, both of these sets can be really tight, and hence under misspecification, both (2) and (3) can lead to the ‘crossing’ problem when constructing MIV (see Manski and Pepper, 2000). It is also common that under (3) both sets are disjoint and hence care should be taken as to which estimates to report. It is not clear what the meaning of MIV is when the MIV assumption is misspecified. The next simple examples provide cases that illustrate (3). 3.2.1. Examples.
Consider the following model: P (y = 1 | v = 0, z = 0) = P (y = 1 | v = 0, z = 1) = 3/4 P (y = 1 | v = 1, z = 0) = P (y = 1 | v = 1, z = 0) = 1/6 P (z = 1 | v = 0) = 1/4, P (z = 1 | v = 1) = 5/6 P (v = 1) = 1/5.
Here β = 0.63, and the (non-parametric) no-information bound is [0.17, 0.81] and of course β belongs to the no-information bound (or non-parametric bound). This model violates the monotonicity assumption. The set MIV is not empty. In particular, MIV = [0.19, 0.31]. We notice that these bounds are somewhat tight, and that they do not cover the true β, so that β∈ / MIV . The MLS set is MLS = [0.35, 0.78]. So, this is a case where empirical researchers obtain different results (MIV and MLS are disjoint) depending on the estimation method and there is no reason to reject the MIV assumption since both of these sets lie in the no-information bound. However, the MLS contains the set of β’s that minimize the distance between the true observed data likelihood and the model likelihood (that is the non-parametric likelihood under the monotonicity assumption). A more striking case emerges in the example above when P (z = 1 | v = 0) = 2/5, keeping other probabilities without changes, so that the parameter value remains the same: β = 0.63 while the no information bound is now [0.27, 0.78]. Again, MIV is non-empty and gives us almost a point-like identification region MIV = [0.30, 0.31]. The MLS set, MLS , is also quite tight, MLS = [0.73, 0.75], although a bit wider than MIV . In this case there is a striking difference between the information the two sets provide. The message of these stylized example is that under misspecification, care should be taken in terms of interpretation of the results and the approach one takes to estimate parameters of interest. 3.3. A simple bivariate game Consider the following bivariate game: y1 = 1[α1 + 1 y2 − 1 ≥ 0], y2 = 1[α2 + 2 y1 − 2 ≥ 0]. C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
(3.6)
200
M. Ponomareva and E. Tamer
As usual, we assume that we observe both y1 and y2 and that both 1 and 2 are strictly negative and (1 , 2 ) ∼ Fθ . Let the parameter of interest be γ = (θ, 1 , 2 , α1 , α2 ). It is well known that this game admits multiple equilibria when (1 , 2 ) ∈ [α1 + 1 , α1 ] × [α2 + 2 , α2 ] ≡ . In particular, there, we have three equilibria (1, 1), (1, 0) and (0, 1), in pure strategies, and 1 + α1 2 + α2 (p1 , p2 ) = − ,− 1 2 in mixed strategies where p1 is player 1’s probability of playing 1 and similarly for player 2. So, the choice probabilities are given by 1 + α1 2 + α2 − − S3 (1 , 2 ) dF θ (1 2 ), Pr(1, 1) = Fθ (α1 + 1 , α2 + 2 ) + 1 2 Pr(1, 0) = dF θ (1 2 ) + S1 (1 , 3 ) dF θ (1 2 ) (1,0)
+
Pr(0, 1) =
1 + α1 − 1
1+
dF θ (1 2 ) + (0,1)
+
2 + α2 S3 (1 , 2 ) dF θ (1 2 ), 2
(3.7)
S2 (1 , 3 ) dF θ (1 2 )
1 + α1 2 + α2 1+ − S3 (1 , 2 ) dF θ (1 2 ), 1 2
where Si (1 , 2 ) ∈ [0, 1]
for i = 1, 2, 3 and S1 (1 , 2 ) + S2 (1 , 2 ) + S3 (1 , 2 ) = 1,
(0,1) = {(1 , 2 ) ∈ R : (1 ≤ α1 , 2 ≥ α2 ) ∩ (1 ≤ α1 + 1 ; α2 + 2 ≤ 2 ≤ α2 )}, 2
and similarly for (1,0) . The functions Si above are selection mechanisms that depend on unobservables and hence are general. They do not have a structural interpretation, but are functions that complete the model. The set of equations in (3.7) provides a set of choice probabilities predicted by the model given the choice of selection mechanisms Si , and constitutes a set of moment equalities that can be used to do inference on γ in the presence of the functions Si . One approach to inference in the model above is to exploit the fact that the functions Si are probabilities and hence by monotonicity of the choice probabilities in Si ’s obtain predicted upper and lower bounds on observed choice probabilities. This will transform the model into one with inequality restrictions on regressions where an MMD-like approach can be used to estimate its parameters. For example, an implication of the above model is the following set of moment inequality restrictions (abstracting from mixed strategies for simplicity):
Pr(1, 1) = Fθ (α1 + 1 , α2 + 2 ) dF θ (1 2 ) ≤ Pr(1, 0) ≤ dF θ (1 2 ) + dF θ (1 2 )
(1,0)
(1,0)
dF θ (1 2 ) ≤ Pr(0, 1) ≤ (0,1)
dF θ (1 2 ) +
(0,1)
(3.8)
dF θ (1 2 ).
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
Misspecification in moment inequality models
201
Again, the set of parameters that satisfy the above inequality restrictions when the model is well specified is a superset that contains the identified set. However, in the presence of misspecification, this approach estimates a set of parameters that is not necessarily the set of interest. In fact, under misspecification, the inequalities above will not all be satisfied by any parameter vector.9 3.3.1. Partially identified maximum likelihood: the ML set. Define the maximum likelihood set (or ML set) as follows. Let the data given choice probability vector be equal to P = (Pr(1, 1), Pr(1, 0), Pr(0, 1), Pr(0, 0)). In addition, let the predicted vector of choice probabilities as a function of γ and the S = (S1 , S2 , S3 ) given in (3.7) be Pγ ,S . Hence, we can define the set of parameters consistent with the observed probabilities as χ = {(γ , S) : P P = Pγ ,s = 1}. This set can also be defined as the arg max of the corresponding likelihood χ = arg max E[log(Pγ ,S )]. γ ,S
Now, the attractiveness of the above approach as compared to the one that uses inequalities (and hence MMD) is that under misspecification, the maximum likelihood set χ = arg max E[log(Pγ ,S )] γ ,S
minimizes the distance between the true choice probabilities (P) and the predicted ones. This distance again is the well-known entropy or KLIC. Hence, a more interpretable approach is to use the maximum likelihood objective function as a function of both γ and the vector of functions S(·). In case the model is well specified, this approach delivers the identified set (and hence is sharp), and on the other hand the maximum likelihood set has a clear meaning when the model is misspecified. The negative of this approach is that one needs to deal with an infinite dimensional parameter. For a similar approach in a difference context, see Honor´e and Tamer (2006). This approach of completing the model arises since the reason why one obtains inequality restrictions in these games is the fact that one is not willing to model the selection mechanism. Completing the model to obtain a standard likelihood (a semi-parametric one here since it depends on unknown functions), or to obtain a moment equality model will lead to inference that provides an interpretation when the underlying model is misspecified. Of course, both the ensuing MLE or GMM model will likely be partially identified and hence one needs to use the inference procedure in moment equality models that are robust to non-point identification (of both γ and the vector of functions S in this case).
4. CONCLUSION This paper makes a simple point. Under misspecification, the parameters that one obtains using some moment inequalities might not be ones that researchers are interested in. For example, in linear models with data on both outcomes and regressors, least squares estimates have a 9 It is irrelevant whether one uses the above non-sharp inequalities to conduct inference or the sharp inequalities of Beresteanu et al. (2008).
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
202
M. Ponomareva and E. Tamer
clear meaning under misspecification. These are called the least squares quasi-true parameters, or the parameters that provide the best approximation to the conditional mean function. With interval measurement on outcomes, the model implies a set of moment inequalities, that under misspecification, will not provide information about these quasi-true parameters. This ‘problem’ of interpretation does not arise in moment equality models since the meaning of these estimates are basically defined through, say, the objective function which basically minimizes some particular distance of the vector of moment conditions from zero. With moment inequalities, the interpretation is not as clear and hence care should be taken in interpreting set estimates that are obtained from these models. More importantly, in moment inequalities where one expects the model parameters to be partially identified with oftentimes wide bounds (depending on the assumptions), under misspecification, moment inequality-based models tend to provide tight estimates of the identified set. This can be a direct result of the misspecification and therefore cannot be viewed as an indicator that the underlying model contains a lot of information about the true but partially identified parameter.
ACKNOWLEDGMENTS We thank the editor and referees for comments that greatly improved the paper. We also thank participants at the Partial Identification conference held in London in March 2008 for comments. Financial support from the National Science Foundation is gratefully acknowledged.
REFERENCES Andrews, D. and G. Soares (2010). Inference for parameters defined by moment inequalities using generalized moment selection. Econometrica 78, 119–57. Beresteanu, A., F. Molinari and I. Molchanov (2008). Sharp identification regions in games. Working paper, Cornell University. Canay, I. (2010). EL inference for partially identified models: large deviations optimality and bootstrap validity. Journal of Econometrics 156, 408–25. Chernozhukov, V., H. Hong and E. Tamer (2007). Estimation and confidence regions for parameter sets in econometric models. Econometrica 75, 1243–84. Haile, P. and E. Tamer (2003). Inference with an incomplete model of English auctions. Journal of Political Economy 111, 1–51. Honor´e, B., S. Khan and J. Powell (2002). Quantile regression under random censoring. Journal of Econometrics 64, 241–78. Honor´e, B. and E. Tamer (2006). Bounds on parameters in panel dynamic discrete choice models. Econometrica 74, 611–29. Imbens, G. and C. F. Manski (2004). Confidence intervals for partially identified parameters. Econometrica 72, 1845–57. Manski, C. F. (1995). Identification Problems in the Social Sciences. Cambridge, MA: Harvard University Press. Manski, C. F. (2003). Partial Identification of Probability Distributions. Springer Series in Statistics. New York: Springer. Manski, C. F. (2007). Identification for Prediction and Decision. Cambridge, MA: Harvard University Press. Manski, C. F. and J. Pepper (2000). Monotone instrumental variables: with an application to the returns to schooling. Econometrica 68, 997–1010. C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
Misspecification in moment inequality models
203
Manski, C. F. and E. Tamer (2002). Inference on regressions with interval data on a regressor or outcome. Econometrica 70, 519–47. Molinari, F. (2008). Partial identification of probability distributions with misclassified data. Journal of Econometrics 44, 81–117. Powell, J. (1984). Least absolute deviations estimation for the censored regression model. Journal of Econometrics, 303–25. Romano, J. and A. Shaikh (2008). Inference for the identified set in partially identified econometric models. Journal of Statistical Planning and Inference 138, 2786–807. Stoye, J. (2006). Bounds on generalized linear predictors with incomplete outcome data. Reliable Computing 13, 293–302. White, H. (1994). Estimation, Inference and Specification Analysis. Econometric Society Monographs, Volume 22. Cambridge: Cambridge University Press.
APPENDIX: PROOFS OF THEOREMS Proof of Theorem 2.1: Rewrite the slope as follows: β1 (Fλ ) = =
1 E[(x − μx )yλ ] var(x) 1 (E[(x − μx )yλ 1[(x − μx ) ≥ 0]] + E[(x − μx )yλ 1[(x − μx ) < 0]]. var(x)
Hence, the maximum of β1 ∈ can be easily obtained by exploiting the fact that λ ∈ [0, 1] : β1max =
1 (E[(x − μx )y1 1[(x − μx ) ≥ 0]] + E[(x − μx )y0 1[(x − μx ) < 0]]), var(x)
β1min =
1 (E[(x − μx )y0 1[(x − μx ) ≥ 0]] + E[(x − μx )y1 1[(x − μx ) < 0]]. var(x)
(A.1)
This can be easily estimated consistently with the observed data. Similarly to the slope, the intercept extreme points can be estimated as follows: β0 (Fλ ) = μyλ − β1 (Fλ )μx var(x) − (x − μx )μx (x − μx ) μx yλ = E = E 1− yλ . var(x) var(x) It easy to see that the desired result for the intercept follows.
(A.2)
Proof of Theorem 2.2: A direct way to prove this is to show that the MMD objective function can be written as in (2.15). Fix b0 , b1 . Then the optimal choice of h0 and h1 is h0 (x; b0 , b1 ) = (b0 + b1 x − E[y0 | x])+
and
h1 (x; b0 , b1 ) = (E[y1 | x] − b0 − b1 x)+ ,
which is exactly Qmmd (b0 , b1 ). Therefore any solution to min Qmmd (b0 , b1 ) is also a solution to min Q(b0 , b1 , h0 , h1 ) with h0 and h1 as above. The reverse of this claim is also true.
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
The
Econometrics Journal Econometrics Journal (2011), volume 14, pp. 204–240. doi: 10.1111/j.1368-423X.2010.00336.x
Likelihood estimation of L´evy-driven stochastic volatility models through realized variance measures A LMUT E. D. V ERAART † †
CREATES, School of Economics and Management, Aarhus University, Building 1322, Bartholins All´e 10, DK-8000 Aarhus C, Denmark. E-mail:
[email protected] First version received: November 2008; final version accepted: August 2010
Summary This paper studies the impact of jumps and leverage-type effects on return and realized variance calculations when the logarithmic asset price is given by a stochastically scaled L´evy process. Realized variance is not a consistent estimator of the integrated squared volatility process in such a modelling framework, but it can nevertheless be used within a quasi-maximum likelihood setup to draw inference on the model parameters. This paper introduces a new methodology for deriving all cumulants of the returns and of the realized variance in explicit form by solving a recursive system of inhomogeneous ordinary differential equations. Keywords: Inference, Leverage effect, L´evy processes, Realized variance, Stochastic volatility, Superposition, Quasi-maximum likelihood.
1. INTRODUCTION This paper focuses on stochastically scaled L´evy processes for modelling logarithmic asset prices. Such processes are defined as stochastic integrals with respect to a L´evy process, where the stochastic integrand reflects stochastic volatility. Although such models are very natural generalizations of the Black and Scholes (1973) model and its stochastic volatility extensions, see e.g. Heston (1993), Barndorff-Nielsen and Shephard (2001), they are still somewhat neglected in the financial literature. This might be due to the fact that they are in general not as analytically tractable as Brownian motion-driven asset price models or models based on time-changed L´evy processes, because they do not belong to the class of affine models; see e.g. Duffie et al. (2003) and Kallsen (2006), where many analytical results can be easily obtained. However, this paper presents a new methodology for handling these analytically more involved models. Before we dive into these new results, let us briefly review related studies in the literature. Because L´evy-driven stochastic volatility models are able to cope with many stylized facts of asset returns particularly well, e.g. they reflect the skewness and fat tails of asset return distributions more appropriately and can handle jumps and volatility smiles much better than models based on Brownian motions alone, several types of such L´evy-based stochastic volatility models have been studied recently in the financial literature. Basically, they can be divided into two groups: time-changed L´evy processes, see e.g. Carr et al. (2003), Carr and Wu (2004), C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society. Published by Blackwell Publishing Ltd, 9600
Garsington Road, Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
Likelihood estimation of L´evy-driven stochastic volatility models
205
Barndorff-Nielsen and Shephard (2006), and stochastic integrals with respect to a L´evy process, see e.g. Eberlein et al. (2003), Woerner (2003), Kl¨uppelberg et al. (2004), Todorov and Tauchen (2006). As already mentioned, we restrict our attention here to the latter class of L´evy-based stochastic volatility models. Such models can be directly extended to a multivariate framework and, hence, are often regarded as being superior to the time change models. Furthermore, as pointed out by Madan (2009), L´evy-driven models based on stochastic spatial scaling lead to a positive relationship between volatility and skew—a fact which is supported by empirical studies. However, L´evy-driven stochastic volatility models based on temporal scaling (as e.g. in the Barndorff-Nielsen and Shephard (2006) model) lead to a negative relationship between volatility and skewness of the price process. Note here that Madan (2009) has derived these results under the assumption of absence of a leverage effect. Apart from jumps, our new model also allows for very general leverage-type effects or asymmetric volatility. During the last decades, many empirical studies have revealed the fact that past stock returns tend to be negatively correlated with innovations of future volatilities. This property is often called the leverage effect—an expression which has been derived from the hypothesis that a negative stock return might increase financial leverage and, hence, leads to a riskier stock which results in higher volatility. Black (1976) was probably the first to investigate this effect, and his finding was further supported by studies by Christie (1982) and Nelson (1991) among others and, more recently, by Harvey and Shephard (1996), Bouchaud et al. (2001), Tauchen (2004, 2005), Yu (2005) and Bollerslev et al. (2006). Although the existence of asymmetric volatility is rarely questioned, its main determinant is still subject to vivid discussions; see e.g. Bekaert and Wu (2000) and the references therein. Besides the previously mentioned leverage hypothesis, there is also the time-varying risk premium theory or volatility feedback theory, which essentially relies on the converse causality when stating that increasing volatility leads to decreasing stock price returns. However, regardless of where asymmetric volatility originates from, it is definitely an important fact, which has to be accounted for in asset pricing, especially in the context of option pricing because the asymmetric relationship is directly associated with implied volatility smiles. So the leverage effect is often regarded as a natural tool for explaining smirks in option price data (see Hull and White, 1987, Garcia et al., 2001). Unfortunately, previous work on the econometric properties of L´evy-driven stochastic volatility has so far only been carried out under the no-leverage assumption (see Woerner, 2003, Barndorff-Nielsen and Shephard, 2006), which is, particularly in equity markets, not realistic. The main contributions of the paper are as follows. We propose a new and highly analytically tractable asset price model, which allows for stochastic volatility, jumps, general leverage-type effects and which can be easily generalized to a multivariate framework. Then we present a novel methodology for deriving all moments/cumulants of interest in explicit form by solving a recursive system of inhomogeneous ordinary differential equation. In addition to the statistical properties of the price and the volatility process, we will also focus in detail on the properties of realized variances in our new modelling framework. Realized variance and its use for estimating and forecasting stochastic volatility has been studied extensively in the finance literature in the last decade; see e.g. Andersen and Bollerslev (1998), Barndorff-Nielsen and Shephard (2002, 2003), Jacod (2008) and Veraart (2010). So far, such studies have mainly focused on asset price models given by Brownian semi-martingales and not by scaled L´evy processes. Using the explicit formulae for the cumulants of the realized variance, we show how to draw inference on the model parameters by using a quasi-maximum likelihood method. The remaining part of the paper is structured as follows. Section 2 sets up the notation and defines the L´evy-driven stochastic volatility model, which we study in this paper. Following C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
206
A. E. D. Veraart
recent research on stochastic volatility models, we use the so-called realized variance as a proxy for the accumulated variance over a day. This quantity will be defined in Section 3. Sections 4 and 5 contain the main theoretical results of the paper. In Section 4, we present explicit formulae for the moments and second-order properties of the returns, the actual variance and the quadratic variation of the price process. Section 5 addresses the first- and second-order properties of the realized variance where we study in detail the influence of the jumps and the leverage effect on volatility estimation. Section 6 studies the impact of market microstructure effects on our previously derived results. Next, we focus on parameter estimation and inference in Section 7, where both a simulation and an empirical study is carried out. Finally, Section 8 concludes. Throughout the text, all the mathematical proofs are relegated to the Appendix.
2. THE MODEL This section introduces a new model for an asset price which is able to account for the key properties of high-frequency financial data such as the absence of Gaussianity, existence of timevarying volatility and volatility clusters and the presence of jumps. Furthermore, it allows for a very general form of a leverage effect and it is empirically compelling due to its high analytical tractability. In the following, we will define this new model, introduce the technical assumptions used throughout the paper and motivate the choice of the model and relate it to other recent stochastic volatility models. Let (, A, P) denote a probability space with filtration F = {Ft }0≤t<∞ , satisfying the usual conditions; see e.g. Rogers and Williams (2001). Let S = (St )t0 denote the logarithmic asset price and σ = (σ t )t0 the stochastic volatility (SV). We will study models of the form t σs− d(vWs + Xs ), (2.1) St = μt + 0
where X = (Xt )t0 denotes a pure jump L´evy process (without drift) (see e.g. Bertoin, 1996, Sato, 1999, Protter, 2004), and W = (Wt )t0 is a standard Brownian motion and v 0 is a constant, which could be 0. In that case, we would be in a pure jump setting. Note that we assume throughout the paper that X ≡ 0. Solely for ease of exposition, we will assume that the mean of the logarithmic asset price is zero, i.e. Assumption 2.1 holds. A SSUMPTION 2.1.
E(X1 ) = 0 and μ = 0.
Furthermore, we will work under the following assumption for the variance of vW + X. A SSUMPTION 2.2.
Var(X1 ) < ∞ and Var(X1 + vW 1 ) = Var(X1 ) + Var(vW 1 ) = 1.
This assumption ensures that the new model is uniquely identified. Otherwise, one could always multiply vW + X by a constant and scale σ appropriately and one would still obtain the same value for the price process S. Note that the main feature of the model defined in (2.1) is that the stochastic volatility process enters as a multiplier of both the Brownian motion and the jump component, i.e. we study a stochastic integral of a stochastic volatility process with respect to a L´evy process. Such models have also been studied by, e.g., Eberlein et al. (2003), Woerner (2003), Kl¨uppelberg et al. (2004) and Todorov and Tauchen (2006) since they represent a natural generalization of C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
Likelihood estimation of L´evy-driven stochastic volatility models
207
Brownian-driven stochastic volatility models. Furthermore, by having a stochastic volatility factor in front of the pure jump component, our new model also nests pure jump-driven stochastic volatility models (when we set v = 0), which are favoured by a stream of literature (see e.g. Carr et al., 2003). Because of the fact that the model is constructed by stochastic spatial scaling of a L´evy process, i.e. stochastic integration with respect to a L´evy process, rather than by stochastic temporal scaling (time changing) of the L´evy process X + vW , extensions to multivariate models are straightforward. In a next step, we specify the volatility process σ . Here we will focus on non-Gaussian Ornstein–Uhlenbeck (OU) processes. That is, we shall model the volatility process by a stationary L´evy-driven OU process which satisfies the following stochastic differential equation: dσt = −λσt− dt + dYλt ,
(2.2)
where Y = (Yt )t0 denotes a pure jump subordinator (i.e. a non-decreasing L´evy process) and λ > 0 is the memory parameter. Throughout we assume that σ 0 is drawn from its stationary distribution. Then, we get the well-known representation t t −λt −λ(t−s) σt = e σ0 + e dLλs = e−λ(t−s) dLλs . 0
−∞
Similar models have been studied in detail in the Brownian motion framework (i.e. when X ≡ 0) by Barndorff-Nielsen and Shephard (2001) who chose σ 2 to be a non-Gaussian OU process. However, studies by Konaris (2002) have shown that choosing σ or σ 2 to be an OU process leads to similar results. So for reasons of mathematical tractability we have chosen the volatility process rather than the variance to be of non-Gaussian OU type. The choice of the volatility process is based on two key facts: empirical findings and analytical tractability. First of all, it is well known that stochastic volatility models based on nonGaussian OU processes fit empirical data very well, in particular when the L´evy sub-ordinator is drawn from an inverse Gaussian distribution. Furthermore, it turns out that such processes behave in distribution very similarly to a CIR process, another very popular stochastic volatility process. In fact, the second-order properties of the CIR process and the non-Gaussian OU process are identical if the driving L´evy subordinator is drawn from a Gamma distribution; see the discussion in Barndorff-Nielsen and Shephard (2011). Furthermore, there is some recent (empirical) work, which suggests that volatility is driven by a pure jump process (see Todorov and Tauchen, 2011). In addition to the empirical findings which favour non-Gaussian OU processes for stochastic volatility modelling, there is also the fact that they are to a very high degree analytically tractable, which is due to the linear structure in the underlying stochastic differential equation, which makes it possible to derive many important quantities explicitly and hence makes such models applicable in practice. Note that, if σ is defined as in (2.2), then the dynamics of the squared volatility process are given by 2 dt + 2σt− dY λt + d[Y ]λt , dσt2 = −2λσt−
which reminds us of a jump-driven version of a square root process. Clearly, the model for the volatility process given in (2.2) satisfies the essential requirement that volatility has to be nonnegative and, furthermore, it is mean reverting to E(Y1 ) at rate λ. The latter property can be seen by rewriting (2.2) using the fact that E (Yλt ) = λtE (Y1 ) and, hence, dσt = −λ (σt − E (Y1 ))dt + d(Yλt − E (Yλt )) . C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
208
A. E. D. Veraart
From this representation, it becomes clear why we time-change the subordinator Y by λ. By doing that, we obtain a stationary distribution of σ which does not depend on λ. In particular, λ can be interpreted as the true rate of mean reversion or the memory parameter of σ . Finally, we wish to allow for very general leverage-type effects in the model and, hence, we do not assume that σ and X are independent, but we rather work with the following assumption. A SSUMPTION 2.3. The process Z = (Zt )t0 , where Zt = (Xt , Y λt ) is a bivariate pure jump L´evy process (without drift), where the second component is a L´evy subordinator. By choosing a bivariate L´evy process (vW + X, Y λ ) as the driving process of (S, σ ) , we can capture very general leverage-type effects in our new model. Note that, to be able to choose such a bivariate L´evy process, we have to make sure that both driving processes run on the same time scale. The choice of Y λ rather than Y in the second component is hence essential. Otherwise it would be possible that there was already information about the price process available before there was any information about the volatility process and vice versa, and this would possibly lead to arbitrage opportunities. Note that the leverage effect in our model is introduced solely through the jump component. Recall that two pure jump L´evy processes are dependent if and only if they have common jumps (see e.g. Cont and Tankov, 2004, p. 144). That is, by allowing for co-jumps of the two L´evy jump processes which drive the volatility and the price process we are able to mimic very general leverage-type effects. Introducing the leverage effect via the jump component rather than via the diffusion component is in line with the modelling approach presented by Barndorff-Nielsen and Shephard (2001) and is based on the intuition that the more extreme movements, represented by the jumps, drive the correlation between asset price and volatility.
3. RETURNS AND REALIZED VARIANCE Our aim now is to study the econometric properties of the L´evy-type SV model defined earlier. Based on these findings we will then be able to comment on the impact of jumps and general leverage-type effects on realized variance computations and we are able to develop a maximum likelihood-based method for estimating the model parameters in our new model. So far, we have defined a model for a logarithmic asset price (St )t0 in continuous time. Clearly, its increments represent the corresponding returns. Hence, let h > 0 denote the length of a fixed time interval, typically one day. The returns of the asset price are then denoted by si = Sih − S(i−1)h ,
i = 1, 2, . . . ,
where i indexes the day. Because of the availability of high frequency data, one is often interested in modelling returns at a higher frequency than just daily data. Suppose that we are given M intrah observations during each time interval of length h. The time gap between these high-frequency observations is denoted by δ = h/M. Then sj ,i = S(i−1)h+j δ − S(i−1)h+(j −1)δ ,
j = 1, . . . , M,
denotes the j th intra-h high-frequency return on the ith period of length h. Often we work with h = 1, representing one day. Based on these high-frequency returns, one can then define the C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
Likelihood estimation of L´evy-driven stochastic volatility models
209
realized variance for the ith day by [Sδ ][(i−1)h,ih] =
M
sj2,i .
j =1
This quantity is often used to proxy the variability in financial markets in SV models (see e.g. Andersen et al., 2001, Barndorff-Nielsen and Shephard, 2002). In the empirical part of this paper, we will compute the realized variance based on 5-min returns and will therefore ignore possible market microstructure effects which come into play when analysing ultra-high-frequency returns (i.e. 1-sec returns or tick-by-tick data). In Section 6, we will give a short outlook on how our results change in the presence of such market micro-structure effects. Recall that the quadratic variation of a semi-martingale S = (St )t0 with S 0 = 0 is defined by t [S]t = St2 − 2 0 Su− dSu . It is well known that ucp
[Sδ ][(i−1)h,ih] −→ [S]ih − [S](i−1)h ,
as M → ∞ (i.e. δ → 0),
where the convergence is uniform on compacts in probability (ucp) (see Protter, 2004). So the realized variance can be used to estimate the (increments of the) quadratic variation of the price process consistently. Hence, in our modelling framework, the realized variance can be used as a consistent estimator of ih ih 2 2 2 v ds + d[X]s . σs− d[vW + X]s = σs− [S]ih − [S](i−1)h = (i−1)h
(i−1)h
ih However, one is rather interested in estimating and forecasting the actual variance (i−1)h σs2 ds ih since, in the absence of leverage, we have Var(si |σ ) = (i−1)h σs2 ds, since Var(vW 1 + X1 ) = 1. So, we see that the actual variance appears naturally as the conditional variance of the returns in the absence of leverage. Furthermore, in the absence of jumps it actually equals the quadratic variation of the price increments. However, due to L´evy’s theorem we can deduce that, as soon as the Brownian motion is replaced by a more general L´evy process or a semi-martingale as the driving process of the logarithmic asset price, the quadratic variation of the SV model is not given by the actual variance. So we see that in the simultaneous presence of jumps and leverage-type effects, the realized variance does not estimate the actual variance consistently. Hence, it is important to study the bias and the degree of inconsistency of the realized variance as proxy for the actual variance, which we will do in the following.
4. CUMULANTS OF RETURNS, ACTUAL VARIANCE AND INCREMENTAL QUADRATIC VARIATION We begin our econometric study by analysing the statistical properties of the following three key objects of interest: the price process S, the integrated variance I V and the quadratic variation of the price process [S]. From these properties, we can then directly derive properties of the increments of the corresponding stochastic processes: the returns of the log-price (the increments of S), the actual variance (the increments of I V ) and the incremental quadratic variation (the C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
210
A. E. D. Veraart
increments of [S]). These results can then be used for estimating all model parameters (see Section 7). 4.1. Notation for cumulants First of all, we set up the notation we will be using throughout this paper. Let ν denotes the L´evy measure of Z, and ν X and νYλ denote the L´evy measure of X and Y λ , respectively. R EMARK 4.1. If v = 0 and X is of finite variation, S can be written as St = 0st Ss . From Jacod and Shiryaev (2003, Chapter IX, Proposition 3) we can deduce that its characteristic triplet is given by (0, 0, ν S ), where ν S is defined by 1 F ν S = 1 F (σ s− x) ν X for all F ⊆ R\{0}, where we denote by f ν for a function f on a subset of R2 and random measure ν the following integral process (f ν)t = R×[0,t] f (x, s)ν(dx, ds). Recall that the nth cumulant of a stochastic process Z = (Zt )t0 is defined by (provided it exists)
κn (Zt ) =
1 ∂n log (E (exp(iuZt ))) . i n ∂un
Furthermore, it is well known, see e.g. Cont and Tankov (2004, pp. 32, 92), that the moments of a L´evy process can then be expressed in terms of the corresponding cumulants by E(Zt ) = κ1 (Z1 )t, E(Zt2 ) = κ2 (Z1 )t + (κ1 (Z1 )t)2 , E(Zt3 ) = κ3 (Z1 )t + 3κ1 (Z1 )κ2 (Z1 )t 2 + (κ1 (Z1 )t)3 and E(Zt4 ) = κ4 (Z1 )t + 3(κ2 (Z1 )t)2 + 4κ1 (Z1 )κ3 (Z1 )t 2 + 6(κ1 (Z1 )2 )κ2 (Z1 )t 3 + (κ1 (Z1 )t)4 . For the cumulants of these processes, we will use the following notation. The cumulants (denoted by κ i (·), i = 1, 2, of the process Y 1 are denoted . . .) of the random variable X1 are denoted by ξ and the cumulants by η. Hence, ξi = κi (X1 ) = R ui νX (du), ηi = κi (Y1 ) = [0,∞) v i νY1 (dv), for i = 1, 2, . . .. Note that κ i (Y λ ) = ληi , for i = 1, 2, . . .. Furthermore, κn,m = R×[0,∞) un v m ν(du, dv), for n, m ∈ N. Throughout the text, we will assume that at least the first four cumulants of the L´evy process Z are finite. From the Cauchy–Schwarz inequality, we obtain the following √ constraints for the cumulants of Z. For n, m ∈ N, the cumulants (if they exist) satisfy κn,m ≤ ξ2n λη2m . The cumulants of the bivariate L´evy process can be regarded as a measure of the dependence between the two driving processes, which obviously includes the leverage effect: the measure of dependence of first order. In the following, we will deal with the following five cumulants: κ1,1 = E (X1 − EX1 )(Yλ − EYλ ) = Cov(X1 , Yλ ),
κ1,2 = E (X1 − EX1 )(Yλ − EYλ )2 ,
κ1,3 = E (X1 − EX1 )(Yλ − EYλ )3 − 3E (X1 − EX1 )(Yλ − EYλ ) E (Yλ − EYλ )2 ,
κ2,1 = E (X1 − EX1 )2 (Yλ − EYλ ) , κ2,2 = E X12 Yλ2 − 2E (X1 ) E X1 Yλ2 − 2E (Yλ ) E X12 Yλ − E X12 E Yλ2 − 2 {E (X1 Yλ )}2 + 2 {E (X1 )}2 E Yλ2 + 2 {E (Yλ )}2 E X12 + 8 E (Yλ ) E (X1 ) E (X1 Yλ ) − 6 {E (X1 )}2 {E (Yλ )}2 . C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
Likelihood estimation of L´evy-driven stochastic volatility models
211
Note that if X and Y λ are independent, they have no common jumps and, hence, ν{(u, v) ∈ R × [0, ∞) : uv = 0} = 0. That means that, if X and Y λ are independent and, hence, there is no leverage effect, then all joint cumulants κ n,m = 0. 4.2. Cumulants of returns We start our theoretical study by computing the moments of the log-price process S. Because these moments depend not only on the moments of X but also on the ones of σ , we calculate the moments of σ first. To do that, we derive a general representation formula for the nth power of σ for n ∈ N. t P ROPOSITION 4.1. Let n ∈ N. As long as 0 σsn ds < ∞, the nth power of σ t satisfies σtn
−
σ0n
t
= −λ n 0
n σs−
ds +
n n k=1
k
n−k σs− (Yλs )k .
0≤s≤t
From the formula above, one can deduce the moments of σ . C OROLLARY 4.1. Recall that ηi = κ i (Y 1 ).The first four moments of the stationary distribution of σ are, hence, given by E (σt ) = η1 , E σt2 = η12 + 12 η2 , E σt3 = η13 + 13 η3 + 32 η2 η1 and E σt4 = 14 η4 + η14 + 3η2 η12 + 43 η1 η3 + 34 η22 . Now we focus on the moments of the price process S and, also, on its joint moments with the volatility process σ . It turns out that by repeated applications of Itˆo’s formula and the use of the compensation formula for jump processes, we obtain a recursive system of inhomogeneous ordinary differential equations, which can be solved explicitly. This methodology is described in detail in the Appendix and will be used extensively in the remaining part of the paper. So although our model is generally not affine and, hence, might look complicated to tackle at first sight, the new methodology proposed in this paper enables us to derive all cumulants of interest explicitly. P ROPOSITION 4.2 (Recursive formula for the moments of S). Let j , k, l, n ∈ N, with j , k, l n. If ηk , ξ k < ∞ for k n and κ j,l < ∞ for l k, j n − k, we get the following results. The joint moments of S and σ are given by E Stn−k σtk = e−kλt
t
g(u; n, k) ekλu du,
0
where g(u; n, k) =
k k j =1
j
(n − k)(n − k − 1) 2 ληj E Sun−k σuk−j + v + ξ2 E Sun−k−2 σuk+2 2
k n−k n−k k+j n−k−j n−k n−k k ξj E σu Su κj ,l E Sun−k−j σuj +k−l . + + j j l j =3 j =1 l=1
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
212
A. E. D. Veraart
The moments of S are given by E
Stn
n(n − 1) 2 v + ξ2 = 2
t 0
n t n−2 2 n ξk E Ss σs ds + E σsk Ssn−k ds. k 0 k=3
Now we can recursively solve the equation above and we obtain the corresponding moments of S. If ξ 1 = 0, the first two moments are given by η2 t. E (St ) = 0, E St2 = v 2 + ξ2 η12 + 2
C OROLLARY 4.2.
For t → 0:
3 1 3 3 2 3 E St = ξ3 η1 + η3 + η1 η2 t + v + ξ2 κ1,1 (2η12 + η2 ) + κ1,2 η1 t 2 + O(t 3 ), 3 2 2
4 1 3 2 4 4 2 E St = ξ4 η4 + η1 + η2 + 3η1 η2 + η1 η3 t 4 4 3
2 3 9 2 2 4 2 + 3η1 + 4η1 η3 + η4 + 9η1 η2 + η2 v + ξ2 4 4
3 3 2 + 6η1 + 2η3 + 9η1 η2 κ2,1 + 3η1 + η2 κ2,2 v 2 + ξ2 2 3 2 2 + ξ3 6η1 + 9η2 η1 + 2η3 κ1,1 + 3η2 + 6η1 κ1,2 + 2η1 κ1,3 t + O(t 3 ).
Also
Var(St2 )
1 3 2 4 2 4 η4 + η2 + 3η1 η2 + η1 η3 + η1 t = ξ4 4 4 3
2 3 2 2 4 + 8η1 η2 + 2η2 + 2η1 + 4η1 η3 + η4 v 2 + ξ2 4
3 3 2 + 6η1 + 2η3 + 9η1 η2 κ2,1 + 3η1 + η2 κ2,2 v 2 + ξ2 2 3 2 2 + ξ3 6η1 + 9η2 η1 + 2η3 κ1,1 + 3η2 + 6η1 κ1,2 + 2κ1,3 η1 t + O(t 3 ).
Since S has stationary increments, we can deduce the moments of the corresponding returns si over a time interval of length h by setting t = h and, hence, E(si ) = E(Sh ), E(si2 ) = E Sh2 , E si3 = E Sh3 , E si4 = E Sh4 , Var si2 = Var Sh2 .
4.3. Cumulants by Bartlett’s identity In the previous section, we have derived the nth moment of S by an application of Itˆo’s formula. An alternative approach would be to use Bartlett-type identities for martingales which have C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
Likelihood estimation of L´evy-driven stochastic volatility models
213
been derived by Mykland (1994). Because this is a powerful technique which can be applied to virtually all asset price models based on martingales, we briefly review those results and relate them to our previously derived formulae. Let = ( t )t0 denote a c`adl`ag semi-martingale. Let 0 = t 0 , t 1 , t 2 , . . . denote partitions of [0, t] and let v = {α 1 , . . ., α p } denote an index set. The optional variation is defined as the c`adl`ag modification of [ ; v]t = [ α1 , . . . , αp ]t =
αti+1 − αti . lim max(ti+1 −ti )↓0
i
α∈v
Mykland (1994) shows that the optional variation is well defined, that it is itself a semi-martingale and that it has the following form: p = 1: [ α ]t = αt , β p = 2: [ α , β ]t =
α,c , β,c t + 0≤s≤t αs s , αp α1 αp α1 p 3: [Y , . . . , Y ]t = 0≤s≤t s · · · s , where α,c denotes the continuous martingale part of α . Next, one can define the predictable variation
; v as the compensator of [ ; v]. The cumulant variation κ( ; v) is then defined by κ( ; v) = κ( α1 , . . . , αp ) = (−1)q−1 (q − 1)![
; v1 , . . . ,
; vq ]t , v
where the sum is computed over all partitions v 1 |· · ·|vq of v. Conditions for the existence of the predictable and cumulant variation are given in Mykland (1994). Now, we define Ut ({α1 , . . . , αp }) = (−1)p−1 (p − 1)![ α1 , . . . , αp ]t . From Mykland (1994, Theorem 3), we deduce that under certain regularity conditions, we have the following Bartlett-type identities for a local martingale : E Ut (v1 ), . . . , Ut (vp ) = 0 and cum(Ut (v1 ), . . . , Ut (vp )) = 0, ϒ
ϒ
q where cum(Ut (v1 ), . . . , Ut (vp )) = {1,...,p} (−1)q−1 (q − 1)! i=1 E( j ∈vi U (vj )) and where ϒ denotes any set of indices. In this paper, we are in particular interested in the first four moments/cumulants of the local martingale S. From the Bartlett identities, we get (see Mykland, t 1994, p. 23): E(St ) = 0, Var(St ) = E([S, S]t ) = (v 2 + ξ2 ) 0 E(σs2 ) ds and κ3 (St ) = E(St3 ) = 3Cov(St , [S]t ) − 2E ([S, S, S]t ) t t E(Ss σs2 ) ds + ξ3 E(σs3 ) ds, = 3(v 2 + ξ2 ) 0
0
κ4 (St ) = −8Cov(St , [S, S, S]t ) − 3Var([S, S]t ) + 6κ(St , St , [S]t ) − 6E[S, S, S, S]t = −8E(St [S, S, S]t ) − 3Var([S, S]t ) + 6 E([S]t St2 ) − E([S]t )E(St2 ) −6E[S, S, S, S]t t t t = 6(v 2 + ξ2 ) E(Ss σs3 ) ds + 4ξ3 E(Ss σs3 ) ds + ξ4 E(σs4 ) ds 0
−3(v 2 + ξ2 )
t 2 0
E(σs2 ) ds
0
2 .
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
0
214
A. E. D. Veraart
This result corresponds to the one we have derived above using the corresponding formula which expresses the fourth cumulant in terms of the moments up to fourth order. So, we see that we can use the Bartlett identities for martingales for computing higher-order moments and cumulants of the log-price process. However, to compute the mixed terms, i.e. the joint moments of the log-price and the volatility process, we have to proceed as previously described: by solving a system of inhomogeneous ordinary differential equations. 4.4. First- and second-order properties of the actual variance Recent research has focused on integrated variance as a measure for the variability of financial t markets. The integrated variance is defined by I Vt = 0 σs2 ds. Note that in the absence of leverage-type t effects, we can condition on the volatility process and we obtain Var (Yt | σ ) = 2 v + ξ2 0 σs2 ds. For ease of exposition we assume throughout this section that v 2 + ξ 2 = 1. Often, one is interested in studying the increments of this process over a time interval of length h, say. So, we will denote these increments by ih 2 σs2 ds, σ[(i−1)h,ih] = I Vih − I V(i−1)h = (i−1)h
which is generally called the actual variance (AV) on the ith interval of length h and which measures the accumulated variance over a time interval (often chosen to be one day). Now we can compute the mean, variance and covariance of the AV as given in the following proposition. Note that we will be using the following notation throughout the paper: rλ (h) =
1 −λh e − 1 + λh , 2 λ
Rλ (h, s) =
1 −λ(s+1)h e − 2e−λhs + e−λ(s−1)h . 2 λ
P ROPOSITION 4.3. The mean, variance and covariance of the actual variance are given by the following formulae: η2 2 h, E(σ[(i−1)h,ih] ) = η12 + 2
2 4 1 2 1 1 η1 η3 + 4η12 η2 rλ (h) + η2 + η4 + η1 η3 rλ (2h), Var σ[(i−1)h,ih] =
3 4 3 8 2 2 2 = 2η12 η2 + η1 η3 Rλ (h, s) , σ[(i+s−1)h,(i+s)h] Cov σ[(i−1)h,ih] 3 1 2 1 1 + 8 η2 + 16 η4 + 6 η1 η3 Rλ (2h, s). As already mentioned earlier, in an SV model based on a Brownian motion, the actual variance can be consistently estimated by the realized variance. However, in a more general L´evy-based model, the quadratic variation of the price process does not equal the integrated variance. Hence, we will turn our attention to the quadratic variation of the log-price process and study its firstand second-order properties. We will then be able to compare those with the results we have just obtained for the integrated variance.
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
Likelihood estimation of L´evy-driven stochastic volatility models
215
4.5. First- and second-order properties of the quadratic variation The first- and second-order properties of the quadratic variation are described in the following proposition. P ROPOSITION 4.4.
Let t, s > 0 and ξ 1 = 0. Then
1 E ([S]t ) = v 2 + ξ2 η12 + η2 t, 2
Var([S]t ) = c1 + c2 t + c3 e−λ t + c4 e−2 λ t , 1 1 Cov([S]t , [S]t+s ) = c1 + c2 t + c3 (e−λt − e−λs + e−λ(t+s) ) 2 2 1 −2λt + c4 (e − e−2λs + e−2λ(t+s) ), 2 where ci = ci (λ, v, ξ 2 , ξ 4 , η1 , η2 , η3 , η4 , κ 2,1 , κ 2,2 ) for i = 1, . . ., 4, with
2 5 1 1 − η1 η3 − 4η12 η2 − η4 − η22 v 2 + ξ2 3 8 4
1 1 1 + − η3 − 3η1 η2 − 4η13 κ2,1 + − η2 − η12 κ2,2 v 2 + ξ2 3 4 2
1 1 1 2 2η1 η3 + η22 + 4η12 η2 + η4 v 2 + ξ2 c2 = λ 2 4
2 1 + 4η1 η2 + 4η13 + η3 κ2,1 + η12 + η2 κ2,2 v 2 + ξ2 3 2
3 1 4 + η14 + η22 + η4 + 3η12 η2 + η1 η3 ξ4 4 4 3
2 1 4 4η12 η2 + η1 η3 v 2 + ξ2 + 2η1 η2 + 4η13 κ2,1 v 2 + ξ2 c3 = 2 λ 3
2 1 2 1 1 1 η2 + η4 + η1 η3 v 2 + ξ2 c4 = 2 λ 4 8 3
1 1 2 1 η3 + η1 η2 κ2,1 + η1 + η2 κ2,2 v 2 + ξ2 . + 3 2 4
c1 =
1 λ2
From this proposition, we can easily deduce the first- and second-order properties of the incremental quadratic variation (IQV), which is defined by [s][(i−1)h,ih] = [S]ih − [S](i−1)h =
ih (i−1)h
P ROPOSITION 4.5.
2 σu− d[vW + X]u .
Let t, s > 0 and ξ 1 = 0 and v 2 + ξ 2 = 1. Then 2 , E [s][(i−1)h,ih] = E σ[(i−1)h,ih]
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
216
A. E. D. Veraart
2 3 2 4 1 2 4 Var [s][(i−1)h,ih] = Var σ[(i−1)h,ih] + ξ4 3η1 η2 + η2 + η1 + η1 η3 + η4 h 4 3 4 3 + κ2,1 4η1 + 2η1 η2 rλ (h)
1 1 1 rλ (2h) + κ2,1 η1 η2 + η3 + κ2,2 η12 + η2 3 2 2 and
2 2 Cov [s][(i−1)h,ih] , [s][(i+s−1)h,(i+s)h] = Cov σ[(i−1)h,ih] , σ[(i+s−1)h,(i+s)h]
3 1 1 1 1 2 Rλ (2h, s). η3 + η1 η2 + κ2,2 η + η2 + κ2,1 2η1 + η1 η2 Rλ (h, s) + κ2,1 6 2 4 1 2
When we compare the first- and second-order properties of the AV with those of the IQV, we observe the following. First, in the variance of IQV there is an extra summand given by ξ4 E(σ 4 )h. Note here that we have assumed that X is a pure jump L´evy process without drift. Hence, νX1 = 0 and, hence, ξ4 = R x 4 νX1 (dx) > 0, so this factor will never disappear. Secondly, both the variance and the covariance of the IQV have an extra term which is due to a possible leveragetype effect in the model. Clearly, in the absence of this effect (i.e. when X and Y are independent), then κ 2,1 = κ 2,2 = 0, and hence this extra term would not exist. So, altogether, we can say that, by choosing a pure jump L´evy process as a driving process for the asset price, we observe an extra term in the variance of the IQV compared to the variance of the AV. If one additionally allows for leverage-type effects, both the variance and the covariance of the IQV have to be generalized by an additional leverage term. 4.6. Covariation of returns We conclude this section by studying the covariance between returns, squared returns and IQV. We will consider returns over a time interval of length h, which are denoted by si = Sih − S (i−1)h . P ROPOSITION 4.6.
Let i, s ∈ N. For h → 0, one obtains with ξ 1 = 0 and ξ 2 + v 2 = 1:
2 ), Cov(si , si+s ) = 0 = Cov(si , si−s
1 2 8η12 κ1,1 Rλ (h, s) + κ1,1 η2 + κ1,2 η1 Rλ (2h, s) Cov si , si+s = 4 = 2η12 + η2 κ1,1 + κ1,2 η1 h2 + O(h3 ), 2 Cov si2 , si+s = Cov [s][(i−1)h,ih] , [s][(i+s−1)h,(i+s)h] + κ1,1 2(η2 + 2η12 )κ1,1 + 3κ1,2 η1 h3 + O(h4 ).
Also 2 2 Cov(s[(i−1)h,ih] − [s][(i−1)h,ih] , s[(i+s−1)h,(i+s)h] − [s][(i+s−1)h,(i+s)h] ) = 0.
So we see that the asset returns are uncorrelated. Further, we observe that the covariance between returns and squared returns basically depends on the two leverage parameters κ 1,1 , κ 2,1 , which denote the covariation between X1 and Y λ and the joint centred moment of X12 and Y λ , respectively. Recall that the squared returns can also be used for estimating the variance (although such an estimator is noisier than realized variance). This covariation will damp down C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
Likelihood estimation of L´evy-driven stochastic volatility models
217
exponentially with the lag length s and so will the influence of a possible leverage effect. Finally, we observe that the covariance between squared returns can be approximated by the covariance between the IQV and by terms of lower order which depend on parameters κ of 2 )= possible leverage. One can furthermore show that, if κ 1,1 = κ 1,2 = κ 1,3 = 0, then Cov(si2 , si+s 2 Cov([s][(i−1)h,ih] , [s][(i+s−1)h,(i+s)h] ) and, clearly, Cov(si , si+s ) = 0.
5. FIRST- AND SECOND-ORDER PROPERTIES OF THE REALIZED VARIANCE In this section, we apply the previous results for computing the first- and second-order properties of the realized variance (RV). These results can then be used for studying the degree of inconsistency of RV as estimator of integrated variance. Furthermore, we will use the secondorder properties of the realized variance to draw inference on the model parameters. 5.1. First- and second-order properties of the realized variance error We start by deriving the second-order properties of the realized variance error. Let t Ht = St2 − I Vt = St2 − 0 σs2 ds. P ROPOSITION 5.1.
Let t, s > 0 and ξ 1 = 0 and v 2 + ξ 2 = 1. Then (for t → 0) E(Ht ) = 0,
Cov(Ht , Ht+s ) = Var(Ht ),
1 3 4 η4 + η22 + 3η12 η2 + η1 η3 + η14 t 4 4 3 3 + 6η1 + 2η3 + 9η1 η2 κ1,1 + 6η12 + 3η2 κ1,2 + 2κ1,3 η1 ξ3
4 8 η3 + 4η13 + 6η1 η2 κ2,1 + 2η12 + η2 κ2,2 + 2η14 + η1 η3 + 6η12 η2 + 3 3 3 1 + η22 + η4 t 2 + O(t 3 ) 2 2
E(Ht2 ) = ξ4
and Cov(Ht , I Vt ) = Cov(Ht+s , I Vt )
1 2 1 1 3 3 η + η2 κ2,2 t 2 + O(t 3 ). = κ2,1 η1 + η3 + η1 η2 + 3 2 2 1 4 Because both S 2 and I V are stationary, we can easily deduce the results for the corresponding increments of the returns and the IV by setting t = h and, hence, 2 2 2 2 2 = E Hh2 , E si − σ[(i−1)h,ih] E si − σ[(i−1)h,ih] = E (Hh ) , 2 2 Cov si2 − σ[(i−1)h,ih] = Cov (Hh , I Vh ) . , σ[(i−1)h,ih] C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
218
A. E. D. Veraart
So we observe that even in the presence of leverage, the expectation of H is zero and, hence, there is no bias. However, leverage-type effects (of higher order) do affect the mean square error. Next, we study the properties of the difference between the squared log-price process and the quadratic variation. P ROPOSITION 5.2.
Let t, s > 0 and ξ 1 = 0. Define Gt = St2 − [S]t . Then (for t → 0) E(Gt ) = 0,
Cov(Gt , Gt+s ) = Var(Gt ) = g1 t 2 + g2 t 3 + O(t 4 ), Cov(Gt , [S]t ) = Cov(Gt+s , [S]t ) = g3 t 2 + g4 t 3 + O(t 4 ), Cov(Gt , [S]t+s ) = Cov(Gt , [S]t ) + g5 (s)t 2 + g6 (s)t 3 + O(t 4 ), where
2 8 1 3 η1 η3 + 2η14 + η4 + η22 + 6η12 η2 v 2 + ξ2 3 2 2
4 η3 + 4η13 + 6η1 η2 κ2,1 + 2η12 + η2 κ2,2 v 2 + ξ2 , + 3
2 1 4 2 4 − η4 − η1 η3 − η22 − η12 η2 v 2 + ξ2 3 3 3 3
4 3 8 10 2 4 + − η1 − η3 − η1 η2 κ2,1 + − η2 − η12 κ2,2 v 2 + ξ2 λ 3 9 3 3 3
2 16 2 2 8 η2 + η1 κ1,1 + 4κ1,1 κ1,2 η1 v + ξ2 , + 3 3
9 3 3 2 ξ3 κ1,1 3η1 + η3 + η1 η2 + κ1,2 3η1 + η2 + κ1,3 η1 , 2 2
4 2 2 2 η + η2 κ1,1 + κ1,1 κ1,2 η1 v 2 + ξ2 3 1 3
5 3 3 2 + −η1 − η1 η2 − η3 κ1,1 + −2η1 − η2 κ1,2 − κ1,3 η1 λξ3 , 2 2 1 2 2 v + ξ2 − η12 + η2 e−2λs − 2η12 e−λs + η2 + 3η12 κ1,1 λ
3 + η1 κ1,1 κ1,2 1 − e−2λs , 2
2 4 4 4 7 2 η12 + η2 e−2λs + η12 e−λs − η2 − η12 κ1,1 v + ξ2 3 3 3 3 11 + η1 κ1,2 κ1,1 e−2λs − 1 . 6
g1 =
g2 =
g3 = g4 =
g5 (s) =
g6 (s) =
We observe that g3 = 0 if and only if we have leverage-type effects, i.e. κ1,1 , κ1,2 , κ1,3 = 0. A similar result holds for Cov(Ht , I Vt ). That is, without the leverage effect we would have O(t 3 ) terms only (and not O(t 2 ) terms), which are the corresponding results in Barndorff-Nielsen and Shephard (2006, Proposition 4). Recall that the realized variance error (when estimating the IQV C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
Likelihood estimation of L´evy-driven stochastic volatility models
219
by the RV) is given by 2 [Sδ ][(i−1)h,ih] − σ[(i−1)h,ih] =
M (sj2,i − σj2,i ). j =1
L
L
Using sj ,i = Sδ and [s]j ,i = [S]δ , we obtain the following result for the squared returns and the IQV. C OROLLARY 5.1.
Let ξ 1 = 0 and v 2 + ξ 2 = 1. Then E([Sδ ][(i−1)h,ih] − [s][(i−1)h,ih] ) = 0, Var [Sδ ][(i−1)h,ih] − [s][(i−1)h,ih] = g1 h2 M −1 + O(M −2 )
and Cov [Sδ ][(i−1)h,ih] − [s][(i−1)h,ih] , [s][(i−1)h,ih]
9 3 3 2 = 3η1 + η3 + η1 η2 κ1,1 + 3η1 + η2 κ1,2 + κ1,3 η1 ξ3 h2 M −1 2 2
1 3 2 η2 + 3η12 κ1,1 + + κ1,1 κ1,2 η1 h2 M −1 λ 2
1 −2λh 1 1 1 −2λh 2 5 −λh 2 e + 2 + e − − + 2e η1 + η2 κ1,1 λ 2 2 2 2
3 −2λh 3 η1 κ1,2 κ1,1 hM −1 + O(M −2 ) + − e 4 4 and Cov [Sδ ][(i−1)h,ih] − [s][(i−1)h,ih] , [Sδ ][(i+s−1)h,i+sh] − [s][(i+s−1)h,(i+s)h] = 0. So, RV is an unbiased estimate for IQV, the variance of the RV error is of O(M −1 ) and the RV errors are uncorrelated. These findings correspond to similar results in the Brownian motion case; see Barndorff-Nielsen and Shephard (2002). Furthermore, we see that the covariance between the RV error and the IQV exhibits an O(M −1 ) term if and only if not all κ = 0. Recall that Barndorff-Nielsen and Shephard (2006) found this term to be of order O(M −2 ) for time-changed L´evy processes in the absence of leverage. 5.2. Cumulants of realized variance Using the results earlier, we can now derive the mean, variance and covariance of the realized variance. P ROPOSITION 5.3. Let i, s ∈ N, ξ 1 = 0 and v 2 + ξ 2 = 1. The first- and second-order properties of the realized variance are then given by
1 E[Sδ ][(i−1)h,ih] = η12 + η2 h 2 C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
220
and
A. E. D. Veraart
8 1 3 2 4 2 η1 η3 + 2η1 + η4 + η2 + 6η1 η2 3 2 2
2 κ1,1 4 κ1,2 κ1,1 η2 + 3η12 + 3 + 2 η1 + κ2,1 η3 + 4η13 + 6η1 η2 λ λ 3 2 + κ2,2 2η1 + η2
+ ξ3 κ1,1 6η13 + 2η3 + 9η1 η2 + κ1,2 6η12 + 3η2 + 2κ1,3 η1 h2 M −1 2 κ1,1
2 −2λh + + 4e−λh − 5 + η2 e−2λh − 1 η1 e 2 λ 3 −2λh hM −1 + O M −2 + 2 κ1,2 κ1,1 η1 −1 + e 2λ
Var [Sδ ][(i−1)h,ih] = Var [s][(i−1)h,ih] +
and
Cov [Sδ ][(i−1)h,ih] , [Sδ ][(i+s−1)h,(i+s)h] = Cov [s][(i−1)h,ih] , [s][(i+s−1)h,(i+s)h]
κ1,1 8κ1,1 η12 Rλ (h, s) + 2κ1,1 (η12 + η2 ) + 3κ1,2 η1 Rλ (2h, s) hM −1 + O M −2 . + 4
5.3. Comparing the autocorrelation functions of realized variance, quadratic variation and integrated variance Now we briefly study some implications of our results for the autocorrelation functions (acfs) of RV, QV and IV. Hereby we follow Barndorff-Nielsen and Shephard (2006), who have studied the same question in the framework of a time-changed L´evy process. From our results earlier, we can deduce that limM→∞ Cor [Sδ ][(i−1)h,ih] , [Sδ ][(i+s−1)h,(i+s)h] = Cor [s][(i−1)h,ih] , [s][(i+s−1)h,(i+s)h] 2 2 Cov σ[(i−1)h,ih] + LC , σ[(i+s−1)h,(i+s)h] = , 2 + ξ4 Eσ 4 + LV Var σ[(i−1)h,ih] where the leverage part in the covariance is denoted by LC = κ2,1 2η13 + η1 η2 Rλ (h, s)
1 1 1 1 2 Rλ (2h, s), η3 + η1 η2 + κ2,2 η1 + η2 + κ2,1 6 2 4 2 and the leverage part in the variance is denoted by
1 1 1 rλ (2h). LV = κ2,1 4η13 + 2η1 η2 rλ (h) + κ2,1 η1 η2 + η3 + κ2,2 η12 + η2 3 2 2 In the absence of leverage-type effects (hence LC = LV = 0), we obtain exactly the same results as derived by Barndorff-Nielsen and Shephard (2006) for time-changed L´evy processes: C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
Likelihood estimation of L´evy-driven stochastic volatility models
221
• The acf of the RV is monotonically decreasing in ξ 4 . • For M → ∞, the acf of the RV is given by
2 2 Cov σ[(i−1)h,ih] , σ[(i+s−1)h,(i+s)h] 2 lim Cor [Sδ ][(i−1)h,ih] , [Sδ ][(i+s−1)h,(i+s)h] = M→∞ + ξ4 Eσ 4 Var σ[(i−1)h,ih] 2 2 , since ξ4 > 0. < Cor σ[(i−1)h,ih] , σ[(i+s−1)h,(i+s)h]
which implies that the acf of RV systematically underestimates the acf of the actual variance. • And for ξ 4 → ∞, we obtain lim Cor [Sδ ][(i−1)h,ih] , [Sδ ][(i+s−1)h,(i+s)h] = 0. ξ4 →∞
However, if we allow for leverage-type effects, we observe the following for the acf of the RV: • Dependencies between X and (higher) moments of Y (i.e. κ 1,1 , κ 1,2 , κ 1,3 ) are asymptotically negligible. In particular, the quantity κ 1,1 , which describes the classical leverage effect, has asymptotically no influence on the acf of the RV. • Dependencies between X2 and (higher) moments of Y (i.e. κ 2,1 , κ 2,2 ) do influence the acf of the RV. 5.4. Superposition model Let us briefly mention a method for generalizing our model slightly. Many empirical studies have indicated that one-factor stochastic volatility models cannot fit empirical data very satisfactorily. Hence, a standard approach for tackling this problem is to study at least a two-factor (or a multifactor) stochastic volatility model (see e.g. Bates, 1996, Bollerslev et al., 2006). Often, one uses the class of so-called superposition models where the volatility is not just given by a single OU process (as in our modelling framework), but rather by a convex combination of independent OU processes (see e.g. Barndorff-Nielsen, 2001, Barndorff-Nielsen and Shephard, 2002, and the references therein). We assume that the volatility process is given by a weighted sum of independent OU processes. For J ∈ N and i = 1, . . ., J , let wi 0 and Ji=1 wi = 1. Then we define σt =
J
wi τt(i) ,
dτ (i) = −λi τt(i) dt + dY (i) λi t ,
i=1
where we assume that the Y (i) are independent (but not necessarily identically distributed). However, as in the one-factor model, we allow for dependence between Xt and Yλ(i)i t . In particular, because X is a L´evy process, there is a sequence of independent identically law distributed random variables XJ,k for k = 1, . . . , J such that X = XJ ,1 + · · · + XJ ,J . We (i) = R×R+ un v m νX,Y (i) (du, dv), for the corresponding cumulants of the bivariate L´evy write κn,m process. In particular, we work in the framework where XJ,i and Y (i) are dependent and XJ,k and Y (i) are independent for all k = i ∈ {1, . . . , J }. When the volatility process is given by a superposition model, the mean, variance and covariance of the realized variance can be derived in a similar way as in the one-factor model. However, due to the fact that we allow
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
222
A. E. D. Veraart
for dependencies between the driving process of the asset price and the driving processes of the different components of σ , these formulae become rather lengthy. So, we will just present them in the Appendix B.
6. ESTIMATION IN THE PRESENCE OF MARKET MICRO-STRUCTURE EFFECTS So far, we have ignored any sort of market micro-structure effects in our analysis. Such effects can be caused by, for example, bid/ask spreads, irregular trading and the fact that prices are recorded in discrete time. Recent research on this matter includes articles by Zhou (1996), A¨ıtSahalia et al. (2005), Zhang et al. (2005), Bandi and Russell (2007, 2008), Hansen and Lunde (2006), Zhang (2006), Barndorff-Nielsen and Shephard (2007), Barndorff-Nielsen et al. (2008a), Andersen et al. (2010) and the references therein. Because such effects are present in (ultra) highfrequency data, we investigate in this section the impact of such market microstructure effects on the second-order properties of the realized variance. These results can then be used in a second step for estimating the cumulants of the noise process in addition to the cumulants of the efficient price process. Now we assume that the observed logarithmic asset price, which we denote by O = (Ot )t0 , is given by the sum of the efficient logarithmic asset price S (as above) plus a noise term, which is denoted by U = (Ut )t0 , i.e. Ot = St + Ut . Clearly, we obtain the following result for the realized variance of the observed price process: [Oδ ]i =
M j =1
sj2,i + 2
M j =1
sj ,i uj ,i +
M
u2j ,i = [Sδ ]i + 2
j =1
M
sj ,i uj ,i + [Uδ ]i .
j =1
To compute the second-order properties of the realized variance in the presence of market microstructure noise, we will work under some of the following assumptions, which are essentially taken from Hansen and Lunde (2006). A SSUMPTION 6.1.
S is independent of U and E(Ut ) = 0 for all t 0.
A SSUMPTION 6.2.
Var(Ut ) = u2 < ∞ for all t 0.
A SSUMPTION 6.3. t 0.
The noise process has zero autocorrelation, i.e. Cor(Ut , Us ) = 0 for all s =
A SSUMPTION 6.4.
E(Ut4 ) < ∞ for all t 0.
P ROPOSITION 6.1.
Let i ∈ N. The mean of the realized variance is given by E ([Oδ ]i ) = E ([Sδ ]i ) + 2
M E sj ,i uj ,i + E ([Uδ ]i ) . j =1
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
223
Likelihood estimation of L´evy-driven stochastic volatility models
Under Assumption 6.1, E ([Oδ ]i ) = E ([Sδ ]i ) + E ([Uδ ]i ) . Under Assumptions 6.1–6.3, E ([Oδ ]i ) = E ([Sδ ]i ) + 2Mu2 .
P ROPOSITION 6.2. Let i, k ∈ N. The variance and covariance of the realized variance (provided they exist) are given by ⎞ ⎛ M Var ([Oδ ]i ) = Var ([Sδ ]i ) + Var ([Uδ ]i ) + 4Var ⎝ sj ,i uj ,i ⎠ + 2Cov ([Sδ ]i , [Uδ ]i ) j =1
⎛ ⎞ ⎛ ⎞ M M + 4Cov ⎝ sj ,i uj ,i , [Sδ ]i ⎠ + 4Cov ⎝ sj ,i uj ,i , [Uδ ]i ⎠ , j =1
j =1
and Cov ([Oδ ]i , [Oδ ]k ) = Cov ([Sδ ]i , [Sδ ]k ) + Cov ([Uδ ]i , [Uδ ]k ) + Cov ([Sδ ]i , [Uδ ]k ) ⎞ ⎛ M + Cov ([Uδ ]i , [Sδ ]k ) + 2Cov ⎝ sj ,i uj ,i , [Sδ ]k ⎠ j =1
⎛ ⎞ ⎛ ⎞ M M M + 2Cov ⎝ sj ,i uj ,i , [Uδ ]k ⎠ + 4Cov ⎝ sj ,i uj ,i , sj ,k uj ,k ⎠ ⎛
j =1
+ 2Cov ⎝[Sδ ]i ,
M
⎞
sj ,k uj ,k ⎠ + 2Cov ⎝[Uδ ]i ,
j =1
j =1
j =1
⎛
M
⎞
sj ,k uj ,k ⎠ .
j =1
Under Assumptions 6.1–6.4 Var ([Oδ ]i ) = Var ([Sδ ]i ) + Var ([Uδ ]i ) + 8Mu2 Var(Sδ ), and Cov ([Oδ ]i , [Oδ ]k ) = Cov ([Sδ ]i , [Sδ ]k ) + Cov ([Uδ ]i , [Uδ ]k ) . So we see that, given a parametric model for the noise process, the method we propose in Section 7 for estimating the cumulants of the driving processes of the asset price and the volatility can in fact be used for additionally estimating the first four cumulants of the noise process and the parameters specifying its autocorrelation function.
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
224
A. E. D. Veraart
7. MODEL ESTIMATION AND INFERENCE 7.1. Quasi-likelihood estimation based on realized variance Finally, we turn our attention to estimating the parameters of our L´evy-driven stochastic volatility model. It is well-known that parameter estimation in such a model framework is difficult because one cannot easily compute the exact likelihood function. Here we follow Barndorff-Nielsen and Shephard (2006) and use a quasi-maximum likelihood approach (see e.g. Gallant, 1997, Chapter 5) based on the Gaussian density function. This methodology leads to a consistent and asymptotically normally distributed set of estimators. Alternative estimation techniques include method of moment (e.g. Bollerslev and Zhou, 2002) and simulation-based methods. For instance, independent work by Roberts et al. (2004), Griffin and Steel (2006) and Fr¨uhwirthSchnatter and S¨ogner (2009) has focused on the Markov chain Monte Carlo methodology for Bayesian inference in OU stochastic volatility models. Recall that we have shown that we can write the mean, the variance and the covariance of the vector of realized variances [Sδ ] = ([Sδ ]1 , . . . , [Sδ ]n ) as function of the model parameters, which we write in terms of a vector θ , say. We choose the following quasi-maximum likelihood (QML) approach for estimating the parameters. Let 1 n l(θ ) = log(L(θ )) = − log(2π ) − log(det(Cov([Sδ ]))) 2 2 1 − ([Sδ ] − E([Sδ ])) (Cov([Sδ ]))−1 ([Sδ ] − E([Sδ ])) 2
(7.1)
denote the Gaussian realized quasi-likelihood function and let θˆ = arg maxθ log(l(θ )) denote the QML estimate. To find this estimate, one has to compute the inverse and the determinant of the RV vector, which would be in general an operation of order O(n3 ). However, since σ is stationary, [S δ ] is itself stationary. Hence, Cov([Sδ ]) is a Toeplitz matrix, which can be inverted by using the Levinson–Durbin algorithm, see Levinson (1947) and Durbin (1960), in O(n2 ). Basically, one uses a Choleski decomposition of the covariance matrix (see e.g. Doornik, 2001), with Cov([Sδ ]) = LDL = P P , where L is lower diagonal, with ones on the diagonal and D is a diagonal matrix with the variances of the residuals (which are denoted by E) as entries. So the likelihood function (7.1) can be written as 1 1 n l(θ ) = − log(2π ) − log(det(D)) − E E, 2 2 2 where the residuals E are given by E = D −1/2 L−1 [Sδ ] − E [Sδ ] = P −1 [Sδ ] − E [Sδ ] .
R EMARK 7.1. We can express the likelihood function in terms of the mean, variance and covariance of the linear predictions of the RV. Assume that f denotes the joint density of the time series of the RV. Straightforwardly, we get f ([Sδ ]1 , . . . , [Sδ ]n ) = f ([Sδ ]1 )
n
f ([Sδ ]i | [Sδ ]i−1 , . . . , [Sδ ]1 ).
i=2 C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
Likelihood estimation of L´evy-driven stochastic volatility models
225
Now let EL (yi |Fi−1 ) and VarL (yi |Fi−1 ) denote the mean and the variance of the linear prediction for yi = [S δ ]i . To construct a quasi-likelihood, we assume that f is given by a Gaussian density and hence 1 n l(θ ) = log(L(θ, y)) = − log(2π ) − log(VarL ([Sδ ]i |Fi−1 )) 2 2 i=1 n
1 ([Sδ ]i − EL ([Sδ ]i |Fi−1 ))2 . 2 i=1 VarL ([Sδ ]i |Fi−1 ) n
−
So we observe that the entries in the diagonal matrix D in the Choleski decomposition are exactly the variances of the best linear, unbiased one-step ahead forecast of the RV. So far, we have only discussed how the model parameters can be estimated. However, in the remaining part of this section we will briefly describe how one can draw inference on the model parameters. Let
2 ∂l(θ ) ∂ l(θ ) 1 1 J = lim Var . and I = lim − E n→∞ n n→∞ ∂θ n ∂θ ∂θ It is well-known (see e.g. Gallant, 1997) that not only the maximum likelihood estimator but also the QML estimator is asymptotically normally distributed with an adjusted covariance matrix (compared to the MLE setting) and hence √ d n(θˆ − θ ) → N(0, I −1 J I −1 ). Based on this asymptotic result, we can construct 95% confidence intervals for θ , which are of the form 1/2 1/2 1.96 1.96 ≤ θ ≤ θˆ + √ I −1 J I −1 , θˆ − √ I −1 J I −1 n n where the square root of a positive (semi-)definite matrix , say, is defined by the matrix 1/2 such that 1/2 1/2 = . Estimating the so-called sandwich matrices I , which only appear in a QML setting and account for the fact that the estimation was not based on the true density function, does not cause any problems, whereas estimating the covariance matrix J is more complicated. Here we have used spectral methods based on the approach popularized by Newey and West (1987). That is, let hˆ t = ∂θ∂ l(yt , θˆ ) and let m denote the number of non-zero autocorrelations of ht (h). Then ˆ0+ Jˆ =
m 1− j =1
j ˆj + ˆ j , m+1
n ˆj = 1 hˆ hˆ t−j . n t=j +1 t
The sandwich matrices can be estimated straightforwardly by n 1 ∂2 ∂2 ˆ) = − 1 Iˆ = − l(y, θ l(yi , θˆ ). n ∂θ ∂θ n i=1 ∂θ ∂θ
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
226
A. E. D. Veraart
7.2. Simulation study Before we turn to an empirical study, we briefly check the proposed estimation method by means of a simulation study. In principle, we can use the quasi-maximum likelihood method to estimate the cumulants of the (X, Y λ ) and the parameters λ, v without specifying any parametric model. However, we have found out that the performance of the estimation technique in such a very general modelling framework is not particularly good since some of the cumulants are only weakly identified. Hence, we suggest to specify a fully parametric model and use the described technique to draw inference on the model parameters from the parametric model. Clearly, there is a huge class of parametric models which is in line with our general model. In particular, there are various ways of modelling the dependence between the asset price and the volatility. In this paper, we focus on a very natural choice of a parametric model, which is given as follows. We assume that the pure jump L´evy process X is defined by λt , Xt = ρ Yλt − Y
(7.2)
are i.i.d. driftless subordinators and ρ ∈ R describes the leverage effect. We carry where Y , Y out the simulation study for two different choices of the distribution of Y . Either Y is chosen to be a Gamma process with Y λ(t+dt) − Y λt ∼ (rλ dt, α) for α, r > 0, or Y is chosen to be an inverse Gaussian process with Y λ(t+dt) − Y λt ∼ I G(λ dtμ, lλ2 dt 2 ) for μ, l > 0. From these models, we simulate price data for 2515 business days, which correspond to a time horizon of approximately 10 years and which is exactly the time period we study in our empirical work. Each day we simulate 4680 data points, which corresponds to a data point every 5 sec in a market which is open for 6.5 h a day. Then we construct the time series of 10, 5 and 1 min returns, i.e. when there are M = 39, 78, 390 observations each day. From these returns, we compute the realized variance and we estimate the cumulants of the driving processes based on the quasi-maximum likelihood method described earlier. When we look at the formulae of the second-order properties of realized variance given in Proposition 5.3, we see that the leverage parameter ρ only enters the formula squared. That is, we cannot determine the sign of the leverage parameter, but only its absolute value. To ensure the identifiability of the model parameters, we impose that v 2 + ξ 2 = 1. We have carried out this simulation study for different choices of parameters. Here we present tables with our estimation results, where we vary the value of the memory parameter λ (which leads to different values of the parameter ρ given the constraint v 2 + ξ 2 = 1). Note that the number of Monte Carlo replication is 500. Table 1 contains the results of the simulation study when Y is drawn from the Gamma distribution and Table 2 present the findings for the case that Y is given by an inverse Gaussian process. The findings in both cases are in fact very similar. We observe that our estimates are pretty close to the true parameter values and the estimation results tend to become better when M increases, i.e. when we compute the realized variance based on returns at high frequencies. In particular, the root mean square errors (RMSEs) we obtain are much smaller when we use M = 390 rather than M = 39. We also see that we get in most cases particularly good estimates for the memory parameter λ (where the RMSEs are quite small), whereas the RMSEs for the leverage parameter ρ tend to be bigger, in particular for the choice λ = 1. In the Gamma case, it turns out that we can estimate the parameters α and r equally well, whereas in the inverse Gaussian C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
Likelihood estimation of L´evy-driven stochastic volatility models
Parameter λ
r
α
ρ
Table 1. Simulation results for a Gamma process Y . True value M Mean Median 0.03
227
RMSE
39 78
0.032 0.033
0.032 0.032
0.009 0.009
390 39 78
0.033 22.174 22.114
0.032 21.845 21.704
0.008 5.300 5.341
390
22.186
21.687
5.178
20
39
22.151
21.560
5.478
7.071
78 390 39
22.076 22.037 8.089
21.544 21.513 8.046
5.403 5.150 1.811
78 390
8.089 8.063
7.999 7.998
1.626 1.489
39 78
0.999 0.999
1.001 0.995
0.138 0.116
390 39 78
0.999 20.588 20.113
0.998 20.164 19.842
0.100 5.197 3.951
20
λ
1
r
20
α
20
390 39
19.922 20.680
20.032 20.267
3.031 5.176
1.224
78 390 39
20.078 19.789 1.281
19.832 19.856 1.446
3.904 2.989 0.959
78 390
1.217 1.227
1.338 1.364
0.756 0.593
ρ
Note: We simulate 4680 data points per day (i.e. one observation every 5 sec in a 6.5 h market) for 2515 days. Then we compute the realized variance based on 10-min returns (M = 39), 5-min returns (M = 78), 1-min returns (M = 390). The number of Monte Carlo replications is 500. We report the true parameter values, the mean and median of the estimates and the root mean square errors (RMSE). Note that the parameter choice implies that ξ 2 = 0.15 and v 2 = 0.85, which means that the jump part of the total variation is given by 15%.
case, the results we obtain for the mean parameter μ are much more precise than the ones for the second parameter l. 7.3. Empirical study To illustrate the applicability of our previously derived theoretical results, we carry out an empirical study. Because leverage-type effects are usually particularly strong in index data, we have chosen high-frequency data from Standard & Poor’s Depository Receipt (SPY). Note that the SPY is a highly liquid, exchange-traded fund which holds all of the S&P 500 Index stocks. We work with a sample from August 3, 1998, to July 31, 2008, i.e. 10 years of data. The data are the C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
228
A. E. D. Veraart
Parameter λ
μ
l
ρ
Table 2. Simulation results for an inverse Gaussian process Y . True value M Mean Median 0.03
RMSE
39 78
0.0333 0.0350
0.0312 0.0317
0.0361 0.0382
390 39 78
0.0349 1.0314 1.0035
0.0319 0.9996 1.0039
0.0434 0.1420 0.0324
20
390 39
1.0240 21.6922
1.0087 21.3719
0.1705 7.1717
7.071
78 390 39
22.4757 22.6071 8.0554
21.8509 22.0778 8.1731
6.7202 6.7563 3.0555
78 390
8.4517 8.3313
8.2456 8.1008
2.4066 2.4202
39 78
1.011 1.023
1.012 1.016
0.207 0.151
390 39 78
1.024 1.003 1.006
1.026 0.999 1.006
0.135 0.043 0.008
1
λ
1
μ
1
l
20
390 39
1.011 20.236
1.011 20.128
0.012 5.091
1.224
78 390 39
20.026 19.969 1.433
19.907 19.943 1.496
4.133 3.357 0.726
78 390
1.415 1.400
1.468 1.407
0.556 0.439
ρ
Note: We simulate 4680 data points per day (i.e. one observation every 5 sec in a 6.5 h market) for 2515 days. Then we compute the realized variance based on 10-min returns (M = 39), 5-min returns (M = 78), 1-min returns (M = 390). The number of Monte Carlo replications is 500. We report the true parameter values, the mean and median of the estimates and the root mean square errors (RMSE). Note that the parameter choice implies that ξ 2 = 0.15 and v 2 = 0.85, which means that the jump part of the total variation is given by 15%.
collection of trades and quotes taken from the TAQ database through the Wharton Research Data Services (WRDS) system and was recorded at the AMEX from 1998 to 2002 and at the PACIF from 2003 to 2008. In our empirical study, we will focus on the mid-quotes of the high-frequency data. Note that the raw high-frequency data have been cleaned using the methods described in Barndorff-Nielsen et al. (2008b) and the pre-processed data have been kindly supplied to the author by Asger Lunde. When we sample the data at 5-min intervals (using the previous tick method), we obtain n = 2515 days of 78 observations each, i.e. 196,170 data points. A plot of the cleaned SPY log-price data and the time series of the realized variances with their autocorrelation is given in Figure 1. Note that the cleaned daily log-mid-prices are shown in Figure 1(a), the daily log-midprice returns are given in Figure 1(b). Figure 1(c) contains the time series of the daily realized C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
Likelihood estimation of L´evy-driven stochastic volatility models
229
Figure 1. SPY data from August 3, 1998, to July 31, 2008.
variances based on 5-min returns and Figure 1(d) provides the corresponding autocorrelation function for daily realized variances based on 5-min returns. Before we estimate the model parameters, we carry out a brief non-parametric check whether we can find leverage-type effects in our SPY data. To do that, we compute the empirical cross-correlation between returns and realized variances which we denote by min(n,n−τ ) i=max(1,1−τ ) (si − s)([Sδ ]i+τ − [Sδ ]) L(τ ) = 2 , n 2 n [S − s) ] − [S ] (s i δ i δ i=1 i=1 for τ ∈ { − 50 , . . . , 50}, where s = n1 ni=1 si and [Sδ ] = n1 ni=1 [Sδ ]i denote the sample mean of the returns and the realized variance, respectively. The results are provided in Figure 2 . Similarly to Bouchaud et al. (2001), the function L(τ ) can be interpreted as a kind of leverage correlation function. In addition to this function, we plot the Bartlett confidence bounds of the √ hypothesis that there is no leverage, which are given by 1.96/ n = 0.039. We can clearly see that returns and future realized variance are negatively correlated for approximately 11 days. Also, it seems that returns and past realized variances are hardly correlated. These findings are in line with the theoretical results implied by our model, which have been stated in Proposition 4.6. Now, we draw inference on the model parameters based on the time series of daily realized variances which are computed using 5-min returns. We use the same parametric model as in the simulation study and present the corresponding estimates and confidence bounds in Table 3. Note
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
230
A. E. D. Veraart
Figure 2. Leverage correlation function.
that Table 3a contains the results when the pure jump processes are given by Gamma processes, whereas Table 3b contains the estimates for the inverse Gaussian case. We see that the estimates for the memory parameter λ are essentially the same in both models and the estimated parameters α, r, μ, l lead to approximately the same mean and variance of the process Y 1 . Furthermore, we obtain a very high value for the leverage parameter ρ, which can be estimated slightly more precisely in the inverse Gaussian case. In addition to the tables with the parameter estimates, we present a plot of the empirical and the fitted autocorrelation function of the realized variances (Figure 3). First, we have fitted the parametric model specified in (7.2) for both Gamma and inverse Gaussian distributed subordinators. To investigate the impact of the leverage effect due to the jump component on the model fit, we have first estimated the model based on the assumption that ρ = 0. The corresponding fitted and empirical autocorrelation function of the realized variance in the inverse Gaussian case is given in Figure 3(a). It is obvious that the model fit is not good at all; this is also confirmed by the BP statistic computed based on the squared residuals. And it turns out that, in the absence of leverage, in the Gamma case, we even obtained a worse model fit than in the inverse Gaussian case in the absence of leverage, hence we do not present the results here. So it is very obvious that we need a more general model than a Brownian semi-martingale. Hence, in a next step, we fit the parametric model specified in (7.2) both for the Gamma and the inverse Gaussian case. In terms of the model fit, both models perform equally well, and, in particular, result in a greatly improved model fit compared to the no-leverage case. The corresponding fitted and empirical autocorrelation functions are presented in Figure 3(b). Note that we only provide one plot, because the figures look almost identical for the Gamma and the inverse Gaussian case. C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
231
Likelihood estimation of L´evy-driven stochastic volatility models Table 3. Empirical results. Estimate
Parameter
Standard errors
(a) Gamma λ r
0.037 2.664
0.023 0.640
α |ρ| Likelihood
244.2 218.59
16.834 10.276 17,940.4
m BP (fitted)
10 61.205
BP (raw data)
3922.16 (b) Inverse Gaussian
λ μ
0.036 0.011
0.028 0.001
l |ρ| Likelihood
0.034 195.44
0.014 4.685 1,7940.4
m BP (fitted)
10 61.154
BP (raw data)
3922.16
Note: Estimation results for the parametric model (7.2) based on the time series of realized variances using 5-min returns. We report the parameter estimates and the corresponding robust standard errors. The last four lines in the table contain the value of the quasi-likelihood function, the value of the Box–Pierce statistic (BP) based on 20 lags computed from the scaled residuals (BP (fitted) and the value of the Box–Pierce statistic (BP) based on 20 lags computed from the raw data (BP (raw)). Finally, m denotes the lag-length of the Newey–West estimate of the asymptotic variance.
0.6
0.40
0.40
0.5
0.35
0.35
0.30
0.30
0.4
0.25
0.3
0.25
0.20 0.20 0.15
0.2
0.15
0.10 0.1
0.10
0.05
0.05 0
20
40
60
80
100
120 140
160 180 200
0
20
40
60
80
100
120 140 160 180 200
0
20
40
60
80
100
120 140 160
180 200
Figure 3. Empirical and fitted autocorrelation function.
Finally, we have fitted a superposition model consisting of two volatility factors to our data. Again, we have obtained equally good results for the Gamma and the inverse Gaussian case and, hence, provide only one plot of the fitted and empirical autocorrelation function; see Figure 3(c). We see that by allowing for a second volatility factor the slower decay in the autocorrelation functions for higher lags can be better described.
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
232
A. E. D. Veraart
Altogether, we can say that our one-factor volatility model which allows for leverage and jumps results in a much better model fit compared to a purely Brownian motion driven model. By allowing for a second volatility factor, the model fit can be further improved. Although the particular choice of the distribution of the subordinator Y did not matter that much, the existence of the leverage effect in the jump component improved the model fit greatly.
8. CONCLUSION In this paper, we have studied the impact of jumps and leverage-type effects on returns and realized variances in L´evy-driven stochastic volatility models based on scaled L´evy processes. In particular, we have derived explicit expressions for the cumulants of the returns and the realized variance by solving a recursive system of inhomogeneous ordinary differential equations. This technique turns out to be very powerful and might be applicable to a wider class of asset price models. We have shown that non-parametric volatility estimators such as realized variance cannot only be used for model-free volatility estimation and forecasting, but are furthermore useful tools in a quasi-maximum likelihood (or generalized method of moment) framework for drawing inference on the model parameters of fully parametric stochastic volatility models. We have carried out a simulation and an empirical study and have illustrated how the new results can be used in practice. In our empirical work based on the SPY mid-quote data we have found strong evidence for the existence of a leverage effect. In future work, it will be interesting to study other classes of fully parametric models which fit into our general modelling framework. For example, one could think of modelling the dependence between the pure jump L´evy process driving the price and the one driving the volatility by means of a L´evy copula and one could estimate the parameters of the copula based on our newly established estimation method.
ACKNOWLEDGMENTS This paper is a revised chapter of my D.Phil. thesis and, therefore, I wish to thank my supervisors Neil Shephard and Matthias Winkel for their guidance and support throughout this project. Financial support by the Rhodes Trust and by the Centre for Research in Econometric Analysis of Time Series, CREATES, funded by the Danish National Research Foundation, is gratefully acknowledged.
REFERENCES A¨ıt-Sahalia, Y., P. A. Mykland and L. Zhang (2005). How often to sample a continuous-time process in the presence of market microstructure noise. Review of Financial Studies 18, 351–416. Andersen, T. G. and T. Bollerslev (1998). Answering the skeptics: yes, standard volatility models do provide accurate forecasts. International Economic Review 39, 885–905. Andersen, T. G., T. Bollerslev and F. X. Diebold (2010). Parametric and nonparametric measurement of volatility. In Y. A¨ıt-Sahalia and L. P. Hansen (Eds.), Handbook of Financial Econometrics, 67–138. Amsterdam: North Holland. Andersen, T. G., T. Bollerslev, F. X. Diebold and H. Ebens (2001). The distribution of realized stock return volatility. Journal of Financial Economics, 43–76. Bandi, F. M. and J. R. Russell (2007). Volatility. In J. R. Birge and V. Linetsky (Eds.), Handbook of Financial Engineering, 183–222. Amsterdam: Elsevier. C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
Likelihood estimation of L´evy-driven stochastic volatility models
233
Bandi, F. M. and J. R. Russell (2008). Microstructure noise, realized volatility, and optimal sampling. Review of Economic Studies 75, 339–69. Barndorff-Nielsen, O. E. (2001). Superposition of Ornstein–Uhlenbeck type processes. Theory of Probability and its Applications 45, 175–94. Barndorff-Nielsen, O. E., P. R. Hansen, A. Lunde and N. Shephard (2008a). Designing realized kernels to measure the ex-post variation of equity prices in the presence of noise. Econometrica 76, 1481–536. Barndorff-Nielsen, O. E., P. R. Hansen, A. Lunde and N. Shephard (2008b). Multivariate realised kernels: consistent positive semi-definite estimators of the covariation of equity prices with noise and nonsynchronous trading. CREATES Working Paper No. 08-63, Aarhus University. Barndorff-Nielsen, O. E. and N. Shephard (2001). Non-Gaussian Ornstein–Uhlenbeck–based models and some of their uses in financial economics (with discussion). Journal of the Royal Statistical Society, Series B, 63, 167–241. Barndorff-Nielsen, O. E. and N. Shephard (2002). Econometric analysis of realised volatility and its use in estimating stochastic volatility models. Journal of the Royal Statistical Society, Series B, 64, 253–80. Barndorff-Nielsen, O. E. and N. Shephard (2003). Realised power variation and stochastic volatility. Bernoulli 9, 243–65. Barndorff-Nielsen, O. E. and N. Shephard (2006). Impact of jumps on returns and realised variances: econometric analysis of time-deformed L´evy processes. Journal of Econometrics 131, 217–52. Barndorff-Nielsen, O. E. and N. Shephard (2007). Variation, jumps, market frictions and high frequency data in financial econometrics. In R. Blundell, P. Torsten and W. K. Newey (Eds.), Advances in Economics and Econometrics, Theory and Applications, Ninth World Congress, Econometric Society Monographs, 328–72. Cambridge: Cambridge University Press. Barndorff-Nielsen, O. E. and N. Shephard (2011). Financial Volatility: Stochastic Volatility and L´evy Based Models. Forthcoming. Cambridge: Cambridge University Press. Bates, D. S. (1996). Jumps and stochastic volatility: exchange rate processes implicit in PHLX Deutsche Mark options. Review of Financial Studies 9, 69–107. Bekaert, G. and G. Wu (2000). Asymmetric volatility and risk in equity markets. Review of Financial Studies 13, 1–42. Bertoin, J. (1996). L´evy Processes. Cambridge: Cambridge University Press. Black, F. (1976). Studies of stock price volatility changes. Proceedings of the 1976 Meeting of the Business and Economic Statistics Section, American Statistical Association, 177–81. Black, F. and M. S. Scholes (1973). The pricing of options and corporate liabilities. Journal of Political Economy 81, 637–54. Bollerslev, T., J. Litvinova and G. Tauchen (2006). Leverage and volatility feedback effects in highfrequency data. Journal of Financial Econometrics 4, 353–84. Bollerslev, T. and H. Zhou (2002). Estimating stochastic volatility diffusion using conditional moments of integrated volatility (plus corrections). Journal of Econometrics 109, 33–65. Bouchaud, J.-P., A. Matacz and M. Potters (2001). Leverage effect in financial markets: the retarded volatility model. Physical Review Letters 87, 228701-1–4. Carr, P., H. Geman, D. B. Madan and M. Yor (2003). Stochastic volatility for L´evy processes. Mathematical Finance 13, 345–82. Carr, P. and L. Wu (2004). Time-changed L´evy processes and option pricing. Journal of Financial Economics 71, 113–41. Christie, A. A. (1982). The stochastic behavior of common stock variances. Journal of Financial Economics 10, 407–32. Cont, R. and P. Tankov (2004). Financial Modelling with Jump Processes. Financial Mathematics Series. Boca Raton, FL: Chapman and Hall. C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
234
A. E. D. Veraart
Doornik, J. A. (2001). Ox 3.0—An Object-Oriented Matrix Programming Language (4th ed.). London: Timberlake Consultants Press. Duffie, D., D. Filipovic and W. Schachermayer (2003). Affine processes and applications in finance. Annals of Applied Probability 13, 984–1053. Durbin, J. (1960). The fitting of time series models. Review of the International Institute of Statistics 28, 233–43. Eberlein, E., J. Kallsen and J. Kristen (2003). Risk management based on stochastic volatility. Journal of Risk 5, 19–44. Fr¨uhwirth-Schnatter, S. and L. S¨ogner (2009). Bayesian estimation of stochastic volatility models based on OU processes with marginal Gamma law. Annals of the Institute of Statistical Mathematics 61, 159– 79. Gallant, A. R. (1997). An Introduction to Econometric Theory. Princeton, NJ: Princeton University Press. Garcia, R., R. Luger and E. Renault (2001). Asymmetric smiles, leverage effects and structural parameters. Cahiers de recherche 2001-09, D´epartement de sciences e´ conomiques, Universit´e de Montr´eal. Griffin, J. E. and M. F. J. Steel (2006). Inference with non-Gaussian Ornstein–Uhlenbeck processes for stochastic volatility. Journal of Econometrics 134, 605–44. Hansen, P. R. and A. Lunde (2006). Realized variance and market microstructure noise. Journal of Business and Economic Statistics 24, 127–61. Harvey, A. and N. Shephard (1996). Estimation of an asymmetric stochastic volatility model for asset returns. Journal of Business and Economic Statistics 14, 429–34. Heston, S. L. (1993). A closed-form solution for options with stochastic volatility with applications to bond and currency options. Review of Financial Studies 6, 327–43. Hull, J. C. and A. White (1987). The pricing of options on assets with stochastic volatilities. Journal of Finance 42, 281–300. Jacod, J. (2008). Asymptotic properties of realized power variations and related functionals of semimartingales. Stochastic Processes and their Applications 118, 517–59. Jacod, J. and A. N. Shiryaev (2003). Limit Theorems for Stochastic Processes (2nd ed.). Berlin: Springer. Kallsen, J. (2006). A didactic note on affine stochastic volatility models. In Y. Kabanov, R. Liptser and J. Stoyanov (Eds.), From Stochastic Calculus to Mathematical Finance, 343–68. Berlin: Springer. Kl¨uppelberg, C., A. Linder and R. Maller (2004). A continuous time GARCH process driven by a L´evy process: stationarity and second order behaviour. Journal of Applied Probability 41, 601–22. Konaris, G. (2002). Derivative pricing under non-Gaussian stochastic volatility. Unpublished thesis, Department of Economics, University of Oxford. Levinson, N. (1947). The Wiener RMS (root mean square) error criterion in filter design and prediction. Journal of Mathematics and Physics 25, 261–78. Madan, D. B. (2009). A tale of two volatilities. Review of Derivatives Research 12, 213–30. Mykland, P. A. (1994). Bartlett type identities for martingales. Annals of Statistics 22, 21–38. Nelson, D. B. (1991). Conditional heteroscedasticity in asset returns: a new approach. Econometrica 59, 347–70. Newey, W. K. and K. D. West (1987). A simple positive semi-finite, heteroskedasticity and autocorrelation consistent covariance matrix. Econometrica 55, 703–08. Protter, P. E. (2004). Stochastic Integration and Differential Equations (2nd ed.). London: Springer. Roberts, G. O., O. Papaspiliopoulos and P. Dellaportas (2004). Bayesian inference for non-Gaussian Ornstein–Uhlenbeck stochastic volatility processes. Journal of the Royal Statistical Society, Series B, 66, 369–93. Rogers, L. C. G. and D. Williams (2001). Diffusions, Markov Processes and Martingales, Foundations (2nd ed.), Volume 1. Cambridge: Cambridge University Press. C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
Likelihood estimation of L´evy-driven stochastic volatility models
235
Sato, K. (1999). L´evy Processes and Infinitely Divisible Distributions. Cambridge: Cambridge University Press. Tauchen, G. (2004). Remarks on recent developments in stochastic volatility: statistical modelling and general equilibrium. Working paper, Department of Economics, Duke University. Tauchen, G. (2005). Stochastic volatility and general equilibrium. Working paper, Department of Economics, Duke University. Todorov, V. and G. Tauchen (2006). Simulation methods for L`evy-driven CARMA stochastic volatility models. Journal of Business and Economic Statistics 24, 450–69. Todorov, V. and G. Tauchen (2011). Volatility jumps. Forthcoming in Journal of Business and Economic Statistics. Veraart, A. E. D. (2010). Inference for the jump part of quadratic variation of Itˆo semimartingales. Econometric Theory 26, 331–68. Woerner, J. H. C. (2003). Purely discontinuous L´evy processes and power variation: inference for integrated volatility and the scale parameter. Working paper 2003-MF-08, Working Papers Series in Mathematical Finance, University of Oxford. Yu, J. (2005). On leverage in a stochastic volatility model. Journal of Econometrics 127, 165–78. Zhang, L. (2006). Efficient estimation of stochastic volatility using noisy observations: a multi-scale approach. Bernoulli 12, 1019–43. Zhang, L., P. A. Mykland, and Y. A¨ıt-Sahalia (2005). A tale of two time scales: determining integrated volatility with noisy high-frequency data. Journal of the American Statistical Association 100, 1394–411. Zhou, B. (1996). High-frequency data and volatility in foreign-exchange rates. Journal of Business and Economic Statistics 14, 45–52.
APPENDIX A: PROOFS n Proof of σtn = σtn − σt− = (σt− + σt )n − n n−k 4.1: k From the binomial formula one obtains: nProposition n n n−1 σt− = k=1 k σt− (Yλt ) . Applying Itˆo’s formula to f (x) = x with f (x) = nx , f (x) = n(n − t n n−k k 1)x n−2 , one gets σtn − σ0n = −λ n 0 σs− ds + nk=1 nk 0≤s≤t σs− (Yλs ) .
Proof of Proposition 4.2:
Note that, for n ∈ N, n n
Stn =
k=1
k
n−k . (σt− Xt )k St−
By applying Itˆo’s formula and taking the expectation, we get n(n − 1) 2 v + ξ2 E Stn = 2
t
0
n t n ξk E Ssn−2 σs2 ds + E σsk Ssn−k ds. k 0 k=3
From the integration by parts formula, it follows that, for k, n ∈ N and k n, Stn−k σtk =
0
t
n−k Ss− dσsk +
0
t
k σs− dSsn−k + [S n−k , σ k ]t = I + I I + I I I .
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
236
A. E. D. Veraart
From Proposition 4.1, one can deduce that
t
I = −kλ 0
n−k k Ss− σs− ds +
k k j =1
j
k−j
n−k Ss− σs− (Yλs )j ,
0≤s≤t
t
k+1 n−k−1 σs− Ss− d(vWs + Xs ) t n−k−2 k+2 (n − k)(n − k − 1) 2 v + ξ2 + Ss− σs− ds 2 0
n−k n − k k+j n−k−j + σs− Ss− (Xs )j , j j =3 0≤s≤t
I I = (n − k)
0
III =
Ssn−k σsk
0≤s≤t
=
k n−k n−k k j =1 l=1
j
l
n−k−j
Ss−
j +k−l
σs−
(Xs )j (Yλs )l .
0≤s≤t
When taking the expectation and applying the Master formula and Fubini’s theorem, one obtains the following differential equation: d n−k k E St σt + kλE Stn−k σtk = g(t; n, k), dt
(A.1)
where g( · ; n, k) is defined by k k k−j g(t; n, k) = ληj E Stn−k σt j j =1
(n − k)(n − k − 1) 2 v + ξ2 E Stn−k−2 σtk+2 2
k n−k n−k n−k n−k k k+j n−k−j n−k−j j +k−l ξj E σt St κj ,l E St + . σt + j j l j =3 j =1 l=1
+
n−k k Because the processes above are continuous in probability, we can write E St− σt− = E Stn−k σtk . Clearly, (A.1) is an inhomogeneous differential equation of first order. From solving (A.1) with t ordinary initial value 0 at 0, one obtains E Stn σtk = e−kλt 0 g(u; n, k)ekλu du. Proofs of Propositions 4.3, 4.4, 4.6, 5.1, 5.2: The proofs of Propositions 4.3, 4.4, 4.6, 5.1, 5.2 are omitted here since they are rather lengthy. However, all computations are based on the same ideas presented above: we use Itˆo’s formula for deriving representations of the products of various powers of S and σ . After taking expectations and applying the Master formula and Fubini’s theorem, we obtain a system of firstorder inhomogeneous ordinary differential equations, which we solve iteratively. Proof of Corollary 5.1: Clearly, the second-order properties of the increments of G follow from directly 2 the results above. In particular, we have the following for [Sδ ][(i−1)h,ih] − [s][(i−1)h,ih] = M j =1 sj ,i − [s]j ,i ,
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
Likelihood estimation of L´evy-driven stochastic volatility models
237
when we assume that v 2 + ξ 2 = 1. The mean is given by [S δ ][(i−1)h,ih] − [s][(i−1)h,ih] = 0, and for the variance and covariance, we get Var [Sδ ][(i−1)h,ih] − [s][(i−1)h,ih] =
M
M
Var sj2,i − [s]j ,i + 2
j =1
2 Cov sj2,i − [s]j ,i , sk,i − [s]k,i
1≤j
= MVar(Gδ ) + 0 = g1 h2 M −1 + O(M −2 ) and
Cov [Sδ ][(i−1)h,ih] − [s][(i−1)h,ih] , [Sδ ][(i −1)h,i h] − [s][(i −1)h,i h] =
M M
2 Cov sj2,i − [s]j ,i , sk,i − [s]k,i = 0,
j =1 k=1
for i = i . Finally, we deduce that Cov [Sδ ][(i−1)h,ih] − [s][(i−1)h,ih] , [s][(i−1)h,ih] =
M M
M Cov sj2,i , [s]k,i − Cov [s]j ,i , [s]k,i = Cov sj2,i , [s]j ,i
j =1 k=1
−
M j =1
j =1
Var [s]j ,i +
M
M
Cov sj2,i , [s]k,i − 2
j =1 k=1,k=j
Cov [s]j ,i , [s]k,i .
1≤j
Proof of Proposition 5.3: The first- and second-order properties of the realized variance can be derived from the corresponding results of the squared returns. Let i, s ∈ N. Then E[Sδ ][(i−1)h,ih] =
M
E sj2,i ,
j =1 M Var sj2,i + 2 Var [Sδ ][(i−1)h,ih] =
Cov [Sδ ][(i−1)h,ih] , [Sδ ][(i+s−1)h,(i+s)h] =
i=1
2 , Cov sj2,i , sk,i
1≤j
2 . Cov sj2,i , sk,i+s
1≤j , k≤M
Finally, we express the (quite lengthy) formulae of the variance and covariance of the realized variance in form of a Taylor series expansion and focus on the first terms only. These leads to the results given in Proposition 5.3.
APPENDIX B: SUPERPOSITION MODEL Now we assume that σ is given by a superposition model as defined in Section 5.4. As already mentioned, we can derive the mean, variance and covariance even in that more general model, but due to the fact that the driving processes of the asset price and the volatility components are dependent, the formulae become quite lengthy. However, here we focus on the case J = 2 and we compute the quadratic variation of the price
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
238
A. E. D. Veraart
process and use that as an O(M −1 )–approximation of the corresponding formulae of the realized variance. Because we only consider the case J = 2 here, we can write the weights as w 1 , w2 = 1 − w1 . Note that throughout the section, we assume for ease of exposition that the Y (i) for i = 1, . . . , J are independent and identically distributed. Hence, the moments of τ (i) are identical for i = 1, . . . , J . Hence we will drop the superscript in the moments below to simplify the notation. In the following we will assume that 1 i, j J and v 2 + ξ 2 = 1. of σ can be straightforwardly derived from the corresponding moments of τ (i) by σt = J The moments (j ) w τ . In particular, we have for J = 2: j t j =1 E(σt ) = w1 E τt(1) + w2 E τt(2) = E τt(1) , E σt2 = w12 E τt(1) 2 + w22 E τt(2) 2 + 2w1 w2 E τt(1) E τt(2) = w12 + w22 E τt2 + 2w1 w2 (E(τt ))2 . Hence, Var(σt ) = w12 + w22 Var(τt ). For the third and fourth moments, we have E σt3 = w13 + w23 E τt3 + 3 w12 w2 + w22 w1 E τt2 E (τt ) , 2 E σt4 = w14 + w24 E τt4 + 4 w13 w2 + w23 w1 E τt3 E (τt ) + 6w12 w22 E τt2 . In the following we use the notation ηk(i) = λi ηk , for i = 1, . . . , J and k ∈ N. Next, we compute the secondorder properties of the actual variance in the superposition model.
ih
Var (i−1)h
σs2 ds
4 4 4η2 η12 + η1 η3 w14 + 12η2 η12 + η1 η3 w2 w13 + 8w12 w22 η2 η12 rλ1 (h) 3 3
4 4 + 4η2 η12 + η1 η3 w24 + 12η2 η12 + η1 η3 w23 w1 + 8w12 w22 η2 η12 rλ2 (h) 3 3
1 1 1 1 η4 + η22 + η1 η3 w14 + −η2 η12 + η1 η3 w2 w13 rλ1 (2h) + 8 4 3 3
1 2 1 1 1 η4 + η2 + η1 η3 w24 + −η2 η12 + η1 η3 w23 w1 rλ2 (2h) + 8 4 3 3
=
+ 8η22 w12 w22 rλ1 +λ2 (h). For the covariance function, we get
ih
Cov (i−1)h
σs2 ds,
(i+s)h
(i+s−1)h
σu2 du
= a1 Rλ1 (h, s) + a2 Rλ2 (h, s) + a3 Rλ1 (2h, s) + a4 Rλ2 (2h, s) + a5 Rλ1 +λ2 (h, s),
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
239
Likelihood estimation of L´evy-driven stochastic volatility models where
2 2 2 4 2 3 2 2 2 η1 η3 + 2η1 η2 w1 + 6η1 η2 + η1 η3 w2 w1 + 4w1 w2 η1 η2 , 3 3
2 2 η1 η3 + 2η12 η2 w24 + 6η12 η2 + η1 η3 w23 w1 + 4w12 w22 η12 η2 , 3 3
1 1 1 1 1 η4 + η1 η3 + η22 w14 + η1 η3 − η12 η2 w2 w13 , 16 6 8 6 2
1 1 1 1 2 1 4 2 3 η4 + η1 η3 + η2 w2 + η1 η3 − η1 η2 w2 w1 , 16 6 8 6 2
a1 = a2 = a3 = a4 =
a5 = 4w12 w22 η22 . Next, we compute the second-order properties of the quadratic variation. For the mean, we get
1 1 η12 + η2 w12 + 2w1 w2 η12 + η12 + η2 w22 h. E [s][(i−1)h,ih] = 2 2 For the variance, we get Var [s][(i−1)h,ih] = Var
ih (i−1)h
σs2 ds + E σ 4 hξ4
+ b1 rλ1 (h) + b2 rλ2 (h) + b3 rλ1 (2h) + b4 rλ2 (2h) + b5 rλ1 +λ2 (h), where b1 = b2 =
b3 =
b4 =
b5 =
2η1 η2 + 4η13 w14 + 2η1 η2 + 12η13 w2 w13 + 2η1 η2 + 12η13 w22 w12 (1) + 2η1 η2 + 4η13 w23 w1 κ2,1 − 4w13 w2 η12 η2 − 4w12 w22 η12 η2 , 2η1 η2 + 4η13 w2 w13 + 2η1 η2 + 12η13 w22 w12 + 2η1 η2 + 12η13 w23 w1 (2) + 2η1 η2 + 4η13 w24 κ2,1 − 4w23 w1 η12 η2 − 4w12 w22 η12 η2 ,
1 1 1 (1) + η3 + η1 η2 w14 + w13 w2 η1 η2 κ2,1 η2 + η12 w14 + w2 η12 w13 3 4 2
1 1 2 (1) η2 + η1 w22 w12 κ2,2 + w13 w2 η12 η2 , + 4 2
1 1 1 (2) η3 + η1 η2 w24 κ2,1 η2 + η12 w22 w12 + w23 w1 η12 w1 w23 η1 η2 + + 3 4 2
1 1 2 (2) 4 3 2 η2 + η1 w2 κ2,2 + w2 w1 η1 η2 , + 4 2
4 (2) (1) (2) 4η1 η2 + η3 κ2,1 w2 w13 + 4η1 κ2,1 η2 − 6η22 + 4η1 κ2,1 η2 w22 w12 3
4 (1) 3 + 4η1 η2 + η3 κ2,1 w2 w1 . 3
For the covariance, we get for s ∈ N: Cov [s][(i−1)h,ih] , [s][(i+s−1)h,(i+s)h] = Cov
ih
(i−1)h
σs2 ds,
(i+s)h
(i+s−1)h
σs2 ds
+ c1 Rλ1 (h, s) + c2 Rλ2 (h, s) + c3 Rλ1 (2h, s) + c4 Rλ2 (2h, s) + c5 Rλ1 +λ2 (h, s), C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
240
A. E. D. Veraart
where we simplify the formulae above using w 1 + w2 = 1 and, hence, (1) c1 = η2 + 2η12 η1 w13 + 4η13 w2 w12 + η2 + 2η12 η1 w22 w1 κ2,1 − 2η12 η2 w2 w12 , (2) c2 = η2 + 2η12 w2 η1 w12 + 4w22 η13 w1 + η2 + 2η12 η1 w23 κ2,1 − 2w22 w1 η12 η2 ,
1 1 1 1 (1) 1 (1) 2 (1) c3 = + η1 w14 η3 + η1 η2 w14 + w13 w2 η1 η2 κ2,1 κ2,2 η2 + κ2,2 6 2 2 8 4
1 2 1 (1) 2 1 (1) 1 (1) 2 η1 η2 + κ2,2 κ2,2 η2 + κ2,2 + η1 w2 w13 + η1 w22 w12 , 2 2 8 4
1 1 1 1 1 2 (2) 4 3 c4 = η3 + η1 η2 w2 + w1 w2 η1 η2 κ2,1 + η2 + η1 w22 w12 6 2 2 8 4
1 3 1 1 1 (2) η2 + η12 w24 κ2,2 + w2 w1 η12 + + w23 w1 η12 η2 , 2 8 4 2
2 (1) c5 = 2w12 w22 η1 η2 + 2η1 η2 + η3 w23 w1 κ2,1 3
2 (2) 2 2 3 2 2 2 + 2w1 w2 η1 η2 + 2η1 η2 + η3 w2 w1 κ2,1 − 3w1 w2 η2 . 3
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
The
Econometrics Journal Econometrics Journal (2011), volume 14, pp. 241–256. doi: 10.1111/j.1368-423X.2010.00324.x
Quasi-maximum likelihood estimation of discretely observed diffusions X IAO H UANG † †
Department of Economics, Finance and Quantitative Analysis, Kennesaw State University, 1000 Chastain Road MD 403, Kennesaw, GA 30144, USA. E-mail:
[email protected] First version received: July 2009; final version accepted: June 2010
Summary This paper introduces a quasi-maximum likelihood estimator for discretely observed diffusions when a closed-form transition density is unavailable. Higher-order Wagner–Platen strong approximation is used to derive the first two conditional moments and a normal density function is used in estimation. Simulation study shows that the proposed estimator has high numerical precision and good numerical robustness. This method is applicable to a large class of diffusions. Keywords: Diffusion, Quasi-maximum likelihood estimator, Wagner–Platen approximation.
1. INTRODUCTION Diffusion processes have been widely used in many research fields to model continuous time phenomena and they are usually characterized by stochastic differential equations (SDEs). Examples include modelling gene changes due to natural selection in genetics, vertical motion of the ground level in seismology, outflow from a reservoir in hydrology, asset prices in finance, etc. When the drift and diffusion coefficient of an SDE are parametrically specified, it is crucial to obtain precise parameter estimates. A major difficulty in estimation is data are always recorded discretely while SDEs are defined in continuous time. This has inspired many researches on obtaining good parameter estimates based on discrete observations. In this paper, we consider the estimation of a scalar, time-homogeneous diffusion characterized by the following SDE: dXt = a(Xt ; θ ) dt + b(Xt ; θ ) dWt ,
(1.1)
where Xt is an observed scalar variable, Wt is a Wiener process, a(Xt ; θ ) and b(Xt ; θ ) are parametric drift and diffusion coefficient with p × 1 parameter vector θ . Given a time discretization t0 (= 0) < · · · < ti−1 < ti < · · · < tn (= T ) and a sampling interval = ti − t i−1 , we let p(Xti |Xti−1 ; θ ) denote the transition density of Xti given Xti−1 . If p is known, maximum likelihood estimator (MLE) will be the first choice for efficient estimation. However, closedform transition densities exist only for a few special SDEs and this makes MLE inapplicable to a general SDE in (1.1). Florens-Zmirou (1989) shows that an estimator based on Euler approximation to (1.1) with a normal transition density will converge to the true MLE as the time C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society. Published by Blackwell Publishing Ltd, 9600
Garsington Road, Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
242
X. Huang
discretization interval goes to zero. In practice, the discretization interval is usually larger than zero, and a Euler estimator inevitably introduces approximation error. Much work has been done to improve the approximation to p. Shoji and Ozaki (1998) obtain a closed-form approximation to the transition density by using local linearization. Hermite polynomial expansion is used in A¨ıt-Sahalia (2002) to approximate the transition density. The estimator in A¨ı t-Sahalia (2002) yields high numerical precision and is shown in Hurn et al. (2007) to outperform many existing estimation methods from the perspective of speed/accuracy trade-off. Simulated MLE (SMLE) and Markov chain Monte Carlo offer alternative approaches to estimation (see e.g. Pedersen, 1995, Elerian et al., 2001, Eraker, 2001, and Brandt and SantaClara, 2002). These simulation-based methods can also achieve high numerical precision but the computation cost is high. More recently, Durham and Gallant (2002) use the Brownian bridge sampler to improve SMLE; Phillips and Yu (2009) propose a two-stage realized volatility approach; Beskos et al. (2009) suggest a simultaneous acceptance method (SAM) by estimating each conditional likelihood independently. However, SAM is applicable only to a restricted class of diffusions. Other approaches include numerically solving the Fokker–Planck equation in Lo (1988), estimation functions based on the low-order Wagner–Platen approximation in Kelly et al. (2004), and method-of-moments approaches in Chan et al. (1992), Gouri´eroux et al. (1993), Hansen and Scheinkman (1995), Gallant and Tauchen (1997), etc.1 See Fan (2005), A¨ıt-Sahalia (2007) and Hurn et al. (2007) for surveys on various estimation methods. This paper develops quasi-maximum likelihood estimator (QMLE) by using higherorder strong Wagner–Platen approximations. Due to the difficulty in obtaining closed-form approximate transition density in higher-order approximations, previous research is limited to low-order approximations such as Euler and Milstein schemes, and the estimates are often less precise compared to the results in A¨ıt-Sahalia (2002) and Durham and Gallant (2002). We show that higher-order approximations can improve estimation. The idea is to derive the first two conditional moments based on a strong numerical solution to (1.1), and use the QMLE in Bollerslev and Wooldridge (1992) for estimation. QMLE has the following appealing features. First, its consistency and asymptotic normality is easy to establish. Secondly, simulation shows that order three or four approximation will often be enough for precise estimation, and the estimator is also numerically robust. Thirdly, our approach does not require (1.1) to be first transformed such that b(X; θ ) = 1, in contrast to some other existing techniques. Simulation shows QMLE obtained from untransformed SDE is also very precise. In a closely related paper, Kessler (1997) approximates the conditional moments directly while our approach is based on a numerical solution to the SDE. The rest of the paper is organized as follows. Section 2 introduces strong Wagner–Platen approximations and QMLE. Section 3 presents simulation evidence for numerical precision and robustness of QMLE. Section 4 concludes. An example of QMLE is given in the Appendix.
2. THE APPROXIMATION AND THE ESTIMATOR Throughout this paper, we consider a general SDE defined in (1.1), where we assume discrete observations are stationary and ergodic. Extension to non-stationary and non-ergodic processes is possible (see Section 2.2 for a discussion). The Euler scheme is the simplest strong 1
Wagner–Platen approximation is also called Itˆo–Taylor approximation in Kloeden and Platen (1999). C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
Quasi-maximum likelihood estimation of diffusions
243
Wagner–Platen approximation and it takes the following form:
√ Xti = Xti−1 + a(Xti−1 ; θ ) + b(Xti−1 ; θ ) εti ,
(2.1)
i.i.d.
where εti ∼ N (0, 1) for all t0 < ti ≤ tn . Equation (2.1) is an order 0.5 strong Wagner–Platen approximation and Xti in (2.1) is a numerical solution to (1.1). The Euler scheme implies a conditional √ normal distribution with mean Xti−1 + a(Xti−1 ; θ ) and standard deviation b(Xti−1 ; θ ) . Elerian (1998) uses an order 1.0 Milstein scheme approximation to obtain a closed-form density. However, closed-form approximate transition density with higher-order approximations is hard to derive. An alternative way to explore time discrete approximation is proposed in Shoji and Ozaki (1998). After transforming b(X; θ ) to 1, (1.1) becomes dYt = aY (Yt ; θ ) dt + dWt , where
Y ≡ G(X) = aY (Y ; θ ) =
(2.2)
X
du/b(u; θ ),
(2.3)
a(G−1 (Y ; θ ); θ ) 1 ∂b(G−1 (Y ; θ ); θ ) − . b(G−1 (Y ; θ ); θ ) 2 ∂X
We may linearize aY (Yti ; θ ) around the point Yti−1 to obtain 1 ∂ 2 aY ∂aY (ti − ti−1 ) + (Yti − Yti−1 ) 2 ∂Y 2 ∂Y 1 ∂ 2 aY ∂aY 1 ∂ 2 aY ∂aY = aY (Yti−1 ; θ ) − Y Yt . t − + ti + i−1 t i−1 2 2 2 ∂Y ∂Y 2 ∂Y ∂Y i
aY (Yti ; θ ) ≈ aY (Yti−1 ; θ ) +
(2.4)
Given Yti−1 , the first three terms on the right-hand side of (2.4) and the coefficients for ti and Yti are constant, and the drift becomes a linear function of Yti and ti . An explicit solution to (2.2) can be obtained, and it follows a conditional normal distribution, making MLE feasible. We note that the values of the first and second derivatives of the drift in (2.4) may also change as Yt evolves from t i−1 to ti , and this observation motivates us to accommodate these changes to possibly improve estimation. 2.1. Wagner–Platen expansion and strong approximation When ∂aY /∂Y and ∂ 2 aY /∂Y 2 in (2.4) are varying on the interval [t i−1 , ti ], they can be further expanded when we approximate Yt in (2.2). Differentiating ∂aY /∂Y gives 1 ∂ 3 aY ∂aY ∂ 2 aY = d dt + dY , 3 ∂Y 2 ∂Y ∂Y 2 and in discrete time, omitting θ , we have ∂aY (Yti−1 ) 1 ∂ 3 aY (Yti−1 ) ∂ 2 aY (Yti−1 ) ∂ 2 aY (Yti−1 ) ∂aY (Yti ) 1 ∂ 3 aY (Yti−1 ) ≈ − t − Y + t + Yti . i−1 t i i−1 ∂Y ∂Y 2 ∂Y 3 ∂Y 2 2 ∂Y 3 ∂Y 2 (2.5) C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
244
X. Huang
If ∂ 3 aY /∂Y 3 in (2.5) also varies when Y evolves from Yti−1 to Yti , we can again differentiate it w.r.t. Y in approximation. In theory, if we assume aY is infinitely differentiable in Y , above differentiation can be continued until desired precision in approximation is reached. This way of expanding diffusion process is analogous to Taylor series expansion and is referred to as Wagner–Platen expansion in Kloeden and Platen (1999) (henceforth KP). It is applicable to a diffusion defined in (1.1) without normalizing b(X; θ ) to one. Consider the solution Xt to (1.1) conditioning on Xti−1 : ti ti Xti = Xti−1 + a(Xu ; θ ) du + b(Xu ; θ ) dWu . (2.6) ti−1
ti−1
By Itˆo formula, we expand a(Xu ; θ ) and b(Xu ; θ ) in (2.6) at Xti−1 to have ti ti Xti = Xti−1 + a(Xti−1 ; θ ) du + b(Xti−1 ; θ ) dWu + R, ti−1
where
ti
R=
+
u
L0 a(Xz ; θ ) dzdu +
ti−1
ti−1 ti
ti−1
ti−1
ti ti−1
u
L0 b(Xz ; θ ) dzdWu +
ti−1
1 ∂2 ∂ L0 = a + b2 2 , ∂x 2 ∂x
u
ti−1
and
L1 a(Xz ; θ ) dWz du
ti−1 ti u
L1 b(Xz ; θ ) dWz dWu ,
ti−1
∂ L1 = b . ∂x
(2.7)
We let a = a(Xti−1 ; θ ) and b = b(Xti−1 ; θ ) to conserve space. This expansion can be continued as long as both a and b are smooth in x. For example, we can further expand L0 a(Xz ; θ ) at Xti−1 in R to obtain higher-order results. The general result is summarized in Theorem 5.5.1 of KP. The following assumptions are adapted from Section 4.5 in KP to guarantee the existence and uniqueness of a strong solution to (1.1). Let be the real line and {Ft , t ≥ 0} be a family of σ -algebras generated by Wt for all t ∈ [t 0 , T ]. For all x in a compact set in , we assume A SSUMPTION 2.1.
a(x; θ ) and b(x; θ ) are infinitely differentiable in x.
A SSUMPTION 2.2. For some positive constant K, we have |a(x; θ )|2 ≤ K 2 (1 + |x|2 ) and |b(x; θ )|2 ≤ K 2 (1 + |x|2 ). A SSUMPTION 2.3.
Xt0 is Ft0 -measurable with E(|Xt0 |2 ) < ∞.
Assumption 2.1 is stronger than the Lipschitz condition in KP. It offers the possibility of establishing consistency using infinite order approximation with a fixed , although we assume → 0 and a fixed approximation order in the paper. It also helps to establish the global Lipschitz condition in (2.13). Next, we introduce the Wagner–Platen expansion (detailed discussion can be found in Chapter 5 of KP). Let α be a multi-index of length l such that α = (j 1 , j 2 , . . . , jl ), ji ∈ {0, 1} for i = 1, 2, . . . , l and l := l(α) ∈ {1, 2, . . . , l}. Let M be the set of all multi-indices such that M = {(j1 , j2 , . . . , jl ) : ji ∈ {0, 1} , i ∈ {1, 2, . . . , l}, for l = 1, 2, . . .} ∪ {v}, where v is the multi-index of length zero. For an α ∈ M with l(α) ≥ 1, we let −α and α− be the multi-index in M obtained by deleting the first and last element of α, respectively. We define C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
Quasi-maximum likelihood estimation of diffusions
245
a sequence of sets for adapted right continuous stochastic processes f (t) with left-hand limits: let Hv be the totality of all processes such that |f (t)| < ∞, H(0) be the totality t of all processes t such that t0 |f (u)| du < ∞, H(1) be the totality of all processes such that t0 |f (u)|2 du < ∞, and Hα be the totality of adapted right continuous processes with left-hand limits such that Iα− [f (t)]t0 ,t ∈ H(jl ) for all t 0 ≤ t ≤ T and l(α) ≥ 2, and the multiple Itˆo integral Iα [f (·)]ti−1 ,ti is defined as ⎧ f (ti ) if l = 0, ⎪ ⎪ ⎪ ti ⎪ ⎪ ⎨ Iα− [f (·)]ti−1 ,u du if l ≥ 1 and jl = 0, (2.8) Iα [f (·)]ti−1 ,ti = ⎪ ti−1 t ⎪ i ⎪ ⎪ ⎪ Iα− [f (·)]ti−1 ,u dWu if l ≥ 1 and jl = 1, ⎩ ti−1
where t0 ≤ ti−1 < ti ≤ T . For example, if α = (0, 1, 1, 0), we have ti u4 u3 u2 I(0,1,1,0) [f (·)]ti−1 ,ti = f (·) du1 dWu2 dWu3 du4 . ti−1
ti−1
ti−1
ti−1
We write Iα [f (·)]ti−1 ,ti as I α when f (ti ) = 1. As an example, we have ti u3 u2 1 1 I(0,0,0) [1]ti−1 ,ti = I(0,0,0) = du1 du2 du3 = (ti − ti−1 )3 = 3 . 3! 6 ti−1 ti−1 ti−1 The researcher chooses the length l(α) in Theorem 5.5.1 of KP to decide how many terms to include in the Wagner–Platen expansion. For example, the expansion for l(α) = 2 is Xti = Xti−1 + aI(0) + bI(1) + (aa + 0.5b2 a )I(0,0) + (ab + 0.5b2 b )I(0,1) + ba I(1,0) + bb I(1,1) + R,
(2.9)
where R is the remainder in expansion.2 For each α, we define recursively the Itˆo coefficient function
f if l = 0, (2.10) fα = j1 L f−α if l ≥ 1, where Lj1 is defined in (2.7). If we let f (x) ≡ x, it is easy to verify the coefficients in (2.9). For example, f (0) = a and f (0,1) = ab + 0.5b2 b . Strong Wagner–Platen approximation can be obtained based on expansions such as (2.9). Let W = Wti − Wti−1 , and an example of I α is ti u1 I(1,1) = dWu2 dWu1 = 0.5((W )2 − ). (2.11) ti−1
ti−1
Replacing all stochastic integrals in (2.9) with expressions similar to (2.11), evaluating all the coefficients at Xti−1 , we obtain a strong Wagner–Platen approximation when l(α) = 2. For stochastic integrals with higher multiplicity, it is not always possible to derive a closed form 2 We let a , a and a be the first, second and third derivative of a(X; θ ) w.r.t. X, respectively. Let a (r) be the rth derivative of a(X; θ ) w.r.t. X when r ≥ 4. The same notation applies to derivatives of b(X; θ ) w.r.t. X. C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
246
X. Huang
in terms of W and , but they can be approximated (see Section 5.8 in KP). However, we note that closed forms such as (2.11) are not needed for estimation. A general form of strong Wagner–Platen approximation is given by Y (ti ) = Xti−1 + Iα [fα (Xti−1 )]ti−1 ,ti , (2.12) α∈Aγ \{v}
where Aγ = {α ∈ M : l(α) + n(α) ≤ 2γ or l(α) = n(α) = γ + 12 }, n(α) is the number of zeros in α, f α is the coefficient function defined in (2.10) with f ≡ x and γ = 0.5, 1, 1.5, . . . is the approximation order. Approximation in (2.12) is a special case of (10.6.4) in KP, where we let Y (ti−1 ) = Xti−1 . Let Hα denote the sets for multi-indices α ∈ M such that f α (x) is square integrable in time t for l(α) > 1, B(Aγ ) = α ∈ M\Aγ : −α ∈ Aγ , and C 2 denote the space of two times continuously differentiable functions in x. T HEOREM 2.1. Let Y (ti ) be the order γ strong Wagner–Platen approximation defined in (2.12) with t 0 ≤ ti ≤ T and 0 < < 1. Under Assumptions 2.1 to 2.3, suppose the coefficient functions in (2.10) satisfy |fα (x) − fα (y)| ≤ K1 |x − y|
(2.13)
for all α ∈ Aγ and x, y in a compact set in ; f −α ∈ C 2 and fα ∈ Hα for all α ∈ Aγ ∪ B(Aγ ); |f α (x)| ≤ K 2 (1 + |x|) for all α ∈ Aγ ∪ B(Aγ ) and x in a compact set in ; and the initial condition at t i−1 satisfies 2 E Xti−1 − Y (ti−1 ) ≤ K3 γ . (2.14) Then for all i and every fixed γ , we have E Xti − Y (ti ) ≤ K4 γ
(2.15)
and lim P (|Y (ti ) − Xti | < ε) = 1 for every ε > 0.
→0
(2.16)
K 1 , K 2 , K 3 and K 4 are positive constants and independent of . Result (2.15) follows Corollary 10.6.4 in KP and (2.16) is simply the result that convergence in the rth mean implies convergence in probability. In the Lipschitz condition of (2.13), we restrict the domain of f α (x) to be a compact set in , which rules out SDEs with explosive, unbounded solutions. The assumption is practically relevant since many observed data are bounded. Assumption 2.1 implies f α (x) is also infinitely differentiable in x, which further implies that it is locally Lipschitz. A locally Lipschitz function with a compact domain is globally Lipschitz, which gives the condition in (2.13). Y (t i−1 ) in (2.14) is the first approximation on [t i−1 , ti ]. By letting Y (ti−1 ) = Xti−1 , (2.14) is always satisfied. Result (2.16) is obtained by assuming → 0 with a fixed γ . Note that the same result can also be obtained by letting γ → ∞ with fixed in (2.15), provided K 4 is bounded. However, it was pointed out by a referee that, just like not all smooth function can be approximated by Taylor series expansion, it is possible that K 4 may explode as γ → ∞. It is for this reason that we adopt the assumption of → 0 with a fixed γ in this paper. C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
Quasi-maximum likelihood estimation of diffusions
247
2.2. Quasi-maximum likelihood estimator Based on the approximation in (2.12) for a fixed γ , we can proceed to obtain QMLE. Define conditional moments based on (2.12) as μti , ≡ E(Y (ti )|Xti−1 ) and σt2i , ≡ Var(Y (ti )|Xti−1 ), and the true conditional moments are μti ≡ E(Xti |Xti−1 ) and σt2i ≡ Var(Xti |Xti−1 ). We note that E|Xti | < ∞ (a consequence of Theorem 4.5.3 in KP) and Y (ti ) is uniformly integrable for every choice of in (0, 1), which follows that there are a fixed number of terms on the right-hand side of (2.12), each term is bounded when evaluated at Xti−1 , and multiple Itˆo integrals w.r.t. time in E(Y (ti )|Xti−1 ) are all bounded. Theorem C in Section 1.4 of Serfling (1980) in conjunction with (2.16) then implies that lim μti , = μti ,
→0
lim σt2i , = σt2i .
→0
(2.17)
Hence the first two conditional moments of Xti are correctly specified as → 0 and QMLE in Bollerslev and Wooldridge (1992) can be used for estimation. Details about consistency, asymptotic normality and identification of QMLE can be found in Bollerslev and Wooldridge (1992). Extension of the asymptotic results to non-stationary and non-ergodic process is also possible (see Theorem 10.1 in Wooldridge, 1994).
2.3. Remarks R EMARK 2.1. This paper differs from Kessler (1997) in several aspects. First, Taylor series expansion of inverse and log of the second moment is used to prove asymptotic properties (see p. 215 in Kessler, 1997) while we use the approach in Bollerslev and Wooldridge (1992). Secondly, this paper provides simulation results. Thirdly, while Kessler (1997) approximates conditional moments directly, we approximate the unique strong solution to an SDE. Our approach also permits the possibility of establishing consistency using γ → ∞ with a fixed . R EMARK 2.2. QMLE is generally less efficient than exact MLE (see e.g. White, 1994). Simulation results in Tables 1–3 nonetheless suggest that, at least for the models and parameter settings used in the simulation study, the efficiency loss of QMLE is small for both normal and non-normal data. R EMARK 2.3. Expressions for conditional mean and variance used in QMLE can be easily derived using software such as Mathematica even for higher-order approximations. Because QMLE provides a closed-form approach, it is also computationally efficient. R EMARK 2.4. QMLE is applicable to a large class of parametric SDEs. Assumptions 2.1 and 2.3 are satisfied by many SDEs. For Assumption 2.2, consider the CIR model in Section 3.1 with positive parameters. It can be shown that the√required K for a(x; θ ) = θ 2 (θ 1 − x) and b(x; θ ) = √ θ3 x are K ≥ max (θ 1 θ 2√ , θ 2 ) and K ≥ θ3 /√ 2, respectively. Assumption 2.2 is satisfied as long as K ≥ max(θ1 θ2 , θ2 , θ3 / 2). Note that θ3 x is not differentiable at 0, but this can be resolved by replacing Assumption 2.2 with a weaker Yamada condition (see Theorem 3.2 in Chapter IV of Ikeda and Watanabe, 1989). If we assume the domain of a(x; θ ) and b(x; θ ) is compact, this
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
248
X. Huang Table 1. Estimated bias and standard error for the OU, CIR and BS models. OU model CIR model
BS model
− θ1
Bias S.E.
−0.0000267218 0.0064456224
−0.00010845 0.00799336
N/A N/A
θˆ1(EUL) − θ1 θˆ1(l=3) − θ1
Bias S.E. Bias
−0.0000267214 0.0064455361 −0.0000267213
−0.00010819 0.00802514 −0.00010906
N/A N/A N/A
θˆ1(l=4) − θ1
S.E. Bias
0.0064457349 −0.0000267217
0.00802198 −0.00011320
N/A N/A
S.E. Bias S.E.
0.0064456897 0.0484283531 0.1195788793
0.00806676 0.05311037 0.11960684
N/A −0.00007343 0.03300022
θˆ2(EUL) − θ2
Bias S.E.
0.0355759566 0.1136035832
0.04013549 0.11629094
0.00164653 0.03352989
θˆ2(l=3) − θ2 θˆ2(l=4) − θ2
Bias S.E. Bias
0.0484253876 0.1195557966 0.0484283861
0.05158365 0.11967522 0.05140656
−0.00007343 0.03300232 −0.00007343
θˆ3(MLE) − θ3
S.E. Bias
0.1194060203 0.0000480238
0.11978155 0.00011431
0.03300232 0.00004147
S.E. Bias S.E.
0.0006872011 −0.0006258604 0.0006562896
0.00344463 −0.00236893 0.00341859
0.00670272 0.00566881 0.00703084
θˆ3(l=3) − θ3
Bias S.E.
0.0000478822 0.0006871695
0.00010803 0.00344402
0.00004147 0.00670271
θˆ3(l=4) − θ3
Bias S.E.
0.0000480253 0.0006871488
0.00009957 0.00344323
0.00004147 0.00670271
θˆ1(MLE)
θˆ2(MLE) − θ2
θˆ3(EUL) − θ3
Note: Bias and S.E. reported in Table 1 are averages over 5000 replications with a sample size of 1000 and = 1/12. Parameter values used Table 1 are the same as those in Table III of A¨ıt-Sahalia (2002). We let θ = (0.06, 0.5, 0.03) for the OU model, θ = (0.06, 0.5, 0.15) for the CIR model and θ = (N /A, 0.2, 0.3) for the BS model. QMLEs with l = 3 and l = 4 in the BS model are the same because both drift and diffusion coefficient are constant after transforming the diffusion coefficient to one.
assumption is easily satisfied. QMLE is also applicable to diffusions when an analytic solution to (2.3) is unavailable such as the example in A¨ıt-Sahalia (1996): dXt = θ1 + θ2 Xt + θ3 Xt2 + θ4 /Xt dt + θ5 + θ6 Xt + θ7 Xtθ8 dWt . This property becomes more attractive in multivariate diffusions where the transform in (2.3) is not applicable to all diffusion matrix b. QMLE can also be used when sampling intervals are unequal or random as in Yu and Phillips (2001). R EMARK 2.5. Efficiency might be improved by using GMM with more moment conditions such as the third and fourth moments. Deriving higher-order moments involving multiple Itˆo C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
Quasi-maximum likelihood estimation of diffusions Table 2. Estimated bias and standard error for the CIR model. DGP (a) DGP (b)
249
DGP (c)
− θ1
Bias S.E.
−0.00010845 0.00799336
−0.00000835 0.00166959
0.00044849 0.01086030
θˆ1(EUL) − θ1 θˆ1(l=3) − θ1
Bias S.E. Bias
−0.00010819 0.00802514 −0.00010906
−0.00000641 0.00166950 −0.00000833
0.00044838 0.01086525 0.00044797
θˆ1(l=4) − θ1
S.E. Bias
0.00802198 −0.00011320
0.00166956 −0.00000832
0.01085883 0.00044799
S.E. Bias S.E.
0.00806676 −0.00010537 0.00808349
0.00166956 −0.00000685 0.00166954
0.01085778 0.00044099 0.01088451
θˆ1(l=4,U) − θ1
Bias S.E.
−0.00010512 0.00808601
−0.00000685 0.00166954
0.00044098 0.01088446
θˆ1(Hermite) − θ1 θˆ2(MLE) − θ2
Bias S.E. Bias
−0.00011272 0.00808862 0.05311037
−0.00031291 0.00268678 0.00150107
0.00044853 0.01085836 0.04439189
θˆ2(EUL) − θ2
S.E. Bias
0.11960684 0.04013549
0.02001636 −0.00882021
0.07823301 0.04086283
S.E. Bias S.E.
0.11629094 0.05158365 0.11967522
0.01912200 0.00150104 0.01993963
0.07615430 0.04437409 0.07787763
θˆ2(l=4) − θ2
Bias S.E.
0.05140656 0.11978155
0.00150127 0.01993980
0.04437507 0.07783932
θˆ2(l=3,U) − θ2
Bias S.E. Bias
0.05177577 0.12416308 0.05171348
0.00152367 0.01994102 0.00152528
0.04432115 0.07826028 0.04432130
S.E. Bias S.E.
0.12423279 0.05269981 0.11914530
0.01994075 −0.15188206 0.00670097
0.07832116 0.04439189 0.07772617
θˆ3(MLE) − θ3
Bias S.E.
0.00011431 0.00344463
−0.00001566 0.00067055
0.00006320 0.00199882
θˆ3(EUL) − θ3 θˆ3(l=3) − θ3
Bias S.E. Bias
−0.00236893 0.00341859 0.00010803
−0.00065034 0.00065642 −0.00001574
−0.00088656 0.00196946 0.00006326
θˆ3(l=4) − θ3
S.E. Bias
0.00344402 0.00009957
0.00067053 −0.00001566
0.00199856 0.00006334
S.E.
0.00344323
0.00067053
0.00199854
θˆ1(MLE)
θˆ1(l=3,U) − θ1
θˆ2(l=3) − θ2
θˆ2(l=4,U) − θ2 θˆ2(Hermite) − θ2
Continued
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
250
X. Huang Table 2. Continued DGP (a)
DGP (b)
DGP (c)
− θ3
Bias S.E.
0.00008891 0.00349839
−0.00001616 0.00067077
0.00006536 0.00200521
θˆ3(l=4,U) − θ3
Bias S.E. Bias
0.00008640 0.00349792 0.00011274
−0.00001607 0.00067077 0.00266855
0.00006540 0.00200525 0.00006320
S.E.
0.00344853
0.00065758
0.00200082
θˆ3(l=3,U)
θˆ3(Hermite) − θ3
Note: Bias and S.E. reported in Table 2 are averages over 5000 replications with a sample size of 1000 and = 1/12. QMLEs with superscript U are obtained from the untransformed model. DGP (a): θ = (0.06, 0.5, 0.15), DGP (b): θ = (0.06, 0.5, 0.03) and DGP (c): θ = (0.08, 0.24, 0.08838).
integrals is complicated. Even they are obtained, GMM is still less efficient than MLE. Compared to the GMM estimator in Hansen and Scheinkman (1995), QMLE is simple because only the first two conditional moments are needed.
3. SIMULATION RESULTS In this section, we show QMLE yields high numerical precision when data are close to normal and numerically robust when data are non-normal. 3.1. Numerical precision QMLEs with l(α) = 3 and l(α) = 4 in (2.12) are used. We use l(α) instead of the approximation order γ just for simplicity reasons. Approximation with l(α) = 3 is given in the Appendix, and the result for l(α) = 4 is available from the author upon request. To gauge the efficiency loss of QMLE, the following models with closed-form transition density are used: the Ornstein–Uhlenbeck (OU) process dXt √= θ 2 (θ 1 − Xt ) dt + θ 3 dWt , the CIR model in Cox et al. (1985) dXt = θ2 (θ1 − Xt ) dt + θ3 Xt dWt , and the Black–Scholes (BS) model in Black and Scholes (1973) dXt = θ 2 Xt dt + θ 3 Xt dWt . Exact MLE can be obtained for these processes. In Tables 1 and 2, θˆ (MLE) , θˆ (EUL) , θˆ (l=3) and θˆ (l=4) correspond to exact MLE, Euler estimator, QMLE with l(α) = 3 and l(α) = 4, respectively. Instead of the average bias, we can report the average distance between QMLE and MLE, similar to Table III in A¨ıt-Sahalia (2002). This average distance can be easily inferred from Tables 1 and 2. Take θ 1 in the OU model for example. We find θˆ1(MLE) − θ1 ≈ −0.0000267218 and θˆ1(l=3) − θ1 ≈ −0.0000267213. Hence the distance between θˆ1(l=3) and MLE is equal to |(θˆ1(l=3) − θ1 ) − (θˆ1(MLE) − θ1 )| ≈ 0.0000000005. Parameter values in Table 1 are selected from Table III in A¨ıt-Sahalia (2002). Results for θˆ (l=3) and θˆ (l=4) in the CIR and BS models are obtained from transforming b to one. Table 1 suggests that QMLE outperforms Euler estimator in some cases and it is usually very close to the exact MLE. The estimates for θˆ (l=3) and θˆ (l=4) in BS model give identical results because after the transform, Y = ln (X)/θ 3 , the model has constant drift and diffusion coefficient, and higher-order terms in approximations are equal to zero. In Table 2, we investigate specifically the precision of QMLE in the CIR model. Untransformed QMLEs, θˆ (l=3,U ) and θˆ (l=4,U ) , and the estimator using all seven Hermite C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
Quasi-maximum likelihood estimation of diffusions
251
coefficients on page 238 of A¨ıt-Sahalia (2002) are also reported. The data-generating process (DGP) (a) is the same as that in Table 1. DGP (b) and (c) are selected from Durham and Gallant (2002) and Jensen and Poulsen (2002), respectively. In Table 2, we find that QMLE offers improvement over the Euler estimator. Higher-order Hermite expansion yields negative approximate density for all DGPs in Table 2. All negative densities are replaced with a small positive number (eps ≈ 2.22 × 10−16 in Matlab). After this modification, θˆ (Hermite) is also precise. θˆ (Hermite) performs poorly in DGP (b) because it is sensitive to starting values in simulated data. After discarding the first 200 observations in DGP (b), the biases of θˆ (Hermite) for (θ 1 , θ 2 , θ 3 ) are about −5.76E−05, 0.05 and 3.80E−05, respectively. 3.2. Numerical robustness In the previous section, QMLE is shown to have high numerical precision and little efficiency loss. The purpose of this section is two-fold: first, we show that QMLE continues to work reasonably well for certain non-linear and non-normal diffusions; secondly, higher-order approximations in QMLE provide improvement over the Euler scheme when sampling interval is large. Both findings reveal good numerical robustness of QMLE.3 In Table 3, we continue to work with the CIR model but let θ = (0.05, 0.3, 0.15). If the process takes the value at the long-run mean, θ 1 (= 0.05), the selected value for θ implies a density with a skewness of 1.7 and a kurtosis of 7.5, a large enough deviation from normality for robustness test purposes. We consider a sample size of 1000 with 1000 replications and four different sampling intervals. A reasonably large bound of [0.01, 10] is imposed in estimation. Table 3 reports average bias of different estimators. θˆ (PDE) is obtained from numerically solving the Fokker–Planck equation using the finite difference method. Details of θˆ (PDE) can be found in Hurn et al. (2007). In Table 4, the following DGP in Shoji and Ozaki (1998) is used: dXt = θ1 + θ2 Xt + θ3 Xt2 + θ4 Xt3 dt + θ5 Xtθ6 dWt , (3.1) where θ = (6, −11, 6, −1, 1, 0.5). Because there is no closed-form transition density for (3.1), we simulate data at an interval of 0.005 based on Euler scheme, and use four different sampling intervals with 1000 observations and 1000 replications. Reasonably large bounds are imposed in estimation: [0.01, 50] or [−50, −0.01] for θ 1 to θ 3 ; [−10, −0.01] or [0.01, 10] for θ 4 and θ 5 ; [0, 1] for θ 6 . Neither θˆ (EUL) nor QMLE hits any bound in estimation. By construction, the simulated data are already in favour of the Euler estimator, yet this does not rescue the Euler scheme in estimation. R EMARK 3.1. In Table 3, QMLE θˆ (l=3,U) provides some improvement over θˆ (EUL) , but θˆ (EUL) is generally better than θˆ (l=4,U ) . In Table 4, there is a very clear pattern that higher-order QMLEs outperform θˆ (EUL) as sampling interval increases. Results in Tables 1 to 4 find some, but not overwhelming, evidence that higher-order QMLEs improve estimation. One interesting observation in Tables 1 and 2 is the distance between QMLE and MLE, measured by difference in biases, is usually smaller than the distance between θˆ (EUL) and MLE, offering additional evidence 3 Discussion in this section is motivated by a referee’s comment on numerical robustness of various existing estimators. I am very grateful to his/her strong insight on this subject. C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
252
Sampling interval = 0.05
X. Huang Table 3. Estimated bias for the CIR model in Section 3.2. dXt = θ2 (θ1 − Xt ) dt + θ3 Xt0.5 dWt Bias θˆ (MLE) − θ θˆ (EUL) − θ θˆ (l=3,U) − θ θˆ (l=4,U) − θ θˆ (Hermite) − θ θˆ (PDE) − θ
= 0.1
θˆ (MLE) − θ θˆ (EUL) − θ θˆ (l=3,U) − θ θˆ (l=4,U) − θ θˆ (Hermite) − θ θˆ (PDE) − θ
= 0.15
θˆ (MLE) − θ θˆ (EUL) − θ θˆ (l=3,U) − θ θˆ (l=4,U) − θ θˆ (Hermite) − θ θˆ (PDE) − θ
= 0.2
θˆ (MLE) − θ θˆ (EUL) − θ θˆ (l=3,U) − θ θˆ (l=4,U) − θ θˆ (Hermite) − θ θˆ (PDE) − θ
θ 1 = 0.05
θ 2 = 0.3
θ 3 = 0.15
0.000581 0.000690 0.000810
0.080117 0.076252 0.072305
0.000161 0.000219 −0.000001
0.000897 0.001219 0.000119
0.077676 0.078511 0.150129
−0.000171 0.000394 −0.000970
0.000172 0.004525
0.041170 0.036714
0.000196 0.000435
0.000567 0.000673 0.001168
0.012868 0.046490 0.024282
−0.000203 −0.000667 0.000809
−0.000263
0.097011
−0.001051
0.000231
0.023558
0.000267
0.000232 0.000531
0.014127 0.019219
0.000603 −0.000521
0.001046 0.002367 0.000368
0.029633 0.000857 0.071206
−0.001198 0.001525 −0.001030
0.000095 0.001964
0.018427 0.005736
0.000260 0.000835
0.000490 0.001207 0.003693
0.012888 0.024207 −0.007201
−0.000710 −0.001653 0.001950
0.000387
0.063130
−0.001124
Note: All biases are averages of 1000 replications for a sample of 1000 observations. QMLEs with superscript U are obtained without transforming the diffusion coefficient to one.
that higher-order estimators might improve efficiency. We also note that Fan and Zhang (2005) find higher-order non-parametric estimators reduce bias but increase variances. R EMARK 3.2. Hermite polynomials produce negative densities for DGPs in Tables 3 and 4, and all negative densities are replaced with a small positive number in estimation. The corresponding log-likelihood function tends to have a much larger value on the boundary than in the neighbourhood of θ . To prevent estimates from taking values on the boundary, all estimates in Tables 3 and 4 are obtained from a local search algorithm instead of a global algorithm with multiple searches. After these modifications, θˆ (Hermite) is also precise. C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
253
Quasi-maximum likelihood estimation of diffusions
Sampling interval
Table 4. Estimated bias for the model in (3.1). θ dXt = (θ1 + θ2 Xt + θ3 Xt2 + θ4 Xt3 ) dt + θ5 Xt 6 dWt Bias
θ1 = 6
θ 2 = −11
θ3 = 6
θ 4 = −1
θ5 = 1
θ 6 = 0.5
= 0.05
θˆ (EUL) − θ θˆ (l=3,U) − θ θˆ (l=4,U) − θ θˆ (Hermite) − θ
−0.5143 0.4113 0.5674
1.2670 −0.6454 −1.0410
−0.8066 0.2946 0.5768
0.1299 −0.0545 −0.1116
−0.0396 0.0059 0.0077
0.0049 −0.0044 −0.0032
0.7068
−1.2189
0.7238
−0.1563
0.0041
−0.0051
= 0.1
θˆ (EUL) − θ θˆ (l=3,U) − θ
−1.3575
3.0268
−1.8406
0.3085
−0.0731
0.0110
−0.1419 0.2454 −0.0967
0.6002 −0.2705 0.4727
−0.4991 0.0417 −0.3364
0.0940 −0.0025 −0.0156
−0.0036 0.0077 −0.0090
0.0010 −0.0135 0.0040
−1.9770 −0.8969
4.2979 2.2662
−2.5724 −1.5219
0.4317 0.2762
−0.0979 −0.0212
0.0094 0.0194
−0.4304 −1.2150
1.2441 2.8853
−0.9217 −1.6679
0.1739 0.2592
0.0086 −0.0110
−0.0477 0.0152
−2.4530
5.2669
−3.1276
0.5246
−0.1169
0.0029
−1.3886 −1.1086
3.7118 2.7116
−2.4003 −1.8218
0.4241 0.3333
−0.0390 0.0109
0.0471 −0.0838
−1.9597
4.2395
−2.5136
0.3716
−0.0125
0.0190
θˆ (l=4,U) − θ θˆ (Hermite) − θ = 0.15
θˆ (EUL) − θ θˆ (l=3,U) − θ θˆ (l=4,U) − θ θˆ (Hermite) − θ
= 0.2
θˆ (EUL) − θ θˆ (l=3,U) − θ θˆ (l=4,U) − θ θˆ (Hermite) − θ
Note: All biases are averages over 1000 replications for a sample size of 1000. QMLEs with superscript U are obtained without transforming the diffusion coefficient to one.
R EMARK 3.3. θˆ (PDE) requires recursively solving a tri-diagonal system and uses several forloops in Matlab, and the optimization is extremely slow. Instead, the results for θˆ (PDE) in Table 3 are obtained using the C++ code in Hurn et al. (2007). Whenever the finite difference method produces a non-positive density, it is replaced with a small positive number (1.0E−15). After this modification, θˆ (PDE) gives good results in Table 3. However, θˆ (PDE) fails to obtain sensible estimates in Table 4 and is not reported. This is due to the particular way data are generated. Note that in estimation we let the unit of discretization of state space be x = 0.001 and that of time be /10, respectively. When = 0.05, /10 = 0.005, which is exactly equal to the interval used in simulating data in (3.1). This poses difficulty for the finite difference method because the data are normal on this interval (0.005), but the model we try to fit with is non-normal. Even if we find a way to make the method work, computation cost remains a big concern. For a sample of 1000 observations from (3.1) and on an Athlon 2.91 GHz desktop with 4 Gb RAM, the computation time in Matlab for θˆ (l=3,U) , θˆ (l=4,U) and θˆ (Hermite) (with seven Hermite polynomial coefficients) is about 4, 11 and 16 seconds, respectively. Using C++ code, the computation time for θˆ (PDE) is about 5000 seconds without producing a sensible result. To possibly improve estimation, a finer discretization in both x and time is needed, which will further increase the computation cost. Similar results are found in Table 5 of Hurn et al. (2007), where θˆ (Hermite) is more than 3000 times faster than θˆ (PDE) in a four-parameter model. For a six-parameter model in (3.1), we conjecture that other estimators will be chosen over θˆ (PDE) because of computation cost. C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
254
X. Huang
Our limited simulation results suggest that the proposed QMLE works better when both the drift and diffusion coefficient are (highly) non-linear.
4. CONCLUSION This paper introduces QMLE for discretely observed diffusions. The estimator is based on higher-order numerical solutions to an SDE, and it is applicable to a large class of parametric diffusions. Simulation study reveals some good finite sample properties. In summary, QMLE is conceptually simple, computationally efficient, and numerically precise and robust. Extending the current method to multivariate diffusions is straightforward, and QMLE works without transforming the diffusion matrix to an identity matrix (see Huang, 2010). It will be interesting to investigate the efficiency gain by using higher-order approximations in SMLE. We leave it for future research.
ACKNOWLEDGMENTS The author thanks a co-editor and three referees for outstanding comments and suggestions. The author also thanks Eckhard Platen, Yongjun Tang and seminar participants at the SNDE 17th Annual Symposium for helpful comments. All remaining errors are mine.
REFERENCES A¨ıt-Sahalia, Y. (1996). Testing continuous-time models of the spot interest rate. Review of Financial Studies 9, 385–426. A¨ıt-Sahalia, Y. (2002). Maximum likelihood estimation of discretely sampled diffusions: a closed-form approach. Econometrica 70, 223–62. A¨ıt-Sahalia, Y. (2007). Estimating continuous-time models using discretely sampled data. In R. Blundell, W. K. Newey and P. Torsten (Eds.), Advances in Economics and Econometrics, Theory and Applications, Ninth World Congress, Volume 3, 261–327. Cambridge: Cambridge University Press. Beskos, A., O. Papaspiliopoulos and G. Roberts (2009). Monte-Carlo maximum likelihood estimation for discretely observed diffusion processes. Annals of Statistics 37, 223–45. Black, F. and M. Scholes (1973). The pricing of options and corporate liabilities. Journal of Political Economy 81, 637–54. Bollerslev, T. and J. M. Wooldridge (1992). Quasi-maximum likelihood estimation and inference in dynamic models with time-varying covariances. Econometric Reviews 11, 143–72. Brandt, M. W. and P. Santa-Clara (2002). Simulated likelihood estimation of diffusions with an application to exchange rate dynamics in incomplete markets. Journal of Financial Economics 63, 161–210. Chan, K. C., G. A. Karolyi, F. A. Longstaff and A. B. Sanders (1992). An empirical comparison of alternative models of the short-term interest rate. Journal of Finance 47, 1209–28. Cox, J. C., J. E. Ingersoll and S. A. Ross (1985). A theory of the term structure of interest rates. Econometrica 53, 385–407. Durham, G. B. and A. R. Gallant (2002). Numerical techniques for maximum likelihood estimation of continuous-time diffusion processes. Journal of Business and Economic Statistics 20, 297–316. Elerian, O. (1998). A note on the existence of a closed form conditional transition density for the Milstein scheme. Working paper, Oxford University. C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
Quasi-maximum likelihood estimation of diffusions
255
Elerian, O., S. Chib and N. Shephard (2001). Likelihood inference for discretely observed non-linear diffusions. Econometrica 69, 959–93. Eraker, B. (2001). MCMC analysis of diffusion models with application to finance. Journal of Business and Economic Statistics 19, 177–91. Fan, J. (2005). A selective overview of nonparametric methods in financial econometrics (with discussion). Statistical Science 20, 316–40. Fan, J. and C. Zhang (2003). A re-examination of diffusion estimations with applications to financial model validation. Journal of the American Statistical Association 98, 118–34. Florens-Zmirou, D. (1989). Approximate discrete-time schemes for statistics of diffusion processes. Statistics 20, 547–57. Gallant, A. R. and G. Tauchen (1997). Estimation of continuous-time models for stock returns and interest rates. Macroeconomic Dynamics 1, 135–68. Gouri´eroux, C., A. Monfort and E. Renault (1993). Indirect inference. Journal of Applied Econometrics 8, 85–118. Hansen, L. P. and J. A. Scheinkman (1995). Back to the future: generating moment implications for continuous-time Markov processes. Econometrica 63, 767–804. Huang, X. (2010). Quasi-maximum likelihood estimation of multivariate diffusions. Working paper, Kennesaw State University. Hurn, A. S., J. I. Jeisman and K. A. Lindsay (2007). Seeing the wood for the trees: a critical evaluation of methods to estimate the parameters of stochastic differential equations. Journal of Financial Econometrics 5, 390–455. Ikeda, N. and S. Watanabe (1989). Stochastic Differential Equations and Diffusion Processes (2nd ed.). New York: North-Holland. Jensen, B. and R. Poulsen (2002). Transition densities of diffusion processes: numerical comparison of approximation techniques. Journal of Derivatives 9, 18–32. Kelly, L., E. Platen and M. Sørensen (2004). Estimation for discretely observed diffusions using transform functions. Journal of Applied Probability 41, 99–118. Kessler, M. (1997). Estimation of an ergodic diffusion from discrete observations. Scandinavian Journal of Statistics 24, 211–29. Kloeden, P. and E. Platen (1999). Numerical Solution of Stochastic Differential Equations (corrected 3rd printing). New York: Springer. Lo, A. W. (1988). Maximum likelihood estimation of generalized Ito processes with discretely sampled data. Econometric Theory 24, 231–47. Pedersen, A. R. (1995). A new approach to maximum-likelihood estimation for stochastic differential equations based on discrete observations. Scandinavian Journal of Statistics 22, 55–71. Phillips, P. C. B. and J. Yu (2009). A two-stage realized volatility approach to estimation of diffusion processes with discrete data. Journal of Econometrics 150, 139–50. Serfling, R. J. (1980). Approximation Theorems of Mathematical Statistics. New York: John Wiley. Shoji, I. and T. Ozaki (1998). Estimation for nonlinear stochastic differential equations by a local linearization method. Stochastic Analysis and Applications 16, 733–52. White, H. (1994). Estimation, Inference and Specification Analysis. Cambridge: Cambridge University Press. Wooldridge, J. M. (1994). Estimation and inference for dependent processes. In R. F. Engle and D. McFadden (Eds. ), Handbook of Econometrics, Volume 4, 2639–738. Amsterdam: North-Holland. Yu, J. and P. C. B. Phillips (2001). A Gaussian approach for continuous time models of the short-term interest rate. Econometrics Journal 4, 210–24.
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
256
X. Huang
APPENDIX A: PROOF OF THEOREM 2.1 Proof of Theorem 2.1: Equation (2.15) is a special case of the uniform convergence result in Corollary 10.6.4 in KP. Using Chebychev’s inequality, as → 0, we obtain P Xti − Y (ti ) > ε ≤ ε−1 E Xti − Y (ti ) ≤ ε−1 K4 γ → 0, which implies Y (ti ) → Xti in probability in (2.16).
APPENDIX B: AN EXAMPLE OF QMLE Recall f α is the coefficient function defined in (2.10). When evaluated at Xti−1 , f α can be taken outside the integral. The following approximation when l(α) = 3 is obtained from Section 5.5 in KP: Y (ti ) = Xti−1 + f(0) I(0) + f(1) I(1) + f(0,0) I(0,0) + f(0,1) I(0,1) + f(1,0) I(1,0) + f(1,1) I(1,1) + f(0,0,0) I(0,0,0) + f(0,0,1) I(0,0,1) + f(0,1,0) I(0,1,0) + f(0,1,1) I(0,1,1) + f(1,0,0) I(1,0,0) + f(1,0,1) I(1,0,1) + f(1,1,1) I(1,1,1) , where f (0) = a, f (1) = b, f (0,0) = aa + 0.5b2 a , f (0,1) = ab + 0.5b2 b , f (1,0) = ba , f (1,1) = bb , f(0,0,0) = a(aa + (a )2 + bb a + 0.5b2 a ) + 0.5b2 (aa + 3a a + ((b )2 + bb )a + 2bb a ) + 0.25b4 a (4) , f(0,0,1) = a(a b + ab + bb b + 0.5b2 b ) + 0.5b2 (a b + 2a b + ab + ((b )2 + bb )b + 2bb b + 0.5b2 b(4) ), f(0,1,0) = a(b a + ba ) + 0.5b2 (b a + 2b a + ba ), f(0,1,1) = a((b )2 + bb ) + 0.5b2 (b b + 2bb + bb ), f(1,0,0) = b(aa + (a )2 + bb a + 0.5b2 a ), f(1,0,1) = b(ab + a b + bb b + 0.5b2 b ), f(1,1,0) = b(a b + a b), f(1,1,1) = b((b )2 + bb ). The conditional expectation and variance of Y (ti ) are μti , = Xti−1 + f(0) + f(0,0) 2 /2 + f(0,0,0) 3 /6, 2 2 2 + f(0,1) f(1) + f(1) f(1,0) + f(1,1) /2 2 + f(0,1) /3 + f(0,0,1) f(1) /3 σt2i , = f(1) 2 /3 + f(1) f(1,0,0) /3 + f(0,1,1) f(1,1) /3 + f(0,1,0) f(1) /3 + f(0,1) f(1,0) /3 + f(1,0) 2 + f(1,0,1) f(1,1) /3 + f(1,1) f(1,1,0) /3 + f(1,1,1) /6 3 + f(0,0,1) f(0,1) /4 2 + f(0,1) f(0,1,0) /6 + f(0,1,1) /12 + f(0,0,1) f(1,0) /12 + f(0,1,0) f(1,0) /6 2 + f(0,1) f(1,0,0) /12 + f(1,0) f(1,0,0) /4 + f(0,1,1) f(1,0,1) /12 + f(1,0,1) /12 2 2 4 + f(0,1,1) f(1,1,0) /12 + f(1,0,1) f(1,1,0) /12 + f(1,1,0) /12 + f(0,0,1) /20
2 2 + f(0,0,1) f(0,1,0) /20 + f(0,1,0) /30 + f(0,0,1) f(1,0,0) /60 + f(0,1,0) f(1,0,0) /20 + f(1,0,0) /20 5 ,
where terms such as Var(I (1,0) ) and Cov(I (1,0,1) , I (1,1) ) are calculated using Lemma 5.7.2 in KP. Finally, a normal density function is used for estimation.
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
The
Econometrics Journal Econometrics Journal (2011), volume 14, pp. 257–277. doi: 10.1111/j.1368-423X.2010.00337.x
On the efficiency of a semi-parametric GARCH model J IANING D I †,‡ AND A SHIS G ANGOPADHYAY ‡ †
‡
Johnson & Johnson PRD, LLC, 3210 Merryfield Row, San Diego, CA 92121, USA. E-mail:
[email protected]
Department of Mathematics and Statistics, Boston University, 111 Cummington Street, Boston, MA 02215, USA. E-mail:
[email protected] First version received: October 2009; final version accepted: October 2010
Summary Financial time series exhibit time-varying volatilities and non-Gaussian distributions. There has been considerable research on the GARCH models for dealing with these issues related to financial data. Since in practice the true error distribution is unknown, various quasi maximum likelihood methods based on different assumptions on the error distribution have been studied in the literature. However, the specification of the distribution family or in particular the shape parameter of the density function is often incorrect. This leads to an efficiency loss quite common in such estimation procedures. To avoid the inaccuracy, semi-parametric maximum likelihood approaches were introduced, where the estimators of the GARCH parameters are derived from the likelihood based on a non-parametrically estimated density function. In general, the semi-parametric likelihood function is trimmed for the √ extreme observations in order to derive a n-consistent estimator. In this paper we consider the situation in which the untrimmed likelihood function is maximized to develop a new semiparametric estimator. The resulting estimator is consistent, asymptotically Gaussian with a vanishing bias term, and a limiting variance–covariance matrix that attains the information lower bound. This work also provides insight into the efficiencies (bias–variability trade-off) of a general class of semi-parametric estimators of GARCH models. Keywords: Adaptive estimation, Efficiency, GARCH, Kernel, Semi-parametric.
1. INTRODUCTION During the last several decades, a significant body of knowledge has been developed in the area of modelling the financial time series. In particular, considerable research has focused on the modelling of the observed financial returns, especially after the introduction of the ARCH by Engle (1982) and the GARCH by Bollerslev (1986). The GARCH model for volatility of returns assumes the following mathematical form: rt = σt t ,
σt2 = α0 +
p i=1
2 αi rt−i +
q
2 βj σt−j ,
(1.1)
j =1
where t ∼ f , i.i.d., with E[t ] = 0. C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society. Published by Blackwell Publishing Ltd, 9600
Garsington Road, Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
258
J. Di and A. Gangopadhyay
When f is known, the maximum likelihood estimator (MLE) of the GARCH model parameter √θ = (α0 , . . . , αp , β1 , . . . , βq ) can be derived. Under regularity conditions, the n-consistent with an asymptotic variance–covariance matrix that reflects MLE is the Fisher information (e.g. Gonz´alez-Rivera and Drost, 1999, Francq and Zako¨ıan, 2004). When f is unknown, two estimation procedures are often considered. The first approach is to specify a parametric form to the error distribution f , and estimate the model in a parametric manner (e.g. Bollerslev, 1986, Weiss, 1986, Newey and Steigerwald, 1997, Mukherjee, 2006); (see also Berkes and Horv´ath, 2003, 2004, for a slightly different approach). The second approach does not assume a parametric form to f , but addresses the problem by either estimating f via non-parametric approaches (e.g. Engle and Gonz´alez-Rivera, 1991, Linton, 1993, Drost and Klaassen, 1997), or gets around the problem by treating f as a nuisance component of the model based on the ideas such as the methods-of-moments (Jacquier et al., 1994). Compared with the first approach, the second method avoids the inaccuracy introduced by the incorrect specification of the form of f , and therefore improves the estimation efficiency. Comparisons between the two classes of estimators are available in the literature (e.g. Engle and Gonz´alez-Rivera, 1991, Gonz´alez-Rivera and Drost, 1999, Berkes and Horv´ath, 2004). In this paper, we have studied a semi-parametric MLE of a stationary and ergodic GARCH model with finite variance, where the true error density is assessed via the usual kernel method. Unlike most of the existing semi-parametric estimation procedures, the corresponding semiparametric likelihood (or the score function) is not trimmed for extreme values. Thus no information is left out of the estimation procedure. As a result, the corresponding estimator gains efficiency in the sense that it has the identical asymptotic variance–covariance matrix to that of the MLE. However, on the other hand, the estimator also loses efficiency due to the slowly decaying bias introduced by the untrimmed kernel estimate of the unknown density. This suggests that the flexibility and efficiency of semi-parametric estimation is accompanied by the complexity associated with bias–variance trade-off. The paper is organized as follows. In Section 2, we discuss the parametric and semiparametric estimation of the GARCH parameters and the corresponding efficiency loss. In Section 3, we discuss the proposed semi-parametric estimator and establish the asymptotic characteristics. Simulation results are reported in Section 4. The proofs of the results are given in the Appendix. Throughout the paper, we assume all conditions for existence of the ergodic and stationary GARCH process. This requires the parameter of interest to be inside a specific parameter space. For instance, in case of the GARCH(1,1) model, it is required that E[α1 t2 + β1 ] < 0 (Nelson, 1991). See Bougerol and Picard (1992) for generalization to the GARCH( p, q) case with p ≥ 1 and q ≥ 1. Finally, we assume all conditions to ensure the consistency and asymptotic normality of the MLE (Gonz´alez-Rivera and Drost, 1999). To simplify the proof, without loss of generality, all results are given in terms of a GARCH(1,1) model without a conditional mean term. Hence 2 2 + β1 σt−1 . We use θ ∗ = (α0∗ , α1∗ , β1∗ ) the variance equation in (1.1) reduces to σt2 = α0 + α1 rt−1 to represent the true value of the GARCH parameter. In addition, we use ·[k] to represent the kth derivative of a function w.r.t. the underlying variable, and the usual prime · to represent the derivative of a function w.r.t. the input. For example, we have f [1] (x) = f (x) but f [1] (g(x)) = f (g(x))g [1] (x).
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
Efficiency of semi-parametric GARCH
259
2. PARAMETRIC VERSUS SEMI-PARAMETRIC ESTIMATION IN GARCH Both parametric and semi-parametric estimators of the GARCH parameters have been studied in the literature under the assumption that the error distribution f is unknown. These approaches are often likelihood based due to the simple parametric structure of the GARCH model, but they differ in the ways the unknown error distribution is specified—parametrically or nonparametrically. Unlike the situations where f is known, with an unknown f additional assumptions need to be made to ensure parameter identifiability, as both the model parameter θ and the variance of the error term t affect the overall variability of the observation rt . Two approaches are often considered for this purpose. One approach—Engle’s (G)ARCH—assumes variance of t to be known and therefore the variation of rt is then completely captured by θ . In particular, t is often assumed to have unit variance; thus σt2 in (1.1) can be interpreted as the conditional variance (e.g. Engle, 1982, Bollerslev, 1986, Weiss, 1986); (see also Berkes and Horv´ath, 2004, for a different way of characterizing this assumption). The other method—Linton’s GARCH—allows the error t to ‘absorb’ the variability, and re-parametrizes the model with a new parameter of interest θ˙ = (α1 /α0 , β1 ). Therefore, this approach only explains the relative effect of model parameters to the future volatility. Similarly, one may modify the parameter setup by adding additional terms related to σt into the mean equation in (1.1). See, for example, the GARCH-M model or the general CH model (e.g. Engle et al., 1987, Drost and Klaassen, 1997, Newey and Steigerwald, 1997). See also Redner (1981) for a discussion on ensuring the parameter identifiability by reconstructing the parameter space. 2.1. Parametric estimation of the GARCH model Parametric estimation procedures often use Engle’s approach to identify the parameters, as the error density f needs to be fully specified anyway. In particular, a quasi MLE (QMLE) is defined to be the maximizer of the quasi likelihood function that is derived by assuming a specific parametric form to f . Empirical and theoretical properties of the QMLE have been studied under several scenarios (e.g. Weiss, 1986, Newey and Steigerwald, 1997, Berkes and Horv´ath, 2003, Francq and Zako¨ıan, 2004). When f is specified correctly, the QMLE becomes the MLE. When f is specified incorrectly, the results are in mixture. In particular, if f is specified as Gaussian, then under some moment conditions on t or rt the corresponding Gaussian QMLE has been shown to be consistent and asymptotically normal, even if the true f is not Gaussian (e.g. Weiss, 1986, Bollerslev and Wooldridge, 1992). Large sample characteristics of the Gaussian QMLE have also been studied in situations where the moment assumptions are violated (e.g. Mikosch and Straumann, 2002, Hall and Yao, 2003, Berkes and Horv´ath, 2004). However, in contrast to the Gaussian QMLE, QMLEs based on other assumed distributions are not consistent in general (Drost and Klaassen, 1997, Newey and Steigerwald, 1997). In a later study, Mukherjee (2006) quantified the inconsistency of a QMLE based on an assumed density g by a bias term 1/2 (1 − cf ,g )θ ∗ , where cf ,g > 0 satisfies E[q(/cf ,g )] = 1 with q(x) = −xg (x)/g(x). The value of this constant can be easily calculated. For example, cf ,g = 1 when g is taken to be f or N(0, 1), and cf ,g = 0.82 when f and g are taken to be standardized F10,50 and Student t15 , respectively. In addition to the potential bias, the estimation inaccuracy of the parametric approaches is also reflected in its large estimation variability. Such loss of efficiency is due to the fact that C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
260
J. Di and A. Gangopadhyay
there is a permanent source of error due to the incorrect model specification, unless the assumed function g perfectly matches with the true error distribution f . This kind of extra estimation variability cannot be eliminated by simply increasing the sample size. For example, Francq and Zako¨ıan (2004) showed that the Gaussian QMLE inflated the estimation variability by a factor of (κ − 1)/2 over the true MLE, where κ is the kurtosis of f (see also Engle and Gonz´alez-Rivera, 1991). Gonz´alez-Rivera and Drost (1999) showed that the efficiency bound of the Gaussian QMLE was larger than that of the MLE in a general location-scale model. The first three rows of Figure 1 illustrate the sampling distributions of the MLE, Gaussian-QMLE and t15 -QMLE of a GARCH(1,1) model. It is clear that both QMLEs have larger variability, and the t15 -QMLE also introduces a negative bias. 2.2. Semi-parametric estimation of the GARCH model It is expected that the QMLE can be improved if the quasi likelihood can be derived from a more accurately specified error distribution. Since an accurate initial specification of the error distribution is often difficult, one alternative is to approximate the true error distribution f with a series of specifications, which, ideally, should approach f as the information accumulates. Semi-parametric maximum likelihood estimators (SMLE) are introduced based on this idea, and have been applied to the estimation of GARCH parameters under particular model (re)parametrizations (e.g. Engle and Gonz´alez-Rivera, 1991, Linton, 1993, Drost and Klaassen, 1997, Hafner and Rombouts, 2007). The underlying algorithm of the semi-parametric estimation is usually two-fold. In the first step, a consistent (parametric) estimator θn of the parameter of interest is derived and plugged into the model so that a sample of residuals can be generated. These residuals are subsequently used to estimate the density f , generally in a non-parametric framework. In the second step, the estimated density fˆn is used to construct the semi-parametric likelihood function, which is maximized to derive the SMLE. To avoid computational burden, one may apply an asymptotically equivalent approach that ‘corrects’ the original estimator θn with the influence function derived from the likelihood scores (Le Cam, 1969, Bickel, 1982, Drost and Klaassen, 1997, Sun and Stengos, 2006). The semi-parametric approaches do not include artificial assumptions in specifying the unknown density f , therefore exhibit advantages over the parametric methods in reducing estimation inaccuracy. Assuming f has unit variance (Engle’s parameter specification), Engle and Gonz´alez-Rivera (1991) reported clear efficiency gain by comparing the Gaussian QMLE to their SMLE with the density f estimated via the discrete maximum penalized likelihood (DMPL) method (Tapia and Thompson, 1978). Gonz´alez-Rivera and Drost (1999) further showed that the efficiency bounds of the considered approaches were VMLE < VSMLE < VQMLE in the matrix sense (see also Hafner and Rombouts, 2007, for similar results in multivariate cases). Following Linton’s parameter specification, stronger results can be established. Adaptive SMLEs for identified parameters were constructed under various assumptions on model structure (e.g. Linton, 1993, Drost and Klaassen, 1997, Yang, 1998, Hafner and Rombouts, 2007). The adaptation is quite generally available under this setup because the re-parametrization creates the necessary orthogonality between the target parameter and the nuisance space (Bickel, 1982). The performances of these adaptive SMLEs are asymptotically equivalent to the MLE, and therefore they generally outperform the QMLEs.
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
Efficiency of semi-parametric GARCH
261
3. A NEW SEMI-PARAMETRIC ESTIMATOR Adaptive estimators exist under Linton’s parameter specification. However, asymptotics of the SMLEs under Engle’s parameter specification have been less thoroughly studied. Therefore in this paper we will focus on the evaluation of the SMLEs under Engle’s setup. In this section, we develop a general SMLE, in which the unknown density f is estimated using kernel methods, as compared to the DMPL in Engle and Gonz´alez-Rivera (1991). We discuss the asymptotic properties of this SMLE, efficiency of such estimator, and other related factors. Similar approaches such as that of Steigerwald (1993) and Hafner and Rombouts (2007) will be discussed in Section 3.2. 3.1. Estimation procedure and its asymptotic characteristics Let θn = (α0n , α1n , β1n ) be any consistent estimator based on observations r1 , . . . , rn . Also let σt (θn ) be the estimated volatility at time t derived iteratively using θn and all observations prior to time t. Define the estimated residuals as t (θn ) = rt /σt (θn ). Consequently, we define the semiparametric density function (based on θn ) as the usual kernel estimate n 1 fˆn (z) = K((z − t (θn ))/hn ), nhn t=1
where K(·) and hn represent the kernel and the bandwidth, respectively. Define the corresponding semi-parametric likelihood function at any given θ (dropping the initial component defined for time t ∈ (−∞, 0]) as n n 1 ˆ rt 1 ˆ 1 ˆ fn lt (θ ) = Ln (θ ) = log , n t=1 n t=1 σt (θ ) σt (θ ) and the proposed semi-parametric estimator of θ is given by θˆnSMLE = argmaxθ∈ Lˆ n (θ ). The following results confirm the intuition behind the semi-parametric estimation. In particular, Theorem 3.1 shows the consistency of the kernel density estimate fˆn and its derivatives. Theorems 3.2 and 3.3 show the asymptotic properties of the SMLE. √ T HEOREM 3.1. In addition to the semi-parametric density estimator fˆn , for any n-consistent estimator θn of the model parameter, define the estimator of the first and second derivatives as n 1 fˆn (z) = 2 K ((z − t (θn ))/hn ) nhn t=1
and n 1 fˆn (z) = K ((z − t (θn ))/hn ). nh3n t=1 C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
262
J. Di and A. Gangopadhyay
Then, under Assumptions B.1–B.3 and hn → 0, we have, for all z ∈ R, p fˆn (z) − f (z) → 0,
nh2n → ∞;
p fˆn (z) + γ1 f (z) → 0, 1 p fˆn (z) − γ2 f (z) → 0, 2
nh4n → ∞; nh6n → ∞.
Consistency of fˆn , fˆn and fˆn follows from implications b1 and b2 of Appendix A. Note that the typical requirements on the bandwidth hn to ensure the consistency of the kernel estimator of a density function ( f ) and its derivatives ( f and f ) are nhn → ∞, nh3n → ∞ and nh5n → ∞, respectively (H¨ardle, 1990). Therefore the requirements in Theorem 3.1 are stronger. This is because when residuals are created using the consistent estimator θn , they do not truly represent a sample from f unless θn perfectly matches the model parameter θ . In particular, these residuals are dependent. Therefore the kernel bandwidth hn may need to be enlarged to smooth out the additional variability introduced by this ‘not-so-good’ sample. However, as illustrated by the following corollary, by choosing a smoother kernel, the typical requirements on hn may be sufficient. C OROLLARY 3.1. If K is chosen such that ui K [j ] (u) du = 0 for all positive integers i < j , e.g. the Gaussian kernel, then the convergences established in Theorem 3.1 are true if, respectively, nhn → ∞, nh3n → ∞ and nh5n → ∞. T HEOREM 3.2. nh4n → ∞.
Under Assumptions A.1–A.2 and B.1–B.3, θˆnSMLE is consistent if hn → 0 and
T HEOREM 3.3.
Consider the random array fˆn (t ) f (t ) σ (θ ∗ ) Zn,t = − t t ∗ , f (t ) σt (θ ) fˆn (t )
and define μn = E[Zn,t ]. Then under the conditions of Theorem 3.2 and additional assumption nh6n → ∞, we have d √ SMLE n θˆn − θ ∗ − B −1 μn →N (0, ), where = B −1 AB −1 with A = E[lt[1] (θ ∗ )], B = −E[lt[2] (θ ∗ )] and lt being the tth log-score of the true likelihood function based on f , i.e. reflects the Cram´er–Rao lower bound. The bias in Theorem 3.3 serves as a correction term that turns the SMLE into an estimator that is asymptotically equivalent to the true MLE. This bias has a natural analogue in partial linear models (H¨ardle, 1990, Theorem 9.1.1). From a slightly different perspective, one can also view the SMLE as a special case of QMLE that assumes the unknown density to be fˆn . Therefore, according to Mukherjee (2006) a bias of (1 − cf ,fˆn )θ ∗ is expected, where cf ,fˆn > 0 is the unique 1/2 constant such that E[q(/cf ,fˆ )] = 1 with q(x) = −x fˆn (x)/fˆn (x). Here, as fˆn and fˆn converge n to f and f , respectively, we know cf ,fˆn → 1 due to its uniqueness. A direct consequence of this fact is that the bias term in Theorem 3.2 vanishes in limit. C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
Efficiency of semi-parametric GARCH
263
3.2. Asymptotic and finite-sample efficiency (adaptiveness) of the SMLE In the asymptotic sense, given the limiting distribution of the SMLE, it is natural to ask ‘how efficient (adaptive) can the SMLE be?’. Gonz´alez-Rivera (1997) showed that under Engle’s specification, SMLEs could be adaptive only if f belonged to certain families of probability density functions (e.g. Gaussian or some bimodal distributions). See also Hafner and Rombouts (2007) for the multivariate generalization of this result. In case of our SMLE, the loss of adaptiveness is captured by the asymptotic bias B −1 μn in Theorem 3.3. Specifically, we know from the proof of Theorem 3.3 that 1 f (t ) fˆn (t ) −1/2 ˆθnSMLE − θ ∗ ∼ OP (n−1/2 ) + 1 − . Zn,t ∼ OP (n )+ n t n t f (t ) fˆn (t ) The second term introduces both kernel-related bias and variability to the SMLE. By Taylor expansion, it can be seen that Zn,t , and therefore n1 nt=1 Zn,t , preserves the same level of bias as fˆn and fˆn , which is slower than O(n−1/2 ). On the other hand, the variability of n1 nt=1 Zn,t is expected to decay faster due to the averaging and the asymptotic independence between Zn,t and Zn,t . More specifically, according to Lemma B.3 we know n1 nt=1 Zn,t = oP (n−1/2 ). As a result, there is no choice of hn that can balance the bias and variability of the model parameter estimate, i.e. the bias of the SMLE will always dominate its variability in terms of the convergence rate. Thus, it is clear that the optimal asymptotic property of the SMLE is achieved by choosing the smallest possible hn while ensuring that the asymptotic properties of the SMLE are preserved. One approach to avoid this slowly decaying bias is to eliminate the effect of extreme residuals by using a trimming function. Such a trimming function assumes a set of boundaries for the density, score or the residual. It has been shown that an appropriate trimming approach could √ produce n-consistent estimators (Bickel, 1982, Kreiss, 1987, Linton, 1993, Steigerwald, 1993, Andrews, 1994, Hafner and Rombouts, 2007). Among these studies, Steigerwald (1993) and Hafner and Rombouts (2007) considered Engle’s setup. However, because of the trimming function, any information carried by the data at the distribution tails will not contribute to the likelihood. This information loss is expected to be larger in case of leptokurtic data, or asymmetric data if the trimming is set up in a symmetric manner. As a result, the SMLEs based on trimmed semi-parametric likelihood often have larger asymptotic variance–covariance matrix as compared to the MLE, i.e. there is also a bias–variance trade-off that is controlled by the trimming function. On the last note, it can be seen that the bandwidth requirement for the asymptotic normality (Theorem 3.3) of the proposed SMLE is stronger than some of the results in the literature. Hafner and Rombouts (2007), as a typical example, allowed hn to be selected from almost O(n−1/3 ) and above. The key differences between the conditions stated in that paper and the conditions derived in our work can be summarized as follows: R EMARK 3.1. In Hafner and Rombouts (2007), the bandwidth is given in combination with the requirements on the trimming function. More specifically, they require cn hn → 0 and nh3n cn−2 en−2 → ∞, where cn and en are the trimming boundaries that control the kernel density estimate. This shows that the overall smoothing of the estimation is a combined effect of the bandwidth hn and the trimming boundaries. For example, if we choose cn = n1/6 and en = n1/32 as in the simulation discussed later in the paper, then the requirement on hn becomes approximately nh5n → ∞. But if we choose cn = en ∼ log(n)—a slowly expanding trimming boundary—the requirement on hn then becomes almost like nh3n → ∞. In terms of C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
264
J. Di and A. Gangopadhyay
our estimation procedure, because it does not require the trimming of the likelihood, it may need a more restrictive bandwidth hn to provide enough smoothing. R EMARK 3.2. The proposed semi-parametric estimator in Hafner and Rombouts (2007) does not have an asymptotic variance–covariance matrix that is identical to the MLE. With respect to our proposed SMLE, it can be seen that the condition nh6n → ∞ is required to ensure fˆn → f (Theorem 3.1) and the consistency of the second derivative of the semi-parametric likelihood, which is the condition necessary to guarantee that our proposed SMLE has the same asymptotic variance–covariance matrix as the MLE (Theorem 3.3 and Lemma B.4). This condition may be relaxed, as in Hafner and Rombouts (2007), at the expense of a larger asymptotic variance–covariance matrix. 3.3. Regression with GARCH errors The asymptotics established in Theorems 3.2 and 3.3 are for the model (1.1), where rt is assumed to be the observation. In practice, it is also quite common to assume a GARCH structure to the regression residuals. In this case, the model becomes Yt = ϕXt + rt ,
rt = σt t ,
σt2 = α0 +
p
2 αi rt−i +
i=1
q
2 βj σt−j ,
(3.1)
j =1
where {Xt , Yt } are the observed data. Denote ϑ = (ϕ, θ ), then the semi-parametric likelihood function becomes n 1 Yt − ϑXt 1 log Ln (ϑ) = f . n t=1 σt (ϑ) σt (ϑ) With any initial estimate of ϑ, the two-step estimation procedure √ can still be applied and a corresponding ϑˆ nSMLE can be derived. If the initial estimator is n-consistent, results similar to Theorems 3.2 and 3.3 would still hold. C OROLLARY 3.2. and nh4n → ∞.
Under Assumptions A.1–A.2 and B.1–B.3, ϑˆ nSMLE is consistent if hn → 0
C OROLLARY 3.3.
Consider the random array fˆn (t ) f (t ) t σt (ϑ ∗ ) + Xt , − Zn,t = · f (t ) σt (ϑ ∗ ) fˆn (t )
and define μn = E[Zn,t ]. Then under the conditions of Corollary 3.2 and additional assumption nh6n → ∞, we have d √ SMLE n ϑˆ n − ϑ ∗ − B −1 μn → N (0, ), where = B −1 AB −1 with A = E[lt[1] (θ ∗ )], B = −E[lt[2] (θ ∗ )] and lt being the tth log-score of the true likelihood function based on f , i.e. reflects the Cram´er–Rao lower bound. In view of these results, in the rest of this paper we will only consider the model specified in (1.1). C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
Efficiency of semi-parametric GARCH
265
4. A NUMERICAL EXAMPLE In this section, we will study the performance of the proposed semi-parametric estimator using simulated examples. We first compare the performance of the proposed SMLE with other typical GARCH estimators. In the second part we evaluate the impact of bandwidth selection on the performance of the estimator. The model being considered is a GARCH(1,1) model based on real parameter values θ ∗ = (0.2, 0.05, 0.9). Two error distributions are considered. The first one is the standardized F10,50 -distribution, which is chosen for its asymmetry. The second one is the standardized Student t15 -distribution, which is chosen for its fat tails. For brevity, only the results of α 1 are included in this paper. 4.1. Estimation accuracy of the SMLE Figure 1 compares the sampling distributions of the candidate estimators of the parameter α 1 (100 replications). All semi-parametric approaches use the Gaussian QMLE as the initial estimate. In our proposed SMLE f was estimated using Gaussian kernel. In Engle’s SMLE (SMLE.EG) f was estimated using the DMPL method with c = 5 and q = 0.2. In Steigerwald’s SMLE (SMLE.ST) f was estimated using quartic kernel with bandwidth hn = 0.6n−0.27 and the fixed trimming boundary of 6 for the density estimate (i.e. fˆn ≤ 6). In Hafner and Rombouts’s SMLE (SMLE.HR) f was estimated using the Gaussian kernel with bandwidth hn = n−1/5 and the bounding parameter cn = n1/6 , dn = n−1/2 and en = n1/32 . More simulation details can be found in Table 1. In both simulation examples (F10,50 and t15 ), the MLE exhibits the smallest bias and variability across the tested sample sizes. The performances of the two parametric approaches (QMLE) highly dependent on the ‘similarity’ between their assumed density and the true density f . For example, in the F10,50 case both QMLEs show larger estimation variance as compared to the semi-parametric approaches. In particular, the t15 -QMLE also shows a clear bias (recall Mukherjee, 2006). But in the t15 case the QMLEs actually outperform all semi-parametric estimators. Thus it appears that it is the asymmetry of the true distribution f that reduces the efficiency of the QMLE. This corresponds to the findings by Engle and Gonz´alez-Rivera (1991), who suggested that the gain of the SMLE.EG over the QMLE existed when f was Gamma but not when f was t5 . The semi-parametric estimators are more robust to the underlying distribution in the sense that their performance patterns are well retained when f changes from F10,50 to t15 . The two untrimmed approaches—the proposed SMLE and the SMLE.EG—performed similarly in the F10,50 case. A small but negative bias is observed in the proposed SMLE. However, in the t15 case, the SMLE.EG experiences larger variability. This is because the DMPL method (Tapia and Thompson, 1978) used in SMLE.EG is a discrete penalized density estimation approach that does not perform well when the underlying density has heavy tails. The two trimmed approaches, the SMLE.ST (Steigerwald, 1993) and SMLE.HR (Hafner and Rombouts, 2007), are in general outperformed by the proposed SMLE and SMLE.EG. The SMLE.ST is always associated with a large variability, and the SMLE.HR introduces a clear bias. While the trimming technique improves the convergence rate, it is not clear how the symmetric and finite boundaries used in the trimming function impact the estimator performance at finite sample sizes. However, it should be noted that in the original publications of the SMLE.ST and SMLE.HR the boundaries for the trimming functions were not defined in detail (for example, in Hafner and Rombouts, 2007, the boundaries were given only in the sense
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
266
J. Di and A. Gangopadhyay
Figure 1. Sampling distribution of the estimates of α1 = 0.05 with data generated from F10,50 or t15 .
of o(·) and O(·)), hence it is likely that our simulation may not reflect the optimal performances of these estimators. 4.2. Impact of bandwidth hn on SMLE In Section 3, we discussed the selection of bandwidth hn and its impact on the asymptotic performance of n1 t Zn,t , which affects the convergence of θˆnSMLE via its mean μn and mean C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
267
Efficiency of semi-parametric GARCH
Table 1. Comparison of estimation accuracy of different estimators of α 1 with true value α1∗ = 0.05. Sample size Dist. F10,50
Estimator
500
2000
5000
8000
10,000
Mean MSE Mean
0.0532 0.0217 0.0606
0.0495 0.0090 0.0484
0.0504 0.0059 0.0508
0.0499 0.0045 0.0485
0.0496 0.0039 0.0487
MSE Mean MSE
0.0484 0.0477 0.0358
0.0173 0.0458 0.0140
0.0111 0.0473 0.0090
0.0076 0.0469 0.0060
0.0072 0.0466 0.0055
SMLE-EG
Mean MSE
0.0595 0.0455
0.0496 0.0115
0.0500 0.0078
0.0494 0.0051
0.0492 0.0047
SMLE-ST SMLE-HR
Mean MSE Mean
0.0636 0.0480 0.0471
0.0526 0.0245 0.0428
0.0562 0.0221 0.0462
0.0534 0.0223 0.0464
0.0549 0.0223 0.0463
SMLE
MSE Mean
0.0197 0.0515
0.0084 0.0470
0.0060 0.0480
0.0046 0.0481
0.0040 0.0496
MSE
0.0284
0.0117
0.0069
0.0049
0.0049
MLE
Mean MSE
0.0567 0.0422
0.0526 0.0138
0.0502 0.0080
0.0494 0.0070
0.0506 0.0067
Gaussian-QMLE
Mean MSE
0.0590 0.0496
0.0527 0.0139
0.0505 0.0086
0.0493 0.0068
0.0509 0.0068
SMLE-EG SMLE-ST
Mean MSE Mean
0.0621 0.0461 0.0544
0.0528 0.0156 0.0627
0.0518 0.0106 0.0563
0.0497 0.0082 0.0543
0.0517 0.0078 0.0566
SMLE-HR
MSE Mean
0.0391 0.0488
0.0539 0.0450
0.0315 0.0449
0.0267 0.0461
0.0288 0.0469
MSE Mean MSE
0.0427 0.0553 0.0452
0.0131 0.0501 0.0131
0.0082 0.0497 0.0083
0.0074 0.0483 0.0067
0.0067 0.0496 0.0068
MLE Gaussian-QMLE t15 -QMLE
t15
SMLE
square error w.r.t. μn . In this section, simulations are performed to evaluate their relationships. In particular, we investigate this by choosing a bandwidth ranging from O(n−1/5 ) to O(n−1/9 ). 1 Figure 2 shows the level and the variability of n t Zn,t computed using bandwidth hn = n−1/γ with various choices of γ . The sample mean and sample std are derived based on 100 replications at each sample size. The curves displayed are further smoothed by moving average to illustrate the rate of convergence. Data simulated from the standardized F10,50 distribution. Only the α 1 component is displayed. The plots clearly show that a wider bandwidth is associated with a larger average (i.e. μn ) but a smaller variability, a fact that reflects the bias–variance trade-off controlled by the bandwidth hn . C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
268
Figure 2. Convergence of μn = E[ n1 hn = n−1/γ .
J. Di and A. Gangopadhyay
t
Zn,t ] and Std[ n1
t
Zn,t ] associated with different bandwidths
5. CONCLUSION This paper focuses on the SMLE of GARCH parameter under Engle’s parametrization (with known innovation variance). The corresponding SMLE is shown to be consistent and asymptotically Gaussian, shares the same asymptotic variance–covariance matrix with the true MLE, but exhibits a bias that converges to zero slowly. This SMLE is a special case of the general QMLE in the sense that they both implement the maximum likelihood estimation based on an approximated density. However, they differ in the sense that in our SMLE the fixed parametric form of the density assumed in the general QMLE is replaced by a consistent kernel density estimate. As a result, the constant bias of the general QMLE, which measures the ‘distance’ between the real and the assumed density, is replaced by an asymptotically converging bias that reflects the consistency of the estimated density used in SMLE. The finite sample behaviour of the term that creates the slowly converging bias is studied. It has been shown that a bias–variance trade-off exists and is controlled by the kernel bandwidth—a larger bandwidth increases this term’s mean but decreases its variability. In addition, it is further shown that the this term’s mean always dominates its variability in the sense of the convergence rate; and therefore the kernel bandwidth should be selected at the smallest possible level to minimize the bias, while ensuring that the desired asymptotic properties of the SMLE still hold. This slowly decaying bias term can be eliminated by trimming the estimated semi-parametric likelihood, a method that has been used in several studies. However, the use of the trimming also has several drawbacks. First of all, trimming the likelihood enlarges the asymptotic variance–covariance matrix. In addition, the selection of the trimming boundary requires extensive investigation. To the best of our knowledge, there are no universally accepted rules to define the trimming function; and the performance of an estimator may vary dramatically if the trimming function is defined differently. C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
Efficiency of semi-parametric GARCH
269
In general, the semi-parametric estimation approaches do not pre-specify the likelihood function, and therefore it is more robust than the traditional quasi likelihood methods. One problem associated with the SMLEs discussed in the paper is that they are generally second-stage estimators, such that these methods modify or update an initial parameter estimate. As a result, the properties of the initial estimate may be carried over to the final SMLEs. This fact largely restricts the use of the approach when a ‘good’ initial estimator is difficult to find. To avoid this, Di (2007) and Di and Gangopadhyay (2010) studied another type of semi-parametric estimation procedure where initial input of a consistent estimator is no longer required. The results of this study will be published elsewhere.
ACKNOWLEDGMENTS The authors gratefully acknowledge Douglas Steigerwald, Oliver Linton and the two anonymous referees for their comments that led to a significant improvement of the paper.
REFERENCES Andrews, D. (1994). Asymptotics for semiparametric econometric models via stochastic equicontinuity. Econometrica 62, 43–72. Berkes, I. and L. Horv´ath (2003). The rate of consistency of the quasi maximum likelihood estimator. Statistics and Probability Letters 61, 133–43. Berkes, I. and L. Horv´ath (2004). The efficiency of the estimators of the parameters in GARCH processes. Annals of Statistics 32, 633–55. Bickel, P. (1982). On adaptive estimation. Annals of Statistics 10, 647–71. Bollerslev, T. (1986). Generalized autoregressive conditional heteroskedasticity. Journal of Econometrics 31, 307–27. Bollerslev, T. and J. Wooldridge (1992). Quasi-maximum likelihood estimation and inference in dynamic models with time-varying covariances. Econometric Reviews 11, 143–72. Bougerol, P. and N. Picard (1992). Stationarity of GARCH processes and of some nonnegative time series. Journal of Econometrics 52, 115–27. Di, J. (2007). New developments in time series analysis (Part I: on the construction and estimation of GARCH models). Unpublished Ph.D. thesis, Department of Mathematics and Statistics, Boston University. Di, J. and A. Gangopadhyay (2010). One-step semiparametric estimation of the GARCH model. Working paper, Department of Mathematics and Statistics, Boston University. Drost, F. and C. Klaassen (1997). Efficient estimation in semi-parametric GARCH models. Journal of Econometrics 81, 193–221. Engle, R. (1982). Autoregressive conditional heteroskedasticity with estimates of the variance of United Kingdom inflation. Econometrica 50, 987–1007. Engle, R. and G. Gonz´alez-Rivera (1991). Semiparametric ARCH models. Journal of Business and Economic Statistics 9, 345–59. Engle, R., D. Lilien and R. Robins (1987). Estimating time varying risk premia in the term structure: the ARCH-M model. Econometrica 55, 391–407. Francq, C. and J. Zako¨ıan (2004). Maximum likelihood estimation of pure GARCH and ARMA-GARCH processes. Bernoulli 10, 605–37. Gonz´alez-Rivera, G. (1997). A note on adaption in GARCH models. Econometric Reviews 16, 55–68. C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
270
J. Di and A. Gangopadhyay
Gonz´alez-Rivera, G. and F. Drost (1999). Efficiency comparisons of maximum-likelihood-based estimators in GARCH models. Journal of Econometrics 93, 93–111. Hafner, C. and J. Rombouts (2007). Semiparametric multivariate volatility models. Econometric Theory 65, 251–80. Hall, P. and Q. Yao (2003). Inference in ARCH and GARCH models with heavy-tailed errors. Econometrica 71, 285–317. H¨ardle, W. (1990). Applied Nonparametric Regression. Cambridge: Cambridge University Press. Jacquier, E., N. G. Polson and P. Rossi (1994). Bayesian analysis of stochastic volatility models. Journal of Business and Economic Statistics 12, 371–417. Kreiss, J. (1987). On adaptive estimation in autoregression processes when there are nuisance functions. Statistics and Decisions 5, 59–76. Le Cam, L. (1969). Th´eorie asymptotique de la d´ecision statistique. Les Presses de l’Univ´ersite de Montreal. Linton, O. (1993). Adaptive estimation in ARCH models. Econometric Theory 9, 539–69. McLeish, D. (1974). Dependent central limit theorem and ivariance principles. Annals of Probability 2, 620–28. Mikosch, T. and D. Straumann (2002). Whittle estimation in a heavy-tailed GARCH(1,1) model. Stochastic Processes and their Applications 100, 187–222. Mukherjee, K. (2006). Pseudo-likelihood estimation in ARCH models. Canadian Journal of Statistics 34, 143–72. Nelson, D. (1991). Conditional heteroskedasticity in asset returns: a new approach. Econometrica 59, 347–70. Newey, W. and D. Steigerwald (1997). Asymptotic bias for quasi-maximum-likelihood estimators in conditional heteroskedasticity models. Econometrica 65, 587–99. Redner, R. (1981). Note on the consistency of the maximum likelihood estimate for nonidentifiable distributions. Annals of Statistics 9, 225–28. Steigerwald, D. (1993). Efficient estimations of models with conditional heteroscedasticity. Working Paper in Economics #5-93, Department of Economics, University of California at Santa Barbara. Sun, Y. and T. Stengos (2006). Semi-parametric efficient adaptive estimation of asymmetric GARCH models. Journal of Econometrics 133, 373–86. Tapia, R. and J. Thompson (1978). Nonparametric Probability Density Estimation. Baltimore, MD: Johns Hopkins University Press. Weiss, A. (1986). Asymptotic theory for ARCH models: estimation and testing. Econometric Theory 2, 107–31. Yang, J. (1998). Semiparametric maximum likelihood estimators of GARCH models. Working paper, Department of Economics, University of Western Ontario.
APPENDIX A: ASSUMPTIONS Assumptions about the distribution f : A SSUMPTION A.1.
f has uniformly bounded, continuous and differentiable derivatives up to order 2.
A SSUMPTION A.2.
f and f reach 0 at support boundary.
These assumptions have the following implications: a1 λ1 = f () d = −1. a2 λ2 = 2 f () d = 2. C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
271
Efficiency of semi-parametric GARCH Assumptions about the kernel K:
A SSUMPTION B.1. K is a probability density, with infinite order of continuous and differentiable derivatives. A SSUMPTION B.2. K and K reach 0 at support boundary. i [j ] u K (u) du < ∞ for all i and j. A SSUMPTION B.3. These assumptions have the following implications: b1 γ1 = uK (u) du = −1. b2 γ2 = u2 K (u) du = 2 .
APPENDIX B: PROOFS OF RESULTS In our notation, a squared value of a matrix always implies the inner product of the matrix itself, i.e. 2 = . In addition, we continue to use the notation t (θ ) = rt /σt (θ) as in Section 3.1, where σt (θ) is the model volatility derived iteratively using the parameter value θ. Thus t (θ ∗ ) = t is the true underlying innovation sequence. Proof of Theorem 3.1: For m = 0, 1, 2, recall fˆn[m] (z) =
n 1 [m] z − t (θn ) , K nhm+1 hn n t=1
and consider the kernel estimator f˜n[m] (z) =
n 1 [m] z − t (θ ∗ ) K . nhm+1 hn n t=1
By the consistency of f˜n[m] (z), it suffices to show, for m = 0, 1, 2, p fˆn[m] (z) − f˜n[m] (z) → 0.
To see this, note n z − t (θn ) z − t (θ ∗ ) 1 [m] [m] K − K nhm+1 hn hn n t=1 ⎡ ⎤ j n ∞ ∗ ∗ z − (θ ) (θ ) − (θ ) 1 ⎣ t t n t ⎦ Tj K [m+j ] = nhm+1 hn hn n t=1 j =1
fˆn[m] (z) − f˜n[m] (z) =
=
∞ j =1
=
∞
Tj
n 1 [m+j ] z − t (θ ∗ ) t (θn ) − t (θ ∗ ) j K nhm+1 hn hn n t=1
Tj Mm,n,j ,
j =1
where Tj = j1! represents the Taylor coefficient. Therefore, to show the consistency it suffices to show that Tj Mm,n,j , m = 0, 1, 2, converge uniformly w.r.t. j, and therefore limn→∞ can be interchanged with ∞ j =1 . To see this, note C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
272
J. Di and A. Gangopadhyay n t (θn ) − t (θ ∗ ) j 1 [m+j ] z − t (θ ∗ ) K Tj nhm+1 hn hn n t=1
Tj Mm,n,j =
1
=
n
m+j +1
nhn
1
∼
t=1 n
m+j +1
nhn
nhn
1 m+j +1 nhn
z − t (θ ∗ ) j σt (θ ∗ ) j rt (θn − θ ∗ )j Tj hn σt2 (θ ∗ )
t=1
(θ − θ ∗ )j m+j +1 n
=
j z − t (θ ∗ ) j 1 1 Tj − rt hn σt (θn ) σt (θ ∗ )
K [m+j ]
1
=
K [m+j ]
n t=1
(θn − θ ∗ )j
n
z − t (θ ∗ ) j σt (θ ∗ ) j rt Tj hn σt2 (θ ∗ )
z − t (θ ∗ ) j ∗ σt (θ ∗ ) j t (θ ) Tj . hn σt (θ ∗ )
K [m+j ] K [m+j ]
t=1
Note that t (θ ∗ ) represents the real innovation, then due to the independency between K [m+j ] t (θ ∗ ) and j
σt (θ ∗ ) , σt (θ ∗ )
z−t (θ ∗ ) hn
we know
Tj Mm,n,j ∼
∗ j ∗ σ (θ ) j ∗ j [m+j ] z − t (θ ) ∗ (θ − θ ) · E K (θ ) ·E t ∗ Tj . t m+j +1 n hn σt (θ ) hn 1
∗ j ∗ j σ (θ ∗ ) σ (θ ) σ (θ ) It is easy to see that σtt (θ ∗ ) is bounded, and therefore E σtt (θ ∗ ) Tj = j1! E σtt (θ ∗ ) is bounded for all j. In addition, the usual change of variable suggests that z − t (θ ∗ ) j ∗ t (θ ) = hn (z − uhn )j f (z − uhn )K [m+j ] (u) du E K [m+j ] hn ∞ = hn O hn u K [m+j ] (u) du (Taylor expansion of f ) =0
=
∞
[m+j ] O h+1 K (u) du u n
=0
= O(hn ), as long as
K [m+j ] (u) du = 0 for all m = 0, 1, 2 and some j ≥ 1. In this case, we have Tj Mm,n,j ∼
Since θn is a
√
1 m+j hn
(θn − θ ∗ )j .
n-consistent estimator of θ ∗ , we know, in probability, that
Tj M0,n,j ∼ Tj M1,n,j ∼ Tj M2,n,j ∼
1 j nj /2 hn
→ 0,
1 j +1
nj /2 hn 1
j +2
nj /2 hn
→ 0, → 0,
nh2n → ∞, ∗ nh2+2/j n
∗∗ nh2+4/j n
∗
→ ∞, j = min j
→ ∞, j
∗∗
K
[j +1]
(u) du = 0 | j ≥ 1 ,
= min j
K
[j +2]
(u) du = 0 | j ≥ 1 .
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
Efficiency of semi-parametric GARCH
273
p p p Therefore, by taking j ∗ = j ∗∗ = 1, we know fˆn (z) − f˜n (z) → 0, fˆn (z) − f˜n (z) → 0 and fˆn (z) − f˜n (z) → 0 if, respectively, nh2n → ∞, nh4n → ∞ and nh6n → ∞. And these convergence at j ∗ = j ∗∗ = 1 are uniform w.r.t. j. Finally, given the consistency of f˜n , f˜n and f˜n to f , −γ1 f and 12 γ2 f under the condition of nhn → ∞, nh3n → ∞ and nh5n → ∞ (H¨ardle, 1990), we prove the result.
Note, when θn has a slower rate of convergence, then the bandwidth hn needs to R EMARK B.1. be enlarged to ensure the convergence. For example, if θn − θ ∗ = oP (n−1/4 ), then we have M0,n,j ∼ 1 , M1,n,j ∼ j /41 j +1 and M2,n,j ∼ j /41 j +2 . Thus by taking j = 1 the sufficient conditions for the j /4 j n
hn
n
hn
n
hn
convergence now become nh4n → ∞, nh8n → ∞ and nh12 n → ∞. Proof of Corollary 3.1: Recall in the proof of Theorem 3.1, we have ∞ z − t (θ ∗ ) [m+j ] t (θ ∗ )j = O h+1 K (u) du . u E K [m+j ] n hn =0 Thus by corollary assumption we have
u K [m+j ] (u) du = 0 for all ≤ m + j − 1. Therefore
z − t (θ ∗ ) +1 t (θ ∗ )j = O hm+j E K [m+j ] . n hn As a result, we know p
Mm,n,j ∼ (θn − θ ∗ )j → 0, regardless of hn . Thus the only requirement on hn is imposed by the convergence of f˜n , f˜n and f˜n .
ˆ SMLE ) = 0 at θ ∗ we know Proof of Theorem 3.2: Based on Taylor’s expansion of Lˆ [1] n (θn −1 ∗ ∗ Lˆ [1] θˆnSMLE − θ ∗ = −Lˆ [2] n (θn ) n (θ ) p ∗ for some θn∗ ∈ B(θ ∗ , θˆnSMLE − θ ∗ ). Thus it is sufficient to show supθ ∗ ∈ Lˆ [1] n (θ ) → 0 and [2] lim infθ∈B(θ ∗ ,r),θ ∗ ∈ | det(Lˆ n (θ ))| > 0. The first limit is given by Lemma B.2 (see below) and implications a1 and b1. The second result can be easily seen as follows. For all θ0 ∈ B(θ ∗ , r) fixed, t (θ0 ) = rt /σ (θ0 ) represent a sample that follows a distribution h that is different from f . Consider a new GARCH model with true parameter θ0 and new innovation t ∼ h. In this case the likelihood Lˆ n based on the density fˆn represents a pseudo likelihood function. And therefore based on the asymptotic distribution of the QMLE we know Lˆ [2] n (θ0 ) converges to a negative-definite matrix.
L EMMA B.1. Consider a random array sequence Zn,t defined on the triangular filtration {Fn,t }. Assume p that Zn,t are identically distributed for fixed n and Zn,t → 0, n → ∞. Define μn = E[Zn,t ], then we have p n √1 t=1 (Zn,t − μn ) → 0. n Proof: Define μn,t = E[Zn,t | Fn,t−1 ]. By the tower property of the conditional expectation, it now suffices to show √1n nt=1 (Zn,t − μn,t ) →P 0. Now let Un,t = Zn,t − μn,t . It is easy to see that Un,t is a martingale difference array w.r.t. Fn,t (McLeish, 1974) and we have E[Un,t ] = 0 and, by lemma assumption, Un,t ∼ oP (1). Finally the result is proved by verifying the conditions given by McLeish (1974, Theorem 2.3): C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
274
J. Di and A. Gangopadhyay (a) maxt≤n | √1n Un,t | is uniformly bounded in L2 norm. This is true since 2 n 1 1 1 2 E max √ Un,t = E max |Un,t |2 ≤ E Un,t = o(1). t≤n t≤n n n t=1 n p
(b) maxt≤n | √1n Un,t | → 0. This is true since for all ε > 0 fixed, we have by Chebyshev’s inequality, n n 1 √ 1 P max √ Un,t > ε ≤ P(|Un,t | > nε) ≤ 2 Var[Un,t ] = o(1). t≤n nε t=1 n t=1 (c)
n
p
→ c < ∞. In fact in this situation we have c = 0 because for all ε > 0 fixed, we have by Markov’s inequality, n n 2 2 1 1 1 = o(1). P √ Un,t > ε ≤ 2 E √ Un,t ε n n t=1 t=1 t=1 (
L EMMA B.2. ∗ p Lˆ [1] n (θ ) → 0.
√1 Un,t )2 n
Assume hn → 0 and nh4n → ∞. Then under Assumptions A.1–A.2 and B.1–B.3, we have
Proof: First, note 1 ∗ Lˆ [1] n (θ ) = − n
n n 1 σt (θ ∗ ) + σt (θ ∗ ) n t=1 t=1
fˆn[1] (t (θ ∗ )) fˆn (t (θ ∗ ))
=−
n n 1 fˆn (t (θ ∗ )) σt (θ ∗ ) 1 σt (θ ∗ ) − rt ∗ n t=1 σt (θ ) n t=1 fˆn (t (θ ∗ )) σt2 (θ ∗ )
=−
n n 1 fˆn (t ) σt (θ ∗ ) 1 σt (θ ∗ ) − t ∗ n t=1 σt (θ ) n t=1 fˆn (t ) σt (θ ∗ )
(denote t = t (θ ∗ ))
n n n 1 fˆn (t ) 1 f (t ) σt (θ ∗ ) f (t ) σt (θ ∗ ) 1 σt (θ ∗ ) + γ − + γ =− 1 t 1 t n t=1 σt (θ ∗ ) n t=1 f (t ) σt (θ ∗ ) n t=1 fˆn (t ) f (t ) σt (θ ∗ ) = Rn + Rn + Rn . p
Define M1 = E[σt (θ ∗ )/σt (θ ∗ )]. Easy to see Rn → − M1 . Additionally since independent, we know
p Rn →
λ1 γ1 M1 . Therefore it suffices to show that ˆ f (t ) σ (θ ∗ ) f (t ) t t ∗ . + γ1 Zn,t = n f (t ) σt (θ ) fˆn (t )
p Rn →
f (t ) f (t ) t
and
σt (θ ∗ ) σt (θ ∗ )
are
0. To see this let
Thus Zn,t is a random array defined on the triangular filtration generated by {1 (θn ), . . . , n (θn )} ⊗ p {1 (θ ∗ ), . . . , t−1 (θ ∗ )}. In addition, by Theorem 3.1 we have Zn,t → 0. Thus we know by Lemma B.1 that Rn = p
n n 1 1 1 p Zn,t = √ · √ (Zn,t − μn ) + μn → 0. n t=1 n n t=1
∗ Finally we have Lˆ [1] n (θ ) → (λ1 γ1 − 1)M1 = 0 under implications a1 and b1.
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
Efficiency of semi-parametric GARCH
275
Proof of Theorem 3.3: As pointed out in the proof of Theorem 3.2, by Taylor’s expansion, we √ ∗ −1 ˆ [1] ∗ ∗ n(Lˆ [1] have θˆnSMLE − θ ∗ = [−Lˆ [2] n (θn )] Ln (θ ). Therefore Theorem 3.3 is proved if we have n (θ ) − p d ∗ μn ) → N (0, A) and −Lˆ [2] n (θ ) → B. These two results are given, respectively, by Lemmas B.3 and B.4. L EMMA B.3: Under Assumptions A.1–A.2 and B.1–B.3, we have ∞.
√ d ∗ 4 n(Lˆ [1] n (θ ) − μn ) → N (0, A) if nhn →
Proof: Denote Xn,t
fˆn (t ) σt (θ ∗ ) · 1 + t = , σt (θ ∗ ) fˆn (t )
f (t ) σt (θ ∗ ) · 1 + t . Xt = σt (θ ∗ ) f (t )
In addition, define Zn,t =
fˆn (t ) f (t ) σ (θ ∗ ) − t t ∗ ˆ f (t ) σt (θ ) fn (t )
and let E[Zn,t ] = μn . Then we have n √ 1 ∗ − nLˆ [1] Xn,t + μn n (θ ) + μn = √ n t=1 n n 1 1 = √ Xt + √ (Xn,t − Xt ) + μn n t=1 n t=1 n n σ (θ ∗ ) 1 1 f (t ) fˆn (t ) − = √ Xt + √ t t ∗ + μn ˆ σt (θ ) n t=1 n t=1 f (t ) fn (t ) n n 1 1 = √ Xt − √ (Zn,t − μn ). n t=1 n t=1
By Lemma B.1 we know n 1 p (Zn,t − μn ) → 0. √ n t=1
Also note that √1n therefore we know
n t=1
Xt is the term involved in deriving the asymptotic distribution of the true MLE, n 1 d Xt → N (0, A), √ n t=1
where A is defined in Theorem 3.3. This proves the result.
L EMMA B.4. Define M2 = E[σt (θ ∗ )/σt (θ ∗ )]2 and τ0 = E[t f (t )/f (t )]2 . Then under Assumptions ∗ p [2] ∗ A.1–A.2 and B.1–B.3 we have Lˆ [2] n (θ ) → (1 − τ0 )M2 = Ln (θ ), where Ln is the real likelihood function. C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
276
J. Di and A. Gangopadhyay
Proof: By Theorem 3.1, we have 2 n n n 1 fˆn[1] (t (θ ∗ )) 1 σt (θ ∗ ) 2 1 fˆn[2] (t (θ ∗ )) − + + σt (θ ∗ ) n t=1 σt (θ ∗ ) n t=1 fˆn (t (θ ∗ )) n t=1 fˆn (t (θ ∗ )) t=1 2 n n n n [1] 1 σt (θ ∗ ) 2 γ2 1 f [2] (t (θ ∗ )) f (t (θ ∗ )) 1 σt (θ ∗ ) 21 + − γ =− + 1 n t=1 σt (θ ∗ ) n t=1 σt (θ ∗ ) 2 n t=1 f (t (θ ∗ )) n t=1 f (t (θ ∗ )) n n γ2 f [2] (t (θ ∗ )) 1 fˆn[2] (t (θ ∗ )) − + n t=1 fˆn (t (θ ∗ )) 2n t=1 f (t (θ ∗ )) ⎫ ⎧ 2 2 n n ⎨1 fˆn[1] (t (θ ∗ )) γ12 f [1] (t (θ ∗ )) ⎬ − − ⎩n n f (t (θ ∗ )) ⎭ fˆn (t (θ ∗ ))
1 ∗ Lˆ [2] n (θ ) = − n
n σ (θ ∗ ) t
t=1
=−
1 n
n t=1
t=1
1 σt (θ ∗ ) + ∗ σt (θ ) n
n t=1
σt (θ ∗ ) σt (θ ∗ )
2
+ Rn − Rn + Tn − Tn ,
where we let Rn =
n γ2 1 f [2] (t (θ ∗ )) , 2 n t=1 f (t (θ ∗ ))
Rn
γ12
=
Tn =
Tn =
2 n 1 f [1] (t (θ ∗ )) , n t=1 f (t (θ ∗ ))
n n γ2 f [2] (t (θ ∗ )) 1 fˆn[2] (t (θ ∗ )) − , n t=1 fˆn (t (θ ∗ )) 2n t=1 f (t (θ ∗ ))
⎧ 2 n ⎨1 fˆ[1] (t (θ ∗ )) ⎩n
n
fˆn (t (θ ∗ ))
t=1 p
⎫ 2 n γ12 f [1] (t (θ ∗ )) ⎬ − . n t=1 f (t (θ ∗ )) ⎭
p
Similar to Lemma B.1, we know Tn → 0 and Tn → 0 under implications b1 and b2. For Rn , note that σ (θ ∗ ) 2 σ (θ ∗ ) (σ (θ ∗ ))2 − f (t (θ ∗ ))rt t2 ∗ + 2f (t (θ ∗ ))rt t 3 ∗ f [2] (t (θ ∗ )) = f (t (θ ∗ )) rt t2 ∗ σt (θ ) σt (θ ) σt (θ ) ∗ 2 ∗ 2 ∗ σ (θ ) σ (θ ) σ (θ ) − f (t )t t ∗ + 2f (t )t t ∗ (denote t = t (θ ∗ )). = f (t )t2 t ∗ σt (θ ) σt (θ ) σt (θ ) Therefore we know
∗ 2 n f (t ) σ (θ ) γ2 1 f (t ) 2 σt (θ ∗ ) 2 f (t ) σt (θ ∗ ) t +2 t t ∗ − Rn = 2 n t=1 f (t ) t σt (θ ∗ ) f (t ) σt (θ ∗ ) f (t ) σt (θ ) p
→
1 1 λ2 γ2 M2 − λ1 γ2 M3 + λ1 γ2 M2 , 2 2
where M3 = E[σt (θ ∗ )/σt (θ ∗ )]. For Rn , since f [1] (t (θ ∗ )) = −f (t (θ ∗ ))t
σt (θ ∗ ) , σt (θ ∗ )
p
we have Rn → γ12 τ0 M2 . C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
Efficiency of semi-parametric GARCH
277
In summary, under implications a1, a2, b1 and b2, we have 1 1 ∗ p 2 Lˆ [2] − λ1 γ2 − 1 M3 = (1 − τ0 )M2 . n (θ ) → 1 + λ2 γ2 + λ1 γ2 − γ1 τ0 M2 + 2 2 On the other hand, consider the true log likelihood function n 1 ri ∗ . − log σt (θ ) + log f Ln (θ ) = n t=1 σt (θ ∗ ) p
∗ It is easy to see that L[2] n (θ ) → (1 − τ0 )M2 . Therefore, Theorem 3.3 is proved.
R EMARK B.2. τ0 − 1 =
Note that τ0 > 1 since 2 f () [f ()]2 [f ()]2 d − 1 = d + 2λ1 + 1 = + 1 f () d > 0. f () f () f ()
Therefore (1 − τ0 )M2 is negative definite. Proof of Corollaries 3.2 and 3.3: When the regression equation is added into the model, the true likelihood function becomes n 1 1 Yt − ϑXt , log f Ln (ϑ) = n t=1 σt (ϑ) σt (ϑ) and the two-step semi-parametric likelihood becomes n 1 1 ˆ Yt − ϑXt Lˆ n (ϑ) = fn . log n t=1 σt (ϑ) σt (ϑ) It is easy to see that the additional terms in L[1] (ϑ) and Lˆ n (ϑ) introduced by the regression equation are, n f (t ) Xt n n fˆn (t ) Xt 1 1 respectively, − n t=1 f (t ) σt (ϑ) and − n t=1 fˆ ( ) σt (ϑ) . Therefore, following the same argument of the n t proof of Theorem 3.2, we know n 1 fˆn (t ) f (t ) Xt p − → 0, n t=1 fˆn (t ) f (t ) σt (ϑ) and therefore the consistency of ϑˆ nSMLE is obtained. The asymptotic normality of ϑˆ nSMLE can be derived similarly.
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
The
Econometrics Journal Econometrics Journal (2011), volume 14, pp. 278–303. doi: 10.1111/j.1368-423X.2011.00348.x
Test statistics for prospect and Markowitz stochastic dominances with applications Z HIDONG B AI † , H UA L I †,‡ , H UIXIA L IU ‡,§ AND W ING -K EUNG W ONG ¶ †
KLASMOE, School of Mathematics and Statistics, Northeast Normal University, 5268 Ren Min Street, Chang Chun City, Ji Lin Province, China (130024). E-mail:
[email protected] ‡
School of Mathematics and Statistics, Northeast Normal University, 5268 Ren Min Street, Chang Chun City, Ji Lin Province, China (130024). E-mail:
[email protected]
§ Department
of Statistics and Applied Probability, National University of Singapore, 21 Lower Kent Ridge Road, Singapore 119077. E-mail:
[email protected]
¶ Department
of Economics, Hong Kong Baptist University, WLB, Shaw Campus Kowloon Tong, Hong Kong. E-mail:
[email protected]
First version received: February 2008; final version accepted: March 2011.
Summary Levy and Levy (2002, 2004) extend the stochastic dominance (SD) theory for risk averters and risk seekers by developing the prospect SD (PSD) and Markowitz SD (MSD) theory for investors with S-shaped and reverse S-shaped (RS-shaped) utility functions, respectively. Davidson and Duclos (2000) develop SD tests for risk averters whereas Sriboonchitra et al. (2009) modify their statistics to obtain SD tests for risk seekers. In this paper, we extend their work by developing new statistics for both PSD and MSD of the first three orders. These statistics provide a tool to examine the preferences of investors with Sshaped utility functions proposed by Kahneman and Tversky (1979) in their prospect theory and investors with RS-shaped investors proposed by Markowitz (1952a). We also derive the limiting distributions of the test statistics to be stochastic processes. In addition, we propose a bootstrap method to decide the critical points of the tests and prove the consistency of the bootstrap tests. To illustrate the applicability of our proposed statistics, we apply them to study the preferences of investors with the corresponding S-shaped and RS-shaped utility functions vis-`a-vis returns on iShares and vis-`a-vis returns of traditional stocks and Internet stocks before and after the Internet bubble. Keywords: Hypothesis testing, Markowitz stochastic dominance, Prospect stochastic dominance, Risk averse, Risk seeking, RS-shaped utility function, S-shaped utility function, Test statistics.
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society. Published by Blackwell Publishing Ltd, 9600
Garsington Road, Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
Prospect and Markowitz stochastic dominance tests
279
1. INTRODUCTION There are two basic approaches to the problem of portfolio selection under uncertainty. One approach is the mean-risk (MR) analysis.1 In this approach, the portfolio choice is made with respect to two measures: the expected portfolio mean return and portfolio risk. A portfolio is preferred if it has higher expected return and smaller risk. Another approach is based on the concept of utility theory; see for example, Markowitz (1952a) and Post and Levy (2005) for more information. Davidson and Duclos (2000), Barrett and Donald (2003), Linton et al. (2005, 2010), Horv´ath et al. (2006), Schechtman et al. (2008) and others have developed several SD test statistics using this approach.2 There are convenient computational recipes and geometric interpretations of the trade-off between the two measures. A disadvantage of the former is that it is derived by assuming the von Neumann-Morgenstern quadratic utility function and that returns are normally distributed (Feldstein, 1969, Hanoch and Levy, 1969). The latter offers a mathematically rigorous treatment for portfolio selection. In addition, the SD test statistics are superior to the MR test statistics because the conclusions drawn by these SD test statistics between the assets being examined could be used by investors to compare their expected utility on these assets since they do not require investors to possess a quadratic utility function nor any form of the distribution for the assets being analysed. Among the MR analysis, the most popular measure is the Sharpe ratio (SR) introduced by Sharpe (1966). As the SR requires strong assumptions that the assets being analysed have to be i.i.d., various measures for MR analysis have been developed to improve the SR, including the Value-at-Risk, VaR hereafter, (Jorion, 2000), conditional VaR (Rockafellar and Uryasev, 2000), and expected shortfall (Chen, 2008). We note that there are some relationships between MR and SD. For example, Ogryczak and Ruszczy´nski (2002) establish the equivalence between VaR and the first degree SD (FSD) whereas Ma and Wong (2010) show that the conditionalVaR is equivalent to the second degree SD (SSD) under certain conditions. We also note that Markowitz (1959) justifies in the mean-variance (MV) analysis that return distributions are not only Gaussian and utility functions are not only quadratic. Rather, it is that for the kinds of return distributions in reasonably conservative portfolios, MV approximations to expected utility are quite robust. This is amply confirmed in the experiments reported in Levy and Markowitz (1979).3 Meyer (1987) and Wong (2006, 2007) have also shown that the conclusion drawn from the MR comparison is equivalent to the comparison of expected utility maximization for any risk-averse investor and for assets with any distribution if the assets being examined belong to the same location-scale family. In addition, one could also apply Theorem 10 in Li and Wong (1999) to generalize the result so that it is valid for any risk-averse investor and for portfolios with any distribution if the portfolios being examined belong to the same convex combinations of (same or different) location-scale families. So far, the aforementioned theories or statistics could rely on the von Neumann and Morgenstern (1944) expected utility theory. However, some experimental studies (see, e.g. Markowitz, 1952a) reveal a few contradictions to the expected utility paradigm. To overcome this problem, Kahneman and Tversky (1979) propose the prospect theory to study the behaviour of investors with value (S-shaped utility) functions whereas Markowitz (1952a) first suggests to 1
See, for example, Leung and Wong (2008) and the references therein for more discussion. Koning and Ridder (2003) develop statistics to test where choice probabilities are consistent with maximization of random utilities. Dardanoni and Forcina (1999) develop tests to draw statistical inference for Lorenz curve orderings. We note that these approaches could be modified to examine the preference of risk averters. 3 We would like to thank Professor Harry Markowitz for providing us with the information in this paragraph. 2
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
280
Z. Bai et al.
examine behaviours of investors with reverse S-shaped (RS-shaped) utility functions. Levy and Levy (2002, 2004) are the first to extend the work of Markowitz (1952a) and others to develop a new criterion (MSD) to determine the dominance of one investment alternative over another for investors with RS-shaped utility functions, and another criterion called (PSD) to determine the dominance of one investment alternative over another for investors with S-shaped utility functions. In addition, Wong and Chan (2008) extend the PSD and MSD theory to the first three orders. Among the SD tests, Davidson and Duclos (2000) and others develop stochastic dominance (SD) techniques to test statistically significant differences between any two distribution functions of different returns for risk averters whereas Sriboonchita et al. (2009) modify the Davidson and Duclos (DD) test to obtain the DD statistic to examine choice of assets for risk seekers. In this paper, we extend their work by developing new SD statistics for both PSD and MSD of the first three orders. One could use these statistics to identify assets for the preferences of investors with S-shaped utility functions proposed by Kahneman and Tversky (1979) and the preferences of investors with RS-shaped investors proposed by Markowitz (1952a). We also derive the limiting distributions of the test statistics to be stochastic processes. In addition, we propose a bootstrap method to decide the critical points of the tests and prove the consistency of the bootstrap tests. To illustrate the applicability of our proposed statistics, we first apply the SD test statistics developed in this paper to examine whether there is any MSD and PSD relationship among iShares. Thereafter, we study their behaviours vis-`a-vis returns of traditional stocks and Internet stocks during the Internet bubble. This finding could be used to draw preferences for investors with S-shaped and RS-shaped utility functions among iShares and over this cycle; this could, in turn, draw inferences for utility theory and behavioural finance. The paper is organized as follows. We begin by introducing definitions and notations and stating some basic SD properties for investors with S-shaped and RS-shaped utility functions in the next section. In Section 3, we develop the theory of SD tests for investors with S-shaped and RS-shaped utility functions. Section 4 discusses the bootstrap method to decide the critical points of the tests. Section 5 illustrates the applicability of our proposed SD test statistics to study the behaviours of investors with the corresponding S-shaped and RS-shaped utility functions visa` -vis returns of iShares and vis-`a-vis returns of traditional stocks and Internet stocks before and after the Internet bubble. Section 6 concludes our findings. Proofs are relegated to the Appendices.
2. DEFINITIONS, NOTATIONS AND BASIC PROPERTIES Let R be the set of extended real numbers and = [a, b] be a subset of R in which a < 0 and b > 0. Let B be the Borel σ -field of and μ be a measure on (, B). We first define the functions F A and F D of the measure μ on the support as F1A (x) = μ[a, x]
and
F1D (x) = μ[x, b]
for all
x ∈ .
(2.1)
Function F is a (cumulative) distribution function (CDF) and μ is a probability measure if μ() = 1.4 All functions are assumed to be measurable and all random variables are assumed to satisfy F1A (a) = 0 and F1D (b) = 0. By basic knowledge of measure and probability theory, for any random variable X with an associated probability measure P, there exists a unique induced probability measure μ on (, B) and a distribution function F such that F satisfies (2.1) and 4
In this paper, the definition of F is slightly different from the ‘traditional’ definition of a distribution function. C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
281
Prospect and Markowitz stochastic dominance tests
−1 μ(B) = P(X (B)) = P(X ∈ B) for any B ∈ B. An integral written in the form of A f (t) d μ(t) or A f (t) d F(t) is understood as a Lebesgue-Stieltjes integral for integrable function f (t). If the integrals have the same value for all A among (c, d], [c, d) or [c, d], then we use the notation d In addition, if μ is a Borel measure with μ(c, d] = d − c, then we write c f (t) dμ(t) instead. d the integral as c f (t) dt. Random variables, denoted by Y and Z, defined on are considered together with their corresponding distribution functions F and G, respectively. The following notations will be used throughout this paper: b b x dF (x), μG = μZ = E(Z) = x dG(x); μF = μY = E(Y ) = a a (2.2) x b HjA (x) =
a
HjA−1 (y) dy,
HjD (x) =
x
HjD−1 (y) dy
j = 2, 3;
where H = F or G.5 In (2.2), μF = μY is the mean of Y, whereas μG = μZ is the mean of Z. For H = F or G, we define the following functions for PSD: H1a (x) = H1A (x) = H (x), H1d (x) = H1D (x) = 1 − H (x); 0 x Hjd (y) = Hjd−1 (t)dt, y ≤ 0; and Hja (x) = Hja−1 (t)dt, x ≥ 0 for n = 2, 3. y
0
(2.3)
In order to make the computation easier, we further define the following functions for MSD and PSD: A d Hj (x) x ≤ 0, Hj (x) x ≤ 0, M P and Hj (x) = (2.4) Hj (x) = a D Hj (x) x > 0, Hj (x) x > 0, where H = F and G and j = 1, 2 and 3. As pointed out by Markowitz (1952a) and others, investors’ behaviours can be different in the positive and negative domains of the return. Without loss of generality, in this paper, ‘upside profit’ refers to the positive domain of the return and ‘downside risk’ the negative domain of the return. We first consider the function HjM which is equal to HjA in downside risk and equal to HjD in upside profit. By comparing the FjM and GM j of the two assets F and G, one could find an asset that shows a smaller probability in downside risk and a bigger probability in upside profit. Once investors find such F that it has a smaller FjA integral in downside risk and a higher FjD integral in upside profit, one may believe that F has the best of both worlds – a smaller probability of losing in downside risk and a larger probability of gaining in upside profit. On the other hand, one could compare HjP for H = F and G, which is equal to Hja integral in upside profit and equal to the Hjd integral in downside risk. Wong and Chan (2008) show that HjM can be used to develop the MSD theory, whereas HjP can be used to develop the PSD theory.6 We first state the following definitions for this purpose: D EFINITION 2.1. Given two random variables Y and Z with F and G as their respective distribution functions, Y dominates Z and F dominates G in the sense of FMSD (SMSD, 5 6
The above definitions are commonly used in the literature; see for example, Wong and Li (1999) and Anderson (2004). Thus, we call the function HjM the jth -order MSD integral and call the function HjP the jth-order PSD integral.
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
282
Z. Bai et al.
M M M M M TMSD), denoted by Y M 1 Z or F 1 G (Y 2 Zor F 2 G, Y 3 Z or F 3 G), if and M M M M M M M only if F1 (−x) ≤ G1 (−x) (F2 (−x) ≤ G2 (−x), F3 (−x) ≤ G3 (−x)) and F1 (x) ≥ GM 1 (x) M M (x), F (x) ≥ G (x)) for each x ≥ 0; where FMSD, SMSD and TMSD stand for (F2M (x) ≥ GM 2 3 3 first-, second- and third-order MSD, respectively.
D EFINITION 2.2. Given two random variables Y and Z with F and G as their respective distribution functions, Y dominates Z and F dominates G in the sense of FPSD (SPSD, TPSD), denoted by Y P1 Z or F P1 G (Y P2 Z or F P2 G, Y P3 Z or F P3 G), if and only if F1P (−x) ≥ GP1 (−x) (F2P (−x) ≥ GP2 (−x), F3P (−x) ≥ GP3 (−x)) and F1P (x) ≤ GP1 (x) (F2P (x) ≤ GP2 (x), F3P (x) ≤ GP3 (x)) for each x 0; where FPSD, SPSD and TPSD stand for first-, secondand third-order PSD, respectively. We note that in Definitions 2.1 and 2.2, if in addition there exists strict inequality for any x in [a, b], we say that Y dominates Z and F dominates G in the sense of SFwSD, SSwSD w w w w w and STwSD, denoted by Y w 1 Z or F 1 G, Y 2 Z or F 2 G, and Y 3 Z or F 3 G, respectively, where SFwSD, SSwSD and STwSD stand for strictly first-, second- and third-order wSD, respectively, where w = M or P. We next state different types of utility functions as shown in the following definition.7 D EFINITION 2.3. For j = 1, 2, 3, UjA , UjD , UjS and UjR are the sets of the utility functions u such that UjA (UjSA ) = {u : (−1)i u(i) ≤ (<)0, i = 1, . . . , j }, UjD (UjSD ) = {u : u(i) ≥ (>) 0, i = 1, . . . , j }, UjS (UjSS ) = {u : u+ ∈ UjA (UjSA ) and u− ∈ UjD (UnSD )} and UjR (UjSR ) = {u : u+ ∈ UjD (UjSD ) and u− ∈ UjA (UjSA )} where u(i) is the ith derivative of the utility function u, u+ = max{u, 0}, and u− = min{u, 0}. We note that investors in UjA are risk averse whereas investors in UjD are risk seeking. We also note that without loss of generality in this definition, the reference point (status quo) for UjS and UjR is assumed to be zero.8 Thus, we refer to positive outcomes as gains and negative outcomes as losses. In this situation, investors in UjR with RS-shaped utility functions are risk seeking for gains but risk averse for losses, whereas investors in UjS with S-shaped utility functions are risk averse for gains but risk seeking for losses. We note that Tversky and Kahneman (1992) develop prospect theory and call u ∈ U2S to be value function.9 For convenience, we call investors with utility functions in UjS prospect investors or investors with prospect preference and investors with utility functions in UjR Markowitz investors or investors with Markowitz preference. Choosing between F and G in accordance with a consistent set of preferences will satisfy the von Neumann and Morgenstern (1944) consistency properties. Accordingly, Y is (strictly) preferred to Z if Eu ≡ u(F) − u(G) ≡ u(Y) − u(Z) ≥ 0( > 0) where u(F ) ≡ u(Y ) ≡ b b a u(x) dF (x) and u(G) ≡ u(Z) ≡ a u(x) dG(x). The theories of MSD and PSD are very useful in ranking investment prospects when there is uncertainty because ranking assets by MSD and PSD is equivalent to utility maximization
7 We note that the theory can be easily extended to satisfy utilities defined to be non-differentiable and/or non-expected utility functions. In this paper, we will skip the discussion of non-differentiable utilities or non-expected utility functions. Readers may refer to Wong and Ma (2008) for the discussion. 8 One could easily extend the theory to include U S and U R with non-zero status quo. j j 9 They have proposed u(x) = x γG if x ≥ 0 and γ ∈ (0, 1) and u(x) = −λ(−x)γL if x < 0, λ > 0 and γ ∈ (0, 1) as a G L value function.
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
Prospect and Markowitz stochastic dominance tests
283
for the preferences of investors with RS-shaped value functions and S-shaped utility functions, respectively.10 We note that a hierarchical relationship exists in MSD and PSD: FwSD implies SwSD, which in turn implies TwSD, where w = M or P. However, the converse may not be true: the existence of SwSD does not imply the existence of FwSD and likewise, the existence of TwSD does not imply the existence of SwSD nor FwSD where w = M or P. Thus, only the lowest dominance order of MSD and PSD is reported. Readers may refer to Levy and Levy (2002, 2004), Wong and Chan (2008), and Broll et al. (2010) for other properties of MSD and/or PSD.
3. MSD AND PSD TESTS Now we turn to develop the new SD statistics for prospect investors with S-shaped utility functions and Markowitz Investors with RS-shaped utility functions. The new SD statistics are M P used to test equality or dominance of Fjw and Gw j where w = M or P in which Hj and Hj defined in (2.4) and H = F or G. That is, we construct statistics to test equality of F and G for the following null: H0w : Fjw ≡ Gw j ,
(3.1)
against the following three alternatives: H1w : F ≡w j G, two-sided test, H1lw : F w j Gj , one-sided test, H1rw
: F
≺w j
(3.2)
G, one-sided test.
When w = M, the three alternative hypotheses are equivalent to H1M : FjM (x) = GM j (x), the inequality is strict for at least one x, M M H1lM : FjM (x) ≤ GM j (x), ∀x ≤ 0 and Fj (x) ≥ Gj (x), ∀x > 0,
and one or both of the inequalities is/are strict for at least one x, H1rM
M M : FjM (x) ≥ GM j (x), ∀x ≤ 0 and Fj (x) ≤ Gj (x), ∀x > 0,
and one or both of the inequalities is/are strict for at least one x; while when w = P, H1P : FjP (x) = GPj (x), the inequality is strict for at least one x, H1lP : FjP (x) ≥ GPj (x), ∀x ≤ 0 and FjP (x) ≤ GPj (x), ∀x > 0, and one or both of the inequalities is/are strict for at least one x, H1rP : FjP (x) ≤ GPj (x), ∀x ≤ 0 and FjP (x) ≥ GPj (x), ∀x > 0, and one or both of the inequalities is/are strict for at least one x. 10 Let Y and Z be random variables with distribution functions F and G, respectively and u is a utility function. For j = SR M R P P 1, 2 and 3, we have F M j (j )G if and only if u(F) ≥ ( > )u(G) for any u in Uj (Uj ), and F j (j )G if and only if u(F) ≥ (>) u(G) for any u in UjS (UjSS ). Readers may refer Levy and Levy (2002, 2004) and Wong and Chan (2008) for more information.
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
284
Z. Bai et al.
In this paper, we propose to use the following test statistic to test the null in (3.1) against the alternatives in (3.2) for w = M:
TjM (x) ≡
=
⎧ ⎨TjD (x)
if x > 0,
⎩TjA (x)
if x ≤ 0,
⎧ jD (x) − G D ⎪ F ⎪ j (x) ⎪
⎪ ⎪ ⎪ jD (x) ⎪ V ⎨
if x > 0,
A A G ⎪ j (x) − Fj (x) ⎪ ⎪
⎪ ⎪ ⎪ jA (x) ⎪ V ⎩
if x ≤ 0,
(3.3)
where
N h 1 j −1 Hˆ jA (x) = (x − hi )+ , VˆjA (x) = VˆFAj (x) + VˆGAj (x), Nh (j − 1)! i=1
Nh 1 1 2(j −1) A 2 A ˆ ˆ VHj (x) = (x − hi )+ − Hj (x) , Nh Nh ((j − 1)!)2 i=1 and
N h 1 j −1 (hi − x)+ Hˆ jD (x) = , VˆjD (x) = VˆFDj (x) + VˆGDj (x), Nh (j − 1)! i=1
Nh 1 1 2(j −1) VˆHDj (x) = (hi − x)+ − Hˆ jD (x)2 , Nh Nh ((j − 1)!)2 i=1 and propose to use the following test statistic to test the null in (3.1) against the alternatives in (3.2) for w = P:
TjP (x) ≡
=
⎧ ⎨Tjd (x)
if x > 0,
⎩Tja (x)
if x ≤ 0,
⎧ jd (x) − G dj (x) ⎪ F ⎪ ⎪
⎪ ⎪ ⎪ jd (x) ⎪ V ⎨
if x ≤ 0,
ja (x) aj (x) − F G ⎪ ⎪ ⎪
⎪ ⎪ ⎪ ja (x) ⎪ V ⎩
if x > 0,
(3.4)
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
Prospect and Markowitz stochastic dominance tests
where
and
285
N h 1 j −1 Hˆ jd (x) = (hi − x)+ I{hi ≤0} , Vˆjd (x) = VˆFdj (x) + VˆGd j (x), Nh (j − 1)! i=1
Nh 1 1 2(j −1) VˆHdj (x) = (hi − x)+ I{hi ≤0} − Hˆ jd (x)2 , Nh Nh ((j − 1)!)2 i=1
N h 1 j −1 = (x − hi )+ I{hi ≥0} , Vˆja (x) = VˆFaj (x) + VˆGa j (x), Nh (j − 1)! i=1
Nh 1 1 2(j −1) a 2 a VˆHj (x) = (x − hi )+ I{hi ≥0} − Hˆ j (x) , Nh Nh ((j − 1)!)2 i=1 Hˆ ja (x)
for H = F, G and h = f , g. We develop the property for the aforementioned tests in the following theorem. T HEOREM 3.1. Let {fi } (i = 1, 2, . . . , Nf ) and {gi } (i = 1, 2, . . . , Ng ) be observations drawn from the independent random variables Y and Z, respectively, with distribution functions F and G, respectively, such that their normalized empirical processes tend to Brownian bridges. The integrals Hjw for w = A, D, a and d are defined in (2.2) and (2.3), respectively, for H = F and G, and for j = 1, 2 and 3. Under the null hypothesis F ≡ G, If Nh → ∞ (a) for MSD, the jth order MSD test statistic TjM (x) (j = 1, 2 and 3) defined in (3.3) tends to a Gaussian process with mean 0, variance 1, and correlation function rjM (x, y) where rjM (x, y) is defined in Appendix A; and (b) for PSD, the jth-order PSD test statistic TjP (x) (j = 1, 2 and 3) defined in (3.4) tends to a Gaussian process with mean 0, variance 1, and correlation function rjP (x, y) where rjP (x, y) is defined in Appendix A. Let A = min{fi , gj } and B = max{fi , gj }. Based on this theorem, to test the hypotheses in (3.2), we propose to reject the null hypothesis H0w if w max |Tjw (x)| > Mα/2 , for the alternative H1w ;
A<x
max Tjw (x) ≥ Mαw , for the alternative H1lw ;
A<x
w min Tjw (x) ≤ −M∞,α , for the alternative H1rw ;
A<x
w (w = M, P) by a bootstrap approach as in which we suggest to compute the critical value M∞,α discussed in the next section.
4. DETERMINATION OF CRITICAL VALUES Suppose that variables being examined are independent and the sample series {fi : i = 1, 2, . . . , Nf } and {gi : i = 1, 2, . . . , Ng } are i.i.d. We draw two resamples {fi∗ : i = 1, 2, . . . , Nf } and {gi∗ : i = 1, 2, . . . , Ng } from the pooled sample {f (i) , g(j) : i = 1, 2, . . . , Nf , j = 1, 2, . . . , Ng }, the MSD test statistic, TˆjM , and the PSD test statistic, TˆjP , (j = 1, 2, 3) could then be bootstrapped. C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
286
Z. Bai et al.
Using this method, one could approximate the null distribution of the test statistics and obtain their critical values for maxA<x
Step 2
Draw a sample {fi∗ : i = 1, 2, . . . , Nf } from {fi , gj : i = 1, 2, . . . , Nf , j = 1, 2, . . . , Ng } with replacement and draw another sample {gi∗ : i = 1, 2, . . . , Ng } in the same way. Compute: ⎧ ∗D ∗D ⎪ F (x) − G (x) ⎪ j j ⎪ if x > 0, ⎪
⎪ ⎪ ⎪ ˆ D (x) V ⎪ j ⎨ ˆ ∗M Mj = max Tj (x) = max A<x
⎪ if x ≤ 0, ⎪ ⎪ ⎪ ⎪ VˆjA (x) ⎩ ⎧ j∗d (x) − G ∗d ⎪ F (x) ⎪ j ⎪ ⎪
⎪ ⎪ d ⎪ ˆ Vj (x) ⎪ ⎨ ˆ ∗P Pj = max Tj (x) = max A<x
⎪ ⎪ ⎪ a ˆ ⎪ V (x) ⎩ j
Step 3
(4.1) if x ≤ 0,
if x > 0.
Repeat Step 2 M times to get M Mj ’s and Pj ’s, denoted by Mjk and Pjk (k = 1, 2, . . . , M). Find Mj (α) such that {|Mjk | ≥ Mj (α), k ≤ M} = [Mα], that is the percentile of the M distribution of Tˆ∗ j . Find Pj (α) such that {|Pjk | ≥ Pj (α), k ≤ M} = [Mα], that is the P percentile of the distribution of Tˆ∗ j .
R EMARK 4.1. The samples {fi∗ , gj∗ : i = 1, 2, . . . , Nf , j = 1, 2, . . . , Ng } are i.i.d. random
samples with the distribution function FN∗ = Nf FNf + Ng GNg (N = Nf + Ng ). When the null hypothesis F = G holds, the limit of FN∗ , the distribution function of the resample {fi∗ , gj∗ : i = 1, 2, . . . , Nf , j = 1, 2, . . . , Ng }, is F (or G). So, it is reasonable to simulate the critical point value with the resamples. Under the alternative hypothesis, though the statistic value with resamples is not approximately equal to the true value any more, resamples {fi∗ : i = 1, 2, . . . , Nf } and {gj∗ : j = 1, 2, . . . , Ng } still have same distribution function FN . Thus, the value should not be too far away from the true value. However, at this time the statistic value with sample {fi , gj : i = 1, 2, . . . , Nf , j = 1, 2, . . . , Ng } will almost be bigger than that with resample because they have different distribution functions F and G, respectively. N
N
R EMARK 4.2. Start with N = 10,000, Nf = 200, Ng = 200 and α = 0.05. Given that F and G are from some specific distributions, we compare the size (the probability of the first type of error) and the power for some varieties of pairs of F and G. For example, suppose F = N(0, 1), G = N(0, 1), N(0.1, 1), N(0.2, 1), N(0.3, 1), N(0.4, 1) and N(0.5, 1), respectively, we obtain the simulation results as shown in Tables 1 and 2. C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
287
Prospect and Markowitz stochastic dominance tests
j=1 N(0,1) j=2
N(0,1) 0.045 N(0,1)
Table 1. The size and the power of MSD. N(0.1,1) N(0.2,1) N(0.3,1) 0.133 0.373 0.734 N(0.1,1) N(0.2,1) N(0.3,1)
N(0.4,1) 0.930 N(0.4,1)
N(0.5,1) 0.991 N(0.5,1)
N(0,1) j=3 N(0,1)
0.037 N(0,1) 0.068
0.117 N(0.1,1) 0.127
0.925 N(0.4,1) 0.866
0.989 N(0.5,1) 0.983
j=1 N(0,1)
N(0,1) 0.058
Table 2. The size and the power of PSD. N(0.1,1) N(0.2,1) N(0.3,1) 0.098 0.1597 0.344
N(0.4,1) 0.618
N(0.5,1) 0.845
j=2 N(0,1)
N(0,1) 0.051
N(0.1,1) 0.054
N(0.2,1) 0.098
N(0.3,1) 0.215
N(0.4,1) 0.481
N(0.5,1) 0.722
j=3 N(0,1)
N(0,1) 0.052
N(0.1,1) 0.063
N(0.2,1) 0.085
N(0.3,1) 0.171
N(0.4,1) 0.298
N(0.5,1) 0.510
0.385 N(0.2,1) 0.307
0.720 N(0.3,1) 0.638
From the tables, one could easily conclude that (a) the size matches with the pre-determined value reasonably well and (b) when F and G are further apart, the power becomes greater for both MSD and PSD tests. We now describe how to construct Tj∗M (x) and Tj∗P (x) defined in (4.1) for j = 1, 2 and 3 and prove the consistency of the bootstrap tests. We denote HNh (x) =
Nh 1 δ hi (x), Nh i=1
HN∗ h (x) =
Nh 1 ∗ δ hi (x), Nh i=1
RN (x) = λFNf (x) + (1 − λ)GNg (x), in which H = F or G, h = f or g, N = Nf + Ng and λ = Nf /N. Then, we have Hˆ jD (x) =
h 1 1 j −1 (hi − x)+ = Nh (j − 1)! i=1 (j − 1)!
Hˆ jA (x) =
h 1 1 j −1 (x − hi )+ = Nh (j − 1)! i=1 (j − 1)!
N
N
b
x
a
x
(t − x)j −1 dHNh (t)
if x > 0,
(x − t)j −1 dHNh (t)
if x ≤ 0,
ˆ D (x), Rˆ jD (x) = λFˆjD (x) + (1 − λ)G j ˆ A (x), Rˆ jA (x) = λFˆjA (x) + (1 − λ)G j ⎛ ⎞ ⎤ ⎡ Nh 1 1 1 2(j −1) ⎝ ⎠ − Rˆ jD (x)2 ⎦ , ⎣ VˆRDj (x) = + (hi − x)+ Nf Ng N((j − 1)!)2 h=1,2 i=1 C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
288
Z. Bai et al.
VˆRAj (x) =
1 1 + Nf Ng
⎡ ⎣
⎛
⎞
Nh
⎤
1 2(j −1) ⎠ ⎝ (x − hi )+ − Rˆ jA (x)2 ⎦ , N ((j − 1)!)2 h=1,2 i=1
and Hˆ j∗D (x) =
h 1 1 j −1 (h∗i − x)+ = Nh (j − 1)! i=1 (j − 1)!
Hˆ j∗A (x) =
h 1 1 j −1 (x − h∗i )+ = Nh (j − 1)! i=1 (j − 1)!
N
N
b x
x
a
(t − x)j −1 dHN∗ h (t) if x > 0, (x − t)j −1 dFN∗h (t)
Now, we are ready to construct Tj∗M (x) (j = 1, 2 and 3) as follows: ∗D Tj (x) if x > 0, ∗M Tj (x) ≡ Tj∗A (x) if x ≤ 0, ⎧ ∗D ˆ ∗D (x) Fˆj (x) − G ⎪ j ⎪ ⎪
if x > 0, ⎪ ⎪ ⎪ ⎨ VˆRD (x) = ˆ ∗A (x) ⎪ Fˆj∗A (x) − G ⎪ j ⎪ ⎪
if x ≤ 0. ⎪ ⎪ ⎩ A Vˆ (x)
if x ≤ 0.
(4.2)
R
Similarly, one could construct Tj∗P (x). We now develop the consistency results for Tj∗M (x) and Tj∗P (x) as stated in the following theorem. T HEOREM 4.1. Let {fi } (i = 1, 2, . . . , Nf ) and {gi } (i = 1, 2, . . . , Ng ) be observations drawn from the independent random variables Y and Z, respectively, with distribution functions F and G, respectively, such that their normalized empirical processes tend to Brownian bridges. Let{fi∗ : i = 1, 2, . . . , Nf } and {gj∗ : j = 1, 2, . . . , Ng } be resamples drawn from the pooled sample {fi , gj : i = 1, 2, . . . , Nf , j = 1, 2, . . . , Ng } andTj∗M (x) and Tj∗P (x) are defined in (4.2). Under the null hypothesis F ≡ G, If Nh → ∞ (a) for MSD, Tj∗M (x) (j = 1, 2, and 3) defined in (4.2) tends to a Gaussian process with mean 0, variance 1, and correlation function rjM (x, y) where rjM (x, y) is defined in Appendix A; and (b) for PSD, Tj∗P (x) (j = 1, 2, and 3) defined in (4.2) tends to a Gaussian process with mean 0, variance 1, and correlation function rjP (x, y) where rjP (x, y) is defined in Appendix A.
5. ILLUSTRATION To illustrate the applicability of our proposed SD statistics, we first apply our proposed test statistics to study the preferences of investors with the corresponding S-shaped and RS-shaped utility functions vis-`a-vis returns on iShares. Thereafter, we examine their preferences vis-`a-vis returns of traditional stocks and Internet stocks before and after the Internet bubble. Our first C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
289
Prospect and Markowitz stochastic dominance tests
Country fund
Table 3. Ticker Symbols and the date of inception for the 18 iShares. Symbol Inception date Country fund Symbol
Inception date
U.S. SPY Australia
SPY EWA
Jan-93 Mar-96
Japan Malaysia
EWJ EWM
Mar-96 Mar-96
Austria Belgium Canada
EWO EWK EWC
Mar-96 Mar-96 Mar-96
Mexico Netherlands Singapore
EWW EWN EWS
Mar-96 Mar-96 Mar-96
France Germany
EWQ EWG
Mar-96 Mar-96
Spain Sweden
EWP EWD
Mar-96 Mar-96
Hong Kong Italy
EWH EWI
Mar-96 Mar-96
Switzerland United Kingdom
EWL EWU
Mar-96 Mar-96
Note: This information was obtained from Gasbarro et al. (2007).
Table 4. MSD and PSD among iShares. MSD EWY M 1 EWZ
EWY M 2 EWZ
EWY M 3 EWZ EFA M 3 EWU
PSD EFA P1 EWT
EFA P2 EWT
EFA P3 EWT EWU P3 EWZ EWY P3 EWU
P Note: M j and j are defined in Definitions (2.1) and (2.2), respectively, for j = 1, 2 and 3.
illustration is to complement the work by Gasbarro et al. (2007) to check whether there is any MSD or PSD relationship among iShares. Our second illustration is to complement the work of Fong et al. (2008) by using our proposed MSD and PSD test statistics to study the preferences for Markowitz investors and prospect investors on the investment of traditional stocks and Internet stocks during Internet bubble. We first discuss our illustration on iShares. Standard and Poor’s depository receipts (SPDRs, or ‘spiders’, with ticker symbol: SPY) began trading in January 1993. They track the S&P 500 Index and are created and redeemed via ‘creation units’ of 50,000 shares. The acceptance and wide use of SPDRs led to the introduction, in March 1996, of seventeen exchange-traded funds (ETFs), now known as iShares, which began to trade on the American Stock Exchange and were designed to track the Morgan Stanley Capital International (MSCI) foreign stock market indices. A listing of the country market indices, their ticker symbols and inception dates are presented in Table 3. We use the daily returns of 17 iShares and the daily returns of the US SPY (which is treated as the 18th iShare for comparisons) from 19 March 1996 to 31 December 2003, as an example, the same data set used in Gasbarro et al. (2007) for comparison. To complement their work, we aim to examine whether there is any MSD or PSD among iShares. To achieve this, we apply our proposed MSD and PSD test statistics to the iShares data and report the results in Table 4. From Table 4, we find that EWY dominates EWZ in the sense of the first, second and third C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
290
Z. Bai et al.
orders of MSD whereas EFA dominates EWU in the sense of the third orders of MSD. This shows that the first-, second- and third-order Markowitz investors will prefer to invest in EWY then EWZ, whereas the third-order Markowitz investors will prefer to invest in EFA then EWU. On the other hand, we find that EFA dominates EWT in the sense of the first, second and third orders of PSD, whereas EWU dominates EWZ and EWY dominates EWU in the sense of the third orders of PSD. This shows that the first-, second- and third-order prospect investors will prefer to invest in EFA then EWT, whereas the third-order prospect investors will prefer to invest in EWU then EWZ and prefer to invest in EWY than EWU. This information could be useful for investors with the corresponding S-shaped and RS-shaped utility functions in making wiser decisions when investing in iShares. We turn to illustrate the applicability of our proposed MSD and PSD statistics to study the preferences of Internet stocks and the traditional S&P 500 stocks for Markowitz and prospect investors. The spectacular rise and fall of Internet stocks in the late 1990s have stimulated recent research into the causes of the Internet stock bubble. Several papers have focused on the role of market sentiments and investor overconfidence in explaining the Internet bubble. For example, Baker and Stein (2004) develop a model of market sentiment with irrationally overconfident investors and short-sale constraints. The ability of asset-pricing models to explain market anomalies is constrained by the standard assumption that investors are risk averse and the distributions of the assets are normally distributed. In reality, actual risk preferences may not be risk averse, and the distributions of assets are not normally distributed. To circumvent the limitations, Fong et al. (2008) apply SD tests to conduct the analysis and find that risk averters and risk seekers show a distinct difference in preference for Internet versus ‘old economy’ stocks. So far, Fong et al. (2008) do not find any clear preferences of Markowitz investors and prospect investors between Internet stocks and the old economy stocks. To complement their work, this paper adopts the SD test statistics developed in this paper to identify the preferences of Markowitz investors and prospect investors in the Internet stocks and the old economy stocks before and after the Internet bubble. The data for this illustration consist of daily returns on two stock indices: the S&P 500 and the NASDAQ 100 Index. We use the S&P 500 Index to represent non-technology or ‘old economy’ firms. Our proxies for the Internet and technology sectors are the NASDAQ 100 Index. The NASDAQ 100 Index comprises 100 of the largest domestic and international technology firms on the NASDAQ stock market. Firms represented in the NASDAQ 100 include those in the computer hardware and software, telecommunications, and biotechnology sectors. Our sample period is from 1 January 1996, through 31 December 2005. Our interest centres on three distinct sub-periods: the bull market period from 1996 through 9 March 2000 (the peak of the Internet bubble), the bear market period from 10 March 2000 to 31 October 2002, and the subsequent recovery period from 1 November 2002 to 31 December 2005. All data for this study are obtain from Datastream. We now report the results of our proposed test statistics in Table 5 for the entire period as well as for the three sub-periods by displaying the percentage of significant values of their corresponding statistics TjM and TjP for the negative and the positive domains to study the preference of investors with S-shaped and RS-shaped utility functions. We note that TjM = TjA in the negative domain and TjM = TjD in the positive domain, whereas TjP = Tjd in the negative domain and TjP = Tja in the positive domain in which TjA , TjD , Tja and Tjd , TjM and TjP are defined as in (3.3) and (3.4), respectively, with F = S&P 500 and G = NASDAQ 100.
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
291
Prospect and Markowitz stochastic dominance tests Table 5. Percentages of significant modified Davidson-Duclos tests. FSD(j = 1) SSD(j = 2) −
+
−
+
TSD(j = 3) −
+
Panel A: SP500 with NASDAQ (January 1996–December 2005) 0 46 0 %TjA > 0
99
0
99
%TjA < 0 %TjD > 0 %TjD < 0
0 0 22
17 97 2
0 0 14
24 59 0
0 0 27
22 98 1
Panel B: SP500 with NASDAQ (January 1996–March 2000) %TjA > 0 0 56 0
99
0
99
%TjA < 0 %TjD > 0 %TjD < 0
0 0 20
16 92 5
0 0 16
99
0
99
22 42 1
0 0 23
23 95 4
Panel C: SP500 with NASDAQ (April 2000–October 2002) %TjA > 0 0 46 0 %TjA %TjD %TjD
<0 >0
24 59
0 0
21 99
0 0
16 98
0 0
<0
0
26
0
22
0
13
Panel D: SP500 with NASDAQ (November 2002–December 2005) %TjA > 0 0 53 0
97
0
96
%TjA < 0 %TjD > 0
21 61
0 0
25 99
0 0
20 99
0 0
%TjD < 0
0
28
0
30
0
23
to represent Tja in the positive domain and use TjD to present Tjd in the negative domain. of TjA and TjD , and refer to (3.4) for the formulae of Tja and Tjd for j = 1, 2, 3 with F =
Note: For simplicity, we use TjA
Refer to (3.3) for the formulae S&P 500 and G = NASDAQ 100.
We first study the preferences of Markowitz investors with RS-shaped utility functions. From Table 5, one could easily reveal the MSD dominance of the S&P 500 over the NASDAQ 100 for all orders in the negative domain whereas the reverse MSD dominance in the positive domain. For simplicity, we only examine the entire period for illustration. The conclusion for all of the sub-periods could be obtained similarly. From Table 5, we find that 24% (22%, 17%) of T1A (T2A , T3A ) in the negative domain are significantly negative, whereas no portion of TjA in the negative domain is significantly positive for j = 1, 2 and 3. On the other hand, we find that 27% (22%, 14%) of T1D (T2D , T3D ) in the positive domain are significantly negative, whereas no portion of TjD in the negative domain is significantly positive for j = 1, 2 and 3. In short, our findings infer that there is no MSD preference between the S&P 500 and the NASDAQ 100 but our findings infer that Markowitz investors prefer the S&P 500 to the NASDAQ 100 in downside risk but prefer the NASDAQ 100 to the S&P 500 in upside profit. We next examine the preferences of prospect investors with S-shaped utility functions in these markets. Similar to the case for Markowitz investors, for simplicity, we only examine the entire period for illustration. From Table 5, we find that in the positive domain, 46% (99%, 99%) C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
292
Z. Bai et al.
of T1a (T2a , T3a ) are significantly positive, whereas no portion of Tja is significantly negative for j = 1, 2 and 3. This indicates that prospect investors prefer the NASDAQ 100 to the S&P 500 in upside profit. On the other hand, we find that in the negative domain 59% (98%, 97%) of T1d (T2d , T3d ) are significantly positive, whereas only 1% (2%) of T2d (T3d ) are significantly negative. This indicates that prospect investors prefer the S&P 500 to the NASDAQ 100 in downside risk for the entire period. The same conclusion could be drawn for the periods before and after the Internet bubble. Using SD theory for risk averters, risk seekers, and investors with S-shaped and RS-shaped utility functions, one could examine investors’ preferences over the cycle of the Internet bubble; this could, in turn, draw inferences for the utility theory of gambling and behavioural finance (Lam et al., 2010, 2011). Our findings could also be used to test two competing theories of choice under risk. The first is the prospect theory of Kahneman and Tversky (1979) which has recently been applied to behavioural finance; see, for example, Barberis et al. (2001). The second theory, which stems from the experimental work of Thaler and Johnson (1990), indicates that contrary to prospect theory, investors are risk seeking when it comes to gains and risk averse when it comes to losses.
6. CONCLUDING REMARKS In this paper, we first extend the work of Davidson and Duclos (2000) and others by developing the statistics for both MSD and PSD of the first three orders. These statistics enable academics and practitioners to examine the preferences of investors with RS-shaped investors proposed by Markowitz (1952a) and investors with S-shaped utility functions proposed by Kahneman and Tversky (1979) in their prospect theory. We also propose the limiting distributions of the test statistics to be stochastic processes. In addition, we propose a bootstrap method to decide the critical points of the tests, prove the consistency of the bootstrap tests, and provide illustrations. These days, it is popular to apply SD to explain financial theories and anomalies; see, for example, McNamara (1998), Post and Levy (2005), Fong et al. (2005), Broll et al. (2006), Wong et al. (2008), and Lean et al. (2010). So far, most of the literature only examine the preference of risk averters in their studies. We recommend financial analysts to apply the tests developed in our paper to examine the MSD and PSD relationships of different orders so that they can detect opportunities for Markowitz and prospect investors of different orders. At last, we note that Markowitz (1952b, 1959), Levy and Markowitz (1979), Bai et al. (2009a,b, 2011) and others develop estimators for the optimal return and its asset allocation. Markowitz (1952b) first proposes the MV criteria while Markowitz (1959) introduces an algorithm for computing efficient frontiers for large numbers of securities.11 Post and Versijp (2007) have developed a new SD test for multiple comparisons whereas Post (2008) constructs an SSD dominating benchmark portfolio from the optimal solution. Recently, Egozcue and Wong (2010) present a general theory and a unifying framework for determining the second-order SD efficient set. Further research could include extending the DD tests to determine the SD efficient set statistically. We also note that the SD test introduced in this paper provides useful information to investors for decision making. Investors could incorporate our statistics with other advanced econometric techniques, see, for example Bai et al. (2010), to make better investment decisions. 11 We would like to show our great appreciation to Professor Harry Markowitz who provided us with this piece of information. C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
Prospect and Markowitz stochastic dominance tests
293
ACKNOWLEDGMENTS The authors are grateful to Professor Oliver B. Linton and anonymous referees for substantive comments that have significantly improved this paper. The authors thank Professors Jean-Yves Duclos, Harry M. Markowitz and Riˇcardas Zitikis for their valuable comments and suggestions that have significantly improved this paper. In addition, the authors are grateful to participants at the 56th meeting of the International Statistical Institute for their valuable comments. This research is partially supported by Northeast Normal University, the National University of Singapore and Hong Kong Baptist University. This research is partially supported by Northeast Normal University, the National University of Singapore, and Hong Kong Baptist University. The research is also supported by CNSF grant 1087-1036, the grant 10ssxt149 from the Fundamental Research Funds for the Central Universities, and grant 202809 from the Research Grants Council of Hong Kong.
REFERENCES Anderson, G. J. (2004). Toward an empirical analysis of polarization. Journal of Econometrics 6, 1–26. Bai, Z. D., H. X. Liu and W. K. Wong (2009a). Enhancement of the applicability of Markowitz’s portfolio optimization by utilizing random matrix theory. Mathematical Finance 19, 639–7. Bai, Z. D., H. X. Liu and W. K. Wong (2009b). On the Markowitz mean-variance analysis of self-financing portfolios. Risk and Decision Analysis 1, 35–42. Bai, Z. D., H. X. Liu and W. K. Wong (2011). Asymptotic properties of eigenmatrices of a large sample covariance matrix. Forthcoming in Annals of Applied Probability. Bai, Z. D., W. K. Wong and B. Z. Zhang (2010). Multivariate linear and non-linear causality tests. Mathematics and Computers in Simulation 81, 5–17. Baker, M. and J. C. Stein (2004). Market liquidity as a sentiment indicator. Journal of Financial Markets 7, 271–300. Barberis, N., M. Huang and T. Santos (2001). Prospect theory and asset prices. Quarterly Journal of Economics 116, 1–53. Barrett, G. and S. Donald (2003). Consistent tests for stochastic dominance. Econometrica 71, 71–104. Billingsley, P. (1968). Convergence of Probability Measures. New York, Toronto: John Wiley. Broll, U., M. Egozcue, W. K. Wong and R. Zitikis (2010). Prospect theory, indifference curves, and hedging risks. Applied Mathematics Research Express 2, 142–53. Broll, U., J. E. Wahl and W. K. Wong (2006). Elasticity of risk aversion and international trade. Economics Letters 92, 126–30. Chen, S. X. (2008). Nonparametric estimation of expected shortfall. Journal of Financial Econometrics 6, 87–107. Dardanoni, V. and A. Forcina (1999). Inference for Lorenz curve orderings. Econometrics Journal 2, 49–75. Davidson, R. and J. Y. Duclos (2000). Statistical inference for stochastic dominance and for the measurement of poverty and inequality. Econometrica 68, 1435–64. Donsker, M. D. (1952). Justification and extension of Doob’s heuristic approach to the KolmogorovSmirnov theorems. Annals of Mathematical Statistics 23, 277–81. Egozcue, M. and W. K. Wong (2010). Gains from diversification: a majorization and stochastic dominance approach. European Journal of Operational Research 200, 893–900. Feldstein, M. S. (1969). Mean variance analysis in the theory of liquidity preference and portfolio selection. Review of Economic Studies 36, 5–12. Fong, W. M., H. H. Lean and W. K. Wong (2008). Stochastic dominance and behavior towards risk: the market for internet stocks. Journal of Economic Behavior and Organization 68, 194–20. C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
294
Z. Bai et al.
Fong, W. M., W. K. Wong and H. H. Lean (2005). International momentum strategies: a stochastic dominance approach. Journal of Financial Markets 8, 89–109. Gasbarro, D., W. K. Wong and J. K. Zumwalt (2007). Stochastic dominance analysis of iShares. European Journal of Finance 13, 89–101. Hanoch, G. and H. Levy (1969). The efficiency analysis of choices involving risk. Review of Economic Studies 36, 335–46. Horv´ath, L., P. Kokoszka and R. Zitikis (2006). Testing for stochastic dominance using the weighted McFadden type statistic. Journal of Econometrics 133, 191–205. Jorion, P. (2000). Value-at-Risk: The New Benchmark for Managing Financial Risk. New York: McGrawHill. Kahneman, D. and A. Tversky (1979). Prospect theory of decisions under risk. Econometrica 47, 263–91. Koning, R. H. and G. Ridder (2003). Discrete choice and stochastic utility maximization. Econometrics Journal 6, 1–27. Lam, K., T. Liu and W. K. Wong (2010). A pseudo Bayesian model in financial decision making with implications to market volatility, under and overreaction. European Journal of Operational Research 203, 166–75. Lam, K., T. Liu and W. K. Wong (2011). A new pseudo Bayesian model with implications to financial anomalies and investors’ behaviors. Forthcoming in Journal of Behavioral Finance. Lean, H. H., M. McAleer and W. K. Wong (2010). Market efficiency of oil spot and futures: a mean-variance and stochastic dominance approach. Energy Economics 32, 979–86. Levy, H. and M. Levy (2004). Prospect theory and mean-variance analysis. Review of Financial Studies 17, 1015–41. Levy, H. and H. M. Markowitz (1979). Approximating expected utility by a function of mean and variance. American Economic Review 69, 308–17. Levy, M. and H. Levy (2002). Prospect theory: much ado about nothing? Management Science 48, 1334–49. Leung, P. L. and W. K. Wong (2008). On testing the equality of the multiple sharpe ratios, with application on the evaluation of IShares. Journal of Risk 10, 1–16. Li, C. K. and W. K. Wong (1999). Extension of stochastic dominance theory to random variables. RAIRO Recherche Op´erationnelle 33, 509–24. Linton, O., E. Maasoumi and Y. J. Whang (2005). Consistent testing for stochastic dominance under general sampling schemes. Review of Economic Studies 72, 735–65. Ma, C. and W. K. Wong (2010). Stochastic dominance and risk measure: a decision-theoretic foundation for VaR and C-VaR. European Journal of Operational Research 207, 927–35. Markowitz, H. M. (1952a). The utility of wealth. Journal of Political Economy 60, 151–6. Markowitz, H. M. (1952b). Portfolio selection. Journal of Finance 7, 77–91. Markowitz, H. M. (1959). Portfolio Selection: Efficient Diversification of Investments. New York: John Wiley. McNamara, J. R. (1998). Portfolio selection using stochastic dominance criteria. Decision Sciences 29, 785–801. Meyer, J. (1987). Two moment decision models and expected utility maximization. American Economic Review 77, 421–30. Ogryczak, W. and A. Ruszczy´nski (2002). Dual stochastic dominance and related mean-risk models. SIAM Journal of Optimization 13, 60–78. Post, T. (2008). On the dual test for SSD efficiency: with an application to momentum investment strategies. European Journal of Operational Research 185, 1564–73. Post, T. and H. Levy (2005). Does risk seeking drive asset prices? A stochastic dominance analysis of aggregate investor preferences and beliefs. Review of Financial Studies 18, 925–53. C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
295
Prospect and Markowitz stochastic dominance tests
Post, T. and P. Versijp (2007). Multivariate tests for stochastic dominance efficiency of a given portfolio. Journal of Financial and Quantitative Analysis 42, 489–516. Rockafellar, R. T. and S. P. Uryasev (2000). Optimization of conditional value-at-risk. Journal of Risk 2, 21–42. Schechtman, E., A. Shelef, S. Yitzhaki and R. Zitikis (2008). Testing hypotheses about absolute concentration curves and marginal conditional stochastic dominance. Econometric Theory 24, 1044– 62. Sharpe, W. F. (1966). Mutual fund performance. Journal of Business 39, 119–38. Sriboonchita, S., W. K. Wong, S. Dhompongsa and H. T. Nguyen (2009). Stochastic Dominance and Applications to Finance, Risk and Economics, Boca Raton, FL: Chapman and Hall. Thaler, R. H. and E. J. Johnson (1990). Gambling with the house money and trying to break even: the effects of prior outcomes on risky choice. Management Science 36, 643–60. Tversky, A. and D. Kahneman (1992). Advances in prospect theory: cumulative representation of uncertainty. Journal of Risk and Uncertainty 5, 297–323. von Neumann, J. and O. Morgenstern (1944). Theory of Games and Economic Behavior. Princeton, NJ: Princeton University Press. Wolfowitz, J. (1954). Generalization of the theorem of Glivenko-Cantelli. Annals of Mathematical Statistics 25, 131–8. Wong, W. K. (2006). Stochastic dominance theory for location-scale family. Journal of Applied Mathematics and Decision Sciences 2006, 1–10. Wong, W. K. (2007). Stochastic dominance and mean-variance measures of profit and loss for business planning and investment. European Journal of Operational Research 182, 829–43. Wong, W. K. and R. Chan (2008). Markowitz and prospect stochastic dominances. Annals of Finance 4, 105–29. Wong, W. K. and C. K. Li (1999). A note on convex stochastic dominance theory. Economics Letters 62, 293–300. Wong, W. K. and C. Ma (2008). Preferences over Location-Scale Family. Economic Theory 37, 119–46. Wong, W. K., K. F. Phoon and H. H. Lean (2008). Performance of Asian hedge funds: stochastic dominance and mean-variance approaches. Pacific-Basin Finance Journal 16, 204–23.
APPENDIX A: PROOF OF THEOREM 3.1 Before we prove the theorem, we define rjM (x, y) and rjP (x, y) stated in Theorem 3.1 as follows: For j > 1,
⎧ x y (x − t)j −2 (y − s)j −2 (F (t ∧ s) − F (t)F (s)) dtds ⎪ a a ⎪
⎪ ⎪ ⎪ ⎪ VjA (x)VjA (y) ⎪ ⎪ ⎪ ⎪ bb ⎪ ⎪ j −2 j −2 ⎪ ⎨ x y (t − x) (s − y) (F (t ∧ s) − F (t)F (s)) dtds
M rj (x, y) = VjD (x)VjD (y) ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ b y ⎪ (t − x)j −2 (s − y)j −2 F (s)(1 − F (t)) dtds ⎪ x a ⎪ ⎪
⎪ ⎪ ⎪ ⎩ VjD (x)VjA (y) C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
if x ≤ 0, y ≤ 0,
if x > 0, y > 0,
if x > 0, y ≤ 0,
296
Z. Bai et al.
and for the case j = 1,
r1M (x, y)
where
=
VjA (x) =
if xy > 0,
F (y)(1 − F (x)) ⎪ ⎪ ⎪ ⎩− √F (x)F (y)(1 − F (x))(1 − F (y))
if x > 0 ≥ y,
x
b
a
VjD (x) =
⎧ F (x ∧ y) − F (x)F (y) ⎪ ⎪ ⎪ ⎨ √F (x)F (y)(1 − F (x))(1 − F (y))
x
and
x
a b
(x − t)j −2 (x − s)j −2 F (t ∧ s) − F (t)F (s) dtds , (t − x)j −2 (s − x)j −2 F (t ∧ s) − F (t)F (s) dtds ,
x
⎧ A ⎪ V (x) = F (x) 1 − F (x) ⎪ 1 ⎪ ⎪ ⎨
if x ≤ 0,
⎪ ⎪ ⎪ ⎪ if x > 0. ⎩V1D (x) = F (x) 1 − F (x) For j > 1,
⎧ 0 0 (t − x)j −2 (s − y)j −2 (F (t ∧ s) − F (t)F (s)) dtds ⎪ ⎪ ⎪ x y a ⎪ ⎪ Vj (x)Vja (y) ⎪ ⎪ ⎪ x y ⎪ ⎪ j −2 j −2 ⎪ ⎪ 0 0 (x − t) (y − s) (F (t ∧ s) − F (t)F (s)) dtds ⎨
P rj (x, y) = Vjd (x)Vjd (y) ⎪ ⎪ ⎪ x 0 ⎪ ⎪ ⎪ (x − t)j −2 (s − y)j −2 F (s)(1 − F (t)) dtds ⎪ 0 y ⎪ ⎪
− ⎪ ⎪ ⎪ ⎩ Vjd (x)Vja (y)
if x, y ≤ 0, if x, y > 0,
if x > 0 ≥ y,
and for j = 1 ⎧ (F (0) − F (x))(1 − F (0) + F (y)) ⎪ ⎪ ⎪ ⎪ V1P (x)V1P (y) ⎪ ⎪ ⎪ ⎪ ⎨ (F (y) − F (0))(1 − F (x) + F (0)) r1P (x, y) = V1P (x)V1P (y) ⎪ ⎪ ⎪ ⎪ ⎪ (F (x) − F (0))(F (0) − F (y)) ⎪ ⎪ ⎪ ⎩− V P (x)V P (y) 1
where
Vjd (x)
= 0
Vja (x) = and
x
x
0
x
if 0 > x > y, if x > y > 0, if x > 0 > y,
1
(x − t)j −2 (x − s)j −2 (F (t ∧ s) − F (t)F (s)) dtds,
0 0
(t − x)j −2 (s − x)j −2 (F (t ∧ s) − F (t)F (s)) dtds,
x
⎧ ⎨V1d (x) = F (0) − F (x) − (F (0) − F (x))2
if x ≤ 0,
⎩V1a (x) = F (x) − F (0) − (F (0) − F (x))2
if x > 0.
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
297
Prospect and Markowitz stochastic dominance tests
Now, we come back to prove Theorem 3.1 as follows: Without loss of generality, we assume Nf = Ng = N in the proof. According to empirical process theorem, for any continuous distribution function F(x), we have12 √ N (FN (x) − F (x)) ⇒ B(F (x)), where FN is the empirical distribution of N i.i.d. observations drawn from F and B( · ) is the standardized Brownian bridge on the interval [0, 1]. Recall that ⎧ b N 1 1 ⎪ j −1 ⎪ ⎪ FˆjD (x) = (fi − x)+ = (t − x)j −1 dFN (t) if x > 0, ⎪ ⎪ N (j − 1)! i=1 (j − 1)! x ⎨ x N ⎪ 1 1 ⎪ j −1 ⎪ ⎪FˆjA (x) = (x − fi )+ = (x − t)j −1 dFN (t) if x ≤ 0. ⎪ ⎩ N (j − 1)! i=1 (j − 1)! a Applying the empirical process theorem, for j ≥ 1, we have b ⎧√ 1 ⎪ D D ˆ ⎪ N ( F (x) − F (x)) ⇒ (t − x)j −1 dB(F (t)) ⎪ j j ⎪ (j − 1)! x ⎪ ⎪ x ⎪ ⎪ √ ⎪ 1 ⎪ ˆjA (x) − FjA (x)) ⇒ ⎪ N ( F (x − t)j −1 dB(F (t)) ⎪ ⎨ (j − 1)! a b √ 1 ⎪ D D ⎪ ˆ ⎪ N ( G (x) − G (x)) ⇒ (t − x)j −1 dB(G(t)) ⎪ j j ⎪ (j − 1)! ⎪ x ⎪ x ⎪ ⎪ ⎪√ ˆ A 1 ⎪ A ⎪ N ( G (x) − G (x)) ⇒ (x − t)j −1 dB(G(t)) ⎩ j j (j − 1)! a
if x > 0, if x ≤ 0, if x > 0, if x ≤ 0.
ˆ Aj , G ˆD Because FjA (x) = GAj (x) under null hypothesis, and (FˆjA (x), FˆjD (y)) and (G j (x)) are independent, we get ⎧ √ b √ ⎪ 2 ⎪ D D ˆ ˆ ⎪ N ( F (x) − G (x)) ⇒ (t − x)j −1 dB(F (t)) if x > 0, ⎪ j j ⎨ (j − 1)! x √ x √ ⎪ 2 ⎪ A A ˆ ˆ ⎪ N ( F (x) − G (x)) ⇒ (x − t)j −1 dB(F (t)) if x ≤ 0. ⎪ j j ⎩ (j − 1)! a By the law of large numbers, with probability 1, one has ⎧ b 1 ⎪ D ⎪ ˆ ⎪ N V → (t − x)2j −2 dF (t) − FjD (x)2 , ⎪ Fj 2 ⎪ ((j − 1)!) ⎪ x ⎪ ⎪ b ⎪ ⎪ 1 ⎪ 2 ⎪ ˆGD → (t − x)2j −2 dG(t) − GD N V ⎪ j (x) . ⎪ j ⎪ ((j − 1)!)2 x ⎪ ⎨ N VˆFAGj → 0 ⎪ x ⎪ ⎪ 1 ⎪ ˆA ⎪ ⎪ N V → (x − t)2j −2 dF (t) − FjA (x)2 , ⎪ Fj 2 ⎪ ((j − 1)!) ⎪ a ⎪ x ⎪ ⎪ ⎪ 1 ⎪ A ˆ ⎪ N V → (x − t)2j −2 dG(t) − GAj (x)2 . ⎪ ⎩ Gj ((j − 1)!)2 a
12
Readers may refer to Donsker (1952) and Wolfowitz (1954) for more information about empirical process theorem.
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
298
Z. Bai et al.
Therefore, under the null hypothesis, N (VˆFAj + VˆGAj ) →
((j
j >1
=
((j
N (VˆFDj (x) + VˆGDj (x)) →
((j
j >1
=
((j
x 2 (x − t)2j −2 dF (t) − FjA (x)2 − 1)!)2 a x x 2 (x − t)j −2 (x − s)j −2 (F (t ∧ s) − F (t)F (s)) dtds, − 2)!)2 a a b 2 (x − t)2j −2 dF (t) − FjA (x)2 − 1)!)2 x b b 2 (x − t)j −2 (x − s)j −2 (F (t ∧ s) − F (t)F (s)) dtds. − 2)!)2 x x
By integration by parts, when j > 1, we have ⎧ √ √ b b ⎪ 2 2 ⎪ j −1 ⎪ (t − x) dB(F (t)) = (t − x)j −2 B(F (t)) dt ⎪ ⎨ (j − 1)! x (j − 2)! x √ √ x x ⎪ 2 2 ⎪ j −1 ⎪ (x − t) dB(F (t)) = − (x − t)j −2 B(F (t)) dt. ⎪ ⎩ (j − 1)! a (j − 2)! a Therefore, when j > 1, we have b ⎧ (t − x)j −2 B(F (t)) dt ⎪ x ⎪
⎪ ⎪ ⎪ b b ⎪ ⎨ x x (t − x)j −2 (s − x)j −2 (F (t ∧ s) − F (t)F (s)) dtds M x Tj (x) ⇒ ⎪ (x − t)j −2 B(F (t)) dt ⎪ a ⎪
− ⎪ ⎪ x x ⎪ ⎩ (x − t)j −2 (x − s)j −2 (F (t ∧ s) − F (t)F (s)) dtds a a
if x > 0,
if x ≤ 0.
From this expression, we know that the limiting process of TjM is Gaussian with mean 0, variance 1, and covariance function ⎧ b b j −2 j −2 ⎪ ⎪ x y (t − x) (s − y) (F (t ∧ s) − F (t)F (s)) dtds if x, y > 0, ⎪ ⎪ ⎪ ⎪ ⎪ VjD (x)VjD (y) ⎪ ⎪ ⎪ ⎪ x y ⎪ ⎪ ⎨ a a (x − t)j −2 (y − s)j −2 (F (t ∧ s) − F (t)F (s)) dtds
if x, y ≤ 0, M rj (x, y) = VjA (x)VjA (y) ⎪ ⎪ ⎪ ⎪ ⎪ by ⎪ ⎪ (t − x)j −2 (y − s)j −2 F (s)(1 − F (t)) dtds ⎪ x a ⎪ ⎪
− if x > 0 ≥ y, ⎪ ⎪ ⎪ ⎩ VjD (x)VjA (y) where
⎧ b b ⎪ D ⎪ V (x) = (t − x)j −2 (s − x)j −2 (F (t ∧ s) − F (t)F (s)) dtds, ⎪ j ⎨ x x x x ⎪ A ⎪ (x) = (x − t)j −2 (x − s)j −2 (F (t ∧ s) − F (t)F (s)) dtds. V ⎪ ⎩ j a
a
For the case j = 1, we have
T1M (x)
⇒
⎧ B(F (x)) ⎪ ⎪ ⎪ F (x) − F 2 (x) ⎨
if x > 0,
B(F (x)) ⎪ ⎪ ⎪ ⎩− F (x) − F 2 (x)
if x ≤ 0.
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
299
Prospect and Markowitz stochastic dominance tests The limiting process is also Gaussian with mean 0, variance 1, and covariance function ⎧ ⎪ ⎪ ⎪ ⎨√
F (x ∧ y) − F (x)F (y) F (x)F (y)(1 − F (x))(1 − F (y)) r1M (x, y) = F (y)(1 − F (x)) ⎪ ⎪ ⎪ ⎩− √F (x)F (y)(1 − F (x))(1 − F (y))
if xy > 0, if x > 0 ≥ y.
To find the limiting distribution of TjP , we note that 0 1 (t − x)j −1 d FˆN (t), for x ≤ 0, (j − 1)! x x 1 Fˆja (x) = (x − t)j −1 d FˆN (t), for x > 0. (j − 1)! 0
Fˆjd (x) =
By an argument similar to that for TjM (x), we can prove that ⎧ 0 x)j −1 dB(F (t)) ⎪ ⎪ x (t −
⎪ ⎪ ⎪ ⎨ Vjd (x) TjP (x) ⇒ x ⎪ ⎪ (x − t)j −1 dB(F (t)) ⎪ 0 ⎪ a ⎪ ⎩ Vj (x)
if x ≤ 0,
if x > 0,
where for j > 1, 0 0 1 (t − x)j −2 (s − x)j −2 (F (t ∧ s) − F (t)F (s)) dtds, ((j − 2)!)2 x x x x 1 Vja (x) = (x − t)j −2 (x − s)j −2 (F (t ∧ s) − F (t)F (s)) dtds, ((j − 2)!)2 0 0
Vjd (x) =
and for j = 1 V1d (x) = F (0) − F (x) − (F (0) − F (x))2 for x ≤ 0, V1a (x) = F (x) − F (0) − (F (x) − F (0))2 for x > 0. From these, we know that the limiting distribution of TjP (x) is Gaussian with mean 0, variance 1, and covariance function ⎧ 0 0 (t − x)j −2 (s − y)j −2 (F (t ∧ s) − F (t)F (s)) dtds ⎪ ⎪ ⎪ x y
⎪ ⎪ ⎪ ⎪ Vjd (x)Vjd (y) ⎪ ⎪ ⎪ ⎪ x y ⎪ j −2 j −2 ⎪ ⎨ 0 0 (x − t) (y − s) (F (t ∧ s) − F (t)F (s)) dtds
P rj (x, y) = Vjd (x)Vjd (y) ⎪ ⎪ ⎪ ⎪ x 0 ⎪ ⎪ ⎪ ⎪ 0 y (x − t)j −2 (s − y)j −2 F (s)(1 − F (t)) dtds ⎪ ⎪
− ⎪ ⎪ ⎪ ⎩ Vjd (x)Vjd (y) C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
if x, y ≤ 0,
if x, y > 0,
if x > 0 ≥ y,
300
Z. Bai et al.
and ⎧ (F (0) − F (x))(1 − F (0) + F (y)) ⎪ ⎪ ⎪ ⎪ V1P (x)V1P (y) ⎪ ⎪ ⎪ ⎪ ⎨ (F (y) − F (0))(1 − F (x) + F (0)) r1P (x, y) = V1P (x)V1P (y) ⎪ ⎪ ⎪ ⎪ ⎪ (F (x) − F (0))(F (0) − F (y)) ⎪ ⎪ ⎪− ⎩ V1P (x)V1P (y)
if 0 > x > y, if x > y > 0, if x > 0 > y.
APPENDIX B: PROOF OF THEOREM 4.1 We only prove Part (a) of Theorem 4.1. Part (b) of Theorem 4.1 could be obtained similarly. To prove Part (a) of the theorem is equivalent to prove that under null hypothesis stated in Theorem 4.1, Tj∗M (x) tends to a Gaussian process with mean 0, variance 1, and correlation function rjM (x, y) as defined in Appendix A. We first prove that b ⎧ Nf Ng ˆ ∗D 1 ⎪ ∗D ˆ ⎪ ( F (x) − G (x)) ⇒ (t − x)j −1 dB(F (t)) if x > 0, ⎪ j j ⎨ N (j − 1)! x (B.1) x ⎪ Nf Ng ˆ ∗A 1 ⎪ ˆ ∗A ⎪ (Fj (x) − G (x − t)j −1 dB(F (t)) if x ≤ 0. ⎩ j (x)) ⇒ N (j − 1)! a To prove (B.1), we let XNf ,Ng (s) =
Nf Ng ∗ (FNf (xs ) − G∗Ng (xs )), N
in which F(xs ) = s( ∈ [0, 1]). Now, we need to prove XNf ,Ng (s) ⇒ B(s) for Nf , Ng → ∞.
(B.2)
To do so, we first show that the finite-dimensional distributions of XNf ,Ng converge to those of B. That is equivalent to show that Nf Ng ∗D (FNf (x) − G∗D Nf (x)) ⇒ B(F (x)), Nf , Ng → ∞, N for finite x points. We consider the condition distribution of ∗ Nf FNf (x) − RN (x) |f1 , . . . , fNf , g1 , . . . , gNg . By central limit theorem, it is easy to prove that for any fix x ∈ (a, b), f1 , . . . , f N FN∗f (x) − RN (x) f ⇒ N (0, 1), RN (x)(1 − RN (x))/Nf g1 , . . . , gNg
(B.3)
when Nf , Ng → ∞. Similarly, we get
f1 , . . . , f N f ⇒ N (0, 1), RN (x)(1 − RN (x))/Ng g1 , . . . , gNg G∗Ng (x) − RN (x)
(B.4)
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
Prospect and Markowitz stochastic dominance tests
301
when Nf , Ng → ∞. From (B.3) and (B.4), we obtain
∗ ∗ Nf Ng FNf (x) − GNg (x) f1 , . . . , fNf ⇒ N (0, 1), √ N RN (x)(1 − RN (x)) g1 , . . . , gNg
when Nf , Ng → ∞. For any a < x1 < x2 < · · · < xk < b, according to multivariate central limit theorem, we have ⎛
Nf Ng N −1/2 ⎜ ⎜ ( k×k ) ⎜ ⎝ N
FN∗f (x1 ) − G∗Ng (x1 ) .. . FN∗f (xk ) − G∗Ng (xk )
⎞
⎟ f1 , . . . , f N f ⎟ ⇒ N (0, I ) ⎟ ⎠ g1 , . . . , gNg
N when Nf , Ng → ∞. Here, ( k×k )ij = Fnf (xi∧j ) − Fnf (xi )FNf (xj ). Since
N )1/2 ( k×k )−1/2 → I as N → ∞ ( k×k
in which ( k×k )ij = F(xi∧j ) − F(xi )F(xj ), we have ⎛
⎜ Nf Ng ⎜ ( k×k )−1/2 ⎜ ⎝ N
FN∗f (x1 ) − Fnf (x1 ) .. .
⎞ ⎟ ⎟ ⎟ ⇒ N (0, I ) ⎠
FN∗f (xk ) − Fnf (xk ) when Nf , Ng → ∞. The result can be rewritten as ⎛
Nf Ng N
⎜ ⎜ ⎜ ⎝
FN∗f (x1 ) − Fnf (x1 ) .. .
⎞ ⎟ ⎟ ⎟ ⇒ N (0, k×k ) ⎠
FN∗f (xk ) − Fnf (xk ) when Nf , Ng → ∞. Here, N(0, σ k×k ) is a distribution without any information of {fi , gj : i = 1, . . . , Nf ; j = 1, . . . , Ng }. Thus, finite-distributional distribution of XNf ,Ng converges. If we could prove {XNf ,Ng } is tight, (B.2) will follow. By Theorem 13.5 (Billingsley, 1968), the conclusion suffices because, for any 0 ≤ s1 ≤ s ≤ s2 ≤ 1, E
2 2 XNf ,Ng (s) − XNf ,Ng (s1 ) XNf ,Ng (s2 ) − XNf ,Ng (s) !
" f1 , . . . , f N f = E E (XNf ,Ng (s) − XNf ,Ng (s1 )) (XNf ,Ng (s2 ) − XNf ,Ng (s)) g1 , . . . , gNg 2
≤ E[C(FN (xs ) − FN (xs1 ))(FN (xs2 ) − FN (xs ))] C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
2
302
Z. Bai et al.
⎞⎛ ⎞⎤ nh nh 1 1 I (xs1 < hi ≤ xs )⎠ ⎝ I (xs < hi ≤ xs2 )⎠⎦ = CE ⎣⎝ N h=f ,g i=1 N h=f ,g i=1 ⎛ ⎞ ⎡⎛
=
N ⎟ C ⎜ ⎜ ⎟ E I (x < h ≤ x )I (x < h ≤ x ) ⎜ s i s s j s 1 2 ⎟ ⎠ N2 ⎝
i = 1, hi = hj
CN (N − 1) (s − s1 )(s2 − s) N2 ≤ C(s2 − s1 )2
=
in which C is a positive constant. Thus, (B.2) is correct. Suppose that ⎧ ⎪ ⎪ Nf Ng (Fˆj∗D (x1−s ) − G ˆ ∗D ⎪ j (x1−s )) ⎨ N D YNf ,Ng (s) = YNf ,Ng (x1−s ) = ⎪ ⎪ Nf Ng (Fˆ ∗D (0) − G ˆ ∗D ⎪ ⎩ j j (0)) N and
if x1−s > 0 if x1−s ≤ 0
⎧ b 1 ⎪ ⎪ ⎪ (t − x1−s )j −1 dB(F (t)) if x1−s < 0 ⎪ ⎨ (j − 1)! x1−s Y (s) = Y D (x1−s ) = b ⎪ 1 ⎪ j −1 ⎪ if x1−s ≤ 0 ⎪ ⎩ (j − 1)! 0 t dB(F (t))
in which F(xs ) = s. Now, we need to prove that YNf ,Ng (s) ⇒ Y (s), Nf , Ng → ∞, for finite points. This could be done as follows: For any a ≤ x1 < x2 < · · · < xk ≤ b, we have ⎛ ⎞ b 1 Nf Ng ∗ j −1 ∗ (FNf (l) − Gng (l)) ⎟ (l − x1 ) d ⎞ ⎜ ⎛ D N ⎜ (j − 1)! x1 ⎟ YNf ,Ng (x1 ) ⎜ ⎟ ⎜ ⎟ b ⎟ ⎜ D 1 Nf Ng ∗ ⎟ j −1 ∗ ⎜ YN ,N (x2 ) ⎟ ⎜ (FNf (l) − Gng (l)) ⎟ (l − x2 ) d ⎟ ⎜ ⎜ f g ⎜ ⎟ (j − 1)! N x2 ⎟=⎜ ⎜ ⎟ ⎟ ⎜ ⎜ .. ⎟ ⎟ ⎜ .. ⎜ ⎟ . ⎠ ⎜ ⎝ ⎟ . ⎜ ⎟ D YNf ,Ng (xk ) ⎜ ⎟ b ⎝ ⎠ Nf Ng ∗ 1 j −1 ∗ (FNf (l) − Gng (l)) (l − xk ) d (j − 1)! xk N b ⎛ ⎞ 1 j −1 (l − x ) dB(F (l)) 1 ⎜ (j − 1)! x ⎟ ⎛ ⎞ ⎜ ⎟ 1 Y D (x1 ) ⎜ ⎟ b ⎜ ⎟ 1 ⎜ D ⎟ ⎜ Y (x2 ) ⎟ (l − x2 )j −1 dB(F (l)) ⎟ ⎜ (j − 1)! ⎟ ⎜ ⎜ ⎟ x ⎜ ⎟ 2 ⇒⎜ , Nf , Ng → ∞. .. ⎟ ⎟=⎜ ⎜ ⎜ ⎟ ⎝ . ⎟ .. ⎠ ⎜ ⎟ . ⎜ ⎟ ⎜ ⎟ Y D (xk ) b ⎝ ⎠ 1 j −1 (l − xk ) dB(F (l)) (j − 1)! xk C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
Prospect and Markowitz stochastic dominance tests
303
That is to say, the finite-dimensional of YNf ,Ng converges properly. In addition, we need to prove the tightness of {YNf ,Ng }. To do so, we first obtain the following: (a) YNf ,Ng (0) = 0. (b) For a ≤ x1 < x2 ≤ b, 0 < δ < 1, we have D(A) D(A) YNf ,Ng (x1 ) − YNf ,Ng (x2 ) x2 1 Nf Ng ∗ j −1 ∗ = (FNf (t) − GNg (t)) (t − x) d (j − 1)! x1 N |x2 − x1 |j −1 |XNf ,Ng (x2 ) − XNf ,Ng (x1 )| (j − 1)! δ j −1 |XNf ,Ng (F (x2 )) − XNf ,Ng (F (x1 ))| ≤ (j − 1)! 1 |XNf ,Ng (F (x2 )) − XNf ,Ng (F (x1 ))| . ≤ (j − 1)! ≤
From (a) and (b) above, together with the fact that {XNf ,Ng } is tight, {YNf ,Ng } is tight. So, we obtain the limit YNf ,Ng ⇒ Y , Nf , Ng → ∞ and thus the first equation in (B.1) holds. Using a similar arguments, one could easily prove that the second equation in (B.1) holds. In addition, by the law of large numbers, with probability 1, one has ⎧ b Nf Ng ˆ D 1 ⎪ ⎪ V → (t − x)2j −2 dF (t) − FjD (x)2 , ⎪ Rj ⎨ N ((j − 1)!)2 x x (B.5) Nf Ng ˆ A 1 ⎪ 2j −2 A 2 ⎪ V → (t − x) dF (t) − F (x) . ⎪ Rj j ⎩ N ((j − 1)!)2 a From (B.1) and (B.5), when j > 1, we have ⎧ b ⎪ (t − x)j −2 B(F (t)) dt ⎪ x ⎪
if x > 0 , ⎪ ⎪ b b ⎪ j −2 (s − x)j −2 (F (t ∧ s) − F (t)F (s)) dtds ⎨ (t − x) x x Tj∗M (x) ⇒ x ⎪ (x − t)j −2 B(F (t)) dt ⎪ a ⎪ ⎪
− if x ≤ 0 . ⎪ ⎪ x x ⎩ (x − t)j −2 (x − s)j −2 (F (t ∧ s) − F (t)F (s)) dtds a a In addition, when j = 1, we have
⎧ B(F (x)) ⎪ ⎪ ⎪ ⎨ F (x) − F 2 (x) if x > 0, T1∗M (x) ⇒ ⎪ B(F (x)) ⎪ ⎪ if x ≤ 0. ⎩− F (x) − F 2 (x)
That is, {Tj∗M } and {TjM } have the same limiting process and thus the assertion holds.
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
The
Econometrics Journal Econometrics Journal (2011), volume 14, pp. 304–320. doi: 10.1111/j.1368-423X.2010.00334.x
Regressions with asymptotically collinear regressors K AIRAT T. M YNBAEV † †
International School of Economics, Kazakh–British Technical University, Tolebi 59, Almaty 050000, Kazakhstan. E-mail: kairat
[email protected] First version received: August 2009; final version accepted: August 2010
Summary We investigate the asymptotic behaviour of the OLS estimator for regressions with two slowly varying regressors. It is shown that the possibilities include a definite case, when the third-order regular variation is sufficient to determine the asymptotic distribution, and an indefinite case, when higher-order regular variation is required to find the distribution. In the definite case the asymptotic distribution is normal one-dimensional and may belong to one of six types depending on the relative rates of growth of the regressors. The dependence of the asymptotic variance on the parameters of the model is discontinuous. The analysis establishes, in particular, a new link between slow variation and Lp -approximability. Keywords: Asymptotically collinear regressors, Asymptotic distribution, Lp -approximability, OLS estimator.
1. INTRODUCTION Regressions with asymptotically collinear regressors arise in a number of applications, both in linear and non-linear settings. The examples are the log-periodogram analysis of long memory (see Robinson, 1995, Hurvich et al., 1998, Phillips, 1999, and references therein), the study of growth convergence (Barro and Sala-i-Martin, 1995), and non-linear least squares estimation (Wu, 1981). Phillips (2007) has developed a powerful method to analyse such regressions. Using the theory of slowly varying functions (for the definition see Bingham et al., 1987) he has proved asymptotic normality of OLS estimators (with an appropriate standardization). He has also shown that the usual regression formulas for asymptotic standard errors are valid. The limit distribution of the regression coefficients has been shown to be one-dimensional. In the paper just cited, Phillips has considered a variety of situations, from simple regression to non-linear regression. In case of simple regression and a polynomial regression in a slowly varying function his treatment is complete. However, in case of two different slowly varying regressors, as in ys = β0 + β1 L1 (s) + β2 L2 (s) + us ,
(1.1)
Phillips limited himself to a heuristic argument. The purpose of this paper is to provide a rigorous result for (1.1). C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society. Published by Blackwell Publishing Ltd, 9600
Garsington Road, Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
Asymptotically collinear regressors
305
Following Phillips (2007) let us consider slowly varying functions L with Karamata representation L(x) = cL exp
x
ε(t) dt/t ,
x ≥ a > 0,
(1.2)
a
and call the function ε in this representation an ε-function of L. We say that two models of form (1.1) with pairs of SV functions (L1 , L2 ) and (L˜ 1 , L˜ 2 ) are of different (asymptotic) types if their asymptotic distributions contain functions of sample size n with different asymptotic behaviour as n → ∞. Phillips (2007, Theorem 5.1) suggests that there are two types of model (1.1): one kind of asymptotics is true when the ε-functions of L1 , L2 satisfy ε2 (n) = o(ε1 (n)) and another holds when ε1 (n) = o(ε2 (n)). Our classification theorem below shows that the number of different types is at least six (this number increases as one considers higher-order approximations) and is determined by such fine characteristics of the regressors as ε-functions of their ε-functions. In all cases we prove a Phillips-type result that the limit distribution is normal and one-dimensional. Let L be a real-valued function defined on the half-axis (0, ∞). We say that L is first-order regularly varying if with some functions A1 , B1 the relation L(xt) = L(t) + A1 (t)(B1 (x) + o(1)), t → ∞, holds for all x > 0. The definition of second-order regular variation (RV) imposes an additional requirement that with some A2 , B2 the relation L(xt) = L(t) + A1 (t)B1 (x) + A1 (t)A2 (t)(B2 (x) + o(1)), t → ∞, holds for all x > 0. Similarly, in case of third-order RV the extra condition is L(xt) = L(t) + A1 (t)B1 (x) + A1 (t)A2 (t)B2 (x) + A1 (t)A2 (t)A3 (t)(B3 (x) + o(1)). Higher-order RV is defined inductively. The purpose of these definitions is to separate the influence of arguments x and t while controlling for the rate of approximation. For details see Alves et al. (2006). Making a start from the heuristic idea by Phillips, we suggest a new transformation of the regressor space, which gives rise to a more elaborate analysis of the transition matrix associated with the modified transformation. We also obtain a sufficient condition for the third-order RV of functions with representation (1.2). The present study reveals the distinction between the definite case, when the third-order RV is enough for determining the asymptotics, and the indefinite case, when higher-order RV is necessary to derive the asymptotic distribution. The situation is similar to the sufficient condition for optima in terms of the first- and second-order derivatives: if the second-order condition is not satisfied, one has to check higher-order derivatives. As the situation with regressions is more complex than with derivatives, we cannot say for sure if the process stops with some higher-order RV or not but the calculations we have done indicate that the number of different asymptotic types is probably infinite. Another unexpected outcome is that the asymptotic variances jump along certain rays. The method in principle is applicable to regressions with more than two different slowly varying regressors. However, we are not sure that such generalizations are required for empirical work. In Section 2, we state the main results. Their proofs are given in the Appendix 3. All constants denoted by c, with or without subscripts, are inconsequential.
2. MAIN RESULTS 2.1. Slowly varying functions Here the properties of slowly varying functions are reviewed to the extent required later. C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
306
K. T. Mynbaev
L L1 = l1 L2 = l2 L3 = 1/l1 L4 = 1/l2
Table 1. Basic SV functions (l1 (x) = log x, l2 (x) = log(log x)). ε(1) ε(2) μ
δ
1/l1 1/(l1 l2 )
−1/l1 −(1 + l2 )/(l1 l2 )
0 −1/(2l1 )
0 −1/(2l12 )
−1/l1 −1/(l1 l2 )
−1/l1 −(1 + l2 )/(l1 l2 )
−1/l1 2 − 2+l 2l1 l2
1/l13 2+l2 2l12 l23
The name slowly varying and its abbreviation SV will be used for a positive measurable function on [a, ∞), where a > 0, satisfying the condition lim L(rx)/L(x) = 1
x→∞
for any r > 0.
(2.1)
Functions with representation (1.2), where ε is continuous and limx→∞ ε(x) = 0, constitute a special case of SV functions. In (1.2) the constant cL is allowed to be negative and the ε-function of L can be found as ε(x) = xL (x)/L(x). For third-order RV we have to assume that the ε-function is also of form x (1.2). That is, denoting ε(1) the ε-function from (1.2), we suppose that ε(1) (x) = cε exp a ε(2) (t)dt/t where limx→∞ ε(2) (x) = 0 and ε(2) is continuous. Denote μ(x) = (ε(1) (x) + ε(2) (x))/2, δ(x) = L(x)ε(1) (x)μ(x). These functions will be called μ- and δ-functions of L, respectively. Without loss of generality, L and ε(1) can be extended continuously from [a, ∞) to [0, ∞) (this does not change their asymptotic behaviour at infinity nor does it affect OLS asymptotics). In (1.2) it is still better to keep a positive because some properties may hold only in the neighbourhood of infinity. When the argument of ε(1) , ε(2) , μ, δ is the sample size n, that argument will be usually suppressed. Table 1 contains a summary of practically important cases. We observe in Table 1 that for l1 both μ and δ are identically zero. In all other cases ε(1) , ε(2) , μ and δ are non-zero for all large x. To avoid multicollinearity, in the pair (L1 , L2 ) only one function is allowed to be log x (and have a vanishing δ-function). By changing the notation, if necessary, one can assume that if one of L1 , L2 is log x, then it is always L1 . When L1 (x) = log x, model (1.1) is called semi-reduced. When δ-functions δ1 and δ2 of L1 and L2 , respectively, are non-zero, model (1.1) is called non-reduced. To exclude constant regressors, we also assume that none of the ε-functions ε1(1) and ε2(1) of L1 and L2 vanishes for all large n. Our analysis shows that the asymptotic theory of model (1.1) depends on the asymptotic behaviour of the ratios δ1 /δ2 and ε1(1) /ε2(1) . In the next assumption we impose on these ratios conditions general enough to include all possible pairs of functions from Table 1. A SSUMPTION 2.1. (a) The limit λε = limn→∞ ε1(1) /ε2(1) (finite or infinite) exists and (b) in the non-reduced case we assume that the limit λδ = limn→∞ δ1 /δ2 (finite or infinite) exists. Condition SVR (Slow Variation Reinforced). We write L = K(ε, φε , θε ) if (a) L is continuous on [0, ∞) and has Karamata representation (1.2) for some a > 0, where ε is SV, continuous and ε(x) → 0 as x → ∞ and (b) there exist a constant c > 0 and a function φε on [0, ∞) with properties: (i) (ii)
φε is positive, non-decreasing on [0, ∞), φε (x) → ∞ as x → ∞, there exist positive numbers θε , Xε such that x −θε φε (x) is non-increasing on [Xε , ∞), and C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
307
Asymptotically collinear regressors
(iii)
for all x ≥ c 1 c ≤ |ε(x)| ≤ . cφε (x) φε (x)
(2.2)
Condition SVR will be used as a building block of more complex conditions. The right inequality in (2.2) means that L is slowly varying with remainder φε (see Aljanˇci´c et al., 1955). All practical examples from Table 1 satisfy the above definition with φε (x) = 1/|ε(x)| and the number θε > 0 can be chosen as close to 0 as desired. This follows from the next property of SV functions: if L is SV, then for any θ > 0, x θ L(x) → ∞ and x −θ L(x) → 0 as x → ∞. Phillips (2007, Assumption SSV) does not have the (b) part while it seems to be essential for the most important statements both in his paper (see Mynbaev, 2009) and here. 2.2. Phillips’ transformation of the regressor space Assuming that μ(n) = 0 denote L(t) − L(n) , H (t, n) = L(n)ε(1) (n) (1)
1 t (1) H (t, n) = H (t, n) − log , n μ(n) (2)
0 < t ≤ n.
Under certain conditions Phillips (2007, equation (63)) H (2) (rn, n) = log2 r + o(1)
uniformly in r ∈ [a, b],
(2.3)
for any 0 < a < b < ∞. Phillips (2007, p. 573) suggested transforming the regressor space as follows. Letting s = rn in (2.3) we get after rearranging s s + [1 + o(1)]Lj (n)εj(1) (n)μj (n) log2 n n uniformly in s ∈ [na, nb], j = 1, 2,
Lj (s) = Lj (n) + Lj (n)εj(1) (n) log
(2.4)
(in fact, we need only b = 1). Using this expansion and suppressing the argument n in Lj , εj(1) , μj , (1.1) can be rewritten as s s + β1 L1 ε1(1) μ1 log2 [1 + o(1)] n n s s + β2 L2 + β2 L2 ε2(1) log + β2 L2 ε2(1) μ2 log2 [1 + o(1)] + us . n n
ys = β0 + β1 L1 + β1 L1 ε1(1) log
(2.5)
Dropping the o(1) term produces an approximation to (2.5): s s ys = β0 + β1 L1 + β2 L2 + β1 L1 ε1(1) + β2 L2 ε2(1) log + (β1 δ1 + β2 δ2 ) log2 + us . n n (2.6) Denoting ⎛
β0
⎞
⎜ ⎟ β = ⎝ β1 ⎠ , β2
⎛
γn0
⎞
⎟ ⎜ γn = ⎝ γn1 ⎠ , γn2
⎛
1
⎜ An = ⎝ 0 0
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
L1 L1 ε1(1) δ1
L2
⎞
⎟ L2 ε2(1) ⎠ , δ2
(2.7)
308
K. T. Mynbaev
we obtain ys = γn0 + γn1 log
s s + γn2 log2 + us , n n
γn = An β.
(2.8)
We call the γi ’s good coefficients and βi ’s bad coefficients. The matrix An is called a transition matrix. Because of asymptotic collinearity of the regressors in (1.1), the asymptotic distribution of the bad coefficients is degenerate (one-dimensional) and is not possible to find directly, by normalizing the OLS estimator. In (2.8) the regressors are not asymptotically collinear and therefore the asymptotic distribution of the OLS estimator γˆn is good (normal with a positive definite variance–covariance matrix). The Phillips idea is to extract the asymptotic distribution −1 of the β’s from that of the γ ’s using β = A−1 n γn . An is not well behaved as n → ∞. Thus, the study of the transition matrix is at the heart of the method. The problem with this transformation is that it is impossible to prove that (2.6) approximates (2.5). Relationship (2.4) does not cover the segment 1 ≤ s < na whose length is proportional to the sample size. For such s the approximation cannot be good. For example, at s = 2 for functions L1 , L3 from Table 1 one has L1 (2)/L1 (n) = log 2/ log n → 0, L3 (2)/L3 (n) = log n/ log 2 → ∞. Therefore Phillips (2007, Theorem 5.1) is true for (2.6) and not for the original regression. Our main result shows that, indeed, replacing (2.5) by (2.6) results in loss of information in terms of the variety of different asymptotic types. We are able to show that the values of s for which there is no approximation are negligible because (1) in Condition SVR we have part (b) which Phillips’ Assumption SSV does not have and (2) instead of the sup-norm used by Phillips we use the integral L2 -norm contained in the definition of L2 -approximability. 2.3. Modification of the Phillips transformation Now we describe a modification of this approach that allows us to avoid dropping any terms. This section and Lemmas A.3 and A.4 are the main contributions of this paper. To explain the idea, we consider only a non-reduced model with |λδ | < ∞
and
β1 λδ + β2 = 0.
(2.9)
The construction uses the notion of Lp -approximability from Mynbaev (2001). Its purpose is to approximate a sequence of vectors with a function of a continuous argument; see Mynbaev (2009) for an intuitive introduction and history. Let {wn } be a sequence of vectors such that 1 1/p wn ∈ Rn for each n and denote f p = 0 |f (x)|p dx , where 1 ≤ p < ∞. Let np denote an interpolation operator defined by np wn = n1/p nt=1 wnt 1it . Here wnt are the coordinates of wn ; the intervals it = [(t − 1)/n, t/n), t = 1, . . . , n, form a partition of [0, 1) and 1A is an indicator of a set A; that is, 1A = 1 on A and 1A = 0 outside A. We say that {wn } is Lp approximable if there exists a function W on [0, 1] such that W p < ∞ and np wn − W p → 0. In this case we also say that {wn } is Lp -close to W . For the non-reduced model both δ1 and δ2 are non-zero. Therefore, we can write Lj (s) − Lj (n) s s (1) (1) Lj (s) = Lj (n) + Lj (n)εj (n) log + Lj (n)εj (n) − log n n Lj (n)εj(1) (n) s = Lj (n) + Lj (n)εj(1) (n) log + δj (n)Hj(2) (s, n), j = 1, 2. (2.10) n C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
Asymptotically collinear regressors
309
Hj(2) is not equal to log2 ns but the normalized sequence {n−1/2 Hj(2) } is L2 -close to it (see the Appendix). Substitution of (2.10) in (1.1) yields s ys = γn0 + γn1 log + n + us , (2.11) n where γn1 = β1 L1 ε1(1) + β2 L2 ε2(1)
γn0 = β0 + β1 L1 + β2 L2 ,
(2.12)
and n = β1 δ1 H1(2) (s, n) + β2 δ2 H2(2) (s, n).
(2.13)
The first two rows of our transition matrix are the same as in (2.7). As for the elements of the last row, we put a31 = 0 and the crucial step is to define γn2 = a32 β1 + a33 β2 and H˜ (s, n) in such a way that n = γn2 H˜ (s, n)
and
{n−1/2 H˜ } is L2 -close to log2 x.
(2.14)
Then (2.11) becomes s + γn2 H˜ (s, n) + us n
ys = γn0 + γn1 log and the transition matrix is
⎛
1
⎜ An = ⎝ 0 0
L2
L1 L1 ε1(1) a32
(2.15)
⎞
⎟ L2 ε2(1) ⎠ . a33
We show how this is done in case (2.9). Continuing (2.13) we get β1 δ1 H1(2) (s, n) β2 δ2 H2(2) (s, n) n = (β1 δ1 + β2 δ2 ) + β1 δ1 + β2 δ2 β1 δ1 + β2 δ2 β1 δ1 /δ2 H1(2) (s, n) β2 H2(2) (s, n) + = (β1 δ1 + β2 δ2 ) . β1 δ1 /δ2 + β2 β1 δ1 /δ2 + β2
(2.16)
Letting γn2 = β1 δ1 + β2 δ2 ,
β1 δ1 /δ2 H1(2) (s, n) β2 H2(2) (s, n) H˜ (s, n) = + , β1 δ1 /δ2 + β2 β1 δ1 /δ2 + β2
(2.17)
we satisfy the first part of (2.14). By Assumption 2.1 and (2.9) β1 δ1 /δ2 β1 λδ → , β1 δ1 /δ2 + β2 β1 λδ + β2
β2 β2 → . β1 δ1 /δ2 + β2 β1 λδ + β2
(2.18)
It will be shown in the Appendix that this implies the second part of (2.14). Note that H˜ (s, n) defined in (2.17) depends on β1 , β2 in a non-linear fashion but in the limit that dependence disappears. The analysis in Lemma A.4 shows that the elements a32 , a33 are as described in Table 2. The case, where the transition matrix is not defined and higher-order RV is necessary to determine it, is marked as indefinite. The dependence of the transition matrix on the true β is not continuous. C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
310
K. T. Mynbaev Table 2. Transition matrix summary. Subcase
Case Non-reduced model, |λδ | < ∞
Non-reduced model, |λδ | = ∞
Coefficients
β 1 λδ + β 2 = 0 β1 λδ + β2 = 0, β2 = 0
I. a32 = δ1 , a33 = δ2 II. Indefinite
β1 λδ + β2 = 0, β2 = 0
III. a32 = δ1 , a33 = 0
β1 = 0
IV. a32 = δ1 , a33 = δ2
β1 = 0
V. a32 = 0, a33 = δ2
Semi-reduced model (L1 (x) = log x, δ2 = 0)
VI. a32 = 0, a33 = δ2
2.4. Convergence statements As explained in Section 2.2 the asymptotic distribution of γˆn should be derived first and that of βˆ next. In principle, convergence of the good coefficients is described by Phillips (2007, Theorem 4.1), where they are denoted αn . However, Theorem 4.1 depends on Phillips’s Lemma 7.4, the proof of which is incomplete, and Lemma 2.1(iii), which can be proved under more general assumptions on the linear process ut . Therefore, we provide an independent proof. A SSUMPTION 2.2 (On the regressors). (a) In the non-reduced case we assume that Li = K(εi(1) , φε(1) , θε(1) ), εi(1) = K(εi(2) , φε(2) , θε(2) ) and εi(2) is slowly varying for i = 1, 2. Further, we i i i i suppose that the μ-functions of Li are different from 0 in some neighbourhood of infinity and satisfy 1 max εi(1) (x), εi(2) (x) ≤ |μi (x)| ≤ max εi(1) (x), εi(2) (x) , cμ
(2.19)
with some constant cμ > 0. Finally, max{2θε(1) , θε(2) } < 1/2 for i = 1, 2. (b) In the semi-reduced i i case L1 (x) = log x and L2 satisfies part (a). The next assumption is less restrictive than the corresponding condition by Phillips. A SSUMPTION 2.3 (On the linear process). For representation ut = ∞ all t > 0, ut has ∞ ∞ j =−∞ cj et−j , where (a) the numbers cj satisfy j =−∞ |cj | < ∞, j =−∞ cj = 0 and (b) the sequence of random variables {ej } is a martingale difference sequence (et is Ft -measurable and E(et | Ft−1 ) = 0) such that E(et2 | Ft−1 ) = σe2 (a constant) for all t and et2 are uniformly integrable. Here {Ft } is an increasing sequence of σ -fields. 2 Henceforth we denote σ 2 = σe ∞ and G is the Gram matrix of the system j =−∞ cj 1 j −1 fj (x) = log x, j = 1, 2, 3, i.e. the element gij of G equals gij = 0 fi (x)fj (x) dx. T HEOREM 2.1.
Let Assumptions 2.2 and 2.3 hold. Then √ d n(γˆn − γn ) → N (0, σ 2 G−1 ).
(2.20)
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
311
Asymptotically collinear regressors
Case Non-reduced
Subcase
Table 3. Type-wise OLS asymptotics. |λε | < ∞
(|λδ | < ∞,
A. (μ1 − μ2 )Bn(1) d
|λε | = ∞
|λε | ≤ ∞
B. (μ1 − μ2 )Bn(2)
Indefinite
d
model
β1 λδ + β2 = 0) or
(δ1 = 0,
(|λδ | = ∞, β1 = 0)
→ f (λε ) if μ1 = μ2
δ2 = 0)
(|λδ | < ∞, β2 = 0,
C. μ1 Bn(1) → f (λε )
if μ1 = μ2
→ g if μ1 = μ2 d
D. μ1 Bn(2) → g
d
d
F. μ2 Bn(2) → g Indefinite
d
H. μ2 Bn(2) → g
β1 λδ + β2 = 0) (|λδ | = ∞, β1 = 0) (|λδ | < ∞, β2 = 0,
E. μ2 Bn(1) → f (λε ) Indefinite
d
β1 λδ + β2 = 0) Semi-reduced model
G. μ2 Bn(1) → f (λε )
d
(L1 (x) = log x, δ2 = 0)
To describe the behaviour of the bad coefficients denote ⎞ ⎞ ⎛ (1) ⎛ (1) ε1 (βˆ0 − β0 ) ε2 (βˆ0 − β0 ) ⎟ ⎟ √ ⎜ √ ⎜ (1) ˆ (1) ˆ (2) ⎟ ⎟ ⎜ Bn(1) = n ⎜ ⎝ L1 ε1 (β1 − β1 ) ⎠ ; Bn = n ⎝ L1 ε1 (β1 − β1 ) ⎠ ; L2 ε2(1) (βˆ2 − β2 ) L2 ε2(1) (βˆ2 − β2 ) f (λε ) = λε − 1, 1, −1 , g = 1, 1, −1 . √ Let be a normal variable distributed as N (0, σ 2 /4) (it arises from the limit of n(γˆn2 − γn2 )). T HEOREM 2.2 (Classification theorem). Let Assumptions 2.1–2.3 hold. Then the relation between the bad coefficients (contained in Bn(i) ) and good coefficients (represented by ) is as presented in Table 3. In the cases marked ‘indefinite’ RV of orders higher than 3 is required to determine the type. Since Phillips (2007, Theorem 5.1) is actually about regression with a quadratic form in log(s/n), no wonder its predictions are different from those in Table 3. In particular, the classification theorem captures a new effect that the asymptotic variance depends on the true β. The case of more than two different SV regressors should present an even larger number of different asymptotic types, and Phillips (2007, Theorem 5.2) does not cover all possibilities. The following example from Phillips (2007) has iterated logarithmic growth, a trend decay component, and a constant regressor: ys = β0 + β1 / log s + β2 log(log s) + us . Such a model is relevant in empirical research where one wants to capture simultaneously two different opposing trends in the data. Here L1 (s) = 1/ log s, L2 (s) = log(log s). From Table 1 ε1 = −1/l1 , μ1 = −1/l1 , δ1 (n) = l1−3 , ε2 = 1/(l1 l2 ), μ2 = −1/(2l1 ) and δ2 (n) = −1/(2l12 ). Since δ1 /δ2 = −2/l1 → 0 and ε2 /ε1 = −1/l2 → 0, we have εmin = ε2 and by Table 3, cases B and D, ⎛ ⎞ 1 ˆ0 − β0 ) ( β ⎟ log(log n) √ ⎜ ⎟ n ⎜ ⎜ ⎟ d 2g if β2 = 0; 1 ˆ ⎜ ⎟→ 2 ⎜ − ( β − β ) ⎟ 1 1 g if β2 = 0. log n ⎝ log n ⎠ βˆ2 − β2 C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
312
K. T. Mynbaev
The formula from Phillips (2007, pp. 575–76), after correction of two typographical errors, gives the same asymptotics: ⎛ ⎞ 1 ˆ0 − β0 ) ( β ⎟ log(log n) √ ⎜ ⎟ n ⎜ ⎜ ⎟ d 1 ˆ ⎟ → g , 2 ⎜ − − β ) ( β ⎜ ⎟ 1 1 log n ⎝ log n ⎠ βˆ2 − β2 regardless of β2 . The comments by Phillips apply: the coefficient of the growth term converges √ fastest but at less that an n rate. The intercept converges next fastest, and finally the coefficient of the evaporating trend. All of these outcomes relate to the strength of the signal from the respective regressor. 2.5. What can be done in the indefinite cases? From the proofs of Lemma A.4 and Theorem 2.2 one can see that indefiniteness can occur when the rate of approximation is not good enough to define the transition matrix or when it is defined but is not invertible. We use the first possibility as an example. To this end, basic notation related to approximation of slowly varying functions will be necessary. The first-order RV is contained in the basic definition of slow variation (2.1). For slowly varying functions with representation (1.2) Phillips (2007, pp. 563, 573) has derived the secondorder RV H (1) (rn, n) = log r + o(1) and the third-order one (2.3). Let Li = K εi(1) , φε(1) , θε(1) , εi(1) = K εi(2) , φε(2) , θε(2) , εi(2) = K εi(3) , φε(3) , θε(3) i
and suppose
εi(3)
i
i
i
i
i
is slowly varying for i = 1, 2. One can prove then
(1) (1) (2) Li (rn) (1) (1) (2) 2 3 − 1 = εi(1) log r + εi(1) μ(1) , i log r + εi μi μi log r + o εi μi μi Li (n)
(2.21)
where
2 εi(2) (εi(2) + εi(3) ) + 3εi(1) εi(2) + εi(1) εi(1) + εi(2) (2) , μi = = . (2.22) 2 εi(1) + εi(2) Denoting H (3) (t, n) = H (2) (t, n) − log2 nt μ(2)1(n) , from (2.21) we have the fourth-order RV H (3) (rn, n) = log3 r + o(1). With this information, an extension of (2.10) is μ(1) i
s + δj (n)Hj(2) (s, n) n s s (3) = Lj (n) + Lj (n)εj(1) (n) log + δj (n) log2 + δj (n)μ(2) j Hj (s, n). n n
Lj (s) = Lj (n) + Lj (n)εj(1) (n) log
(2.23)
Equation (2.23) allows us to rewrite (2.13) as (3) (2) (3) 2 n = β1 δ1 μ(2) 1 H1 (s, n) + β2 δ2 μ2 H2 (s, n) + (β1 δ1 + β2 δ2 ) log
s . n
(2.24)
(2) Denote κ = (β1 δ1 + β2 δ2 )/(β1 δ1 μ(2) 1 + β2 δ2 μ2 ). C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
Asymptotically collinear regressors
Case |λδμ | < ∞
313
Table 4. Transition matrix summary in case II: |λδ | < ∞, β1 λδ + β2 = 0, β2 = 0. Subcase β1 λδμ + β2 = 0 β1 λδμ + β2 = 0
VII. Indefinite (2) |λκ | < ∞ VIII. a32 = δ1 μ(2) 1 , a33 = δ2 μ2 . |λκ | = ∞ IX. a32 = δ1 , a33 = δ2 .
|λδμ | = ∞
β1 = 0
(2) |λκ | < ∞ X. a32 = δ1 μ(2) 1 , a33 = δ2 μ2 .
β1 = 0
|λκ | = ∞ XI. a32 = δ1 , a33 = δ2 . |λμ | < ∞ XII. a32 = 0, a33 = δ2 . |λμ | = ∞ XIII. a32 = 0, a33 = δ2 μ(2) 2 .
Note: The numbering continues that of Table 2. Case VII starts a new indefinite branch.
(2) A SSUMPTION 2.4. (a) Let β1 δ1 μ(2) 1 + β2 δ2 μ2 = 0 for all large n. (b) Suppose the limits λδμ = (2) (2) −1/2 (3) Hj } is limn→∞ δ1 μ1 /(δ2 μ2 ), λκ = lim κ, λμ = lim μ(2) 2 (finite or infinite) exist. (c) {n L2 -close to log3 x, j = 1, 2.
Under this assumption the last row of the transition matrix is described by Table 4. Equation (2.21) can be called a point-wise RV. The method requires establishing an integral version of the fourth-order RV in the form Assumption 2.4(c). The proofs of second-order Mynbaev (2009, Theorem 3) and third-order RV (Lemma A.3) are pretty complex. The proof of the fourth-order RV must be even more complex given the fact that the function μ(2) in (2.22) depends non-linearly on the ε-functions. Therefore, trying to obtain a general result is not recommended. If indefiniteness arises in an applied problem with specific L1 , L2 , it would be easier to prove Assumption 2.4(c) for those specific functions. In Phillips (2007, Theorems 3.1, 4.2, 4.3) the Phillips condition LP can be replaced by our Assumption 2.3, which is weaker. On the other hand, the conditions on the slowly varying functions should be made stronger: L = K(ε, φε , θε ) with 2θε < 1 in Phillips (2007, Theorem 3.1) and L = K(ε, φε , θε ) with 2θε p < 1 in Phillips (2007, Theorems 4.2, 4.3). With our Lemma A.3 one can fix Phillips’s central limit theorem for the linear regression. Corrected statements and complete proofs of all results, except for those related to the log-periodogram analysis of long memory and a linear regression in more than two different SV regressors, can be found in Mynbaev (2011).
ACKNOWLEDGMENTS The paper benefited substantially from remarks by the Editor and two anonymous referees. In particular, the existence of the indefinite case was discovered due to their insistence.
REFERENCES Aljanˇci´c, S., R. Bojani´c and M. Tomi´c (1955). Deux th´eor`emes relatifs au comportement asymptotique des s´eries trigonom´etriques. Srpska Akademija Nauka. Zbornik Radova. Mat. Inst. 43, 15–26. C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
314
K. T. Mynbaev
Alves, I. F., L. de Haan and T. Lin (2006). Third order extended regular variation. Publications de L’Institut Math´ematique, Nouvelle s´erie 80, 109–20. Barro, R. J. and X. Sala-i-Martin (1995). Economic Growth. New York: McGraw-Hill. Bingham, N. H., C. M. Goldie and J. L. Teugels (1987). Regular Variation. Encyclopedia of Mathematics and its Applications, Volume 27. Cambridge: Cambridge University Press. Hurvich, C. M., R. Deo and J. Brodsky (1998). The mean squared error of Geweke and Porter-Hudak’s estimator of the memory parameter of a long-memory time series. Journal of Time Series Analysis 19, 19–46. Mynbaev, K. T. (2001). Lp -approximable sequences of vectors and limit distribution of quadratic forms of random variables. Advances in Applied Mathematics 26, 302–29. Mynbaev, K. T. (2009). Central limit theorems for weighted sums of linear processes: Lp -approximability versus Brownian motion. Econometric Theory 25, 748–63. Mynbaev, K. T. (2011). Short-Memory Linear Processes and Econometric Applications. Forthcoming.New York: John Wiley. Phillips, P. C. B. (1999). Discrete Fourier transforms of fractional processes. Cowles Foundation Discussion Paper No. 1243, Yale University. Phillips, P. C. B. (2007). Regression with slowly varying regressors and nonlinear trends. Econometric Theory 23, 557–614. Robinson, P. M. (1995). Log-periodogram regression of time series with long range dependence. Annals of Statistics 23, 1048–72. Seneta, E. (1985). Pravil’no Menyayushchiesya Funktsii. Moscow: Nauka (Translated from English by I. S. Shiganov, translation edited and with a preface by V. M. Zolotarev, with appendices by I. S. Shiganov and V. M. Zolotarev.) Wu, C.-F. (1981). Asymptotic theory of nonlinear least squares estimation. Annals of Statistics 9, 501–13.
APPENDIX: PROOFS OF RESULTS A.1. Bounds for second- and third-order regular variation The behaviour of L(λx) for small λ presents a problem. In Lemmas A.1 and A.2 we prove that the larger x, the closer λ is allowed to be to 0. L EMMA A.1. If L = K(ε, φε , θε ), then for any b > θε there exist numbers Mb > 0 and ab ≥ max{a, c} such that |L(λx)/L(x) − 1| ≤ Mb λ−b /φε (x) for all x ≥ ab and ab /x ≤ λ ≤ 1. This lemma is a special case of Seneta (1985, Lemma A.1.1). For the proof see also Mynbaev (2009). Since in practical cases the number θε can be arbitrarily close to 0, the number b > θε can also be as close to 0 as desired. L EMMA A.2. If L = K(ε (1) , φε(1) , θε(1) ) and ε(1) = K(ε(2) , φε(2) , θε(2) ), then for any constant b > max{2θε(1) , θε(2) } there exist constants Mb > 0 and ab ≥ max{a, c} such that |H (λx, x) − log λ| ≤ Mb λ (1)
−b
1 1 + φε(1) (x) φε(2) (x)
for x ≥ ab
and
ab ≤ λ ≤ 1. x
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
315
Asymptotically collinear regressors
Proof: Denote r(λ, x) = L(λx)/L(x), U (λ, x) = log r(λ, x). Let x ≥ c and c/x ≤ λ ≤ 1, where c is the constant from (2.2). Since λx ≤ x, (1.2) implies x dt (A.1) ε(1) (t) . U (λ, x) = − t λx Using the right inequality from (2.2) and the fact that φε(1) is non-decreasing we get x x x c c log λ 1 dt dt dt ≤ =− . |U (λ, x)| ≤ |ε(1) (t)| ≤ c t φε(1) (λx) λx t φε(1) (λx) λx λx φε(1) (t) t
(A.2)
Fix some bε(1) > θε(1) . Using monotonicity of φε(1) and the fact that it increases to ∞ at ∞, from ab ≤ xλ we have c/φε(1) (λx) ≤ c/φε(1) (ab ) < (bε(1) − θε(1) )/2 for a sufficiently large ab > 0. Then by (A.2) |U (λ, x)| ≤ −
bε(1) − θε(1) log λ. 2
(A.3)
On the other hand, by part (b) of the definition of the class K(ε, φε , θε ) the inequality Xε(1) ≤ λx ≤ x implies (λx)−θε(1) φε(1) (λx) ≥ x −θε(1) φε(1) (x) and 1/φε(1) (λx) ≤ λ−θε(1) /φε(1) (x). Hence, from (A.2) |U (λ, x)| ≤ −cλ−θε(1) (log λ)/φε(1) (x).
(A.4)
r(λ, x) − 1 − ε (1) (x) log λ = eU (λ,x) − 1 − U (λ, x) + U (λ, x) − ε (1) (x) log λ.
(A.5)
Now consider
By Lemma A.1 applied to ε (1) |ε(1) (λx)/ε(1) (x) − 1| ≤ c1 λ−bε(2) /φε(2) (x)
for all
x ≥ ab
and
ab /x ≤ λ ≤ 1,
(A.6)
where bε(2) is an arbitrary number greater than θε(2) and c1 depends on bε(2) . From (A.1) we have x x dt dt ε(1) (t) + ε(1) (x) |U (λ, x) − ε(1) (x) log λ| = − t λx λx t x (1) 1 (1) ds ε (sx) dt ε (t) = ε(1) (x) ≤ |ε(1) (x)| −1 ε(1) (x) − 1 s . ε(1) (x) t λx
λ
The conditions ab ≤ λx and λ ≤ s ≤ 1 imply ab ≤ sx ≤ x, so we can use (A.6) to get c1 |ε(1) (x)| 1 −b (2) −1 c2 |ε(1) (x)| −b (2) (λ ε − 1) s ε ds = |U (λ, x) − ε (1) (x) log λ| ≤ φε(2) (x) λ φε(2) (x) ab c2 |ε(1) (x)| −b (2) λ ε ≤ λ ≤ 1. for x ≥ ab and ≤ φε(2) (x) x
(A.7)
Applying bounds (A.3) and (A.4) and an elementary inequality |ex − 1 − x| ≤ x 2 e|x| we obtain |eU (λ,x) − 1 − U (λ, x)| ≤ U 2 (λ, x)e|U (λ,x)| ≤ c3
λ−2θε(1) log2 λ − 1 (b (1) −θ (1) ) ε , λ 2 ε φε2(1) (x)
where bε(1) > θε(1) . Combining (A.5), (A.7) and (A.8) gives L(λx) |ε(1) (x)| −b (2) log2 λ − 1 b (1) − 3 θ (1) (1) 2 2 ε ε ε + 2 λ . L(x) − 1 − ε (x) log λ ≤ c4 φ (2) (x) λ φε(1) (x) ε
(A.8)
(A.9)
On the interval (0, 1] the function log2 λ can be dominated by c(δ)λ−δ with any δ > 0. Since the number bε(1) > θε(1) is arbitrarily close to θε(1) , the number aε(1) ≡ 12 bε(1) + 32 θε(1) + δ is larger than, and arbitrarily C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
316
K. T. Mynbaev
close to, 2θε(1) . Hence, the left inequality in (2.2) and (A.9) imply λ−bε(2) λ−aε(1) 1 1 (1) + 2 + λ−b . |H (λx, x) − log λ| ≤ c5 ≤ c 6 φε(2) (x) φε(1) (x)|ε(1) (x)| φε(2) (x) φε(1) (x) Taking b > max{2θε(1) , θε(2) } and putting bε(2) = aε(1) = b we satisfy both bε(2) > θε(2) and aε(1) > 2θε(1) . The constant c6 depends on b.
A.2. Lp -approximability Here we prove an integral version of the third-order RV (2.3) that uses the notion of Lp -approximability. L EMMA A.3. Suppose that L = K(ε (1) , φε(1) , θε(1) ), ε(1) = K(ε(2) , φε(2) , θε(2) ) and ε(2) is slowly varying. Assume, further, that the μ-function of L is different from zero for all large x and satisfies the condition of type (2.19) with some constant cμ > 0. For p ∈ [1, ∞) define a vector wn ∈ Rn by wnt = n−1/p H (2) (t, n), t = 1, . . . , n. If max{2θε(1) , θε(2) } < 1/p, then {wn } is Lp -close to f (x) = log2 x. Proof: The definitions of wn and np give np wn = nt=1 H (2) (t, n)1it . This is equivalent to n equations ( np wn )(u) = H (2) (t, n) for u ∈ it , t = 1, . . . , n. The condition u ∈ it is equivalent to the condition that t is an integer satisfying t ≤ nu + 1 < t + 1 which, in turn, is equivalent to t = [nu + 1]. Hence, the above n equations take a compact form ( np wn )(u) = H (2) ([nu + 1], n), 0 ≤ u < 1. 1/p b Denote f p,(a,b) = a |f (x)|p dx , to reflect dependence on the domain of integration. Let 0 < δ ≤ 1/2 and with the number ab from Lemma A.2 put n1 ≡ ab /δ. For n > n1 the interval (ab /n, δ) is not empty and by the triangle inequality
np wn − f p,(0,1) ≤ np wn − f p,(δ,1) + f p,(0,δ) + np wn p,(0,ab /n) + np wn p,(ab /n,δ) .
(A.10)
Since |f | is integrable on (0, 1), we have f p,(0,δ) → 0 as δ → 0. For the other three terms at the right of (A.10) we consider three cases. p
Case δ ≤ u < 1. Under the conditions of this lemma (2.3) is true and therefore 1 . H (2) (rn, n) = [1 + o(1)] log2 r uniformly in r ∈ δ, 1 + 2ab
(A.11)
Defining r = [nu + 1]/n, from the inequality nu < [nu + 1] ≤ nu + 1 we have δ ≤ u < r ≤ u + 1/n < 1 + 1/n1 ≤ 1 + 1/(2ab ).
(A.12)
This leads to r = u + o(1) and r ∈ [δ, 1 + 1/(2ab )]. From these equations and (A.11) we see that H (2) ([nu + 1], n) − log2 u = o(1) uniformly in u ∈ [δ, 1) which allows us to conclude that np wn − f p,(δ,1) → 0, n → ∞. Case a b /n ≤ u<δ. Let n > n2 ≡ max{n1 , 2}. Then (A.12) and the conditions u ∈ [ab /n, δ), n > n2 imply ab /n ≤ u < r ≤ u + 1/n < δ + 1/n2 ≤ 1. This means we can successively apply Lemma A.2, (2.2), condition (2.19) and (A.12) to get (1) H (rn, n) − log r Mb r −b 1 1 ≤ + |H (2) ([nu + 1], n)| = μ(n) |μ(n)| φε(2) (n) φε(1) (n) a −b c1 r b max{|ε (1) (n)|, |ε(2) (n)|} ≤ c2 r −b ≤ c2 u−b for u ∈ ,δ , ≤ |μ(n)| n δ where b > max{2θε(1) , θε(2) }. Hence, ab /n | np wn |p du ≤ c3 δ 1−pb . Here the right-hand side tends to zero when δ → 0 if b < 1/p. This is possible because of max{2θε(1) , θε(2) } < 1/p. C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
Asymptotically collinear regressors
317
Case 0 < u < a b /n. By monotonicity the inequality [nu + 1]/n > u implies | log([nu + 1]/n)| ≤ | log u|. On the other hand, [nu + 1] ≤ nu + 1 < ab + 1 and L([nu + 1]) ≤ c by continuity of L. Hence, |H (1) ([nu + 1], n)| ≤ (c/L(n) + 1)/|ε (1) (n)| and (1) H ([nu + 1], n) log([nu + 1]/n) + |H (2) ([nu + 1], n)| ≤ μ(n) μ(n) c + L(n) log u + . ≤ L(n)ε(1) (n)μ(n) μ(n) All functions of n here are slowly varying and | log u| can be dominated by cu−a with 0 < a < 1/p. 1/p 1/p−a ab ab c + |μ(n)| → 0. This tends to zero as n → ∞ Therefore np wn p,(0,ab /n) ≤ L(n)εc+L(n) (1) (n)μ(n) n n because (a) sums, products and real powers of SV functions are SV and (b) for any a > 0 and any SV function L, the product n−a L(n) tends to zero as n → ∞. L EMMA A.4. Table 2.
Under Assumptions 2.1 and 2.2 the last row of the transition matrix is described by
Proof: We shall need the following linearity property of Lp -approximable sequences: if {wn } is Lp -close to W , {vn } is Lp -close to V , {an } and {bn } are numerical sequences converging to a and b, respectively, then {an vn + bn wn } is Lp -close to aV + bW . This follows from
np (an vn + bn wn ) − (aV + bW ) p ≤ |an − a| np vn p + |bn − b| np wn p + |a| np vn − V p + |b| np wn − W p → 0, where, by Lp -approximability, np vn p and np wn p are uniformly bounded. Non-reduced model. The functions L1 and L2 satisfy part (a) of Assumption 2.2. By Lemma A.3, where i = n−1/2 Hi(2) (t, n), t = 1, . . . , n, i = 1, 2, are we take p = 2, the sequences wn1 , wn2 with components wnt L2 -close to log2 x. By linearity {an wn1 + bn wn2 } is L2 -close to (a + b) log2 x whenever an → a, bn → b. Now we consider one by one the six cases listed in the last column of Table 2. Case I. Suppose |λδ | < ∞, β1 λδ + β2 = 0. This is exactly the model situation of Section 2.3. By (2.18) the second part of (2.14) is true and we can put a32 = δ1 , a33 = δ2 . Case II. |λδ | < ∞, β1 λδ + β2 = 0, β2 = 0. This is the indefinite case discussed in Section 2.5. Case III. |λδ | < ∞, β1 λδ + β2 = 0, β2 = 0. Obviously, from n = β1 δ1 H1(2) one has a32 = δ1 , a33 = 0. Case IV. Assume that |λδ | = ∞, β1 = 0. From the first line of (2.16) β1 β2 δ2 /δ1 n = (β1 δ1 + β2 δ2 ) H1(2) (s, n) + H2(2) (s, n) . β1 + β2 δ2 /δ1 β1 + β2 δ2 /δ1 Here β1 /(β1 + β2 δ2 /δ1 ) → 1, β2 (δ2 /δ1 )/(β1 + β2 δ2 /δ1 ) → 0. As above, by linearity (2.14) holds and the definition from Case I can be used again. Case V. Let |λδ | = ∞, β1 = 0. Obviously, n = β2 δ2 H2(2) (s, n) and the choice γn2 = β2 δ2 , H˜ (s, n) = H2(2) (s, n) satisfies (2.14) and gives a32 = 0, a33 = δ2 . Semi-reduced model (case VI). In this case by definition L1 (s) = log s, δ1 = 0, δ2 = 0. We can still apply (2.10) to L2 . For L1 we use simply L1 (s) = L1 (n) + (L1 (s) − L1 (n)) = L1 (n) + log ns . Since L1 ε1(1) ≡ 1, (2.12) is true and (2.13) formally holds with δ1 = 0. Therefore the choice is the same as in (s, n) = H2(2) (s, n). case V: γn2 = β2 δ2 , H
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
318
K. T. Mynbaev
A.3. Proof of Theorem 2.1 Non-reduced model.
Denote ⎛
1
⎜ Xn = ⎝ . . . 1
log(1/n) ... log(n/n)
⎞ H˜ (2) (1, n) ⎟ ... ⎠ (2) ˜ H (n, n)
the matrix of regressors in (2.14). We know from (2.14) that the third column of Wn ≡ n−1/2 Xn is L2 -close to f3 . The first column of this matrix, wn = n−1/2 (1, . . . , 1) , is L2 -close to f 1 because n2 wn is identically 1 on (0, 1). Letting p = 2, k = 1 in Mynbaev (2009, Theorem 3) we see that the second column, wn = n−1/2 (log(1/n), . . . , log(n/n)) is L2 -close to f2 . By Mynbaev (2001, Theorems 3.1(b) and 4.1(D)) we have √ d Wn u(n) → N (0, σ 2 G), Wn Wn → G, where u(n) = (u1 , . . . , un ) . Now (2.20) follows from n(γˆn − γn ) = −1 (n) (Wn Wn ) Wn u . Semi-reduced model. In this case the situation is simpler because the first and second columns of Xn are the same, whereas H˜ (2) (s, n) = H2(2) (s, n).
A.4. Proof of Theorem 2.2 We need the following well-known fact. Let An be a non-singular matrix. If the parameter vector β in the linear model y = Xβ + u has been transformed as γn = An β, to obtain y = XA−1 n γn + u, then γˆn − γn = An (βˆ − β).
(A.13)
ˆ By Phillips (2007, Lemma 7.8iii) the element It turns out that only γˆn2 affects the limit distribution of β. 33 −1 g in the lower right corner of G equals 1/4. Therefore Theorem 2.1 implies √
d
n(γˆn2 − γn2 ) → N (0, σ 2 /4).
(A.14)
We go through the eight cases itemized in Table 3. Case A. Let us restrict our attention to the non-reduced model and in the first two cases assume that either (|λδ | < ∞, β1 λδ + β2 = 0) or (|λδ | = ∞, β1 = 0). In terms of Table 2, we are looking at Cases I and IV. In both cases the matrix An is the same as in the Phillips analysis. From (2.7) we have det An = L1 ε1(1) L2 ε2(1) (μ2 − μ1 ). For An to be invertible, we have to require μ1 = μ2 for all large n. One can check that ⎞ ⎛ 1 μ2 1 1 μ1 1 − (1) − (1) ⎟ ⎜1 μ − μ μ1 − μ2 ε2(1) ε1(1) ε2 ε1 1 2 ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ 1 μ 2 −1 ⎜ ⎟ (A.15) An = ⎜ 0 − ⎟ (1) (1) ⎜ ⎟ L1 ε1 (μ1 − μ2 ) L1 ε1 (μ1 − μ2 ) ⎜ ⎟ ⎝ ⎠ μ1 1 0 − L2 ε2(1) (μ1 − μ2 ) L2 ε2(1) (μ1 − μ2 ) satisfies A−1 n An = I and that
(μ1 − μ2 )Dn(1) A−1 n
⎛ ⎜ μ1 − μ2 ⎜ =⎜ ⎜ 0 ⎝ 0
μ2 − μ1 −μ2 μ1
ε1(1) ε2(1)
ε1(1)
⎞
− 1⎟ ⎟ ε2(1) ⎟, ⎟ 1 ⎠ −1
(A.16)
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
319
Asymptotically collinear regressors
where Dn(i) = diag[εi(1) , L1 ε1(1) , L2 ε2(1) ], i = 1, 2. In the case under consideration |λε | < ∞. From (A.13), (A.15) and (A.16) we have (μ1 − μ2 )Bn(1) =
√ ⎛
n(μ1 − μ2 )Dn(1) (βˆ − β)
⎜ μ1 − μ2 ⎜ =⎜ ⎜ 0 ⎝ 0
μ2 − μ1
ε1(1)
ε1(1)
ε2(1)
ε2(1)
⎞
− 1⎟ ⎟√ ⎟ n(γˆn − γn ). ⎟ 1 ⎠ −1
−μ2 μ1
(1) (2) (1) (1) Now take into account that √ εi , εi and μi vanish at infinity by the Karamata theorem, that ε1 /ε2 → λε by assumption and that n(γˆn − γn ) converges in distribution by Theorem 2.1. Then the preceding equation and (A.14) imply
√ d (μ1 − μ2 )Bn(1) = f (λε ) n(γˆn2 − γn2 ) + op (1) → f (λε ) .
(A.17)
In the other cases the argument is similar, and we indicate only the analogues of (A.15), (A.16) and (A.17). Case B. In this case |λε | = ∞ and the other assumptions do not change, so we are still in cells I and IV of Table 2. To obtain the fraction ε2(1) /ε1(1) → 0, we change the diagonal matrix in (A.16) as in ⎛ ⎜ μ1 − μ2 ⎜ ⎜ (μ1 − μ2 )Dn(2) A−1 = n ⎜ 0 ⎝ 0 Then (μ1 − μ2 )Bn(2) =
√
μ2
ε2(1) ε1(1)
− μ1
1−
ε2(1)
⎞
⎟ ε1(1) ⎟ ⎟. ⎟ 1 ⎠ −1
−μ2 μ1
√ d n(μ1 − μ2 )Dn(2) A−1 n (γˆn − γn ) = g n(γˆn2 − γn2 ) + op (1) → g .
Case C. By Table 2, cell III, in the last row of An there is only one non-zero element, ⎛ ⎛
1
⎜ An = ⎝ 0 0
L1 L1 ε1(1) δ1
L2
⎞
⎟ L2 ε2(1) ⎠ ,
A−1 n
0
−
⎜1 ⎜ ⎜ ⎜ =⎜ ⎜0 ⎜ ⎜ ⎝ 0
1
1 μ1
ε2(1)
ε2(1)
0 1
−
L2 ε2(1)
1
−
(1) ⎜ μ1 ε1
⎜ ⎜ μ1 Dn(1) A−1 n = ⎜ ⎝
we get μ1 Bn(1) =
√
−μ1
ε1(1)
ε1(1)
ε2(1)
ε2(1)
0
0
0
μ1
L2 ε2(1) μ1
⎞
− 1⎟ ⎟ ⎟ ⎟ 1 ⎠ −1
√ d nμ1 Dn(1) A−1 n (γˆn − γn ) = f (λε ) n(γˆn2 − γn2 ) + op (1) → f (λε ) .
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
ε1(1)
1 δ1 1
Noting that ⎛
1
⎞ ⎟ ⎟ ⎟ ⎟ ⎟. ⎟ ⎟ ⎟ ⎠
320 Case D.
K. T. Mynbaev Here |λε | = ∞ and all other assumptions are like in Case C (Table 2, cell III). So ⎞ ⎛ ε2(1) (1) ε −μ 1 − μ 1 1 ⎟ ⎜ 2 ⎜ ε1(1) ⎟ ⎟ ⎜ μ1 Dn(2) A−1 n = ⎜ ⎟ 0 1 ⎠ ⎝ 0 0 μ1 −1
and, as a result, μ1 Bn(2) =
√
√ d nμ1 Dn(2) A−1 n (γˆn − γn ) = g n(γˆn2 − γn2 ) + op (1) → g .
Case E. We continue looking at the non-reduced model and assume that |λδ | = ∞, β1 = 0. From Table 2, cell V, we see that An is triangular, ⎞ ⎛ 1 1 1 1 1 − − ⎜ ⎟ ⎞ ⎛ μ2 ε1(1) ⎜ ε1(1) ε2(1) ⎟ L2 1 L1 ⎜ ⎟ ⎜ ⎟ ⎜ (1) (1) ⎟ 1 1 −1 ⎜ ⎟. (A.18) An = ⎝ 0 L1 ε1 L2 ε2 ⎠ , An = ⎜ 0 − ⎟ (1) (1) ⎜ ⎟ L ε L ε μ 1 1 1 1 2 ⎜ ⎟ 0 0 δ2 ⎝ ⎠ 1 0 0 δ2 Suppose that |λε | < ∞. To make use of this condition, consider ⎛ μ2 Dn(1) A−1 n
(1) ⎜ μ2 ε1 ⎜ =⎜ ⎜ 0 ⎝
0
−μ2 μ2 0
1−
ε1(1)
⎞
⎟ ε2(1) ⎟ ⎟. −1 ⎟ ⎠ 1
It follows that √ d μ2 Bn(1) = −f (λε ) n(γˆn2 − γn2 ) + op (1) → f (λε ) .
(A.19)
Case F. We are still in cell V of Table 2. Unlike the previous case, now we have |λε | = ∞ and use ⎞ ⎛ ε2(1) ε2(1) (1) μ ε −μ − 1 2 (1) ⎟ ⎜ 2 2 ⎟ ⎜ ε1 ε1(1) ⎟. ⎜ = μ2 Dn(2) A−1 n ⎜ 0 μ2 −1 ⎟ ⎠ ⎝ 0 0 1 Hence, √ d μ2 Bn(2) = −g n(γˆn2 − γn2 ) + op (1) → g .
(A.20)
Cases G and H. As is clear from the proof of Lemma A.4 (case VI), the transition matrix and its inverse are the same as in (A.18). Therefore, the conclusions in (A.19) and (A.20) apply. The case μ1 = μ2 is marked indefinite because, as mentioned in Case A, the matrix An is not invertible. The case |λδ | < ∞, β1 λδ + β2 = 0, β2 = 0 (case II of Table 2) is also indefinite because the transition matrix is not defined.
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
The
Econometrics Journal Econometrics Journal (2011), volume 14, pp. 321–329. doi: 10.1111/j.1368-423X.2011.00346.x
Large deviations of generalized method of moments and empirical likelihood estimators TAISUKE O TSU † †
Cowles Foundation and Department of Economics, Yale University, P.O. Box 208281, New Haven, CT 06520-8281, USA. E-mail:
[email protected] First version received: August 2009; final version accepted: February 2011
Summary This paper studies large deviation properties of the generalised method of moments and generalized empirical likelihood estimators for moment restriction models. We consider two cases for the data generating probability measure: the model assumption and local deviations from the model assumption. For both cases, we derive conditions where these estimators have exponentially small error probabilities for point estimation. Keywords: Empirical likelihood, Generalized method of moments, Large deviations.
1. INTRODUCTION This paper studies large deviation properties of the generalized method of moments (GMM) and generalized empirical likelihood (GEL) estimators for moment restriction models. Since Hansen (1982), there have been numerous empirical applications and theoretical studies on the GMM and related methods. If the model is just identified, we can apply the conventional method of moments estimator. The large deviation properties of this estimator have been studied elsewhere (e.g. Jensen and Wood, 1998, and Inglot and Kallenberg, 2003). If the model is over-identified, the method of moments is not directly applicable. Instead the GMM (Hansen, 1982) or GEL (Smith, 1997, and Newey and Smith, 2004) should be applied. Special cases of GEL include empirical likelihood (Qin and Lawless, 1994), continuous updating GMM (Hansen et al., 1996) and exponential tilting (Kitamura and Stutzer, 1997, and Imbens et al., 1998).1 In contrast with the literature on the method of moments estimator, and to the best of our knowledge, there is no theoretical work on the large deviation properties of the GMM and GEL estimators for the over-identified case.2 The purpose of this paper is to derive some regularity conditions that guarantee exponentially small large deviation error probabilities for the GMM and GEL estimators both when the model is correctly specified (we refer to this case as the model assumption) and also when there exist local 1
See Kitamura (2007) for a review. Kitamura and Otsu (2006) proposed a large deviation minimax optimal estimator for moment restriction models, which is different from the existing GMM or GEL estimator. Our focus is on the large deviation properties of the conventional GMM and GEL estimators. 2
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society. Published by Blackwell Publishing Ltd, 9600
Garsington Road, Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
322
T. Otsu
deviations or contaminations from the model assumption. The first setup serves as a benchmark. The second setup is useful to evaluate robustness of the estimators under local misspecification. It should be noted that although our large deviation results are extensions of the previous results on the method of moments estimator to the over-identified case, theoretical arguments for these extensions are not trivial because (i) the GMM estimator is defined as a minimizer of some quadratic form in the sample mean of the moment function; (ii) the objective function of the two-step GMM estimator contains the first-step estimator; and (iii) the GEL estimator is defined as a minimax solution of the GEL criterion function. Existing technical tools to analyse large deviation estimation errors are not directly applicable to these estimators. Finally, although our large deviation results are important in their own right, they can be employed as a building block for more detailed estimation error analysis. For example, Otsu (2009) used our large deviation results to derive moderate deviation rate functions for the GMM and GEL estimators.
2. MAIN RESULTS Suppose we observe a random sample (X1n , . . . , Xnn ) with support X ⊆ Rdx and wish to estimate a vector of unknown parameters θ0 ∈ ⊆ Rdθ defined by moment restrictions E[g(X, θ0 )] = g(x, θ0 )dP (x) = 0, (2.1) where g : X × → Rdg is a vector of measurable functions with dg ≥ d θ . Although our results apply to the just-identified case (i.e. dg = d θ ), we focus on the over-identified case (i.e. dg > d θ ). ˆ ) = n1 ni=1 g(Xin , θ ). We consider the following point estimators. Let g(θ • GMM estimator: ˆ ) Wn g(θ ˆ ), θˆ1 = arg min g(θ θ∈
with some weight matrix Wn . • Two-step GMM estimator: ˆ −1 g(θ ˆ ) ˆ ), θˆ2 = arg min g(θ ˆ = with
1 n
n i=1
θ∈
g(Xin , θˆ1 )g(Xin , θˆ1 ) .
• GEL estimator: θˆ3 = arg min max θ∈ λ∈
for some ρ(·).
n
ρ[λ g(Xin , θ )]
i=1
3
This paper studies large deviation properties of these estimators under the model assumption (2.1) or local deviations from the model assumption. More specifically, we consider the following data generating measure on the triangular array {(X1n , . . . , Xnn )}n∈N . 3 For example, ρ(v) = −(1 + v)2 /2 (continuous updating GMM), ρ(v) = log(1 − v) (empirical likelihood) and ρ(v) = − exp(v) (exponential tilting). C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
Large deviations
323
A SSUMPTION 2.1. (a) For each n ∈ N, (X1n , . . . , Xnn ) is an i.i.d. sample from the measure n = 1 + an An (x) with respect to P for some an → 0 and An : X → Pn with the density dP dP R satisfying supn∈Nsupx∈X |An (x)| < ∞ and An (x)dP (x) = 0. (b) There exists a unique solution θ 0 ∈ for g(x, θ0 )dP (x) = 0. Hereafter the expectations under P and Pn are denoted by E[·] and En [·], respectively. Assumption 2.1, adapted from Inglot and Kallenberg (2003) to the moment restriction setup, allows two cases for the data generating measure Pn : (a) model assumption (i.e. an = 0), where the data are generated from Pn = P and the moment restrictions (2.1) are satisfied; and (b) local contamination (i.e. an = 0), where the data are generated from Pn = P and the moment restrictions may or may not be satisfied for n ∈ N even though Pn converges to P as n → ∞. The focus of this paper is on large deviation properties of the GMM and GEL estimators under the model assumption and local contaminations. In particular, we investigate whether these estimators have exponentially small estimation error probabilities for θ 0 . Exponentially small probability events and estimation error probabilities are defined as follows. D EFINITION 2.1. (a) (ESP) A sequence of events (or subsets in Xn ) {Bn }n∈N has exponentially small probability under {Pn }n∈N (we say ‘Bn has ESP’) when (a) there exist C, c > 0 such that Pn (Bn ) ≤ Ce−cn for all n large enough; and (b) under the model assumption, Pn = P , ˜ ˜ −cn ˜ c˜ > 0 such that P (Bn ) ≤ Ce for all n ∈ N. (b) (ESEP) An estimator θˆ for there exist C, θ 0 has exponentially small error probability (we say ‘θˆ has ESEP’) when (a) the event {|θˆ − θ0 | > } has ESP for each > 0 ; and (b) there exists C¯ > 0 such that the event {|θˆ − θ0 | > ¯ or θˆ is not unique} has ESP for each ∈ (0, C). The following assumptions will imply that the GMM and GEL estimators have ESEP. Let |A| = trace(A A) be the Euclidean norm of a scalar, vector, or matrix A, int(B) be the interior of 2 a set B, ‘a.e.’ mean ‘almost every’, ρ1 (v) = dρ(v) , and ρ2 (v) = d dvρ(v) 2 . dv A SSUMPTION 2.2. (a) is compact and θ0 ∈ int(). There exist L : X → [0, ∞) and α, T 1 > 0 such that |g(x, θ1 ) − g(x, θ2 )| ≤ L(x)|θ1 − θ2 |α for all θ 1 , θ 2 ∈ and a.e. x, and E[exp(T1 L(X))] < ∞. For each θ ∈ , there exists T 2 > 0 satisfying E[exp(T2 |g(X, θ )|)] < ∞. (b) There exist H : X → [0, ∞), β, T3 > 0 and a neighbourhood N around θ 0 such that 0) − ∂g(x,θ | ≤ H (x)|θ − θ0 |β for all θ ∈ N and a.e. x, and E[exp(T3 H (X))] < ∞. There | ∂g(x,θ) ∂θ ∂θ 0) 0) exists T 4 > 0 satisfying E[exp(T4 | ∂g(X,θ |)] < ∞. E[ ∂g(X,θ ] has the full column rank. ∂θ ∂θ A SSUMPTION 2.3. For each > 0 , the event {|Wn − W | > } has ESP for some positive definite symmetric matrix W. A SSUMPTION 2.4. For each n ∈ N, there exist T 5 , T 6 , T 7 > 0 such that E[exp(T5 L(X)2 )] < ∞, E[exp(T6 L(X)|g(X, θ0 )|)] < ∞, and E[exp(T7 |g(X, θ0 )g(X, θ0 ) |)] < ∞. E[g(X, θ0 )g(X, θ0 ) ] is positive definite. A SSUMPTION 2.5. (a) ρ(v) is strictly concave and ρ1 (0) = ρ2 (0) = −1. is compact and ¯ ) = arg maxλ∈ E[ρ(λ g(X, θ ))] belongs to int(). g(x, θ ) 0 ∈ int(). For each θ ∈ , λ(θ is differentiable on for a.e. x. There exists T 8 > 0 satisfying E[exp(T8 |g(X, θ0 )|)] < C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
324
T. Otsu
∞. For each θ ∈ , there exist T 9 > 0 and neighbourhoods Nθ and Nλ(θ) around ¯ ∂g(X,ϑ) ¯ θ and λ(θ ), respectively, satisfying E[exp(T9 supϑ∈Nθ supλ∈N ¯ |ρ1 (λ g(X, ϑ)) ∂θ |)] < λ(θ ) ∞. (b) There exist T 10 > 0 and neighbourhoods Nρ and Nρ around θ 0 and 0, respectively, satisfying E[exp(T10 supθ∈Nρ supλ∈Nρ |ρ2 (λ g(X, θ ))g(X, θ )g(X, θ ) |)] < ∞. E[g(X, θ0 )g(X, θ0 ) ] is positive definite.
Assumption 2.2 (i) restricts the global shape of the moment function g over the parameter space . The Lipschitz-type condition on g is common in the literature (e.g. Jensen and Wood (1998)) and is satisfied with α = 1 if g is differentiable on for a.e. x and |] < ∞ for some T˜1 > 0. The conditions for exponential moments are E[T˜1 supθ∈ | ∂g(x,θ) ∂θ typically required to control large deviation probabilities. Assumption 2.2 (ii), which controls the local shape of the moment function around θ 0 , is required only to guarantee the uniqueness of θˆ1 . Assumption 2.3, a high-level assumption on the weight matrix Wn , should be checked for each choice of Wn . Assumption 2.4 is required only for the two-step GMM estimator θˆ2 ˆ −1 . Assumption 2.5 is used for the GEL to guarantee that Assumption 2.3 holds for Wn = estimator. Assumption 2.5 (i) replaces Assumption 2.2 (i). The conditions on the GEL criterion function ρ are satisfied by the examples listed in Section 2. Although technical arguments become more complicated, the compactness assumption on may be avoided by adding an assumption similar to the one used by Inglot and Kallenberg (2003, Assumption (R2’)) which controls the global behaviour of the objective function outside some compact set for λ. The last condition in Assumption 2.5 (i), which corresponds to the condition for L(X) in Assumption 2.2 (i), restricts the slope of the GEL objective function with respect to θ . This condition needs to be checked for each choice of ρ. Assumption 2.5 (ii) contains additional assumptions to guarantee the uniqueness of the GEL estimator θˆ3 . This assumption restricts the local curvature of the GEL objective function with respect to λ in a neighbourhood of 0. Based on these assumptions, our main theorem is presented as follows. T HEOREM 2.1. (a) Under Assumptions 2.1, 2.2 and 2.3, the GMM estimator θˆ1 has ESEP. (b) Under Assumptions 2.1, 2.2, 2.3 and 2.4, the two-step GMM estimator θˆ2 has ESEP. (c) Under Assumptions 2.1, 2.2 (b) and 2.5, the GEL estimator θˆ3 has ESEP. R EMARK 2.1. Based on Definition 2.1 (ii), this theorem says that (a) under the model assumption the error probabilities of the GMM and GEL estimators are exponentially small for all sample sizes, and under local contaminations the convergence rates of the estimation error probabilities to zero are exponentially fast; and (b) the probabilities for multiple solutions of the GMM and GEL minimization problems are also exponentially small. If one wants to guarantee only Definition 2.1 (ii)-(a), Assumptions 2.2 (ii) and 2.5 (ii) are unnecessary. R EMARK 2.2. For the GMM estimator θˆ1 (i.e. Part (i) of Theorem 2.1), Definition 2.1 (ii)-(a) is shown by verifying conditions for a general lemma to establish the ESEP property for extremum estimators (Lemma A.2). Lemma A.2 is a modification of general consistency theorems for extremum estimators (e.g. Newey and McFadden, 1994, Theorem 3.1) to our large deviation context, and thus can be applied to other contexts. On the other hand, for the GEL estimator θˆ3 (i.e. Part (iii) of Theorem 2.1), the minimax form of the estimator prevents us from applying directly the general lemma. Thus, we followed the proof strategy of Newey and Smith (2004, C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
Large deviations
325
Theorem 3.1), which effectively utilized the minimax form of the estimator. Part (ii) of Theorem ˆ −1 satisfies Assumption W. 2.1 is shown by verifying that the optimal weight R EMARK 2.3. Although Theorem 2.1 is important in its own right, it can be used as a building block for more detailed estimation error analysis, such as formal derivations of the large deviation rate functions for the GMM and GEL estimators. To this end, we need to derive not only concrete forms of the constants in the definition of ESEP, but also lower bounds for the large deviation error probabilities (we conjecture that the lower bounds will be characterised by the Kullback–Leibler divergence between P and the set of measures satisfying the moment restrictions). Otsu (2009) used the above theorem to derive moderate deviation rate functions for the GMM and GEL estimators.
ACKNOWLEDGMENTS The author thanks the managing editor and referees for helpful comments. Financial support from the National Science Foundation (SES-0720961) is gratefully acknowledged.
REFERENCES Hansen, L. P. (1982). Large sample properties of generalized method of moments estimators. Econometrica 50, 1029–54. Hansen, L. P., J. Heaton and A. Yaron (1996). Finite-sample properties of some alternative GMM estimators. Journal of Business & Economic Statistics 14, 262–80. Imbens, G. W., R. H. Spady and P. Johnson (1998). Information theoretic approaches to inference in moment condition models. Econometrica 66, 333–57. Inglot, T. and W. C. M. Kallenberg (2003). Moderate deviations of minimum contrast estimators under contamination. Annals of Statistics 31, 852–79. Jensen, J. L. and A. T. A. Wood (1998). Large deviation and other results for minimum contrast estimators. Annals of Institute of Statistical Mathematics 50, 673–95. Kitamura, Y. (2007). Empirical likelihood methods in econometrics: theory and practice. In R. Blundell, W. K. Newey and T. Persson (Eds.), Advances in Economics and Econometrics, Theory and Applications, Ninth World Congress, Volume 3, 174–237. Cambridge, MA: Cambridge University Press. Kitamura, Y. and T. Otsu (2006). Minimax estimation and testing for moment condition models via large deviations. Working paper, Yale University. Kitamura, Y. and M. Stutzer (1997). An information-theoretic alternative to generalized method of moments estimation. Econometrica 65, 861–74. Newey, W. K. and D. L. McFadden (1994). Large sample estimation and hypothesis testing. In R. F. Engle and D. McFadden (Eds.), Handbook of Econometrics, Volume 4, 2213–45. Amsterdam: North-Holland. Newey, W. K. and R. J. Smith (2004). Higher order properties of GMM and generalized empirical likelihood estimators. Econometrica 72, 219–55. Otsu, T. (2009). Moderate deviations of generalized method of moments and empirical likelihood estimators. Working paper, Yale University. Qin, J. and J. Lawless (1994). Empirical likelihood and general estimating equations. Annals of Statistics 22, 300–25. Smith, R. J. (1997). Alternative semi-parametric likelihood approaches to generalized method of moments estimation. Economic Journal 107, 503–19. C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
326
T. Otsu
APPENDIX: MATHEMATICAL APPENDIX We repeatedly use the following lemma to show that events associated with means have ESP. that Assumption 2.1 (a) holds and there L EMMA A.1. Let f : X → R be a measurable function. Suppose exists T > 0 satisfying E[exp(Tf (X))] < ∞. Then the event { n1 ni=1 f (Xin ) > z} has ESP for each z ∈ (E[f (X)], ∞). Proof: Pick any n ∈ N and z ∈ (E[f (X)], ∞). Let M(t) = E[exp{t(f (X) − z)}]. Since M(0) = |t=0 = E[f (X)] − z < 0, and M(t) is continuous at each t ∈ [0, T ], there exists t ∗ = 1, . dM(t) dt arg mint∈[0,T ] M(t) with M(t ∗ ) < 1. The Markov inequality and Assumption P 2.1 (a) imply Pn
n 1 f (Xin ) > z ≤ (En [exp{t ∗ (f (Xin ) − z)}])n ≤ {(1 + CA an )M(t ∗ )}n , n i=1
where CA = supn∈N supx∈X |An (x)| < ∞. Since an → 0 and M(t ∗ ) < 1, it holds (1 + CA an )M(t ∗ ) < 1 for all n large enough. Thus Definition 2.1 (a) is satisfied. The same argument with setting an = 0 guarantees Definition 2.1 (b). The next lemma, an adaptation of Newey and McFadden (1994, Theorem 3.1), provides conditions where an extremum estimator θˆ = arg minθ∈ Qn (θ ) has ESEP. L EMMA A.2. Suppose that (a) is compact; (b) the event {supθ∈ |Qn (θ) − Q0 (θ)| > 1 } has ESP for each 1 > 0 ; (c) the limiting objective function Q0 (θ ) is continuous at each θ ∈ and is uniquely minimized at θ 0 ∈ . Then the event {|θˆ − θ0 | > } has ESP for each > 0. Proof: Pick any n ∈ N and > 0. Let δ = infθ∈,|θ−θ0 |≥ Q0 (θ) − Q0 (θ0 ) > 0 (by conditions (a) and (c)). Set inclusion relations imply Pn (|θˆ − θ0 | > ) ≤ Pn (Q0 (θˆ ) ≥ Q0 (θ0 ) + δ) δ δ ˆ + Pn sup |Qn (θ) − Q0 (θ)| > . ≤ Pn Q0 (θ ) ≥ Q0 (θ0 ) + δ, sup |Qn (θ ) − Q0 (θ )| ≤ 3 3 θ∈ θ∈ Since the first term is dominated by Pn (Qn (θˆ ) ≥ Qn (θ0 ) + 3δ ) = 0, condition (b) implies the conclusion. n ∂g(Xin ,θ) n n 1 1 1 ˆ )= , Lˆ = n i=1 L(Xin ), Hˆ = n i=1 H (Xin ), Hereafter, let xn = (x1n , . . . , xnn ), G(θ i=1 n ∂θ 0) ] and = E[g(X, θ )g(X, θ ) ]. G = E[ ∂g(X,θ 0 0 ∂θ Proof of Theorem 2.1. Proof of (a). Verification of Definition 2.1 (a): To this end, we check ˆ ˆ ) Wn g(θ) and Q0 (θ) = E[g(X, θ)] W E[g(X, θ )]. From the conditions of Lemma A.2 with Qn (θ ) = g(θ Assumptions 2.1 (b), 2.2 (a) and 2.3, condition (c) of Lemma A.2 is satisfied. Since is compact, it remains to check condition (b) of Lemma A.2. Now pick any > 0 and n ∈ N. Since the compact set is covered by a finite sequence of balls {j }Jj=1 with radius c > 0 and centres {θj }Jj=1 , the triangle inequality C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
327
Large deviations yields sup |Qn (θ ) − Q0 (θ )| ≤ max sup |Qn (θ ) − Qn (θj )| + max |Qn (θj ) − Q0 (θj )| 1≤j ≤J θ∈j
θ∈
1≤j ≤J
+ max sup |Q0 (θj ) − Q0 (θ)| 1≤j ≤J θ∈j
= max T1j + max T2j + max T3j , 1≤j ≤J
1≤j ≤J
1≤j ≤J
where T 1j , T 2j and T 3j are implicitly defined. Define the event
ˆ j ) − E[g(X, θj )]| ≤ c , |Wn − W | ≤ c , Lˆ ≤ E[L(X)] + 1 . B1n = max |g(θ 1≤j ≤J
From the triangle inequality and Assumptions 2.2 (a) and 2.3, there exist C 1 , C 2 > 0 such that
ˆ ) − g(θ ˆ j )) Wn (g(θ) ˆ ˆ j ))| + sup |g(θ ˆ j ) Wn (g(θ) ˆ ˆ j ))| ≤ C1 c2α + cα , T1j ≤ sup |(g(θ − g(θ − g(θ θ∈j
θ∈j
ˆ j ) − E[g(X, θj )]) Wn (g(θ ˆ j ) − E[g(X, θj )])| + 2|E[g(X, θj )] Wn (g(θ ˆ j ) − E[g(X, θj )])| T2j ≤ |(g(θ
+ |E[g(X, θj )] (Wn − W )E[g(X, θj )]| ≤ C2 c2 + 3c , for a.e. xn ∈ B 1n and all j = 1 , . . . , J . Similarly, we obtain T3j ≤ C3 (c2α + cα ) for some C 3 > 0. By choosing c small enough to satisfy C1 (c2α + cα ) + C2 (c2 + 3c ) + C3 (c2α + cα ) < , we obtain c ). Pn ({supθ∈ |Qn (θ ) − Q0 (θ )| > } ∩ B1n ) = 0, which implies Pn (supθ∈ |Qn (θ) − Q0 (θ)| > ) ≤ Pn (B1n Since B c 1n has ESP from Assumption 2.3 and Lemma A.1 (setting f (x) = L(x) and ±gl (x, θj ) for l = 1 , . . . , dg and j = 1 , . . . , J ), condition (b) of Lemma A.2 is satisfied. Verification of Definition 2.1 (b). Pick any n ∈ N. Let B2n = {infθ∈,|θ−θ0 |> Qn (θ) > Qn (θ0 )} and B3n = ˆ 0 ) − G| ≤ , |Wn − W | ≤ , Lˆ ≤ E[L(X)] + 1, Hˆ ≤ E[H (X)] + 1} for > 0. By a ˆ 0 )| ≤ , |G(θ {|g(θ similar argument to the proof of Lemma A.2, we see that B c 2n has ESP for each > 0. Also, by Assumption 2.3 and Lemma A.1, B c 3n has ESP for each > 0. Therefore, it is sufficient for the conclusion to show that there exists C¯ 1 > 0 such that B2n ∩ B3n ⊆ {|θˆ1 − θ0 | ≤ and θˆ1 is unique} for all ∈ (0, C¯ 1 ). Since θ0 ∈ int(), we can find C¯ 1 > 0 such that {θ ∈ : |θ − θ0 | ≤ } ⊂ int() for all ∈ (0, C¯ 1 ). Note that for each ∈ (0, C¯ 1 ) and a.e. xn ∈ B 2n , there exists a minimum θˆ1 which solves the first-order condition ˆ ) Wn g(θ) ˆ = 0 with respect to θ . Thus, it is sufficient to show that there exists C¯ 1 ∈ (0, C¯ 1 ) Sn (θ ) = G(θ satisfying B2n ∩ B3n ⊆ {Sn (θ ) is one-to-one in {θ ∈ : |θ − θ0 | ≤ }} for all ∈ (0, C¯ 1 ). Now, pick any ∈ (0, C¯ 1 ) and then pick any θ and ϑ = 0 to satisfy θ, θ + ϑ ∈ {θ ∈ N : |θ − θ0 | ≤ }, where N appears in Assumption 2.1 (b). By the triangle inequality, |Sn (θ + ϑ) − Sn (θ )| ˆ 0 )ϑ − G W Gϑ| − |Sn (θ + ϑ) − Sn (θ) − G(θ ˆ 0 )ϑ| ˆ 0 ) Wn G(θ ˆ 0 ) Wn G(θ ≥ |G W Gϑ| − |G(θ = |G W Gϑ| − |A1 | − |A2 |, where A1 and A2 are implicitly defined. For a.e. xn ∈ B 2n , there exists C 1 > 0 such that |A1 | ≤ C1 . By Taylor expansions and Assumptions 2.2 (b) and 2.3, ˆ θ)||θ ˆ + ϑ) − G(θ ˆ )) Wn g(θ ˆ + ϑ) − G(θ)) ˆ ˜ ˆ 0 )| + |(G(θ − θ0 | |A2 | ≤ |(G(θ Wn G(
ˆ + ϑ) Wn G(θ ˆ + ϑ) ˆ 0 ) Wn G(θ ˆ 0 )||ϑ| ≤ C2 ( + β ), ˜ − G(θ + |G(θ for some C 2 > 0, where θ˜ is a point on the line joining θ and θ 0 , and ϑ˜ is a point on the line joining ϑ and 0. Combining these results with |G W Gϑ| > 0 (because G has the full column rank and W is positive definite), for a.e. xn ∈ B 2n ∩ B 3n we can find a constant C¯ 1 ∈ (0, C¯ 1 ) such that {θ ∈ : |θ − θ0 | ≤ } ⊆ N and |Sn (θ + ϑ) − Sn (θ )| > 0 for all ∈ (0, C¯ 1 ), which implies Definition 2.1 (b). C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
328
T. Otsu
ˆ −1 satisfies Assumption 2.3. A detailed proof is Proof of (b): Omitted. It is obtained by showing that available from the author upon request. ˆ = Proof of (c). Verification of Definition 2.1 (a): Let Pˆ (λ, θ) = n1 ni=1 ρ(λ g(Xin , θ )) and λ(θ) ˆ Pick any n ∈ N and θ ∈ . Also pick > 0 arg maxλ∈ Pˆ (λ, θ). We first derive some properties of λ(θ). ¯ )| ≤ } ⊂ . Define the event BR1n (θ) = {supλ∈,|λ−λ(θ)|> Pˆ (λ, θ) < small enough to satisfy {λ : |λ − λ(θ ¯ c ¯ ), θ )}. By applying Inglot and Kallenberg (2003, Theorem 2.1), BR1n Pˆ (λ(θ (θ) has ESP. Since ρ(·) is strictly ˆ ˆ ¯ concave, the maximizer λ(θ) exists uniquely and satisfies |λ(θ) − λ(θ)| ≤ for a.e. xn ∈ BR1n (θ). Now define Qρn (θ ) = supλ∈ Pˆ (λ, θ) and B4n = {infθ∈,|θ−θ0 |> Qρn (θ) > Qρn (θ0 )}. Pick any > 0 and n ∈ N again. Using a finite cover of {θ ∈ : |θ − θ0 | > } by a sequence of balls {j }Jj=1 with centres {θj }Jj=1 and radius c , a set inclusion relation implies inf
θ∈,|θ−θ0 |>
Qρn (θ ) ≥ min
g
Let ρ1ij = supθ∈Nθ supλ∈N ¯ j
inf
1≤j ≤J θ∈j ,|θ−θ0 |>
λ(θj )
ˆ j ), θ) − Pˆ (λ(θ ˆ j ), θj )} + Pˆ (λ(θ ˆ j ), θj ). {Pˆ (λ(θ
in ,θ) |ρ1 (λ g(Xin , θ )) ∂g(X | and BR2n = { n1 ∂θ
n i=1
g
g
ρ1ij ≤ E[ρ1ij ] + 1}. For a.e.
xn ∈ BR1n (θj ) ∩ BR2n , an expansion around θ = θ j yields sup
θ∈j ,|θ−θ0 |>
ˆ j ), θ ) − Pˆ (λ(θ ˆ j ), θj )| ≤ C1 c , |Pˆ (λ(θ
ˆ j ) uniquely maximizes Pˆ (λ, θj ) with respect to λ ∈ , for some C 1 > 0. Also, for a.e. xn ∈ BR1n (θj ), λ(θ ˆ j ), θj ) − ρ(0) > 0. On the other hand, for a.e. xn ∈ BR1n (θ0 ), that is, δ = Pˆ (λ(θ n 1 1 ˆ 0 ) g(θ ˆ 0) ˆ 0) ˆ 0 ) + λ(θ Qρn (θ0 ) = ρ(0) − λ(θ ρ2 (λ˜ g(Xin , θ0 ))g(Xin , θ0 )g(Xin , θ0 ) λ(θ 2 n i=1 ˆ 0 )|, ≤ ρ(0) + C2 |g(θ ˆ 0 ) and 0, the equality follows from an for some C 2 > 0, where λ˜ is a point on the line joining λ(θ ˆ 0 ) − λ(θ ¯ 0 )| ≤ ˆ expansion around λ(θ0 ) = 0, and the inequality follows from the concavity of ρ and |λ(θ for a.e. xn ∈ BR1n (θ0 ). Combining these results and choosing c small enough, there exists C 3 > 0 such c ˆ 0 )| ≥ C3 for a.e. xn ∈ B4n ∩ (∩Jj=1 BR1n (θj ) ∩ BR1n (θ0 )) ∩ BR2n . Thus, Lemma A.1 implies that that |g(θ c J c c B4n ∩ (∩j =1 BR1n (θj ) ∩ BR1n (θ0 )) ∩ BR2n has ESP. Since BR1n (θ0 ) and BR1n (θj )(j = 1, . . . , J ) have ESP c by Inglot and Kallenberg (2003, Theorem 2.1) and BR 2n has ESP by Lemma A.1, B c 4n has ESP, which implies Definition 2.1 (a). in ,θ) |, Verification of Definition 2.1 (b): Let ρ1i = supθ∈Nρ supλ∈Nρ |ρ1 (λ g(Xin , θ)) ∂g(X ∂θ g ρ2i = supθ∈Nρ supλ∈Nρ |ρ2 (λ g(Xin , θ ))g(Xin , θ )g(Xin , θ) |, and
g
B5n = B4n ∩ BR1n (θ0 ) ∩
⎧ ⎪ ⎪ ⎨
ˆ 0 ) − G| ≤ , Hˆ ≤ E[H (X)] + 1, |G(θ
1 ⎪ ⎪ ⎩n
n i=1
⎫ ⎪ ⎪ ⎬
n . g g 1 g g ρ1i ≤ E ρ1i + 1, ρ ≤ E ρ2i + 1⎪ ⎪ ⎭ n i=1 2i
Pick any n ∈ N. Since we have already seen that B c4n has ESP, it is sufficient to show that there exists C¯ 2 > 0 satisfying B5n ⊆ {|θˆ3 − θ0 | ≤ and θˆ3 is unique} for all ∈ (0, C¯ 2 ). Observe that: (a) by the strict concavity of ρ(v) and compactness of (Assumption 2.5 (a)), the maximum theorem implies ˆ ) is continuous in θ ∈ Nρ , and (b) Assumptions 2.1 (b) and 2.5 (a) guarantee that the maximizer λ(θ ˆ 0 ) − λ(θ ¯ 0 )| = |λ(θ ˆ 0 )| ≤ for a.e. xn ∈ B 5n (by Inglot and Kallenberg, ¯ 0 ) = 0 ∈ int() and |λ(θ that λ(θ 2003, Theorem 2.1). Thus, for a.e. xn ∈ B 5n , we can pick a constant C¯ 2 > 0 such that for each ∈ ˆ )| ≤ |λ(θ ˆ ) − λ(θ ˆ 0 )| + |λ(θ ˆ 0 )| ≤ C1 for some C 1 > 0 and all θ ∈ {θ ∈ Nρ : |θ − θ0 | ≤ }. On (0, C¯ 2 ), |λ(θ C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
Large deviations
329
the other hand, from θ0 ∈ int(), we can find a constant C¯ 2 > 0 such that {θ ∈ Nρ : |θ − θ0 | ≤ } ⊂ int() ¯ 2 , C¯ 2 }) and a.e. xn ∈ B 5n , there exists a for all ∈ (0, C¯ 2 ). Combining these results, for each ∈ (0, min{ C n 1 ˆ g(Xin , θ))( ∂g(Xin ,θ) ) λ(θ) ˆ ˆ =0 minimum θ3 which solves the first-order condition Sρn (θ ) = n i=1 ρ1 (λ(θ) ∂θ with respect to θ. Thus, it is sufficient for the conclusion to show that there exists C¯ 2 ∈ (0, min{C¯ 2 , C¯ 2 }) satisfying B5n ⊆ {Sρn (θ ) is one-to-one in {θ ∈ : |θ − θ0 | ≤ }} for all ∈ (0, C¯ 3 ). Now, pick any ∈ (0, min{C¯ 2 , C¯ 2 }) and then pick any θ and ϑ = 0 to satisfy θ, θ + ϑ ∈ {θ ∈ Nρ : |θ − θ0 | ≤ }. Since G −1 G is positive definite (Assumption 2.5 (b)) and |Sρn (θ + ϑ) − Sρn (θ )| ≥ |G −1 Gϑ| − |Sρn (θ + ϑ) − Sρn (θ) + G −1 Gϑ|, it is sufficient to show that |Sρn (θ + ϑ) − Sρn (θ ) + G −1 Gϑ| ≤ C2 for some C 2 > 0. Observe that |Sρn (θ + ϑ) − Sρn (θ ) + G −1 Gϑ| n 1 ∂g(Xin , θ + ϑ) ˆ −1 ˆ ˆ ρ1 (λ(θ + ϑ) g(Xin , θ + ϑ)) (λ(θ + ϑ) − λ(θ)) + G Gϑ ≤ n ∂θ i=1 n ˆ )| 1 ˆ + ϑ) g(Xin , θ + ϑ)) ∂g(Xin , θ + ϑ) + |λ(θ ρ1 (λ(θ n ∂θ i=1 n 1 ˆ ˆ g(Xin , θ)) ∂g(Xin , θ ) = A1 + A2 + A3 , +|λ(θ)| ρ1 (λ(θ) n ∂θ i=1
where A1 , A2 and A3 are implicitly defined. Note that there exists C 3 > 0 such that A2 + A3 ≤ C 3 for a.e. xn ∈ B 5n . Also for a.e. xn ∈ B 5n , the triangle inequality implies that n 1 ∂g(X in , θ + ϑ) ˆ + ϑ) g(Xin , θ + ϑ)) ˆ + ϑ) − λ(θ) ˆ ρ1 (λ(θ + G ||λ(θ A1 ≤ n ∂θ i=1
ˆ + ϑ) − λ(θ ˆ )) + G −1 Gϑ| ≤ |G||λ(θ ˆ + ϑ) − λ(θ) ˆ + |G (λ(θ + −1 Gϑ| + C4 , ˆ ) is an interior solution and satisfies the first-order for some C 4 > 0. On the other hand, for a.e. xn ∈ B5n , λ(θ condition, which is expanded as n 1 ˆ ˆ ˆ ) − λ(θ) ρ2 (λ˜ g(Xin , θ))g(Xin , θ)g(Xin , θ) λ(θ), 0 = −g(θ + + n i=1 ˆ ˆ + ϑ). Thus, by an with a point λ˜ on the line joining λ(θ) and 0. We can obtain a similar expansion for λ(θ expansion around ϑ = 0, there exist C 5 , C 6 , C 7 > 0 such that ˆ + ϑ) − λ(θ) ˆ |λ(θ + −1 Gϑ| ˆ + ϑ) ˜ + G||ϑ| + C5 ˆ + ϑ) − g(θ)} ˆ + −1 Gϑ| + C5 ≤ |−1 || − G(θ ≤ | − −1 {g(θ ˆ 0 ) + G| + |Hˆ | β + } ≤ C7 ( β + ), ≤ C6 {| − G(θ for a.e. xn ∈ B 5n , where ϑ˜ is a point on the line joining ϑ and 0. Combining these results, we verify Definition 2.1 (b).
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
The
Econometrics Journal Econometrics Journal (2011), volume 14, pp. 330–342. doi: 10.1111/j.1368-423X.2010.00338.x
Simple regression-based tests for spatial dependence ¨ B ENJAMIN B ORN † AND J ORG B REITUNG ‡ †
‡
Bonn Graduate School of Economics, University of Bonn, Kaiserstrasse 1, D-53113 Bonn, Germany. E-mail:
[email protected]
Institute of Econometrics, University of Bonn, Kaiserplatz 7-9, D-53113 Bonn, Germany. E-mail:
[email protected]
First version received: October 2009; final version accepted: October 2010
Summary We propose simple and robust diagnostic tests for spatial error autocorrelation and spatial lag dependence. The idea is to reformulate the testing problem such that the outer product of gradients (OPG) variant of the LM test can be employed. Our versions of the tests are based on simple auxiliary regressions, where ordinary regression t- and F-statistics can be used to test for spatial autocorrelation and lag dependence. An important advantage of the proposed test statistics is that they are robust against heteroscedastic errors. Therefore, our approach gives practitioners an easy to implement and robust alternative to existing tests. Keywords: Heteroscedasticity, LM test, Spatial dependence.
1. INTRODUCTION Recent years have seen an increasing availability of regional datasets, leading to a growing awareness of spatial dependence (see Anselin, 2007), an issue that can render ordinary least squares (OLS) estimation and inference inefficient or even biased and inconsistent (see e.g. Kr¨amer and Donninger, 1987, Anselin, 1988b, Kr¨amer, 2003). Arguably the most commonly used test for spatial dependence is Moran’s I (see Moran, 1948, Cliff and Ord, 1972, 1981), which is based on regression residuals and which has been shown to be best locally invariant by King (1981). In a Gaussian maximum likelihood framework, Lagrange multiplier (LM) test statistics were proposed by Burridge (1980) against a spatial error alternative and Anselin (1988a) against a spatial lag alternative and against the joint alternative of spatial lag and spatial error. We show how to compute the outer product of gradient (OPG) variants of these LM tests (see e.g. Davidson and MacKinnon, 2004, p. 427) based on a simple transformation of the spatial weight matrix. This allows us to compute the test statistics as n (the sample size) times the R 2 from an auxiliary regression. An important advantage of the OPG variant is that it is robust against heteroscedastic and non-normal disturbances. In an alternative regression-based approach, Baltagi and Li (2001) use Davidson and MacKinnon’s (1984, 1988) double length artificial regression approach to test for spatial error and spatial lag dependence but this is computationally more demanding and not robust to heteroscedasticity. C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society. Published by Blackwell Publishing Ltd, 9600
Garsington Road, Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
Simple regression-based tests for spatial dependence
331
Monte Carlo simulations demonstrate that (under standard assumptions) our versions of the tests perform similarly to the original LM tests. However, if the errors are heteroscedastic, the latter tests suffer from severe size distortions, whereas the OPG variants turn out to be robust against heteroscedastic error processes. The remainder of the paper is organized as follows. Section 2 reviews the existing maximum likelihood-based test procedures. The regression-based OPG variants of the LM test are presented in Section 3. Section 4 analyses the asymptotic properties of these tests. Sizes and powers in finite samples are compared in Section 5. Section 6 concludes.
2. LM TEST STATISTICS Consider the linear spatial first-order autoregressive model with spatially autocorrelated disturbances (see e.g. Anselin, 1988b) given by y = φW1n y + Xβ + u, u = ρW2n u + ε,
(2.1)
where y is an n × 1 vector of observations on a dependent variable, X is an n × k matrix of regressors, β is the associated k × 1 vector of coefficients, φ and ρ are spatial autoregressive parameters, and ε is a vector of independent and identically normally distributed random variables.1 W n1 and W n2 are spatial weight matrices of known constants with zero diagonals. The spatial error model is obtained by setting φ = 0, yielding −1 (2.2) y = Xβ + u, where u = In − ρW2n ε. Setting ρ = 0, the linear spatial autoregressive model (2.1) with first-order autoregressive disturbances becomes the spatial lag model: y = φW1n y + Xβ + ε.
(2.3)
Accordingly, we will consider the three null hypotheses: H0a : ρ = 0 in (2.2), H0b : φ = 0 in (2.3), H0c : ρ = 0 and φ = 0 in (2.1). Burridge (1980) shows that the LM statistic for H a0 results as2 n 2 u u W2 LM = , n 2 σ 4 tr W2 + W2n W2n a
(2.4)
, β is the OLS estimator of β in the regression y = Xβ + u, and u u. where u = y − Xβ σ 2 = n−1
1 2
The normality assumption is only required to derive the test statistics from the LM principle. Note that the square of the well-known Moran’s I-statistic is asymptotically equivalent to LMa .
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
332
B. Born and J. Breitung
To test hypothesis H b0 , Anselin (1988a) derives the LM test statistic for the null hypothesis φ = 0: n 2 u W1 y b LM = , (2.5) n 2 σ 2 σ 4 tr W1 + W1n W1n + y W1n MW1n y , and M = In − X(X X)−1 X . where y = Xβ The LM test of the joint null hypothesis H c0 is obtained as (Anselin, 1988a) n u u W2 1 c LM = 2 u W1n y σ ⎛ ⎞−1 2 n tr W2n + W2n W1n tr W2n + W2n W2n u u W2 ⎝ ⎠ . n n n 2 n u W y n n n n 2 n 1 tr W2 + W2 W1 tr W1 + W1 W1 + σ y W2 MW2 y (2.6) Although these test statistics are derived by applying the LM principle, they cannot be computed as nR 2 from a regression of a vector of ones on the gradients of the log-likelihood function (see Engle, 1982). For illustration, consider the gradient of the spatial error model (2.2) with respect to the parameter ρ: −1 1 (y − Xβ) W2n (y − Xβ). g(β, σ 2 , ρ) = −tr In − ρW2n W2n − 2σ 2 Inserting the estimates under the null hypothesis H a0 , we obtain , sa = g(β σ 2 , 0) ≡
n
i=1
sia =
n 1 ui zin , σ 2 i=1
(2.7)
where zin denotes the ith element of the vector zn = W2n u and sia = σ −2 ui zin . It is important a a sj for i = j and, therefore, to note that in general si is (asymptotically) correlated with n a 2 −1 n si ) does not converge in probability to the information of the likelihood function. i=1 ( Hence, the usual OPG variant of the LM test is invalid.
3. REGRESSION VARIANTS To compute the OPG variants of the LM tests, the scores are decomposed into uncorrelated components. Let us first consider the spatial error model. To focus on the main issues, we assume that β is known so that u is replaced by u = y − Xβ. Let u zn =
n
i=1 j =i
wijn ,2 ui uj =
i−1 n n
n ui ξin , wij ,2 + wjni,2 ui uj = i=2 j =1
i=2
n n n n where ξin = i−1 j =1 (wij ,2 + wj i,2 )uj and w ij,2 is the (i, j ) element of the matrix W 2 . Defining n ξ n = (0, ξ 2 , . . . , ξ nn ) , we have u zn = u C2n + D2n u = u C2n + D2n u = u ξ n , C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
333
Simple regression-based tests for spatial dependence
where C n2 and D n2 are lower triangular matrices such that W2n = C2n + D2n and ξ n = (C n2 + D n2 )u. However, there is an important difference between the two formulations of the sum with an increasing σ -field generated by {u1 , . . . , ui−1 }, this is u zn . Whereas ξ ni is associated not the case for zin = j =i wijn ,2 uj , as this variable depends on {uj |j = i}. This has important consequences for the variance of u zn . Specifically, under the null hypothesis we have
Var(u ξ n ) = σ 2 E(ξ n ξ n ),
but
Var(u zn ) = σ 2 E(zn zn ).
If W n2 is symmetric, it is not difficult to show that Var(u zn ) = 2σ 2 E(zn zn ). The factor 2 results from the fact that, due to the symmetric nature of the sum, the product ui uj occurs two times for each combination of i and j. We therefore suggest to use ξ n instead of zn = W n2 u for constructing the test statistic. Using these results, the scores (2.7) are represented as sa =
n n
1 n 1 a ξin , u W u = s = ui 2 i 2 σ2 σ i=2 i=2
(3.1)
ξin . Since ξ n = (C2n + D2n ) u and sia = σ −2 ui sia is where ξin is the ith element of the vector a (asymptotically) uncorrelated with sj for i = j , we can construct the OPG variant of the LM statistic as n 2
a si a = LM
i=1
n
a 2 si
.
(3.2)
i=1
can be seen as a heteroscedasticity robust version of the squared The test statistic LM ξin + ei , where the estimated variance of the least-squares t-statistic in the regression ui = ρ ∗ ∗ estimator ρ is replaced by the estimator a
n
n 2 ξi ui
ρ ∗ ) = i=2 Var( 2 . n
n 2 ξi i=1
This estimator is similar to the heteroscedasticity-robust variance estimator suggested by Eicker (1963, 1967) and White (1980), where the residuals are estimated under the null hypothesis. For H b0 the scores result as sb = where
1 1 n 1 n ζ , u W1 y MW1n y = 2 u + u W1n y = 2 u 2 σ σ σ u + MW1n ζ n = C1n + D1n y
(3.3)
and C n1 and D n1 are lower triangular matrices such that W1n = C1n + D1n . Note that we have introduced the projection matrix M in the last term of (3.3). Due to the idempotency of M, C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
334
B. Born and J. Breitung
this matrix does not affect the product y MW1 y. However, introducing the matrix M yields a consistent estimator of the asymptotic variance (see the proof of Proposition 4.1 for more details). We can now form the OPG variant of the score statistic as n 2
n ζi ui b = LM
i=1
n
n 2 ζi ui
.
(3.4)
i=1
Finally, for the hypothesis H c0 the scores are given by n n u u W2 1 1 in , = ui ϒ sc = 2 u W1n y σ σ 2 i=2 in = [ ξin , ζin ] . The OPG variant of the score statistic results as where ϒ n n −1 n
c = in in ϒ in in , ui ϒ u2i ϒ ui ϒ LM i=1
i=1
(3.5)
i=1
in . ui ϒ which is equivalent to nR 2 obtained from a regression of a constant on
4. ASYMPTOTIC PROPERTIES In the previous section, some regression variants of the LM test statistics were suggested. These test statistics are equivalent to heteroscedasticity-robust t- and F-statistics of the following regressions: ξin + ei , ui = ρ ∗ H0a : ρ ∗ = 0 in
(4.1)
ζin + ei , H0b : φ ∗ = 0 in ui = φ ∗
(4.2)
ξin + φ ∗ ζin + ei , H0c : ρ ∗ = 0 and φ ∗ = 0 in ui = ρ ∗
(4.3)
, and β denotes the OLS estimator of β. The regressors ξin and where ui = yi − xi β ζin are defined in Section 3. To analyse the asymptotic properties, we make the following assumptions: (a) The errors εi are independent random variables with E(εi ) = 0, E(ε2i ) A SSUMPTION 4.1. 2 < ∞ for all i and some δ > 0. (b) The vector xi is a k × 1 vector = σ i < c < ∞ and E(|εi |4+δ ) of constants with limn→∞ n−1 ni=1 xi xi = CX (positive definite). (a) The diagonal elements of Whn = (w nij,h ) are zero. (b) All row and column A SSUMPTION 4.2. n n sums of Wh and W h Whn are uniformly bounded for all n and h ∈ {1, 2}. These assumptions are standard in the asymptotic analysis of spatial models (e.g. Kelejian and Prucha, 2001, Lee, 2007). C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
Simple regression-based tests for spatial dependence
335
The following proposition states that the OPG variants of the LM tests suggested in Section 3 possess the usual asymptotic distributions if the errors are heteroscedastic. Furthermore, if the errors are homoscedastic, the test statistics are asymptotically equivalent to the original LM tests suggested by Burridge (1980) and Anselin (1988a). (a) Under Assumptions 4.1 and 4.2 and hypotheses H a0 , H b0 and H c0 , we P ROPOSITION 4.1. d d d a b −→ c −→ −→ χ12 , LM χ12 and LM χ22 . (b) If σ 21 = · · · = σ 2n (homoscedastic have LM p p p a b a − LMb −→ c − LMc −→ − LM −→ 0, LM 0 and LM 0. errors), it follows that LM An important implication of this proposition is that if the errors are homoscedastic, the LM tests can be performed by using the ordinary t-statistics for ρ ∗ = 0 or φ ∗ = 0 in (4.1) and (4.2). The hypothesis H c0 can be tested by computing the F-statistic of the joint hypothesis in (4.3).
5. MONTE CARLO SIMULATIONS In this section, we conduct a small Monte Carlo study to demonstrate the finite sample properties of our new tests and investigate their relative performance compared to the original LM approaches of Burridge (1980) and Anselin (1988a).3 We simulate three different models. Models (2.2) and (2.3) are employed to evaluate the spatial error test and the spatial lag test, respectively. The test for the joint hypothesis is based on model (2.1). The matrix of exogenous regressors, X, contains two regressors, x 1 and x 2 , with associated parameters β 1 and β 2 , where β = (1, 1) . x 1 is a vector of ones and the elements of x 2 are drawn independently from a standard normal distribution. The elements of the vector ε are generated as independent normally distributed random variables such that E(εε ) = I . Furthermore, we set W 1 = W 2 . Our weight matrix design closely follows Arraiz et al. (2010).4 The authors use a setup that mimics the spacing of US states, which is units located in the northeast portion of their model space are closer to each other and have more neighbours than the units in the other three quadrants. They refer to a weight matrix defined in such a way as northeast modified rook matrix. We choose a specification where the share of units located in the northeast is approximately 75%.5 The distance between any two units is defined as the Euclidean distance d(i1 , i2 ) = [(x1 − x2 )2 + (y1 − y2 )2 ]1/2 . The elements of the row normalized weighting matrix are then defined as wij∗ 1 if 0 < d(i1 , i2 ) ≤ 1, ∗ where wij = wij = n ∗ , 0 else. j =1 wij 3 There are alternative tests developed for the heteroscedastic case. Kelejian and Robinson (1998) propose a joint test for spatial error dependence and heteroscedasticity, where the variance is a (possibly unknown) function of explanatory variables. Kelejian and Robinson (2004) propose a heteroscedasticity-robust version of Moran’s I that is based on a consistent estimator of σ 2i . Since we focus on tests that do not require any knowledge about the variance function, we do not include these alternative tests in our Monte Carlo experiment. 4 See the online appendix (http://www.ect.uni-bonn.de/spatialtest webappendix.pdf) for additional results using alternative weight matrices. 5 See Arraiz et al. (2010) for a detailed description of this weight matrix design.
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
336
B. Born and J. Breitung
n
LM a
Table 1. Empirical sizes under heteroscedasticity, 5% level. a b c LM b LM c LM aboot LM LM LM
105 166 241
0.229 0.235 0.265
0.050 0.044 0.044
0.243 0.243 0.267
0.045 0.046 0.050
0.399 0.406 0.457
0.047 0.041 0.045
486 974
0.305 0.353
0.048 0.051
0.310 0.355
0.048 0.050
0.494 0.581
0.047 0.042
LM bboot
LM cboot
0.052 0.051 0.046
0.049 0.053 0.054
0.066 0.057 0.063
0.050 0.051
0.049 0.049
0.053 0.050
Note: Empirical sizes are calculated using 5000 replications.
The original LM tests remain valid under heteroscedasticity as long the heteroscedasticity is not itself spatially correlated (Kelejian and Robinson, 2004). In practice, however, it is reasonable to assume that the heteroscedasticity possesses a spatial pattern. We therefore introduce a disturbance ψ i = εi x 2i with a ‘medium’ extent of heteroscedasticity (see Kelejian and Robinson, 1998), where the spatial correlation in the heteroscedasticity is induced by the sorted vector x2. Table 1 presents the empirical sizes obtained from the Monte Carlo simulations under heteroscedasticity. Not shown here are the empirical sizes under homoscedasticity which are all close to the nominal size. The results change considerably in the presence of heteroscedastic errors as the original LM tests are now strongly oversized. Our OPG variants, on the other hand, do not exhibit any notable size distortions. An alternative approach to produce heteroscedasticityrobust test statistics is to bootstrap the original LM tests.6 We report the empirical sizes of the bootstrapped LM tests in the last three columns of Table 1. Size corrected power curves for the tests are depicted in Figure 1.7 The left column of plots shows size-corrected power curves of original and OPG versions of the LM tests under homoscedasticity. The OPG variants are nearly as powerful as the original LM tests. In the right column, we only plot the size-corrected power curves of our proposed OPG variants, as the original LM tests suffer from massive size distortions under heteroscedasticity.
6. CONCLUSION In this paper, we propose simple and robust diagnostic tests for spatial error autocorrelation and spatial lag dependence. We reformulate the testing problem such that the OPG variant of the LM tests can be employed. Our versions of the tests are based on simple auxiliary regressions, where ordinary regression t- and F-statistics can be used to test for spatial autocorrelation and lag dependence. We show that these tests are asymptotically equivalent to the existing LM tests, yet simpler to implement. An important advantage of the proposed test statistics is that they are robust against heteroscedastic errors. 6 We thank an anonymous referee for suggesting the bootstrap test. Specifically, we employ the wild bootstrap (Liu, 1988). In this approach, the true OLS residuals ui are replaced in the bootstrap DGP by u∗i = ui εi , where ε i = 1 with probability 0.5 and ε i = −1 with probability 0.5 (see Davidson and Flachaire, 2008). 7 As pointed out by Kr¨ amer (2005) and Martellosio (2010), the power of spatial autocorrelation tests can drop to zero for some combinations of X and W n2 . Since the regression-based test is asymptotically equivalent to Moran’s I, our test suffers from the same deficiency.
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
Power
Power
337
Power
Power
Power
Power
Simple regression-based tests for spatial dependence
Figure 1. Size-corrected power under homo- and heteroscedasticity (n = 241).
Monte Carlo simulations suggest that our new tests have good size properties, even under heteroscedasticity, where the original LM tests suffer from size distortions. Hence, we believe that the proposed tests will give researchers a robust and easily implementable tool for their applied work.
ACKNOWLEDGMENTS The authors thank the editor, Richard J. Smith, and two anonymous referees for their suggestions. The authors also thank Peter Burridge, Raymond Florax, Walter Kr¨amer, and participants at ESEM 2009 and the 3rd World Conference of Spatial Econometrics for helpful comments. Benjamin Born gratefully acknowledges financial support by the German Research Foundation (DFG).
REFERENCES Anselin, L. (1988a). Lagrange multiplier test diagnostics for spatial dependence and spatial heterogeneity. Geographical Analysis 20, 1–17. C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
338
B. Born and J. Breitung
Anselin, L. (1988b). Spatial Econometrics: Methods and Models. Dordrecht: Kluwer. Anselin, L. (2007). Spatial econometrics in RSUE: retrospect and prospect. Regional Science and Urban Economics 37, 450–56. Arraiz, I., D. M. Drukker, H. H. Kelejian and I. R. Prucha (2010). A spatial Cliff–Ord-type model with heteroskedastic innovations: small and large sample results. Journal of Regional Science 50, 592– 614. Baltagi, B. and D. Li (2001). Double length artificial regressions for testing spatial dependence. Econometric Reviews 20, 31–40. Burridge, P. (1980). On the Cliff–Ord test for spatial autocorrelation. Journal of the Royal Statistical Society, Series B, 42, 107–08. Cliff, A. D. and J. K. Ord (1972). Testing for spatial autocorrelation among regression residuals. Geographical Analysis 4, 267–84. Cliff, A. D. and J. K. Ord (1981). Spatial Processes: Models and Applications. London: Pion. Davidson, R. and E. Flachaire (2008). The wild bootstrap, tamed at last. Journal of Econometrics 146, 162–69. Davidson, R. and J. G. MacKinnon (1984). Model specification tests based on artificial linear regressions. International Economic Review 25, 485–502. Davidson, R. and J. G. MacKinnon (1988). Double-length artificial regressions. Oxford Bulletin of Economics and Statistics 50, 203–17. Davidson, R. and J. G. MacKinnon (2004). Econometric Theory and Methods. New York: Oxford University Press. Eicker, F. (1963). Asymptotic normality and consistency of the least squares estimators for families of linear regressions. Annals of Mathematical Statistics 34, 447–56. Eicker, F. (1967). Limit theorems for regressions with unequal and dependent errors. In L. M. LeCam and J. Neyman (Eds.), Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Theory of Statistics, 59–82. Berkeley, CA: University of California Press. Engle, R. F. (1982). A general approach to Lagrange multiplier model diagnostics. Journal of Econometrics 20, 83–104. Kelejian, H. H. and I. R. Prucha (2001). On the asymptotic distribution of the Moran I test statistic with applications. Journal of Econometrics 104, 219–57. Kelejian, H. H. and D. P. Robinson (1998). A suggested test for spatial autocorrelation and/or heteroskedasticity and corresponding Monte Carlo results. Regional Science and Urban Economics 28, 389–417. Kelejian, H. H. and D. P. Robinson (2004). The influence of spatially correlated heteroskedasticity on tests for spatial correlation. In L. Anselin, R. J. G. M. Florax and S. J. Rey (Eds.), Advances in Spatial Econometrics: Methodology, Tools and Applications, 79–98. Berlin: Springer. King, M. (1981). A small sample property of the Cliff–Ord test for spatial correlation. Journal of the Royal Statistical Society, Series B, 43, 263–64. Kr¨amer, W. (2003). The robustness of the F-test to spatial autocorrelation among regression disturbances. Statistica 3, 435–40. Kr¨amer, W. (2005). Finite sample power of Cliff–Ord-type tests for spatial disturbance correlation in linear regression. Journal of Statistical Planning and Inference 128, 489–96. Kr¨amer, W. and C. Donninger (1987). Spatial autocorrelation among errors and the relative efficiency of OLS in the linear regression model. Journal of the American Statistical Association 82, 577–79. Lee, L.-F. (2007). GMM and 2SLS estimation of mixed regressive, spatial autoregressive models. Journal of Econometrics 137, 489–514. Liu, R. Y. (1988). Bootstrap procedures under some non-i.i.d. models. Annals of Statistics 16, 1696–708. C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
Simple regression-based tests for spatial dependence
339
Martellosio, F. (2010). Power properties of invariant tests for spatial autocorrelation in linear regression. Econometric Theory 26, 152–86. Moran, P. (1948). The interpretation of statistical maps. Biometrika 35, 255–60. White, H. (1980). A heteroscedasticity-consistent covariance matrix estimator and a direct test for heteroscedasticity. Econometrica 48, 817–38.
APPENDIX: PROOF OF PROPOSITION 4.1 (a) We first assume that β is known such that ui = yi − xi β, u = [u1 , . . . , un ] , and ⎛ ⎞ n i−1 n
n n n ⎝ u W2 u = wij ,2 + wj i,2 uj ⎠ ui = ui ξin , i=2
j =1
i=2
where ξ ni is the ith element of the vector ξ n = (C n2 + D n2 )u. It is important to note that although the sequence n {(u2 ξ 2 ), . . . , (un ξ n )} is different for a different ordering of the cross-section units and the respective weights, the sum ui ξin is invariant to the ordering of the units i. Thus, since we are only interested in results.8 the distribution of ui ξin , the ordering of the units does not matter for our asymptotic n Let Fn be the increasing σ -algebra generated by {u1 , . . . , un } and Zn = i=2 ui ξin . Note that Zn is a martingale difference sequence with respect to the filtration Fn . From Assumptions 4.1 and 4.2 it follows that9 1 p Zn −→ 0, n n n i−1 2 1
n 1 2 n 2 p wij ,2 + wjni,2 σj2 σi2 ≡ sZ2 , ui ξi −→ lim n→∞ n n i=2 i=2 j =1 1 d √ Zn −→ N 0, sZ2 . n It is not difficult to see that the limiting distribution does not change if ui is replaced by the OLS residual ui . To see this, consider ⎛ ⎞ n n i−1
n = ⎝ ξin = wijn ,2 + wjni,2 uj ⎠ ui Z ui . i=2
i=2
j =1
− β), we obtain Using u = u − X(β n = Zn + (β − β) X W2n X(β − β) − (β − β) X W2n + W2n u. Z − β is Op (n−1/2 ) and, therefore, Assumptions 4.1 and 4.2 imply that β 1 1 −1/2 ). √ Z n = √ Zn + Op (n n n
8 As pointed out by a referee, this invariance property may be lost if the weight matrices are renormalized for different orderings. 9 Using different techniques, a similar result is derived by Kelejian and Prucha (2001, Theorem 1).
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
340
B. Born and J. Breitung
In a similar manner, it can be shown that n 1 2 n 2 p ξ u −→ sZ2 . n i=2 i i
It follows that a = LM
n
n2 Z d 2 2 −→ χ1 . ξin u2i
i=2
b , we first consider the case that β is known. Let ζ n be Regarding the asymptotic distribution of LM n constructed as ζ , where β is replaced by β: Zn∗ = u ζ n = u W1n u + u W1n Xβ = u C1n + D1n u + u W1n Xβ =
n
ui (ξi + μi ),
i=1
where μi is the ith element of the vector μ = W 1n Xβ. It follows that n 1 2 2 1 Var Zn∗ → sZ2 + lim μi σi ≡ sZ2 ∗ n→∞ n n i=1
and 1 d √ Zn∗ −→ N 0, sZ2 ∗ . n n , we can show that Using similar arguments as for Z 1 ∗ 1 ∗ −1/2 ), √ Z n = √ Zn + Op (n n n n∗ is constructed as Zn∗ , with β replaced by β . However, some caution is necessary to derive the where Z variance n∗ = Var +E β X W1n . u W1n u + 2E u W1n u u W1n Xβ u u W1n Xβ Var Z . Then, Let μi denote the ith element of the vector μ = W1n Xβ n n
1 n n 1 E u W1 u u W1 X β = E ui ξi ui μi n n i=2 i=1 n 1 −1/2 ui ξi + Op (n ) =E √ n i=2 n 1 −1/2 ui μi − μi xi (β − β) + Op (n ) × √ n i=1 = O(n−1/2 ). C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
Simple regression-based tests for spatial dependence
341
Finally, 1 n n 1 X W1 = β X W1n M E(uu ) MW1n Xβ + O(n−1/2 ) E β u u W1 X β n n n 1 2 2 μ σ + O(n−1/2 ), = n i=1 i i μ = MW1n Xβ. It follows that where μ2i is the ith element of the vector 1 n 2 ζi , ui n i=1 n
sZ2∗ =
n∗ . Hence, with ζin as defined in (3.3), converges to the limiting variance of n−1/2 Z b = LM
n∗ Z
2
n
n 2 ζi ui
d
−→ χ12 .
i=1
Finally, using these results, it is easy to verify that n −1
d c 2 n n n −→ = Yn i ϒ i Y ui ϒ χ22 , LM i=1
n , Z n∗ ] [Z
n = and = where Y (b) If the errors are homoscedastic, the LM statistic suggested by Burridge (1980) has the asymptotic representation in ϒ
[ ξin , ζin ] .
LMa =
(u W n u)2 + op (1). 2 2 σ 4 tr W2n + W2n W2n
a under homoscedasticity, we first note that To analyse the asymptotic distribution of the statistic LM 2 C2n + D2n + C2n + D2n C2n + D2n tr W2n + W2n W2n = tr C2n + D2n = tr C2n C2n + tr D2n D2n + 2 tr D2n C2n = tr C2n + D2n C2n + D2n . It follows that n 2 1 2 n 2 p σ 4 n σ4 tr C2 + D2n C2n + D2n = lim tr W2n + W2n W2n . ζi −→ lim ui n→∞ n n→∞ n n i=1 Therefore, 2 u C2n + D2n C2n + D2n u = + op (1) LM 2 σ 4 tr W2n + W2n W2n a
− LMa −→ 0. and LM a
p
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
342
B. Born and J. Breitung
b under homoscedastic errors, we first note that from the To analyse the asymptotic properties of LM results in (a) we obtain 1 ∗ 1 = √ u C1n + D1n u + u W1n Xβ + op (1) √ Z n n n and n n 1 2 n 2 p σ 2 n E C1 + D1n u + MW1n Xβ ζi −→ lim C1 + D1n u + MW1n Xβ ui n→∞ n n i=2 2 = σ 4 tr W1n + W1n W1n + σ 2 β X W1n MW1n Xβ.
It follows that b = LM
2 n u C + D1n u + u W1n Xβ 1 + op (1) 2 σ 4 tr W1n + W1n W1n + σ 2 β X W1n MW1n Xβ
b is asymptotically equivalent to the original LMb statistic presented and, therefore, the test statistic LM c follows directly from these in (2.5). The asymptotic equivalence of the joint test statistics LMc and LM results.
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
The
Econometrics Journal Econometrics Journal (2011), volume 14, pp. 343–350. doi: 10.1111/j.1368-423X.2011.00347.x
Non-parametric identification of the mixed proportional hazards model with interval-censored durations C HRISTIAN N. B RINCH †,‡ †
‡
Statistics Norway, Research Department, Pb 8131 Dep, N-0033 Oslo, Norway Department of Economics, University of Oslo, Pb 1095 Blindern, N-0371, Oslo, Norway. E-mail:
[email protected] First version received: August 2009; final version accepted: February 2011
Summary This note presents identification results for the mixed proportional hazards model when duration data are interval-censored. Earlier positive results on identification under interval-censoring require both parametric specification on how covariates enter the hazard functions and assumptions of unbounded support for covariates. New results provided here show how one can dispense with both of these assumptions. The mixed proportional hazards model is non-parametrically identified with interval-censored duration data, provided covariates have support on an open set and the hazard function is a non-constant continuous function of covariates. Keywords: Duration analysis, Interval-censoring, Non-parametric identification.
1. INTRODUCTION The Mixed Proportional Hazards (MPH) model is the main workhorse in econometric duration analysis with a focus on separating heterogeneity from structural duration dependence. A large and growing literature has been concerned with identification of the MPH model under different assumptions, see e.g. Van den Berg (2001) for a survey. This note describes identification results for MPH models when durations are not observed exactly, but are interval-censored. Existing identification results for the MPH model with interval-censored data require both parametric specification of how covariates enter the model and assumptions of unbounded support for regressors. I here demonstrate that it is possible to dispense with both assumptions: the MPH model is non-parametrically identified under interval-censoring provided covariates have support on an open set; and the hazard function is a continuous non-constant function of the covariates. Non-parametric identification is an important issue for duration models with unobserved heterogeneity. The non-linearities in commonly applied models often ensure parametric identification of structural duration dependence. In the absence of non-parametric identification, estimation results depend crucially on parametric specifications—which may often be ad hoc. With non-parametric identification results to fall back on, one can at least hope for results that do not depend crucially on parametric specifications, even if full non-parametric estimation is often not feasible, and parametric or semi-parametric estimators are applied.
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society. Published by Blackwell Publishing Ltd, 9600
Garsington Road, Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
344
C. N. Brinch
Elbers and Ridder (1982) prove identification of the MPH model with minimal requirements on variation in covariates, under an assumption of finite mean for the heterogeneity distribution, while Heckman and Singer (1984) prove a similar result with alternative tail assumptions for the heterogeneity distribution. Ridder (1990) clarifies the differences within the Generalized Accelerated Failure Time (GAFT) class that generalizes the MPH model. Heckman and Honor´e (1989) and Abbring and van den Berg (2003) generalize these results to dependent competing risks models. Lee (2006) provides semi-parametric identification results for a more general class of competing risks model. The above results rely crucially on exact observation of durations. In practice, duration data should usually be considered interval-censored or discrete. That is, durations are not observed exactly, but only observed to lie within some interval, e.g. one observes spell lengths that are less than one month, between one and two months, etc. The combination of continuous time hazard rate models and interval-censored duration data is common enough to have generated a voluminous literature. There are basically three approaches to estimation of models in this setting. The first is to simply assume away the interval-censoring in the sense that data are treated as if they were not censored. Not surprisingly, this may lead to problems, see e.g. Bergstr¨om and Edin (1992) or Røed and Zhang (2002). The second approach is to derive the likelihood of the interval-censored observations from a continuous time model and use this likelihood as a basis for estimation. Flinn and Heckman (1982) give an early discussion of this. The third approach is to specify the model as a discrete duration model. A discrete duration model may or may not be consistent with a hazard rate model. For cases where the discrete time models are consistent with such underlying continuous time models, the second and third approaches are equivalent. Han and Hausman (1990) and Sueyoshi (1995) estimate discrete duration models that are consistent with hazard rate models, while e.g. Van den Berg and Van Ours (1994) estimate discrete duration models that are not consistent with hazard rate models, but on the other hand allow for simplification of some estimation procedures. There are some identification results for MPH models with interval-censoring in the literature. Clearly, it is not possible to recover hazard function behaviour within intervals, see Sueyoshi (1995). Ridder (1990) shows that the GAFT class is not identified under assumptions corresponding to the classical results for uncensored data, but that the model is identified in a corresponding way if covariates are assumed to enter the log structural hazard function linearly and covariates have support on the full real line. McCall (1994) shows that the model is still identified when the coefficients associated with the linear function of covariates are interval specific. Meyer (1995) contains an identification result for the MPH model similar to the positive result in Ridder (1990) and also comments that the result also holds in the more general case where the structural hazard function is a known function of the linear function of covariates. Bierens (2008) proves identification of the same model, while also relaxing the assumption on the support of covariates somewhat and, in addition, providing alternative conditions on the heterogeneity distribution. All identification results for MPH models with interval-censored durations in the literature are semi-parametric, in that they require a known function of the structural hazard function to be linear in covariates. All results also rely on unbounded covariate support. In the next section, I first show how the unbounded support assumption may be relaxed within the semi-parametric framework. Secondly, I show that full non-parametric identification can be achieved, regardless of the negative identification result in Ridder (1990). The identification results provided here are isomorphic to known identification results for corresponding models
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
Non-parametric identification with interval-censored durations
345
with continuous durations, but with little or no variation in covariates. The results provided here require only discrete variation in duration, but continuous covariates.
2. IDENTIFICATION RESULTS The MPH model describes the family of distributions of a positive random variable T, the duration, conditional on covariates x ∈ X . Assuming continuous distribution functions for T, these are fully described through hazard functions. The MPH model is specified in terms of an independent random variable V with support on R + , representing unobserved heterogeneity, and a hazard function, conditional on both covariates, x and V = v specified as vψ(t)φ(x). The survival function of the MPH model, G(t, x) := Pr(T ≥ t | x) follows as G(t, x) = E(exp(−V z(t)φ(x))) = L(z(t)φ(x)), (2.1) t where E denotes expectation with respect to V , z(t) = 0 ψ(r)dr and L is the Laplace transform of the random variable V, see e.g. Feller (1971). In addition, I will discuss the Generalized Accelerated Failure Time (GAFT) class introduced by Ridder (1990), a generalization of the MPH model. Define the GAFT class directly by G(t, x) = L(z(t)φ(x)),
(2.2)
where L is a continuously differentiable, strictly decreasing, positive function defined on R+ with L(0) = 1. L corresponds to L in (2.1), which satisfies the restrictions on L. L has more properties. The essential extra property in our context is that L is analytic and hence uniquely determined by its values on an open set. Here identification of the model will be studied under discrete or more precisely intervalcensored duration data. With interval-censoring, durations are not observed exactly, but only observed to fall within a certain interval. Equivalently, whether or not durations ‘have ended’ is only observed at a finite number of points in time. In the literature, identification under interval-censoring has been studied in models with parametric functional form restrictions and unbounded support assumptions on covariates. Let us first see how we can dispense with the latter assumption. A SSUMPTION 2.1. The random variable V has finite mean, normalized to unity. Thus, L (0) = −1. A SSUMPTION 2.2. φ is restricted to the parametric specification φ(x) = exp (xβ) , with β = 0. Scalar covariates are assumed throughout. It is straightforward to extend results to vectorvalued covariates. A SSUMPTION 2.3. x takes on values on an open set X ⊆ R. A SSUMPTION 2.4. G(t,x) is only known at t = ta , with G(ta , x) < 1 for some x ∈ X . This corresponds to an observation plan where it is only observed whether durations have ended at one point in time. A structure of the MPH model is a set {L, ψ, φ} that conforms to the definitions. We say that the MPH model is identified if the structure of the model is uniquely determined from the survival C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
346
C. N. Brinch
function G(t, x). Under Assumptions 2.3 and 2.4, the starting point is what one can identify from G(t, x) = L(z(t)φ(x)) for t = ta and x ∈ X . Clearly, it is then impossible to identify z(t) for t = ta . Use the notation za = z(ta ). Under Assumptions 2.2, 2.3 and 2.4, a structure of the MPH model can now be represented by the set {L, za , β}. T HEOREM 2.1. Under Assumptions 2.2, 2.3 and 2.4, observationally equivalent structures {L1 , za1 , β1 } and {L2 , za2 , β2 } of the MPH model must satisfy b za2 = Aza1 ,
(2.3)
β2 = β1 b
(2.4)
L2 (As b ) = L1 (s),
(2.5)
and
for positive constants A and b. Under Assumptions 2.1, 2.2, 2.3 and 2.4, the MPH model is identified. Theorem 2.1 is very similar to Theorem 2 in Ridder (1990) with an identification result for the corresponding GAFT class—where Assumption 2.1 is not invoked. The main difference from the first part of Theorem 2.1 is that analytical continuation cannot be applied for the GAFT class and that Assumption 2.3 must therefore be strengthened such that x takes on values on R. This is precisely the point with Theorem 2.1, to demonstrate that for the MPH model, the unbounded covariate support assumption is not necessary for identification. Theorem 2.1 contains the main identification result for interval-censored durations in Meyer (1995) and Theorem 5 in Bierens (2008) as special cases. These apply stronger conditions—either that x takes on values on R—or in the case of Bierens (2008), that xβ has no lower bound. I have been made aware of the very close correspondence between Theorem 2.1 and existing results for identification of MPH models without variation in covariates, but with a Weibull structural hazard function, see Heckman and Singer (1984) or Lancaster (1990, p.152). If the covariate is transformed to τ := exp(x), the survival function can then be expressed as G(ta ; x) = L(za τ β ), which is the same expression as the survival function for a Weibull MPH model without variation in covariates, with τ as the duration. (This argument requires β > 0, but the sign of β is trivially identified in our model, and can thus be taken to be positive without loss of generality.) The formal results in the literature do not consider the case where the duration only varies over an open set, although the necessary analytic extension is straightforward. Hence, an alternative proof of Theorem 2.1 could be constructed based on existing proofs for the Weibull hazard models. It follows from the discussion in Ridder (1990) that, if L1 is part of the unique structure conforming to Assumption 2.1, then the constant b characterizing observationally equivalent structures violating Assumption 2.1 must be larger than one. Other values of b lead to L2 that do not conform to the requirements of Laplace transforms. Note that there are alternatives to Assumption 2.1 for pinning down the constants A and b. Bierens (2008) discusses two such alternative assumptions. As should be clear from the GAFT definition, 1 − L can be interpreted as a cumulative distribution function, say for a random variable Y, with support on R+ . Bierens (2008) considers the distribution of W := exp(−Y ), which has support on the unit interval. Identification results such as Theorem 2.1 identify this distribution up to two parameters (A and b). The first of the alternative identification assumptions C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
Non-parametric identification with interval-censored durations
347
in Bierens (2008) is to pre-specify two quantiles of W or equivalently of Y. Fixing two quantiles is sufficient for pinning down the constants A and b. The second alternative identification assumption is to pre-specify the first two moments of W—which again suffices for determining A and b. Thus, the alternative conditions in Bierens (2008) can be substituted for Assumption 2.1 in Theorem 2.1, although Assumption 2.3 is weaker than the corresponding assumption in Theorems 6 and 7 in Bierens (2008). Similarly, identification based on Theorem 2.1 could use the tail assumptions from Heckman and Singer (1984) in place of Assumption 2.1. These different ways of achieving identification pin down different combinations of the constants A and b and potentially lead to qualitatively different structural duration dependence. At first glance, one can hardly claim to be identifying structural duration dependence through Theorem 2.1, as the integrated structural hazard rate is only identified at one point. However, identification of structural duration dependence is trivial when the heterogeneity distribution is identified, since z(t) can easily be recovered at any point of time where G(t, x) is known as z(t)φ(x) = L−1 (G(t, x))
(2.6)
for any t, since φ(x) and L are recovered through Theorem 2.1. Indeed, it is straightforward to generalize this result to the case where the conditional hazard function beyond ta can be any function that factors into v and a function of t and x. A special case of such a model is studied in McCall (1994), where the function φ(x) may differ over intervals, while retaining the exponential structure from Assumption 2.2. Hence, unbounded support of covariates is not necessary for identification of the model in McCall (1994). It is now clear that the parametric restriction in Assumption 2.2 suffices for identification of the MPH model without assuming unbounded support of covariates. Let us now see where we get without imposing parametric restrictions. A SSUMPTION 2.5. φ is a continuous, non-constant function of x. Assumption 2.5 is strictly weaker than Assumption 2.2. It follows from Assumptions 2.3 and 2.5 that φ(x) takes on values on an open set. Ridder (1990) contains a demonstration that even unbounded support is not sufficient for identification without parametric restrictions in the GAFT class. Since this non-identification result may not hold in the specialized MPH model, we provide the following argument: The identification problem characterized by what can be identified under Assumptions 2.1, 2.3, 2.4 and 2.5 completely mirrors the known non-identification result for continuous distributions, but without variation in covariates, provided by Lancaster and Nickell (1980). Any survival function G(t, x) is consistent with any heterogeneity distribution, since if G(ta , x) = L(φ(x)), then G(ta , x) = L0 (φ0 (x)) for an arbitrary distribution function with Laplace transform L0 , if φ0 = L−1 0 ◦ L ◦ φ. In view of the negative result on non-parametric identification in Ridder (1990) for the GAFT class, and the straightforward extension of this result to the MPH model, it is not surprising that positive identification results for this case has not been searched for. These negative results do however depend critically on the extreme interval-censoring as implemented in Assumption 2.4. To get positive results, we will instead use the following alternative. A SSUMPTION 2.6. G(t,x) is known at ta and tb > ta , with G(ta , x) < 1 and G(tb , x) < G(ta , x) for some x ∈ X . C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
348
C. N. Brinch
Thus, whether durations have ended is observed at two points in time, and some durations end between these points of time. The following key result shows that neither the parametric assumptions nor the unbounded support assumptions are necessary for identification. T HEOREM 2.2. The MPH model is identified under Assumptions 2.1, 2.3, 2.5 and 2.6. Theorem 2.2 is closely related to the classical result for continuous durations from Elbers and Ridder (1982). In the notation from (2.1), Elbers and Ridder (1982) prove that L, z and φ are uniquely determined from G(t, x), given Assumption 2.1. It is required that z(t) takes on values on an open set with 0 as a limit point, and φ(x) takes on two distinct values. Kortram et al. (1995) show that the result still holds when the requirement on z(t) is relaxed such that z(t) is only required to take on values on any open set. Assumptions 2.3, 2.5 and 2.6 in the present paper ensure that φ(x) takes on values on an open set and that z(t) takes on two distinct values. Hence, with the obvious swap of the roles of z and φ, there is no need for a separate proof of Theorem 2.2. It is pointed out in Meyer (1995) that the proof of his main identification result (which is covered by Theorem 2.1) also applies beyond the parametric specification in Assumption 2.2. Specifically, in the notation applied here, it is required that φ is a known, strictly monotone, continuously differentiable function in a linear function of x. Clearly, Theorem 2.2 goes beyond this result from Meyer (1995), as it is here not required that the function φ is known.
3. DISCUSSION The results provided here close the gap between identification results for the MPH model with exact and interval-censored duration data. The model is non-parametrically identified under interval-censoring, assuming the structural hazard function is a continuous non-constant function of covariates with support on an open set. It is clearly not possible to straightforwardly extend the results to the case with covariates with finite support. The combination of interval-censored duration and covariates with finite support gives us only a finite number of cell probabilities as empirical predictions—hardly enough for full identification of infinite-dimensional models. Still, sets of observationally equivalent models may be sufficiently similar for identification in the intuitive sense to hold in practice. See Bierens (2008) or Honor´e and Lleras-Muney (2006) for related discussions. The identification results provided here do not generalize directly to the case with dependent competing risks. Dependent competing risks model with interval-censoring are difficult to work with. State dependent integrated structural hazard functions are not even directly identifiable when the unobserved heterogeneity distribution is known. Within interval behaviour of transition rates to one state may affect the population at risk for transitions to other states. Honor´e and Lleras-Muney (2006) show how bounds may still be achieved on interesting parameters in a closely related model. The identification results provided here rely crucially on the proportional hazards assumption and the finite mean assumption on the heterogeneity distribution. Brinch (2008) provides results that show we can dispense with these assumptions if covariates vary over time as well as across observations, corresponding to results in Brinch (2007) for models without interval-censoring.
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
Non-parametric identification with interval-censored durations
349
ACKNOWLEDGMENTS Thanks to Rolf Aaberge, Magne Mogstad, Arvid Raknerud and an anonymous referee for helpful comments. While carrying out this research, I have been associated with the centre of Equality, Social Organization, and Performance (ESOP) at the Department of Economics at the University of Oslo. ESOP is supported by the Research Council of Norway.
REFERENCES Abbring, J. H. and G. J. van den Berg (2003). The identifiability of the mixed proportional hazards competing risks model. Journal of the Royal Statistical Society, Series B, 65, 701–10. Bergstr¨om, R. and P. A. Edin (1992). The identifiability of the mixed proportional hazards competing risks model. Journal of Applied Econometrics 7, 5–30. Bierens, H. (2008). The identifiability of the mixed proportional hazards competing risks model. Econometric Theory 24, 749–94. Brinch, C. N. (2007). Nonparametric identification of the mixed hazards model with time-varying covariates. Econometric Theory 23, 349–54. Brinch, C. N. (2008). Non-parametric identification of the mixed hazards model with interval-censored durations. Discussion Paper No. 539, Statistics Norway. Elbers, C. and G. Ridder (1982). True and spurious duration dependence: the identifiability of the proportional hazard model. Review of Economic Studies 49, 403–9. Feller, W. (1971). An Introduction to Probability Theory and Its Applications, Vol. II. New York: John Wiley. Flinn, C. and J. J. Heckman (1982). Models for the analysis of labor force dynamics. In R. Bassman and G. Rhodes (Eds.), Advances in Econometrics, Volume 1, 35–95. Greenwich, CT: JAI Press. Han, A. and J. Hausman (1990). Flexible parametric estimation of duration and competing risk models. Journal of Applied Econometrics 5, 1–28. Heckman, J. and B. Honor´e (1989). The identifiability of the competing risks model. Biometrika 76, 325–30. Heckman, J. and B. Singer (1984). The identifiability of the proportional hazard model. Review of Economic Studies 51, 231–41. Honor´e, B. and A. Lleras-Muney (2006). Bounds in competing risks models and the war on cancer. Econometrica 74, 1675–98. Kortram, R., A. Rooij, A. Lenstra and G. Ridder (1995). Constructive identification of the mixed proportional hazards model. Statistica Neerlandica 49, 269–81. Lancaster, T. (1990). The Analysis of Transition Data. New York: Cambridge University Press. Lancaster, T. and S. Nickell (1980). The analysis of re-employment probabilities for the unemployed. Journal of the Royal Statistical Society, Series A, 143, 141–65. Lee, S. (2006). Identification of a competing risks model with unknown transformations of latent failure times. Biometrika 93, 996–1002. McCall, B. (1994). Testing the proportional hazards assumption in the presence of unmeasured heterogeneity. Journal of Applied Econometrics 9, 321–34. Meyer, B. D. (1995). Semiparametric estimation of hazard models. Working paper, Northwestern University. Ridder, G. (1990). The non-parametric identification of generalized accelerated failure-time models. Review of Economic Studies 57, 167–81. C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
350
C. N. Brinch
Røed, K. and T. Zhang (2002). A note on the Weibull distribution and time aggregation bias. Applied Economics Letters 9, 469–72. Sueyoshi, G. (1995). A class of binary response models for grouped duration data. Journal of Applied Econometrics 10, 411–31. Van den Berg, G. (2001). Duration models: specification, identification and multiple durations. In J. J. Heckman and E. Leamer (Eds.), Handbook of Econometrics, Volume 5, 3381–460. Amsterdam: North Holland. Van den Berg, G. and J. Van Ours (1994). Unemployment dynamics and duration dependence in France, The Netherlands and the United Kingdom. Economic Journal 104, 432–43.
APPENDIX: PROOFS OF RESULTS Proof of Theorem 2.1: Assume that two structures {L1 , za1 , β1 } and {L2 , za2 , β2 } are observationally equivalent. That is, L1 (za1 exp(β1 x)) = L2 (za2 exp(β2 x)), for all x ∈ X .
(A.1)
β2 x + log za2 = h1 (β1 x + log za1 )), for all x ∈ X ,
(A.2)
Equivalently,
log ◦L−1 2
◦ L1 ◦ exp, where ◦ denotes composition of functions. Then h1 must be a linear where h1 = function for x ∈ X . Let h1 (z) = log (A) + bz, with two arbitrary constants A > 0 and b > 0. (h1 is increasing, by the properties of the component functions.) Next, let h2 = L−1 1 ◦ L2 . Then h2 (z) = Azb .
(A.3)
L2 (As b ) = L1 (s).
(A.4)
Thus, for all s on some open set,
When this equation holds for all s on an open set, it holds for all s > 0 through the analyticity of Laplace transforms. Substituting for L1 in (A.1), we find b exp(bβ1 x)) = L2 (za2 exp(β2 x)), for all x ∈ X , L2 (Aza1
(A.5)
b exp(bβ1 x)) = za2 exp(β2 x)), Aza1
(A.6)
hence
leading to (2.3) and (2.4). Differentiation of both sides of (A.4) with respect to s gives L2 (As b )Abs b−1 = L1 (s), s ∈ R+ .
(A.7)
Under Assumption 2.1, both L1 (s) and L2 (As b ) are required to approach −1 as s → 0, which again requires that Abs b−1 → 1 as s → 0, giving A = b = 1.
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
The
Econometrics Journal Econometrics Journal (2011), volume 14, pp. B1–B4. doi: 10.1111/j.1368-423X.2010.00322.x
BOOK REVIEW: A Review of Micro-Econometrics: Methods of Moments and Limited Dependent Variables (2nd Ed.) By L EE (M YOUNG - JAE ) (New York, NY: Springer: 2010. Pp. 770. £81.00, hardcover, ISBN 978-0-387-95376-2)
INTRODUCTION When the first edition of this book was published, almost 15 years ago, it broke new ground in that it was the first book to provide both a systematic coverage of standard microeconometric topics and a detailed description of a number of semi-parametric methods for modelling limited dependent variables. The book also had the innovative feature of providing a data set and GAUSS code that allowed the reader to have a hands-on experience with some of the methods described in it. In the last decade, however, several excellent books on microeconometrics have appeared. Some of them are quite general (e.g. Wooldridge, 2002, Cameron and Trivedi, 2005, and Winkelmann and Boes, 2006), while others have a more specific focus (e.g., Arellano, 2003, Yatchew, 2003, Koenker, 2005, Winkelmann, 2008, Train, 2009, and Cameron and Trivedi, 2010). Therefore, this second edition will face tough competition, and the first question to ask is whether it adds anything to the texts currently available. My answer to this question is certainly positive because this book has some features that make it useful to a number of readers. In particular, this book is unique in that, like in the first edition, it covers in some detail not only the standard inference tools but also more advanced topics, including a vast collection of semi- and non-parametric methods that are rarely, if at all, discussed in books of this nature.
AN OVERVIEW It is important to note from the outset that, more than a book on microeconometrics, this is a book on microeconometric methods. For example, there is no introductory chapter on the nature and purpose of microeconometrics, neither is there a reference to the challenges of using observational rather than experimental data. In this respect, therefore, this book is very different from those of Wooldridge (2002) and Cameron and Trivedi (2005), which define the gold standard in the area. Compared to the current alternatives, this book is generally more advanced and somewhat more compactly written. Throughout, the emphasis is on the methods and on their asymptotic properties, with little attention being paid to their performance in finite samples. This focus on the methods does not detract from its usefulness, instead it is its strength. Indeed, this book provides C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society. Published by Blackwell Publishing Ltd, 9600
Garsington Road, Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
B2
Jo˜ao M. C. Santos Silva
much more detailed descriptions of the different methods and their asymptotic properties than comparable textbooks. Because the reader needs to have a reasonable background in statistics and econometrics, this book is particularly useful both for advanced graduate students and for researchers, whether they are theoretically or empirically oriented. If this book is to be used as the basis for a graduate course, there are a number of issues to consider. First, it is important to supplement its contents with additional material. The book contains many illustrative examples of empirical applications and gives references of applied papers using the methods presented, but the students would benefit from additional material, both illustrating the practical importance of the methods it presents, and giving some guidance in the always difficult task of taking models to data. Another characteristic of the book that has to be kept in mind if it is used as a textbook is that there are no end-of-chapter exercises that the students can use to test their knowledge. I also note that the style of the book is somewhat unusual (for example, equations are not numbered), being perhaps too informal from time to time. Although some readers may take a while to get used to this idiosyncratic style, I do not see it as a drawback, and many students may even welcome it as a refreshing change. Because of these characteristics, this book is perhaps better used as a very useful second textbook where the students can find more technical details on some of the methods taught in the course, or more advanced methods not covered in other books. Alternatively, it can be used as an excellent basis for an advanced course on semi- and non-parametric econometrics, or simply as a valuable reference book. One thing that could be done to greatly increase the attractiveness of the book to students, instructors and researchers alike, is to make available on the author’s webpage at least some of the data sets used in the empirical illustrations presented in the book.
WHAT IS IN IT Chapter 1 deals with single-equation linear models, presenting least squares, instrumental variables, and GMM estimators and tests. Although some illustrative examples are provided, this is a very compact chapter that is ideal for students to refresh their knowledge of the basic econometric results. Chapter 2 extends the methods in the previous chapter to the case of multiple equations, including panel data models. The most notable feature of this chapter is the unusual approach to the rank condition for identification in simultaneous equation models. Although the proposed approach is ingenious and shows the rank condition in new light, I still find more illuminating the way this problem is treated in Davidson and MacKinnon (1993). The third chapter starts the treatment of non-linear models by covering maximum likelihood estimation and M-estimation in general, including a discussion of issues related to hypotheses tests and estimation algorithms. I found particularly useful the discussion on the computation of the covariance matrix of two-step estimators, which is richer and clearer than what is typically found in similar textbooks. Chapter 4 deals with a variety of other non-linear estimators, including quantile regression and mode regression. Chapters 5 and 6 complete the second part of the book, covering an extensive set of single- and multiple-equation models for limited dependent variables. In these chapters, the focus is essentially on fully parametric models, including a brief discussion of simulation-based estimators. The third part of the book, Chapters 7–9, is arguably its most interesting part. Chapter 7 provides a relatively standard, but advanced, treatment of kernel density and regression C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
Review
B3
estimation, including kernel hazard estimators and kernel regression with both continuous and discrete regressors. Chapter 8 introduces a number of bandwidth-free semi-parametric estimators, including estimators for binary, censored, truncated and duration data. Besides some of the now standard semi-parametric methods, like the maximum score estimator and estimators for censored and truncated data, the chapter also covers less popular estimators based on ranking and pairwise differencing. Naturally, this chapter also contains a useful description of the author’s own contributions in the area. Chapter 9 is the last chapter of the main part of the book and it covers bandwidth-dependent semi-parametric estimators. Most of the methods discussed in this chapter (e.g. two-stage least squares with non-parametric first-stage, and semi-linear and additive models) are not covered in other textbooks in microeconometrics or general econometrics textbooks. This catalogue of semi-parametric estimators, including some reasonably exotic ones, can provide applied researchers with interesting ideas about how to relax strong parametric assumptions that are implicit in many popular estimators. In short, the third part of the book provides a most useful bridge between general textbooks and more advanced and specialized monographs or journal articles. The book concludes with three long appendices. Appendix I starts by providing the traditional mathematical background, which is pitched for the level of this book and therefore is more advanced than usual. Additionally, this appendix provides extensions to Chapters 2–9, with the three last ones being particularly interesting. This idea of supplementing each chapter with additional material in the form of an appendix works very well in that it helps the book to achieve a good balance between readability and completeness. Appendix II is, however, even more interesting, containing a variety of additional topics, most of which would deserve to be covered in the main part of the book. In this appendix we find, for example, sections on estimation with stratified samples, on the empirical likelihood estimator and on the bootstrap. Finally, Appendix III gives examples of GAUSS code to implement some of the methods discussed in the book, including some of the more advanced ones. Although practitioners will mostly rely on off-the-shelf software to apply these methods, providing the actual code used to implement the estimators is very useful even for those that are not GAUSS users. I am a great believer in learning-by-programming and many students will benefit from implementing these estimators in their preferred software. Moreover, the code provided is also a great starting point for those wishing to have more control over their results, or wishing to experiment with modified versions of the estimators.
CONCLUSION Books written by successful researchers, in contradistinction to books written by merely successful authors, are always interesting and inspiring because they provide the reader with many unique insights into areas where the author has had original contributions. This second edition of Micro-Econometrics: Methods of Moments and Limited Dependent Variables certainly falls into this category of books. Indeed, Myoung-jae Lee has published on a very wide range of topics, and that allows him to present many results with a rare ‘insiderknowledge’. In summary, although it obviously has some features that do not exactly match my preferences, I have found this book to be extremely useful for my own work and I believe that many other readers, either students or researchers, will share that positive experience.
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.
B4
Jo˜ao M. C. Santos Silva
REFERENCES Arellano, M. (2003). Panel Data Econometrics. Oxford: Oxford University Press. Cameron, A. C. and P. K. Trivedi (2005). Microeconometrics: Methods and Applications. Cambridge: Cambridge University Press. Cameron, A. C. and P. K. Trivedi (2010). Microeconometrics using Stata (revd ed.). College Station, TX: Stata Press. Davidson, R. and J. G. MacKinnon (1993). Estimation and Inference in Econometrics. Oxford: Oxford University Press. Koenker, R. (2005). Quantile Regression. Cambridge: Cambridge University Press. Train, K. (2009). Discrete Choice Methods with Simulation (2nd ed.). Cambridge: Cambridge University Press. Winkelmann, R. (2008). Econometric Analysis of Count Data (5th ed.). Berlin: Springer. Winkelmann, R. and S. Boes (2006). Analysis of Microdata. Berlin: Springer. Wooldridge, J. M. (2002). Econometric Analysis of Cross Section and Panel Data. Cambridge, MA: MIT Press. Yatchew, A. (2003). Semiparametric Regression for the Applied Econometrician. Cambridge: Cambridge University Press.
˜ M. C. S ANTOS S ILVA J O AO University of Essex
C 2011 The Author(s). The Econometrics Journal C 2011 Royal Economic Society.