Econometrics Journal (2008), volume 11, pp. 409–442. doi: 10.1111/j.1368-423X.2008.00258.x
Seasonal unit root tests and the role of initial conditions D AVID I. H ARVEY † , S TEPHEN J. L EYBOURNE † AND A. M. R OBERT T AYLOR † †
School of Economics and Granger Centre for Time Series Econometrics, University of Nottingham, Nottingham, NG7 2RD, UK E-mails:
[email protected],
[email protected],
[email protected] First version received: March 2008; final version accepted: August 2008
Summary In the context of regression-based (quarterly) seasonal unit root tests, we examine the impact of initial conditions (one for each quarter) of the process on test power. We investigate the behaviour of the well-known OLS detrended HEGY seasonal unit root tests together with their quasi-differenced (QD) detrended analogues, when the initial conditions are not asymptotically negligible. We show that the asymptotic local power of a test at a given frequency depends on the value of particular linear (frequency specific) combinations of the initial conditions. Consistent with previous findings in the nonseasonal case, the QD detrended test at a given spectral frequency dominates on power for relatively small values of this combination, while the OLS detrended test dominates for larger values. Since, in practice, the seasonal initial conditions are not observed, in order to maintain good power across both small and large initial conditions, we develop tests based on a union of rejections decision rule; rejecting the unit root null at a given frequency (or group of frequencies) if either of the relevant QD and OLS detrended HEGY tests rejects. This procedure is shown to perform well in practice, simultaneously exploiting the superior power of the QD (OLS) detrended HEGY test for small (large) combinations of the initial conditions. Moreover, our procedure is particularly adept in the seasonal context since, by design, it exploits the power advantage of the QD (OLS) detrended HEGY tests at a particular frequency when the relevant initial condition is small (large) without imposing that same method of detrending on tests at other frequencies. Keywords: Asymptotic local power, HEGY seasonal unit root tests, Initial conditions, Union of rejections decision rule.
1. INTRODUCTION The role of the initial condition (defined as the deviation of the first observation from its deterministic component) on standard (zero frequency) unit root tests has attracted considerable attention in recent years. While unit root tests which include a constant in their detrending procedure are exact similar with respect to the initial condition, their local power functions depend crucially on the magnitude of the initial condition, even asymptotically; see, inter alia, C The Author(s). Journal compilation C Royal Economic Society 2008. Published by Blackwell Publishing Ltd, 9600 Garsington Road,
Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
410
D.I. Harvey, S.J. Leybourne and A.M. Robert Taylor
Elliott et al. (1996), Elliott (1999), M¨uller and Elliott (2003), Elliott and M¨uller (2006), Harvey and Leybourne (2005, 2006) and Harvey et al. (2008). As discussed in Elliott and M¨uller (2006, pp. 286–90), while there may be situations in which one would not necessarily expect the initial condition to be unusually large or, indeed, unusually small, relative to the other data points, equally the initial condition might be relatively large in other situations. The former case occurs, for example, where the first observation in the sample is dated quite some time after the inception of a mean-reverting process, while the latter can happen if the sample data happen to be chosen to start after a break (perceived or otherwise) in the series or where the beginning of the sample coincides with the start of the process. This latter example can also allow for the case where an unusually small (even zero) initial condition occurs. In practice it is therefore hard to rule out small or large initial conditions, a priori. This is problematic, given the substantial impact of the magnitude of the initial condition on the power properties of standard unit root tests. In the seasonal unit root testing context we have not one initial condition but S (seasonal) initial conditions, one for each of the S seasons. Similar arguments to those put forward by Elliott and M¨uller (2006) also apply in the seasonal case, but with a wider range of possible effects. As a simple example, a large break in the average level of a monthly series might occur, taking effect half way through the year. In such a case one would expect the initial conditions for the first (second) half of the initial year to be large (small) relative to the rest of the sample. Moreover, with S seasons, breaks might occur in some of the seasons but not others, again effecting different magnitudes for the initial conditions in different seasons. It is also well known that the relative importance of each season within a (near-) seasonal unit root process also has a tendency to evolve over time; the so-called ‘spring becomes summer’ phenomenon referred to by, for example, Hylleberg et al. (1993). This effect implies that, where the sample is dated some time after the inception of the series, the relative position and magnitude of each of the seasonal initial conditions will vary as to where the sample data starts. It therefore seems worthwhile and of practical relevance to investigate the role played by the magnitude of the initial conditions in determining the power properties of seasonal unit root tests. Working with a rather general formulation for the seasonal initial conditions (which includes seasonal extensions of the non-seasonal set-ups of Elliott, 1999, M¨uller and Elliott, 2003, and Elliott and M¨uller, 2006, as special cases), we find that for a test at a given frequency it is the magnitude of a specific linear combination of the seasonal initial conditions that matters, rather than the initial conditions themselves. For example, in the case of the zero frequency it is the sum of the initial conditions for each season that turns out to be the important quantity. We term these quantities spectral initial conditions. Where the spectral initial condition for a test at a given frequency is not asymptotically negligible, the quasi-differenced (QD) detrended Hylleberg et al. (1990) [HEGY]-type tests of Rodrigues and Taylor (2007) can perform very badly indeed, with their power against a given alternative rapidly decreasing towards zero as the magnitude of the spectral initial condition is increased. In sharp contrast, the OLS detrended HEGY tests show an increase in power, other things equal, as the magnitude of the spectral initial condition increases, albeit their powers are considerably lower than those of corresponding QD detrended tests when the initial condition is small. Powers of joint frequency unit root tests of the type proposed in Ghysels et al. (1994) [GLN] are also shown to depend on the method of detrending and relevant set of spectral initial conditions. Our findings are made relevant because in practice the seasonal initial conditions are neither known nor are they amenable to estimation. Consequently, uncertainty surrounds the appropriate choice of detrending method. In the non-seasonal setting, such considerations led Harvey et al. C The Author(s). Journal compilation C Royal Economic Society 2008.
Seasonal unit root tests and the role of initial conditions
411
(2008) to investigate whether it is possible in practice to construct unit root test strategies that maintain good power properties across both large and small initial conditions. They showed that a union of rejections decision rule between the QD- and OLS-based ADF tests (whereby the unit root null is rejected if either of the QD detrended ADF and OLS detrended ADF tests rejects) works well. This approach exploits the superior power properties of the QD (OLS) detrended tests when the initial condition is small (large) and is capable of outperforming the more sophisticated testing procedures proposed in Elliott and M¨uller (2006) and Harvey and Leybourne (2005, 2006). Our findings on the relative power behaviour of the QD and OLS detrended HEGY tests indicates that a union of rejections decision rule between the QD and OLS detrended HEGY tests, either at a given frequency or set of frequencies, can also be fruitfully employed in a seasonal context. We provide asymptotic and finite sample evidence to suggest that this procedure is again highly effective, despite its relative simplicity. The plan of the remainder of the paper is as follows. In Section 2, we outline our reference seasonal unit root testing model and detail the unit root tests on which we focus our attention. These are the OLS detrended seasonal unit root tests of HEGY and the corresponding QD detrended HEGY-type tests of Rodrigues and Taylor (2007). Although we restrict our analysis primarily to the case of quarterly (S = 4) data, generalisations to an arbitrary seasonal aspect follow quite straightforwardly. The limiting distributions of these statistics are derived under near-seasonal integration in Section 3. This enables us to show, and to illustrate numerically, the precise nature of the dependence of the asymptotic local power functions of these tests on the initial conditions of the process. In Section 4, we detail our union of rejections testing strategy and compare its large sample performance with that of the corresponding OLS and QD detrended HEGY tests. Section 5 reports corresponding finite sample results. We offer some conclusions in Section 6. Proofs of the main technical results in this paper are given in the Appendix. Throughout the paper we use the following notation: ‘x := y’ to indicate that x is defined by y; p
d
· to denote the integer part of the argument; ‘→’ and ‘→’ denote convergence in probability and weak convergence, respectively, as the sample size diverges and I(·) to denote the indicator function.
2. THE SEASONAL UNIT ROOT FRAMEWORK 2.1. The seasonal model Consider the case where we have T := 4N observations on the quarterly time series process {x 4t+s }, where N denotes the span in years of the sample data, generated according to the model x4t+s = μ4t+s + v4t+s , a(L)v4t+s = u4n+s , vi = ξi ,
s = −3, . . . , 0, t = 1, 2, . . . , N ,
s = −3, . . . , 0, t = 2, . . . , N ,
i = 1, . . . , 4,
(2.1)
(2.2)
(2.3)
where a(L) := 1 − 4j =1 aj Lj is a fourth order AR polynomial in the lag operator L, L4j +k x 4t+s := x 4(t−j )+s−k , and the deterministic component μ 4t+s = γ s + δ (4t + s); that is, seasonal C The Author(s). Journal compilation C Royal Economic Society 2008.
412
D.I. Harvey, S.J. Leybourne and A.M. Robert Taylor
intercepts and a (non-seasonal) time trend. 1 The shocks, {u 4t+s }, are assumed to follow a stationary AR(p), 0 ≤ p < ∞, process, viz., p
φ(L)u4t+s = ε4t+s ,
(2.4)
where φ(z) := 1 − i=1 φi zi , the roots of φ(z) = 0 all lie outside the unit circle, |z| = 1, and the error process, {ε 4t+s }, is a martingale difference sequence with constant conditional variance, σ 2 ; see Fuller (1996, theorem 5.3.5, pp. 236–37) for precise assumptions on {ε 4t+s }. We denote the long run variance of u t by ω2u := σ 2 ψ(1)2 , where ψ(z) denotes the (unique) inverse of φ(z). The initial conditions of the process are given by ξ 1 , . . ., ξ 4 in (2.3), so that ξ 1 is the initial condition associated with the first quarter, ξ 2 the second quarter, and so on. Precise assumptions on the initial conditions will be detailed and discussed in Section 2.3 below. 2.2. The seasonal unit root hypotheses In this paper, we are concerned with the behaviour of tests for seasonal unit roots in the AR(S) polynomial, α(L), against near seasonally integrated alternatives; that is, the null hypothesis of interest is H0 : a(L) = 1 − L4 =: 4 ,
(2.5)
while, following Tanaka (1996, pp. 355–6), Rodrigues (2001), Taylor (2002) and Rodrigues and Taylor (2004b), inter alia, the near seasonally integrated alternative takes the form, c (2.6) L4 , c ≤ 0. Hc : a(L) = 1 − 1 + N Notice that H c of (2.6) reduces to H 0 of (2.5) for c = 0. Under H 0 of (2.5) the DGP (2.1)–(2.2) of {x 4t+s } is that of a quarterly random walk process with (non-seasonal) drift δ, admitting unit roots at each of the zero frequency, ω 0 = 0, the Nyquist (or biannual) frequency, ω 2 = π and the annual frequency ω 1 = π /2. Under H c of (2.6) the process {x 4t+s } is locally stationary. Rodrigues and Taylor (2004b) demonstrate that H c of (2.6) can be partitioned into H c ≡ ∩2k=0 H c,k , where the hypotheses H c,0 and H c,2 correspond to a local to unit root at the zero and biannual frequencies, respectively, while H c,1 yields a pair of complex conjugate local to unit roots at the annual frequency. The null hypothesis of unit roots at the zero, biannual and annual frequencies are therefore individually denoted as H 0,0 , H 0,2 and H 0,1 , respectively. 2.3. The initial conditions As discussed in the introduction, a number of recent papers have highlighted the strong dependence of the power functions of non-seasonal unit root tests on the deviation of the initial 1 For expositional purposes we have chosen to focus our attention on the case of most practical relevance where the deterministic component consists of seasonal intercepts and a (non-seasonal) trend. Other choices of the deterministic component are possible; see, in particular, the typology of cases in Smith and Taylor (1998). However, Smith and Taylor (1998) show that allowing for seasonal intercepts ensures that the resulting seasonal unit root tests will be exact similar with respect to the initial conditions, which is especially important given our focus in this paper. If the drift should appear seasonal, then μ 4t+s could be augmented with seasonal time trends, as in Smith and Taylor (1998), while if no drift was apparent the linear trend could be omitted from μ 4t+s . C The Author(s). Journal compilation C Royal Economic Society 2008.
Seasonal unit root tests and the role of initial conditions
413
observation of the series from its underlying deterministic component (see, inter alia, Elliott, 1999, M¨uller and Elliott, 2003, Elliott and M¨uller, 2006 and Harvey and Leybourne, 2005, 2006). The following assumption provides a generalisation of the conditions discussed by these authors to the seasonal case, and contains as special cases the assumptions made by previous authors in the seasonal case. A SSUMPTION 2.1. Under H c of (2.6) with c < 0, the initial conditions in (2.3) are generated according to
ξi = αi ωu2 1 − ρN2 , i = 1, . . . , 4, (2.7) where ρN := 1 + Nc , and where αi ∼ I N (μα,i I(σα2 = 0), σα2 ), i = 1, . . . , 4, independently of u 4t+s , s = −3, . . . , 0, t = 2, . . . , N . For c = 0, that is under H 0 of (2.5), we may set ξ i = 0, i = 1, . . . , 4, without loss of generality, due to the exact similarity of the seasonal unit root tests considered in this paper to the initial conditions; see Smith and Taylor (1998) and Rodrigues and Taylor (2007). In Assumption 2.1, α i controls the magnitude of the initial condition in season i, ξ i , relative to the magnitude of the standard deviation of a stationary seasonal AR(1) process with parameter ρ N and innovation long-run variance ω2u . The form given for the ξ i allow the initial conditions to be either random and of O p (N 1/2 ), or fixed and of O(N 1/2 ). If σ 2α > 0, then the initial conditions are random; σ 2α = 1 yields the so-called unconditional case considered in the non-seasonal case by Elliott (1999) and in the seasonal case by Rodrigues and Taylor (2004b), inter alia. If, on the other hand, σ 2α = 0 then the ξ i are non-random and of the form given in M¨uller and Elliott (2003), Elliott and M¨uller (2006). 2 By considering both the random and fixed scenarios in this way, we try to allow for some flexibility in how the initial conditions may be generated. Notice finally that Rodrigues and Taylor (2007) assume that the initial conditions are asymptotically p vanishing, such that N −1/2 ξi → 0, i = 1, . . . , 4, which is equivalent to setting α i = 0, i = 1, . . . , 4, in (2.7). 2.4. Regression-Based Seasonal Unit Root Tests Following HEGY, Smith and Taylor (1998) and Rodrigues and Taylor (2007), inter alia, the regression-based approach to testing for seasonal unit roots in α(L) consists of two stages. In the first stage one detrends the data in order to achieve (exact) invariance to the seasonal intercept and linear trend parameters, γ s , s = −3, . . . , 0 and δ of (2.1). In the case of the OLS detrending approach of HEGY and Smith and Taylor (1998), the detrended series is β z4t+s , where z 4t+s := (D 1,4t+s , . . . , D 4,4t+s , (4t + s)) where given by xˆ4t+s := x4t+s − β is the OLS estimator of β := (γ−3 , . . . , γ0 , δ) , Dj ,4t+s := I(j = s), j = −3, . . . , 0, and obtained from regressing x 4t+s onto z 4t+s along 4t + s = 1, . . . , T . Under the QD detrending z4t+s , where β is the QD estimator approach of Rodrigues and Taylor (2007), xˆ4t+s := x4t+s − β
2 Notice that where the seasonal initial conditions are random, Assumption 2.1 imposes the condition that they are mutually independent. This assumption seems quite natural given that the individual seasons within a (near-) seasonally integrated model starting in the remote past are approximately independent of one another.
C The Author(s). Journal compilation C Royal Economic Society 2008.
414
D.I. Harvey, S.J. Leybourne and A.M. Robert Taylor
of β obtained from the OLS regression of x c on Z c , where
xc := x1 , x2 − α1c x1 , x3 − α1c x2 − α2c x1 , x4 − α1c x3 − · · · − α4c x1 , c x5 , . . . c xT
Zc := z1 , z2 − α1c z1 , z3 − α1c z2 − α2c z1 , z4 − α1c z3 − · · · − α4c z1 , c z5 , . . . , c zT and
c¯1 c := 1 − 1 + T
4 c¯2 c¯3 2 2 L 1+ 1+ L L =: 1 − αjc Lj 1+ 1+ T T j =1
where for tests run at the 5% level, c¯1 = −13.5, c¯2 = −7 and c¯3 = −3.75. 3 In the second stage, using the proposition √ in HEGY (pp. 221–2), we expand a(L) of (2.2) around the seasonal unit roots ±1, ±i, i := (−1), to obtain the auxiliary regression equation 4 xˆ4t+s =
4 j =1
πj xˆj ,4t+s−1 +
p
φj∗ 4 xˆ4t+s−j + uˆ 4t+s ,
(2.8)
j =1
where 4 xˆ4t+s := xˆ4t+s − xˆ4(t−1)+s and, corresponding to the zero and biannual frequencies
xˆ1,4t+s := a1 (L)xˆ4t+s , a1 (L) := 1 + L + L2 + L3 (2.9) and
xˆ2,4t+s := −a2 (L)xˆ4t+s , a2 (L) := 1 − L + L2 − L3 ,
(2.10)
respectively, and corresponding to the annual frequency, xˆ3,4t+s := −a3 (L)xˆ4t+s , a3 (L) := L(1 − L2 ) xˆ4,4t+s := −a4 (L)xˆ4t+s , a4 (L) := (1 − L2 )
(2.11)
cf. HEGY and Smith and Taylor (1998). The parameters π j , j = 1, . . . , 4, of (2.8) are of focal interest. As demonstrated in HEGY, a unit root occurs at the zero and biannual frequencies when π 1 = 0 and π 2 = 0, respectively, while a pair of complex conjugate unit roots occur at the annual frequency when π 3 = π 4 = 0. In order to test H 0 of (2.5) against the alternative of stationarity at at least one of the zero, biannual and harmonic seasonal frequencies, HEGY therefore propose using the following regression statistics in (2.8): t 1 (left-sided) for the exclusion of xˆ1,4t+s−1 ; t2 (left-sided) for the exclusion of xˆ2,4t+s−1 , and F 34 for the exclusion of xˆ3,4t+s−1 and xˆ3,4t+s−1 . 4 GLN also propose the joint frequency F-statistics, F 234 , for the exclusion of xˆ2,4t+s−1 , xˆ3,4t+s−1 and xˆ4,4t+s−1 , and F 1234 , for the exclusion of all of xˆ1,4t+s−1 , xˆ2,4t+s−1 , xˆ3,4t+s−1 and xˆ4,4t+s−1 . The former tests the null hypothesis of unit roots at all of the seasonal frequencies, while the latter tests the null hypothesis, H 0 of (2.5). 3 If, as discussed in footnote 1, the linear trend variable is omitted from z 4t+s , then c¯1 should be changed to −7, while if seasonal trends are also included in z4t+s , c¯2 and c¯3 should be changed to −13.5 and −8.65, respectively. 4 In their original article HEGY also suggest a testing procedure for the annual frequency pair of unit roots based on the pair of regression statistics t 3 for the exclusion of xˆ3,4t+s−1 and t 4 for the exclusion of xˆ4,4t+s−1 . However, these statistics have subsequently been shown to have non-pivotal asymptotic limiting null distributions when p > 0 in (2.4), rendering them unusable in practice; see, e.g. Smith et al. (2007) and Burridge and Taylor (2001).
C The Author(s). Journal compilation C Royal Economic Society 2008.
Seasonal unit root tests and the role of initial conditions
415
In what follows, we use a superscript OLS (QD) on these tests to denote that OLS (QD) denotes the QD detrending has been performed in the first stage, so that for example t QD 2 denotes the detrended biannual frequency test of Rodrigues and Taylor (2007), while t OLS 2 corresponding OLS detrended test of HEGY. Where no superscript is present, reference to the test is understood to be made in a generic sense. Finite sample and asymptotic null critical values for these tests are provided in Table 1, Panels A and B. The finite sample critical values were obtained via Monte Carlo simulation, setting p = 0 in the fitted regression (2.8) with φ(z) = 1 and γ −3 = · · ·γ 0 = δ = 0 in (2.1) and generating {ε 4t+s } as an NIID(0, 1) sequence, for T = 52, 100, 152, 300. 5 Here and throughout the paper, simulations were programmed in Gauss 7.0 using 50,000 replications. See the discussion following Remark 3.5 below, regarding computation of the asymptotic critical values.
3. ASYMPTOTIC REPRESENTATIONS For the set of OLS and QD detrended seasonal unit root tests considered in Section 2, the following lemma details their asymptotic behaviour. L EMMA 3.1. Let {x 4t+s } be generated according to (2.1)–(2.3) and let Assumption 2.1 hold. For i = 1, 2, 3, 4, let Wi (r) c=0 Kic (r) := rc −1/2 + Wic (r) c < 0, α¯ i (e − 1)(−2c) where W i (r), i = 1, . . . , 4, are independent standard Brownian motion processes, W ic (r), i = 1, . . . , 4, are independent standard Ornstein-Uhlenbeck processes given by r e(r−s)c dWi (s) Wic (r) := 0
and the spectral magnitudes which are defined as α¯ 1 := (α1 + α2 + α3 + α4 )/2 α¯ 2 := (−α1 + α2 − α3 + α4 )/2 √ α¯ 3 := (α4 − α2 )/ 2 √ α¯ 4 := (α3 − α1 )/ 2. Observe that the spectral magnitudes, α¯ i , i = 1, . . . , 4, are mutually independent, being mutually orthogonal transformations of the (mutually independent) seasonal initial condition magnitudes, α i , i = 1, . . . , 4, of (2.7). Also define 1 μ Kic (r) := Kic (r) − Kic (s)ds i = 1, 2, 3, 4. 0
Then under H c of (2.6), the asymptotic distributions of the OLS and QD detrended t 1 and t 2 statistics from (2.8) are given by j
d
ti →
5
j
j
Kic (1)2 − Kic (0)2 − 1 j =: τi 1 j 2 0 Kic (r)2 dr
i = 1, 2;
j = OLS, QD
Note that the chosen values of T are consistent with complete years of data (i.e. N = 13, 25, 38, 75).
C The Author(s). Journal compilation C Royal Economic Society 2008.
416
D.I. Harvey, S.J. Leybourne and A.M. Robert Taylor Table 1. Critical values and λ ζ values for ζ -level quarterly seasonal unit root tests. t1
T
ζ = 0.10
ζ = 0.05
t2 ζ = 0.01
ζ = 0.10
ζ = 0.05
ζ = 0.01
52 100
−3.18 −3.15
Panel A. Critical values for OLS detrended tests −3.49 −4.12 −2.63 −2.96 −3.45 −4.04 −2.61 −2.92
152 300 ∞
−3.15 −3.14 −3.13
−3.44 −3.43 −3.42
−2.90 −2.87 −2.86
−3.52 −3.48 −3.44
52 100
−3.07 −2.91
Panel B. Critical values for QD detrended tests −3.37 −4.00 −2.34 −2.64 −3.19 −3.75 −2.13 −2.41
−3.27 −3.02
152 300
−2.83 −2.72
−3.11 −3.01
−3.67 −3.59
−2.01 −1.86
−2.31 −2.16
−2.90 −2.75
∞
−2.56
−2.85
−3.41
−1.62
−1.94
−2.56
∞
1.070
1.058
Panel C. λ ζ values for UR tests 1.043 1.126
1.095
1.065
−4.00 −3.99 −3.96
−2.59 −2.58 −2.57
F 34 T
ζ = 0.10
ζ = 0.05
−3.62 −3.53
F 234 ζ = 0.01
ζ = 0.10
ζ = 0.05
52 100
6.01 5.82
Panel A. Critical values for OLS detrended tests 7.23 9.97 5.75 6.79 6.92 9.33 5.42 6.33
152 300 ∞
5.71 5.68 5.62
6.73 6.71 6.62
8.97 9.00 8.78
5.30 5.22 5.13
ζ = 0.01
9.12 8.30
6.14 6.04 5.87
7.90 7.84 7.52
52
3.69
Panel B. Critical values for QD detrended tests 4.53 6.50 3.84
4.57
6.23
100 152 300
3.14 2.91 2.66
3.92 3.66 3.38
5.72 5.38 5.07
3.14 2.82 2.50
3.77 3.42 3.08
5.24 4.75 4.34
∞
2.39
3.07
4.70
2.20
2.74
3.89
∞
1.197
1.163
1.131
1.101
Panel C. λ ζ values for UR tests 1.118
1.163
C The Author(s). Journal compilation C Royal Economic Society 2008.
Seasonal unit root tests and the role of initial conditions
417
Table 1. Continued. F 1234 T
ζ = 0.10
ζ = 0.05
ζ = 0.01
52
6.41
Panel A. Critical values for OLS detrended tests 7.40
9.61
100 152
5.95 5.80
6.77 6.53
8.54 8.18
300 ∞
5.68 5.52
6.41 6.19
7.90 7.61
Panel B. Critical values for QD detrended tests 52 100
4.79 3.96
5.51 4.56
7.21 5.92
152 300 ∞
3.62 3.25 2.81
4.18 3.77 3.32
5.38 4.91 4.35
1.118
Panel C. λ ζ values for UR tests 1.100
1.075
∞
where
OLS K1c (r)
:=
μ K1c (r)
1 − 12 r − 2
0
1
1 s− K1c (s)ds 2
μ
OLS K2c (r) := K2c (r) QD K1c (r)
:= K1c (r) −
c¯1∗ rK1c (1)
− 3 1 − c¯1∗ r
1
sK1c (s)ds 0
QD K2c (r) := K2c (r)
with c¯1∗ := (1 − c¯1 )(1 − c¯1 + c¯12 /3). Moreover, the asymptotic distributions of the F 34 , F 234 and F 1234 statistics from (2.8) under H c are given by j
1 j 2 j 2 j A + B =: τ34 , j = OLS, QD 2 2 d 1 j 2 j τ2 + Aj + (B j )2 =: τ234 , → j = OLS, QD 3 j 2 d 1 j 2 j τ1 + τ2 + (Aj )2 + (B j )2 =: τ1234 , → j = OLS, QD 4 d
F34 → j
F234 j
F1234
C The Author(s). Journal compilation C Royal Economic Society 2008.
418
D.I. Harvey, S.J. Leybourne and A.M. Robert Taylor
where
OLS
A
1
:= c 0
QD
A
:= c
B QD
1
+ 0
1
K3c 0
B OLS
μ K3c (r)2 dr
(r)2 dr
1 μ K4c (r)2 dr
+
0
1 μ μ K3c (r)dW3 (r) + 0 K4c (r)dW4 (r) 1 μ 1 μ 2 2 0 K3c (r) dr + 0 K4c (r) dr
1
+
K4c (r)2 dr 0
1 1 α¯ 3 c(−2c)−1/2 0 K3c (r)dr + α¯ 4 c(−2c)−1/2 0 K4c (r)dr + 1 1 2 2 0 K3c (r) dr + 0 K4c (r) dr 1 1 0 K3c (r)dW3 (r) + 0 K4c (r)dW4 (r) + 1 1 2 2 0 K3c (r) dr + 0 K4c (r) dr 1 μ 1 μ 0 K3c (r)dW4 (r) − 0 K4c (r)dW3 (r) := 1 μ 1 μ 2 2 0 K3c (r) dr + 0 K4c (r) dr 1 1 α¯ 4 c(−2c)−1/2 0 K3c (r)dr − α¯ 3 c(−2c)−1/2 0 K4c (r)dr := 1 1 2 2 0 K3c (r) dr + 0 K4c (r) dr 1 1 K3c (r)dW4 (r) − 0 K4c (r)dW3 (r) + 0 . 1 2 dr + 1 K (r)2 dr K (r) 3c 4c 0 0
R EMARK 3.1. Under the null hypothesis H 0 of (2.5) the test statistics do not depend on the initial conditions {ξ j }4j =1 (see footnote 1), so they play no role in their asymptotic null distributions. It is under the alternative hypothesis, H c of (2.6) with c < 0, that the initial conditions have an effect. For a given statistic, setting the relevant value(s) of the {α¯ i }4i=1 to zero, the limiting representation given in Lemma 3.1 reduces to the corresponding representation for the statistic when the initial conditions are asymptotically negligible, as given in, inter alia, Rodrigues and Taylor (2004b, 2007). R EMARK 3.2. Observe that the limiting distributions of the OLS and QD detrended HEGY tests from (2.8) do not depend on the magnitudes, α i , of the initial conditions, ξ i , i = 1, . . . , 4, of (2.7) directly. Rather, they depend on the magnitude of frequency specific linear combinations of these initial conditions, what we will term spectral initial conditions. The zero frequency initial condition is given by ξ¯1 := ξ1 + ξ2 + ξ3 + ξ4 , that for the biannual frequency by ξ¯2 := −ξ1 + ξ2 − ξ3 + ξ4 , and those for the annual frequency by ξ¯3 := ξ4 − ξ2 and ξ¯4 := ξ3 − ξ1 . Notice from (2.7) that the spectral magnitudes therefore satisfy α¯ i ∼ I N (μ¯ i I(σα2 = 0), σα2 ), i = 1, . . . , 4, with μ¯ 1 :=√(μα,1 + μα,2 + μα,3 + μα,4 )/2, √ μ¯ 2 := (−μα,1 + μα,2 − μα,3 + μα,4 )/2, μ¯ 3 := (μα,4 − μα,2 )/ 2 and μ¯ 4 := (μα,3 − μα,1 )/ 2. Consequently if, for example, the magnitude of the initial conditions from each of the seasons happened to sum to zero (which would imply that α¯ 1 = 0), then the asymptotic and t QD local power functions of the t OLS 1 tests would be the same as if these initial conditions 1 were asymptotically vanishing. Notice that the asymptotic local power function of the joint C The Author(s). Journal compilation C Royal Economic Society 2008.
Seasonal unit root tests and the role of initial conditions
419
frequency F 234 test depends on the spectral initial conditions relating to both the biannual and annual frequencies, while that for the F 1234 test additionally depends on the zero frequency initial condition. QD QD R EMARK 3.3. From Lemma 3.1, it is seen that the limiting distributions of the t QD 1 , t 2 and F 34 OLS statistics are mutually independent under H c , as are the limiting distributions of the t OLS 1 , t2 QD QD OLS OLS OLS and F 34 statistics. Moreover, the limiting distributions of the t 1 and F 234 and t 1 and F 1234 statistics are also mutually independent. In each case this follows from the independence of the K ic (r), i = 1, . . . , 4, limiting processes. Indeed, this implies more generally that the limiting distributions of different frequency statistics will be mutually independent regardless of whether OLS they be based on OLS or QD detrended data, so that, for example, the t QD 2 and F 34 statistics also have independent limiting distributions. However, it should be noted that, for example, the and t QD t OLS 1 1 statistics will not have independent limiting distributions owing to the fact that they are both functionals of K 1c (r).
R EMARK 3.4. If, as discussed in footnote 1, the linear trend variable is omitted from z 4t+s , μ then the representation given in Lemma 3.1 for t 1OLS would hold on re-defining K OLS 1c := K 1c (r), and for t QD by re-defining K OLS 1 1c := K 1c (r). In this case the stated representations given for both the OLS and QD detrended versions of the t 2 , F 34 and F 234 statistics would be unchanged, while those for the OLS and QD detrended F 1234 statistics would still be of the form given in and τ QD Lemma 3.1, noting the change in τ OLS 1 from above. Should seasonal trends be included 1 and t QD would remain unchanged, the in z 4t+s , then while the limiting distributions of t OLS 1 1 μ OLS GLS OLS limiting distributions for t 2 and t 2 would obtain on re-defining K2c := K2c (r) − 12(r − 1 1 QD 1 ) 0 (s − 12 )K2c (s)ds and K2c := K2c (r) − c¯2∗ rK2c (1) − 3(1 − c¯2∗ )r 0 sK2c (s)ds, where c¯2∗ := 2 OLS (1 − c¯2 )(1 − c¯2 + c¯22 /3). Similarly, in this case the representations for the F QD 34 and F 34 statistics QD QD OLS OLS would hold (and, hence, together with the changes for t 2 and t 2 , for the F 1234 , F 1234 , F QD 234 and 1 QD ∗ ∗ ¯ ¯ F OLS statistics) on replacing K (r) by K := K (r) − c rK (1) − 3(1 − c )r sK jc jc jc j c (s)ds, jc 234 3 3 0 ∗ 2 QD where c¯3 := (1 − c¯3 )(1 − c¯3 + c¯3 /3), for j = 3, 4 in the expressions for A and B QD , and 1
:= Kjμc (r) − 12(r − 12 ) 0 s − 12 Kj c (s)ds, for j = 3, 4 in the replacing K μjc (r) with KjOLS c expressions for AOLS and B OLS . R EMARK 3.5. As is standard in the literature on HEGY-type tests, Lemma 3.1 is derived under the condition that u 4t+s is a finite-order stationary AR(p) process, with a lag truncation of p used in (2.8), the results holding more generally for a lag length p∗ ≥ p. However, where u 4t+s is generated according to a stationary and invertible ARMA process the assumption that φ(z) is of finite order must be dropped, and here we conjecture that the stated results will continue to hold provided p is of o (T 1/3 ); cf. Chang and Park (2002) and Said and Dickey (1984), inter alia. In Figures 1–4, we graph the asymptotic local powers of the OLS and corresponding QD detrended HEGY tests from (2.8) run at the nominal 0.05 level, of each of the tests from Lemma 3.1 for c = −5, −10. The results reported in Figures 1 (c = −5) and 2 (c = −10) pertain to the fixed initial conditions case, while Figures 3 (c = −5) and 4 (c = −10) are for QD the case of random initial conditions. In the case of the t QD 1 and F 1234 statistics, whose limiting distribution depends on the QD parameter c¯1 , the reported results pertain to c¯1 = −13.5 . The null critical values and local powers were obtained by direct simulation of the limiting functionals in C The Author(s). Journal compilation C Royal Economic Society 2008.
420
D.I. Harvey, S.J. Leybourne and A.M. Robert Taylor
Figure 1. Asymptotic local power: c = −5; j = OLS: - - -, j = GLS: – –, j = UR:—.
C The Author(s). Journal compilation C Royal Economic Society 2008.
Seasonal unit root tests and the role of initial conditions
421
Figure 1. Continued.
Lemma 3.1 approximating the Wiener processes using NIID(0, 1) random variates, and with the integrals approximated by normalized sums of 1000 steps. For the results in Figures 1 and 2 we report in parts (a), (b) and (c) for the pairs QD QD QD OLS OLS of tests, (t OLS 1 , t 1 ), (t 2 , t 2 ) and (F 34 , F 34 ), respectively, the local powers as functions of the absolute values of the relevant magnitude parameters μ¯ 1 , μ¯ 2 and μ¯ 3 = μ¯ 4 , ¯ = {0.0, 0.1, 0.2, . . . , 6.0}, i = 1, . . . , 4. 6 For the joint frequency respectively, for |μ¯ i | = |μ| QD OLS (F 234 , F 234 ) pair of tests we report in parts (d), (e) and (f) of Figures 1–2 the local powers ¯ = {0.0, 0.1, 0.2, . . . , 6.0} with μ¯ 3 = μ¯ 4 = 0; |μ¯ 3 | = |μ¯ 4 | = |μ| ¯ = as functions of: |μ¯ 2 | = |μ| ¯ = {0.0, 0.1, 0.2, . . . , 6.0}, {0.0, 0.1, 0.2, . . . , 6.0} with μ¯ 2 = 0, and |μ¯ 2 | = |μ¯ 3 | = |μ¯ 4 | = |μ| OLS , respectively. Parts (g), (h) and (i) of Figures 1 and 2 report local powers of the (F 1234 QD F 1234 ) pair of tests as functions of |μ¯ 1 | = |μ| ¯ = {0.0, 0.1, 0.2, . . . , 6.0} with μ¯ 2 = μ¯ 3 = μ¯ 4 = ¯ = {0.0, 0.1, 0.2, . . . , 6.0} with μ¯ 1 = μ¯ 2 = 0, and |μ¯ 1 | = |μ¯ 2 | = |μ¯ 3 | = 0; |μ¯ 3 | = |μ¯ 4 | = |μ| ¯ = {0.0, 0.1, 0.2, . . . , 6.0}, respectively. Corresponding results for the case of random |μ¯ 4 | = |μ| starting values are reported in Figures 3 and 4 as functions of σ α = {0.0, 0.1, 0.2, . . . , 6.0}. 6 It should be clear from the representations given in Lemma 3.1 that the asymptotic local power functions of the tests do not depend on the signs of the α¯ i , i = 1, . . . , 4. C The Author(s). Journal compilation C Royal Economic Society 2008.
422
D.I. Harvey, S.J. Leybourne and A.M. Robert Taylor
Figure 2. Asymptotic local power: c = −10; j = OLS: - - -, j = GLS: – –, j = UR:–––.
C The Author(s). Journal compilation C Royal Economic Society 2008.
Seasonal unit root tests and the role of initial conditions
423
Figure 2. Continued.
Consider first the results for the t 1 , t 2 and F 34 tests. 7 We immediately see from these results that with either random or fixed initial conditions, the power curves of the QD detrended HEGY tests in each case exhibit monotonic decrease in σ α or |μ¯ i |, whilst the power of the OLS test is seen to have detrended HEGY tests increase monotonically. In the fixed case, the t OLS 1 test when (approximately) | μ ¯ | = 1.5 and 1.3 for c = −5 and −10, higher power than the t QD 1 1 QD and t tests these crossing points both occur at about |μ¯ 2 | = 1.0, respectively. For the t OLS 2 2 QD OLS while for the F 34 and F 34 tests these occur at about |μ¯ 3 | = |μ¯ 4 | = 0.9. A key feature here is the drastic speed with which the power of the QD detrended version of the tests approaches zero QD ¯ 1 | ≥ 4, while for t QD with |μ|: ¯ the t QD 1 test has power which is effectively zero for |μ 2 and F 34 power is effectively zero even by |μ¯ 2 | ≥ 2 and |μ¯ 3 | = |μ¯ 4 | ≥ 2, respectively. For the random and t QD case the crossing points for t OLS 1 1 occur at about σ α = 1.8 and 1.6 for c = −5 and −10, QD
Notice that the powers of the t 1 and t OLS tests are in general rather lower than the power functions of the 1 corresponding tests at other frequencies. This is because of the presence of a non-seasonal linear trend in the detrending routine. It is well known that this causes a significant reduction in available power; cf. Elliott et al. (1996), Harvey et al. (2008) and Rodrigues and Taylor (2004a, 2007). 7
C The Author(s). Journal compilation C Royal Economic Society 2008.
424
D.I. Harvey, S.J. Leybourne and A.M. Robert Taylor
Figure 3. Asymptotic local power: c = −5, α i ∼ N (0, σ 2α ), i = 1, 2, 3, 4; j = OLS: - - -, j = GLS: – –, j = UR:–––.
C The Author(s). Journal compilation C Royal Economic Society 2008.
Seasonal unit root tests and the role of initial conditions
425
Figure 4. Asymptotic local power: c = −10, α i ∼ N (0, σ α2 ), i = 1, 2, 3, 4; j = OLS: - - -, j = GLS: – –, j = UR:–––.
C The Author(s). Journal compilation C Royal Economic Society 2008.
426
D.I. Harvey, S.J. Leybourne and A.M. Robert Taylor
OLS respectively. For t OLS and t QD 2 they occur at about 1.6 and 1.4, respectively, while for the F 34 2 QD and F 34 tests these occur at about 1.1 and 0.9, respectively. For each of these pairs of tests, the extent of the power dominance of the QD detrended variant over the OLS detrended variant increases as σ α shrinks towards zero. Now consider the results for the joint frequency F 234 and F 1234 tests. As with the results for the single frequency tests discussed above, we see that in both the fixed and random cases the QD detrended HEGY tests dominate the corresponding OLS detrended tests on power for small initial conditions with the pattern reversing for large initial conditions. In the case of the fixed initial conditions, the limiting distributions of the joint tests now depend on more than one spectral initial condition (precisely, the F 234 tests depends on ξ¯2 , ξ¯3 and ξ¯4 , while the F 1234 test additionally depends on ξ¯1 ) so that the relationship between the power properties of the QD and OLS detrended variants of the test and the underlying initial conditions is more complex than for the single frequency tests. In the fixed case, the crossing points of the power functions of a given joint frequency tests are consequently related to the magnitude of all of the spectral intercepts which feature in the statistic’s limiting distribution. As can clearly be seen by comparing, for example, parts (g) and (i) of Figure 1, the crossing point for the F 1234 tests when c = −5 is at about |μ¯ 1 | = 4.4 when ¯ = 0.8 when |μ¯ 1 | = |μ¯ 2 | = |μ¯ 3 | = |μ¯ 4 | = |μ|, ¯ indicating μ¯ 2 = μ¯ 3 = μ¯ 4 = 0, but is at about |μ| that, as might be expected, the point at which the QD version of the joint frequency tests becomes inferior on power to the OLS version occurs for smaller magnitudes of the spectral initial conditions the more of these there are that are non-zero. Moreover, as can be seen by comparing, for example, (d) and (e) with (f), and (g) and (h) with (i) in Figure 1 it is only when all of the spectral initial values relevant to a particular test are non-zero that the power of the QD test collapses to zero as the magnitude of the initial conditions increases. To explain this QD statistic. Now, asymptotically, this is equal to the phenomenon, consider, for example, the F 1234 QD average of the squared t j , j = 1, . . . , 4, statistics. Consider then part (g) of Figures 1 and 2. 2 statistic (and, hence, also the power of the (t QD Here while the power of the t QD 1 1 ) statistic), QD QD collapses to zero as |μ¯ 1 | increases, the spectral intercepts relating to the t 2 and F 34 statistics QD are all zero and so these tests maintain power, such that the power of the F 1234 test will not drop to zero. This also explains why the crossing point for the joint tests moves to the left, other things equal, as the number of non-zero spectral intercepts which affect the statistic increases. In the random case, similar patterns are seen in the joint frequency F 1234 and F 234 tests as for the t 1 , t 2 and F 34 tests, with the crossing points occurring at about 1.0 for each of the tests for c = −5, and at about 0.9 for c = −10. An interesting implication of the findings above is that, depending on the magnitudes of the individual spectral initial conditions it is possible that at one frequency, due to a large spectral initial condition at that frequency, the OLS detrended variant of the HEGY test could dominate the corresponding QD detrended test on power, but that if the initial conditions at the other spectral frequencies were small, here the QD variants of the tests would dominate on power. Consequently, while constructing the HEGY regression from QD detrended data would be appropriate for tests at those frequencies where small spectral initial conditions pertained, it would be a very inefficient thing to do for any frequencies where the initial condition was large, and vice versa.
C The Author(s). Journal compilation C Royal Economic Society 2008.
Seasonal unit root tests and the role of initial conditions
427
4. A UNION OF REJECTIONS TESTING STRATEGY Given the clear results of Figures 1–4, it seems sensible to consider whether it is possible to devise a testing strategy which, for small values of σ α in the random case or the relevant |μ i |, i = 1, . . . , 4, magnitudes in the fixed case, captures the power advantages of the QD detrended HEGY tests over the corresponding OLS detrended tests and, at the same time, exploits the reverse relationship that exists between the tests’ power when σ α or |μ i |, i = 1, . . . , 4, is large. As noted in the introduction, in the context of initial condition uncertainty in the non-seasonal case, Harvey et al. (2008) suggest a simple union of rejections decision rule between the QDand OLS-based ADF tests; the unit root null being rejected if either of the QD detrended ADF and OLS detrended ADF tests rejects. This approach effectively combines the superior power properties of the QD detrended ADF test when the initial condition is small with those of the OLS detrended ADF when the initial condition is large and, as such, represents a near admissible procedure; see M¨uller (2008). Compared with other more involved procedures, such as those of Elliott and M¨uller (2006) and Harvey and Leybourne (2005, 2006), it is extremely competitive in terms of power. Whilst it is not immediately clear how these competing procedures might be extended to the current seasonal case, extension of the union of rejections approach is quite straightforward. We simply take the union of rejections of the QD and OLS detrended versions of each of the t 1 , t 2 , F 34 , F 234 and F 1234 statistics. Of course, Bonferroni’s inequality makes clear that none of these individual strategies will be size controlled for c = 0, being oversized even asymptotically. However, as in the non-seasonal work of Harvey et al. (2008), it is straightforward to correct these sizes in the limit, by multiplying the critical values associated with both the QD and OLS detrended versions of a given statistic by a common constant, chosen such that the composite union of rejections strategy has correct asymptotic size. OLS be used generically to denote the asymptotic ζ significance level critical Let cv QD ζ and cv ζ values of the QD and OLS detrended HEGY tests, and let the size-correction constant be denoted generically by λ ζ for nominal ζ -level tests. Then: (i)
For the zero frequency, the relevant size-corrected union of rejections procedure is given by
t1U R (ζ ) := t1QD I t1QD < λζ cvζQD + t1OLS I t1QD ≥ λζ cvζQD , QD QD UR where if t UR 1 (ζ ) = t 1 , a rejection of H 0,0 is recorded if t 1 (γ ) < λ ζ cvζ ; otherwise if UR OLS UR OLS t 1 (ζ ) = t 1 , a rejection is recorded if t 1 (ζ ) < λ ζ cv ζ . In the limit, using the relevant expressions from Lemma 3.1
d t1U R (ζ ) → τ1QD I τ1QD < λζ cvζQD + τ1OLS I τ1QD ≥ λζ cvζQD , OLS where, for example, for tests run at the asymptotic 0.05 level, cv QD = ζ = −2.85 and cv ζ −3.42; cf. Table 1, Panels A and B. In order to determine λ ζ in a computationally efficient way, we recognise that the decision rule associated with t UR 1 (ζ ) can also be written as
Reject H0,0 if min
t1QD ,
C The Author(s). Journal compilation C Royal Economic Society 2008.
cvζQD cvζOLS
t1OLS
< λζ cvζQD .
(4.1)
428
(ii)
D.I. Harvey, S.J. Leybourne and A.M. Robert Taylor
The representation in (4.1) makes it very straightforward to calculate λ ζ . Specifically, setting c = 0 we find the limit distribution of the min function in (4.1) using the (joint) OLS limit distributions of t QD 1 and t 1 , then obtain an asymptotic ζ -level critical value from QD UR this empirical cdf, say cv ζ . Then λ ζ is given by λ ζ := cv UR ζ /cv ζ . Values of λ ζ for conventional significance levels are reported in Table 1, Panel C; for example, at the asymptotic 0.05 level, λ ζ = 1.058. For the biannual frequency, the union of rejections is given by
t2U R (ζ ) := t2QD I t2QD < λζ cvζQD + t2OLS I t2QD ≥ λζ cvζQD , QD QD UR where if t UR 2 (ζ ) = t 2 , a rejection of H 0,2 is recorded if t 2 (ζ ) < λ ζ cv ζ ; otherwise if OLS UR OLS t UR 2 (ζ ) = t 2 , a rejection is recorded if t 2 (ζ ) < λ ζ cv ζ . In the limit, from Lemma 3.1
d t2U R (ζ ) → τ2QD I τ2QD < λζ cvζQD + τ2OLS I τ2QD ≥ λζ cvζQD
(iii)
OLS and at the asymptotic 0.05 level, cv QD = −2.86. Values of λ ζ can be ζ = −1.94 and cv ζ UR found in an analogous way to those for t 1 (ζ ); for example at the 0.05 level, λ ζ = 1.095. For the annual frequency, the union of rejections is
QD QD QD UR OLS (ζ ) := F34 I F34 > λζ cvζQD + F34 I F34 ≤ λζ cvζQD , F34 QD QD UR where if F UR 34 (ζ ) = F 34 , a rejection of H 0,1 is recorded if F 34 (ζ ) > λ ζ cv ζ ; otherwise if OLS UR OLS F UR 34 (ζ ) = F 34 , a rejection is recorded if F 34 (ζ ) > λ ζ cv ζ . From Lemma 3.1, we have that
QD d QD QD UR OLS (ζ ) → τ34 I τ34 > λζ cvζQD + τ34 I τ34 ≤ λζ cvζQD . F34 OLS For tests run at the asymptotic 0.05 level, cv QD = 6.62. To find λ ζ for a ζ = 3.07 and cv ζ given significance level, the decision rule analogous to (4.1) here involves the maximum rather than minimum function, so that the decision rule of F UR 34 (ζ ) can also be written as
QD cvζ QD OLS Reject H0,1 if max F34 , F34 > λζ cvζQD . cvζOLS
(iv)
Aside from this change, the λ ζ values can be obtained in the same way as for t UR 1 (ζ ) and t UR 2 (ζ ). At the 0.05 level, λ ζ = 1.163; cf. Table 1, Panel C. For testing the joint null hypothesis of unit roots at the biannual and annual frequencies, the union of rejections is QD QD QD UR OLS (ζ ) := F234 I F234 > λζ cvζQD + F234 I F234 ≤ λζ cvζQD , F234 QD QD UR where if F 234 (ζ ) = F 234 , a rejection of H 0,1 ∩ H 0,2 is recorded if F UR 234 (ζ ) > λ ζ cv ζ ; UR OLS UR OLS otherwise if F 234 (ζ ) = F 234 , a rejection is recorded if F 234 (ζ ) > λ ζ cv ζ . Again using Lemma 3.1, we have that
QD d QD QD UR OLS (ζ ) → τ234 I τ234 > λζ cvζQD + τ234 I τ234 ≤ λζ cvζQD . F234 OLS For tests run at the asymptotic 0.05 level, cv QD = 5.87 and λ ζ = 1.131, with ζ = 2.74, cv ζ λ ζ values computed in an analogous way to those for F UR (ζ ). 34 C The Author(s). Journal compilation C Royal Economic Society 2008.
Seasonal unit root tests and the role of initial conditions
(v)
429
For testing the joint null hypothesis of unit roots at the zero, biannual and annual frequencies, the union of rejections is
QD QD QD UR OLS F1234 (ζ ) := F1234 I F1234 > λζ cvζQD + F1234 I F1234 ≤ λζ cvζQD , QD UR UR (ζ ) = F 1234 , a rejection of H 0 is recorded if F 1234 (ζ ) > λ ζ cv QD where if F 1234 ζ ; otherwise UR OLS UR OLS if F 1234 (ζ ) = F 1234 , a rejection is recorded if F 1234 (ζ ) > λ ζ cv ζ . Using Lemma 3.1 we have that
QD d QD QD UR OLS F1234 (ζ ) → τ1234 I τ1234 > λζ cvζQD + τ1234 I τ1234 ≤ λζ cvζQD .
= 3.32, cv OLS = 6.19 and λ ζ = 1.100; For tests run at the asymptotic 0.05 level, cv QD ζ ζ cf. Table 1. Observe that these procedures share the same asymptotic independence properties as were detailed in Remark 3.2, so that, for example, t 1UR (ζ ) is asymptotically independent of t 2UR (ζ ). Notice also that the union of rejections approach yields testing strategies which are correctly sized in the limit, regardless of the value of σ α or the μ α,i , i = 1, . . . , 4, since the (exact) null distributions of the tests involved do not depend on these parameters. R EMARK 4.1. Before continuing, for readers wishing to apply the procedures outlined in this paper to monthly data, Table 2 reports the corresponding finite sample and asymptotic critical values for the OLS and QD detrended monthly seasonal unit root tests together with the λ ζ values for constructing the corresponding union of rejections tests. These were obtained by Monte Carlo simulation in an analogous way to those reported in Table 1, again using 50,000 replications; in line with Beaulieu and Miron (1993), the finite sample critical values for the Fodd,even tests were computed from a vector of length 250,000, comprised of the 50,000 simulated observations on all five similar statistics (i.e. F 3,4 , F 5,6 , F 7,8 , F 9,10 and F 11,12 ). The notation used for these tests is consistent with that adopted by Beaulieu and Miron (1993) and Taylor (1998) who developed OLS detrended monthly seasonal unit root tests. The corresponding QD detrended tests are as outlined in Rodrigues and Taylor (2007). The union of rejections between the OLS and corresponding QD detrended tests are then formed using exactly the same principles as outlined above for the quarterly case. The asymptotic power curves for the union of rejections tests in the case of quarterly data UR UR UR are shown in Figures 1–4. As we would conjecture, the power curves of the t UR 1 , t 2 , F 34 , F 234 UR and F 1234 tests tend to mimic (lie a little below due to the implicit size correction) those of QD QD QD QD t QD 1 , t 2 , F 34 , F 234 and F 1234 , respectively, for small magnitudes of the relevant spectral initial OLS OLS OLS OLS conditions, then mimic those of t OLS 1 , t 2 , F 34 , F 234 and F 1234 , respectively, for large initial conditions. Thus, for the smaller initial conditions the union of rejections tests pick up a good deal of the extra power available to the QD detrended tests over OLS detrended variants, while for the larger initial conditions they avoid the dramatic power losses often associated with QD detrended tests and follow, with only a modest loss in power, the (typically rising) power profile of the OLS detrended tests. It is interesting to observe, also, that when the initial conditions are fixed, there is an almost exact common point of intersection for the QD detrended, OLS detrended and union of rejections tests. R EMARK 4.2. Although the union of rejections based tests outlined in this paper are easy to compute they are not motivated from formal optimality criteria. That being the case, it would C The Author(s). Journal compilation C Royal Economic Society 2008.
430
D.I. Harvey, S.J. Leybourne and A.M. Robert Taylor Table 2. Critical values and λ ζ values for ζ -level monthly seasonal unit root tests. t1
T
ζ = 0.10
ζ = 0.05
t2 ζ = 0.01
ζ = 0.10
ζ = 0.05
ζ = 0.01
156
−3.05
Panel A. Critical values for OLS detrended tests −3.33 −3.87 −2.55 −2.85
−3.39
300 456
−3.10 −3.12
−3.39 −3.40
−3.92 −3.96
−2.55 −2.56
−2.85 −2.85
−3.41 −3.42
900 ∞
−3.13 −3.13
−3.40 −3.42
−3.98 −3.96
−2.57 −2.57
−2.85 −2.86
−3.45 −3.44
156 300
−2.92 −2.85
Panel B. Critical values for QD detrended tests −3.20 −3.73 −2.24 −2.52 −3.11 −3.67 −2.08 −2.36
−3.08 −2.92
456 900
−2.79 −2.71
−3.07 −2.98
∞
−2.56
−2.85
∞
1.070
1.058
−3.63 −3.54
−1.97 −1.85
−2.26 −2.14
−2.82 −2.75
−3.41 −1.62 Panel C. λ ζ values for UR tests 1.043 1.126
−1.94
−2.56
1.095
1.065
F odd,even T
ζ = 0.10
ζ = 0.05
F 2...12 ζ = 0.01
ζ = 0.10
ζ = 0.05
ζ = 0.01
Panel A. Critical values for OLS detrended tests 156 300
5.48 5.53
6.47 6.55
8.64 8.69
4.50 4.29
4.94 4.68
5.84 5.49
456 900 ∞
5.58 5.61 5.62
6.59 6.61 6.62
8.73 8.74 8.78
4.24 4.16 4.11
4.62 4.55 4.46
5.42 5.29 5.15
156
3.44
Panel B. Critical values for QD detrended tests 4.19 5.90 2.73
3.03
3.64
300 456 900
3.04 2.84 2.63
3.78 3.56 3.34
2.25 2.03 1.84
2.51 2.28 2.09
3.03 2.80 2.58
∞
2.39
3.07
4.70 1.65 Panel C. λ ζ values for UR tests
1.86
2.34
∞
1.197
1.163
1.084
1.060
5.46 5.23 4.97
1.118
1.094
C The Author(s). Journal compilation C Royal Economic Society 2008.
Seasonal unit root tests and the role of initial conditions
431
Table 2. Continued. F 1...12 ζ = 0.10
ζ = 0.05
ζ = 0.01
156
4.71
Panel A. Critical values for OLS detrended tests 5.15
6.08
300 456 900
4.49 4.43 4.35
4.87 4.79 4.71
5.67 5.60 5.43
∞
4.26
4.61 Panel B. Critical values for QD detrended tests
5.30
156 300 456
3.02 2.53 2.30
3.32 2.79 2.56
3.95 3.32 3.07
900 ∞
2.09 1.84
2.33 2.07 Panel C. λ ζ values for UR tests
2.84 2.52
∞
1.086
1.073
1.058
T
be of interest to explore how close the union of rejections tests developed in this paper are to being efficient. For the zero and biannual frequencies, where only a single (spectral) initial condition is involved, standard spectral independence results entail that the analysis of M¨uller and Elliott (2003) and Elliott and M¨uller (2006) can be directly applied. Consequently, the asymptotic relative efficiency of our union of rejections tests to these efficient tests will be identical to the corresponding comparison in the non-seasonal case. M¨uller and Elliott (2003) derive a family of asymptotically efficient tests, denoted Qμ (g, k) and Qτ (g, k) for the mean and trend cases, respectively, constructed to maximise weighted average power over different initial conditions for a given local alternative. Elliott and M¨uller (2006) further recommend feasible versions ˆ τ (15, 3.968), chosen so as to ˆ μ (10, 3.8) and Q of particular members of this family of tests, Q minimise the influence of the initial condition on test power. In Harvey et al. (2008, section 4), it is demonstrated that the non-seasonal union of rejections tests compare very favourably with the ˆ τ (15, 3.968) tests, with neither approach dominating the other overall across ˆ μ (10, 3.8) and Q Q initial conditions. In the current setting, the same favourable properties then carry over to the union of rejections tests for the zero and biannual frequencies. In the case of the annual and joint frequency tests, more than one spectral initial condition is involved (two in the case of testing at the annual frequency, three when testing jointly at the annual and biannual frequency, and four when testing jointly at the zero, biannual and annual frequencies; cf. Lemma 3.1). To derive asymptotically efficient tests in this context, the analysis of M¨uller and Elliott (2003) and Elliott and M¨uller (2006) would need extending to the multiple initial condition case. The weighted average power criterion then requires the specification of a multivariate distribution to govern the the relevant initial conditions, and the analysis becomes considerably more involved. While outside the scope of the present paper, a formal analysis of this problem would be interesting for future research. However, given the competitiveness of the union of rejections approach at the C The Author(s). Journal compilation C Royal Economic Society 2008.
432
D.I. Harvey, S.J. Leybourne and A.M. Robert Taylor
zero and biannual frequencies, we would expect it to also remain competitive with Elliott and M¨uller (2006)-type efficient tests at the annual and joint frequencies.
5. FINITE SAMPLE COMPARISONS In this section we investigate the finite sample size and power properties of the QD and OLS detrended HEGY-type tests of Section 2.4 together with the corresponding union of rejections tests from Section 4. Our simulations are based on the DGP (2.1)–(2.3) under H c of (2.6). Results are reported for samples of size T = 152 and T = 300. Without loss of generality, we set γ −3 = · · · = γ 0 = δ = 0 in (2.1). Throughout these simulations, we adopt the lag selection approach of Ng and Perron (1995) and select p in the fitted regression (2.8) using downward testing at the 0.10 level from a maximum lag length set at p max = 12(T /100)1/4 . Finite sample critical values for the tests are taken from Table 1, Panels A and B. 8 The scaling constants applied to the size-corrected union of rejections tests are the asymptotically valid ones given in Table 1, Panel C. In Table 3, we first report the empirical sizes of the various tests for the case where c = 0 and where the innovation process u 4t+s is assumed to follow an ARMA(1,4) process of the form (1 − φL)u 4t+s = (1 − θ L4 )ε 4t+s for φ ∈ {0.0, 0.3, 0.6}, θ ∈ {±0.4, 0.0} and ε 4t+s ∼ NIID(0, 1), with u 1 = ε 1 and ε s = 0, s = −3, . . . , 0. 9 For the size results all initial conditions ξ i , i = 1, . . . , 4, are set to zero with no loss of generality. The sizes of the QD and OLS detrended tests are fairly similar throughout Table 3 and are generally free from significant size distortion (outside of a negative moving average component), particularly for the larger sample size. If anything, the QD detrended tests display slightly less upward size distortion than the corresponding OLS detrended tests. For any given combination of φ and θ , the (asymptotically) size-corrected union of rejections tests tend to have sizes similar to the larger of the QD and OLS detrended tests’ individual sizes. This occurs since the union essentially selects whichever of the QD and OLS detrended tests is least favourable to the null. Overall, then, the results of Table 3 indicate that the union of rejections tests display pretty decent size control. Finite sample powers are given in Figures 5–6. Here we set φ = θ = 0 in the generating process to abstract from any confounding effects that may arise from size distortions. For brevity we report results for the random initial conditions case only, and for the representative setting c = −10, using the same values for σ α that underlie Figures 3 and 4. An issue of note here is that all the tests tend to display lower powers than in the asymptotic case, most noticeably when T = 152. We expect that this is partly explained by the lag selection process which is still in place. Otherwise, the finite sample relationships between the QD detrended, OLS detrended, and corresponding union of rejections tests across σ α qualitatively resemble those of their asymptotic counterparts when T = 152. For T = 300, the resemblance is much closer in general. 8 Following conventional practice in the seasonal unit root literature, we base our tests on finite sample critical values. As was seen in Table 1, the critical values (especially for the QD de-trended tests in Panel B) are quite sensitive to the sample size. Consequently, the use of asymptotic critical values in the case of i.i.d. errors would be likely to result in substantial finite sample over-size for many of the tests considered. This effect is mitigated by employing the finite sample critical values from Table 1 in cases of dependent as well as i.i.d. errors, even though they are strictly only appropriate in the latter case. 9 Although MA behaviour is not formally allowed under our assumption that u 4t+s follows a finite order AR(p) process, we nonetheless include it in our simulations so that we may informally assess the robustness of the tests to MA behaviour; cf. Remark 3.5.
C The Author(s). Journal compilation C Royal Economic Society 2008.
433
Seasonal unit root tests and the role of initial conditions Table 3. Empirical sizes of nominal 0.05-level quarterly seasonal unit root tests. j
j
t1 T
φ
152
0.0
OLS
GLS
UR
OLS
GLS
UR
−0.4 0.0
0.126 0.060
0.110 0.066
0.118 0.057
0.093 0.049
0.090 0.053
0.090 0.046
0.4
0.065
0.071
0.061
0.052
0.057
0.050
−0.4
0.112
0.102
0.107
0.102
0.100
0.099
0.0 0.4
0.058 0.075
0.063 0.082
0.054 0.072
0.050 0.047
0.051 0.052
0.045 0.045
0.6
−0.4 0.0 0.4
0.092 0.058 0.063
0.090 0.064 0.070
0.088 0.055 0.061
0.102 0.047 0.047
0.097 0.050 0.054
0.099 0.043 0.045
0.0
−0.4 0.0
0.090 0.057
0.086 0.060
0.085 0.056
0.070 0.049
0.073 0.053
0.070 0.049
0.4
0.056
0.060
0.054
0.049
0.053
0.048
−0.4
0.080
0.079
0.076
0.076
0.078
0.076
0.0 0.4
0.057 0.058
0.059 0.061
0.054 0.056
0.048 0.049
0.051 0.052
0.048 0.048
−0.4 0.0 0.4
0.071 0.056 0.052
0.073 0.060 0.056
0.070 0.055 0.051
0.074 0.048 0.049
0.078 0.051 0.054
0.076 0.048 0.049
0.3
300
0.3
0.6
θ
j:
t2
j
j
F 34 T
φ
θ
152
0.0
0.3
0.6
300
0.0
j:
F 234
OLS
GLS
UR
OLS
GLS
UR
−0.4 0.0 0.4
0.100 0.054 0.053
0.073 0.056 0.057
0.092 0.054 0.053
0.117 0.052 0.054
0.093 0.057 0.059
0.115 0.054 0.056
−0.4 0.0
0.105 0.053
0.078 0.057
0.096 0.053
0.127 0.052
0.103 0.057
0.125 0.054
0.4
0.052
0.057
0.053
0.050
0.057
0.052
−0.4
0.107
0.078
0.100
0.134
0.104
0.130
0.0 0.4
0.052 0.051
0.056 0.055
0.053 0.052
0.050 0.050
0.055 0.056
0.053 0.051
−0.4 0.0
0.067 0.051
0.059 0.055
0.062 0.053
0.076 0.050
0.068 0.053
0.074 0.052
0.4
0.049
0.055
0.051
0.050
0.052
0.051
C The Author(s). Journal compilation C Royal Economic Society 2008.
434
D.I. Harvey, S.J. Leybourne and A.M. Robert Taylor Table 3. continued. j
j
F 34 T
φ
300
0.3
0.6
θ
j:
F 234
OLS
GLS
UR
OLS
GLS
UR
−0.4 0.0
0.070 0.050
0.060 0.054
0.065 0.052
0.083 0.049
0.073 0.052
0.081 0.051
0.4
0.050
0.054
0.051
0.049
0.052
0.051
−0.4
0.069
0.060
0.066
0.082
0.071
0.081
0.0 0.4
0.050 0.049
0.054 0.055
0.051 0.051
0.049 0.049
0.052 0.053
0.051 0.051
j
F 1234 T
φ
152
0.0
300
OLS
GLS
−0.4
0.152
0.123
0.150
0.0 0.4
0.056 0.061
0.063 0.068
0.057 0.063
0.3
−0.4 0.0 0.4
0.155 0.057 0.064
0.125 0.062 0.073
0.154 0.056 0.067
0.6
−0.4 0.0
0.155 0.055
0.122 0.061
0.151 0.056
0.4
0.057
0.066
0.061
−0.4
0.097
0.090
0.097
0.0 0.4
0.053 0.054
0.059 0.060
0.056 0.056
0.3
−0.4 0.0 0.4
0.097 0.053 0.054
0.089 0.059 0.060
0.097 0.055 0.058
0.6
−0.4 0.0
0.093 0.052
0.085 0.059
0.094 0.055
0.4
0.051
0.057
0.053
0.0
θ
j:
UR
On the basis of this finite sample evidence, it appears then that a size-corrected union of rejections approach can provide a very decent practical strategy for seasonal unit root testing in the context of uncertainty about the initial conditions and, consequently, equal uncertainty over whether it is best to employ QD or OLS detrending. This is a particularly pertinent issue in the seasonal case considered here, because in employing QD detrending, while this might constitute C The Author(s). Journal compilation C Royal Economic Society 2008.
Seasonal unit root tests and the role of initial conditions
(a) tj1
(b) tj2
j (c) F34
j (d) F234
435
j (e) F1234
Figure 5. Finite sample power: T = 152, c = −10, α i ∼ N (0, σ α2 ), i = 1, 2, 3, 4; j = OLS: - - -, j = GLS: – –, j = UR:–––.
C The Author(s). Journal compilation C Royal Economic Society 2008.
436
D.I. Harvey, S.J. Leybourne and A.M. Robert Taylor
Figure 6. Finite sample power: T = 300, c = −10, α i ∼ N (0, σ 2α ), i = 1, 2, 3, 4; j = OLS: - - -, j = GLS: – –, j = UR:–––.
C The Author(s). Journal compilation C Royal Economic Society 2008.
Seasonal unit root tests and the role of initial conditions
437
the best approach at one frequency, it may also be totally unsuitable for a different frequency, depending on the values of the spectral initial conditions. Taking unions of rejections at each frequency essentially ensures that we employ the most appropriate method of detrending at any particular frequency.
6. CONCLUSIONS In this paper, we have investigated the impact that the magnitude of the spectral initial condition has on the power of commonly used seasonal unit root tests. For a given frequency we have shown that when the relevant spectral initial condition of the process is not asymptotically negligible, QD detrended implementation of a HEGY-type seasonal unit root test, as developed by Rodrigues and Taylor (2007), can lead to tests that have very low power against a given alternative, typically decreasing towards zero as the magnitude of the relevant spectral initial condition(s) increase. In contrast, we showed that corresponding OLS detrended HEGY tests display increasing power, other things equal, as the magnitude of the spectral initial condition(s) increase. At the same time, the power of such tests can lie well below that of their QD detrended counterparts for small (or asymptotically negligible) values of the initial condition. The relevance of these results lies in the fact that the magnitude of the initial condition is unknown in practice and therefore uncertainty surrounds the best choice of detrending method, which can therefore also differ across frequencies. Given these considerations, we followed a strategy shown to work well in the non-seasonal case by Harvey et al. (2008) and proposed a union of rejections decision rule, whereby the relevant null hypothesis was rejected if either of the QD and OLS detrended variants rejected. Asymptotic and finite sample evidence suggested that, despite its simplicity, this procedure performs well in practice, simultaneously exploiting the superior power of the QD (OLS) detrended HEGY test for small (large) combinations of the initial conditions.
ACKNOWLEDGMENTS We would like to thank Pierre Perron and two anonymous referees for their helpful comments on earlier versions of this paper.
REFERENCES Beaulieu, J. J. and J. A. Miron (1993). Seasonal unit roots in aggregate U.S. data. Journal of Econometrics 55, 305–28. Burridge, P. and A. M. R. Taylor (2001). On the properties of regression-based tests for seasonal unit roots in the presence of higher-order serial correlation. Journal of Business and Economic Statistics 19, 374–79. Chang, Y. and J. Y. Park (2002). On the asymptotics of ADF tests for unit roots. Econometric Reviews 21, 431–47. Elliott, G. (1999). Efficient tests for a unit root when the initial observation is drawn from its unconditional distribution. International Economic Review 40, 767–83. Elliott, G. and U. K. M¨uller (2006). Minimizing the impact of the initial condition on testing for unit roots. Journal of Econometrics 135, 285–310. Elliott, G., T. J. Rothenberg and J. H. Stock (1996). Efficient tests for an autoregressive unit root. Econometrica 64, 813–36. C The Author(s). Journal compilation C Royal Economic Society 2008.
438
D.I. Harvey, S.J. Leybourne and A.M. Robert Taylor
Fuller, W. A. (1996). Introduction to Statistical Time Series (2nd ed.). New York: John Wiley. Ghysels, E., H. S. Lee and J. Noh (1994). Testing for unit roots in seasonal time series: some theoretical extensions and a Monte Carlo investigation. Journal of Econometrics 62, 415–42. Harvey, D. I. and S. J. Leybourne (2005). On testing for unit roots and the initial observation. The Econometrics Journal 8, 97–111. Harvey, D. I. and S. J. Leybourne (2006). Unit root test power and the initial condition. Journal of Time Series Analysis 27, 739–52. Harvey, D. I., S. J. Leybourne and A. M. R. Taylor (2008). Unit root testing in practice: dealing with uncertainty over the trend and initial condition (with commentaries and rejoinder). Forthcoming in Econometric Theory. Hylleberg, S., R. F. Engle, C. W. J. Granger and B. S. Yoo (1990). Seasonal integration and cointegration. Journal of Econometrics 44, 215–38. Hylleberg, S., C. Jørgensen and N. K. Sørensen (1993). Seasonality in macroeconomic time series. Empirical Economics 18, 321–35. M¨uller, U. K. and G. Elliott (2003). Tests for unit roots and the initial condition. Econometrica 71, 1269–86. Ng, S. and P. Perron (1995). Unit root tests in ARMA models with data dependent methods for selection of the truncation lag. Journal of the American Statistical Association 90, 268–81. Rodrigues, P. M. M. (2001). Near seasonal integration. Econometric Theory 17, 70–86. Rodrigues, P. M. M. and A. M. R. Taylor (2004a). Alternative estimators and unit root tests for seasonal autoregressive processes. Journal of Econometrics 120, 35–73. Rodrigues, P. M. M. and A. M. R. Taylor (2004b). Asymptotic distributions for regression-based seasonal unit root test statistics in a near-integrated model. Econometric Theory 20, 645–70. Rodrigues, P. M. M. and A. M. R. Taylor (2007). Efficient tests of the seasonal unit root hypothesis. Journal of Econometrics 141, 548–73. Said, S. E. and D. A. Dickey (1984). Testing for unit roots in autoregressive-moving average models of unknown order. Biometrika 89, 1420–37. Smith, R. J. and A. M. R. Taylor (1998). Additional critical values and asymptotic representations for seasonal unit root tests. Journal of Econometrics 85, 269–88. Smith, R. J., A. M. R. Taylor and T. del Barrio Castro (2007). Regression-based seasonal unit root tests. Granger Centre Discussion Paper 07/05, School of Economics, University of Nottingham. Tanaka, K. (1996). Time Series Analysis: Nonstationary and Noninvertible Distribution Theory. New York: John Wiley. Taylor, A. M. R. (1998). Testing for unit roots in monthly time series. Journal of Time Series Analysis 19, 349–68. Taylor, A. M. R. (2002). Regression-based unit root tests with recursive mean adjustment for seasonal and nonseasonal time series. Journal of Business and Economic Statistics 20, 269–81.
APPENDIX Proof of Lemma 3.1. Consider first the t QD statistic. When the initial conditions are asymptotically 2 negligible, it follows from Rodrigues and Taylor (2004a, 2004b, 2007) that under the stated conditions the limit distribution of the statistic can be written as t2QD
d
→c
1
1
W2c 0
(r)2 dr
W2c (r)dW20 (r) , + 0 1 W2c (r)2 dr 0
C The Author(s). Journal compilation C Royal Economic Society 2008.
Seasonal unit root tests and the role of initial conditions
439
where ∗ ∗ ∗ ∗ W2c (r) := −W−3,c (r) + W−2,c (r) − W−1,c (r) + W0,c (r) 2 is a standard Ornstein–Uhlenbeck [OU] process, formed from the independent standard OU processes, W ∗s,c (r), s = −3, . . . , 0. Noting that dW 2c (r) = cW 2c (r)dr + dW 20 (r), we can equivalently write 1
t2QD
W2c (r)dW2c (r) → 0 1 W2c (r)2 dr 0 d
=
W2c (1)2 − W2c (0)2 − 1 1 2 0 W2c (r)2 dr
(A.1)
using the Itˆo integral. When the initial conditions are as defined in Assumption 2.1, the analysis of M¨uller and Elliott (2003) implies that we need to replace W ∗s,c (r) with K ∗s,c (r) := α s+4 (erc − 1)(−2c)−1/2 + W ∗s,c (r) for s = −3, . . . , 0. Consequently, W 2c (r) in (A.1) is replaced with ∗ ∗ ∗ ∗ K2c (r) = −K−3,c (r) + K−2,c (r) − K−1,c (r) + K0,c (r) 2 = {(−α1 + α2 − α3 + α4 )/2}(erc − 1)(−2c)−1/2 + W2c (r) = α¯ 2 (erc − 1)(−2c)−1/2 + W2c (r) which completes the proof of the stated result for t QD 2 in Lemma 3.1. ∗ statistic follows in exactly the same way as for t QD The result for the t OLS 2 2 , replacing W s,c (r) 1 ∗ ∗μ ∗ ∗ ∗μ ∗ with W (r) := Ws,c (r) − 0 Ws,c (t)dt, s = −3, . . . , 0, and, hence, K s,c (r) with Ks,c (r) := Ks,c (r) − 1 ∗ s,c Ks,c (t)dt, s = −3, . . . , 0. The limit of t OLS then has the same form as that for t QD but with K 2c (r) 2 2 0 1 μ now replaced by K2c (r) := K2c (r) − 0 K2c (s)ds. statistic. When the initial conditions are asymptotically negligible, the limit Next consider the t OLS 1 distribution can be written as d
t1OLS →
τ τ (1)2 − W1c (0)2 − 1 W1c , 1 τ 2 0 W1c (r)2 dr
(A.2)
where ∗τ τ ∗τ ∗τ ∗τ (r) := W−3,c (r) + W−2,c (r) + W−1,c (r) + W0,c (r) 2 W1c is a demeaned and detrended standard OU process, formed from the independent demeaned and detrended standard OU processes
∗τ ∗ (r) := Ws,c (r) − Ws,c
0
1
1 1 1 ∗ ∗ Ws,c t− Ws,c (t)dt − 12 r − (t)dt, s = −3, . . . , 0. 2 2 0
C The Author(s). Journal compilation C Royal Economic Society 2008.
440
D.I. Harvey, S.J. Leybourne and A.M. Robert Taylor
∗ (r) with K ∗s,c (r) := When the initial conditions are governed by Assumption 2.1, as before we replace W s,c τ rc −1/2 ∗ + W s,c (r) for s = −3, . . . , 0, thus W 1c (r) in (A.2) is replaced with α s+4 (e − 1)(−2c) 1 1 0 1 1 ∗ ∗ ∗ OLS 2 Ks,c (t)dt t− K1c (r) = Ks,c (t)dt − 12 r − Ks,c (r) − 2 2 0 0 s=−3 1 = α¯ 1 (erc − 1)(−2c)−1/2 − (esc − 1)(−2c)−1/2 ds 0
1 1 1 sc −1/2 τ (e − 1)(−2c) s− ds + W1c (r) −12 r − 2 2 0 1 1 1 μ K1c (s)ds = K1c s− (r) − 12 r − 2 2 0 in Lemma 3.1. which completes the proof of the result for t OLS 1 ∗τ statistic follows in exactly the same way as for t OLS The result for the t QD 1 , replacing W s,c (r) with 1 1 ∗ ∗ ∗ ∗ τ,c¯1 ∗ ∗ ∗ Ws,c (r) := Ws,c (r) − c¯1 rWs,c (1) − 3(1 − c¯1 )r 0 tWs,c (t)dt, s = −3, . . . , 0, where c¯1 is as defined in then has the same form as that for t OLS but with K OLS Lemma 3.1. The limit of t QD 1 1c (r) now replaced by 1 1 QD K1c (r) := K1c (r) − c¯1∗ rK1c (1) − 3(1 − c¯1∗ )r 0 sK1c (s)ds. Consider next the result for the F QD 34 statistic. Drawing again on results established in Rodrigues and Taylor (2004a, 2004b, 2007), we can write the limit distribution in the asymptotically negligible initial conditions case as d
QD → F34
where
A∗ := c
W3c (r)2 dr + 0
1 B ∗ :=
with
and
1
0
1 ∗2 (A ) + (B ∗ )2 , 2 1
1
W4c (r)2 dr +
0
0
1 W3c (r)dW40 (r) − 0 W4c (r)dW30 (r) 1 1 W3c (r)2 dr + 0 W4c (r)2 dr 0
1 W3c (r)dW30 (r) + 0 W4c (r)dW40 (r) 1 1 W3c (r)2 dr + 0 W4c (r)2 dr 0
√ ∗ ∗ W3c (r) := −W−2,c (r) + W0,c (r) 2 √ ∗ ∗ W4c (r) := −W−3,c (r) + W−1,c (r) 2
constituting a pair of mutually independent standard OU processes defined via the independent standard OU ∗ processes W s,c (r), s = −3, . . . , 0. Now since dWic (r) = cWic (r)dr + dW i0 (r), i = 3, 4, we can alternatively write 1 1 W3c (r)dW3c (r) + 0 W4c (r)dW4c (r) 0 ∗ A = 1 1 W3c (r)2 dr + 0 W4c (r)2 dr 0 1 1 W3c (r)dW4c (r) − 0 W4c (r)dW3c (r) 0 ∗ . B = 1 1 W3c (r)2 dr + 0 W4c (r)2 dr 0 C The Author(s). Journal compilation C Royal Economic Society 2008.
Seasonal unit root tests and the role of initial conditions
441
∗ Introducing initial conditions of the form given in Assumption 2.1, we again need to replace W s,c (r) with ∗ K s,c (r), s = −3, . . . , 0 . The limit processes W 3c (r) and W 4c (r) are then correspondingly replaced with √ ∗ ∗ (r) + K0,c (r) 2 K3c (r) = −K−2,c √ = {(−α2 + α4 )/ 2}(erc − 1)(−2c)−1/2 + W3c (r)
= α¯ 3 (erc − 1)(−2c)−1/2 + W3c (r) and
√ ∗ ∗ 2 (r) + K−1,c (r) K4c (r) = −K−3,c √ = {(−α1 + α3 )/ 2}(erc − 1)(−2c)−1/2 + W4c (r) = α¯ 4 (erc − 1)(−2c)−1/2 + W4c (r),
respectively. Consequently, d
QD F34 →
1 QD 2 (A ) + (B QD )2 , 2
where 1 K3c (r)dK3c (r) + 0 K4c (r)dK4c (r) 1 1 K3c (r)2 dr + 0 K4c (r)2 dr 0 1 1 K3c (r)dK4c (r) − 0 K4c (r)dK3c (r) . = 0 1 2 dr + 1 K (r)2 dr K (r) 3c 4c 0 0 1
AQD =
B QD
0
Now, it is straightforward to show that dKic (r) = cKic (r)dr + α¯ i c(−2c)−1/2 dr + dWi (r), i = 3, 4 and so we obtain on substitution that 1 QD =c K3c (r)2 dr + A 0
B QD
1
K4c (r)2 dr
0
1 1 α¯ 3 c(−2c)−1/2 0 K3c (r)dr + α¯ 4 c(−2c)−1/2 0 K4c (r)dr + 1 1 K3c (r)2 dr + 0 K4c (r)2 dr 0 1 1 K3c (r)dW3 (r) + 0 K4c (r)dW4 (r) 0 + 1 1 K3c (r)2 dr + 0 K4c (r)2 dr 0 1 1 α¯ 4 c(−2c)−1/2 0 K3c (r)dr − α¯ 3 c(−2c)−1/2 0 K4c (r)dr = 1 1 K3c (r)2 dr + 0 K4c (r)2 dr 0 1 1 K3c (r)dW4 (r) − 0 K4c (r)dW3 (r) + 0 1 1 K3c (r)2 dr + 0 K4c (r)2 dr 0
which completes the result. ∗ again follows in exactly the same way as for F QD The result for F OLS 34 34 , replacing W s,c (r) with is then obtained W ∗s,cμ (r), s = −3, . . . , 0, and K ∗s,c (r) with K ∗s,cμ (r), s = −3, . . . , 0. The limit of F OLS 34 C The Author(s). Journal compilation C Royal Economic Society 2008.
442
D.I. Harvey, S.J. Leybourne and A.M. Robert Taylor
1 μ as that for F QD 34 above, but with K ic (r) replaced by Kic (r) := Kic (r) − 0 Kic (s)ds, i = 3, 4. Note that 1 μ Kic (r)dr = 0, i = 3, 4, so that the expression simplifies to the form given in Lemma 3.1. 0 QD QD OLS The stated representations for the F OLS 234 , F 234 , F 1234 and F 1234 statistics then follow immediately from the representations given above, noting the asymptotic orthogonality of the HEGY regressors xˆ1,4t+s−1 , xˆ2,4t+s−1 , xˆ3,4t+s−1 and xˆ4,4t+s−1 from (2.8) under H c of (2.6) for both OLS and QD detrended data; see, Rodrigues and Taylor (2004b, 2007).
C The Author(s). Journal compilation C Royal Economic Society 2008.
Econometrics Journal (2008), volume 11, pp. 443–477. doi: 10.1111/j.1368-423X.2008.00247.x
Bootstrap inference in a linear equation estimated by instrumental variables R USSELL D AVIDSON † AND JAMES G. M AC K INNON ‡ †
‡
Department of Economics, McGill University, Montreal, QC, H3A 2T7, Canada E-mail:
[email protected]
Department of Economics, Queen’s University, Kingston, ON, K7L 3N6, Canada E-mail:
[email protected] First version received: March 2007; final version accepted: March 2008
Summary We study several tests for the coefficient of the single right-hand-side endogenous variable in a linear equation estimated by instrumental variables. We show that writing all the test statistics—Student’s t, Anderson–Rubin, the LM statistic of Kleibergen and Moreira (K), and likelihood ratio (LR)—as functions of six random quantities leads to a number of interesting results about the properties of the tests under weak-instrument asymptotics. We then propose several new procedures for bootstrapping the three non-exact test statistics and also a new conditional bootstrap version of the LR test. These use more efficient estimates of the parameters of the reduced-form equation than existing procedures. When the best of these new procedures is used, both the K and conditional bootstrap LR tests have excellent performance under the null. However, power considerations suggest that the latter is probably the method of choice. Keywords: Bootstrap test, weak instruments, Anderson–Rubin test, conditional LR test, Wald test, K test.
1. INTRODUCTION This paper is concerned with tests for the value of the coefficient of the single right-hand side endogenous variable in a linear structural equation estimated by instrumental variables (IV). We consider the Wald (or t) test, the LM test that was independently proposed by Kleibergen (2002) and Moreira (2001), which we refer to as the K test, and the likelihood ratio (LR) test, as well as its conditional variant due to Moreira (2003), which we refer to as CLR. Both asymptotic and bootstrap versions of these tests are studied, and their relationships to the Anderson–Rubin (AR) test of Anderson and Rubin (1949) are explored. The analysis allows for instruments that may be either strong or weak. The paper’s main practical contribution is to propose some new bootstrap procedures for models estimated by IV and to elucidate the widely differing properties of procedures that use different test statistics and different bootstrap data-generating processes (DGPs). Our main concern is with finite-sample behaviour, and so our results are supported by extensive simulations. Asymptotic considerations are also explored for the insights they may provide. The C The Author(s). Journal compilation C Royal Economic Society 2008. Published by Blackwell Publishing Ltd, 9600 Garsington Road,
Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
444
R. Davidson and J. G. MacKinnon
weak-instrument asymptotic paradigm of Staiger and Stock (1997), in particular, is very useful for explaining the properties of asymptotic tests, although less so for bootstrap tests. The new procedures we propose use more efficient estimates of the parameters of the reduced-form equation than existing procedures, and what seems to be the best procedure also employs a form of bias correction. Using this procedure instead of more conventional ones greatly improves the performance under the null of all the tests. The improvement is generally greatest for the Wald test and least for the K test, because the latter already works very well in most cases. Using the new procedures also severely reduces the apparent, but actually spurious, power of the Wald test when the instruments are weak, making its power properties much more like those of the other tests. The theoretical contributions of the paper are based on a well-known result that allows all the test statistics of interest to be written as functions of six random quantities. This makes it easy to understand and analyse the asymptotic properties of the tests under both weak and strong instruments. In particular, we are able to distinguish bootstrap tests that are asymptotically valid with weak instruments from those that are not. The distinctions show up very clearly in the simulations. With the assumption of normally distributed disturbances, it is straightforward to determine the joint distribution of the six random quantities under the null hypothesis. This joint distribution can be simulated very inexpensively. It is therefore attractive to employ parametric bootstrap DGPs that generate bootstrap test statistics directly, as functions of model parameters and realizations of the six quantities, without actually generating bootstrap samples. The assumption of normal disturbances may often be felt to be too strong. For each of the parametric bootstrap DGPs we consider, there exists a corresponding semi-parametric bootstrap DGP based on resampling residuals. In practice, it is probably preferable to use these resampling methods, although they are, of course, a good deal more computationally intensive with large samples. Our experiments suggest that results from the parametric bootstrap under normality provide a very good guide to the performance of the corresponding semi-parametric bootstrap. The practical conclusions we come to are consistent with those of Andrews et al. (2006). Two tests seem to be particularly reliable under the null when the instruments are weak. One is the K test when it is bootstrapped using one of our new methods. The other is a conditional bootstrap version of the CLR test that uses one of our new bootstrap methods. Power considerations suggest that the latter test is probably the best procedure overall. In the next section, we discuss the four test statistics and show that they are all functions of six random quantities. Then, in Section 3, we show how all the statistics can be simulated very efficiently under the assumption of normally distributed disturbances. In Section 4, we consider the asymptotic properties of the statistics under both strong and weak instruments. In Section 5, we discuss some new and old ways of bootstrapping the statistics and show how, in some cases, the properties of bootstrap tests differ greatly from those of asymptotic tests. Finally, in Section 6, we present extensive simulation evidence on the performance of asymptotic and bootstrap tests based on all of the test statistics.
2. THE FOUR TEST STATISTICS The model treated in this paper consists of just two equations, y1 = β y2 + Zγ + u1 ,
(2.1)
C The Author(s). Journal compilation C Royal Economic Society 2008.
445
Bootstrap inference in linear IV regression
y2 = W π + u2 .
(2.2)
Here y 1 and y 2 are n-vectors of observations on endogenous variables, Z is an n × k matrix of observations on exogenous variables, and W is an n × l matrix of instruments such that S(Z) ⊂ S(W ), where the notation S( A) means the linear span of the columns of the matrix A. The disturbances are assumed to be serially uncorrelated and, for many of the analytical results, normally distributed. We assume that l > k, so that the model is either exactly identified or, more commonly, overidentified. The parameters of this model are the scalar β, the k-vector γ , the l-vector π , and the 2 × 2 contemporaneous covariance matrix of the disturbances u 1 and u 2 : 2 σ1 ρσ1 σ2 ≡ . (2.3) ρσ1 σ2 σ22 Equation (2.1) is the structural equation we are interested in, and equation (2.2) is a reducedform equation for the second endogenous variable y 2 . We wish to test the hypothesis that β = 0. There is no loss of generality in considering only this null hypothesis, since we could test the hypothesis that β = β 0 for any non-zero β 0 by replacing the left-hand side of (2.1) by y1 − β 0 y2. Since we are not directly interested in the parameters contained in the l-vector π , we may without loss of generality suppose that W = [Z W1 ], with Z W1 = O. Notice that W1 can easily be constructed by projecting the columns of W that do not belong to S(Z) off Z. We consider four test statistics: an asymptotic t statistic on which we may base a Wald test, the AR statistic, the K statistic, and a likelihood ratio (LR) statistic. The 2SLS (or IV) estimate βˆ from (2.1), with instruments the columns of W, satisfies the estimating equation ˆ y 2 P1 ( y1 − β y2 ) = 0,
(2.4)
where P1 ≡ PW1 is the matrix that projects orthogonally on to S(W1 ). This follows because Z W1 = O, but equation (2.4) would hold even without this assumption if we defined P1 as PW − P Z , where the matrices PW and P Z project orthogonally on to S(W ) and S(Z), respectively. It is not hard to see that the asymptotic t statistic for a test of the hypothesis that β = 0 is t=
P1 y1 n1/2 y 2 , y 2 P1 y1 P1 y2 MZ y1 − y2 y2 P1 y2
(2.5)
where MZ ≡ I − P Z . It can be seen that the right-hand side of (2.5) is homogeneous of degree zero with respect to y 1 and also with respect to y 2 . Consequently, the distribution of the statistic is invariant to the scales of each of the endogenous variables. In addition, the expression is unchanged if y 1 and y 2 are replaced by the projections MZ y 1 and MZ y 2 , since P1 MZ = MZ P1 = P1 , given the orthogonality of W1 and Z. It follows that, if M W ≡ I − PW , the statistic (2.5) depends on the data only through the six quantities y 1 P1 y1 ,
y 1 P1 y2 ,
y 2 P1 y2 ,
y 1 MW y1 ,
y 1 MW y2 ,
and
y 2 MW y2 ;
(2.6)
notice that y i MZ y j = yi (MW + P1 ) y j , for i, j = 1, 2. It has been known for some time—see Mariano and Sawa (1972)—that the 2SLS and LIML estimators of β depend only on these six quantities. We can think of them as sufficient statistics C The Author(s). Journal compilation C Royal Economic Society 2008.
446
R. Davidson and J. G. MacKinnon
for all the model parameters. They can easily be calculated by means of four OLS regressions on just two sets of regressors. By regressing y i on Z and W for i = 1, 2, we obtain four sets of residuals. Using the fact that P1 y i = (MZ − MW ) y i , all six quantities can be obtained as sums of squared residuals, differences of sums of squared residuals, inner products of residual vectors, or inner products of differences of residual vectors. Another way to test a hypothesis about β is to use the famous test statistic of Anderson and Rubin (1949). The AR statistic for the hypothesis that β = β 0 can be written as AR(β0 ) =
n − l ( y1 − β0 y2 ) P1 ( y1 − β0 y2 ) . l − k ( y1 − β0 y2 ) MW ( y1 − β0 y2 )
(2.7)
Note that, when β 0 = 0, the AR statistic depends on the data only through the first and fourth of the six quantities (2.6). Under the normality assumption, this statistic is exactly distributed as F (l − k, n − l) under the null hypothesis. However, because it has l − k degrees of freedom, it has lower power than statistics with only one degree of freedom when l − k > 1. Kleibergen (2002) and Moreira (2001) therefore proposed a modification of the AR statistic which has only one degree of freedom. Their statistic for testing β = 0, which can also be interpreted as an LM statistic, is K = (n − l)
y 1 PM Z W π˜ y1 , y 1 MW y1
(2.8)
which is asymptotically distributed as χ 2 (1) under the null hypothesis that β = 0. The matrix PMZ W π˜ projects orthogonally on to the one-dimensional subspace generated by the vector ˜ where π˜ is a vector of efficient estimates of the reduced-form parameters. Under our MZ W π, assumptions, the vector MZ W π˜ is equal to the vector W1 π˜ 1 , where π˜ 1 is the vector of OLS estimates from the artificial regression MZ y2 = W1 π 1 + δ MZ y1 + residuals.
(2.9)
The estimator π˜ 1 will be discussed in Section 5 in the context of bootstrapping. A somewhat lengthy calculation shows that the K statistic (2.8) is given explicitly by 2 (n − l) y 1 P1 y2 y1 MW y1 − y1 P1 y1 y1 MW y2 (2.10) K= 3 2 . y + y 2 P1 y2 y1 MW y1 1 P1 y1 y1 MW y1 y1 MW y2 2 −2 y 1 P1 y2 y1 MW y2 ( y1 MW y1 ) From this, it can be seen that the K statistic, like the t statistic and the AR statistic, depends on the data only through the six quantities (2.6) and is invariant to the scales of y 1 and y 2 . It is well known that, except for an additive constant, the concentrated loglikelihood function for the model specified by (2.1), (2.2) and (2.3) can be written as
l−k n AR(β) , (2.11) − log 1 + 2 n−l where AR (β) is the AR statistic (2.7) evaluated at β. It can then be shown, using results in Anderson and Rubin (1949), that the likelihood ratio statistic for testing the hypothesis that β = 0 can be written as
1 SS + T T − LR = n log(1 + SS/n) − n log 1 + (SS − T T )2 + 4ST 2 , (2.12) 2n 2n C The Author(s). Journal compilation C Royal Economic Society 2008.
Bootstrap inference in linear IV regression
447
where y 1 P1 y1 , y1 MW y1
y n 1 P1 y1 y1 MW y2 , ST ≡ 1/2 y P y − 1 1 2 y 1 MW y1 2
y n 1 P1 y1 ( y1 MW y2 ) TT ≡ y P1 y2 y1 MW y1 − 2 y1 P1 y2 y1 MW y2 + , 2 y 1 MW y1
(2.13)
2 ≡ y 1 MW y1 y2 MW y2 − ( y1 MW y2 ) .
(2.14)
SS ≡ n
and
The notation is chosen so as to be reminiscent of that used by Moreira (2003) in his discussion of a conditional LR test. Moreira’s development is different from ours in that he assumes for most of his analysis that the contemporaneous disturbance correlation matrix is known. Moreira also introduces a simplified statistic, LR 0 , which is obtained by Taylor expanding the logarithms in (2.12) and discarding terms of order smaller than unity as n → ∞. This procedure yields 1 SS − T T + (SS − T T )2 + 4ST 2 . (2.15) LR0 = 2 We see that both LR and LR 0 are invariant to the scales of y 1 and y 2 and depend only on the six quantities (2.6). Some tedious algebra shows that the K statistic (2.10) can also be expressed in terms of the quantities ST and TT, as follows: n − l ST 2 . (2.16) n TT Moreira (2003) demonstrates an asymptotic version of this relation without the degrees-offreedom adjustment. Finally, it is worth noting that, except for the initial deterministic factors, SS is equal to the AR statistic AR(0). K=
3. SIMULATING THE TEST STATISTICS Now that we have expressions for the test statistics of interest in terms of the six quantities (2.6), we can explore the properties of these statistics and how to simulate them efficiently. Our results will also be used in the next two sections when we discuss asymptotic and bootstrap tests. In view of the scale invariance that we have established for all the statistics, the contemporaneous covariance matrix of the disturbances u 1 and u 2 can without loss of generality be set equal to 1 ρ = , (3.1) ρ 1 with both variances equal to unity. Thus we can represent the disturbances in terms of two independent n-vectors, say v 1 and v 2 , of independent standard normal elements, as follows: u1 = v 1 ,
u2 = ρv 1 + rv 2 ,
C The Author(s). Journal compilation C Royal Economic Society 2008.
(3.2)
448
R. Davidson and J. G. MacKinnon
where r ≡ (1 − ρ 2 )1/2 . We now show that we can write all the test statistics as functions of v 1 , v 2 , the exogenous variables, and just three parameters. With the specification (3.2), we see from (2.2) that y 2 MW y2 = (ρv 1 + rv 2 ) MW (ρv 1 + rv 2 ) 2 = ρ 2v 1 MW v 1 + r v 2 MW v 2 + 2ρrv 1 MW v 2 ,
(3.3)
and y 2 P1 y2 = π 1 W1 W1 π 1 + 2π 1 W1 (ρv 1 + rv 2 ) 2 + ρ 2v 1 P1 v 1 + r v 2 P1 v 2 + 2ρrv 1 P1 v 2 .
(3.4)
Now let W1 π 1 = aw 1 , with w 1 = 1. The square of the parameter a is the so-called scalar concentration parameter; see Phillips (1983) and Stock et al. (2002). Further, let w 1 v i = x i , for i = 1, 2. Clearly, x 1 and x 2 are independent standard normal variables. Then we see that 2 π 1 W1 W1 π 1 = a and π 1 W1 v i = axi ,
i = 1, 2.
(3.5)
Thus (3.4) becomes 2 2 2 y 2 P1 y2 = a + 2a(ρx1 + rx2 ) + ρ v 1 P1 v 1 + r v 2 P1 v 2 + 2ρrv 1 P1 v 2 .
(3.6)
From (2.1), we find that 2 y 1 MW y1 = v 1 MW v 1 + 2β ρv 1 MW v 1 + rv 1 MW v 2 + β y2 MW y2 .
(3.7)
Similarly, 2 y 1 P1 y1 = v 1 P1 v 1 + 2β y2 P1 v 1 + β y2 P1 y2 2 = v 1 P1 v 1 + 2β(ax1 + ρv 1 P1 v 1 + rv 1 P1 v 2 ) + β y2 P1 y2 .
(3.8)
Further, from both (2.1) and (2.2) y 1 MW y2 = ρv 1 MW v 1 + rv 1 MW v 2 + β y2 MW y2 , and y 1 P1 y2 = ax1 + ρv 1 P1 v 1 + rv 1 P1 v 2 + β y2 P1 y2 .
(3.9) (3.10)
The relations (3.3), (3.6), (3.7), (3.8), (3.9) and (3.10) show that the six quantities given in (2.6) can be generated in terms of eight random variables and three parameters. The eight random variables are x 1 and x 2 , along with six quadratic forms of the same sort as those in (2.6), v 1 P1 v 1 ,
v 1 P1 v 2 ,
v 2 P1 v 2 ,
v 1 MW v 1 ,
v 1 MW v 2 , and v 2 MW v 2 ,
(3.11)
and the three parameters are a, ρ, and β. Under the null hypothesis, of course, β = 0. Since P1 MW = O, the first three variables of (3.11) are independent of the last three. If we knew the distributions of the eight random variables on which all the statistics depend, we could simulate them directly. We now characterize these distributions. The symmetric matrix v v 1 P1 v 1 1 P1 v 2 (3.12) v v 2 P1 v 1 2 P1 v 2 C The Author(s). Journal compilation C Royal Economic Society 2008.
Bootstrap inference in linear IV regression
449
follows the Wishart distribution W (I 2 , l − k), and the matrix v 1 MW v 1 v 1 MW v 2 v 2 MW v 1
v 2 MW v 2
follows the distribution W(I 2 , n − l). It follows from the analysis of the Wishart distribution M in Anderson (1984) (see Section 7.2) that v 1 MW v 1 is equal to a random variable t 11 which M v is the square root follows the chi-squared distribution with n − l degrees of freedom, v W 2 1 2 multiplied by a standard normal variable z independent of it, and v M v is z plus a of t M M W 2 M 11 2 M with n − l − 1 degrees of freedom, independent of z and t . chi-squared variable t M M 22 11 The elements of the matrix (3.12) can, of course, be characterized in the same way. However, since the elements of the matrix are not independent of x 1 and x 2 , it is preferable to define 2 P P 2 v 2 P1 v 2 as x 2 + t 22 , v 1 P1 v 2 as x 1 x 2 plus the square root of t 22 times z P , and v 1 P1 v 1 as x 1 + 2 P P P zP + t 11 . Here t 11 and t 22 are both chi-squared, with l − k − 2 and l − k − 1 degrees of freedom, respectively, and z P is standard normal. All these variables are mutually independent, and they are also independent of x 1 and x 2 . Of course, if l − k ≤ 2, chi-squared variables with zero or negative degrees of freedom are to be set to zero, and z P = 0 if l − k = 0. An alternative way to simulate the test statistics, which does not require the normality assumption, is to make use of a much simplified model. This model may help to provide an intuitive understanding of the results in the next two sections. The simplified model is y1 = β y2 + u1 ,
(3.13)
y2 = aw 1 + u2 ,
(3.14)
where the disturbances are generated according to (3.2). Here the n-vector w 1 ∈ S(W ) with w 1 = 1, where W, as before, is an n × l matrix of instruments. By normalizing w 1 in this way, we are implicitly using weak-instrument asymptotics; see Staiger and Stock (1997). Clearly, we may choose a ≥ 0. The DGPs of this simple model, which are completely characterized by the parameters β, ρ, and a, can generate the six quantities (2.6) so as to have the same distributions as those generated by any DGP of the more complete model specified by (2.1), (2.2) and (2.3). If the disturbances are not Gaussian, the distributions of the statistics depend not only on the parameters a, ρ and β but also on the vector w 1 and the linear span of the instruments. We may suspect, however, that this dependence is weak, and limited simulation evidence (not reported) strongly suggests that this is indeed the case. The distribution of the disturbances seems to have a much greater effect on the distributions of the test statistics than the features of W.
4. ASYMPTOTIC THEORY To fix ideas, we begin with a short discussion of the conventional asymptotic theory of the tests discussed in Section 2. By ‘conventional’, we mean that the instruments are assumed to be strong, in a sense made explicit below. Under this assumption, the tests are all classical. In particular, Kleibergen (2002) and Moreira (2001) show that the K statistic is a version of the Lagrange Multiplier test. The reduced-form equation (3.14) of the simplified model of the previous section is written in terms of an instrumental variable w 1 such that w 1 = 1. Conventional asymptotics would set w 1 2 = O p (n) and let the parameter a be independent of the sample size. Our setup is C The Author(s). Journal compilation C Royal Economic Society 2008.
450
R. Davidson and J. G. MacKinnon
better suited to the weak-instrument asymptotics of Staiger and Stock (1997). For conventional asymptotics, we may suppose that a = n1/2 α, for α constant as the sample size n → ∞. Under the null, β = 0. Under local alternatives, we let β = n−1/2 b, for b constant as n → ∞. Conventional asymptotics applied to (3.3), (3.6), (3.7), (3.8), (3.9) and (3.10) then give a
P 2 y 1 P1 y1 = (x1 + αb) + t11 , a
2 n−1/2 y 1 P1 y2 = αx1 + α b, a
2 n−1 y 2 P1 y2 = α , a
n−1 y 1 MW y1 = 1,
(4.1)
a
n−1 y 1 MW y2 = ρ, and a
n−1 y 2 MW y2 = 1. Using these results, it is easy to check that the statistics t2 , K, and LR, given by (2.5) squared, (2.10), and (2.12), respectively, are all equal to (x 1 + α b)2 asymptotically. They have a common asymptotic distribution of χ 2 with one degree of freedom and non-centrality parameter α 2 b2 = a 2 β 2 . We can also see that the AR statistic AR(0), as given by (2.7), is asymptotically equal to (x 1 + α b)2 + z2P + t P11 , with l − k degrees of freedom and the same non-centrality parameter. Thus AR(0) is asymptotically equal to the same non-central χ 2 (1) random variable as the other three statistics, plus an independent central χ 2 (l − k − 1) random variable. We now turn to the more interesting case of weak-instrument asymptotics, for which a is kept constant as n → ∞. The last three results of (4.1) are unchanged, but the first three have to be replaced by the following equations, which involve no asymptotic approximation, but hold even in finite samples: 2 2 P y 1 P1 y1 = x1 + zP + t11 , 2 2 P P y 1 P1 y2 = ax1 + ρ x1 + zP + t11 + r x1 x2 + zP t22 , and 2 P 2 2 2 y 2 P1 y2 = a + 2a(ρx1 + rx2 ) + ρ x1 + zP + t11 P P + r 2 (x22 + t22 . ) + 2ρr x1 x2 + zP t22
(4.2)
Because the AR statistic AR(0) is exactly pivotal for the model we are studying, its distribution under the null that β = 0 depends neither on a nor on ρ. Since the quantity SS in (2.13) is equal to AR(0) except for degrees-of-freedom factors, it too is exactly pivotal. Its asymptotic distribution under weak-instrument asymptotics is that of y 1 P1 y 1 . Thus, as we see from the first line of (4.2), a
P , SS = x12 + zP2 + t11
(4.3)
which follows the central χ 2 (l − k) distribution. Although the K statistic is not exactly pivotal, it is asymptotically pivotal under both weakinstrument and strong-instrument asymptotics. From (2.10), and using (4.1) and (4.2), we can C The Author(s). Journal compilation C Royal Economic Society 2008.
451
Bootstrap inference in linear IV regression
see, after some algebra, that, under weak-instrument asymptotics and under the null, 2 y1 P1 y2 − ρ y a 1 P1 y1 K= 2 y2 P1 y2 − 2ρ y 1 P1 y2 + ρ y1 P1 y1
2 2 P P ax1 + r x1 x2 + zP t22 x1 (a + rx2 ) + rzP t22 = = . P P (a + rx2 )2 + r 2 t22 a 2 + 2arx2 + r 2 x22 + t22
(4.4)
Although the last expression above depends on a and ρ, it is in fact just a chi-squared variable with one degree of freedom. To see this, argue conditionally on all random variables except x 1 and z P , recalling that all the random variables in the expression are mutually independent. The numerator is the square of a linear combination of the standard normal variables x 1 and z P , and the denominator is the conditional variance of this linear combination. Thus the conditional asymptotic distribution of K is χ 21 , and so also its unconditional distribution. As Kleibergen (2002) remarks, this implies that K is asymptotically pivotal in all configurations of the instruments, including that in which a = 0 and the instruments are completely invalid. For the LR statistic, we can write down expressions asymptotically equal to the quantities ST a and TT in (2.13). First, from (2.14), we have /n2 = 1 − ρ 2 . It is then straightforward to check that 1 a P ST = ax , + r x x + z t22 1 1 2 p (1 − ρ 2 )1/2 and a
TT =
1 2 P a + 2arx2 + r 2 x22 + t22 . 2 1−ρ
(4.5)
a
Comparison with (4.4) then shows that, in accordance with (2.16), K = ST 2 /T T . It is clear from (4.5) and (4.3) that SS and TT are asymptotically independent, since the on x 2 former depends only on the random variables x 1 , z P and t P11 , while the latter depends only √ and t P22 . The discussion based on (4.4) shows that, conditional on TT, ST is distributed as T T times a standard normal variable, and that K is asymptotically distributed as χ 21 . Even though SS and ST are not conditionally independent, the variables ST and SS − ST 2 /T T are so asymptotically. This follows because, conditionally on x 2 and t P22 , the normally distributed
P is a linear combination of the standard normal variables x 1 variable x1 (a + rx2 ) + rzP t22 and z P that partially constitute the asymptotically chi-squared variable SS. These properties led Moreira (2003) to suggest that the distribution of the statistics LR and LR 0 , which are deterministic functions of SS, ST and TT, conditional on a given value of TT, say tt, can be estimated by a simulation experiment in which ST and SS − ST 2 /T T are generated as independent variables distributed respectively as N (0, tt) and χ 2l−k−1 . The variable SS is then generated by combining these two variables, replacing TT by tt. Such an experiment has a bootstrap interpretation that we develop in the next section. It may be helpful to make explicit the link between the quantities SS, ST and TT, defined in (2.13), and the vectors S and T used in Andrews et al. (2006). For the simplified model given by (3.13) and (3.14), these vectors can be expressed as
S = W v1
and
T=
C The Author(s). Journal compilation C Royal Economic Society 2008.
1 W ( y2 − ρ y1 ). r
452
R. Davidson and J. G. MacKinnon
It is straightforward to check that, with these definitions, S S, S T and T T are what the expressions SS, ST and TT would become if the last three quadratic forms in (4.1) were replaced by their asymptotic limits. It should also be noted that all of the weak-instrument asymptotic results continue to hold with non-Gaussian disturbances under a very few additional regularity conditions. First, the disturbances must have moments of order at least 2. Second, the instrument matrix W must be such as to allow us to apply a law of large numbers to obtain the last three results of (4.1) and to apply a central limit theorem to show that the random variables in (4.2) are asymptotically standard normal or chi-squared, as required. For this, it is enough to be able to apply a central limit theorem to the vectors n−1/2 W 1 v i , i = 1, 2. The results (4.2) are independent of b. This shows that, for local alternatives of the sort used in conventional asymptotic theory, no test statistic that depends only on the six quantities (2.6) can have asymptotic power greater than asymptotic size under weak-instrument asymptotics. However, if instead we consider fixed alternatives, with parameter β independent of n, then the expressions do depend on β. For notational ease, denote the three right-hand sides in (4.2) by Y 11 , Y 12 and Y 22 , respectively. Then it can be seen that the weak-instrument results become 2 y 1 P1 y1 = Y11 + 2βY12 + β Y22 ,
y 1 P1 y2 = Y12 + βY22 , y 2 P1 y2 = Y22 , a
2 n−1 y 1 MW y1 = 1 + 2βρ + β ,
(4.6)
a
n−1 y 1 MW y2 = ρ + β, and a
n−1 y 2 MW y2 = 1. Note that, if we specialize the above results, letting β be O(n−1/2 ) and letting a be O(n1/2 ), then we obtain the conventional strong-instrument results (4.1). We have not written down the weak-instrument asymptotic expression for the Wald t statistic given in (2.5), because it is complicated and not very illuminating. Suffice it to say that it depends non-trivially on the parameters a and ρ, as does its distribution. Consequently, the statistic t is not asymptotically pivotal. Indeed, in the terminology of Dufour (1997), it is not even boundedly pivotal, by which we mean that rejection probabilities of tests based on it cannot be bounded away from one. We will see this explicitly in a moment. The estimating equations (2.4) imply that the IV estimate of β is βˆ = y 1 P1 y2 / y2 P1 y2 . Under weak-instrument asymptotics, we see from (4.6) that a βˆ = β + Y12 /Y22 .
(4.7)
Since E (Y 12 /Y 22 ) = 0, it follows that βˆ is biased and inconsistent. The square of the t statistic (2.5) can be seen to be asymptotically equal to Y22 (Y12 + βY22 )2 2 − 2ρY12 Y22 + Y12
2 Y22
(4.8)
under weak-instrument asymptotics. Observe that this expression is of the order of β 2 as β → ∞. Thus, for fixed a and ρ, the distributions of t2 for β = 0 and β = 0 can be arbitrarily far apart. C The Author(s). Journal compilation C Royal Economic Society 2008.
Bootstrap inference in linear IV regression
453
For β = 0, however, the distribution of t2 , for a and ρ sufficiently close to 0 and 1, respectively, is also arbitrarily far from that with fixed a = 0 and ρ = 1. It is this fact that leads to the failure of t2 to be boundedly pivotal. Let a and r = (1 − ρ 2 )1/2 be treated as small quantities, and then expand the denominator of expression (4.8) through the second order in small quantities. Note that, to this order, ρ = 1 − r 2 /2. Then we see that, to the desired order, 1 2 Y12 = v 1 P1 v 1 + ax1 + rv 1 P2 v 2 − r v 1 P1 v 1 , and 2
1 2 r P v − v P v Y22 = v 1 P1 v 1 + 2 ax1 + rv 1 P1 v 2 + a 2 + 2arx2 + r 2 v 2 1 2 1 1 1 . 2 Consequently, to the order at which we are working, 2 2 2 Y22 − 2ρY12 Y22 + Y12 = r 2 (v 1 P1 v 1 ) . 3 To leading order, the numerator of (4.8) for β = 0 is just (v 1 P1 v 1 ) , and so, in the neighborhood of a = 0 and r = 0, we have from (4.8) that 2 t 2 = v 1 P1 v 1 /r . a
(4.9)
The numerator here is just a chi-squared variable, but the denominator can be arbitrarily close to zero. Thus the distribution of t2 can be moved arbitrarily far away from any finite distribution by letting a tend to zero and ρ tend to one. The points in the parameter space at which a = 0 and ρ = ±1, which implies that r = 0, are points at which β is completely unidentified. To see this, consider the DGP from model (3.13) and (3.14) that corresponds to these parameter values. The DGP can be written as y2 = v 1 ,
y1 = (1 ± β) y2 .
(4.10)
It follows from (4.7) that βˆ = 1 ± β. This is not surprising, since the second equation in (4.10) fits perfectly. This fact then accounts for the t statistic tending to infinity. All the other tests have power that does not tend to 1 when β → ∞ under weak-instrument asymptotics. For K, some algebra shows that 2 2 a Y12 − ρY11 + β(Y22 − Y11 ) + β (ρY22 − Y12 ) , (4.11) K= D(1 + 2βρ + β 2 ) where D ≡ Y22 − 2ρY12 + ρ 2 Y11 + 2β ρY22 − (1 + ρ 2 )Y12 + ρY11 + β 2 Y11 − 2ρY12 + ρ 2 Y22 . (4.12) As β → ∞, the complicated expression (4.11) tends to the much simpler limit of (ρY22 − Y12 )2 . Y11 − 2ρY12 + ρ 2 Y22
(4.13)
Thus, unlike t2 , the K statistic does not become unbounded as β → ∞. Consequently, under weak-instrument asymptotics, the test based on K is inconsistent for any non-zero β, in the sense that the rejection probability does not tend to 1 however large the sample size may be. C The Author(s). Journal compilation C Royal Economic Society 2008.
454
R. Davidson and J. G. MacKinnon
A similar result holds for the AR test. It is easy to see that a
SS =
Y11 + 2βY12 + β 2 Y22 −→ Y22 , β→∞ 1 + 2βρ + β 2
(4.14)
which does not depend on β. Thus, since AR is proportional to SS, we see that the asymptotic distribution of AR tends under weak-instrument asymptotics to a bounded distribution as β → ∞. Similar results also hold for the LR and LR 0 statistics. By an analysis like the one that produced (4.14), we have that ρY22 − Y12 Y12 − ρY11 + β(Y22 − Y11 ) + β 2 (ρY22 − Y12 ) −→ , and β→∞ r(1 + 2βρ + β 2 ) r Y11 − 2ρY12 + ρ 2 Y22 D a −→ TT = 2 , r (1 + 2βρ + β 2 ) β→∞ r2 a
ST =
where D is given by (4.12). From (2.15), therefore, 1 LR0 −→ 2 Y22 (1 − 2ρ 2 ) + 2ρY12 − Y11 β→∞ 2r 1/2 2 2 2 . + Y11 − 2Y11 Y22 (1 − 2ρ 2 ) − 4ρY11 Y12 + Y22 + 4Y12 − 4ρY12 Y22 The inconsistency of the LR 0 test follows from the fact that this random variable has a bounded distribution. This is true for the LR test as well, but we will spare readers the details. We saw above that, when a = 0 and ρ = 1, the parameter β is unidentified. We expect, therefore, that a test statistic for the hypothesis that β = 0 would have the same distribution whatever the value of β. This turns out to be the case for the K statistic. If one computes the 2 limit of expression (4.11) for a = 0, r → 0, the limiting expression is just (v 1 P1 v 2 ) /v 2 P1 v 2 , independently of the value of β. Presumably a more complicated calculation would show that the same is true for LR and LR 0 . The result that the AR, K, and LR tests are inconsistent under weak-instrument asymptotics appears to contradict some of the principal results of Andrews et al. (2006). The reason for this apparent contradiction is that we have made a different, and in our view more reasonable, assumption about the covariance matrix of the disturbances. We assume that the matrix , defined in (2.3) as the covariance matrix of the disturbances in the structural equation (2.1) and the reduced form equation (2.2), remains constant as β varies. In contrast, Andrews et al. (2006) assumes that the covariance matrix of the reduced form disturbances does so. In terms of our parametrization, the covariance matrix of the disturbances in the two reduced form equations is σ12 + 2ρβσ1 σ2 + β 2 σ22 ρσ1 σ2 + βσ22 . (4.15) ρσ1 σ2 + βσ22 σ22 This expression depends on β in a non-trivial way. In order for it to remain constant as β changes, both ρ and σ 1 must be allowed to vary. Thus the assumption that (4.15) is fixed, which was made by Andrews et al. (2006), implies that the matrix cannot remain constant as |β| → ∞. A little algebra shows that, as β → ±∞ with the covariance matrix held fixed, the parameters β and ρ of the observationally equivalent DGP of the model given by (3.13) and (3.14) along with (3.2) tend to ±1 and ∓1 respectively. Thus the omnipresent denominator C The Author(s). Journal compilation C Royal Economic Society 2008.
Bootstrap inference in linear IV regression
455
1 + 2βρ + β 2 tends to zero in either of these limits. But it is clear from (4.6) that this means that the estimate of σ 21 from the full model (2.1) and (2.2) tends to zero. Based on the above remark, it seems to us much more reasonable that should remain constant than that (4.15) should. The parameter ρ is a much more interesting parameter than the correlation in (4.15). Even when ρ = 0, in which case the OLS estimator of β is consistent, the correlation between the two reduced form disturbances tends to ±1 as β → ±∞. Thus we believe that the latter correlation is not a sensible quantity to hold fixed. In any case, it is rather disturbing that something as seemingly innocuous as the parametrization of the covariance matrix of the disturbances can have profound consequences for the analysis of power when the instruments are weak.
5. BOOTSTRAPPING THE TEST STATISTICS There are several ways to bootstrap the non-exact test statistics that we have been discussing (Wald, K and LR). In this section, we discuss five different parametric bootstrap procedures, three of which are new. We obtain a number of interesting theoretical results. We also discuss the pairs bootstrap, show how to convert the new procedures into semiparametric bootstraps, and propose a new, semiparametric, conditional bootstrap LR test. In the next section, we will see that two of the new procedures, and one of them in particular, perform extremely well. For all the statistics (squaring it first in the case of the t statistic), we perform B bootstrap simulations and calculate the bootstrap P value as
pˆ ∗ (τˆ ) =
B 1 ∗ I τj > τˆ , B j =1
(5.1)
where τˆ denotes the actual test statistic, and τ ∗j denotes the statistic calculated using the j th bootstrap sample. Dufour (1997) makes it clear that bootstrapping is not in general a cure for the difficulties associated with t2 , the Wald statistic. However, since the Wald statistic is still frequently used in practice when there is no danger of weak instruments, it is interesting to look at the performance of the bootstrapped Wald test when instruments are strong. When they are weak, we confirm Dufour’s result about the ineffectiveness of bootstrapping. In the context of our new bootstrap methods, this manifests itself in an almost complete loss of power, for reasons that we analyse. Since the K statistic is asymptotically pivotal under weak-instrument asymptotics, it should respond well to bootstrapping, at least under the null. The LR statistic is not asymptotically pivotal, but, as shown by Moreira (2003), a conditional LR test gives the asymptotically pivotal statistic we call CLR. As we explain below, the implementation of this conditional likelihood ratio test is in fact a form of bootstrapping. Thus it is computationally quite intensive to bootstrap the conditional LR test, since doing so involves a sort of double bootstrap. As an alternative, we propose below a new conditional bootstrap version of the CLR test. Since we have assumed up to now that the disturbances of our model are Gaussian, it is appropriate to use a parametric bootstrap in which the disturbances are normally distributed. In practice, however, investigators will often be reluctant to make this assumption. At the end of this section, we therefore discuss semiparametric versions of our new bootstrap techniques C The Author(s). Journal compilation C Royal Economic Society 2008.
456
R. Davidson and J. G. MacKinnon
that resample the residuals. We also discuss the pairs bootstrap that was proposed by Freedman (1984) and has been used by Moreira et al. (2007). For any bootstrapping procedure, the first task, and usually the most important one, is to choose a suitable bootstrap DGP; see Davidson and MacKinnon (2006a). An obvious but important point is that the bootstrap DGP must be able to handle both of the endogenous variables, that is, y 1 and y 2 . A straightforward, conventional approach is to estimate the parameters β, γ , π, σ 1 , σ 2 and ρ of the model specified by (2.1), (2.2) and (2.3) and then to generate simulated data using these equations with the estimated parameters. However, the conventional approach estimates more parameters than it needs to. The bootstrap DGP should take advantage of the fact that the simple model specified by (3.13) and (3.14) can generate statistics with the same distributions as those generated by the full model. Equation (3.13) becomes especially simple when the null hypothesis is imposed: It says simply that y 1 =u 1 . If this approach is used, then only the parameters a and ρ need to be estimated. In order to estimate a, we may substitute an estimate of π into the definition (3.5) with an appropriate scaling factor to take account of the fact that a is defined for DGPs with unit disturbance variances. We investigate five different ways of estimating the parameters ρ and a. To estimate ρ, we just need residuals from equations (3.13) and (3.14), or, in the general case, (2.1) and (2.2). To estimate a, we need estimates of the vector π 1 from the reduced-form equation (3.14), or from (2.2), along with residuals from that equation. If u¨ 1 and u¨ 2 denote the two residual vectors, and π¨ 1 denotes the estimate of π 1 , then our estimates are ρ¨ =
a¨ =
¨2 u¨ 1u ¨ 1 u¨ ¨2 u¨ 1u 2u
1/2 , and
¨ 1 /u¨ ¨ 2. n π¨ 1 W1 W1 π 2u
(5.2)
(5.3)
Existing methods, and the new ones that we propose, use various estimates of π 1 and various residual vectors. The simplest way to estimate ρ and a is probably to use the restricted residuals u˜ 1 = MZ y1 = MW y1 + P1 y1 , which, in the case of the simple model, are just equal to y 1 , along with the OLS estimates πˆ 1 and OLS residuals uˆ 2 from regression (2.2). We call this widely-used method the RI bootstrap, for ‘Restricted, Inefficient’. It can be expected to work better than the pairs bootstrap, and better than other parametric procedures that do not impose the null hypothesis. As the name implies, the problem with the RI bootstrap is that πˆ 1 is not an efficient estimator. That is why Kleibergen (2002) did not use πˆ 1 in constructing the K statistic. Instead, the estimates π˜ 1 from equation (2.9) were used. It can be shown that these estimates are asymptotically equivalent to the ones that would be obtained by using 3SLS or FIML on the system consisting of equations (2.1) and (2.2). The estimated vector of disturbances from equation (2.9) is not the ˜ vector of OLS residuals but rather the vector u˜ 2 = MZ y2 − W1 π. Instead of equation (2.9), it may be more convenient to run the regression y2 = W1 π 1 + Zπ 2 + δ MZ y1 + residuals.
(5.4)
C The Author(s). Journal compilation C Royal Economic Society 2008.
457
Bootstrap inference in linear IV regression
This is just the reduced form equation augmented by the residuals from restricted estimation of the structural equation. Because of the orthogonality between Z and W1 , the vector u˜ 2 is equal to the vector of OLS residuals from regression (5.4) plus δˆ MZ y1 . We call the bootstrap that uses u˜ 1 , π˜ 1 , and u˜ 2 the RE bootstrap, for “Restricted, Efficient”. Two other bootstrap methods do not impose the restriction that β = 0 when estimating ρ and a. For the purposes of testing, it is a bad idea not to impose this restriction, as we argued in Davidson and MacKinnon (1999). However, it is quite inconvenient to impose restrictions when constructing bootstrap confidence intervals, and, since confidence intervals are implicitly obtained by inverting tests, it is of interest to see how much harm is done by not imposing the restriction. The UI bootstrap, for ‘Unrestricted, Inefficient’, uses the unrestricted residuals uˆ 1 from IV estimation of (2.1), along with the estimates πˆ 1 and residuals uˆ 2 from OLS estimation of (2.2). The UE bootstrap, for ‘Unrestricted, Efficient’, also uses uˆ 1 , but the other quantities come from the artificial regression MZ y2 = W1 π 1 + δ uˆ 1 + residuals,
(5.5)
which is similar to regression (2.9). Of course, a regression analogous to (5.4) could be used instead of (5.5). A fifth bootstrap method will be proposed after we have obtained some results on which it depends. It is possible to write the estimates of a and ρ used by all four of these bootstrap schemes as functions solely of the six quantities (2.6). This makes it possible to program the bootstrap very efficiently. Because many of the functions are quite complicated, we will spare readers most of the details. However, we need the following results for the RE bootstrap: y 1 MW y2 +
y 1 MW y2 y 1 MW y1
ρ˜ = ⎛ ⎛ ⎝ y ⎝ y1 MW y1 + y 1 P1 y1 2 MW y2 +
y 1 P1 y1 y 1 MW y2 y1 MW y1
and y 2 P1 y2 − 2 y1 P1 y2
a˜ 2 =
y 1 MW y2 y 1 MW y1
y 2 MW y2 + y1 P1 y1
+ y 1 P1 y1
y 1 MW y2
⎞⎞1/2 ,
2
(5.6)
⎠⎠ y 1 P1 y1
y 1 MW y2
y 1 MW y1 2
2 .
(5.7)
y 1 MW y1
Davidson and MacKinnon (1999) show that the size distortion of bootstrap tests may be reduced by use of a bootstrap DGP that is asymptotically independent of the statistic that is bootstrapped. In general, this is true only for bootstrap DGPs that are based on efficient estimators. Thus it makes sense to use the efficient estimator π˜ 1 rather than the inefficient estimator πˆ 1 in order to estimate a, and, via the reduced-form residuals, ρ. Either restricted or unrestricted residuals from (2.1) can be used as the extra regressor in estimating π1 without interfering with the desired asymptotic independence, but general considerations of efficiency suggest that restricted residuals are the better choice. Thus we would expect that, when conventional asymptotics yield a good approximation, the best choice for bootstrap DGP is RE. C The Author(s). Journal compilation C Royal Economic Society 2008.
458
R. Davidson and J. G. MacKinnon
Under weak-instrument asymptotics, things are rather different. We use the results of (4.2) and the last three results of (4.1) to see that, with data generated by the model (3.13) and (3.14) under the null hypothesis, a
a
σ˜ 12 = 1,
σ˜ 22 = 1 and
a
ρ˜ σ˜ 1 σ˜ 2 = ρ.
(5.8)
Thus the RE bootstrap estimator ρ, ˜ as defined by (5.6), is a consistent estimator, as is the estimator used by the RI bootstrap. It can be checked that this result does not hold for any of the estimators that use unrestricted residuals from equation (3.13), since they depend on the inconsistent IV estimate of β; recall (4.7). The weak-instrument asymptotic version of (5.7) under the null can be seen to be a P . (5.9) a˜ 2 = a 2 + 2arx2 + r 2 x22 + t22 Unless r = 0, then, a˜ 2 is inconsistent. It is also biased, the bias being equal to r 2 (l − k). It seems plausible, therefore, that the bias-corrected estimator 2 a˜ BC ≡ max 0, a˜ 2 − (l − k)(1 − ρ˜ 2 ) (5.10) may be better for the purposes of defining the bootstrap DGP. Thus we consider a fifth bootstrap method, REC, for ‘Restricted, Efficient, Corrected.’ It differs from RE in that it uses a˜ BC instead ˜ This has the effect of reducing the R2 of the reduced-form equation in the bootstrap DGP. of a. For the purposes of an analysis of power, it is necessary to look at the properties of the estimates ρ˜ and a˜ 2 under the alternative, that is, for non-zero β. From (4.6), we see that a
σ˜ 12 = 1 + 2βρ + β 2 ,
a
σ˜ 22 = 1,
and
a
ρ˜ σ˜ 1 σ˜ 2 = ρ + β,
from which we find that a
ρ˜ =
ρ+β . (1 + 2βρ + β 2 )1/2
As β → ∞, then, we see that ρ˜ → 1, for all values of a and ρ. For the rate of convergence, it is better to reason in terms of the parameter r ≡ (1 − ρ 2 )1/2 . We have a
r˜ 2 = 1 − ρ˜ 2 = 1 −
(ρ + β)2 r2 = . 2 1 + 2βρ + β 1 + 2βρ + β 2
Thus r˜ = Op (β −1 ) as β → ∞. The calculation for a˜ 2 is a little more involved. From (5.7) and (4.6), we find that
2 2(ρ + β) ρ+β a (Y + βY ) + (Y11 + 2βY12 + β 2 Y22 ) a˜ 2 = Y22 − 12 22 1 + 2βρ + β 2 1 + 2βρ + β 2 1 a = (1 + βρ)Y22 − 2(ρ + β)(1 + βρ)Y12 + (ρ + β)2 Y11 . 2 2 (1 + 2βρ + β ) Clearly, this expression is of the order of β −2 in probability as β → ∞, so that a˜ → 0, again for all a and ρ. In fact, it is clear that a˜ = Op (β −1 ) as β → ∞, from which we conclude that a˜ and r˜ tend to zero at the same rate as β → ∞, as in the calculation that led to (4.9). These results can be understood intuitively by considering (3.13) and (3.14). Estimation of ρ uses residuals which, for that model, are just the vector y 1 = β y 2 + v 1 . For large β, this residual vector is almost collinear with y 2 , and so also with the residual vector u˜ 2 . The estimated C The Author(s). Journal compilation C Royal Economic Society 2008.
Bootstrap inference in linear IV regression
459
correlation coefficient therefore tends to 1. Similarly, when y 1 is introduced as an extra regressor for the estimation of a, it is highly collinear with the dependent variable and explains almost all of it, leaving no apparent explanatory power for the weak instruments. For large β, then, the RE bootstrap DGP is characterized by parameters a and ρ close to 0 and 1, respectively. As we saw near the end of the last section, at this point in the parameter space, β is unidentified, and the Wald statistic has an unbounded distribution. These facts need not be worrisome for the bootstrapping of statistics that are asymptotically pivotal with weak instruments, but they mean that the bootstrap version of the Wald test, like the K and LR tests, is inconsistent, having a probability of rejecting the null hypothesis that does not tend to one as β → ∞. To see this, we make use of expression (4.9) to see that the distribution of the Wald statistic t2 , for a and r small and of the same order and β = 0, is of order r −2 . For large β, therefore, the distribution of the bootstrap Wald statistic, under the null, is of order r˜ −2 , which we have just seen is the same order as β 2 . But the distribution of the Wald statistic itself for large β is also of order β 2 , unlike the K and LR statistics. Although the distribution of the actual statistic t2 for large β and that of the bootstrap statistic (t ∗ )2 are not the same, and are unbounded, the distributions of t 2 /β 2 and (t ∗ )2 /β 2 are of order unity in probability, and, having support on the whole real line, they overlap. Thus the probability of rejection of the null by the bootstrap test does not tend to 1 however large β may be. This conclusion, which is borne out by the simulation experiments of the next section, merits some discussion. In Horowitz and Savin (2000), it is pointed out that, unless one is working with pivotal statistics, it is not in general possible to define an empirically relevant definition of the power of a test that does not have true level equal to its nominal level. They conclude that the best measure in practice is the rejection probability of a well-constructed bootstrap test. In Davidson and MacKinnon (2006b), we point out that, even for well-constructed bootstrap tests, ambiguity remains in general. Only when the bootstrap DGP is asymptotically independent of the asymptotically pivotal statistic being bootstrapped can level adjustment be performed unambiguously on the basis of the DGP in the null hypothesis of which the parameters are the probability limits of the estimators used to define the bootstrap DGP. This result, as proved, applies only to the parametric bootstrap, and, more importantly here, to cases in which these estimators have non-random probability limits. But, as we have seen, that is not the case here. It seems therefore that there is no theoretically satisfying measure of the power of tests for which the bootstrap DGP is not asymptotically non-random. It is therefore pointless to try to refine our earlier result for the Wald test, whereby we learn merely that its bootstrap version is inconsistent. Because the Wald statistic is not boundedly pivotal, a test based on it has size equal to one, as shown by Dufour (1997). Dufour also draws the conclusion that no Wald-type confidence set based on a statistic that is not at least boundedly pivotal can be valid, whether or not the confidence set is constructed by bootstrapping. If, however, instead of using a conventional bootstrap confidence set, we invert the bootstrap Wald test to obtain a confidence set that contains all parameter values which are not rejected by a bootstrap Wald test, we may well obtain confidence sets with a level of less than one, since, on account of the inconsistency of the bootstrap test, unbounded confidence sets can arise with positive probability. We mentioned earlier that the conditional LR, or CLR, test of Moreira (2003) has a bootstrap interpretation. We may consider the variable TT defined in (2.13) as a random variable on which a bootstrap distribution is conditioned. In fact, as can be seen from (4.5) and (5.9), TT is equivalent under weak-instrument asymptotics to a˜ 2 /r 2 . Conditional on TT, Moreira shows that the statistics LR and LR 0 are asymptotically pivotal. Thus, rather than estimating both a C The Author(s). Journal compilation C Royal Economic Society 2008.
460
R. Davidson and J. G. MacKinnon
and ρ and using the estimates to generate bootstrap versions of the six sufficient statistics (2.6), we can evaluate TT and then generate s simulated versions of the two (conditionally) sufficient statistics SS and ST based on their asymptotic conditional distributions, as discussed earlier. From this, we can obtain conditional empirical distributions for either LR or LR 0 which may be used to compute P values in the usual way. This procedure is not quite a real bootstrap, although it is almost as computationally intensive as a fully parametric bootstrap based on simulating the six quantities. Moreover, the CLR test as originally proposed involves an approximation which may not be a good one in small samples. The “bootstrap” conditional distributions of SS and ST are not known exactly. Instead, they are approximated on the basis of the distributions when the contemporaneous covariance matrix is known. Recently, Moreira et al. (2007) have proposed a ‘conditional bootstrap’ CLR test which uses the pairs bootstrap to generate the two sufficient statistics, but still conditions on TT. In the case of the LR 0 statistic, for each of B bootstrap samples, they compute the statistic LR∗0j =
1 ∗ SSj − T T + (SSj∗ + T T )2 − 4T T SSj∗ − (STj∗ )2 /T Tj∗ . 2
(5.11)
The quantities SS∗j , ST ∗j and T T ∗j here are computed from the jth bootstrap dataset, but TT is computed from the actual data. The ‘statistic’ that we will call CLRb then has the form of a bootstrap P value. It is simply the fraction of the bootstrap samples for which LR∗0j exceeds LR 0 . In principle, the CLRb test can be based on any bootstrap DGP. The pairs bootstrap is not a good choice, however, because its bootstrap DGP does not satisfy the null hypothesis. This makes it quite tricky to compute SS∗j , ST ∗j and T T ∗j , as we discuss at the end of this section. Moreover, as our simulation results show, the pairs bootstrap tends to perform much less well than our new RE and REC bootstraps for all the test statistics. It therefore seems attractive to consider CLRb tests based on the semiparametric versions of the RE and REC bootstraps that are described below. We study these in the next section, and they turn out to work very well indeed. Some remarks concerning bootstrap validity are in order at this point. If the statistic that is bootstrapped is asymptotically pivotal, then the bootstrap is valid asymptotically in the sense that the difference between the bootstrap distribution and the distribution under the true DGP, provided the latter satisfies the null hypothesis, converges to zero as the sample size tends to infinity; see, among many others, Davidson and MacKinnon (2006a). The bootstrap provides higher-order refinements if, in addition, the bootstrap DGP consistently estimates the true DGP under the null; see Beran (1988). A further level of refinement can be attained if the statistic bootstrapped is asymptotically independent of the bootstrap DGP; see Davidson and MacKinnon (1999). All of these requirements are satisfied by any of the statistics considered here under strong-instrument asymptotics when either the RE or REC bootstrap is used. Besides the AR statistic, only the K statistic and CLR test P value are asymptotically pivotal with weak instruments, and so it is only for them that we can conclude without further ado that the bootstrap is valid with weak-instrument asymptotics. However, even if the statistic bootstrapped is not asymptotically pivotal, the bootstrap may still be valid if the bootstrap DGP consistently estimates the true DGP under the null. But, with weak instruments, this is true of no conceivable bootstrap method, since there is no consistent estimate of the parameter a. Consequently, however large the sample size may be, the bootstrap distributions of the Wald statistic and the LR statistic are different from their true distributions. C The Author(s). Journal compilation C Royal Economic Society 2008.
Bootstrap inference in linear IV regression
461
Although our discussion has focused on parametric bootstrap procedures, we do not necessarily recommend that they should be used in practice, since the assumption of Gaussian disturbances may often be uncomfortably strong. Any parametric bootstrap procedures that are valid with Gaussian disturbances, under either weak- or strong-instrument asymptotics, remain valid with non-Gaussian disturbances, provided only that laws of large numbers and central limit theorems can be applied, as discussed in Section 4. This follows because, under those conditions, asymptotically pivotal statistics remain so, and the parameter estimators are consistent or not regardless of whether the disturbances are Gaussian. If the disturbances are very far from being normally distributed, we may reasonably expect that a semiparametric resampling bootstrap will work better than a parametric bootstrap. All of the parametric bootstrap procedures that we have discussed have semiparametric analogues which do not require that the disturbances should be normally distributed. We discuss only the RE and REC bootstraps, partly because they are new, partly because they will be seen in the next section to work very well, and partly because it will be obvious how to construct semiparametric analogues of the other procedures. For the semiparametric RE bootstrap, we first estimate equation (2.1) under the null hypothesis to obtain restricted residuals MZ y 1 . We then run regression (2.9) or, equivalently, regression (5.4). The residual vector u˜ 2 that we want to resample from is the vector of residuals from the regression plus δˆ MZ y1 . The bootstrap DGP is then y∗1 = u∗1 , y∗2 = W1 π˜ 1 + u∗2 , [u∗1
u∗2 ]
(5.12)
∼ EDF[u˜ 1 u˜ 2 ].
Thus the bootstrap disturbances are resampled from the joint EDF of the two residual vectors. This preserves the sample correlation between them. For the REC bootstrap, we need to use a different set of fitted values in the reduced-form equation. To do so, we first compute a˜ 2 using either the formula (5.7) or, more conveniently, a˜ 2 =
˜1 π˜ 1 W1 W1 π , u˜ 2 u˜ 2 /n
which is the square of (5.3) evaluated at the appropriate values of π1 and u 2 . Then we calculate 2 from (5.10). The bootstrap DGP is almost the same as (5.12), except that the fitted values a˜ BC W1 π˜ 1 are replaced by a˜ BC /a˜ times W1 π˜ 1 . This reduces the length of the vector of fitted values somewhat. The fitted values actually shrink to zero in the extreme case in which a˜ BC = 0. The quantity a˜ 2 is very closely related to the ‘test statistic’ for weak instruments recently proposed by Stock and Yogo (2005); recall that Z W1 = O. When it is large, the instruments are almost certainly not weak, and even asymptotic inference should be reasonably reliable. When it is very small, however, many tests are likely to overreject severely, and those that do not are likely to be seriously lacking in power. There is evidence on these points in the next section. An alternative to the parametric and semiparametric bootstrap methods that we have discussed in this section is the pairs bootstrap, which was proposed by Freedman (1984). The simplest way to implement the pairs bootstrap is just to resample from the rows of the matrix [ y1
y2
Z
W ].
Moreira et al. (2007) describe an alternative, semiparametric, resampling procedure, but it yields exactly the same results as the ordinary pairs bootstrap when applied to any of the tests that C The Author(s). Journal compilation C Royal Economic Society 2008.
462
R. Davidson and J. G. MacKinnon
we study. One potential advantage of the pairs bootstrap is that it is valid in the presence of heteroskedasticity of unknown form. However, it is quite easy to create wild bootstrap versions of all the semiparametric bootstrap procedures that we have discussed, which are also valid for this case; see Davidson and MacKinnon (2008). Because the pairs bootstrap does not impose the null hypothesis, it is necessary to modify the bootstrap test statistics so that what they test is true in the bootstrap samples. For the t statistic, this simply means replacing β 0 = 0 by β0 = βˆ in the numerator of the statistic. For K and LR, however, it means computing the quantities SS∗ , ST ∗ , and TT ∗ under the null hypothesis that β = βˆLIML , where βˆLIML denotes the LIML estimate of β. 1 Thus the pairs bootstrap is relatively difficult to implement. Moreover, as we will see in the next section, it generally performs much less well than the RE and REC bootstraps.
6. SIMULATION EVIDENCE In this section, we report the results of a large number of simulation experiments with data generated by the simplified model (3.13) and (3.14). All experiments have 100,000 replications for each set of parameter values. In many of the experiments, we use fully parametric bootstrap DGPs, but we also investigate the pairs bootstrap and the semiparametric RE and REC bootstraps. In most of the experiments, n = 100. Using larger values of n, but for the same value(s) of a, would have had only a modest effect on the results for most of the bootstrap tests; see Figures 9 and 10. Many of the asymptotic tests are quite sensitive to n, however; we provide some evidence on this point in Figure 3. For the base cases, we consider every combination of a = 2 and a = 8 with ρ = 0.1 and ρ = 0.9. The limiting R2 of the reduced-form regression (3.14) is a 2 /(n + a 2 ). Thus, when a = 2, the instruments are very weak, and, when a = 8, they are moderately strong. When ρ = 0.1, there is not much correlation between the structural and reduced-form disturbances; when ρ = 0.9, there is a great deal of correlation. All our results are presented graphically. Note that, within each figure, the vertical scale often changes, because otherwise it would be impossible to see many important differences between alternative tests and bootstrap methods. Readers should check the vertical scales carefully when comparing results across figures or in different panels of the same figure. Figures 1–4 concern the properties of asymptotic tests. Figure 1 shows the rejection frequencies for the Wald, K, LR and CLR tests (the last of these uses LR rather than LR 0 and is based on 199 simulations) for the four base cases as functions of l − k. These are all increasing functions, so test performance generally deteriorates as the number of over-identifying restrictions, l − k − 1, increases. Of particular note are the extremely poor performance of the Wald test when ρ = 0.9 and the surprisingly poor performance of the LR test when ρ = 0.1. It is also of interest that the Wald test underrejects severely when ρ, a and l − k are all small. This is a case that is rarely investigated in simulation experiments. Figure 2 shows rejection frequencies as functions of ρ for four values of a. In this and all subsequent figures, l − k = 9, so that there are eight overidentifying restrictions. As expected, performance improves dramatically as a increases. The Wald test is extremely sensitive to ρ. The
1 It would also be possible to compute SS∗ , ST ∗ and TT ∗ under the null that β = β. ˆ However, since the K and LR tests are based on LIML estimates rather than IV ones, this does not seem appropriate, and it works much less well.
C The Author(s). Journal compilation C Royal Economic Society 2008.
463
Bootstrap inference in linear IV regression
0.10
0.50 0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00
.................................. Wald
(t) test ............. K test ... ... ... ... ... ... .. LR test .............................. CLR test
... . ... ... ... ... .... . . . .. ... ... .. ... .... . . . ... .... .... .... . . .... .... .. . . . ... ....... ... ......... ........... ... .......... . ... . . . . . . . . . . . . ........ ... ............... ... ............... ......... .................. ... ............................................................ .................... .......... ...................................... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ....... ...........................................
2
1.00 0.90 0.80 0.70 0.60 0.50 0.40 0.30 0.20 0.10
4
0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01
.. ...... ... .......... ........................... ...... . . . . . . . . . . . . . . ...... ........ ...... .................. ...... ................................
... . ........ .....
. ......................................... ......................................................... . ........................................... .................................. Wald (t) test ............. K test ... ... ... ... ... ... .. LR test .............................. CLR test
0.00
6 8 10 12 14 16 ρ = 0.1, a = 2
....................... ................ ........... ......... . . . . . . ..... .... .... .... . . . .. ... ... ... . . ... ... ... .... . Wald (t) test .................................. ... ... ... K test ............. .. . ... LR test ... ... ... ... ... ... .. ... . .. CLR test .............................. .. . . . . . . ... ... ... ... ... .. ... ... ... ... ... ... ... ... ............................................................. . . . ... . . . . . . . . . . . . . . . . . . .. ................... ................ ............................................................................... ....................
... ... ........ .................. . . . . . . . . . . . . ... ................ ... ... .......... ... ...............
2
4
6 8 10 12 14 16 ρ = 0.1, a = 8
0.50 0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05
.................................. Wald (t) test ............. K test ... ... ... ... ... ... .. LR test .............................. CLR test
... ... ... ... . . . . .... .... .... . . . . .... .... .... . . . .. .... ..... .... . . . .. .... .... ..... . . . . . . ... ...... ...... ........ . . . . . . . ...... .......... .................. ........ ...................................................................... ... ... ... ............................................. ................................................................... ... ... ... ... ... ... ... ... ... ... ... ... ... ............
0.00
0.00 2
4
6 8 10 12 14 16 ρ = 0.9, a = 2
2
4
6 8 10 12 14 16 ρ = 0.9, a = 8
Figure 1. Rejection frequencies as functions of l − k, n = 100.
others, especially K and CLR, are much less so. Only when ρ is small and a is large does the Wald test perform at all well. In Figure 3, √ we consider values of n between 20 and 1280 that increase by factors of approximately 2. This figure makes it clear that the somewhat mediocre performance of CLR evident in the first two figures is a consequence of using n = 100. The performance of CLR, and also of K, always improves dramatically as n increases. Recall that CLR and K are asymptotically pivotal under both weak-instrument and conventional asymptotics. Thus it is not surprising that they can safely be used as asymptotic tests when the sample size is large but the instruments are weak. The performance of LR also improves as n increases, but it continues to overreject, sometimes very severely, even for large values of n. The Wald test is the least sensitive to n, but its performance often deteriorates as n increases. Figure 4 shows what happens as a varies. We consider values from a = 1 to 64 that increase √ by factors of 2. As expected, the performance of the Wald and LR tests improves dramatically C The Author(s). Journal compilation C Royal Economic Society 2008.
464
R. Davidson and J. G. MacKinnon 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0
(t) test ............ K test .... .... .... .... .... .... LR test ........................... CLR test
0.0
0.20
0.2 0.4 0.6 0.8 Very weak instruments: a = 2
0.10
0.05
0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05
ρ 1.0
... ...
0.08 0.07 0.06 0.05
0.2 0.4 0.6 0.8 Weak instruments: a = 4
ρ 1.0
.. ... .... ....... . . . . .. ........ ... ...... . . . . . ...... ........ .............................................................................................................................................. . . . . . . ...... .... .... ... ... ... ..... ... ............................................ ... ..... ... .. .. .. . . . ... ...... ....................... ... .................... .. . . .......... . . .. ... .... . . . . . . .. ........................... .. . . . . . . . . . . . . . . . . ......................... .
0.04 0.03 0.02 0.01
0.00
(t) test ............ K test ... .... ... .... ... .... LR test ........................... CLR test
0.0
0.09
.. ..... ..... ..... . . ..... ... .... . . . . ...... .... ................................................................................................................................................... ......... . . .. ... ... ... ... ... .. ... ... ... . ..... . . . . . . ... . ...... ....... ... . ........... ... ....... . .. . .......... ...... ... ..... ........... ......... ....... ........................
.. ... ... .... . ... ... .... . .. ... . . . ... ... ... . . .. ... ... ... . . .... .... ... . . . ... .... ... ... ... ... ... ... ... ... ... ....... . .... ... ..... ... ... ... ... ... ... . .. ..... .............................................................................................................................. ... . ......................... .................................. . . ............. . . .............. . . . . . .. ..................... .............................. Wald
0.00
0.10
.... (t) test ... ... ............ K test ..... .. .... .... .... .... .... .... LR test .... .... . . ........................... CLR test ... ..... .............................. Wald
0.15
0.50
... .... .... . . . ... ... ... ... . . . ... ... ... . . . ... ... ... . . . ... ... ... . . .... ... ... ... .... ... .... .... ... .... ... ... ... ... ... ... ........ . . .... ... ... ... ... ... ... ... .... ... ... ..... . ... .. . . . .... ...... .................................................................................................................................... .......... . . . . ............................................. .. ......... . . . . . . . . . . . ................. .............................. Wald
ρ 0.0 0.2 0.4 0.6 0.8 1.0 Moderately strong instruments: a = 8
0.00
.............................. Wald (t) test ............ K test ... .... ... .... ... .... LR test ........................... CLR test
ρ 0.0 0.2 0.4 0.6 0.8 1.0 Very strong instruments: a = 16
Figure 2. Rejection frequencies as functions of ρ for l − k = 9, n = 100.
as a increases. There is a modest effect on K, which performs quite well even for small values of a, and a somewhat larger effect on CLR. The latter tests would instead benefit from a larger sample size, holding a constant. Figures 5–7 concern the properties of parametric bootstrap tests under the null hypothesis. In all cases, B = 199. Results are presented for the five different bootstrap DGPs that were discussed in Section 5 and for the pairs bootstrap. Procedures with an ‘R’ employ restricted estimates of the structural equation, while procedures with a ‘U’ employ unrestricted estimates. Procedures with an ‘E’ employ efficient estimates of the reduced-form equation, while procedures with an ‘I’ employ inefficient ones. The REC procedure bias-corrects the estimate of a. Figure 5 shows rejection frequencies for the Wald test as a function of a for two values of ρ in the top two panels, and as a function of ρ for two values of a in the bottom two panels. Our new RE and REC bootstraps perform reasonably well, although they do lead to significant underrejection in some cases. The other four methods lead to very severe overrejection when ρ is not small and a is not large. Despite this, the UE and pairs bootstraps actually underreject very C The Author(s). Journal compilation C Royal Economic Society 2008.
465
Bootstrap inference in linear IV regression 0.25
0.50 Wald (t) K LR CLR
0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00
.... .... ... .. ... ... ... ... .. .... ..
test ............................. test ........... test ... ... ... ... ... .. test .........................
0.20
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... .. ... . ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... .... .... ... .... .... ........ .............................. ...................................... ............................ .......
0.15
0.10
0.05
.. ...
.. ...
20
40
.. ...
.. ...
.. ...
.. ...
.. ...
n
0.00
80 160 320 640 1280 a = 2, ρ = 0.1
...................................................... ...........................................................................
0.20
0.7 Wald (t) K LR CLR
0.6 0.5 0.4
test ............................. test ........... test ... ... ... ... ... .. test .........................
0.15
0.10
0.3
0.1 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ...... ....... ... ... ... ... ....... .... ... ... ................... ... ... ... ... ... ... . .. ... ... ...... ..... ..................... ... ... ... ... ... ... ................. ................... ........................ ................ ............................ ........
............. ................................................................................
...
...
20
40
...
...
...
...
...
n
80 160 320 640 1280 a = 8, ρ = 0.1
0.25
0.8
0.2
test ............................. test ........... test ... ... ... ... ... .. test .........................
.................................................. ......... ........... ...........................................................
1.0 0.9
Wald (t) K LR CLR
... .
... .......... ............. ........ ... ... ...
0.05
..................................................................... ....................... ............ .......... . . . . . . . ..... ..... Wald (t) test .............................
...........
K test ... ... ... LR test ... ... ... ... ... .. ... ... ... ......................... .... CLR test .... .... .... ...... .. ......... .......... ... .. ........................ ..... ... ... ............. ... ... ................... ............................. ... ... ... ... .............. ......................... .........
.......... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ..... ............... ........... ........................................... ............. ....................... ...................... ....
....
20
40
....
....
....
....
....
n
0.00
80 160 320 640 1280 a = 2, ρ = 0.9
...
...
20
40
...
...
...
...
...
n
80 160 320 640 1280 a = 8, ρ = 0.9
Figure 3. Rejection frequencies as a function of n for l − k = 9.
severely when both a and ρ are small. It is interesting that, in all four panels, the performance of the UE and pairs bootstraps is very similar. It is also apparent, most clearly in the lower left-hand panel, that RI and RE yield similar results when ρ is small. This makes sense, because π˜ cannot be much more efficient than πˆ when there is little correlation between the structural and reduced form equations. Figure 6 shows rejection frequencies for the K test in the same format as Figure 5. All methods except the pairs bootstrap, which always underrejects, work very well, with REC being arguably the best of a remarkably good bunch. Both a and n must be quite large for the pairs bootstrap to perform as well as RE and REC do when a = 2 and n = 100. Figure 7 deals with the LR test, for which the REC bootstrap is unquestionably the best method, overall, in every one of the four panels. Using REC leads to only very modest overrejection in the worst cases, when ρ and a are both small. Except in the upper right-hand panel, the pairs bootstrap performs quite well here, but it is the only method for which the rejection frequency does not seem to converge to 0.05 as a becomes large. For that to happen, n must be quite large. C The Author(s). Journal compilation C Royal Economic Society 2008.
466
R. Davidson and J. G. MacKinnon 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05
0.40 Wald (t) test .................................. .. .. .. .. K test ............. .. .. .. LR test ... ... ... ... ... ... .. .. .. CLR test .............................. .. .. .. .. .. .. .. .. .. .. .. .. .. . ......................... ............ .... ......... ... .................. . . .............................................................. ......................... ... ... . . ......................... ............................................................................. ............................. . . . . . . . . . . . . . . . . . . . .................
0.35 0.30 0.25 0.20 0.15 0.10 0.05 a
0.00 1
2
4
8 16 ρ = 0.1 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
32
Wald (t) test .................................. .... ....... ....... K test ............. .. .... .. .... .. .... LR test ... ... ... ... ... ... .. .. ... .. .. CLR test .............................. .. ... .. ... .. ... .. ... .. ... .. ... .. ... .. ... .. ... .. ... .. ... .. ..... ... .... .. .... ..................... .......... .......... .. ... ........... .................... ........ ........................................................................ ..................... .. ... ............................................ ............................ a
0.00
64
1
2
4
8 16 ρ = 0.5
32
64
.......... ...... .... Wald (t) test .................................. .... ... ... K test ............. ... ... LR test ... ... ... ... ... ... .. ... ... .............................. ... CLR test ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... .... ... .... .... .... .... ........ .... ........ ............................. ... .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. ... ............................................................................................................. .................................................
a
0.0 1
2
4
8 16 ρ = 0.9
32
64
Figure 4. Rejection frequencies as functions of a for l − k = 9, n = 100.
Figure 8 deals with the CLR and CLRb tests. The former are based on LR and the latter on LR 0 . With n = 100, it would have made almost no difference if they had both been based on either LR or LR 0 . As we have remarked, bootstrapping CLR is almost as computationally intensive as performing a double bootstrap. 2 In contrast, calculating the CLRb test is no more expensive than bootstrapping any of the other tests. To avoid cluttering the figure, we present results for only four cases, namely, the CLR test bootstrapped using the RE and REC bootstraps and the CLRb
2 A recent paper by Hillier (2006) derives the exact conditional distribution of the LR statistic under the assumption that the disturbance covariance matrix is known. Critical values for the test can be obtained from this conditional distribution by numerical methods. Use of these critical values might considerably reduce the computational burden of bootstrapping the CLR test.
C The Author(s). Journal compilation C Royal Economic Society 2008.
467
Bootstrap inference in linear IV regression
1.0 ... 0.9 ... . . . 0.8 ... ... . . 0.7 .... .. ... .................. . . 0.6 . . . . ... ............ ... .. .. ........... 0.5 . . . . . . ... .. ... ... ... 0.4 ................ ... . . . . . . . .. .. .. .. .... 0.3 ... .... .... .... ................ ....... . . . . . . . . .. .. .. 0.2 ......... ... ... ..... .......... ......................................................... . . 0.1 . . . . ...............◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ .............◦ .. ...........◦ ........◦ .....◦ .... ......◦ . .. .......◦ ◦... ρ 0.0 .............................. 0.0 0.2 0.4 0.6 0.8 1.0
....................
. ... ... . . . ... ... . 0.15 . . ... ... . . .. ... 0.10 ... .. . . .............................. .. ................ .. ............. ... ..... .................................................................. . . . . . . . . . . . .. .... .... ..... .. ............................... ............◦.................◦..............◦.............◦.........◦.....◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ..... ......◦ ......◦ 0.05 ◦... ..........◦ . . . . . . . . ........................... 0.20
....................
0.00 0.0
0.2
0.4
a=2 0.10 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0.00
1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0
................
............. . . ...... . . . . . . ....
.. .. .. ... ........ .......... .. . ... ......... ..................................
2
4
a 8 16 ρ = 0.1
0.8
ρ 1.0
a=8
... ......... .............. ................ ..........◦ ...◦ .........◦..........◦ .............. .. .. ....◦ .................◦ ...... ............... .......... ..............◦..........◦ ◦............. .. ◦ . . . . . . . . . . . ...... ..... ................. ◦ ..... ◦ ◦ ......
1
0.6
32
64
...... ... ... ...... ... ....... ..... ... .... ..... .. ..... .. ..... .. ..... .. ..... .. ..... .. ..... .. . . . . ........ .... .. ....... . .... ....... ..... ..... ... . .... ..... .... .... ... .... .... .... ..... .... .... ... ... ......... ... ........... .......... ............ ..... .............. . ..... ... .◦ ............... ...◦ ............◦ .. ......... .......◦ .... ◦ ◦ ◦ ◦ ◦ ◦ ◦ .◦.............◦..............
............. 1
2
4
8 16 ρ = 0.9
32
a
64
Legend
. . .
REC ◦ ◦ ◦ RE .............................................. RI
... ... ... ... ... ... ... ... ... UE .................. UI ......................................... Pairs
Figure 5. Rejection frequencies for bootstrap Wald (t) tests for l − k = 9, n = 100.
test computed using the RE and REC bootstraps. 3 Other bootstrap methods performed less well overall. In Figure 8, the CLRb tests always perform extraordinarily well, as does the REC bootstrap version of the CLR test. The two CLRb tests tend to overreject very slightly when a is small, less for the version based on REC than for the one based on RE. Since CLRb, when based on REC, is very much faster and easier to compute than CLR bootstrapped using REC and performs just as well, there appears to be no reason to consider the latter any further.
3 When bootstrapping the CLR test, we used B = 199 together with s = 299 simulations. It is important that s and B not be the same, because, if they are, the actual and bootstrap statistics will be equal with probability approximately 1/(B + 1). Different choices for s and B would have resulted in slightly different results. C The Author(s). Journal compilation C Royal Economic Society 2008.
468
R. Davidson and J. G. MacKinnon 0.060
0.060
0.055
...............◦ .......... .. ... ... .. ............... ...............................◦ ... ... ....... ..... .◦...........◦..... ◦........◦..........◦.........◦...◦.. ◦.. ◦..............◦.... ......◦ ...◦....◦...........◦................ ◦........◦..... ..... .....◦. ..... ... .....◦ ..... 0.050 .....
0.055
0.045
0.045
0.040
0.040
. . ........ . .........
. ..... ..............
0.035
..... ... ........◦........ ...◦............◦... .....◦........◦.........◦...... ... .....◦. ........◦......... 0.050 ◦.............◦........................◦.............◦.......◦......◦...........◦. .◦....... ............◦..........◦......... ......... .....◦ .◦...
0.035
... ...... ......... .... .. .. ................... ........ ................................................ ..........
0.030 0.0
0.2
0.4 0.6 a=2
0.8
ρ 1.0
0.06
... ... .. . .......... . . . . . . . . . . . ... .... .................. ............ . . . . . . . . . . . . ........ .. .... ................
0.030 0.0
0.2
0.4 0.6 a=8
0.8
ρ 1.0
0.06 . ..
0.05
0.04
... . ........................ . ... .. ◦..... .. ...... ◦................ ......... ◦............◦.... ............. ...◦ ........... ◦......................................... . ................... ............. ◦ ......................◦.... ..◦ ◦ ◦ ......◦. ◦
................
............ .
0.05
.............. .................. .... ... ... . . .. ... ... ... . . .............. ... .... ... ... ........ . . ........... ......... .....
0.04
a
0.03 1
2
4
8 16 ρ = 0.1
32
0.03
64
◦......... .◦.. ..........◦........ ......... ... ...................◦.................◦................ ........◦............ ... ... ..............◦ .....◦ ................... ...... .. ......... ....◦ .... ..... ◦.........◦... .◦ ...... .........◦ ......
............. ................................ ............... ..... ... ... . . . . .... ... ... ... . ... ... ... . . .. .... .... .... ........ ..
1
2
4
8 16 ρ = 0.9
32
a
64
Legend
. . .
REC ◦ RE .............................................. RI ◦
◦
... ... ... ... ... ... ... ... ... UE .................. UI ......................................... Pairs
Figure 6. Rejection frequencies for bootstrap K tests for l − k = 9, n = 100.
As we mentioned at the end of the last section, it is probably better in practice to use a semiparametric rather than a fully parametric bootstrap, because the normality assumption is likely to be false. We therefore undertook a number of experiments to compare parametric and semiparametric versions of the REC and RE bootstraps. In every case, the semiparametric and fully parametric bootstraps yield very similar results. Of course, this would quite possibly not be the case if the disturbances in the DGP were not normally distributed. Figures 9 and 10 plot rejection frequencies for parametric and semiparametric bootstrap tests as functions of the sample size for four sets of parameter values. For the Wald, K and LR tests, they show rejection frequencies under the null as a function of the sample size for the REC and RE bootstraps, respectively, both parametric and semiparametric. The similarity of the results from the parametric and semiparametric bootstraps is striking. Note that the same random numbers were used to generate the underlying data, but different ones were used for C The Author(s). Journal compilation C Royal Economic Society 2008.
469
Bootstrap inference in linear IV regression
0.07
0.13 ... ... ... ... ... ... ... ... ... ... 0.11 0.09 0.07 0.05
... ... ... ... ..................................... .... ................... ... ... ... . . . . . . .... .. . . ... . . . . . . . . . . . . . . . . . . . . . . . . .. .. ◦........ ......◦ .......◦ ...◦........... ◦ ◦ ◦ ◦.... .......◦ ◦ .◦...◦...◦. ..... ...... ..◦. ..... .... ..◦. .. ... ...◦ .. ... ...◦.. ... ... ..◦ ..... .. .... ◦ .... ... .. ... .. .. ◦ .. .. ................................................... .. .. .............. . .... ............... .. ........ .. ....... . ...... .............
0.01 0.0
0.13 0.11 0.09 0.07 0.05
0.2
0.4 0.6 a=2
0.8
0.04
ρ 1.0
.... ... .. .. .. . ◦............. .... ..◦...... .. ....... ... ◦..... ... ...... .. .◦..... .. ..... ... .... .. .◦ ......... .... .... ............. .... ◦................. ....................................................... .... ............ ............◦ ◦................. ◦................◦.......... .. ◦ ..◦........◦..
0.03
0.11 0.09 0.07 0.05
a 1
2
4
.......... ... ................ ... . ..... ..............
8 16 ρ = 0.1
32
0.2
0.4 0.6 a=8
ρ 1.0
0.8
... ... ... ... .. .. ..... .. ... . . .. .... ...... . ◦... .... ...... .. . ... .. ... ... ..◦.. ... .. .. .... .. .. .. .. .◦. ... .. . ... .◦....... ◦... ... ... ... ............................................................................. .. ... ... ...............◦.................◦...........◦.. ... .. ◦........ .. ...◦ .. .. ................ ............. . .. ...... ◦ ◦ ◦ ...
.... . ... . ....... .... . . . . . . . . . .. ... .. ....... ...... ... . ............................
0.03
0.01
.
... .. . . ... .... .......... .... ..... .... . . .... .. ... ..... ................. ....
0.0
0.13
. ... .... .... .................
... ... .
0.03
0.15
... . . ...... . . . . . .... . . ... ....
... ... ... .. ....
.. ...... ...........◦ ......... ◦ ...... .........◦ . .. .. .◦...........◦........◦...............◦...........◦ ...........◦................◦...........◦.......................◦..............◦............◦............◦.............◦............◦.............◦...............◦....................◦... .. ...
0.05
....................
0.03
0.15
0.06
0.01
64
a 1
2
4
8 16 ρ = 0.9
32
64
Legend
. . .
REC ◦ RE .............................................. RI ◦
◦
... ... ... ... ... ... ... ... ... UE .................. UI .........................................
Pairs
Figure 7. Rejection frequencies for bootstrap LR tests for l − k = 9, n = 100.
bootstrapping, because generating pseudo-random normal variates does not work the same way as resampling. The information provided by Figures 9 and 10 for the CLR tests differs from that for the other three tests. They show rejection frequencies for the original CLR test bootstrapped using the semiparametric REC or RE bootstraps and for the REC or RE versions of the CLRb test. Even when bootstrapped, the CLR tests perform poorly when n and a are small, while the simpler CLRb tests perform very much better. The CLR test here is based on LR. If it had instead been based on LR 0 , it would have overrejected quite a bit more for very small values of n, but results for n ≥ 50 would have been essentially identical. Strikingly, the performance of CLRb is remarkably similar to that of K, especially for the REC bootstrap. Figure 9, which presents the results for the REC bootstrap, makes it clear that the decision to focus on the case n = 100 in most of our experiments is not entirely inconsequential. The C The Author(s). Journal compilation C Royal Economic Society 2008.
470
R. Davidson and J. G. MacKinnon 0.055
0.065
0.055
◦ ◦◦ ◦◦ ◦ ◦◦ ◦◦◦◦◦◦ ◦ ◦ ◦
. . .. . . . . .. . . .. .. . .. .
. ............ ◦ ... ◦..........◦........... .◦........◦........ ...............◦..... .........◦..◦..... .. ..... . . .... ............ . .......... ... . . . . ◦ . . . . 0.050 ....◦... ...... ....... ...... ... ◦◦ ◦....... ◦..........◦ ......... ◦ . . .... . ..... ....... ◦....... . ◦ ... ◦. ...... ...... ◦........
◦◦ ....... ... ...... ....................................................................................... .. . . . . . . ...... ... ......... ....................................... .............. ...◦ ..
.. . ....... ...... ... . 0.045 0.0
0.2
0.4 0.6 a=2
0.8
ρ 1.0
0.065
0.0
0.2
0.4 0.6 a=8
0.8
0.065
◦ ◦ ◦ 0.055
ρ 1.0
0.045
◦
◦
◦ ................... . .........................................◦ .............. ..... ... .◦ ..... ....... .........◦........ .. ..... ◦... ........◦..............◦..............◦.. ◦..... ..... .
.....................◦ .........................◦...............◦...............◦.............◦ .....◦................◦..............◦................◦.................◦...............◦. ... .. ........
.. . . .. ..... . . a
0.045 1
2
4
8 16 ρ = 0.1
32
◦
0.055
a
0.045
64
1
2
4
8 16 ρ = 0.9
32
64
Legend
. . . ◦
◦
Bootstrap CLR (REC) ◦ Bootstrap CLR (RE)
......................................... CLRb (REC) .................... CLRb (RE)
Figure 8. Rejection frequencies for bootstrap LR tests for l − k = 9, n = 100.
K test works very well indeed for all but the smallest sample sizes, as does the CLRb test. The Wald test underrejects for the smaller sample sizes in three cases out of four, but its performance improves as n increases. When a = 2, the LR test overrejects moderately for small sample sizes and underrejects moderately for large ones. Figure 10 is similar to Figure 9, but it presents results for the RE bootstrap. Once again, we see that the fully parametric and semiparametric bootstraps produce almost identical results with normally distributed disturbances. At least in some cases, the RE procedure is substantially inferior to the REC one. In particular, the LR test overrejects quite severely when a = 2 and ρ = 0.1, and the Wald test performs less well for cases where a is small and n is large. However, for the K and CLRb tests, performance is once again excellent for n ≥ 50. It emerges clearly from these last two figures that the rejection frequencies of the RE bootstrap Wald and LR tests do not seem to converge to the nominal level as n → ∞ when a = 2, whereas those of the K and CLRb tests do so. This is in accord with our discussion of the C The Author(s). Journal compilation C Royal Economic Society 2008.
471
Bootstrap inference in linear IV regression
0.20
0.15
0.10
0.05
... ... .. .. ... ... .. .. ... ... .. .... .. ... .. ... .. ... .. ... .. ... .. ... .. .∗.. ............. . ...∗... ................. . ............ ..∗......∗........ .... ......∗ ◦.. ∗ ∗.............................. ....................∗.............∗.....................∗.......................... ◦...............∗
0.07
◦............◦....
0.04
0.06
.
.
.
...
...
20
40
0.05
. . . ..... ... .. .◦ ◦.. ... ....... ... .... ◦............... ............. .............◦ ..........◦ ...
...
...
80 160 320 a = 2, ρ = 0.1
...
n
640
0.12 0.11 0.10 0.09 0.08 0.07 0.06 0.05 0.04
.
. . .
.. ........ ....◦ ........ . . ◦ . . . . . ◦........ ......... ....... ...◦ . . . .. ◦..... ...
...
20
40
...
. .◦ . . .
......∗ .......
...
...
80 160 320 a = 8, ρ = 0.1
...
n
640
0.055 0.054
.. ... .. ....... ..... ..... .... ..... .... ..... ..... ..... . .. ∗.... ............ ... ......... ∗............................ ..... .∗ .............∗....... ............. ∗ ..∗............... .............∗............∗..... ...∗ ..........∗ ...........∗ . ......
0.053
.
0.052 0.051
.
.
◦....
0.03 0.02
..... ..... ..... ... . ...... ... . ... ... . ∗.... ...... ..... ... ..... .... .. .∗.. ..... . ... ... .... ..................... ... ... ...∗.... .. ........... ...... ◦∗...................∗................................ ............... ............ ∗............ ∗ .....∗................ ◦ . ................ ◦∗.. ..................... ◦∗.............. .. .. ..◦
.
.... .............◦ ..........◦
0.00
.......
...
20
40
0.049
. . . .... ... .... ... ... ... . ... ... ... ... .................... . ◦.. ◦ ....... . . . . . . . . . . . . . . . . . . . . ◦ . . ........◦
... ... ....... ... ............ ◦ .◦ ......... .............◦ ....◦ . . .....◦ . . . . .............◦ . . .. ...
0.050
...
...
...
80 160 320 a = 2, ρ = 0.9
...
0.048
.
.◦...................... .....
..... ∗... ....... ...... .. .........◦ ....... ∗...............................................∗....................◦..... . . ◦...... ◦ ..... .........................∗..................................∗..............∗..... ............. ............... ∗................. ............ ............ ......... .............∗. ∗ ..... .. .......... ...... .∗..... ...◦ ∗ ...◦ ..... ..... ◦ .. ...... ◦ ..... ... ............ .◦ ........ ..... .......... ◦
.
.
. . . . .
. .
0.047 n
0.046 0.045
640
. ...
. ...
20
40
. ...
. ...
. ...
80 160 320 a = 8, ρ = 0.9
. ...
n
640
Legend .....................................
Wald (P) ◦ ◦ ◦ ◦ Wald (SP) .............. K (P) ∗ ∗ ∗ ∗ K (SP)
... ... ... ... ... ... ...
LR (P) LR (SP) ................................. CLR (SP) CLRb
....
Figure 9. Rejection frequencies for REC bootstrap tests as a function of n for l − k = 9.
previous section. We echo Moreira et al. (2007), however, in noting that it is remarkable that the bootstrap Wald and LR tests perform as well as they do. Figures 11–13 concern power. Because there is no point comparing the powers of tests that do not perform reliably under the null, only the (semiparametric) REC bootstrap is used. In these figures, we present results for the AR test (not bootstrapped, since it is exact), the Wald test, the K test, and the CLRb test. The Wald and K tests were bootstrapped using the semiparametric REC bootstrap rather than the parametric one for comparability with CLRb. Since it is no more expensive to compute CLRb than to bootstrap the LR test, results for LR (which is generally less reliable under the null) are not reported. To reduce the power loss associated with small values of B, we set B = 499 in these experiments, which made them relatively expensive to perform. C The Author(s). Journal compilation C Royal Economic Society 2008.
472
R. Davidson and J. G. MacKinnon
0.20
... ... .. . ...... ..... .... ..... .... ... .. .. ... ... ... ... ...... ... .. ... . ... ... ... ... ... ... . ... .. ... ... ... ... ... ... ... ... ... ... ... ... . ... .... .∗.. .... ... . . ∗........ ................ ...................... ∗ ...∗......... . .................................... ∗ ..... ∗ ..... ∗........................ ∗ ∗ ∗ ∗
0.07
◦............◦.......
0.04
. 0.15
0.10
0.05
0.00
.
0.06
. . . . . . . . .
...
...
40
...
...
...
80 160 320 a = 2, ρ = 0.1
...
0.05
n
640
0.12 0.11 0.10 0.09 0.08 0.07 0.06 0.05 0.04
.
. . .
.. ........ ....◦ ........ . . ◦ . . . . . ◦........ ......... ....... ...◦ . . . .. ◦..... ...
...
20
40
...
. .◦ . . .
......∗ .......
...
...
80 160 320 a = 8, ρ = 0.1
...
n
640
0.055 0.054
.. ... .. ....... ..... ..... .... ..... .... ..... ..... ..... . .. ∗.... ............ ... ......... ∗............................ ..... .∗ .............∗....... ............. ∗ ..∗............... .............∗............∗..... ...∗ ..........∗ ...........∗ ...... .
0.053
.
0.052 0.051
.
.
◦....
0.03 0.02
..... ..... ..... ... . ...... ... . ... ... ∗... ...... .... ... ..... .... ..∗. .... . ... ... . .... .............. .. ... ... ... ... ... ∗........... ......... .∗............ .. ........... ...... ◦∗...................∗................................ . ∗...........∗.............. ◦ . .................... ....................... ◦∗.. ◦∗............... .. ......◦
.
.. .............◦ .............◦ .............◦ .............◦ ............◦ .............◦ ..............◦ ............◦ .......◦
20
.......
...
20
40
0.049
. . . .... ... .... ... ... ... . ... ... ... ... .................◦..................◦.. ..............◦ .........◦
... ... ....... ... ............ ◦ .◦ ......... .............◦ ....◦ . . .....◦ . . . . .............◦ . . .. ...
0.050
...
...
...
80 160 320 a = 2, ρ = 0.9
...
0.048
.
.◦...................... .....
..... ∗... ....... ...... .. .........◦ ....... ∗...............................................∗....................◦..... . . ◦...... ◦ ..... ..........................∗..................................∗..............∗..... ............. ............... ∗................. ............ ............ ......... .............∗. ∗ ..... .. .......... ...... .∗..... ...◦ ∗ ...◦ ..... ..... ◦ .. ...... ◦ ..... ... ............ .◦ ........ ..... .......... ◦
.
.
. . . . .
. .
0.047 n
0.046 0.045
640
. ...
. ...
20
40
. ...
. ...
. ...
80 160 320 a = 8, ρ = 0.9
. ...
n
640
Legend .....................................
Wald (P) ◦ ◦ ◦ ◦ Wald (SP) .............. K (P) ∗ ∗ ∗ ∗ K (SP)
... ... ... ... ... ... ...
LR (P) LR (SP) ................................. CLR (SP) CLRb
....
Figure 10. Rejection frequencies for REC bootstrap tests as a function of n for l − k = 9.
Figure 11 deals with the weak-instrument case in which a = 2. The three panels correspond to ρ = 0.1, 0.5 and 0.9. No test has good power properties. When ρ = 0.1, CLRb generally has the most power, but AR is close behind and actually seems to be a little bit more powerful when |β| is large. Both these tests are often much more powerful than K, the only other test that is reliable under the null, and CLRb is never less powerful. The Wald test appears to be the most powerful test for certain values of β, but it performs poorly when |β| is large. When ρ = 0.5, all the tests have strange-looking power functions. They tend to have more power against negative values of β than against positive ones. There is a small region in which Wald dominates, but it is severely lacking in power when |β| is large. Once again, CLRb and AR are generally quite close. AR seems to have a bit more power for β < −1, but CLRb dominates C The Author(s). Journal compilation C Royal Economic Society 2008.
473
Bootstrap inference in linear IV regression 0.50
0.50
0.45
0.45
0.40
0.40
0.35
0.35
0.30
0.30
0.25
0.25
........ 0.20 ....... ........ ........ ..... .... ....
0.15 0.10 0.05 0.00
.. ......... . ........... ...... ......... ..............
............. ..........
... .. ............................. ......... .......................... ............................. ............ ................. ............ ....... ..................... ............... .................... .................. ................ ..................... ............. ............ ..... .... .. ..............
−3
−2
.................... .. .... ................................... ...... .............. .... . ..... ....... ..... . . . . . 0.20 ...... ... . . . . . .. . ....... .......... ............ . ....... ......... ...... 0.15 ......... .. ..................... ............... ..... ... ...... ............... ..................... . . ....... . ............ . .... . . . . ................................................................ 0.10 . ......... . . . ............ ....................... ............. .. ......
−1 0 1 ρ = 0.1, a = 2
2
0.05 β 0.00
3
... ...... ......
β −3
−2
−1 0 1 ρ = 0.5, a = 2
2
3
1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0
......
..... ..... ... ..... ....... ..... ...... . . . ..... .... . ..... .... .. . . .. ....... ...... . . . ....... . .. ................. . . . . .... ...... ... ... . . . . . .. ... .. .. ..... ........ .. .. ..... ...... ... .. .. .... ...... .... . ..... ..... ....... . . .. . .. .... ..... .... .. .. ..... ....... .... .......... ... ... ...... .. .... ........ ... ... ....... ..... ... ......................... ... ...... . . . .. .................. ... ..... ....... . ........ . . . . . . . . . . . . . . . . . . . . . . . ..... . . . . . ........ ... ....... ........ ................ .. ... ......... ............................................... ....... ......... .... .................................................. . ......
... .................. .......
−3
−2
−1 0 1 ρ = 0.9, a = 2
2
β
3
Legend ..................... AR (REC)
.............. K (REC)
.................................... Wald
................................. CLRb
(REC)
(REC)
Figure 11. Power of tests when instruments are very weak for l − k = 9, n = 100.
for most other values of β. The K test is severely lacking in power for β < 0, but much less so for β > 0. When ρ = 0.9, the power functions look stranger still, with far less power against positive values of β than against negative ones. The power function for K has a curious dip for some negative values of β, and in this region K can be much less powerful than AR. The dip is also evident in the simulation results of Stock et al. (2002), which are not directly comparable to ours, since they use the same parametrization as Andrews et al. (2006) and assume that is known. See Poskitt and Skeels (2005) and Kleibergen (2007) for an explanation of this dip. The minimum of the power function for the Wald test occurs well to the left of β = 0. The CLRb C The Author(s). Journal compilation C Royal Economic Society 2008.
474
R. Davidson and J. G. MacKinnon 1.0
1.0
0.9
.................... 0.9 .... ............................ ........ ....
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0
0.8
....... ....... ....... ..... ..... .............. ......... ............ ......... ....... .............................................................. ..... ...... .... ...... .................... .. ......... . . . ......... .. ... ... ...................... . .. ....... .... .................... ........ ................... . . . . . . . . . .. ........ .. . .... .. .... . .... .. ......... .. ... . .... .. .......... ......... .. . ...... . . .. ........ .. .... . .. .......... .. .... . .. ......... ........... .. ............ ........ . ........... ............ .......... ............ ......... .. ..........
0.7 0.6 0.5 0.4 0.3 0.2 0.1 β
−3
−2
−1 0 1 ρ = 0.1, a = 4
2
0.0
3
.. .. ............... ...... ............ .. ...... ..... . . . . . . ... .. ........ .. ......... .. ..... ......... ............ ..... . .. ............. ...................... . . ... .. .. . .... ...... ... ..... .. ......... ....... ..... ... ............. . . . ..... . . . . . .. . . . . . .. ... . .. .... .... ............................. .......... .... .... ................. ......... ..... ..... .......................... ........ ... .......... ..... .. ............ ....... .. ........ ..... .. . . . ........ ...... .. ... ... ...... ... ........ ...... ... ......... .......... ...... ... ......... ..........
−3
−2
−1 0 1 ρ = 0.5, a = 4
2
β
3
................ ............................ ....................... ....... ........ 1.0 .............. ... .. ... ................... .... .. ... ...... ... .. .. ..... 0.9 .... . ... ...... . . . ... ... .... .......... . .. . . .. . . .. .... ........ 0.8 ........ .... ... ... .... .......... ....... .. ..... .... . . . . . . . . . .... ... ...... ..... 0.7 .... . ... ........ .... .... ... ....... ...... .. .... . . . . . . . . . . . 0.6 ...... ... ...... ... ... ...... .. ...... ... ........ ..... 0.5 ... ..... . . ... . .. ... .......... ...... .... ..... ... ...... . . . . . 0.4 . . . . ... ...... .... .... ... ...... .... .... ... ...... 0.3 .... ........ ... ....... . . . ... ....... ..... ...... ... ....... ..................... 0.2 ... ...... .......................................................... ... ..... ............ . . . . . . . . . . . ... .... ........ 0.1 ... ...... .... . ................ β 0.0 −3 −2 −1 0 1 2 3 ρ = 0.9, a = 4
Legend ..................... AR (REC)
.............. K (REC)
.................................... Wald
................................. CLRb
(REC)
(REC)
Figure 12. Power of tests when instruments are fairly weak for l − k = 9, n = 100.
test is more powerful than AR except in a small region near β = −1. Oddly, however, K is just a little bit more powerful than CLRb for most positive values of β. Figures 12 and 13 show what happens as the instruments become stronger. They deal with the cases in which a = 4 and 8, respectively. As instrument strength increases, all the tests perform very much better. Even K outperforms AR for many values of β, especially when ρ = 0.9. However, when ρ = 0.5 and 0.9, the power function for K once again has a curious dip for some negative values of β, and the power functions for Wald have their minima noticeably to the left of β = 0. The power functions for K when ρ = 0.1 are particularly strange, since power actually declines as the absolute value of β increases beyond a certain point. The CLRb test does not have any of these problems, and it is very reliable under the null. C The Author(s). Journal compilation C Royal Economic Society 2008.
475
Bootstrap inference in linear IV regression
............................ ......... .............................. ............................................................ ............................ ............ ............... 1.0 ........................... ............. ...... ........... .. ....... ...... ... .... .. ....... ....... . ... .. ...... . . . 0.9 . . . . .. .......... .. ...... . ... .. ......... ...... .. 0.8 .. ....... .. ... ..... . . . . . . . ... ...... ...... . ... ......... .......... .... 0.7 ... ...... ........ ... ... ...... ......... ... 0.6 .. ..... ...... .. ... ..... ...... .. .. ..... ....... ... 0.5 ... .... ....... .. ......... ........ ... 0.4 ....... ....... .. ...... ........ ... ........ ........ 0.3 ....... ........ ...... ......... ...... ......... 0.2 ...... ....... ...... ........ .............. 0.1 ........ β 0.0 −3 −2 −1 0 1 2 3 ρ = 0.1, a = 8
................ ................................................................. .................. ...................... ............ ....... ....................... .............. 1.0 ................................. ............. .. ............. ..... .... .. .......... ..... ... ... 0.9 ....... .......... . . ... . .......... . .. ...... . .......... 0.8 ..... ....... .. .. .......... . . . . ... .. ......... .. .. .......... ... ... ....... .. 0.7 .. .............. ........... ... .......... ...... .. 0.6 ............ ........ .. ......... ....... .. ........... ..... .. 0.5 ........... ...... .. ............. .......... ... ......... ...... .. 0.4 .......... ..... .. ........ ..... .. ....... ....... .. 0.3 ........ ...... .. ........ ....... . 0.2 .......... ........... ...... ..... ........ ........ 0.1 .................. ...... β 0.0 −3 −2 −1 0 1 2 3 ρ = 0.5, a = 8
..................................................................... ............. .................................... ............................. .................... 1.0 ............... .................. .................... ............ .. ... ................. .................................... ............ ...... ... ......... . . . . . . ........... 0.9 ...... ... ....... ...... .. .......... ...... .. ......... 0.8 . . . . .......... ..... . ........... ..... ... ......... 0.7 .... .. .......... . ......... .. . ......... ...... . 0.6 .......... ...... .. ......... ...... .. ........... .... . 0.5 ......... ..... .. ........... ..... .. .......... ........ . 0.4 ......... ...... .. ......... ...... . .......... ...... .. 0.3 ... ........ ...... .. ... ..... ...... . ... ........ ...... .. 0.2 ... ....... ......... ... ...... ....... ... .............. 0.1 ... .... ...... β 0.0 −3 −2 −1 0 1 2 3 ρ = 0.9, a = 8
Legend ..................... AR (REC)
.............. K (REC)
.................................... Wald
................................. CLRb
(REC)
(REC)
Figure 13. Power of tests when instruments are strong for l − k = 9, n = 100.
Results for the RE bootstrap are not reported because, in most respects, they are quite similar to the ones in Figures 11–13. In several cases, the Wald test appears to have somewhat more power with RE than with REC, mainly because it is less prone to underreject, or even prone to overreject, under the null. Similarly, the RE version of the CLRb test tends to have very slightly more power than the REC version because it is very slightly more prone to overreject.
7. CONCLUDING REMARKS We have provided a detailed analysis of the properties of several tests for the coefficient of a single right-hand-side endogenous variable in a linear structural equation estimated by C The Author(s). Journal compilation C Royal Economic Society 2008.
476
R. Davidson and J. G. MacKinnon
instrumental variables. First, we showed that the Student’s t (or Wald) statistic, the K statistic, and the LR statistic can be written as functions of six random quantities. The AR statistic is also a function of two of these six quantities. Using these results, we obtained explicit expressions for the asymptotic distributions of all the test statistics under both conventional and weak-instrument asymptotics. Under weak-instrument asymptotics, we found that none of the test statistics can have any real asymptotic power against local alternatives. Even when the alternative is fixed, AR, K and LR are not consistent tests under weak-instrument asymptotics. The t statistic has very different properties, however. It is unbounded as β → ∞, so that it appears to be consistent. But it is also unbounded as certain parameters of the DGP tend to limiting values, so that it is not asymptotically pivotal, or even boundedly pivotal. Note that these results depend in an essential way on how the DGP is specified, in particular, the disturbance covariance matrix. We then proposed some new procedures for bootstrapping the three test statistics. Our RE and REC procedures use more efficient estimates of the coefficients of the reduced-form equation than existing procedures and impose the restriction of the null hypothesis. In addition, the REC procedure corrects for the tendency of the reduced-form equation to fit too well. A semiparametric version of this procedure is quite easy to implement. In most cases, the REC bootstrap outperforms the RE bootstrap, which in turn outperforms previously proposed methods. The improvement can be quite dramatic. Even the Wald test performs quite well when bootstrapped using these procedures, although it sometimes underrejects fairly severely. Interestingly, however, as we show analytically, the RE and REC bootstrap versions of the Wald test are not consistent against fixed alternatives under weak-instrument asymptotics, and they can be much less powerful than the other tests when the instruments are weak and |β| is large. We also proposed two new variants of the conditional bootstrap LR test, or CLRb, based on the RE and REC bootstraps. Like the K test when bootstrapped using either the RE or REC procedures, these new CLRb tests have excellent performance under the null, even when the sample size is small and the instruments are weak. Unlike the K test, their power functions have no strange features. All of our theoretical analysis is conducted under the assumption that the disturbances are Gaussian, although some results do not in fact depend on this assumption. To our knowledge, little work has been done on the properties of tests in the presence of weak instruments and non-Gaussian disturbances. We conjecture that the qualitative features of the asymptotic and semiparametric bootstrap tests considered in this paper do not greatly depend on the assumption of Gaussianity. In the light of our results, it is tempting to conclude that, when the number of overidentifying restrictions is large, so that the AR test may suffer significant power loss, the best method to use is the CLRb test based on the REC bootstrap. It has better performance under the null than every test except the K test bootstrapped using the same method (with which it is pretty much tied), and it can sometimes have substantially more power than the K test. However, when the instruments are even moderately strong and the sample size is not small, all the tests perform quite well when bootstrapped using the new RE and REC bootstraps.
REFERENCES Anderson, T. W. (1984). An Introduction to Multivariate Statistical Analysis (2nd ed.). New York: John Wiley. C The Author(s). Journal compilation C Royal Economic Society 2008.
Bootstrap inference in linear IV regression
477
Anderson, T. W. and H. Rubin (1949). Estimation of the parameters of a single equation in a complete set of stochastic equations. Annals of Mathematical Statistics 20, 46–63. Andrews, D. W. K., M. J. Moreira and J. H. Stock (2006). Optimal two-sided invariant similar tests for instrumental variables regression. Econometrica 74, 715–52. Beran, R. (1988). Prepivoting test statistics: a bootstrap view of asymptotic refinements. Journal of the American Statistical Association 83, 687–97. Davidson, R. and J. G. MacKinnon (1999). The size distortion of bootstrap tests. Econometric Theory 15, 361–76. Davidson, R. and J. G. MacKinnon (2006a). Bootstrap methods in econometrics. In T. C. Mills and K. Patterson (Eds.), Palgrave Handbook of Econometrics, Volume 1, 812–38. Basingstoke: PalgraveMacmillan. Davidson, R. and J. G. MacKinnon (2006b). The power of bootstrap and asymptotic tests. Journal of Econometrics 133, 421–41. Davidson, R. and J. G. MacKinnon (2008). Wild bootstrap tests for IV regression. Forthcoming in Journal of Business and Economic Statistics. Dufour, J.-M. (1997). Some impossibility theorems in econometrics with applications to structural and dynamic models. Econometrica 65, 1365–87. Freedman, D. A. (1984). On bootstrapping stationary two-stage least-squares estimates in stationary linear models. Annals of Statistics 12, 827–42. Hillier, G. (2006). Exact properties of the conditional likelihood ratio test in an IV regression model. CWP, Centre for Microdata Methods and Practice, Institute for Fiscal Studies and University College London (revised) 23/06. Horowitz, J. L. and N. E. Savin (2000). Empirically relevant critical values for hypothesis tests. Journal of Econometrics 95, 375–89. Kleibergen, F. (2002). Pivotal statistics for testing structural parameters in instrumental variables regression. Econometrica 70, 1781–803. Kleibergen, F. (2007). Generalizing weak instrument robust IV statistics towards multiple parameters, unrestricted covariance matrices and identification statistics. Journal of Econometrics 139, 181–216. Mariano, R. S. and T. Sawa (1972). The exact finite-sample distribution of the limited-information maximum likelihood estimator in the case of two included endogenous variables. Journal of the American Statistical Association 67, 159–63. Moreira, M. J. (2001). Tests with correct size when instruments can be arbitrarily weak. Working paper, Center for Labor Economics, University of California, Berkeley (revised, Dec. 2006). Moreira, M. J. (2003). A conditional likelihood ratio test for structural models. Econometrica 71, 1027–48. Moreira, M. J., J. R. Porter and G. A. Suarez (2007). Bootstrap and higher-order expansion validity when instruments may be weak. NBER Working Paper 302, revised. Phillips, P. C. B. (1983). Exact small sample theory in the simultaneous equations model. In Z. Griliches and M. D. Intriligator (Eds.), Handbook of Econometrics, Volume 1, 449–516. Amsterdam: North Holland. Poskitt, S. S. and C. L. Skeels (2005). Small concentration asymptotics and instrumental variables inference. Working Paper 948, Department of Economics, University of Melbourne. Staiger, D. and J. H. Stock (1997). Instrumental variables regression with weak instruments. Econometrica 65, 557–86. Stock, J. H., J. H. Wright and M. Yogo (2002). A survey of weak instruments and weak identification in generalized method of moments. Journal of Business and Economic Statistics 20, 518–29. Stock, J. H. and M. Yogo (2005). Testing for weak instruments in linear IV regression. In D. W. K. Andrews and J. H. Stock (Eds.), Identification and Inference for Econometric Models: Essays in Honor of Thomas Rothenberg, 80–108. Cambridge: Cambridge University Press. C The Author(s). Journal compilation C Royal Economic Society 2008.
Econometrics Journal (2008), volume 11, pp. 478–498. doi: 10.1111/j.1368-423X.2008.00248.x
Using semi-parametric methods in an analysis of earnings mobility S HAWN W. U LRICK † †
U.S. Federal Trade Commission, 600 Pennsylvania Ave., NW, Washington, DC 20580 E-mail:
[email protected] First version received: June 2006; final version accepted: March 2008
Summary This paper describes a dynamic random effects econometric model from which inferences on earnings mobility may be made. It answers questions such as, given some initial level of observed earnings, what is the probability that an agent with certain characteristics will remain below a specified level of earnings (for example the poverty level) for a specified number of time periods? Existing research assumes that the distributions of the unobserved permanent and transitory shocks in the model are known up to finitely many parameters. However, predictions of earnings mobility are highly sensitive to assumptions about these distributions. The present paper estimates the distributions of the random effects non-parametrically. The results are used to predict the probabilities of remaining in a low state of earnings. The results from the non-parametric distributions are contrasted to those obtained under a normality assumption. Using the non-parametrically estimated distributions gives estimated probabilities that are smaller than those obtained under the normality assumption. Through a Monte Carlo experiment and by examining unconditional predicted earnings distributions, it is demonstrated that the non-parametric method is likely to be considerably more accurate, and that assuming normality may give quite misleading results. Keywords: Earnings mobility, Semi-parametric estimation, Deconvolution, Panel data with serial correlation.
1. INTRODUCTION It is well documented that the wage gap between high and low earners has been growing in past years, and anyone living in America today witnesses the regularity with which politicians and the media reference this phenomenon. However, as pointed out by Moffitt and Gottschalk (1998), this phenomenon can be caused by changing trends in either permanent or temporary fluctuations in income. The dispersion of wages is less troubling if low earners are only temporarily poor or, said another way, ‘[I]f individuals are able to climb up the earnings ladder, then changes in the dispersion of annual earnings are less informative’ (Daly and Valetta, 2003). Thus, an accurate analysis of earnings mobility can be useful in putting into perspective the implications of the growing wage gap between the ‘rich and poor’. To this end, several papers have examined the extent of earnings mobility. These papers generally make predictions of mobility based on a regression model and some rigid assumption about the distributions of permanent and temporary fluctuations in earnings (i.e. usually normality). The purpose of this paper is to use a recent semi-parametric technique to relax the rigid assumptions about the distributions. As will be seen, relaxing these assumptions can have serious consequences on the estimated mobility. C Royal Economic Society 2008. No claim to original US government works. Published by Blackwell Publishing Journal compilation Ltd, 9600 Garsington Road, Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
Semi-parametric methods in an analysis of earnings mobility
479
This and many past papers examining mobility use the basic framework of Lillard and Willis’s (L&W) 1978 paper, in which expected earnings are estimated via a regression model, and the deviation from expected earnings is estimated using the distributions of temporary and permanent fluctuations in workers’ earnings. 1 Specifically, L&W use panel data to estimate a linear, random effects model of earnings. L&W regress logged annual earnings on several variables including education, experience in the labour force, race and time effects. The residual structure includes a random, permanent component and a random, serially correlated transitory component. Together, these components represent unmeasured variables affecting earnings. The permanent component allows the model to capture influences on earnings that are specific to an individual. The serially correlated transitory component allows the model to capture effects of time persistent shocks. With their estimated model, the assumed structure of the random effects, and a further assumption about the distribution of the random effects, L&W are able to estimate the probability that an individual’s earnings will fall into a specific income bracket continuously for some period of time. L&W and most past authors assumed that the distributions of the random error components are normal. This is a serious assumption which often proves dubious [see, e.g. Geweke and Keane, 2000 (‘G&K’), Horowitz and Markatou, 1996 (‘H&M’), and White and MacDonald, 1980]. If the assumption is incorrect, the predicted mobilities may be wrong. Using the wrong distribution in such an analysis can be compared to a quality control expert assuming the lifespan of an electronics component is normally distributed when in fact it follows the exponential. The assumption about the distributions of the errors is so central to the predictions, it seems wrong not to evaluate the correctness of it and whether it matters in application. To this end, the present paper uses a recently developed semi-parametric method to estimate the individual distributions of the error components and uses these estimates in predicting mobility. Although non-parametric methods for estimating distribution functions have been around for years (see, e.g. Silverman, 1985), most past authors have not had the luxury of separately estimating the distributions of the permanent and transitory error components via non- or semi-parametric techniques due to the fact that the errors are only observed when convoluted. The semiparametric method used in this paper (developed in H&M) uses the multiplicative relationship between characteristic functions of convoluted random variables to deconvolute the permanent and transitory components. 2 This paper contrasts the resulting transition probabilities to those obtained under a normality assumption. As will be seen, the differences in transition probabilities found by using the nonparametric method as opposed to assuming normality are, at times, substantial. Specifically, assuming normality gives probabilities of upward mobility that are, in some cases, twice as large as those obtained by using the non-parametrically estimated distributions. The result of a Monte Carlo experiment shows that the non-parametric method is likely to be considerably more accurate than is assuming normality. Further evidence that the non-parametric densities provide more accurate results is that the non-parametric densities do a much better job at predicting the unconditional distribution of earnings.
1 Others, such as Geweke and Keane (2000), and Datcher-Laoury (1986), have implemented similar studies using the L&W techniques. 2 G&K use a mixture of normals, which shows a substantial improvement over the basic normality assumption. Robin and Bonhomme (2004) provide another approach in dealing with the unobservable permanent and transitory components, using a statistical technique based on copulas.
C Royal Economic Society 2008. No claim to original US government works. Journal compilation
480
Shawn W. Ulrick
Variable
Table 1. Means and standard deviations of the variables. Mean S.D. Description
Earn Agea
8.7110 0.4133
0.7673 0.1125
Log Annual Earnings Age
Raceb Educ
0.2729 11.7047
0.4455 3.3792
Race = 1 if Black Years of Education
Married Fathed1b Mothed1b
0.8400 0.0392 0.0391
0.3666 0.1940 0.1939
Dummy Indicating Married Dummy Indicating Father’s Education Missing Dummy Indicating Mother’s Education Missing
Fathed2b Mothed2b
0.2704 0.4542
0.4442 0.4979
Dummy Indicating Father’s Education H.S. Dummy Indicating Mother’s Education H.S.
Fathed3b Mothed3b Age2
0.0613 0.0526 0.1835
0.2399 0.2233 0.0978
Dummy Indicating Father’s Education College Dummy Indicating Mother’s Education College Age∧ 2
Age3 Edage
0.0868 4.7449
0.0672 1.8117
Age∧ 3 Educ∗ age
Edage2 Edage3
2.0670 0.9609
1.2390 0.8041
Educ∗ age∧ 2 Educ∗ age∧ 3
a Measured b Denotes
as age/100. time invariant variables.
The organization of the rest of this paper is as follows. Section 2 discusses the data used in the analysis. Section 3 presents the earnings model to be used and the details of its estimation. This section also discusses estimation of the transition probabilities. Section 4 presents the estimation results, including the estimates of the parameters of the earnings model, the densities, and the transition probabilities. This section also provides some informal tests suggesting that using the estimated densities rather than assuming normality provides a better fitting model. Section 5 presents the results of some Monte Carlo experiments aimed at investigating the properties of the estimators used. Concluding remarks appear in Section 6.
2. DATA This analysis is based on the same subset of the Panel Study of Income Dynamics (PSID) that G&K used. G&K used 22 continuous years of the PSID from 1967 onward. They applied several screens to the data. First, they considered only men between the ages of 25 and 65 who are household heads. Second, they removed individuals for whom race and education data are missing. Third, they dropped the first observation for each household head, since often the first year’s observation is only for a partial year’s income. Finally, if a person has a missing observation of earnings or marital status for a given year, G&K dropped all observations for that person from that point onward. This leaves an unbalanced panel of 4766 male household heads in their prime earning years. Means and standard deviations of the data appear in Table 1.
C Royal Economic Society 2008. No claim to original US government works. Journal compilation
Semi-parametric methods in an analysis of earnings mobility
481
3. STATISTICAL MODEL This section introduces the structure of the earnings model, develops a method of estimation and discusses estimating the earnings mobility probability. 3.1. Earnings model The earnings model for individual i at time t is yit = αyi,t−1 + Xit β + Vi γ + ηi + εit
(3.1)
εit = ρεi,t−1 + ξit .
(3.2)
with
The variable y it denotes log annual earnings, X it is a vector of seven time varying independent variables, and V i is a vector of eight time invariant independent variables, including an intercept. A complete listing of these variables may be found in Table 1. Let there be T i > 3 observations on individual i. Let there be N individuals. The error components consist of η i , an i.i.d. individual effect; ε it , a serially correlated transitory component; and ξ it , an i.i.d. error. The individual effect represents unmeasured time invariant variables that affect earnings and are unique to an individual. The serially correlated transitory effect captures unmeasured, serially correlated variables and time persistent shocks. This model was also used by G&K. Here, however, in contrast to G&K, the distributions of η, ξ and ε are not assumed to belong to known parametric families. Instead, their distributions are estimated non-parametrically. 3.2. Estimation Hsiao (1986) and Chamberlain (1984) summarize methods for estimation of panel data models, including those with a lagged dependent variable (LDV) or a serially correlated transitory component. Blundell and Bond (1998) and Arellano and Bond (1991) discuss improved, more efficient GMM methods for estimating a model with a lagged dependent variable. However, there has been relatively little research on estimation of models like (3.1) and (3.2), which have both an LDV and a serially correlated transitory effect. OLS would not provide a consistent estimate of (3.1) and (3.2), due to the correlation of the LDV with the individual effect and the AR(1) transitory effect. Instead, I use an instrumental variables approach. To deal with the correlation between the LDV and the individual effect, remove the latter by subtracting the group mean, that is, the mean of individual i’s observations. This yields yit − y¯i = α(yi,t−1 − y¯i ) + (Xit − X¯ i )β + εit − ε¯ i ,
(3.3)
where y¯i =
Ti 1 yit ; Ti i=1
Ti 1 X¯ i = Xit ; Ti i=1
ε¯ i =
Ti 1 εit . Ti i=1
(Note that ε¯ i is the unobservable within-group mean of the serially correlated error.) Equation (3.3) does not contain the time invariant variables, so γ cannot be estimated from (3.3). Methods for estimating γ are discussed at the end of this subsection. Note that the LDV term (yi,t−1 − y¯i ) C Royal Economic Society 2008. No claim to original US government works. Journal compilation
482
Shawn W. Ulrick
is still correlated with the transitory effect, but an instrumental variables method will estimate (3.3) consistently. I use (Xi,t−1 − X¯ i,−1 ) as instruments for (yi,t−1 − y¯i ), where X¯ i,−1 =
Ti −1 1 Xit . Ti − 1 t=1
The structure of (3.1) suggests that (Xi,t−1 − X¯ i,−1 ) is correlated with (yi,t−1 − y¯i ) but uncorrelated with the error. To estimate γ , substitute estimates of α and β into y¯i − α y¯i,−1 − X¯ i β = Vi γ + ηi + ε¯ i and apply OLS (Hsiao, 1986, pp. 50–51). To estimate ρ, substitute the estimated residuals (εit − ε¯ i ) and (εi,t−1 − ε¯ i ) (obtainable from the estimated equation 3.3) into 3 (1/N)
Ti N (1/Ti ) (εit − ε¯ i )(εi,t−1 − ε¯ i )/(εi,t−1 − ε¯ i )2 . i=1
i=1
3.3. Earnings mobility probability This subsection discusses a method to estimate the probability that an individual will remain below a specified level of log earnings, y ∗ , for θ time periods, given that he initially had log earnings less than or equal to a specified value, y i0 . This probability may be written p(Yi1 < y ∗ , . . . , Yiθ < y∗ |Yi0 ≤ yi0 ).
(3.4)
The expression for (3.4) is analytically very complicated and involves several iterated integrals. The integrals are evaluated by simulation. For several thousand individuals with initial log earnings less than or equal to y i0 and specified characteristics (i.e. specified values of X i0 , V i ), generate earnings paths out to θ time periods according to an estimated version of (3.1) and (3.2). Calculate the proportion of those whose simulated log earnings lie below y ∗ for all θ periods. This number is an estimate of the value of (3.4). To generate the earnings paths, one needs estimates of the parameters of the model (3.1) and (3.2), a method of sampling the initial conditions y i0 and ε i0 , and the appropriate values of X it for t > 1. (V i does not change; ε it it is known for t ≥ 1, given ρ, ε i0 and ξ it .) Finally, to sample the initial conditions and subsequent shocks, one needs estimates of the distributions of η i and ξ it . (The distribution of ε it is not needed in this exercise because, as written in Appendix A, ε it may be written as a function of anterior ξ it ’s.) The method to estimate the parameters was described in Section 3.1 of this paper. Details on generating the initial conditions are discussed in Appendix A. As shown, this may be done using the estimated density functions of ξ it and η i , though care must be taken since the error components are correlated with each other and with y 0 . To calculate X it for t > 1, note that, with the exception of marriage, every element of X it is a function of only age and education level. Education level is time invariant. If age is known at time t, age is known at every time period. If 3 This is the same procedure as in Hsiao (1986), p. 55. However, because of the lagged dependent variable, for Hsiao’s Step 2, I used the aforementioned instrumental variables estimator instead of OLS.
C Royal Economic Society 2008. No claim to original US government works. Journal compilation
Semi-parametric methods in an analysis of earnings mobility
483
one makes an assumption about marriage (I assume that it is unchanging), all the elements of X t are known at every t. Lastly, the distributions of the error components are estimated non-parametrically. Standard kernel density estimators (such as those in, e.g. Silverman, 1985) are not applicable to this model, because we never observe estimates of η i , ε it or ξ it . Instead we only observe estimates of the errors convoluted together in some form. For example, we can observe estimates of ˆ i,t−1 + Xit βˆ + Vi γˆ ] εˆ + ηˆ = yit − [αy
(3.5)
or ˆ i,t−1 ) − ( yi,t−1 − ρ y ˆ i,t−2 )αˆ ξˆit − ξˆi,t−1 = ( yit − ρ y ˆ −( Xit − ρ X ˆ ˆ i γˆ , i,t−1 )β − (1 − ρ)V
(3.6)
where the ‘hat’ notation represents an estimated value and the delta means take the first difference. I hence use a non-parametric deconvolution estimator developed in Horowitz (1997) and H&M. In essence, H&M use the fact that the characteristic function for the sum of two independent random variables is the product of the characteristic functions of the individual variables. The authors use this relationship to write the characteristic functions of η, ε and ξ in terms of the characteristic functions of the convoluted residuals whose estimates are observable. The characteristic functions can then be estimated from the residuals and smoothed with a kernel function, and the inversion formula can be applied to the smoothed estimates to obtain estimates of the densities. 4 The basic estimator presented in H&M assumes that η and ε are i.i.d. The authors present extensions that allow for asymmetry or serial correlation in ε. Horowitz (1997, pp. 125–127) combines the two extensions to allow for both asymmetry and serial correlation in ε. I allow for asymmetry and serial correlation and hence implement this version of the estimator. 5 There has been little research in the best way to choose the tuning parameters of the estimator. See Appendix B for details on the choices made in this paper.
4. ESTIMATION RESULTS This section presents the results of estimating the parameters of the earnings model, the densities of the random effects, and the transition probabilities. Parameter estimates are presented in Section 4.1. Section 4.2 presents the estimated density functions and the results of some informal specification tests which suggest that the error components are not normal. The estimated transition probabilities are given in Section 4.3. 4.1. Estimated parameters Table 2 shows the estimates of the coefficients of the earnings model and their standard errors. Estimates of ρ, the variances of ξ and η, and the third through fifth moments of ξ are also given. 6 4 Note that the estimator does not function well in the tails of the distributions. Thus, the tails are estimated by assuming they belong to a mixture of two normals having the same first three moments as the actual distribution in the data. Few observations are in the tails, so this should have minimal effect. 5 I do not implement the bias correction of H&M. 6 Methods for estimating the variances of ξ and η are in Hsiao (1986), pp. 55–56. The higher moments of ξ were estimated by a relatively straightforward extension (involving higher powers of ξ it − ξ i,t−1 ). C Royal Economic Society 2008. No claim to original US government works. Journal compilation
484
Shawn W. Ulrick Table 2. Estimated parameters. Estimate
Variable
S.E.
Learn (Lagged Log Earnings) Married
−0.3047 0.1199
0.2187 0.0162
Age Age2
−66.1786 143.6497
11.4648 24.5376
Age3 Edage Edage2
−105.1305 5.2093 −10.3406
17.3960 1.0103 2.0516
Edage3 Race
6.7819 −0.3658
1.3700 0.0073
Educ Fathed1 Mothed1
−0.7638 −0.0482 −0.0379
0.0010 0.0159 0.0175
Fathed2 Mothed2
0.0518 −0.0191
0.0080 0.0064
Fathed3 Mothed3 Constant
−0.0100 0.0550 20.5671
0.0140 0.0141 0.0131
Moments of ξ : μ 2 : 0.1542
μ 3 : −0.2719
μ 4 : 0.9061
μ 5 : −2.7302
σ η2 : 0.5433 ρ: 0.43
The negative third moment suggests the distribution of ξ is skewed to the left and therefore asymmetrical. Under normality, the value of μ 4 /μ22 − 3 is zero. The data give this value as 35.1, which is evidence that the distribution is thick-tailed and not normal. The estimate of ρ is 0.43; the estimate of α, the coefficient to the LDV, is negative but insignificant. These values are similar to what has been found in the literature. Lillard and Willis found ρˆ to be in the neighbourhood of 0.40. Geweke and Keane found, for the mixture model, ρˆ to be 0.655 and αˆ to be −0.121— negative but small in magnitude. As expected, marriage has an upward influence on earnings, as does being white. In general, greater parental education predicts higher earnings, although not all variables pertaining to parent’s education are significant. The effects of age and education on earnings are not obvious from the table, due to the fact they are formed from a polynomial. A simple plot of earnings versus age, holding education and all other variables at a constant value, such as their means, reveals that initially earnings increase with age and later, around the age of 45, decrease—a characteristic commonly found in the literature (Freeman, 1972). A year’s increase in education level has a net effect of 5 to 9 percent higher earnings, depending the specific age level. Again, this is a number similar to what has been found in the literature (Psacharopoulos, 1992). To help provide interpretation to the age/education coefficients, Table 3 lists the predicted percentage change in earnings for changes in age and education.
C Royal Economic Society 2008. No claim to original US government works. Journal compilation
485
Semi-parametric methods in an analysis of earnings mobility
Table 3. Predicted percentage increase in annual earnings for increase in age/education. Age changes Predicted percentage from (top) to (bottom) increase in earnings Education 12
25 years old 35 years old
16%
12
35 years old 45 years old
5%
12
45 years old 55 years old
−20%
12 years 16 years
17%
35
12 years 16 years
51%
45
12 years 16 years
60%
55
12 years 16 years
58%
Age 25
4.2. Estimated density functions To motivate the need for estimating the densities, some informal graphical tests of normality were carried out on the data. These graphs reconfirm the non-normality revealed by the moments. Were ξ normally distributed, ξ it − ξ i,t−1 would be normally distributed, and hence a normal probability plot of the estimated ξ it − ξ i,t−1 would be a straight line, up to random sampling error. Figure 1 depicts this normal probability plot, where the estimated residuals are obtained from (3.6) above. The plot is S-shaped, suggesting that the distribution of ξ is not normal; its tails are too thick. 7 The evidence that η is not normally distributed is less strong. Because we only observe η when convoluted with some form of ε or ξ , we cannot check for normality in the η’s by creating a normal probability plot similar to that used above. However, we can perform another informal 2 graphical test. If η is normally distributed, then its characteristic function is e−cτ , where c is 2 a constant. Hence a plot of ln[hˆ η (τ )] against −τ would be a straight line. Figure 2 depicts this plot. Any departure in the distribution of η from normality is small, since the curvature in Figure 2 is only slight. Further evidence that the transitory component may not be normally distributed comes from plotting kernel density estimates of simulated earnings obtained both by assuming normality and 7 A more formal normality test for time-dependent error terms in complicated estimators is in Bai (2003), though it is not implemented here for two reasons: First, it is applicable to ξ it − ξ i,t−1 but not η, since η is not observable. Second, the primary goal is to see whether the normality assumption affects the final results, not proving whether these random variables are normal.
C Royal Economic Society 2008. No claim to original US government works. Journal compilation
486
2 0 -2 -4
Inversenorm[F(ksi - ksi)]
4
Shawn W. Ulrick
-6
-4
-2
0
2
4
ksi - ksi Figure 1. Normal probability plot of ξ it − ξ i,t−1 .
Ln of estimated empirical characteristic function of eta
0
-2
-4
-6 -60
-40
-20
0
-tao^2
Figure 2. Estimated empirical cf. of η i versus square of its argument.
C Royal Economic Society 2008. No claim to original US government works. Journal compilation
487
Semi-parametric methods in an analysis of earnings mobility
.8
.6
.4
.2
0 0
5
10
15
earnings
Figure 3. Distribution of earnings in sample (solid line) versus that predicted by normal (short dashes) and semi-parametric (long dashes) models.
by using the non-parametric method. These plots are overlaid with a kernel density estimate of the actual earnings found in the PSID sample. Figure 3 displays this plot. The simulated earnings were generated from the same distribution of covariates found in the data. The predicted distribution of earnings obtained by using the non-parametric densities closely mimics that of the sample. The predicted distribution obtained from the normality assumption fits poorly. Simulated quantiles of earnings are examined, as well. Table 4 shows various quantiles of log earnings in the PSID sample. It also shows the percentage of individuals in the simulated data that lie below the given quantiles. As can be seen, the percentage of individuals who lie below each quantile in the simulated earnings data obtained from the non-parametric densities is quite close to the truth. This is not the case under the normality assumption. In fact, the simulated earnings data under the normality assumption places, for example, 19% of individuals in the bottom decile of earnings, and 36% in the bottom quartile. 4.3. Earnings mobility probability The previous subsection presented evidence that the true distribution of the transitory error effect is not normal, and that assuming normality gives poor prediction of unconditional earnings distribution. This subsection shows the implications of assuming normality on the transition probabilities. Table 5 presents the estimated probabilities that a person has earnings in the bottom quintile of the sample continuously for θ time periods, conditional on being initially there. The 20th percentile in the PSID sample is log annual earnings of 8.2. The transition probabilities
C Royal Economic Society 2008. No claim to original US government works. Journal compilation
488
Shawn W. Ulrick
Table 4. Quanitles of earnings in the PSID sample versus the percentage that lie below each quanitle in the simulated earnings. Quanitle Quanitle value % below, normal % below, non-parametic 5% 10% 25%
7.44 7.87 8.33
9% 19% 36%
4% 10% 27%
50% 75% 90%
8.74 9.08 9.36
54% 68% 78%
53% 74% 87%
95% 99%
9.53 10.03
84% 93%
93% 99%
Notes: Quantile value: The value of the given quantile in the PSID sample. % below, normal: The percentage who lie below each quanitle in the normal simulated data. % below, non-parametric: The percentage who lie below each quanitle in the non-parametric simulated data.
Table 5. Estimated transition probabilities: The probability of remaining in bottom quintile of sample for θ time periods conditional on initially being there. 8 12 16 θ
normal
nonpar.
normal
nonpar.
normal
nonpar.
Black 1 3
0.665 0.447
0.71 0.53
0.584 0.344
0.627 0.425
0.506 0.251
0.533 0.32
5 8
0.349 0.273
0.436 0.351
0.24 0.158
0.321 0.236
0.152 0.084
0.226 0.151
11
0.227
0.299
0.118
0.186
0.054
0.113
1 3 5
0.573 0.341 0.243
0.618 0.433 0.337
0.491 0.246 0.157
0.515 0.327 0.244
0.421 0.185 0.103
0.451 0.252 0.177
8 11
0.178 0.14
0.267 0.222
0.094 0.066
0.18 0.145
0.054 0.033
0.107 0.066
White
Notes: nonpar.: The probabilities obtained by using the non-parametrically estimated densities. normal: The probabilities obtained by assuming normality.
were estimated for black and white men who are married and initially 30 years old, with both parents possessing only a high school diploma. The probabilities were calculated both under the assumption of normality of the error components and by using the non-parametrically estimated densities. The table shows that those with more education tend to have a lower probability of remaining in the bottom quintile. Similarly, whites have a lower probability of staying in the low-earnings state than do blacks. More importantly, the probabilities of remaining in the bottom
C Royal Economic Society 2008. No claim to original US government works. Journal compilation
489
0
.5
1
1.5
2
Semi-parametric methods in an analysis of earnings mobility
-1
-.5
0
.5
.65
1
ksi Figure 4. Non-parametrically estimated density function of ξ overlaid with normal (dashed line).
quintile obtained by using the estimated densities are, in most cases, higher than those obtained under the assumption of normality—sometimes nearly twice as high. Monte Carlo experiments in Section 5 suggest that these differences are real features of the data and not artefacts of the estimation procedure. This evidence suggests that assuming normality usually overstates upward mobility. For low earnings categories, for example, uneducated blacks, the non-parametric distribution gives less upward mobility than does assuming normality. Table 5 shows that, for example, under normality, blacks with 8 years of education have about a 23% chance of remaining in the bottom quintile for 11 years, but when using the empirical densities, this number is a larger 30%. Other groups show similar results. This difference is reasonable. Investigating the simulated data reveals that for all agent-types displayed, except college educated whites, the conditional mean level of earnings (conditional on being initially poor) for the first few time periods is in the bottom quintile. To move out of this low-earnings state, an individual would need a large, positive transitory shock to his earnings. This event is more likely with the normal density than it is with the non-parametric one. Figure 4 presents a graph of the non-parametric density function overlaid with the normal of the same variance. As can be seen, the non-parametric density has much more mass near zero, and hence, provides less probability for moderately large transitory shocks and thus less probability of upward mobility. 8
8 The non-parametrically estimated density is negative for ξ ∈ [−0.5, 0.65]. Therefore, the figure was not plotted for these values of ξ . (As stated in the text, the tails of the density were estimated with a mixture of two normals when estimating the earnings mobility probabilities.)
C Royal Economic Society 2008. No claim to original US government works. Journal compilation
490
Shawn W. Ulrick Table 6. Mean of estimated parameters versus true value. Married Age Age2 Age3 Edage
Variable
Learn
Actual Estimated
−0.30 −0.31
0.12 0.12
−66.18 −66.32
143.65 143.84
−105.13 −105.19
0.09
0.01
9.57
22.77
17.67
S.D.
Edage2
Edage3
5.21 5.22
−10.34 −10.35
6.78 6.77
0.79
1.88
1.46
Notes: Actual: True value used in data generation process. Estimated: Mean of the 1000 estimated parameters. S.D.: Standard deviation of the 1000 estimated parameters.
5. MONTE CARLO EXPERIMENTS This section presents results of Monte Carlo experiments aimed at providing information about the accuracy of the methods used in estimating the parameters of the model, the density f ξ , and the mobility probabilities. 5.1. Estimates of the parameters This subsection reports the results of a Monte Carlo experiment that was carried out to check the finite sample performance of the IV estimator of the coefficients described in Section 3. This experiment consisted of repeatedly applying the estimator to data generated by simulation from the estimated model. A balanced panel dataset, with the assumption that each individual has observations for 14 years, was used. The distribution for X i1 and V i found in the real data was used, excluding those who are initially over 51 years of age. After 14 years, these men would exceed 65 and would be at an age outside the typical working years. This yielded a sample size of 57 400, roughly the same as that of the real dataset, with approximately the same distribution of covariates. For each i, I generated ε i0 , η i and y i0 by sampling error terms from the non-parametrically estimated densities. The sampling was done by the method discussed in Appendix A, which was also used in estimating the earnings mobility probability, except that here I kept all y i0 ’s. Next, for each individual, for t = 1, . . . , 14, I sampled ξ it from its estimated distribution, used the V i and the appropriate X it , and generated the corresponding y it . With the exception of the marriage variable, the appropriate values for the elements of X it for each individual are predetermined, as discussed in Section 4. To deal with the uncertain path of marriage, I introduced an element of randomness: I changed the value of marriage with a probability equal to the proportion of observations in which marital status changed in the real data. The coefficients α and β were estimated by the methods discussed in Section 3. This process was repeated 1000 times. The means and standard deviations of the estimates for α and each element in β, as well as the actual values used in the data generation process, are reported in Table 6. All the means are very close to the true values, suggesting that the finite sample performance of the IV estimation method is satisfactory. 5.2. Estimates of the density functions This subsection discusses the results of an experiment investigating the accuracy of the method used to estimate the transitory error component’s distribution. This experiment consisted of C Royal Economic Society 2008. No claim to original US government works. Journal compilation
Semi-parametric methods in an analysis of earnings mobility
491
pdf of ksi
2
1
0 -1
-.5
0 ksi
.5
1
Figure 5. Estimated density (dashed line) of ξ versus its true density (solid line).
sampling error components from a known distribution, generating convoluted error terms, and then using the non-parametric deconvolution procedure to estimate f ξ . To generate simulated error components, I sampled ξ from a mixture of two normals with the first 5 moments equal to those estimated from the actual data. Since the evidence suggests that η is not highly non-normal, I sampled it from a zero-mean normal distribution with standard deviation equal to that found in the data. I used the sampled ξ ’s and η’s to simulate convoluted error components ε it + η i , for 57 400 individuals and 14 time periods. (I simulated the initial condition ε i0 as described in Appendix A and the remaining ε it according to equation (3.2).) I implemented the non-parametric density estimation exactly as I did on the real data, including methods of choosing the tuning parameters. (In this experiment, I did not estimate the regression parameters in model (3.1)–(3.2). The rate of convergence of the estimated parameters is much faster than that of the estimated distributions, so estimating only the distributions in this experiment should make little difference and simplifies computation. 9 ) A plot of the estimated density of ξ superimposed on the real density used in the generation process appears in Figure 5. The plot shows that the non-parametric method works well for a model with an AR(1) and an asymmetrical transitory error component. 5.3. Estimates of the mobility probability This subsection presents the results of a Monte Carlo experiment that was designed to check the finite sample properties of the method used in section 4 to estimate the earnings mobility probabilities. By sampling the permanent and transitory error terms from a known ‘true’ 9 Moreover, the Monte Carlo experiment in Section 5.1 separately investigated the finite sample characteristics of the IV estimator used to estimate the parameters.
C Royal Economic Society 2008. No claim to original US government works. Journal compilation
492
Shawn W. Ulrick
Table 7. True mobility probabilities versus estimated probabilities: The probability of remaining in bottom quintile of sample for time periods conditional on being initially there. 8 12 16 θ
TRUE
nonpar.
normal
TRUE
nonpar.
normal
TRUE
nonpar.
normal
Black 1 0.82 2 0.74
0.83 0.76
0.70 0.56
0.76 0.66
0.76 0.65
0.64 0.48
0.68 0.54
0.68 0.53
0.57 0.39
3 4 5
0.69 0.66 0.63
0.70 0.67 0.64
0.48 0.42 0.38
0.59 0.54 0.49
0.57 0.51 0.47
0.38 0.32 0.27
0.46 0.39 0.34
0.43 0.35 0.30
0.29 0.22 0.18
6 7
0.61 0.59
0.61 0.59
0.34 0.32
0.46 0.43
0.43 0.40
0.24 0.21
0.30 0.27
0.25 0.21
0.14 0.12
8 9 10
0.57 0.55 0.54
0.58 0.56 0.55
0.29 0.27 0.26
0.41 0.39 0.37
0.38 0.36 0.34
0.19 0.17 0.16
0.25 0.22 0.21
0.18 0.16 0.14
0.10 0.09 0.07
11
0.53
0.54
0.25
0.35
0.32
0.14
0.19
0.13
0.06
1 2
0.74 0.64
0.76 0.65
0.62 0.47
0.65 0.53
0.66 0.51
0.55 0.38
0.54 0.41
0.56 0.43
0.48 0.30
3 4 5
0.58 0.54 0.51
0.58 0.53 0.50
0.38 0.32 0.27
0.45 0.40 0.37
0.41 0.35 0.30
0.29 0.23 0.19
0.33 0.27 0.23
0.33 0.27 0.23
0.21 0.15 0.12
6 7
0.48 0.46
0.47 0.45
0.24 0.22
0.33 0.31
0.26 0.23
0.16 0.13
0.20 0.17
0.20 0.17
0.09 0.07
8 9 10
0.45 0.43 0.42
0.43 0.42 0.40
0.20 0.18 0.17
0.29 0.27 0.25
0.21 0.19 0.17
0.12 0.10 0.09
0.15 0.13 0.12
0.15 0.14 0.12
0.06 0.05 0.04
11
0.41
0.39
0.16
0.24
0.16
0.08
0.11
0.11
0.04
White
Notes: true: The ‘true’ probability in the Monte Carlo experiment. nonpar.: The estimated probabilities obtained from using the non-parametric distributions. normal: The estimated probabilities obtained from assuming normality.
distribution (specifically, the same distributions used in Section 5.2), earnings paths for black and white married individuals, initially 30 years of age, with 8, 12 and 16 years of education were generated. From the simulated earnings paths, the ‘true’ transition probabilities were calculated. Next, the densities of ξ and η were estimated and the methods of Section 4 were used to estimate the transition probabilities. In this experiment, the parameters of the model (α, β, γ and ρ) were held fixed and assumed known, for the same reason discussed in Section 5.2. The results of this experiment are in Table 7. The table presents the probability that the specified individual has log annual earnings of 8.2 or less for θ time periods, conditional on initially having log annual earnings of 8.2 or less. This level of earnings was chosen since it is the 20th percentile in the PSID sample used in this paper. The ‘true’ probabilities from the generated data are listed, as are the estimated probabilities obtained by assuming normality and C Royal Economic Society 2008. No claim to original US government works. Journal compilation
Semi-parametric methods in an analysis of earnings mobility
493
using the non-parametric densities. The probabilities obtained from non-parametric densities are considerably closer to the truth than the results obtained by assuming normality. The probabilities calculated from the non-parametric distributions are always within 10% of the truth. In some cases, these probabilities remain within 2% of the truth for 11 periods. The probabilities obtained from assuming normality are not even close to the truth. In the later years, the probabilities obtained from the normal distributions are less than half the true probabilities—in all cases displayed. Hence, this experiment suggests that using the non-parametric method provides more accurate results.
6. CONCLUDING REMARKS This paper uses a random-effects panel data model to estimate a model of earnings mobility. This paper uses a recently developed non-parametric method to estimate the distribution of the error components of the model. The results reveal that the transitory component is not normally distributed. Moreover, assuming normality of the error components leads to estimates of upward mobility that are often much higher than those obtained by using the estimated densities and understates the effects of factors such as race and education on mobility. It was demonstrated through a Monte Carlo experiment that the results obtained by using the estimated densities are likely to be more accurate than those obtained by assuming normality.
ACKNOWLEDGEMENTS I gratefully acknowledge the support of Joel Horowitz. I also wish to thank George Deltas, Brendan Cunninghan, John Geweke, Douglas Herman and David Schmidt for their helpful suggestions. I also wish to thank two anonymous referees. This research was supported in part by NSF grant SBR-9617925. The views in this article are those of the author and do not necessarily reflect those of the Federal Trade Commission or any individual Commissioner.
REFERENCES Arellano, M. and S. R. Bond (1991). Some tests of specification for panel data: Monte Carlo evidence and an application to employment equations. Review of Economic Studies 58, 277–97. Bai, J. (2003). Testing parametric conditional distributions of dynamic models. The Review of Economics and Statistics 85, 531–48. Bane, M. J. and D. T. Ellwood (1985). Slipping into and out of poverty: the dynamics of spells. The Journal of Human Resources 21, 2–23. Blundell, R. and S. Bond (1998). Initial conditions and moment restrictions in dynamic panel data models. Journal of Econometrics 87, 115–43. Chamberlain, G. (1984). Panel Data. In Z. Griliches and M. D. Intriligator (Eds.), Handbook of Econometrics, Volume 2, 1248–318. Amsterdam: North-Holland. Coe, R. D., G. J. Duncan and M. S. Hill (1978). Dependency and poverty in the short and long Run. In G. Duncan and J. Morgan (Eds.), Five Thousand American Families: Patterns and Economic Progress, Volume 6, 323-46. Ann Arbor, MI: Institute for Social Research. Daly, M. and R. Valletta (2003). Earnings inequality and earnings mobility in the U.S. FRBSF Economic Letter 28, Federal Reserve Bank of San Francisco. C Royal Economic Society 2008. No claim to original US government works. Journal compilation
494
Shawn W. Ulrick
Datcher-Loury, L. (1986). Racial differences in the stability of high earnings among young men. Journal of Labor Economics 4, 301–16. Duncan, G. J. (1984). Years of Poverty and Plenty. Ann Arbor, MI: Institute for Social Research. Freeman, R. B. (1972). Labor Economics. Englewood Cliffs, NJ: Prentice-Hall. Geweke, J. and M. Keane (2000). An empirical analysis of income dynamics among men in the PSID: 1968–1989. Journal of Econometrics 96, 293–356. Hill, M. S. (1981). Some dynamic aspects of poverty. In M. Hill, D. Hill and J. Morgan, (Eds.), Five Thousand American Families: Patterns and Economic Progress, Volume 9, 9–32. Ann Arbor, MI: Institute for Social Research. Horowitz, J. L. (1997). Semiparametric Methods in Econometrics. New York: Springer-Verlag New York, Inc. Horowitz, J. L. and M. Markatou (1996). Semiparametric estimation of regression models for panel data. Review of Economics Studies 63, 145–68. Hsiao, C. (1986). Analysis of Panel Data. Cambridge: Cambridge University Press. Levy, F. (1977). How big is the American underclass? Working paper 0090-1. The Urban Institute, Washington. Lillard, L. and R. J. Willis (1978). Dynamic aspects of earnings mobility. Econometrica 46, 985–1012. McCall, J. J. (1973). Income Mobility, Racial Discrimination, and Economic Growth. Lexington, MA: D.C. Heath and Company. Moffitt, R. and P. Gottschalk (1998). Trends in the variances of permanent and transitory earnings in the U.S. and their relation to earnings mobility. Working paper, Johns Hopkins University. Psacharopoulos, G. (1992). Returns to education: a further international update & implications. In M. Blaug, (Ed.), The Economic Value of Education, International Library of Critical Writings in Economics, Volume 17. Brookfield, VT: Ashgate. Rainwater, L. (1982). Persistent and transitory poverty: a new look. Working paper 70, Joint Studies for Urban Studies, Cambridge, MA. Robin, J. and S. Bonhomme (2004). Modeling individual earnings trajectories using copulas with an application to the study of earnings inequality: France, 1990-2002. Working paper, Universit´e de Paris 1-Panth´eon-Sorbonne (EUREQua), CREST-INSEE. Silverman, B. W. (1985). Density estimation for Statistics and Data Analysis. New York: Chapman and Hall. White, H. and G. MacDonald (1980). Some large sample tests for nonnormality of the linear regression model. Journal of the American Statistical Association 76, 419–33.
APPENDIX A: SAMPLING η AND THE INITIAL CONDITIONS ε 0 AND y 0 To motivate the process for sampling η and the initial conditions ε 0 and y 0 , observe that ε 0 is a function of the independent lagged ξ ’s and that y 0 is a function of the same ξ ’s and η. Hence, write y 0 as a function of η and anterior ξ ’s; write ε 0 as a function of the same lagged ξ ’s. To write y 0 in terms of ξ ’s and η, note that (i subscript dropped) Yt = αYt−1 + Xt β + V γ + η + εt ,
(A.1)
εt = ρεt−1 + ξt
(A.2)
and
C Royal Economic Society 2008. No claim to original US government works. Journal compilation
Semi-parametric methods in an analysis of earnings mobility
495
Repeatedly substituting (A.1) into itself yields yt =
∞
α j (Xt−j β + V γ ) +
j =0
1 η + εt + αεt−1 + α 2 εt−2 + · · · , 1−α
(A.3)
if |α| < 1. Repeatedly substituting (A.2) into itself, gives εt = ξt + ρξt−1 + ρ 2 ξt−2 + · · · .
(A.4)
Substituting (A.4) into (A.3) results in yt =
=
∞
1 η + ξt + ρξt−1 + ρ 2 ξt−2 + · · · 1−α j =1 + α ξt−1 + ρξt−2 + ρ 2 ξt−3 + · · · + α 2 ξt−2 + ρξt−3 + ρ 2 ξt−4 + · · · + α 3 ξt−3 + ρξt−4 + ρ 2 ξt−5 + · · · + · · ·
∞
α j (Xt−j β + V γ ) +
α j (Xt−j β + V γ ) +
j =1
1 η + ξt + (α + ρ)ξt−1 + (ρ 2 + αρ + α 2 )ξt−2 + · · · 1−α
Note that, for sufficiently small α and ρ, yt ≈ At +
1 η + ξt + c1 ξt−1 + c2 ξt−2 + c3 ξt−3 + c4 ξt−4 + c5 ξt−5 , 1−α
(A.5)
with At =
5
α j (Xt,j −1 β + Vi γ )
j =1
c1 = (ρ + α) c2 = (ρ 2 + αρ + α 2 ) c3 = (ρ 3 + ρ 2 α + ρα 2 + α 3 ) c4 = (ρ 4 + ρ 3 α + ρ 2 α 2 + ρα 3 + α 4 ) c5 = (ρ 5 + ρ 4 α + ρ 3 α 2 + ρ 2 α 3 + ρα 4 + α 5 ). Thus, y 0 may be approximated by a function of η, ξ 0 , ξ −1 , ξ −2 , ξ −3 , ξ −4 , ξ −5 , and exogenous variables. The ε 0 is also a function of these same variables, by equation (A.4). The initial conditions may be sampled by the following steps: 1 Generate y 0 according to equation (A.5) by sampling η and the five ξ ’s from their estimated empirical distributions. ∗ ∗ 2 When conditioning on y 0 < y , proceed to step 3 only if the y 0 resulting from step 1 is less than y . Otherwise repeat step 1. 3 Generate the corresponding ε 0 ’s by using equation (A.4) and the same ξ ’s as in step 1. 4 Use the resulting data in the simulation used to calculate (3.4). This algorithm quickly generates a set of y 0 , η and η 0 . C Royal Economic Society 2008. No claim to original US government works. Journal compilation
496
Shawn W. Ulrick
APPENDIX B: IMPLEMENTING THE NON-PARAMETRIC DENSITY ESTIMATORS AND CHOOSING THE TUNING PARAMETERS This appendix discusses the tuning parameters used to estimate the densities of ξ and η, since suggestions for many of these choices are not presented in H&M or Horowitz (1997). 10 It is necessary to introduce some notation to discuss the tuning parameters. The density of ξ may be estimated via the extensions in H&M (p. 162, section 6b) and in Horowitz (1997, pp. 135–136): ∞ e−j zt φˆ ξ (t)g(λξ t)dt, (B.1) fˆξ (z) = √
−∞
where fˆξ is the density; j = −1; φˆ ξ (t) is the estimated characteristic function (CF) of ξ ; g(.), a smoothing function, is a CF with support [−1, 1]; and λ ξ is a bandwidth. H&M show that φˆ ξ (t) may be estimated by ˆ φˆ ξ (t) = |φˆ ξ (t)|1/2 exp[j ω(t)], where ω(t) ˆ is an estimate of the argument or phase to the complex variable φ ξ (t), and φˆ ξ (t) is the estimated CF of ξ it − ξ i,t−1 . Evaluating φˆ ξ (t) is straightforward, given residuals ξˆit − ξˆi,t−1 (obtainable from equation (3.6)). Estimating ω(t) ˆ is more complicated. H&M suggest obtaining it by estimating a power series approximation ω(τ ˆ )=
Kn
aˆ i τ i ,
i=1
for some K n → ∞ as n → ∞. The aˆ i ’s are obtained as follows: Let qˆ be the empirical CF of (ξ i2 − ξ i1 ), (ξ i3 − ξ i1 ), . . . , (ξ iT − ξ i1 ), for some T. That is, ˆ 2 , . . . , τT ) = q(τ
n 1 exp{j [τ2 (ξˆi2 − ξˆi1 ) + · · · + τT (ξˆiT − ξˆi1 )]}. n i=1
Then, the aˆ i ’s can be obtained by the OLS regression of ˆ . . . , τ )/|q(τ, ˆ . . . , τ )|} = Im{log[q(τ,
Kn
ai [(T − 1) − (T − 1)i ]τ i .
(B.2)
i=2
Once φˆ ξ (t) is obtained, it is relatively straightforward to obtain the empirical CF’s of ε and η and their corresponding densities. Methods for estimating these are presented in the H&M and Horowitz papers. The smoothing function g and several tuning parameters needed to be selected: λ ξ , T , K n , τ min and τ max , where τ min and τ max are the minimum and maximum values, respectively, of τ in (B.2). Finally, a bandwidth λ η needs to be selected to estimate the density of η. I now discuss each of these choices; none of them are obvious, and future work would clearly benefit by more rigorous selections. In this paper, the smoothing function g was chosen to be the same as in H&M (p. 156), that is, the fourfold convolution of the uniform distribution with itself (this CF has corresponding density of c[sin(x)/x]4 , where c is a normalization constant). The bandwidths were chosen in the informal graphical manner put forth by H&M (p. 156, last paragraph). Specifically, my estimated φˆ ξ (t) is nearly zero for |t| > 25, except for a few wiggles. Hence, λ ξ was chosen to set the integrand in (B.1) equal to zero for values of t beyond that point. Since g(t) = 0 for |t| > 1, λ ξ = 0.04. (The bandwidth used in estimating η was chosen in the same manner; λ η = .04.)
10 The density f is not estimated, since, as shown in Appendix A, the initial conditions used in calculating the earnings ε mobility probabilities can be written as a function of ξ (rather than ε) and since, given these initial conditions, only ξ is required to simulate earnings for subsequent time periods.
C Royal Economic Society 2008. No claim to original US government works. Journal compilation
497
-4
-2
0
2
4
6
Semi-parametric methods in an analysis of earnings mobility
-20
-10
0
10
20
tau Figure 6. Fitted regression of power series expansion in (B.2).
Another tuning parameter is K n , the number of terms in the power series expansion (B.1). Experimentation in fitting known mixture distributions suggested that K n has a limited effect on the ability to estimate the density, provided it is not too small; K n of 10 or more usually provided a good fit. In estimating the PSID data, I set K n = 25. Small changes in the parameter did not substantially effect results. Similarly, since the data contain unbalanced panels, T had to be chosen. I set T = 5, leaving the bulk of individuals in the sample. Slight variations in T did not seem to matter much. Less obvious parameter choices arise when estimating (B.2). Specifically, it is unclear how to choose τ min and τ max . Let D be the interval [τ min , τ max ]. Experimentation using mixed distributions suggested that the choice of D has a great impact on the ability of the deconvolution estimator to capture asymmetry in ξ . A plot of the LHS of (B.2) against τ is smooth for τ near zero but chaotic for larger τ (see Figure 6, in ˆ . . . , τ )/|q(τ, ˆ . . . , τ )|}, which pluses show the actual values of the LHS of (B.2), for example, Im{log[q(τ, for each τ . Solid line represents fitted values). This phenomenon induced a tradeoff in selecting the width of D: In the experimentation, an insufficiently wide D captured too little information about the overall shape of (B.2) to give accurate estimates of the known density. However, for too wide of D, the chaotic pattern at large τ made it impossible for a regression to capture the shape of (B.2). Using larger K n did not help, due to the extreme non-linearity of the function at large τ . In the experimentation, the best estimates of the known density where obtained when D was chosen to be as large as possible, while still giving the appearance of a good fit in the regression (B.2), in the region where the function is relatively smooth (e.g. roughly τ ∈ [−9, 9] in Figure 6). Hence, when estimating the PSID I chose D by repeatedly estimating (B.2), slightly widening D each time, and stopping when the predicted values of (B.2) appeared to become poor in the relatively smooth area. That is, I used the largest D that visually gave a good fit; D = [−16, 16]. (This interval is smaller than that over which the empirical CF of ξ was evaluated, as determined by the smoothing function g and the bandwidth λ ξ .) Figure 6 plots the LHS of (B.2) and the corresponding fitted values against τ over this range. The tails of the estimated distributions of η and ξ are wiggly and sometimes negative, and therefore not useful in estimating probabilities. To cope with this problem, I assigned area to the tails of the estimated C Royal Economic Society 2008. No claim to original US government works. Journal compilation
498
Shawn W. Ulrick
distributions. Since the density η did not appear non-normal, I assigned area to its tails (|η| > 3) by using the normal distribution having the same mean and variance in the data. For the tails of ξ (i.e. ξ ∈ [−0.5, 0.65]), I used the mixture of two normals with the first 5 moments equal to those found in the actual data. To compensate for the fact that the resulting densities did not integrate to one, I normalized them. Relatively little area is found in the tails, so this solution should cause minimal distortion. (The density of ε was not used in the simulations.) Finally, estimating the densities can be computationally expensive. Therefore, in simulating the mobility probabilities, rather than estimating the corresponding density each time an error component needed to be drawn, I made look-up tables. I divided the domain over which I estimated the densities (i.e. η ∈ [−3, 3] and ξ ∈ [−0.5, 0.65]) into bins of width 0.012 and estimated the density at the midpoint of each bin. The value at the midpoint served as the value of the density over the entire bin. Similarly, I divided the region in which I assumed a mixture of normals into bins.
C Royal Economic Society 2008. No claim to original US government works. Journal compilation
Econometrics Journal (2008), volume 11, pp. 499–516. doi: 10.1111/j.1368-423X.2008.00256.x
Heterogeneity, state dependence and health T IMOTHY J. H ALLIDAY †,‡ †
‡
Department of Economics and John A. Burns School of Medicine, University of Hawaii at M¯anoa, 2424 Maile Way, Saunders Hall 533, Honolulu, HI 96822, USA
Institute for the Study of Labor (IZA), Schaumburg-Lippe-str 5-9, D-53113 Bonn, Germany E-mail:
[email protected] First version received: November 2006; final version accepted: April 2008
Summary We investigate the evolution of health over the life-cycle. We allow for two sources of persistence: unobserved heterogeneity and state dependence. Estimation indicates that there is a large degree of heterogeneity. For half the population, there are modest degrees of state dependence. For the other half of the population, the degree of state dependence is near unity. However, this may be the result of a high frequency of people in our data who never exit healthy states, potentially resulting in a failure to pin down the state dependence parameter for this segment of the population. We conclude that individual characteristics that trace back to early adulthood and before can have far reaching effects on health. Keywords: Dynamic panel Data models, Gradient, Health.
1. INTRODUCTION We explore the dynamics of health and, in doing so, concern ourselves with two tasks. First, we aim to gain a better understanding of how to model the evolution of health. While many empirical studies have investigated the dynamics of both the level of earnings (Lillard and Willis, 1978, and Abowd and Card, 1989) and, more recently, the variance of earnings (Meghir and Pistaferri, 2004), few have investigated the dynamics of health. 1 As health status becomes a more common state variable in structural models, it is becoming increasingly more important that researchers arrive at a better understanding of its dynamics. 2 Second, we quantify the relative contributions of unobserved heterogeneity and state dependence in the determination of health. Doing so is important as this will have implications for health policy. Utilizing data on Self-Reported Health Status (SRHS) from the Panel Study of Income Dynamics (PSID), we observe that health is highly persistent. The first order auto-correlation of a dummy variable indicating bad health is 0.5661 and 0.5643 for men and women, respectively. While these correlations do indicate a high degree of persistence, they are not informative of the underlying stochastic properties of the health process. 1
Contoyannis et al. (2004a,b) are notable exceptions. For examples of structural models using health as a state variable, see Rust and Phelan (1997), French (2005) and Arcidiacono et al. (2007). 2
C The Author(s). Journal compilation C Royal Economic Society 2008. Published by Blackwell Publishing Ltd, 9600 Garsington Road,
Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
500
T. J. Halliday
To gain additional insight, we model the evolution of health over the life-cycle as a first order Markov process which allows for two sources of persistence. The first is unobserved heterogeneity or the (unobserved) ability to cope with health shocks. The second is state dependence or the degree to which the ability to cope with a shock depends on health status. Estimation will shed light on the relative contributions of both of these sources of persistence. The balance of this paper is organized as follows. Section 2 describes the data. In Section 3, we set up our model. In Section 4, we describe our estimation procedure. Section 5 discusses our findings. Finally, in Section 6, we conclude and discuss the relevance of our findings for health policy.
2. DATA We use data from the PSID spanning the years 1984–1997. The variables that we employ are SRHS, age and gender. The SRHS question was only asked of heads of household and their spouses and, thus, our sample is restricted to these individuals. We do not employ data prior to 1984 since the SRHS question was not asked in these years. The PSID contains an over-sample of low-income families called the Survey of Economic Opportunity (SEO). Because the sample was chosen based on income, we follow Lillard and Willis (1978) and drop it due to endogenous selection. SRHS is a five-point categorical variable that measures the respondent’s assessment of their own health. One is excellent and five is poor. While these data are subjective measures, there is an extensive literature that has shown a strong link between SRHS and more objective health outcomes such as mortality and the prevalence of disease (Mossey and Shapiro, 1982, Kaplan and Camacho, 1983, Idler and Kasl, 1995 and Smith, 2003). 3 To lower the number of parameters that we estimate, we map reports of fair or poor health into unity and all others into zero. We restrict our sample to individuals between ages 22 and 60. We do not include people younger than age 22 because there are not that many household heads younger than this age. We do not include people older than age 60 to mitigate any possible bias resulting from attrition due to mortality. We drop individuals whose age declines or increases by more than two years across successive survey years. Finally, we restrict our sample to white men and women. Table 1 reports the descriptive statistics from the resulting sample.
3. THE EMPIRICAL MODEL We let h i,t ∈ {0, 1} denote the health of individual i at age t. When h i,t = 1 then the individual is ‘ill’ and when h i,t = 0 she is ‘well’. Health evolves according to the following process: hi,t = 1(αi + γi hi,t−1 + ρi T + εi,t ≥ 0),
(3.1)
3 Many objective health measures are not without their limitations. For example, self-reports of specific morbidities such as diabetes or cancer are often inaccurate since many people are unaware that they even have these conditions due to low consumption of medical services. In addition, these measures typically do not account for the severity of the condition.
C The Author(s). Journal compilation C Royal Economic Society 2008.
501
Heterogeneity, state dependence and health
Mean
Table 1. Descriptive statistics. 25% Quantile 75% Quantile
Standard deviation
Women SRHS (5-point) SRHS (2-point) Age
2.22 0.10 39.10
1 0 31
3 0 46
0.99 0.30 9.82
Panel duration∗ ∗∗ N = 4186
8.21
4
14
4.45
Men SRHS (5-Point) SRHS (2-Point)
2.10 0.08
1 0
3 0
0.98 0.27
Age Panel duration∗
39.34 8.44
32 4
46 14
9.56 4.46
∗∗
N = 3923
Notes: ∗ Panel duration refers to the length of time that the individual was in the panel. ∗∗ N is the number of individual observations, not individual-time observations.
Aging
Table 2. AIC for index selection. ρ γ A=2
A=3
function
Hetero?
Hetero?
Men
Women
Men
Women
Linear model Homo Quad -
Linear
No
Yes
6163.8
7564.5
6084.9
7453.9
Homogeneous γ Homo Quad
Quad Quad
No No
No Yes
6164.5 6163.1∗
7564.4 7563.2∗
6083.8 6083.5∗
7451.9 7452.9
Hetero Quad
Quad
Yes
Yes
6163.9
7564.7
6085.6
7450.9
∗
Note:∗ Denotes the model with the lowest AIC.
where T = [t, t 2 ] . 4 The residual in the model represents idiosyncratic risk or ‘health shocks’ such as accident occurrence, disease onset or exposure to bacteria. We assume that ε i,t is independent of (α i , γ i , ρ i , h i,0 ) and that it is distributed i.i.d. across time with a logistic distribution. These assumptions imply that P (hi,t = 1|hi,t−1 , . . . , hi,0 , θi ) =
exp(θi Zi,t−1 ) , 1 + exp(θi Zi,t−1 )
(3.2)
where θ i ≡ (α i , γ i , ρ i ) and Z i,t−1 = (1, h i,t−1 , T ) . 4 While we acknowledge that a thorough understanding of the linkages between income and health is of vital importance to policy makers, we do not incorporate income into the analysis as doing so would involve much more than simply including income as a strictly exogenous explanatory variable in equation (3.1). To include income in the analysis, we would have to model income as a predetermined or endogenous variable. This would have made the exercise substantially more complicated.
C The Author(s). Journal compilation C Royal Economic Society 2008.
502
T. J. Halliday Table 3. AIC for Selection of the number of support points. Points of support Men Women A=1 A=2
11,798.0 6163.1
13,631.0 7563.2
A=3 A=4
6083.5 6062.9
7452.9 7422.3
Note: The homogeneous quadratic model was employed in the estimation.
Table 4. Parameter estimates for preferred model – men. Type 1 Type 2 Type 3
Type 4
αa
−8.0789 (0.7920)
−5.6632 (0.7537)
−3.7868 (0.7400)
−1.9916 (0.7238)
γa
9.9901 (0.9093)
0.8776 (0.3874)
0.9597 (0.1915)
0.8335 (0.2075)
ρ1 ρ2
0.3598 (0.3657) 3.9879
0.3598 (0.3657) 3.9879
0.3598 (0.3657) 3.9879
0.3598 (0.3657) 3.9879
pa
(4.3498) 0.9907
(4.3498) 0.9578
(4.3498) 0.8757
(4.3498) 0.9999
πa
(0.0055) 0.5353
(0.0655) 0.2721
(0.1387) 0.1296
(0.0001) 0.0630
The model has three other key aspects. First, ρ i models aging and, thus, allows the effects of health shocks to increase with age. Within the context of the Grossman model of health investment (Grossman, 1972), these coefficients can be interpreted as the rate at which the health capital stock depreciates. Second, γ i models state dependence or the notion that the ability to cope with a given shock will depend on health status. To give a concrete (albeit extreme) example, exposure to a flu virus is more likely to affect a person’s health if she is HIV positive than if she is HIV negative. 5 Third, the model allows for a large degree of heterogeneity by allowing all of the elements of θ i to vary across individuals. Unobserved heterogeneity models an individual’s ability to resist health shocks. Finally, it is important to point out that, while this discussion provides a motivation for our model that is rooted in epidemiology, there are economic motivations which we describe below.
5 It is important to contrast our model with an obvious alternative formulation in which health is determined by a continuous index given by H i,t which follows an AR(1) process and agents report ill health when H i,t is beyond some threshold, i.e. h i,t = 1(H i,t ≥ 0). While this alternative model does allow health shocks to have persistent effects, it does not allow for state dependence. In other words, in this model, the effects of a shock on future health outcomes are not conditioned by the agent’s current health status.
C The Author(s). Journal compilation C Royal Economic Society 2008.
Heterogeneity, state dependence and health Table 5. Parameter estimates for preferred model – women. Type 1 Type 2 Type 3
503
Type 4
αa
−6.8666
−5.5826
−3.1537
−1.4090
γa
(0.7049) 9.5925 (1.0865)
(0.6721) 0.7514 (0.5424)
(0.6584) 0.8067 (0.1340)
(0.6586) 0.8779 (0.2033)
ρ2
0.2494 (0.3267) 4.4630
0.2494 (0.3267) 4.4630
0.2494 (0.3267) 4.4630
0.2494 (0.3267) 4.4630
pa
(3.8702) 0.9997
(3.8702) 0.9587
(3.8702) 0.8874
(3.8702) 0.7432
πa
(0.0054) 0.4093
(0.0285) 0.3495
(0.0868) 0.1804
(0.2245) 0.0608
ρ1
3.1. A reduced form model of health investment Our model can be viewed as a reduced form model of health investment. Suppose that agents live until age T with certainty and derive utility from a consumption good denoted by c i,t . Utility in a given period depends on the health state a la Viscusi and Evans (1990) and is denoted by u(c i,t , h i,t ). The agent’s expected lifetime utility is then E0 ( Tt=0 β t u(ci,t , hi,t )). The health state is (partly) the consequence of an endogenous investment decision, i i,t ∈ {0, 1}: hi,t = 1 gi,t (ii,t−1 ) + εi,t ≥ 0 , (3.3) where g i,t (i i,t ) is an individual-specific return to health investment with the property that g i,t (0) > g i,t (1). Income is given by y i,t and investment imposes pecuniary costs of the form λ i,1 ∗ h i,t + λ i,0 ∗ (1 − h i,t ) with λ i,1 > λ i,0 . Assuming no storage, the individual’s budget constraint will be given by ci,t + ii,t ∗ [λi,1 ∗ hi,t + λi,0 ∗ (1 − hi,t )] ≤ yi,t .
(3.4)
In this simple set-up, health will be a dynamic process similar to equation (3.1) because investment in equation (3.3) will depend on health status in the previous period due to statedependent utility and investment costs. Consequently, a positive degree of state dependence might indicate that health investment is less likely when people are ill. 3.2. An exogenous state variable Our model can be viewed as an exogenous state variable in a life-cycle consumption model. Many recent investigations into life-cycle consumer behaviour such as Arcidiacono et al. (2007), French (2005) and Rust and Phelan (1997) have incorporated exogenous uncertainty over health states. Our investigation will provide additional insights into how this uncertainty should be modelled. Proper modelling is crucial for the conclusions of these models to be valid. Indeed, Deaton (1992) provides a discussion of how different income processes can lead to radically different C The Author(s). Journal compilation C Royal Economic Society 2008.
504
T. J. Halliday
Figure 1. Health Transition Profiles - Type 1 Men.
consumption behaviours and, thus, demonstrates the sensitivity of the outcomes of economic models to their underlying assumptions. 3.3. An analogy to state dependence in labour market outcomes It is important to point out the relationship between state dependence in health and labour market outcomes. As discussed by Hyslop (1999), many sources of state dependence in labour force participation have been cited including intertemporally non-separable preferences for leisure (Hotz et al., 1988) and search costs which depend on participation states (Eckstein and Wolpin, 1990). However, regardless of the underlying source, understanding the magnitude of state dependence in labour force participation will have policy implications since this tells us about the effectiveness of policies that alleviate short-term unemployment. Similarly, the magnitude of state dependence in health will be informative of the relative importance of unobserved individual characteristics vis-a-vis idiosyncratic health shocks. To the extent that the effects of these shocks can be mitigated by improvements in health care and its delivery, understanding the magnitude of state dependence in health will have implications for many health policy debates. In both the cases of labour and health economics, the statistical properties of the data will contain information that is pertinent to the conduct of policy. C The Author(s). Journal compilation C Royal Economic Society 2008.
Heterogeneity, state dependence and health
505
Figure 2. Health Transition Profiles - Type 2 Men.
4. MAXIMUM LIKELIHOOD ESTIMATION We estimate the model in equation (3.1) using a maximum likelihood estimation (MLE) procedure which has been discussed in Heckman (1981a, b). Individual i (i = 1, . . . , N ) experiences h i,t at time t ∈ {0, . . . , T i }. However, the econometrician only observes h i,t for t ∈ {τ i , . . . , T i } where τ i ≥ 0. This causes an initial conditions problem. The procedure that we use accounts for this. We now construct the likelihood function. The likelihood of a sequence of health outcomes conditional on (θi , hi,τi ) for individual i for t = τ i , . . . , T i is given by P (hi,Ti , . . . , hi,τi +1 |hi,τi , θi ) =
Ti
(θi Zi,t−1 (2hi,t − 1)).
t=τi +1
(4.1)
We assume that the heterogeneity vector has a discrete support where it can take on one of A values so that θ i ∈ {θ 1 , . . . , θ A } . The probability weight that is associated with each point of support is π a . Our approach is the same as Deb and Trivedi (1997) in that we assume that the population is drawn from a finite number of distinct classes corresponding to varying degrees C The Author(s). Journal compilation C Royal Economic Society 2008.
506
T. J. Halliday
Figure 3. Health Transition Profiles - Type 3 Men.
of latent health. 6 Let Pτi (hi,τi |θa ) denote the probability of the first observation conditional on θ i = θ a . We can now obtain the unconditional likelihood via P (hi,Ti , . . . , hi,τi ) =
A
P (hi,Ti , . . . , hi,τi |θa )πa
a=1
=
Ti A
(θa Zi,t−1 (2hi,t − 1))Pτi (hi,τi |θa )πa .
(4.2)
a=1 t=τi +1
Summing over the heterogeneity addresses the incidental parameters problem (Neyman and Scott, 1948). Our model implies a recursive definition for Pτi (hi,τi |θa ). To compute this, we let the probability of being well in t = 0 conditional on θ a be given by p a ≡ P 0 (h i,0 = 0|θ a ). The
6 This approach is also similar to Heckman and Singer (1984) who use a discrete distribution to approximate the distribution of unobserved heterogeneity.
C The Author(s). Journal compilation C Royal Economic Society 2008.
Heterogeneity, state dependence and health
507
Figure 4. Health Transition Profiles - Type 4 Men.
probability of observing h i,t conditional on θ a in any subsequent period is then given by Pt (hi,t |θa ) =
1
Pt (hi,t |hi,t−1 = dt−1 , θa )Pt−1 (hi,t−1 = dt−1 |θa )
dt−1 =0
=
1
(αa + γa dt−1 + ρa T) Pt−1 (hi,t−1 = dt−1 |θa ).
(4.3)
d=0
Substituting, we get Pt (hi,t |θa ) =
1
((αa + γa dt−1 + ρa T)(2hi,t − 1))∗
dt−1 =0 1
(((αa + γa dt−2 + ρa (T − 1))(2dt−1 − 1))Pt−2 hi,t−2 = dt−2 |θa ).
(4.4)
dt−2 =0
Using the above formulation, we can calculate Pτi (hi,τi |θa ). 7 Of course, this is a burdensome task if τ i is large since computation will involve calculating the sum of the probabilities of 7 Heckman (1981a) proposes using this method which involves using the underlying statistical model to calculate Pτi (hi,τi |θa ) which can in turn be used to calculate P (hi,Ti , . . . , hi,τi ). This procedure addresses the initial condition C The Author(s). Journal compilation C Royal Economic Society 2008.
508
T. J. Halliday
Figure 5. State Dependence - Type 1 Men.
all possible sequences of health outcomes that could have led to hi,τi . Fortunately, the above recursive definition simplifies matters greatly. We now obtain the likelihood function: ⎛ ⎞ Ti N A log ⎝ (θa Zi,t−1 (2hi,t − 1))Pτi (hi,τi |θa )πa ⎠ , (4.5) L(β) = i=1
a=1 t=τi +1
where β ≡ (θ 1 , . . . , θ A , π 1 , . . . , π A−1 , p 1 , . . . , p A ) and has dimension 7A − 1. The likelihood function was maximized using the Fletcher–Powell algorithm, a variant of Newton’s method,
problem that occurs when the stochastic process has been running prior to τ i . Since our underlying statistical model does not have any time varying regressors, we do not need to concern ourselves with the distribution of the time varying regressors for t < τ i . However, in the presence of time varying regressors, auxiliary distributional assumptions must be made. In addition, the computations become rather involved. An alternative to this is provided by Wooldridge (2005) who proposes modelling the distribution of the heterogeneity conditional on hi,τi and any time varying regressors that may be present. Doing this does not require internal consistency with the underlying statistical model nor does it require computations that are as involved as the previous method, but it does require additional distributional assumptions. A third solution to the initial conditions problem assumes that the process has been running sufficiently long prior to the sampling period and that the process is in equilibrium. It then uses the stationary distribution for the process as the probability of the first observation. However, this will not work in our case as health is non-stationary process. C The Author(s). Journal compilation C Royal Economic Society 2008.
Heterogeneity, state dependence and health
509
Figure 6. State Dependence - Type 2 Men.
which only requires the computation of the the gradient vector. To save on computation time, we calculated analytical gradients. 8 When the number of support points for the mixing distribution exceeds two, estimating π a directly will often result in some trivial probabilities so that the number of support points effectively collapses to two or (sometimes) three. To avoid this, we follow Arcidiacono and Jones (2003) and note that the MLE of π a is given by πa =
N 1 fi,a , N i=1 qi
(4.6)
i where fi,a ≡ Tt=τ (θa Zi,t−1 (2hi,t − 1))Pτi (hi,τi |θa )πa and qi ≡ A a=1 fi,a . This insight i +1 suggests the following iterative strategy. First, choose a set of values for the mixing distribution probabilities and call these values π 1a . Similarly, choose initial values for ≡ (θ 1 , . . . , θ A , p 1 , . . . , p A ) and call these values 1 . Next, calculate the gradient with respect to using the
8
All computer programs and data used are available upon request from the author.
C The Author(s). Journal compilation C Royal Economic Society 2008.
510
T. J. Halliday
Figure 7. State Dependence - Type 3 Men.
probabilities π 1a and 1 and iterate to get 2 . Then, evaluate equation (4.1) using π 1a and 1 to obtain π 2a . Repeat the process. 9
5. ESTIMATION RESULTS 5.1. Model selection We investigate model selection along two dimensions. The first is the specification of the index inside equation (3.1) and the second is the number of support points. 10 Our model selection criterion is the Akaike Selection Criterion (AIC) which is proportional to the absolute value of
9 To verify that this procedure does, in fact, work, using two support points, we calculated the MLE using this method and using the alternative method in which the probabilities π a were estimated directly (i.e. we differentiated the likelihood function with respect to π a as well). Both procedures yielded the same estimates. 10 When testing for the number of support points, likelihood-based test statistics are inappropriate because, under the null hypothesis, one of the probabilities π a must be set to zero. This places the parameter vector at the edge of a compact set and, thus, violates the regularity conditions of MLE. Consequently, the resulting test statistic will not be χ 2 . However, as pointed out by Leroux (1992), model selection criteria do not require that the true parameter lie in the interior of a compact set and, thus, they are an appropriate means of testing for the number of support points.
C The Author(s). Journal compilation C Royal Economic Society 2008.
Heterogeneity, state dependence and health
511
Figure 8. State Dependence - Type 4 Men.
the likelihood function plus the number of estimated parameters (Amemiya, 1985). The preferred model has the lowest AIC. 5.1.1. Index. In Table 2, we report the AIC for four indices with two and three support points. The indices are defined in the table. When we only have two points of support, we see that the AIC slightly favours the homogeneous quadratic model for both men and women. When we move to three points, the AIC still favours the homogeneous quadratic model for men, but now favours the heterogeneous quadratic model for women. What is important to note, however, is that our choice of index does not alter the AIC by a large margin. 5.1.2. Support points. Table 3 reports the AIC results for the number of support points. We consider up to four points of support. We did not venture beyond four points due to computational limitations. For each value of A that we considered, we estimated the model with a homogeneous quadratic function of age. 11 We see that the AIC increases with the number of support points, but at a decreasing rate. In contrast to altering the index, adding support points has a dramatic effect on the AIC. The preferred model has four points of support which suggests that there is
11 When choosing the number of support points, we did not concern ourselves with the index selection for two reasons. The first is that, as indicated by Table 2, changing the index did not alter the AIC tremendously. The second is that the computations in this exercise were quite intensive. Utilizing more complicated indices, such as the heterogeneous quadratic model, would only have made it worse.
C The Author(s). Journal compilation C Royal Economic Society 2008.
512
T. J. Halliday Table 6. Health sequence frequencies – men. (h i,t−3 , h i,t−2 , h i,t−1 , h i,t ) t = 33 t = 43
t = 53
(1, 1, 1, 1) (1, 1, 1, 0)
14 20
19 35
42 44
(1, 1, 0, 1) (1, 0, 1, 1)
3 6
7 9
15 8
(0, 1, 1, 1) (1, 1, 0, 0) (1, 0, 1, 0)
20 30 8
35 63 15
44 58 11
(0, 1, 1, 0) (1, 0, 0, 1)
13 4
35 9
29 9
(0, 1, 0, 1) (0, 0, 1, 1) (1, 0, 0, 0)
11 27 123
17 61 137
4 65 109
(0, 1, 0, 0) (0, 0, 1, 0)
97 100
83 85
59 52
(0, 0, 0, 1) (0, 0, 0, 0)
123 3649
137 3153
108 1327
a tremendous amount of heterogeneity in health. 12 The results of this table stand in contrast to results in Deb and Trivedi (1997) who find that only two points of support were necessary when estimating a model for the demand of medical care. 5.2. Health dynamics Tables 4 and 5 report the parameter estimates and their standard errors for the homogeneous quadratic model with four support points for men and women. 13 This model had the lowest AIC of all the models that we considered. 14 Each column of the tables corresponds to a separate support point which we call a ‘type’. We have defined each according to the magnitude of α a . The lowest value (i.e. most negative) of α a is defined to be ‘Type 1’ and the highest is ‘Type 4’. 15
12 Presumably, if we had continued to add support points, we would have found evidence of even more heterogeneity. However, because the relationship between the AIC and A appears to be concave, we conjecture that eventually the selection criterion would have started to decline. 13 Standard errors were calculated using the ‘sandwich’ standard errors. The gradient vector from the likelihood function was used to calculated the average of its outer product. To calculate the Hessian, we numerically differentiated the gradient vector. 14 We also calculated the AIC for a homogeneous quadratic model with a homogeneous state dependence parameter for A = 4. The model with the heterogeneous state dependence parameter was preferred. 15 It is important to emphasize that the probability of being a certain type is independent of age in our analysis. However, if we were to have modelled mortality as well, then the probability of being a particular type would depend on age since the unhealthy types would have higher probabilities of dying.
C The Author(s). Journal compilation C Royal Economic Society 2008.
513
Heterogeneity, state dependence and health Table 7. Health sequence frequencies – women. (h i,t−3 , h i,t−2 , h i,t−1 , h i,t ) t = 33 t = 43
t = 53
(1, 1, 1, 1)
10
29
44
(1, 1, 1, 0) (1, 1, 0, 1) (1, 0, 1, 1)
22 5 6
49 13 19
39 12 12
(0, 1, 1, 1) (1, 1, 0, 0) (1, 0, 1, 0)
22 42 12
49 84 18
39 54 15
(0, 1, 1, 0) (1, 0, 0, 1)
25 10
48 12
27 17
(0, 1, 0, 1) (0, 0, 1, 1) (1, 0, 0, 0)
13 41 142
24 78 168
15 55 134
(0, 1, 0, 0) (0, 0, 1, 0)
110 111
96 102
97 97
(0, 0, 0, 1) (0, 0, 0, 0)
142 3736
168 2853
135 1258
In Figures 1–4, we take the parameter estimates for men and map them into health transition probabilities. 16 Each figure corresponds to a separate type and plots two profiles. The first is the probability of being ill today conditional on having been ill yesterday. We call this profile the persistence of illness. The second is the probability of being ill today conditional on having been well yesterday. We call this profile the onset of illness. We plot 95% confidence bands around each profile. 17 It is important to realize that because many of the standard errors in Tables 4 and 5 are quite large, some of these confidence bands include zero or unity. The figures show a large degree of heterogeneity in health. Figure 1, which corresponds to Type 1 men, shows that the persistence of illness is close to unity and that the onset of illness is close to zero at all ages. Taken at face value, this suggests that Type 1 men exhibit a tremendous degree of state dependence. Figures 2–4 correspond to Types 2–4. Type 2 men are the healthiest and Type 4 men are the unhealthiest. These figures show far more muted degrees of state dependence than Figure 1. Figures 5–8 display the degree of state dependence, which is defined to be the difference between the persistence and onset profiles for men. Each figure corresponds to a separate type and includes a 95% confidence band. The degree of state dependence is close to unity for Type 1 men and women. The degree of state dependence for Type 2 men and women is very low—below 10% for most ages. For Types 3 and 4, we see a more intermediate degree of state dependence that is somewhere between 10 and 20%. At this point, we subject the reader to a caveat concerning the high degree of state dependence that we uncovered for Type 1 people. We conjecture that this is a consequence of the fact that these individuals have an initial probability of being ill that is below 1% and an extremely low 16 17
The figures for women, which we do not report, were similar. The δ-method was used to calculate the standard errors.
C The Author(s). Journal compilation C Royal Economic Society 2008.
514
T. J. Halliday
probability of falling ill from that point onward. Consequently, the data do not contain a wealth of information on the persistence of illness for these types. This makes it difficult to pin down γ 1 . Thus, we believe that our estimations are telling us that these types have very low propensities of falling ill, but are not terribly informative of their degree of state dependence. Also, it is worth mentioning that Halliday (2007a) used the same data, but alternative semi-parametric tests, and did not find strong evidence of state dependence. To better see this, in Tables 6 and 7, we report the frequencies of 4 year sequences of health, for men and women, starting from ages 30, 40 and 50. Both tables show that the sequence (0, 0, 0, 0) is, by far, the most frequent and so, the healthy state is highly persistent for the vast majority of the individuals in our data. In contrast, if it were the case that the degree of state dependence actually was unity for half of the population, we would also expect to see a large frequency for the sequence (1, 1, 1, 1) , but we do not.
6. CONCLUSION This paper investigated the evolution of health over the life-course by estimating several specifications of a flexible model of health dynamics which allowed for two sources of persistence: unobserved heterogeneity and state dependence. Our analysis suggested that altering the linear index of our model did little to improve its fit. In contrast, adding support points to the mixing distribution led to large improvements in fit. We found that at least four support points were necessary, indicating a large degree of heterogeneity in our data. This suggests that much of what determines health in adulthood can be traced back to childhood and is consistent with recent work by Case et al. (2002). We found modest degrees of state dependence for approximately half of the population. For the other half, we found that it was near unity. However, because the likelihood of falling ill was so low for this part of the population, we do not believe that we can say anything conclusive about their degree of state dependence. Can the estimates in this paper inform us about health policy? While this paper can be criticized as being too ‘reduced form’, we believe that our approach, which is focused on deepening our understanding of the statistical properties of the data while make parametric restrictions that are as weak as possible, can be informative of policy. In fact, because measuring health is so difficult and incorporating it into life-cycle consumption models often results in models that are very hard to estimate and potentially fragile in the face of mis-specified distributional and modelling assumptions, many authors such as Adams et al. (2003), Adda et al. (2006) and Halliday (2007b) have also adopted less structural approaches in health applications. To this end, we contend that the results of this paper shed light on the gradient: the muchstudied but little-understood statistical correlation between health and socioeconomic status (Adams et al., 2003). If it is the case that the gradient is largely determined by the causal impact of health status on earnings and wealth—as suggested by Smith (1999)—then the relevant policy prescription is to directly target health via improvements in health care and its delivery (Deaton, 2002). The argument for health policies is further strengthened if health exhibits a high degree of state dependence since this implies that interventions will have large dynamic effects. Our reading of the results leads us to conclude that, while improvements in medical care will lead to modest improvements in health, there may be larger potential gains to identifying and then targeting factors that influence individual heterogeneity. Our reasoning for this is that we uncover relatively modest degrees of state dependence for most people. For the rest of the C The Author(s). Journal compilation C Royal Economic Society 2008.
Heterogeneity, state dependence and health
515
population, we do uncover an enormous degree of state dependence, but we have good reasons, which we outlined above, for thinking that this is a result of the fact that this segment of the population almost never gets sick prior to age 60. On the other hand, we do uncover a large amount of heterogeneity which indicates that much of the persistence that we observe in the aggregate is driven by individual characteristics which can be traced back to early adulthood and before.
ACKNOWLEDGMENTS This paper was the first chapter from my dissertation at Princeton University. I would like to thank the editor of this journal and several anonymous referees for excellent comments. In addition, I would like to extend my gratitude to my advisors Chris Paxson and Bo Honor´e for their encouragement and guidance.
REFERENCES Abowd, J. and D. Card (1989). On the covariance structure of earnings and hours changes. Econometrica 57, 411–45. Adams, H. P., M. D. Hurd, D. McFadden, A. Merrill and T. Ribeiro (2003). Healthy, Wealthy and wise? Tests for direct causal pathways between health and socioeconomic status. Journal of Econometrics 112, 3–56. Adda, J., J. Banks and H. M. von Gaudecker (2006). The impact of income shocks on health: evidence from cohort data.Working paper, University College London. Amemiya, T. (1985). Advanced Econometrics. Cambridge, MA: Harvard University Press. Arcidiacono, P. and J. B. Jones (2003). Finite mixture distributions, sequential likelihood and the em algorithm. Econometrica 71, 933–46. Arcidiacono, P., H. Sieg and F. Sloan (2007). Living Rationally under the volcano? Heavy drinking and smoking among the elderly. International Economic Review 48, 37–65. Case, A., D. Lubotsky and C. Paxson (2002). Economic status and health in childhood: the origins of the gradient. American Economic Review 92, 1308–34. Contoyannis, P., A. M. Jones and R. Leon-Gonzalez (2004). Using simulation-based inference with panel data in health economics. Health Economics 13, 101–22. Contoyannis, P., A. M. Jones and N. Rice (2004). Simulation-based inference in dynamic panel probit models: an application to health. Empirical Economics 29, 49–77. Deaton, A. (1992). Understanding Consumption. Oxford: Oxford University Press. Deaton, A. (2002). Policy implications of the gradient of health and wealth. Health Affairs 21, 13–30. Deb, P. and P. Trivedi (1997). Demand for medical care by the elderly: a finite mixture approach. Journal of Applied Econometrics 12, 313–36. Eckstein, Z. and K. Wolpin (1990). Dynamic labor force participation of married women and endogenous work experience. Review of Economic Studies 56, 375–90. Grossman, M. (1972). On the concept of health capital and the demand for health. Journal of Political Economy 80, 223–55. French, E. (2005). The effects of health, wealth and wages on labour supply and retirement behavior. Review of Economic Studies 72, 395–427.
C The Author(s). Journal compilation C Royal Economic Society 2008.
516
T. J. Halliday
Halliday, T. (2007a). Testing for state dependence with time variant transition probabilities. Econometric Reviews 26, 1–19. Halliday, T. (2007b). Income volatility and health. IZA Discussion Paper No. 3234, Institute for the Study of Labor (IZA). Heckman, J. J. (1981a). Statistical models for discrete panel data. In C. Manski and D. McFadden (Eds.), Structural Analysis of Discrete Data, Cambridge, MA: MIT Press. Heckman, J. J. (1981b). Heterogeneity and state dependence. In C. Manski and D. McFadden (Eds.),Structural Analysis of Discrete Data, Cambridge, MA: MIT Press. Heckman, J. J. and B. Singer (1984). A method for minimizing the impact of distributional assumptions in econometric models for duration data. Econometrica 52, 271–320. Hotz, J. V., Kydland, F. E. and G. L. Sedlacek (1988). Intertemporal Preferences and labor supply. Econometrica 56, 335–60. Hyslop, D. (1999). State dependence, serial correlation and heterogeneity in intertemporal labor force participation of married women. Econometrica 67, 1255–94. Idler, E. L. and S. V. Kasl (1995). Self-ratings of health: do they also predict changes in functional ability? Journal of Gerontology 50, S344–53. Kaplan, G. A. and T. Camacho (1983). Perceived health and mortality: a 9 year follow-up of the human population laboratory cohort. American Journal of Epidemiology 177, 292–304. Leroux, B. G. (1992). Consistent estimation of a mixing distribution. Annals of Statistics 20, 1350–60. Lillard, E. L. and R. Willis (1978). Dynamic aspects of earnings mobility. Econometrica 46, 985–1012. Meghir, C. and L. Pistaferri (2004). Income variance dynamics and heterogeneity. Econometrica 72, 1–32. Mossey, J. M. and E. Shapiro (1982). Self-rated health: a predictor of mortality among the elderly. American Journal of Public Health 71, 100. Neyman, J. and E. Scott (1948). Consistent estimates based on partially consistent observations. Econometrica 16, 1–32. Rust, J. and C. Phelan (1997). How Social security and medicare affect retirement behavior in a world of incomplete markets. Econometrica 65, 781–831. Smith, J. P. (1999). Healthy bodies and thick wallets: the dual relation between health and economic status. Journal of Economic Perspectives 13, 145–66. Smith, J. P. (2003). Health and SES Across the Life Course. Working paper, Rand. Viscusi, K. and W. N. Evans (1990). Utility functions that depend of health status: estimates and economic implications. American Economic Review 80, 353–74. Wooldrdige, J. (2005). Simple solutions to the initial conditions problem for dynamic, nonlinear panel data models with unobserved heterogeneity. Journal of Applied Econometrics 20, 39–54.
C The Author(s). Journal compilation C Royal Economic Society 2008.
Econometrics Journal (2008), volume 11, pp. 517–537. doi: 10.1111/j.1368-423X.2008.00255.x
Semiparametric estimation of the Box–Cox transformation model Y OUNGKI S HIN † †
Department of Economics, University of Western Ontario, London, ON N6A 5C2, Canada E-mail:
[email protected] First version received: November 2007; final version accepted: June 2008
Summary In this paper, I propose a semiparametric estimation procedure for the Box–Cox transformation model. I show a global identification result under mild conditions that allow conditional heteroskedastic error terms. The proposed estimator minimizes a second order U-process and does not require any user-chosen values such as a smoothing parameter that sometimes induces unstable inference result. With a slight modification, it can also be applied to random censoring which depends on covariates in an √ arbitrary way. The estimator converges to an asymptotic normal distribution at the rate of n and Monte Carlo experiments show adequate finite sample performance. Keywords: Box–Cox transformation, Covariate dependent censoring, Minimum distance estimation.
1. INTRODUCTION Box and Cox (1964) proposed a transformed regression model that contains the linear and log-linear regression model as its special case. Specifically, they adopted the class of power transformations indexed by λ: y (λ) =
yλ − 1 , λ
= log y,
if λ = 0
(1.1)
if λ = 0.
(1.2)
They assumed that the distribution y (λ) conditional on a k-dimensional covariate x follows a normal distribution N(x β, σ 2 ) and suggested the maximum likelihood estimation (MLE) procedure for parameters λ, β and σ 2 . Using this model, the logarithmic transformation that is frequently used in econometric models is testable. Furthermore, this model provides a better fit of data among the class of power transformations. It is well known that there is a specification problem in the Box–Cox model. Specifically, the support of y (λ) is truncated by a point that depends on the parameter λ unless λ = 0. For example, if λ is positive, then y (λ) is always greater than − λ1 . Thus, the normality assumption does not hold and the Box–Cox MLE would induce an inconsistent result. The degree of inconsistency increases as the distribution of y (λ) is more skewed, and the normal approximation of the Box– Cox model is valid only when λ is near to zero. C The Author(s). Journal compilation C Royal Economic Society 2008. Published by Blackwell Publishing Ltd, 9600 Garsington Road,
Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
518
Y. Shin
To resolve the problem, modifications of the transformation family have been suggested in one way. Bickel and Doksum (1981) proposed a generalized class of power transformations that allows y to have the support of the whole real line. Carroll and Ruppert (1984) considered a transformation of the regression function as well as the dependent variable to obtain normality. These models, however, still assume that error distributions are known up to certain parameters. On the other hand, a semiparametric approach that does not assume any error distribution has been applied. Amemiya and Powell (1981) proposed the non-linear two-stage least square (NL2S) estimation method that is consistent under the condition of E(ε|x) = 0. Han (1987b) considered a monotone class of transformation that is more general than the Box–Cox model and suggested a three-step estimation procedure exploiting rank correlation. √ He showed that the estimator is consistent, and Asparouhova et al. (2002) established n-consistency and asymptotic normality. The purpose of this paper is to provide a simple semiparametric estimation method. The proposed method has the following advantages compared to the existing ones. The model can be identified globally under mild regularity conditions. The identification condition in NL2S that requires at least as many equations as the number of parameters only tells us about the local property. As noted in Dominguez and Lobato (2004), this is a general problem in the GMM estimator using only a finite number of moment conditions. Note that a regression model defined as conditional moment conditions corresponds to an infinite number of unconditional moment conditions. Thus, the model can be misspecified in the GMM estimation if there are no additional high dimensional restrictions which are not always tractable. The main contribution of this paper is to avoid this kind of problem and show the global identification with a simple support condition of x. Second, the proposed estimators are easy to compute, for all interesting parameters can be estimated in one step by O(n2 ) computations. For instance, the maximum rank correlation (MRC) estimator in Han (1987b) consists of three steps, and the first and second steps require O(n2 ) and O(n4 ) computations respectively. Therefore, the whole computation would be quite burdensome even with a reasonable size of observations. Third, the proposed estimator is robust to conditional heteroskedasticity since it does not need i.i.d. conditions on error terms. Finally, it is free from the user-chosen values such as smoothing parameters. I provide three estimators of which each corresponds to different restrictions on error terms but with unknown distributions. First, I consider the conditional mean restriction as above. Next, I suggest a similar method under the conditional median restriction of med(ε|x) = 0, which may easily extend to any quantile regression models without much difficulty. Finally, I combine the Box–Cox regression model with random censoring. Recently, Khan and Tamer (2008) provide a consistent estimator for the accelerated failure time (AFT) model with general forms of random censoring, in which censoring variables may depend on covariates in an arbitrary way. In the Box–Cox model, the transformation of a dependent variable is not fixed as in the AFT model but the class of the power transformations. All estimators are consistent and converge to the normal √ distribution at the rate of n. The rest of this paper is organized as follows. In Section 2, I describe the Box–Cox regression model with a conditional mean restriction and propose an estimation procedure. The asymptotic properties are also established under mild regularity conditions. In Section 3, I consider the model with a conditional median restriction and combine it with random censoring. I investigate the finite sample properties by the Monte Carlo simulations in Section 4. Section 5 concludes and suggests some area of future research. Technical proofs are presented in the Appendix.
C The Author(s). Journal compilation C Royal Economic Society 2008.
Box–Cox transformation model
519
2. MODEL AND ESTIMATION PROCEDURE In this section, I propose the estimation method of the Box–Cox regression model with the mean restriction. I rewrite the model in regression form, but do not impose any functional restriction on the error distribution. Let (y i , x i ) be a sequence of random variables with y i > 0. The Box–Cox regression model is specified as: yi(λ0 ) = xi β0 + εi ,
(2.1)
where yi(λ0 ) is the power transformation defined as above and ε i is an unobserved error term. I am interested in estimating the (k + 1)-dimensional parameter vector θ 0 = (λ 0 , β 0 ). First, I construct a conditional moment condition that identifies the correct model. To do that, I need the following regularity conditions: A SSUMPTION I1. E(ε i |x i ) = 0 x i − a.s. A SSUMPTION I2. x i does not lie in any proper linear subspace of Rk x i − a.s. A SSUMPTION I3. Let x i = (1, x i,−1 ) and β = (β 1 , β −1 ) to denote the intercept term explicitly. (a) The support of x i,−1 is large enough for x i,−1 β −1 to be continuous on the real line for all β −1 = 0. (b) The true parameter value β 0,−1 = 0. Assumption I1 is the usual conditional mean restriction on error terms. Assumption I2 is also the standard rank condition of the linear regression model. Assumption I3 is crucial for identification. First, Assumption I3 (a) requires that index x i,−1 β −1 is continuous on the real line. Note that this condition is a bit stronger than the support condition used in the semiparametric literature such as Manski (1985) and Han (1987a). Basically, they require that, for the true parameter value β 0 , the index x i β 0 has everywhere positive Lebesgue density in R. Although Assumption I3 (a) involves all possible parameter values, it enables us to identify the parameters globally. A weaker condition might be imposed if one is more interested in local identification. One trivial sufficient condition for Assumption I3 (a) is that x i,−1 has positive Lebesgue density on the whole space of Rk−1 . Second, if β 0,−1 = 0, then λ is not identified from the coefficient parameter β 1 . 1 Now, I suggest a conditional moment condition, and show that it has desirable property for identification. The following lemma is important L EMMA 2.1. Under Assumptions I1–I3, PX E yi(λ) − xi β|xi = 0 < 1 for all (λ, β ) = (λ0 , β0 ),
(2.2)
where P X is a probability measure on x i . It is clear from Assumption I1 that E(yi(λ) − xi β|xi ) = 0 x i − a.s. for the true parameter value (λ 0 , β 0 ). I will use this as a conditional moment condition that specifies the correct model. The proof depends only on the monotone property of the power transformation class, so I may extend the result to any monotone transformation family by assuming that transformations are non-linear with respect to the parameter λ as in Han (1987b). Now, I am ready to derive the estimator of θ 0 based on the conditional moment condition in Lemma 2.1. Recently, Dominguez and Lobato (2004) suggested an estimation method which is 1
I would like to thank a referee for pointing out issues related to the intercept identification.
C The Author(s). Journal compilation C Royal Economic Society 2008.
520
Y. Shin
directly based on a conditional moment condition and does not need any user-chosen values such as smoothing parameters. I will take the similar approach in this paper. To derive the estimator, they used Theorem 16.10 in Billingsley (1995) and addressed that a conditional moment condition is equivalent to unconditional moment condition multiplied by an indicator function on the whole support of regressors. To see this in detail, I first define the following functions: g(xi , yi , θ ) = yi(λ) − xi β
(2.4)
H (xl , θ ) = E(g(xi , yi , θ ) · I (xi ≤ xl )|xl ),
(2.5)
where {x i } are i.i.d. and i = l. Then, the theorem implies that E(g(xi , yi , θ0 )|xi ) = 0xi − a.s. ⇔ H (xl , θ0 ) = 0 for almost all xl ∈ Rk .
(2.6)
Based on this equivalence, I define an objective function Q(θ ) as follows: Q(θ ) = E(H (xl , θ )2 ).
(2.7)
In the next lemma, I establish identification by showing that the distance function Q(θ ) has the unique minimizer θ 0 . Note that this result is global identification. Thus, it is more general than the one suggested by Amemiya and Powell (1981). L EMMA 2.2. Under Assumptions I1–I3, Q(θ ) is uniquely minimized at θ = θ 0 . (xl , θ ) Now, I propose my estimator based on the sample analogue principle. First, I define H which is a sample analogue of H (x l , θ ): (xl , θ ) = 1 g(xi , yi , θ ) · I (xi ≤ xl ). H n i=1 n
(2.8)
Then Q n (θ ) is define in similar way and I suggest my estimator that minimizes the sample objective function: θM = arg min Qn (θ ) θ∈
1 H (xl , θ )2 . n l=1
(2.9)
n
= arg min θ∈
(2.10)
Note that the proposed estimator minimizes a second-order U-process, and only requires O(n2 ) computations. On the other hand, Han’s (1987b) semiparametric estimator maximizes a fourthorder U-process, and requires O(n4 ) computations in the second step only. Asparouhova et al. (2002) proposed simplified but equivalent rank estimation procedure, but it still requires O(n2 log n) computations. The computational efficiency of the proposed estimator makes it more attractive in empirical setting. C The Author(s). Journal compilation C Royal Economic Society 2008.
Box–Cox transformation model
521
Next, I show that the proposed estimator is consistent. First, I need the following regularity conditions: A SSUMPTION C1. The parameter space is a compact subset of Rk+1 A SSUMPTION C2. The sample vector z i = (y i , x i ) is i.i.d. A SSUMPTION C3. There exists a measurable envelope of functions {g(·, ·, θ )}. Assumption C3 is a bit stronger than what I need√for consistency result, but Euclidean property derived from it will be very useful when I prove n-consistency and asymptotic normality later on. Consistency is established in the following theorem. T HEOREM 2.1. Under Assumptions I1–I3 and C1–C3, p θM → θ0.
(2.11) √ In the remaining of this section, I show that the proposed estimator satisfies n-consistency and asymptotic normality. Following Sherman (1994b), I define a primitive function γ (z, θ ) as γ (z, θ ) = h (z, P , P , θ ) + h (P , z, P , θ ) + h (P , P , z, θ ) ,
(2.12)
where h(z j , z k , z l , θ ) = g(x j , y j , θ ) g(x k , y k , θ ) I (x j ≤ x l )I (x k ≤ x l ) and P is a probability measure for the random vector z. I use the notation h(z, P , P , θ ) for conditional expectation given the first argument and the remaining two terms in γ (z, θ ) are defined in similar way. The function h comes from the process of simplifying the sample objective function, which I explain in detail in the Appendix. I need the following regularity conditions for asymptotic normality: A SSUMPTION D1. θ 0 is an interior point of the parameter space . A SSUMPTION D2. E(||∇ 1 γ (z i , θ 0 )||2 ) and E(||∇2 γ (zi , θ0 )||) are finite, where ∇ i denotes the ith derivative operator with respect to θ . A SSUMPTION D3. E(∇2 γ (zi , θ0 )) is non-singular. Note that I do not impose any √smoothness conditions since the objective function is already differentiable. Now, I establish n -consistency and asymptotic normality in the following theorem. T HEOREM 2.2. Under Assumptions I1–I3, C1–C3 and D1–D3, √ d M − β) → n(β N(0, V −1 V −1 ),
(2.13)
where V = 13 E(∇2 γ (zi , θ0 )) and = E(∇ 1 γ (z i , θ 0 )∇ 1 γ (z i , θ 0 ) ). For inference, one may estimate the variance matrix consistently using the sample analogue of it. Different from the median restriction case below, the function g(·, ·, θ ) is differentiable with respect to θ , so it can be easily computed without involving numerical derivatives. 2 First, note that ∇1 g (x, y, θ ) = y λ−1 , −β1 , −β2 , · · · , −βk (2.14)
2
I would like to thank a referee for pointing this out.
C The Author(s). Journal compilation C Royal Economic Society 2008.
522
Y. Shin
∇2 g (x, y, θ ) = diag (λ − 1) y λ−2 , −1, −1, · · · , −1 ,
(2.15)
where ∇ 2 g(x, y, θ ) is a (k + 1) × (k + 1) diagonal matrix with given elements. Then, I can write the derivatives of h(·, ·, ·, θ ) in terms of g as follows: ∇1 h(zi , zj , zk , θ ) = {∇1 g(xi , yi , θ )g(xj , yj , θ ) + g(xi , yi , θ )∇1 g(xj , yj , θ )} ×I (xi ≤ xl )I (xj ≤ xl )
(2.16)
∇2 h(zi , zj , zk , θ ) = {∇2 g(xi , yi , θ )g(xj , yj , θ ) + 2∇1 g(xi , yi , θ )∇1 g(xj , yj , θ ) +g(xi , yi , θ )∇2 g(xj , yj , θ )}I (xi ≤ xl )I (xj ≤ xl ).
(2.17)
The sample analogues of ∇ 1 γ (z, θ ) and ∇ 2 γ (z, θ ) are 1 ∇i h z, zj , zk , θ + ∇i h zj , z, zk , θ + ∇i h zj , zk , z, θ ∇i γ (z, θ ) = n (n − 1) j =k (2.18)
for i = 1, 2. The separate components of the variance matrix can be consistently estimated by
1 = ∇ 1 γ z i , θ · ∇1 γ z i , θ n i = 1 V γ zi , θ . ∇2 3n i
(2.19)
(2.20)
The consistency follows from the existing uniform law of large numbers for U-statistics. −1 . −1 V Therefore, one can estimate the variance matrix by V
3. EXTENSION: MEDIAN RESTRICTION AND RANDOM CENSORING In this section, I extend the proposed estimator to the random censoring case. To do that, I first consider the model with a conditional median restriction. I can apply most of the results in the previous section with a slight modification. The main point is how to find a conditional moment condition that assures global identification. I use the same regression model in (2.1) but now ε i follows a conditional median restriction such that med(εi |xi ) = 0. This condition still allows very general forms of heteroskedasticity and is weaker than the assumption ε i ⊥ x i . First, I define a function that will be used for constructing a conditional moment condition. Let τ 1 (x i , θ 0 ) be defined as: 1 (3.1) τ1 (xi , θ ) = E I yi(λ) ≥ xi β |xi − . 2 For identification I need only to change Assumption I1 to I1 . A SSUMPTION I1 . med(εi |xi ) = 0. Then, I have the following result. C The Author(s). Journal compilation C Royal Economic Society 2008.
Box–Cox transformation model
523
L EMMA 3.1. Under Assumptions I1 and I2–I3, the following holds: (1) τ 1 (x i , θ 0 ) = 0 x i − a.s. (2) P X (τ 1 (x i , θ ) = 0) > 0 for all θ = θ 0 . Now I have another conditional moment condition that is appropriate to the median restriction. I define the function H (x l , θ ) and the objective function Q(θ ) in the same way as above:
(3.2) H (xl , θ ) = E I yi(λ) ≥ xi β − 12 · I (xi ≤ xl ) |xl Q (θ ) = E H (xl , θ )2 .
(3.3)
Then, the identification result immediately follows in the next lemma whose proof is exactly same as Lemma 2.2. L EMMA 3.2. Under Assumptions I1 and I2–I3, Q(θ ) is uniquely minimized at θ = θ 0 . All arguments in the previous section can be applied including the global identification property. Now I define new estimator by using the sample analogue principle: θD = arg min Qn (θ ) θ∈
1 Hn (xl , θ )2 , n l=1
(3.4)
n
= arg min θ∈
(3.5)
where H n (x l , θ ) is defined as Hn (xl , θ ) =
n 1 1 I yi(λ) ≥ xi β − · I (xi ≤ xl ) . n i=1 2
(3.6)
Again, the sample objective function Q n (θ ) is a second order U-process, so its computation is much easier than the current semiparametric estimators. Now I derive the consistency and asymptotic normality. I modify one regularity condition: A SSUMPTION C3 . Q(θ ) is continuous at θ = θ 0 . The following theorem establishes that θ is consistent. T HEOREM 3.1. Under Assumptions I1 , I2–I3 and C1–C2, C3 , p θˆD → θ0 .
(3.7) √ Next, I show that the proposed estimator satisfies n-consistency and asymptotic normality under the additional regularity conditions. These conditions assure the second order Taylor expansion of the primitive function. A SSUMPTION D2 . Let N be a neighbourhood of θ 0 . Then the following holds: (a) For each z, all mixed second partial derivatives of γ (z, ·) exist on N . (b) There exists an integrable function (z) such that for all z and θ ∈ N , || 2 γ (z, θ ) − 2 γ (z, θ0 )|| ≤ (z) |θ − θ0 |2 . C The Author(s). Journal compilation C Royal Economic Society 2008.
524
Y. Shin
(c) E|| 1 γ (·, θ )||2 < ∞. (d) E|| 2 γ (·, θ )|| < ∞. (e) The (k + 1) × (k + 1) dimensional matrix E[ 2 γ (·, θ 0 )] is negative definite. √ I establish n-consistency and asymptotic normality in the following theorem. Now the sample objective function is not continuous in θ , I may adopt either resampling or numerical derivative methods for inference. T HEOREM 3.2. Under Assumptions I1 , I2–I3, C1–C2, C3 and D1, D2 , d √ n θˆD − θ0 → N 0, V −1 V −1 ,
(3.8)
where V and are defined in similar way above. Next, I combine the model with general forms of random censoring. Without loss of generality, I consider a Box–Cox transformation model with right censoring. In this model, the (latent) dependent variable is observed only when it is less than the value of a random censoring variable. So, the model I am interested in can be expressed as: (3.9) vi(λ0 ) = min xi β0 + εi , ci di = I xi β0 + εi ≤ ci ,
(3.10)
where c i is a censoring variable that may depend on x i in an arbitrary way and θ = (λ, β ) is an interesting parameter vector. The error term ε i satisfies the median restriction as before. Recently, Khan and Tamer (2008) proposed an estimation method for the accelerated failure time model with general forms of random censoring. Following their approach, I extend the suggested estimator to the random censoring case. I first define the following notation: 1 τ1 (xi , θ ) = E I vi(λ) ≥ xi β0 |xi − 2 1 τ0 (xi , θ ) = E (1 − di ) + di I vi(λ) ≥ xi β |xi − 2 =
1 − E di I vi(λ) ≤ xi β |xi 2
g(xi , θ ) = τ1 (xi , θ )I (τ1 (xi , θ ) ≥ 0) − τ0 (xi , θ ) I (τ0 (xi , θ ) ≤ 0) .
(3.11)
(3.12)
(3.13)
(3.14)
To show the identification result, I need additional regularity conditions that come directly from Khan and Tamer (2008): A SSUMPTION I4. The subset C = xi ∈ SX : P ci ≥ xi β0 |xi = 1 .
(3.15)
does not lie in a proper linear subspace of Rk where S X is the support of x i . C The Author(s). Journal compilation C Royal Economic Society 2008.
Box–Cox transformation model
525
Note that I do not impose any statistical independence among the censoring variable and the latent error term. 3 The model also allows a censoring variable to depend on covariates in an arbitrary way. The proposed estimator is still robust to the conditional heteroskedasticity as before. Assumption I4 is a support restriction which only requires that there are some uncensored observations. Now, I am ready to show the main result of identification which is summarized in the next lemma: L EMMA 3.3. Under Assumptions I1 , I2–I4, the following holds: (1) g(x i , θ 0 ) = 0 x i − a.s. (2) P X (g(x i , θ ) > 0) > 0 for all θ = θ 0 . I finish this section by providing the exact form of the estimator. The result of consistency and asymptotic normality directly follow from Khan and Tamer (2008) where their proofs can also be found: θC = arg min Qn (θ ) θ∈
= arg min θ∈
1 {H1n (xj , xk , θ )I (H1n (xj , xk , θ ) ≥ 0) n(n − 1) j ,k
−H0n (xj , xk , θ )I (H0n (xj , xk , θ ) ≤ 0)}, where H 1n (x j , x k , θ ) and H 0n (x j , x k , θ ) are defined as n 1 1 (λ) I vi ≥ xi β − · I xj ≤ xi ≤ xk H1n xj , xk , θ = n i=1 2 H0n
n 1 1 (λ) − di I vi ≤ xi β xj , xk , θ = · I xj ≤ xi ≤ xk . n i=1 2
(3.16)
(3.17)
(3.18)
(3.19)
4. MONTE CARLO SIMULATION In this section, I investigate finite sample properties of the proposed estimators by a Monte Carlo simulation study. The baseline model is a simple regression of a Box–Cox transformed value on a constant and a covariate: yi(λ) = β1 + β2 xi + εi ,
(4.1)
where true parameter values of β 1 and β 2 are set to be 0 and 1, respectively. To escape from complicated truncation conditions on covariate and error distribution, I adopt the transformation function family in Bickel and Doksum (1981): yi(λ) = 3
|yi |λ sgn(yi ) − 1 for λ > 0. λ
See Khan and Tamer (2008) for detail.
C The Author(s). Journal compilation C Royal Economic Society 2008.
(4.2)
526
Y. Shin Table 1. Simulation results: conditional mean restriction with λ = 0.1. λ = 0.1 β1 = 0 β2 = 1 Mean Bias
RMSE
Mean Bias
RMSE
Mean Bias
RMSE
MD NL2S MRC
−0.0021 0.0009 −0.0178
0.0405 0.0373 0.0422
0.0010 0.0056 −0.0424
0.1368 0.1375 0.1264
0.0108 0.0070 0.0217
0.0641 0.0605 0.0777
PMLE
−0.0203
0.0339
−0.1152
0.1495
0.0153
0.0531
MD NL2S
−0.0029 0.0012
0.0261 0.0210
0.0007 0.0106
0.1026 0.1025
0.0113 0.0075
0.0425 0.0318
MRC PMLE
−0.0059 −0.0172
0.0244 0.0236
−0.0121 −0.1052
0.0983 0.1270
0.0106 0.0128
0.0404 0.0322
MD
−0.0026
0.0201
−0.0079
0.0696
0.0013
0.0282
NL2S MRC PMLE
−0.0008 −0.0010 −0.0138
0.0153 0.0100 0.0185
−0.0026 −0.0028 −0.1012
0.0618 0.0546 0.1143
0.0010 0.0009 0.0072
0.0221 0.0215 0.0209
Homoskedasticity 50obs
100obs
200obs
Heteroskedasticity 50obs
100obs
200obs
MD
−0.0039
0.0494
−0.0281
0.1412
−0.0038
0.0788
NL2S MRC
0.0124 −0.0406
0.0609 0.0668
0.0153 −0.1265
0.1823 0.1904
0.0002 0.0140
0.0778 0.0941
PMLE
−0.0559
0.0668
−0.2225
0.2412
−0.0092
0.0794
MD
−0.0049
0.0297
0.0033
0.0962
0.0084
0.0508
NL2S MRC
0.0013 −0.0564
0.0396 0.0751
0.0166 −0.1540
0.1175 0.1969
0.0032 0.0347
0.0555 0.0828
PMLE MD NL2S
−0.0561 −0.0001 0.0073
0.0632 0.0219 0.0337
−0.2062 0.0092 0.0241
0.2215 0.0721 0.0885
−0.0032 0.0014 −0.0051
0.0508 0.0362 0.0482
MRC PMLE
−0.0525 −0.0550
0.0724 0.0609
−0.1489 −0.2150
0.1897 0.2208
0.0196 −0.0097
0.0636 0.0452
Note that the function in (4.2) can be defined for all real values of y, and that it would be same with the original function in (1.1) whenever y i > 0. I considered four parameter values of λ = 0.1, 0.5, 1 and 2. The covariate x i follows N(0, 5) for all designs. For different values of λ, I considered homoskedastic error terms with ε i ∼ χ 2 (1) normalized to mean zero and variance of 0.5, and heteroskedastic error terms with ε i = η i × exp (x i /4) where η i follows the same χ 2 (1) as above. Also I considered median restriction models with random censoring. In these models, the error term ε i follows N(0, 0.5), and I set the covariate dependent censoring as c i = ηi × (5x 2i ) where ηi follows χ 2 (1). Therefore, I used 12 different simulation designs in total. C The Author(s). Journal compilation C Royal Economic Society 2008.
527
Box–Cox transformation model Table 2. Simulation results: conditional mean restriction with λ = 0.5. λ = 0.5 β1 = 0 β2 = 1 Mean Bias
RMSE
Mean Bias
RMSE
Mean Bias
RMSE
MD NL2S MRC
−0.0047 −0.0024 −0.0158
0.0470 0.0475 0.0422
−0.0233 −0.0194 −0.0574
0.1685 0.1849 0.1401
0.0113 0.0130 0.0170
0.0629 0.0547 0.0632
PMLE
−0.0155
0.0326
−0.1074
0.1583
0.0138
0.0487
MD NL2S MRC
0.0056 0.0043 −0.0030
0.0342 0.0365 0.0172
0.0035 −0.0003 −0.0143
0.1010 0.1195 0.0776
−0.0079 −0.0050 −0.0041
0.0469 0.0374 0.0351
PMLE
−0.0113
0.0240
−0.0972
0.1282
0.0012
0.0320
MD NL2S
−0.0006 0.0029
0.0261 0.0217
−0.0036 0.0056
0.0800 0.0786
0.0028 0.0016
0.0315 0.0230
MRC PMLE
−0.0020 −0.0154
0.0141 0.0217
−0.0060 −0.1061
0.0630 0.1221
0.0030 0.0082
0.0246 0.0225
MD NL2S
−0.0081 0.0322
0.0544 0.1182
−0.0203 0.0568
0.1471 0.2330
0.0131 0.0079
0.0905 0.1053
MRC PMLE
−0.0475 −0.0639
0.0821 0.0797
−0.1384 −0.2363
0.2704 0.2630
0.0415 0.0139
0.1323 0.1008
MD NL2S
0.0001 0.0214
0.0410 0.0748
−0.0114 0.0256
0.1082 0.1803
0.0022 −0.0066
0.0634 0.0780
MRC PMLE
−0.0396 −0.0619
0.0629 0.0722
−0.1268 −0.2499
0.1803 0.2632
0.0219 0.0082
0.0777 0.0707
MD NL2S
0.0030 0.0161
0.0287 0.0562
−0.0014 0.0217
0.0743 0.1353
−0.0048 −0.0100
0.0447 0.0521
MRC PMLE
−0.0307 −0.0621
0.0554 0.0669
−0.1006 −0.2564
0.1590 0.2645
0.0116 −0.0005
0.0563 0.0454
Homoskedasticity 50obs
100obs
200obs
Heteroskedasticity 50obs
100obs
200obs
In the mean restriction designs, I also calculated the non-linear two-state least square (NL2S) estimator in Amemiya and Powell (1981), the maximum rank correlation (MRC) estimator in Han (1987b), and the pseudo maximum likelihood estimator (PMLE) in Bickel and Doksum (1981) besides the minimum distance (MD) estimator introduced in this paper. I used the simulated annealing method for the MD, NL2S and PMLE estimations and the grid search method for the second step of the MRC estimation. Tables 1–4 summarize the result of the 8 mean restriction designs. The first panel of each table corresponds to the homoskedastic error terms, and the second to the heteroskedastic error terms. I conducted 101 replications with the sample size of 50, 100 and 200. Mean bias and root mean square error are reported for each parameter. C The Author(s). Journal compilation C Royal Economic Society 2008.
528
Y. Shin Table 3. Simulation results: conditional mean restriction with λ = 1. λ=1 β1 = 0
β2 = 1
Mean Bias
RMSE
Mean Bias
RMSE
Mean Bias
RMSE
MD NL2S MRC
−0.0096 0.0378 −0.0178
0.1085 0.2350 0.0890
0.0112 0.0099 −0.0056
0.1704 0.5937 0.1583
0.0104 0.0643 0.0053
0.0521 0.2219 0.0448
PMLE
−0.0119
0.0750
−0.0606
0.1530
0.0012
0.0323
MD NL2S
−0.0009 0.0042
0.0758 0.1620
−0.0043 −0.0141
0.1257 0.2562
0.0058 0.0204
0.0413 0.0675
MRC PMLE
−0.0168 −0.0199
0.0589 0.0533
−0.0278 −0.0926
0.0943 0.1322
−0.0020 −0.0026
0.0339 0.0257
MD
−0.0062
0.0593
−0.0157
0.0934
−0.0004
0.0254
NL2S MRC
−0.0023 −0.0267
0.0973 0.0517
−0.0146 −0.0430
0.1342 0.0831
0.0048 −0.0068
0.0525 0.0226
PMLE
−0.0283
0.0420
−0.1051
0.1190
−0.0062
0.0165
MD
−0.0361
0.1130
−0.0310
0.1563
0.0060
0.0620
NL2S MRC
0.0830 −0.1149
0.3780 0.1653
0.0592 −0.1626
0.4818 0.2246
0.1010 −0.0224
0.4651 0.0695
PMLE
−0.1247
0.1553
−0.2259
0.2545
−0.0460
0.0717
MD NL2S
−0.0183 0.0869
0.0970 0.3588
−0.0329 0.0998
0.1309 0.5258
0.0046 0.0915
0.0451 0.2778
MRC PMLE
−0.1050 −0.1383
0.1393 0.1583
−0.1603 −0.2698
0.2034 0.2824
−0.0124 −0.0428
0.0565 0.0601
MD
−0.0204
0.0806
−0.0272
0.0959
0.0001
0.0273
NL2S MRC PMLE
0.0050 −0.1079 −0.1445
0.2314 0.1231 0.1543
0.0594 −0.1534 −0.2668
0.2936 0.1719 0.2749
0.0363 −0.0266 −0.0504
0.1653 0.0404 0.0558
Homoskedasticity 50obs
100obs
200obs
Heteroskedasticity 50obs
100obs
200obs
First, I look at the homoskedastic designs. All estimators including the PMLE work well. This result coincides with the previous simulation studies in Han (1987b) and Asparouhova et al. (2002). However, there are slight difference in their performance according to different λ values. For instance, the NL2S estimator has relatively large RMSEs when λ = 2, and the pseudo MLE shows large bias when λ = 0.1. The proposed MD estimator shows stable performance in √ most designs. The mean bias of it decreases quickly and the RMSE shows n convergence rate. In the heteroskedastic designs, the performance of the MD estimator looks better as predicted by theory. The size of mean bias and RMSE of the MD estimator is much smaller than those of the other estimators. In addition, the MD√estimator seems to be the only estimator whose mean bias decreases and the RMSE shrinks as n-rate in all designs. C The Author(s). Journal compilation C Royal Economic Society 2008.
529
Box–Cox transformation model Table 4. Simulation results: conditional mean restriction with λ = 2. λ=2 β1 = 0
β2 = 1
Mean Bias
RMSE
Mean Bias
RMSE
Mean Bias
RMSE
MD NL2S MRC
−0.1228 −0.0803 −0.0441
0.2605 0.7096 0.1571
−0.0275 −0.0627 −0.0050
0.1385 0.5971 0.1214
−0.0176 0.1219 −0.0132
0.1051 0.4415 0.0729
PMLE
−0.0104
0.1356
−0.0462
0.1222
0.0012
0.0646
MD NL2S
−0.0713 −0.1125
0.1914 0.6321
−0.0341 −0.1431
0.1004 0.7062
−0.0131 0.0910
0.0659 0.3948
MRC PMLE
−0.0515 −0.0444
0.1062 0.1109
−0.0325 −0.0913
0.0802 0.1179
−0.0197 −0.0162
0.0438 0.0437
Homoskedasticity 50obs
100obs
200obs
MD
−0.0294
0.1401
−0.0103
0.0845
−0.0038
0.0583
NL2S MRC
−0.1075 −0.0396
0.5525 0.0901
−0.1091 −0.0203
0.4885 0.0633
0.0465 −0.0151
0.2648 0.0380
PMLE
−0.0346
0.0778
−0.0819
0.0991
−0.0132
0.0304
MD
−0.2078
0.3104
−0.0933
0.1692
−0.0499
0.0904
NL2S MRC
−0.0988 −0.2040
0.9057 0.3276
−0.2074 −0.1949
1.2103 0.8853
0.2057 −0.0486
0.6511 0.2827
PMLE
−0.2237
0.3035
−0.1949
0.2232
−0.1128
0.1233
MD NL2S
−0.1445 −0.0616
0.2217 0.9049
−0.0586 −0.0705
0.1186 0.7318
−0.0342 0.2111
0.0664 0.6283
MRC PMLE
−0.2218 −0.2770
0.2867 0.3262
−0.1207 −0.2156
0.1534 0.2275
−0.0834 −0.1271
0.1002 0.1335
MD
−0.0740
0.1458
−0.0355
0.0904
−0.0226
0.0536
NL2S MRC PMLE
0.0237 −0.1871 −0.2707
0.7844 0.2155 0.2927
0.0163 −0.1036 −0.2128
0.5127 0.1217 0.2197
0.1867 −0.0787 −0.1319
0.5777 0.0878 0.1343
Heteroskedasticity 50obs
100obs
200obs
Finally, Table 5 summarizes the result of median restriction designs with random censoring. Note that the censoring variable depends on the covariate. For each design, the censoring ratio varies from 25 to 40%. I used the Nelder–Mead method for numerical minimization. √ The result again shows good performance of the proposed MD estimator in finite samples and nconvergence rate for various λ values. Consequently, the result from this simulation study shows that the MD estimators introduced in this paper perform well in various cases. They show appropriate finite sample performance with heteroskedasticity and covariate dependent censoring. This property will help a researcher to derive robust conclusions in empirical applications. C The Author(s). Journal compilation C Royal Economic Society 2008.
530
Y. Shin Table 5. Simulation results: conditional median restriction with random censoring. Mean Bias RMSE Mean Bias RMSE Mean Bias RMSE λ = 0.1 β1 = 0 β2 = 1
50obs 100obs 200obs
−0.0668 −0.0346 −0.0203
0.1894 0.1323 0.0591
0.0638 0.0144 −0.0059
100obs 200obs
0.1524 0.0850 0.0354
β1 = 0
λ = 0:5 50obs
0.3289 0.3818 0.1745
0.4208 0.2633 0.1090 β2 = 1
0.0246
0.1158
0.1228
0.3021
0.0270
0.1455
−0.0127 −0.0002
0.0905 0.0382
0.0257 0.0042
0.1562 0.1079
0.0346 0.0108
0.1130 0.0620
λ=1
β1 = 0
β2 = 1
50obs
0.1640
0.4162
0.1362
0.4376
0.0322
0.1539
100obs 200obs
0.1144 0.0502
0.2380 0.1582
0.1406 0.0491
0.2708 0.2430
0.0200 0.0093
0.0876 0.0725
β1 = 0
λ=2
β2 = 1
50obs 100obs
0.2505 0.1874
0.5164 0.4282
0.1164 0.1105
0.3183 0.3196
0.0813 0.0905
0.2396 0.2349
200obs
0.1132
0.2839
0.0719
0.1948
0.0489
0.1298
5. CONCLUSIONS This paper proposed an estimation method for the Box–Cox transformation model. I showed the global identification result under mild conditions that allow conditional heteroskedasticity. This procedure minimizes a second order U-process and does not require any user-chosen smoothing parameters. Consistency and asymptotic normality were established under the standard regularity √ conditions, and it converges at the rate of n. I also extended the estimation method to the Box– Cox model with the median restriction and general forms of random censoring. Monte Carlo experiments show that all methods satisfy adequate finite sample properties. I may consider several extensions of results in this paper. First, similar estimation procedures can be suggested for other types of data transformation models. Especially, I can extend the result to any monotone transformation models since the main property I used in identification is monotonicity. Secondly, I may consider partial identification of the Box–Cox transformation model. If all covariates are discrete or some of them are observed as an interval, which is quite often in microeconomic data, point identification in this paper does not hold anymore. Combining with the result in Manski and Tamer (2002), I can achieve a partial identification that still has useful meaning in application. I leave these extensions for future research.
ACKNOWLEDGMENTS This paper is based on my thesis chapter. I am greatly indebted to my advisor, Shakeeb Khan for his encouragement and invaluable guidance. I am also grateful to the co-editor, Oliver Linton, C The Author(s). Journal compilation C Royal Economic Society 2008.
Box–Cox transformation model
531
and two anonymous referees for helpful comments on earlier versions which led to significant improvements. All errors are mine.
REFERENCES Amemiya, T. and J. Powell (1981). A comparison of the Box-Cox maximum likelihood estimator and the non-linear two-stage least squares estimator. Journal of Econometrics 17, 351–81. Asparouhova, E., R. Golanski, K. Kasprzyk, R. P. Sherman and T. Asparouhov (2002). Rank estimators for a transformation model. Econometric Theory 18, 1099–120. Bickel, P. J. and K. A. Doksum (1981). An analysis of transformations revisited. Journal of the American Statistical Association 76, 296–311. Billingsley, P. (1995). Probability and Measure. New York: Wiley and Sons. Box, G. and D. Cox (1964). An analysis of transformations. Journal of the Royal Statistical Society, Series B 26, 211–52. Carroll, R. and D. Ruppert (1984). Power transformations when fitting theoretical models to data. Journal of the American Statistical Association 79, 321–8. Dominguez, M. and I. Lobato (2004). Consistent estimation of models defined by conditional moment restrictions. Econometrica 72, 1601–15. Han, A. (1987a). Non-parametric analysis of a generalized regression model. Journal of Econometrics 35, 303–16. Han, A. (1987b). A non-parametric analysis of transformations. Journal of Econometrics 35, 191–209. Khan, S. and E. Tamer (2008). Inference on endogenously censored regression models using conditional moment inequalities. Forthcoming in Journal of Econometrics. Manski, C. (1985). Semiparametric analysis of discrete response: asymptotic properties of maximum score estimation. Journal of Econometrics 27, 313–34. Manski, C. F. and E. Tamer (2002). Inference on regressions with interval data on a regressor or outcome. Econometrica 70, 519–46. Newey, W. K. and D. McFadden (1994). Large sample estimation and hypothesis testing. In R. F. Engle and D. McFadden (Eds.), Handbook of Econometrics, Volume 4, 2111–245. Amsterdam: Elsevier. Pakes, A. and D. Pollard (1989). Simulation and the asymptotics of optimization estimators. Econometrica 57, 1027–57. Serfling, R. (1980). Approximation Theorems of Mathematical Statistics. New York: Wiley. Sherman, R. (1994a). Maximal inequalities for degenerate u-processes with applications to optimization estimators. Annals of Statistics 22, 439–59. Sherman, R. (1994b). U-processes in the analysis of generalized semiparametric regression estimator. Econometric Theory 10, 372–95.
APPENDIX: PROOFS OF RESULTS Proof of Lemma 2.1: Note that, for any (λ, β) = (λ 0 , β 0 ), I have (λ ) PX E yi(λ) − xi β|xi = 0 = PX E yi(λ) − xi β|xi = E yi 0 − xi β0 |xi (λ ) = PX E yi(λ) − yi 0 |xi = xi (β − β0 ) . C The Author(s). Journal compilation C Royal Economic Society 2008.
(2.3)
532
Y. Shin
Thus it is enough to show that the equality inside (2.3) does not hold in some support of x i . If λ = λ 0 , the result follows immediately from the Assumption I2 as in a linear regression model. I consider the following two cases when λ = λ 0 : (λ )
(i) β −1 = β 0,−1. First, I assume that λ > λ 0 . By the monotone property, E(yi(λ) − yi 0 |xi ) ≥ 0 x i − a.s. However, Assumption I3 (a) implies that I can find some x i such that x i (β − β 0 ) < 0. I can prove it in similar way for the case of λ < λ 0 . (λ ) (ii) β −1 = β 0,−1. Then, the equality condition becomes E(yi(λ) − yi 0 |xi ) = (β1 − β0,1 ). Note that the class of transformation {y i(λ) } is non-linear with respect to λ and that (β 1 − β 0,1 ) is a constant. Considering the shape of transformations, it is sufficient for the result that y i has at least three different conditional mean values, which comes directly from the Assumption I3. Proof of Lemma 2.2: From the Lemma 2.1 and (2.6) it is clear that Q(θ 0 ) = 0 and Q(θ) > 0 for all θ = θ 0. Proof of Theorem 2.1: I verify the conditions of Theorem 2.1 in Newey and McFadden (1994). Compactness and Identification follows from Assumption C1 and Lemma 2.2, respectively. It is easy to check that Q(θ ) is continuous since g(·, ·, θ) is continuous in θ and the indicator function inside H (·, θ ) does not contain θ as its arguments. It remains to show uniform convergence of Q n (θ ) to Q(θ ). To simplify the proof, I assume that regressor values lie in a compact set. Clearly Assumption C3 implies that (A.1) E sup |g(xi , yi , θ )| < ∞. θ∈
Thus, I can apply Lemma 2.4 in Newey and McFadden (1994) to conclude that sup
xl ∈SX ,θ∈
p
|Hn (xl , θ ) − H (xl , θ )| → 0.
(A.2)
Finally, I get the desired result by applying existing uniform law of large numbers for U-statistics. Since H (x l , θ) has finite variation on a compact space, it satisfies a Lipschitz condition in θ. Thus Lemma 2.13 in Pakes and Pollard (1989) implies that the functional space H (·, θ) is Euclidean for some envelope which is also in L2 space. Combining this result with Corollary 7 in Sherman (1994a), the following is established: sup |Qn (θ ) − Q(θ)| = Op (n−1/2 ),
(A.3)
θ∈
which is more than enough to show the uniform convergence.
Proof of Theorem 2.2: I prove the asymptotic normality by providing a locally quadratic approximation function √ of the objective function. Specifically, I follow the approach in Sherman (1994b). First, I need to show n-consistency. Let n (θ ) be defined as: n (θ ) = Qn (θ ) − Qn (θ0 ).
(A.4)
I may √ also define (θ ) in similar way. Then, Theorem 1 in Sherman (1994b) provides sufficient conditions for n-consistency as follows: (i) (ii) (iii)
θM − θ0 = op (1) There exists a neighbourhood N of θ 0 and a constant κ > 0 such that (θ ) ≥ κ||θ − θ 0 ||2 for ∀θ ∈ N . Uniformly over o p (1) neighbourhoods of θ 0 , √ n (θ ) = (θ ) + Op (||θ − θ0 || / n) + op (||θ − θ0 ||2 ) + Op (1/n). (A.5) C The Author(s). Journal compilation C Royal Economic Society 2008.
Box–Cox transformation model
533
√ If I have shown n-consistency, I can apply Theorem 2√in Sherman (1994a) to derive asymptotic normality. A sufficient condition is that uniformly over Op (1/ n) neighbourhoods of θ 0 , n (θ ) =
1 1 (θ − θ0 ) V (θ − θ0 ) + √ (θ − θ0 ) Wn + op (1/n), 2 n
(A.6)
where V is a positive definite matrix, and W n converges in distribution to a N (0, ). It is easy to check that Theorem 2.1 and Lemma 2.2 imply conditions (i) and (ii) respectively. Therefore, √ for n-consistency and asymptotic normality, it is enough to show that n (θ ) =
1 1 (θ − θ0 ) V (θ − θ0 ) + √ (θ − θ0 ) Wn + op (||θ − θ0 ||2 ) + op (1/n) 2 n
(A.7)
uniformly in o p (1) neighbourhoods of θ 0 . (xl , θ )2 into the double summation To show the result, I first expand H n (xl , θ )2 = 1 g(xi , yi , θ )2 I (xi ≤ xl ) H 2 n i=1 1 g(xj , yj , θ)g(xk , yk , θ )I (xj ≤ xl )I (xk ≤ xl ). + 2 n j =k
(A.8)
Since g(·, ·, θ) is square integrable for all θ ∈ , the first term is o p (1) and can be ignored in asymptotic theory. Thus, I can use the following 3rd order U-process instead of the sample objective function Q n (θ ): n 1 1 g(xj , yj , θ )g(xk , yk , θ )I (xj ≤ xl )I (xk ≤ xl ). n l=1 n2 j =k
(A.9)
Now I can see where the function h(z j , z k , z l , θ) in (2.12) comes from. Furthermore, I define f (z j , z k , z l , θ ) for later use: f (zj , zk , zl , θ ) = h(zj , zk , zl , θ ) − h(zj , zk , zl , θ0 ). Next, I apply U-processes decomposition to n (θ ) as in Serfling (1980): 1 1 f1 (zi , θ ) + f2 (zi , zj , θ ) n (θ ) = (θ ) + n i n(n − 1) i=j 1 f3 (zi , zj , zk , θ ), + n(n − 1)(n − 2) i=j =k
(A.10)
where f i (·, θ ) is a degenerate U-process of order i. For example, f1 (z, θ ) = f (z, P , P , θ) + f (P , z, P , θ) + f (P , P , z, θ) − 3(θ ). I next establish the result (A.7) by evaluating each term of the equation (A.10). First, I do the Taylor expansion of γ (z, θ ) around θ 0 and take expectation on both sides. From E[γ (z, θ ) − γ (z, θ 0 )] = 3(θ) and the first order condition, I get the following result: (θ) =
1 (θ − θ0 ) V (θ − θ0 ) + o(||θ − θ0 ||2 ) 2
(A.11)
uniformly over o p (1) neighbourhoods of θ 0 . I now turn my attention to the 2nd term and expand it around θ 0 . Then, I get the following immediately: 1 1 f1 (zi , θ ) = √ (θ − θ0 ) Wn + op (||θ − θ0 ||2 ), (A.12) n i n where Wn =
√1 n
i
∇1 γ (zi , θ0 ) ⇒ N (0, ).
C The Author(s). Journal compilation C Royal Economic Society 2008.
534
Y. Shin
Finally it follows from the Euclidean property of the class of the functions in the objective function and Corollary 8 in Sherman (1994a) that the last two terms in (A.10) are negligible: 1 1 f2 (zi , zj , θ ) + f3 (zi , zj , zk , θ ) = op (1/n). n(n − 1) i=j n(n − 1)(n − 2) i=j =k Therefore, the desired result (A.7) is established by combining (A.11), (A.12) and (A.13).
(A.13)
Proof of Lemma 3.1: Consider the first case θ = θ 0 . Then, τ1 (xi , θ0 ) = P (εi ≥ 0|xi ) −
1 =0 2
xi − a.s.
Next, I consider the case θ = θ 0 . Let x i δ 1 = x i β − x i β 0 and δ2 (yi , λ) = yi
(λ0 )
1 τ1 (xi , θ ) = P yi(λ) ≥ xi β|xi − 2
(A.14) − yi(λ) . Then, (A.15)
1 = P yi(λ) ≥ xi δ1 + xi β0 |xi − 2
(A.16)
1 (λ ) = P yi(λ) ≥ xi δ1 + yi 0 − εi |xi − 2
(A.17)
1 = P εi ≥ xi δ1 + δ2 (yi , λ) |xi − . 2
(A.18)
For θ = θ 0 , I know that either x i δ 1 = 0 or δ 2 (y i , λ) = 0. Noting that δ 2 (y i , λ) > 0 for all y i if and only if λ 0 > λ, I can easily find some support of xi such that x i δ 1 + δ 2 (y i , λ) = 0, which implies that τ 1 (x i , θ ) = 0 with positive probability. Proof of Theorem 3.1: It is enough to show that the conditions of Theorem 2.1 in Newey and McFadden (1994) are satisfied. Compactness and continuity follows from the Assumption C1 and C3, respectively. Identification holds from Lemma 3.2. It only remains to show uniform convergence of Q n (θ ) to Q(θ): p
sup |Qn (θ) − Q (θ )| → 0.
(A.19)
θ∈
To simplify the proof, I assume that this regressor values lie in a compact set. Note that H n (x l , θ ) is bounded since it is multiplied by an indicator function. Thus, I can apply Lemma 2.4 in Newey and McFadden (1994) to conclude that sup
xl ∈SX ,θ∈
p
|Hn (xl , θ ) − H (xl , θ )| → 0.
(A.20)
Therefore, I can apply existing uniform law of large numbers for U-statistics. The functional space indexed by θ is Euclidean for a constant envelop; see Example 2.11 in Pakes and Pollard (1989). Combining this result with Corollary 7 in Sherman (1994a), the following is established: (A.21) sup |Qn (θ ) − Q (θ )| = Op n−1/2 , θ∈
which is stronger result than the uniform convergence.
C The Author(s). Journal compilation C Royal Economic Society 2008.
Box–Cox transformation model
535
n (θ ) = Qn (θ ) − Qn (θ0 ) .
(A.22)
Proof of Theorem 3.2: Note that
I need to show that n (θ ) =
1 1 (θ − θ0 ) V (θ − θ0 ) + √ (θ − θ0 ) Wn + op ||θ − θ0 ||2 + op (1/n) . 2 n (A.23)
Expand H n (x l , θ )2 into the double summation Hn (xl , θ )2 =
n 1 2 1 (λ) I y ≥ x β − · I (xi ≤ xl ) i i n2 i=1 2 1 1 1 I yk(λ) ≥ xk β − I yj(λ) ≥ xj β − + 2 n j =k 2 2 ×I xj ≤ xl I (xk ≤ xl ) .
(A.24)
The first term converges to 0 and can be ignored asymptotically. The objective function Q n (θ ) will be the third order U-process n 1 1 1 1 (λ) (λ) I y I y ≥ x β − ≥ x β − j k j k n l=1 n2 j =k 2 2 ×I xj ≤ xl I (xk ≤ xl ) . 1 1 h (z1 , z2 , z3 , θ ) = I yj(λ) ≥ xj β − I yk(λ) ≥ xk β − 2 2
(A.25)
(A.26)
× I xj ≤ xl I (xk ≤ xl ) γ (z, θ ) = h (z, P , P , θ) + h (P , z, P , θ) + h (P , P , z, θ) ,
(A.27)
where h(z, P , P , θ) denotes the conditional expectation given the first argument and the remaining two terms are defined in similar way. f (z1 , z2 , z3 , θ ) = h (z1 , z2 , z3 , θ ) − h (z1 , z2 , z3 , θ0 ) .
(A.28)
n (θ ) = (θ ) + Pn f1 (·, θ) + Un2 f2 (·, θ) + Un3 f3 (·, θ ) ,
(A.29)
f1 (z, θ ) = f (z, P , P , θ) + f (P , z, P , θ) + f (P , P , z, θ) − 3 (θ ) .
(A.30)
Decompose n (θ )
where
Then the first term is (θ) =
1 (θ − θ0 ) V (θ − θ0 ) + o ||θ − θ0 ||2 , 2
C The Author(s). Journal compilation C Royal Economic Society 2008.
(A.31)
536
Y. Shin
where 3V = E[∇ 2 γ (z, θ 0 )] = 3∇ 2 Q(θ 0 ). This comes from the Taylor expansion of γ (z, θ ) around θ 0 and taking expectation γ (z, θ ) = γ (z, θ0 ) + (θ − θ0 ) ∇1 γ (z, θ0 ) 1 + (θ − θ0 ) ∇2 γ (z, θ0 ) (θ − θ0 ) 2 + o ||θ − θ0 ||2 E [γ (z, θ ) − γ (z, θ0 )] =
1 (θ − θ0 ) E [∇2 γ (z, θ0 )] (θ − θ0 ) 2 +o ||θ − θ0 ||2
(A.32)
(A.33)
3 (θ ) =
1 (θ − θ0 ) 3V (θ − θ0 ) + o ||θ − θ0 ||2 2
(A.34)
(θ ) =
1 (θ − θ0 ) V (θ − θ0 ) + o ||θ − θ0 ||2 . 2
(A.35)
The second term 1 Pn f1 (z, θ) = √ (θ − θ0 ) Wn + o ||θ − θ0 ||2 , n √ where Wn = nPn ∇1 γ (z, θ0 ) =⇒ N (0, ) and = E[∇ 1 γ (z, θ 0 )∇ 1 γ (z, θ 0 ) ]. The last two terms are negligible as before. Proof of Lemma 3.3: Consider the first case θ = θ 0 . It is enough to show that τ 1 (x i , θ 0 ) is non-positive and τ 0 (x i , θ ) is non-negative for all x i . First look at τ 1 (x i , θ 0 ): 1 (λ ) τ1 (xi , θ0 ) = P vi 0 ≥ xi β0 |xi − 2
(A.36)
1 = P min xi β0 + εi , ci ≥ xi β0 |xi − 2
(A.37)
1 = P εi ≥ 0, ci ≥ xi β0 |xi − 2
(A.38)
≤ P (εi ≥ 0|xi ) −
1 = 0. 2
(A.39)
Thus, τ 1 (x i , θ 0 ) ≤ 0 for all x i . To see τ 0 (x i , θ ) is non-negative, 1 (λ ) − P di = 1, vi 0 ≤ xi β0 |xi 2
(A.40)
=
1 − P (di = 1, εi ≤ 0|xi ) 2
(A.41)
≥
1 − P (εi ≤ 0|xi ) = 0. 2
(A.42)
τ0 (xi , θ ) =
C The Author(s). Journal compilation C Royal Economic Society 2008.
Box–Cox transformation model
537
Now, I consider the case θ = θ 0 . It is enough to show that τ 1 (x i , θ ) > 0 or τ 0 (x i , θ ) < 0 on (x i , c i ) ∈ S xc . Thus, I restrict my attention on (x i , c i ) in the subset of S xc . Let x i δ 1 = x i β − x i β 0 and δ2 (vi , λ) = (λ ) vi 0 − vi(λ) . Then, θ = θ 0 can be divided into the following two cases. (i)
Case 1: x i δ 1 + δ 2 (v i , λ) ≤ 0. Then, 1 τ1 (xi , θ ) = P viλ ≥ xi β|C − 2 1 (λ ) = P vi 0 − δ2 (vi , λ) ≥ xi δ1 + xi β0 |C − 2 = P εi ≥ xi δ1 + δ2 (vi , λ) , ci ≥ xi β0 + xi δ1 + δ2 (vi , λ) |C 1 − 2 1 = P εi ≥ xi δ1 + δ2 (vi , λ) |C − > 0. 2
(ii)
(A.43)
(A.44)
(A.45)
(A.46)
Case 2: x i δ 1 + δ 2 (v i , λ) > 0. Then 1 − P di = 1, vi(λ) ≤ xi β|C 2
(A.47)
=
1 (λ ) − P di = 1, vi 0 − δ2 (vi , λ) ≤ xi δ1 + xi β0 |C 2
(A.48)
=
1 − P εi ≤ min ci − xi β0 , xi δ1 + δ2 (vi , λ) |C 2
(A.49)
τ0 (xi , θ ) =
< 0.
(A.50)
C The Author(s). Journal compilation C Royal Economic Society 2008.
Econometrics Journal (2008), volume 11, pp. 538–553. doi: 10.1111/j.1368-423X.2008.00252.x
A semiparametric derivative estimator in log transformation models C HUNRONG A I † ‡ AND E DWARD C. N ORTON § ,
†
University of Florida, Gainesville, FL E-mail:
[email protected]
‡
Shanghai University of Finance and Economics (SUFE), China § The University of North Carolina at Chapel Hill, Chapel Hill, NC E-mail:
[email protected] First version received: September 2005; final version accepted: April 2008
Summary This paper considers a regression model with a log-transformed dependent variable. The log transformed model is estimated by simple least squares, but computing the conditional mean of the dependent variable on the original scale given the explanatory variables requires knowing the conditional distribution of the error term in the transformed model. We show how to obtain a consistent estimator for the conditional mean and its derivatives without specifying the conditional distribution of the error term. The asymptotic distribution of the estimator is derived. The proposed procedure is then illustrated via a simulation study. Keywords: Asymptotic distribution, Conditional mean, Derivative estimator, Log transformation, Series estimator.
1. INTRODUCTION Applied economists often estimate models with a log-transformed dependent variable. Common justifications for using the logarithmic transformation include to deal with a dependent variable that is badly skewed to the right and to compute elasticities (Manning, 1998). The log transformation can also deal with heteroskedasticity. The conditional mean of the logtransformed dependent variable given the explanatory variables is usually estimated by simple least squares. The conditional mean of the dependent variable on the original scale, however, depends on the conditional distribution of the error term in the log transformed model. Consequently, the derivatives of the conditional mean on the original scale with respect to the explanatory variables also depend on the conditional distribution of the error term. Therefore, any estimates of the conditional mean and its derivatives must adjust for the error term distribution. Failure to account for the conditional distribution may lead to substantially biased estimates. There are three approaches to account for the error term distribution. The first is the parametric approach, which specifies the conditional distribution of the error term parametrically and then computes the conditional mean and its derivatives either analytically (Manning, 1998, Mullahy, 1998, Manning and Mullahy, 2001), or numerically (Abrevaya, 2002). Such an approach, however, may yield misleading results if the functional form of the conditional C The Author(s). Journal compilation C Royal Economic Society 2008. Published by Blackwell Publishing Ltd, 9600 Garsington Road,
Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
A semiparametric derivative estimator
539
distribution is misspecified. In practice, the functional form of the conditional distribution is rarely known. Hence, this approach is not robust. The second approach decomposes the error term into a standard error multiplied by a standardized residual term. The standardized residual term is assumed to be independent of the explanatory variables and has an unknown distribution. The standard error is fully parametrized (Ai and Norton, 2000, Abrevaya, 2002). This approach clearly imposes fewer restrictions on the conditional distribution of the error term than the first approach and hence is more robust. Still, this approach can yield biased estimates if either the independence condition on the standardized residual term is not satisfied or if the parametrization of the standard error is misspecified. The third approach, which is adopted in this paper, assumes that the conditional distribution of the error term given the explanatory variables is completely unknown. This approach is semiparametric and therefore most robust. In particular, this approach allows for heteroskedasticity of any form, a problem that is deemed to be particularly difficult to deal with in practice (see Manning, 1998). Under this specification, we show how to obtain consistent estimates for the conditional mean and its derivatives in Section 2. We also derive the asymptotic distribution of the proposed estimators and provide consistent estimates for the asymptotic variance in Section 3. A potential criticism of our model is why it is necessary to have the first step log regression because the conditional mean of the dependent variable (on the original scale) given the explanatory variables and its derivatives can always be estimated consistently by non-parametric regression of the dependent variable (on the original scale) on the explanatory variables. There are at least two reasons. First, in some applications, researchers are not just interested in the marginal and interaction effects of the dependent variable (on the original scale) but also are interested in the semi-elasticity coefficients. The log regression equation provides a consistent estimate for the elasticity coefficients. Second, researchers may have some prior knowledge about the conditional distribution of the error term given the explanatory variables. For instance, the conditional distribution of the error term in the log regression may depend on a subset of explanatory variables. The log regression equation allows us to exploit such prior knowledge, while the non-parametric regression of the dependent variable (on the original scale) on all explanatory variables does not. Finally, we evaluate the finite sample performance of our estimators via a small scale simulation study in Section 4. Technical proofs are relegated to the Appendix.
2. ESTIMATION Before we introduce the model, we adopt the following notation conventions: bold lowercase letters denote random variables; standard lowercase letters denote the realizations of the random variables; and E{·} denotes the expectation taken with respect to the distribution of the bold letters in the bracket. Assume that the original dependent variable y is possibly a non-linear function of K explanatory variables x, excluding a constant and a random error term u. The model is given by ln(y) = h(x, β0 ) + u, C The Author(s). Journal compilation C Royal Economic Society 2008.
540
C. Ai and C. Norton
where h(.,.) is a known measurable function, β 0 is a vector of unknown parameters (including the constant), and the error term u satisfies the conditional mean restriction E{u|x = x} = 0 for almost all x. In most applications, the function h(x, β 0 ) is linear in parameters β 0 . Here we allow h(x, β 0 ) to be non-linear in both x and β 0 . We assume that no element of x can be expressed as a function of the other elements, for example, there are no higher-order or interaction terms in x. This assumption is not as restrictive as it appears because it can always be satisfied by redefining the function h. For example, if x includes income and income squared, then the function h(x, β 0 ) can be redefined as a function of income. Inverting the log function, we obtain the dependent variable on the original scale y = exp[h(x, β0 ) + u] = exp[h(x, β0 )] × exp(u). Let z denote a subset of or equal to x and suppose that the conditional density of u given x is assumed to depend on z only. Let D(z) be Duan’s (1983) smearing estimator, the mean of the exponentiated residuals conditional on z. The conditional mean of y given the explanatory variables, denoted F (x, β0 ), is F (x, β0 ) = E{y|x = x} = exp[h(x, β0 )] × E{exp(u)|z = z} = exp[h(x, β0 )] × D(z). The marginal effects of the explanatory variables are found by taking derivatives (or differences) of the conditional mean with respect to the continuous (discrete) regressors. Let m = (m 1 , m 2 , . . . , m K ) denote a vector of non-negative integers and define |m| = m 1 + m 2 + · · · + m K . Define μm (x) = ∂ m F (x, β0 ) ≡
|m| F (x, β0 ) , x1m1 x2m2 ...xKmK
where denotes either the derivative or the difference operator depending on whether x is continuous or discrete. For example, with x = (x 1 , x 2 ), x 1 continuous, x 2 a 0–1 dummy, and m = (1, 1) , we have μm (x) = ∂ m F (x, β0 ) =
∂F (x1 , 1, β0 ) ∂F (x1 , 0, β0 ) − . ∂x1 ∂x1
For the same example with m = (1, 0) , we have the marginal effect of a continuous variable x 1 μm (x) = ∂ m F (x, β0 ) =
∂F (x, β0 ) ; ∂x1
and with m = (0, 1) , we have the incremental effect for a discrete change in a dichotomous variable x 2 μm (x) = ∂ m F (x, β0 ) = F (x1 , 1, β0 ) − F (x1 , 0, β0 ) . For convenience, we define μ m (x) = F (x, β 0 ) when m = 0. Thus, μ m (x) encompasses the estimands of interest, such as the conditional mean, and the marginal and interaction effects of the explanatory variables (Ai and Norton, 2003, Norton et al., 2004). C The Author(s). Journal compilation C Royal Economic Society 2008.
A semiparametric derivative estimator
541
The focus of this paper is to present a consistent estimator for the derivative μ m (x) and for the average derivative μ m = E{μ m (x)} and to derive the asymptotic distributions of these estimators. To estimate the derivative μ m (x), we need to estimate the conditional mean function F (x, β 0 ) which depends on the unknown parameter β 0 and the unknown function D(z). The unknown parameter β 0 can be estimated by standard regression techniques. Given a sample {y i , x i , i = denote the regression estimator of β 0 . We shall not concern ourselves with 1, 2, . . . , n}, let β , which are well established in the literature. the derivation of the asymptotic properties of β is Instead we will assume that log regression equation satisfies standard conditions so that β √ n consistent. For instance, the simple least-squares estimator β has the following influence representation − β0 ) = n (β0 )−1 (β
n ∂h(xi , β0 )
∂β
i=1
with n (β) =
n ∂h(xi , β) ∂h(xi , β)
∂β
∂β
i=1
ui + op (n−1/2 ),
.
with w(z) as the weighting function, the influence For the weighted least-squares estimator β representation is − β0 ) = n (β0 )−1 (β
n i=1
with n (β) =
1 ∂h(xi , β0 ) ui + op (n−1/2 ), w(zi ) ∂β
n ∂h(xi , β) ∂h(xi , β) i=1
∂β
∂β
1 . w 2 (zi )
In general we assume: is A SSUMPTION 2.1. β
√ n consistent and has the following influence representation: − β0 ) = (β
n
δ(xi )ui + op (n−1/2 ).
i=1
Denote u(y, x, β) = ln(y) − h(x, β); ). ui = u(yi , xi , β The unknown function D(z) depends on the unknown conditional distribution of the error term u given the explanatory variables z and thus cannot be estimated by a simple parametric regression. We propose to use a parametric approximation and then estimate that parametric approximation by least squares. Specifically, for some integer J, let pJ (z) = [p 1 (z), . . . , p J (z)] denote the approximating functions so that there is a π such that the parametric function pJ (z) π approximates D(z) well. Examples of the approximating functions include polynomials, splines C The Author(s). Journal compilation C Royal Economic Society 2008.
542
C. Ai and C. Norton
and Fourier series. Let {z i , i = 1, 2, . . . , n} denote a sample of observations on z. Denote P = pJ (z1 ), pJ (z2 ), . . . , p J (zn ) ; Q(β) = {exp[u(y1 , x1 , β)], . . . , exp[u(yn , xn , β)]} ; = Q(β ); Q D(z, β) = E{exp[u(y, x, β)]|z = z}. Then D(z) is estimated by regressing exp( ui ) on pJ (z i ): = pJ (z) (P P )−1 P Q. D(z) The derivative of the conditional mean is now estimated by )] × pJ (z)} (P P )−1 P Q μm (x) = ∂ m {exp[h(x, β and the average derivative is estimated by 1 m )] × pJ (zi )} (P P )−1 P Q. ∂ {exp[h(xi , β n i=1 n
μm =
μm . In the following sections, we derive the asymptotic distributions of μm (x) and
3. ASYMPTOTIC RESULTS We first derive the asymptotic properties of μm (x) and then derive the asymptotic properties of μm (x) draws heavily from Newey (1997). We μm . The derivation of the asymptotic properties of begin by introducing some regularity conditions. A SSUMPTION 3.1. {(y i , x i ), i = 1, 2, . . . , n} are drawn independently from the joint distribution of (y, x). The conditional density of u given x is the same as the conditional density of u given z. The first part of this condition rules out dependent data and hence is restrictive. The main result however can be generalized to dependent data using the results of Ai and Sun (2005). The second part of the condition is part of model specification. It can be tested by employing some of the non-parametric √ techniques proposed in the literature. Let B = trace(B B) be the Euclidean norm of matrix B. Also, let X denote the support of x and Z denote the support of z. A SSUMPTION 3.2. For every J there is a non-singular constant matrix B such that: (i) the smallest eigenvalue of E{B × pJ (z)pJ (z) × B} is bounded away from zero uniformly in J; and (ii) there is a sequence of constants ζ 0 (J ) satisfying supz∈Z pJ (z) ≤ ζ0 (J ) and J = J (n) such that ζ 0 (J )2 J /n → 0 as n → ∞. This condition imposes restrictions on the approximating functions. Conditions of this sort are common in the literature on series estimation; see Newey (1997) and Andrews (1991). When the density of z is bounded away from zero, the constant ζ 0 (J ) is computed for splines and power C The Author(s). Journal compilation C Royal Economic Society 2008.
A semiparametric derivative estimator
543
√ series as c J and cJ, respectively, for some constant c; and the restriction in part (ii) is satisfied by J 2 /n → 0 and J 3 /n → 0. For any vector λ = (λ 1 , . . . , λ n ), denote ζ|λ| (J ) = max sup −1 ∂ δ [pJ (z)]. |δ|≤|λ| z∈Z
Denote ε = exp(u) − D(z). Because we approximate D(z) by pJ (z) π , the approximation error will cause bias in the proposed estimator. To control the bias, the approximation error must shrink to zero as more terms are added to the approximating functions. We now specify a rate of approximation for the approximating functions. A SSUMPTION 3.3. (i) Z is compact; (ii) there are α and π such that α−|λ| sup |∂ λ (D (z) − pJ (z) π )| = O J − K asJ → ∞ z∈Z
for any λ ≤ m ; (iii)
√
nJ −
α−|λ| K
→ 0 and (iv)
√ ζ|λ| (J ) J √ n
→ 0.
A SSUMPTION 3.4. E{ε4 |z = z} is bounded, and σ 2ε (z) = E{ε 2 |z = z} is bounded and bounded away from zero for all z ∈ Z. A SSUMPTION 3.5. (i) For every x , h(x, β) is twice continuously differentiable with respect to 2 β in the neighbourhood of β0 ; (ii) exp(u(y, x, β)), ∂ exp(u(y,x,β)) , and ∂ exp(u(y,x,β)) satisfy the ∂β ∂β∂β stochastic dominance condition in the neighbourhood of β0 and (iii) for each element β r of β,
∂ exp[u(y, x, β0 )] |x = x Var ∂βr
is bounded and
λ ∂D(z, β0 )
λ J
∂ − ∂ p (z) πr = O J −(αr −|λ|)/K
∂β r
for some π r and α r . Assumption 3.3(i) requires that the explanatory variables have bounded support. This condition can always be satisfied by discarding observations with large values. Assumption 3.3(ii) requires that the approximation error shrinks at polynomial rate. This condition is satisfied by splines and power series approximations with α as the degree of smoothness of D(z). Assumption 3.4 requires that the fourth conditional moment is bounded and the conditional variance is bounded both from above and below. This condition is common in the regression by the true value. literature. Assumption 3.5 is needed so that we can replace the estimate β Assumption 3.5(ii) is a stochastic dominance condition that is commonly imposed in the nonlinear econometric literature. C The Author(s). Journal compilation C Royal Economic Society 2008.
544
C. Ai and C. Norton
i ). Denote Denote the regression residuals εi = exp( ui ) − D(z σε2 (z) = E{ε2 |z = z}; Vnm (x) = ∂ m {exp[h(x, β0 )] × pJ (z)} (P P )−1 ×
n
σε2 (zi ) × pJ (zi )pJ (zi )
i=1
× (P P )−1 ∂ m {exp[h(x, β0 )] × pJ (z)}; nm (x) = ∂ m {exp[h(x, β )] × pJ (z)} (P P )−1 V ×
n
εi2 × pJ (zi )pJ (xi )
i=1
)] × pJ (z)}. × (P P )−1 ∂ m {exp[h(x, β The following theorem is proved in the Appendix. T HEOREM 3.1. Under Assumptions 2.1 and 3.1–3.5, we show: (1) Vnm (x)−1/2 [ μm (x) − μm (x)] has asymptotically standard normal distribution; and nm (x)1/2 → 1 in probability. (2) Vnm (x)−1/2 V Part (1) of the theorem shows that the proposed estimator is consistent and asymptotically normally distributed. Part (2) provides a consistent estimator for the variance. These two results μm (x) allow us to conduct the statistical inference on the estimands. For example, the t-ratio √ vˆ (x) nm
μm (x) can be used for significance tests of the derivative μ m (x). If | √ | > 1.96, then the pointwise vˆnm (x) derivative μ m (x) is statistically significant at the 5% level. The theorem also reveals that the estimate of the finite dimensional parameter has no effect on the asymptotic distribution of the estimated derivative. This result is not surprising because the estimator of the finite dimensional parameter converges to the true value at a faster rate. It is worth pointing out that the asymptotic variance V nm (x) shrinks as n goes to infinity but at a rate slower than n−1 . To see this, noting that, by Assumption 3.4, σ 2 (z) is bounded below, and hence for some constants c 1 , c 2 > 0
c1 Vnm (x) ≤ m nλmax ∂ {exp[h(x, β0 )] × pJ (z)} × ∂ m [{exp[h(x, β0 )] × pJ (z)} c2 ≤ , nλmin where λ max and λ min are the smallest and the largest eigenvalues of P P /n. Notice that the term (z)} generally explodes as J goes to ∂ m {exp[h(x, β 0 )] × pJ (z)} × ∂ m {exp[h(x, β 0 )] × pJ √ infinity. Hence, the estimator μm (x) generally is less than n consistent. The derivative estimator μm (x) and its asymptotic variance can be computed as follows: (i)
Estimate the log transformed model by least squares; save the predicted values on the log ). ) and the regression residuals scale h(x, β ui = ln(yi ) − h(xi , β C The Author(s). Journal compilation C Royal Economic Society 2008.
545
A semiparametric derivative estimator
(ii) (iii)
Regress exp( ui ) on pJ (z i ) using robust standard errors; save the regression coefficients . in π and its heteroskedasticity-consistent covariance matrix in nm (x) = ∂ m {exp[h(x, β )] × pJ (z)} )] × π and V Compute μm (x) = ∂ m {exp[h(x, β × ∂ m {exp[h(x, β )] × pJ (z)}. pJ (z)} ×
Next, we derive the asymptotic distribution of the estimated average derivative μm . Denote 1n
=
n
∂ m {exp[h(xi , β0 )] × P J (zi ) }(P P )−1 ,
i=1
2n
=
n
∂
m
i=1
∂{exp[h(xi , β0 )]D(zi , β0 )} , ∂β
n (x) = 1n P J (z)P J (z) 1n σε2 (z) + 2n δ(x)δ(x) 2n σu2 (z) + 21n P J (z)δ(x) 2n ρ(z), vnm =
n 1 m (∂ {exp[h(xi , β0 )] × D(zi )} − μm )2 + n (xi ) , 2 n i=1
where σ 2u (z) = E{u2 |z = z} and ρ (z) = E{u ×ε|z = z}. Denote vnm = with
n 1 m i )} − )] × D(z n (xi ) (∂ {exp[h(xi , β μm )2 + 2 n i=1
2 n (xi ) = 1n P J (zi ) 2n δ (xi ) εi + ui , 1n =
n
)] × P J (zi ) }(P P )−1 , ∂ m {exp[h(xi , β
i=1
2n =
n i=1
∂
m
)]D(zi , β )} ∂{exp[h(xi , β , ∂β
. The following theorem is with δ (xi ) as an estimator of δ (x i ) obtained by replacing β 0 with β also proved in the Appendix. −1/2
μm − μm ) has T HEOREM 3.2. Under Assumptions 2.1 and 3.1–3.5, we show that: (1) vnm ( −1/2 1/2 vnm → 1 in probability. asymptotically standard normal distribution; and (2) vnm Theorem 3.2 shows that the average derivative estimator is asymptotically normally distributed and provides a consistent estimator for the asymptotic variance. These results can μm | > 1.96, then be used for statistical inference on the average derivatives. For instance, if | √ vˆnm the average derivative μ m is statistically significant at the 5% level. It is interesting to note 0 )]D(zi ,β0 )} = 0 and in this case the estithat, in the special case where z = x, we have ∂{exp[h(xi ,β ∂β mated finite dimensional parameter has no effect on the asymptotic distribution of the average and the derivative estimator. This seems counter intuitive because both the parameter estimator β C The Author(s). Journal compilation C Royal Economic Society 2008.
546
C. Ai and C. Norton
average derivative estimator converge at the same rate and, in a sequential estimation like ours, generally the estimator in the first step affects the asymptotic distribution of the estimator in affects the asymptotic distribution of the the second step estimation. The estimated parameter β ) and the other through average derivative estimator through two channels: one through h(xi , β the residual ui . It so happens in this case that these two effects offset each other. In other cases, 0 )]D(zi ,β0 )} affects = 0 and the estimated finite dimensional parameter β where z ⊂ x, ∂{exp[h(xi ,β ∂β the asymptotic distribution of the average derivative estimator. One potential criticism of our model specification is that the functional form of h(x, β) may not be the true conditional mean of the log-transformed dependent variable given the explanatory variables. The question then is whether our estimator is biased in this case. It is interesting to note that our estimator is still consistent if the true conditional mean has the form h(x, β) + q(z), where q(z) is some unknown function. This is because the term q(z) will be absorbed into u and hence any bias resulting from misspecifying the conditional mean will be corrected through the regression residuals u. However, if h(x, β) is correctly specified, then our estimator utilizes more information than the simple average derivative estimator proposed by Powell et al. (1989). Another potential criticism of our approach is that the number of the approximating terms, J, is not uniquely determined by the sufficient conditions of Assumptions 2.1 and 3.1–3.5. In practice, these sufficient conditions are not very useful for choosing J. A feasible and practical way to determine J is to apply the cross-validation approach, which chooses J to minimize n 2 exp( ui ) − pJ (zi ) (P−i P−i )−1 P−i Q−i , i=1
−i denote P and Q with the ith row deleted. The choice of J trades off bias of where P −i and Q the conditional mean with the variance.
4. SIMULATION 4.1. Design The asymptotic distributions of the proposed estimators are derived under the condition that the sample size is sufficiently large. In applications, sample sizes are often moderate or small. The question then is whether the large sample distributions still provide good approximations when sample sizes are moderate or small. To address this question, we conduct a small scale simulation study. Our simulation study is based on a design with two continuous regressors x 1 and x 2 . Both regressors are independent from each other and have a normal distribution with mean zero and variance 2 truncated so that |x 1 | ≤ 1 and |x 2 | ≤ 1. The dependent variable is generated by y = exp(1 + x1 + x2 + u) so that ln(y) = 1 + x1 + x2 + u, C The Author(s). Journal compilation C Royal Economic Society 2008.
A semiparametric derivative estimator
547
where the error term u is normally distributed with mean zero and variance 1 + x 1 + x 21 . Hence, we obtain D(z) = exp[(1 + x 1 + x 21 )/2]. Corresponding to our notation, we have x = (1, x1 , x2 ) , β = (β1 , β2 , β3 ) = (1, 1, 1) , h(x, β) = x β, z = x1 . We shall use this design to generate a sample {(y i , x 1i , x 2i ), i = 1, 2, . . . , 300}, which shall be used to estimate the parameter of interest. 4.2. Cross validation Before we carry out the simulation study, we must determine the value of J . For an
initial sample
of observations, we estimate β by the generalized least squares regression with 1 + x1 + x12 as . We use the power functions the weighting function. Compute ui = ln(yi ) − xi β P J (z) = (1, z, z2 , . . . , zJ )
to approximate D(z). For each value of J ∈ {1, 2, 3, . . .}, we compute the following N regressions: for each value of k ∈ {1, 2, . . . , N}, delete the kth observation; regress exp( ui ) on pJ (z i ); and save the predicted value for the kth observation in D Jk . Chooses J to minimize n
uk ) − DJ k )2 . (exp(
i=1
For the initial sample, we generally find J = 2. 4.3. Monte Carlo For the chosen J, we now conduct the simulation study. Using the above design, we generate R = 800 samples indexed by r ∈ {1, 2, . . . , 800}. For each generated sample (say, the rth r ; save the regression residuals in sample), we compute the generalized least squares estimator β uri ) on pJ (z i ) and denote the regression coefficients by π r . For each value z ∈ uri ; regress exp( {−0.9, −0.8, −0.7, −0.6, −0.5, −0.4, −0.3, −0.2, −0.1, 0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9}, we compute the predicted values and the derivatives r (z) = pJ (z) D πr, r (z) dpJ (z) πr dD = . dz dz C The Author(s). Journal compilation C Royal Economic Society 2008.
548
C. Ai and C. Norton
Using the sample observations, we also compute the following average derivatives (with respect to x 1 and x 2 ): 300 r r r dpJ (x1i ) πr 1 exp xi β β2 × Dr (zi ) + exp xi β μ1r = , 300 i=1 dz 1 r (zi ). r )β 3r × D exp(xi β 300 i=1 300
μ2r =
Using the true parameter values, we also compute D(z) = exp[ 1 + z + z2 /2], dD(z) = (0.5 + z) × exp[(1 + z + z2 )/2], dz μ1r =
300 1 dD(x1i ) exp(xi β)β2 × D(x1i ) + exp(xi β) , 300 i=1 dz
μ2r =
1 exp(xi β)β3 × D(x1i ). 300 i=1 300
μ2r are unbiased estimates of the average derivatives μ 1 and μ 2 . We shall compare Then μ1r and μ2r ) to ( μ1r , μ2r ). our estimator ( μ1r , 4.4. Results First, consider the smearing function D(z). For each value of z = ±0.9, ±0.8, . . . , 0, we compute the means and the variances by 1 Dr (z), 800 r=1 800
D(z) =
2 1 Dr (z) − D(z) , 800 r=1 800
V D (z) =
r (z) 1 dD , 800 r=1 dz 2 800 r (z) dD(z) 1 dD − V dD (z) = . 800 r=1 dz dz 800
dD(z) dz
=
The true function D(z), the mean D(z) and 95% confidence interval (plus and minus 1.96 times the standard deviation) are graphed in Figure 1, while Figure 2 does the same thing for the derivative function. These two graphs show that the estimated functions do extremely well at predicting the true functions D(z) and dD(z)/dz. The true functions lie entirely within the 95% confidence intervals. The estimated functions are quite close to the true functions, with the slight exception at the high end of the derivative function. Using J = 3 could improve the fit. C The Author(s). Journal compilation C Royal Economic Society 2008.
549
0
2
4
6
8
A semiparametric derivative estimator
5
0 z Dhatmean Dhat_hi
.5
1
Dhat_lo D_z
0
5
10
Figure 1. Estimated smearing function D(z) compared to actual.
5
0 z dDhatmean dDhat_hi
.5
1
dDhat_lo dD_z
Figure 2. Estimated dD(z)/dz compared to actual.
Next, consider the average derivatives. Table 1 reports the mean, 25, 50 and 75% quantiles μ1r , μ2r , μ2r ), r = 1, 2, . . . , 800. The mean value of μ1r is and the standard deviation of ( μ1r , μ2r is extremely close to that of μ2r . Due to estimation close to that of μ1r and the mean value of μ2r are higher than their true counterparts. error, the standard deviations of μ1r and of C The Author(s). Journal compilation C Royal Economic Society 2008.
550
C. Ai and C. Norton Table 1. Monte Carlo comparison of μ to μ. Quantiles Mean
25%
50%
75%
SD
μ1 μ1 μ2
15.82 17.13 8.84
10.80 16.15 6.92
13.96 17.05 8.31
18.30 18.12 10.08
9.59 1.47 2.97
μ2
8.78
8.35
8.74
9.22
0.61
5. CONCLUSION The log transformation is commonly used to deal with skewed data, and the conditional mean of the original dependent variable, marginal effects and interaction effects of explanatory variables on the original dependent variables are often the variables of interest in applied econometrics. In this paper, we present estimators for those variables for log transformed dependent variable models where the error term is possibly heteroskedastic and has an unknown distribution. We show that the estimators are consistent and asymptotically normally distributed. We provide consistent estimators for the asymptotic variances. The ratio of the estimate divided by the estimated standard error has a standard normal distribution and can be used for statistical inference. To evaluate the finite sample performance of the proposed estimator, we conduct a small scale simulation study. The simulation study shows that the proposed estimator has practical value.
ACKNOWLEDGMENTS SUFE provided partial funding through Project 211 Phase III and Shanghai Leading Academic Discipline Project, Project Number: B803. The authors are grateful to participants in the 15th European Workshop on Econometrics and Health Economics, and to Timothy Gunning for specific suggestions. [Correction added after online publication, 24 October 2008: In the first equation on page 540, u had been incorrectly omitted from exp[h(x, β 0 ) + u]. The equation now visible is the corrected version.]
REFERENCES Abrevaya, J. (2002). Computing marginal effects in the Box-Cox model. Econometric Reviews 21, 383–94. Ai, C. and E. C. Norton (2000). Standard errors for the retransformation problem with heteroscedasticity. Journal of Health Economics 19, 697–718. Ai, C. and E. C. Norton (2003). Interaction terms in logit and probit models. Economics Letters 80, 123–9. Ai, C. and Y. Sun (2005). Asymptotic Normality of Functionals of Series estimators with dependent data. Working paper, University of Florida. Andrews, D. W. K. (1991). Asymptotic normality of series estimators for non-parametric and semiparametric regression models. Econometrica 59, 307–45. Duan, N. (1983). Smearing estimate-A nonparametric retransformation method. Journal of the American Statistical Association 78, 605–10. C The Author(s). Journal compilation C Royal Economic Society 2008.
A semiparametric derivative estimator
551
Manning, W. G. (1998). The logged dependent variable, heteroscedasticity, and the retransformation problem. Journal of Health Economics 17, 283–95. Mullahy, J. (1998). Much ado about two: reconsidering retransformation and the two-part model in health econometrics. Journal of Health Economics 17, 247–81. Manning, W. G. and J. Mullahy (2001). Estimating log models: to transform or not to transform? Journal of Health Economics 20, 461–94. Newey, W. K. (1997). Convergence rates and asymptotic normality for series estimators. Journal of Econometrics 79, 147–68. Norton, E. C., Wang, H. and C. Ai (2004). Computing interaction effects and standard errors in logit and probit models. The Stata Journal 4, 154–67. Powell, J. L., Stock, J. H. and T. M. Stoker (1989). Semiparametric estimation of index coefficients. Econometrica 57, 1403–30.
APPENDIX: PROOF OF RESULTS Proof of Theorem 3.1. Denote R = (D(z1 ) − p J (z1 ) π, . . . , D(zn ) − p J (zn ) π ) ; E = (ε1 , ε2 , . . . , εn ) ; = p J (z) × (P P )−1 P Q(β0 ). D(z) For any vector λ ≤ m − ∂ λ D(z) = ∂ λ p J (z) × (P P )−1 P ∂ λ D(z) +
∂Q(β0 ) − β0 ) × (β ∂β
dim(β ) dim(β ) ∂ 2 Q(β) 1 0 0 λ J ∂ p (z) × (P P )−1 P 2 r=1 s=1 ∂βr ∂βs
s − βs0 )(β r − βr0 ) × (β = A1 + A2, . By theorem 1 of Newey (1997), where β is between β 0 and β ∂Q(β0 ) λ ∂D(z, β0 ) → ∂ ∂ λ p J (z) × (P P )−1 P ∂β ∂β in probability. Assumption 2.1 and 3.5 imply A1 = O p (n−1/2 ). Note that
2
2
λ J −1 ∂ Q(β)
∂ p (z) × (P P ) P
∂βr ∂βs
2 n ∂ 2 exp(ln(yn ) − h(xn , β)) ≤ ∂ λ p J (z) × (P P )−1 × ∂ λ p J (z) × ∂βr ∂βs i=1 2 n 1 ∂ 2 exp(ln(yn ) − h(xn , β)) = ζ|λ| (J )2 = Op (ζ|λ| (J )2 ), n i=1 ∂βr ∂βs where the last equality follows from Assumption 3.5(ii). By Assumption 3.5(iii), A2 = o p (n−1/2 ). Hence, − ∂ λ D(z) = Op (n−1/2 ). ∂ λ D(z) C The Author(s). Journal compilation C Royal Economic Society 2008.
552
C. Ai and C. Norton Denote
Vλn (z) = ∂ p (z) × (P P ) λ
J
−1
n
2
J
J
σ (zi ) p (zi )p (zi )
(P P )−1 × ∂ λ p J (z).
i=1
− ∂ λ D(z)) → N (0, 1) in distribution. Furthermore, it is By theorem 2 of Newey (1997), Vλn (z)−1/2 (∂ λ D(z) straightforward to show Vλn (z) = Op (ζ|λ| (J )2 /n). − ∂ λ D(z)) → N (0, 1) in distribution. Hence, Vλn (z)−1/2 (∂ λ D(z) Note that μm (x) − μm (x) = − ∂ λ D(z)) ∂ m−λ exp(h(x, β0 )) × (∂ λ D(z) λ≤m
+
− ∂ λ D(z)) )) − ∂ m−λ exp(h(x, β0 ))) × (∂ λ D(z) (∂ m−λ exp(h(x, β λ≤m
(∂
m−λ
λ≤m
=
)) − ∂ m−λ exp(h(x, β0 ))) × ∂ λ D(z) exp(h(x, β
− ∂ λ D(z)) + Op (n−1/2 ) ∂ m−λ exp(h(x, β0 )) × (∂ λ D(z)
λ≤m
=
− ∂ λ D(z)) + Op (n−1/2 ). ∂ m−λ exp(h(x, β0 )) × (∂ λ D(z)
λ≤m
Denote n σ (zi )2 p J (zi )p J (zi ) Vn (x) = ∂ m exp(h(x, β0 )) × p J (z) (P P )−1 i=1
× (P P )−1 × ∂ m exp(h(x, β0 )) × p J (z) . By theorem 2 of Newey (1997), − ∂ λ D(z)) → N (0, 1) Vn (x)−1/2 ∂ m−λ exp(h(x, β0 )) × (∂ λ D(z) λ≤m
μm (x) − μm (x)) → N (0, 1) in distribution. in distribution. This proves Vn (x)−1/2 ( Denote εi = exp(ui ) − p J (zi ) (P P )−1 P Q(β0 )and n n (x) = ∂ m exp(h(x, β0 )) × p J (z) (P P )−1 εi2 p J (zi )p J (zi ) (P P )−1 V
× ∂ m exp(h(x, β0 )) × p J (z) .
i=1
n (x)1/2 Vn (x)−1/2 → 1 in probability. Note that the difference between By theorem 2 of Newey (1997), V √ . It is easy to show that Vn (x) and V n (x) is that β 0 is replaced by a n consistent estimator β n (x)1/2 Vn (x)−1/2 → 1 in probability. This completes the proof of the theorem. V Proof of Theorem 3.2. The proof is similar to the proof of Theorem 3.1 except that the O p (n−1/2 ) terms are no longer ignored. First, from the proof of Theorem 3.1, we immediately have − ∂ λ D(z) = ∂ λ p J (z) (P P )−1 P ∂ λ D(z)
∂Q(β0 ) − β0 ) + op (n−1/2 ) (β ∂β
C The Author(s). Journal compilation C Royal Economic Society 2008.
553
A semiparametric derivative estimator holds uniformly for all z and all λ ≤ m. Denote
Write
n i=1
μm (x) = ∂ m [exp(h(x, β0 )) × D(z)]
n
μm (xi ) − i=1 μm (xi ) = i ) − ∂ λ D(z i )) ∂ m−λ exp(h(xi , β0 )) × (∂ λ D(z λ≤m
+
i ) − ∂ λ D(z i )) )) − ∂ m−λ exp(h(xi , β0 )))(∂ λ D(z (∂ m−λ exp(h(xi , β λ≤m
i) )) − ∂ m−λ exp(h(xi , β0 ))) × ∂ λ D(z (∂ m−λ exp(h(xi , β λ≤m
=
n
∂ m [exp(h(xi , β0 )) × p J (zi ) ] × (P P )−1 P
i=1
∂Q(β0 ) − β0 ) × (β ∂β
n ∂ exp(h(xi , β0 )) − β0 ) × ∂ m (β + × p J (zi ) (P P )−1 P Q(β0 ) ∂β i=1 + op (n1/2 ) =
n
∂m
i=1
∂[exp(h(xi , β0 )) × D(zi , β0 )] − β0 ) (β ∂β
+ op (n1/2 ), i ) − ∂ λ D(z i ) and ∂ λ D(z i ) and linearizing where the second equality follows from substituting for ∂ λ D(z )), the third equality follows from applying Theorem 1 of Newey to obtain ∂ m−λ exp(h(xi , β ∂Q(β0 ) ∂ m [exp(h(x, β0 )) × p J (z) ] × (P P )−1 P ∂β ) ∂D(z, β 0 ; → ∂ m exp(h(x, β0 )) ∂β ∂ exp(h(xi , β0 )) ∂m × p J (zi ) × (P P )−1 P Q(β0 ) ∂β ∂ exp(h(xi , β0 )) → ∂m D(z, β0 ) ∂β n μm (xi ) − μm ) in probability uniformly over x. Hence, i=1 ( =
n
2n δ(xi )ui +
i=1
=
n ( μm (xi ) − μm ) + op (n1/2 ) i=1
n
(∂ m [exp(h(xi , β0 )) × D(zi )] − μm )
i=1
+
n
1n p J (zi )εi + 2n δ(xi )ui + op (n1/2 ).
i=1
Part (1) of Theorem 3.2 now follows from applying a central limit theorem. Part (2) can be easily proved by using consistency of various parts.
C The Author(s). Journal compilation C Royal Economic Society 2008.
Econometrics Journal (2008), volume 11, pp. 554–572. doi: 10.1111/j.1368-423X.2008.00254.x
Asymptotic properties of estimators for the linear panel regression model with random individual effects and serially correlated errors: the case of stationary and non-stationary regressors and residuals B ADI H. B ALTAGI † , C HIHWA K AO † AND L ONG L IU ‡ †
Center for Policy Research, 426 Eggers Hall, Syracuse University, Syracuse, NY 13244-1020, USA E-mails:
[email protected],
[email protected]
‡
Department of Economics, University of Texas at San Antonio, San Antonio, TX 78249-0633, USA E-mail:
[email protected] First version received: April 2007; final version accepted: May 2008
Summary This paper studies the asymptotic properties of standard panel data estimators in a simple panel regression model with random error component disturbances. Both the regressor and the remainder disturbance term are assumed to be autoregressive and possibly non-stationary. Asymptotic distributions are derived for the standard panel data estimators including ordinary least squares (OLS), fixed effects (FE), first-difference (FD) and generalized least squares (GLS) estimators when both T and n are large. We show that all the estimators have asymptotic normal distributions and have different convergence rates dependent on the non-stationarity of the regressors and the remainder disturbances. We show using Monte Carlo experiments that the loss in efficiency of the OLS, FE and FD estimators relative to true GLS can be substantial. Keywords: Fixed-effects, First-difference, GLS, OLS, Panel data.
1. INTRODUCTION Econometricians have long been concerned with conditions under which the ordinary least squares (OLS) estimator is asymptotically efficient. The standard textbook result is that, under a general variance–covariance structure on the disturbances, the OLS estimator is less efficient than generalized least squares (GLS). This is well documented for the case of stationary autoregressive disturbances and stationary regressors. However, Phillips and Park (1988) showed that in a regression with integrated regressors, OLS and GLS are asymptotically equivalent. Recently, Choi (1999) studied the limiting distributions of the fixed effects (FE), GLS and within-GLS estimators for a panel data regression model with autoregressive disturbances, while Choi (2002) extended this work to instrumental variables (IV) estimation. Phillips and Moon (1999) presented a fundamental framework for studying sequential and joint limit theories in non-stationary panel data analysis, while Kao (1999) studied the asymptotic properties of the FE C The Author(s). Journal compilation C Royal Economic Society 2008. Published by Blackwell Publishing Ltd, 9600 Garsington Road,
Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
Asymptotic properties of estimators for the linear panel regression model
555
estimator of a spurious regression and proposed residual-based tests for panel co-integration. See Baltagi and Kao (2000), Choi (2006) and Breitung and Pesaran (2008) for recent surveys of this rapidly growing subject. In an early finding, Baltagi and Kr¨amer (1997) showed the equivalence of the GLS and FE estimators in a simple panel data regression with time trend as a regressor. Kao and Emerson (2004a,b) extended Baltagi and Kr¨amer to a model with serially correlated remainder errors. Kao and Emerson showed that the FE estimator is asymptotically equivalent to GLS when the error term is I(0); but that GLS is more efficient than FE when the error term is I(1). It is known that the panel time trend can be seen as a special case of the panel regression with a non-zero drift I(1) regressor. This paper extends the literature by studying the asymptotic properties of OLS, FE, first difference (FD) and GLS in the random effects error components regression model with an autocorrelated regressor and an autocorrelated remainder error (both of which can be stationary or non-stationary). We show that when the error term is I(0) and the regressor is I(1), the FE estimator is asymptotically equivalent to the GLS estimator and OLS is less efficient than GLS (due to a slower convergence speed). However, when the error √ term and the regressor are I(1), √ GLS is more efficient than the FE estimator since GLS is nT consistent, while FE is n consistent. This implies that GLS is the preferred estimator under both cases (i.e. regression error is either I(0) or I(1)). All asymptotic results in this paper assume that T → ∞ followed by n → ∞. We use 1 seq (n, T ) → ∞ to denote this sequential limit. We write the integral 0 W (s)ds as W and W˜ as W − W when there is no ambiguity over limits. ⇒ to denote weak convergence, ≡ to p denote equivalence in distribution, → to denote convergence in probability, [x] to denote the largest integer ≤ x, I(0) and I(1) to signify a time series that is integrated of order zero and one, respectively, and BM() to denote Brownian motion with covariance matrix . All proofs are collected in an appendix available upon request from the authors.
2. THE MODEL AND ASSUMPTIONS Consider the following panel regression: yit = α + xit β + uit ,
i = 1, . . . , n,
t = 1, . . . , T ,
(2.1)
where u it = μ i + ν it , and α and β are scalars. For simplicity, we consider the case of one regressor, but our results can be extended to the multiple regressor case. We assume that the individual effects μ i are random with A SSUMPTION 2.1. μ i ∼ i.i.d.(0, σ 2μ ). Let {ν it } be an AR(1) νit = ρνit−1 + eit , |ρ| ≤ 1,
(2.2)
where e it is a white noise process with variance 2e . The μ i ’s are independent of the ν it ’s for all i and t. This is the random effects error component model with serial correlation, see Baltagi and Li (1991) for the estimation of this model in the case of stationary regressors and stationary remainder disturbances. Let x it be also an AR(1) such that xit = λxit−1 + εit , |λ| ≤ 1, C The Author(s). Journal compilation C Royal Economic Society 2008.
(2.3)
556
B. H. Baltagi, C. Kao and L. Liu
where ε it is a white noise process with variance 2ε . In this paper, we assume that A SSUMPTION 2.2. E(μ i |x it ) = 0 for all i and t. The initialization of this system is y i1 = x i1 = O p (1) for all i. Our interest is in the estimation of the common slope β. This paper shows that the asymptotic properties of OLS, FE, FD, and GLS estimators depend crucially on the serial correlation properties of x it and v it . When y it and x it are both I(1) but v it is I(0), equation (2.1) is a panel co-integrated model. On the other hand, when v it is I(1) and y it and x it are both I(1), equation (2.1) is a panel spurious model. FE estimators for panel co-integrated and panel spurious models have been discussed in Phillips and Moon (1999) and Kao (1999). The case of a panel time trend model, x it = t, has been studied by Baltagi and Kr¨amer (1997) and Kao and Emerson (2004a,b). Next, we characterize the innovation vector w it = (e it ,ε it ) . We assume that w it is a linear process that satisfies the following assumption: A SSUMPTION 2.3. For each i, we assume: ∞ a 1 wit = (L)ηit = ∞ j =0 j η ij t−j , j =0 j j < ∞, |(1)| = 0 for some a > 1. 2 For a given i, η it is i.i.d. with zero mean and variance–covariance matrix , and finite fourth order cumulants. A SSUMPTION 2.4. We assume η it and η j t are independent for i = j. That is we assume crosssectional independence for our model. r] Assumption 2.3 implies that the partial sum process √1T [T t=1 wit satisfies the following multivariate invariance principle: [T r] 1 wit ⇒ Bi (r) = BMi () as T → ∞ for all i, √ T t=1
where
(2.4)
Bei . Bi = Bεi
The long-run 2 × 2 covariance matrix of {w it } is given by =
∞
E(wij wi0 )
j =−∞
= (1) (1) e2 eε = . eε ε2 The long-run covariance matrix can be decomposed into = + 2, where ∞ γe2 γeε
E(wij wi0 ) = = γeε γε2 j =1
(2.5)
C The Author(s). Journal compilation C Royal Economic Society 2008.
Asymptotic properties of estimators for the linear panel regression model
and
=
E(wi0 w i0 )
=
σe2
σeε
σeε
σε2
557
.
(2.6)
Assuming 2ε is non-zero, we define e.ε = e2 − Then, B i can be rewritten as
Bei Bi = Bεi
Vi Wi covariance
=
2 eε . ε2
(2.7)
e.ε
eε /ε
0
ε
Vi
,
Wi
(2.8)
= BM(I ) is a standardized Brownian motion. Define the one-sided long-run
where
= + =
∞
E(wij w i0 )
j =0
with
δeε . δε2
δ2 = e δeε
The assumption of constant variances/covariances across i, such as in , , and is used to simplify the notation. It can be extended into the case where different variances are allowed for different i at the expense of more complicated notation.
3. OLS ESTIMATOR The OLS estimator of β is given by
where x =
1 nT
n T
xit − x yit − y i=1 t=1 OLS = β , 2 n T
x − x it i=1 t=1 n T 1 n T i=1 t=1 xit and y = nT i=1 t=1 yit .
T HEOREM 3.1. Under Assumptions 2.1–2.4, we obtain the following results: 1 If |ρ| < 1 and |λ| < 1, n T p 1−λ2 OLS − β → [lim nT1 (a) β i=1 t=1 E(xit νit )], σε2 √ OLS OLS − β − τ1NT ) ⇒ N(0, κ1OLS ), (b) nT (β where 1 n n OLS = τ1NT
nT 1 nT
E(xit νit )
i=1
t=1
i=1
t=1 (xit
n n
C The Author(s). Journal compilation C Royal Economic Society 2008.
− x)2
,
(3.1)
558
B. H. Baltagi, C. Kao and L. Liu
∞ 2r (1−λ2 )2 1 [σ 2 2 + (1−ρλ) 2 [ψ00 + r=1 λ ψ0r σε4 μ ε 2 2 2 2 E(εit ei(t−rc) ), and ψ 00 = E(εit eit ).
κ1OLS =
+
∞ r=1
2 ρ 2r ψr0 ]], ψ0r = E(εi(t−r) eit2 ), ψr0 =
If ρ = 1 and |λ| < 1, p (1+λ)(− 12 eε +δeε ) OLS − β → (a) β , σε2
√
OLS n βOLS − β − τ2nT ⇒ N 0, κ2OLS , (b)
2
OLS = where τ2nT
( n1
n
1 i=1 T
(1−λ) nT1
T
eε t=1 νi(t−1) eit ) 2 +δeε e T 2 i=1 t=1 (xit −x)
n
, κ2OLS =
(1+λ)2 ε.e e2 . 2σε4
3 If |ρ| < 1 and λ = 1, √
p OLS − β → (a) T β 0, √
(b) nT βOLS − β ⇒ N 0, κ3OLS , where κ3OLS =
4σμ2 . 3ε2
4 If ρ = 1 and λ = 1, p 2δεe OLS − β → , (a) β ε2
√
OLS ⇒ N 0, κ4OLS , n βOLS − β − τ4NT (b) where
OLS τ4nT
=
( n1
n
1 i=1 T
1 n
n
T
xi(t−1) εit ) εe2 +δεe ε T , κ4OLS 2 t=1 (xit −x i )
t=1
1 i=1 T 2
=
2e.ε . 3ε2
It is important to note that e.ε / 2ε can be seen as the long-run signal-to-noise ratio. The OLS estimator ignores the individual effects in the disturbance term. Thus, the variance of μ i , i.e. OLS depending on the case considered. σ 2μ might appear in the variance–covariance matrix of β In case 1, both μ i and ν it affect the variance of βOLS . In cases √ 2 and 4, ν it dominates μ √i . In case 3, μ i dominates ν it and hence the convergence speed is nT , which differs from the nT asymptotics in the panel co-integration literature. Also the asymptotic normality of the OLS estimator comes naturally. When summing across i, the non-standard asymptotic distribution due to unit root in the time dimension, such as for cases 2–4, is smoothed out. C OROLLARY 3.1. When E(e it ε i(t+k) ) = 0 for all i and k, under the assumptions in Theorem 3.1, then 1
If |ρ| < 1 and |λ| < 1, 2 √ OLS − βt) ⇒ N(0, σμ2 + nT (β σ When ε it and e it are
2
∞ 2r 2r (1−λ2 )2 [ψ00 + ∞ r=1 λ ψ0r + r=1 ρ ψr0 ] ). 2 4 (1−ρλ) σε ε √
2 2 σμ2 OLS − β ⇒ N (0, 2 + (1+ρλ)(1−λ2 )σe2 ). independent, nT β σε (1−ρλ)(1−ρ )σε
If ρ = 1 and |λ| < 1, 2 2 √
OLS − β ⇒ N (0, (1−λ)2 σe ), n β 2σ ε
3
If |ρ| < 1 and λ = 1, 2 √
OLS − β ⇒ N (0, 4σμ2 ), nT β 3σ
4
If ρ = 1 and λ = 1, 2 √
OLS − β ⇒ N (0, 2σe2 ). n β 3σ
ε
ε
Corollary 1 follows directly from Theorem 3.1. C The Author(s). Journal compilation C Royal Economic Society 2008.
Asymptotic properties of estimators for the linear panel regression model
559
4. FE ESTIMATOR The FE estimator of β is given by F E = β where x i =
1 T
T t=1
xit and y i =
1 T
(xit − x i ) yit − y i , n T 2 i=1 t=1 (xit − x i )
n T i=1
T
t=1
t=1
(4.1)
yit .
T HEOREM 4.1. Under Assumptions 2.1–2.4, we have the following results: 1 If |ρ| < 1 and |λ| < 1, n T p 1−λ2 F E − β → (a) β [lim nT1 E (xit νit )], i=1 σε2 √
t=1 FE FE (b) nT βF E − β − τ1nT ⇒ N 0, κ1 , FE τ1nT =
where
1 n T i=1 t=1 E(xit νit ) nT 2 1 n 1 T i=1 T t=1 (xit −x i ) n
∞
∞ 2r 2r r=1 λ ψ0r + r=1 ρ ψr0 ] , (1−ρλ)2 σε4
2 1−λ2 ) [ψ00 + , κ1F E = (
2 2 ψ0r = E(εi(t−r) eit2 ), ψr0 = E(εit2 ei(t−r) ), and ψ 00 = E(ε2it e2it ).
If ρ = 1 and |λ| < 1, p (1+λ)(− 12 eε +δeε ) F E − β → (a) β , σε2
√ FE F E − β − τ2nT ⇒ N 0, κ2F E , n β (b)
2
FE where τ2nT =
( n1
n
1 i=1 T
(1−λ) n1
T
νi )eit ) eε2 +δeε t=1 (νit −¯ e T 2 1 i=1 T 2 t=1 (xit −x i )
n
and κ2F E =
(1−λ)2 ε.e e2 . 6σε4
3 If |ρ| < 1 and λ = 1,
p −3εe +6δεe F E − β → (a) T β 2 , (1−ρ)
√ Fε E √
F E − β − nτ3nT ⇒ N 0, κ3F E , nT β (b) FE = where τ3nT
( n1
n
1 i=1 T
(1−ρ) n1
T
εe t=1 (xit −x i )εit ) 2 +δεe ε T 2 1 i=1 T 2 t=1 (xit −x i )
and κ3F E =
6e.ε . (1−ρ)2 ε2
and κ4F E =
2e.ε . 5ε2
n
4 If ρ = 1 and λ = 1, p εe +6δεe F E − β → (a) β , ε2
√
FE n βF E − β − τ4nT ⇒ N 0, κ4F E , (b) where
FE τ4nT
=
( n1
n
1 T 2 εe i=1 T 2 t=1 (xit −x i ) ) 2 ε 2 1 n 1 T i=1 T 2 t=1 (xit −x i ) n
+δεe
Note εe is due to the endogeneity of the regressor x it , and δ εe is due to serial correlation. Because uit − ui = νit − ν i , the individual effect μ i is eliminated for each individual. C OROLLARY 4.1. When E(e it ε i(t+k) ) = 0 for all i and k, under the same conditions as for Theorem 4.1, then 1
If |ρ| < 1 and |λ| < 1, ∞ 2r ∞ 2r √ 2 2 F E − β) ⇒ N(0, (1−λ ) [ψ00 + r=1 λ 2ψ0r4 + r=1 ρ ψr0 ] ). nT (β (1−ρλ) σε √ 2 2 F E − β) ⇒ N(0, (1+ρλ)(1−λ2 )σe2 ). If ε it and e it are independent, nT (β (1−ρλ)(1−ρ )σ ε
2
If ρ = 1 and |λ| < 1, 2 2 √ F E − β) ⇒ N(0, (1−λ)2 σe ). n(β 6σ ε
C The Author(s). Journal compilation C Royal Economic Society 2008.
560
B. H. Baltagi, C. Kao and L. Liu
3
If |ρ| < 1 and λ = 1, 2 √ F E − β) ⇒ N(0, 6σe2 2 ). nT (β (1−ρ) σ
4
If ρ = 1 and λ = 1, 2 √ F E − β) ⇒ N(0, 2σe2 ). n(β 5σ
ε
ε
Corollary 2 follows directly from Theorem 4.1. Note that case 1 is the textbook result under the assumptions of stationarity of the regressor and the disturbance term. Case 2 is new. Case 3 is discussed by Phillips and Moon (1999) and Kao and Chiang (2000). Case 4 is discussed in Kao (1999).
5. FD ESTIMATOR The FD estimator of β is given by n T (xit − xit−1 ) (yit − yit−1 ) βF D = i=1 nt=1 T . 2 i=1 t=1 (xit − xit−1 )
(5.1)
T HEOREM 5.1. Under Assumptions 2.1–2.4, we obtain the following results: If |ρ| < 1 and |λ| < 1, n T p lim nT1 i=1 t=1 E[(xit −xit−1 )(νit −νit−1 )] F D − β → β , 2σε2 2(−2λ2 +2λ−1)γε2 1+λ + 1−λ √
FD F D − β − τ1nT (b) ⇒ N(0, κ1F D ), where nT β n T lim nT1 i=1 t=1 E[(xit − xit−1 )(νit − νit−1 )] FD , τ1nT = 1 n T 2 i=1 t=1 (xit − xit−1 ) nT
1
(a)
and κ1F D = r−1 r−1 + 2ρ r − ρ r+1 )2 ψ0r + ∞ + 2λr − λr+1 )2 ψr0 (2 − λ − ρ)2 ψ00 + ∞ r=1 (−ρ r=1 (−λ 2σ 2
(1 − ρλ)2 ( 1+λε +
2(−2λ2 +2λ−1)γε2 2 ) 1−λ
.
If ρ = 1 and |λ| < 1, n T p lim nT1 i=1 t=1 E [(εit +(λ−1)xit−1 )eit ] F D − β → β , 2 2(−2λ2 +2λ−1)γε2 2σε + 1+λ 1−λ √
FD F D − β − τ2nT ⇒ N 0, κ2F D , where nT β (b)
n T lim nT1 i=1 t=1 E (εit + (λ − 1) xit−1 ) eit FD , τ2nT = 2 1 n T i=1 t=1 (xit − xit−1 ) nT
2
(a)
and κ2F D =
(1 + λ)
2ψ00 2σε2 1+λ
2 −2λ2 +2λ−1 γ 2 + ( 1−λ ) ε
2 .
C The Author(s). Journal compilation C Royal Economic Society 2008.
561
Asymptotic properties of estimators for the linear panel regression model
3 If |ρ| < 1 and λ = 1, n T p 1 F D − β → (a) β lim nT1 E[εit (νit − νit−1 )], σε2 √
i=1 t=1 F D F D F D − β − τ3nT ⇒ N 0, 3 , (b) nT β n T lim nT1 E [εit (νit −νit−1 )] i=1 t=1 where τ F D = and κ F D = 3nT
1 nT
n i=1
T 2 t=1 (xit −xit−1 )
4 If ρ = 1 and λ = 1, p σεe F D − β → (a) β , σε2 √
FD F D − β − τ4nT (b) ⇒ N 0, κ4F D , nT β FD where τ4nT =
1 nT
n i=1
σ T εe
2 t=1 (xit −xit−1 )
and κ4F D =
3
2ψ00 . (1+ρ)σε4
e.ε ε2 . σε4
Similar to the FE estimator, the individual effect μ i is also eliminated by the FD estimator because u it − u it−1 = ν it − ν it−1 . In cases 2 and 4, ρ = 1, and the FD estimator is asymptotically equivalent to the GLS estimator because both methods transform the disturbance from I(1) into I(0). Actually, the FD estimator is mathematically the same as the GLS estimator except for the omission of the first observation for each individual. C OROLLARY 5.1. When E(e it ε i(t+k) ) = 0 for all i and k, under the same conditions as for Theorem 5.1, then 1
If |ρ| < 1 and |λ| < 1, ∞ ∞ √ 2 2 r−1 r r+1 2 r−1 r r+1 2 F D − β) ⇒ N(0, (1+λ) [(2−λ−ρ) ψ00 + r=1 (−ρ +2ρ −ρ 2 4) ψ0r + r=1 (−λ +2λ −λ ) ψr0 ] ). nT (β 4(1−ρλ) σε (1−λ)3 2 2 (1−ρ)3 2 √ F D − β) ⇒ N(0, (1+λ) [(2−ρ−λ) + 21+ρ2 + 1+λ ]σe ). If ε it and e it are independent, nT (β 4(1−ρλ) σ ε
2
If ρ = 1 and |λ| < 1, √
00 F D − β ⇒ N 0, (1+λ)ψ . nT β 4 2σε √
2 e F E − β ⇒ N 0, (1+λ)σ If ε it and e it are independent, nT β . 2 2σ ε
3
If |ρ| < 1 and λ = 1, √
F D − β ⇒ N 0, 2ψ00 4 . nT β (1+ρ)σε √
2 F D − β ⇒ N 0, 2σe 2 . If ε it and e it are independent, nT β (1+ρ)σ
4
If ρ = 1 and λ = 1, √
2 F D − β ⇒ N 0, σe2 . nT β σ
ε
ε
Corollary 5.1 follows directly from Theorem 5.1.
6. GLS ESTIMATOR Let us rewrite equation (2.1) in vector form y = αιnT + xβ + u, C The Author(s). Journal compilation C Royal Economic Society 2008.
562
B. H. Baltagi, C. Kao and L. Liu
where y is nT × 1, x is a vector of x it of dimension nT × 1, ι nT is a vector of ones of dimension nT. and u is nT × 1. As shown in the Appendix, −1
GLS = x −1 x − x −1 ιnT ι nT −1 ιnT −1 ι nT −1 x β
−1
× x −1 y − x −1 ιnT ι nT −1 ιnT ιnT −1 y
(6.1)
and −1
GLS − β = x −1 x − x −1 ιnT ι nT −1 ιnT −1 ι nT −1 x β
−1
× x −1 u − x −1 ιnT ι nT −1 ιnT ιnT −1 u , where = E(uu ). One can decompose the variance–covariance matrix into
= E uu = σμ2 In ⊗ ιT ι T + e2 (In ⊗ A) , where ι T is a vector of ones of dimension T. A is the variance–covariance matrix of ν it , ⎡
ρ
ρ2
···
1
ρ
···
ρ .. .
1 .. .
··· .. .
ρ T −2
ρ T −3
···
1
⎢ ⎢ ρ ⎢ 2 ⎢ A=⎢ ρ ⎢ . ⎢ . ⎣ . ρ T −1
ρ T −1
⎤
⎥ ρ T −2 ⎥ ⎥ ρ T −3 ⎥ ⎥ .. ⎥ ⎥ . ⎦ 1
when |ρ| < 1 and ⎡
1 1 1 ⎢ ⎢1 2 2 ⎢ ⎢ A = ⎢1 2 3 ⎢. . . ⎢. . . ⎣. . . 1 2 3
··· ··· ··· .. . ···
⎤ 1 ⎥ 2⎥ ⎥ 3⎥ ⎥ .. ⎥ ⎥ .⎦ T
when ρ = 1. Thus, it can be shown that
−1
1 = In ⊗ e2
−1
A
−
σμ2 e2 + θ σμ2
−1
A
ιT ι T A−1
,
where θ = ι T A−1 ι T . C The Author(s). Journal compilation C Royal Economic Society 2008.
563
Asymptotic properties of estimators for the linear panel regression model
When |ρ| < 1, this estimation is equivalent to the Prais–Winsten (PW) transformation method suggested by Baltagi and Li (1991). One can easily verify that A−1 = CC , where ⎡ ⎤ 1 − ρ2 0 0 · · · 0 0 ⎢ −ρ 1 0 ··· 0 0⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ 0 −ρ 1 · · · 0 0 ⎢ ⎥ C=⎢ .. .. .. . . .. .. ⎥ ⎢ . . . . . .⎥ ⎢ ⎥ ⎢ ⎥ ⎣ 0 0 0 −ρ 1 0⎦ 0
0
0
−ρ
0
1
is the PW transformation matrix as in Baltagi and Li (1991). Thus, we have the following theorem: T HEOREM 6.1. Under Assumptions 2.1–2.4, we obtain the following results: If |ρ| < 1 and |λ| < 1, p lim T1 Tt=1 E [(xit −ρxit−1 )eit ]
GLS − β → (a) β (1−2ρλ+ρ 2 )σε2 + 2(λ−2ρλ2 +ρ 2 λ−ρ )γε2 1−λ2 √
1−λ GLS GLS − β − τ1nT ⇒ N 0, κ1GLS , (b) nT β lim T1 Tt=1 E [(xit −ρxit−1 )eit ] GLS where τ1nT = , κ1GLS = σ 2 1 X −1 X
1
e nT
(1−2ρλ+ρ 2 )ψ00 2 1−2ρλ+ρ 2 σ 2 2 λ−2ρλ2 +ρ 2 λ−ρ )γε2 (1−λ2 ) ( 1−λ2 ) ε + ( 1−λ
If ρ = 1 and |λ| < 1, p lim n1 T1 ni=1 Tt=1 E [(xit −xit−1 )eit ]
GLS − β → , (a) β 2(−2λ2 +2λ−1)γε2 2σε2 1+λ + 1−λ √
GLS GLS − β − τ2nT (b) ⇒ N 0, κ2GLS , nT β lim n1 ni=1 T1 Tt=1 E [(xit −xit−1 )eit ] GLS where τ2nT = , κ2GLS = σ 2 1 X −1 X
2
e nT
If |ρ| < 1 and λ = 1, 6
p −3εe +6γεe + 1−ρ σεe GLS − β → (a) T β , 2 (1−ρ)ε
√ GLS √ GLS − β − nτ3nT ⇒ N 0, κ3GLS , nT β (b)
(1+λ)
2ψ00 2 −2λ2 +2λ−1 γε2 2σε2 1+λ + 1−λ
(
)
2
3
GLS where τ3nT =
1 n
n
i=1 (1−ρ)
1 n n
1 i=1 T
T
t=1 (xit −x¯ i )εit
σe2 nT1 X −1 X
εe ε2
σεe +γεe + 1−ρ
, κ3GLS =
6e.ε . (1−ρ)2 ε2
4 If ρ = 1 and λ = 1, √
p σεe GLS − β → (a) T β , σε2 √
√
GLS GLS − β − nτ4nT (b) nT β ⇒ N 0, κ4GLS , GLS where τ4nT =
√ nσεe , κ4GLS σe2 nT1 X −1 X
=
e2 ε2 . σε4
It is worth pointing out that when ρ = 1, the GLS transformation given by Baltagi and Li (1991) is identical to the first-difference transformation. In fact, it omits the first observation for each individual and the Cochrane–Orcutt (CO) transformation from period 2 up to T becomes the first difference transformation. Hence, the GLS estimator will be the same as the FD estimator and the conditional expectation of μ i given x it need not be zero, when ρ = 1. When ρ < 1, C The Author(s). Journal compilation C Royal Economic Society 2008.
564
B. H. Baltagi, C. Kao and L. Liu
E(μ i |x it ) = 0 is required, otherwise βˆGLS would be biased and inconsistent. This is the case for the Mundlak (1978) model where the μ i ’s are explicitly formulated as a function of the means of all the regressors, in this case (x i .). The result is that under this Mundlak model, OLS and GLS suffer from omitted variable bias, i.e. the omission of x i ., while FD and FE wipe out this source of endogeneity and remain consistent. In this case, one may use the within or first-difference transformation to wipe out μ i and then run GLS estimation in case of serial correlation in the remainder error. However, this is beyond the scope of this paper and can be left as a further extension. The following corollary follows directly from Theorem 6.1. C OROLLARY 6.1. When E(e it ε i(t+k) ) = 0 for all i and k, under the same conditions as for Theorem 6.1, then 1
2
If |ρ| < 1 and |λ| < 1, √
GLS − β ⇒ N 0, nT β
(1−λ2 )ψ00 . (1−2ρλ+ρ 2 )σε4 √
GLS − β ⇒ N 0, If ε it and e it are independent, nT β
(1−λ2 )σe2 . (1−2ρλ+ρ 2 )σε2
If ρ = 1 and |λ| < 1, √
00 GLS − β ⇒ N 0, (1+λ)ψ . nT β 4 2σε √
2 e GLS − β ⇒ N 0, (1+λ)σ . If ε it and e it are independent, nT β 2 2σ ε
3
If |ρ| < 1 and λ = 1, 2 √
GLS − β ⇒ N 0, 6σe2 2 . nT β (1−ρ) σ
4
If ρ = 1 and λ = 1, √
2 GLS − β ⇒ N 0, σe2 . nT β σ
ε
ε
Case 1 is the textbook result. Case 3 is discussed in Choi (1999). Cases 2 and 4 are new.
7. FEASIBLE GLS ESTIMATOR It is clear that the GLS estimator in Section 6 is not feasible. In this section, we discuss feasible GLS estimation. Assuming E(e it ε i(t+k) ) = 0 for all i and k and ε it and e it are independent, a feasible GLS estimator can be calculated by estimating the autocorrelation coefficient ρ and the variance components σ 2μ and σ 2e . To estimate these parameters, we take the following steps. First, retrieve the residual estimator νit from the FE regression in (2.1). Now ρ can be νit−1 , i.e. estimated as the correlation between νit and
n T
νit − νˆ νit−1 − νˆ i=1 t=2 (7.1) ρ = 2 n T
2 , n T
ν ν − ν ˆ − ν ˆ it−1 it−1 i=1 t=2 i=1 t=2 νit . Alternatively, as suggested by Baltagi and Li (1991), one where νˆ is the sample average of can estimate ρ by ρ =
T n i=1 t=2
νit νit−1 /
T n
νit−1 )2 . (
i=1 t=2 C The Author(s). Journal compilation C Royal Economic Society 2008.
565
Asymptotic properties of estimators for the linear panel regression model
˜1 − Q ˜2 / Q ˜1 , ˜0 − Q Baltagi and Li (1997) suggests another consistent estimator ρ = Q T ˜s = n where Q uit uit−s /n (T − s). We choose the correlation coefficient estimator i=1 t=s+1 because it ensures that ρ is always between 0 and 1. It can be shown that ρ in (7.1) is a consistent p estimator of ρ by using the Theorem 4.1, i.e. ρ → ρ if |ρ| < 1. autocorrelation coefficient ρ , we can Next, using the FE residuals νit and the estimate of Nthe T 2 2 e . Also, σ can be estimated get eit . Therefore σ 2e can be estimated by σˆ e2 = nT1 μ i=1 t=1 it
2 1 N T 2 2 using σˆ μ = nT i=1 t=1 uit − uit denote the OLS residuals from equation (3.1). νit , where p
p
σˆ e2 and σˆ μ2 are consistent estimators for σ 2e and σ 2μ , respectively, i.e. σˆ e2 → σe2 , σˆ μ2 → σμ2 if |ρ| < 1. These variance components can be estimated by using the variance decomposition and the PW transformation suggested by Baltagi and Li (1991). Alternatively, one can also use the CO procedure, which ignores the information contained in the first observation. As suggested by Maeshiro (1976), Beach and MacKinnon (1978) and Park and Mitchell (1980), estimation using the PW transformation is more efficient than using the CO procedure when the regressors are trended. When the assumptions of Corollary 6.1 hold, one can show that feasible GLS has the same asymptotic distribution as true GLS. Define φ = (ρ, σ 2μ , σ 2e ) and φˆ is its corresponding estimator. Then = (φ). Further define that G k (φ) = ∂−1 (φ)/∂φ k ,where k = 1, 2, 3. For example, in case 1, a Taylor’s series expansion as in Fuller and Battese (1973) gives √ nT (τˆGLS − τ ) =
−1 −1 Z −1 φˆ Z Z φˆ u √ nT nT
=
Z −1 (φ) Z nT
−1
Z −1 (φ) u √ nT
+
−1
3 −1 Z (φ ∗ ) Z Z Gk (φ ∗ ) u √ nT nT k=1
−1
−1 ∗ −1 −1 ∗
Z Gk (φ ∗ ) Z Z (φ ) u Z (φ ) Z Z −1 (φ ∗ ) Z φˆ − φ √ nT nT nT nT −1 −1 −1 Z (φ) Z Z (φ) u + op (1) , = √ nT nT
−
p where φ ∗ lies between φˆ and φ, hence φ ∗ → φ. The last equal sign holds if
∗
−1
∗
(φ k (φ )Z = Op (1) , Z √nT(φ )u = Op (1) and Z G√knT Op (1) , Z GnT similar arguments in the proofs of the theorems above.
∗
)u
Z −1 (φ)Z nT
=
= Op (1). This follows using
8. EFFICIENCY COMPARISONS This section summarizes the relative efficiency of OLS, FE, GLS and FD estimators. First, the speed of convergence for the different cases considered are summarized as follows: C The Author(s). Journal compilation C Royal Economic Society 2008.
566
B. H. Baltagi, C. Kao and L. Liu
OLS √ nT Case 1: |ρ| < 1 and |λ| < 1 √ n Case 2: ρ = 1 and |λ| < 1 √ Case 3: |ρ| < 1 and λ = 1 nT √ n Case 4: ρ = 1 and λ = 1
FE √ nT √ n √ nT √ n
FD √ nT √ nT √ nT √ nT
GLS √ nT √ nT √ nT √ nT .
√ In case 1, the four estimators have the same convergence speed of nT . The efficiency of the OLS estimator is hard to compare with the remaining estimators because OLS does not difference out μ i , and as a result its variance still contains σ 2μ . That GLS is more efficient than FE and FD is √ evident from the Gauss–Markov theorem. Since these estimators all converge at same rate nT , we plot the relative efficiency of the FE and FD estimators with respect to true GLS in Figures 1 and 2. The relative efficiency of the FE estimator with respect to true GLS is given by
1 − λ2 σe2 (1 + ρλ) 1 − λ2 σe2 F E /var β GLS = /
var β (1 − ρλ) 1 − ρ 2 σε2 1 − 2ρλ + ρ 2 σε2
(1 + ρλ) 1 − 2ρλ + ρ 2
= . (1 − ρλ) 1 − ρ 2 The relative efficiency of the FD estimator with respect to true GLS is given by
(1−ρ)3 (1−λ)3 2 2
(1 + λ) (2 − ρ − λ) + 1+ρ + 1+λ σe2 (1 − λ2 )σe2 var βF D /var βGLS = (1 − 2ρλ + ρ 2 )σε2 4 (1 − ρλ)2 σε2 3 3 (1 + λ)(1 − 2ρλ + ρ 2 ) (2 − ρ − λ)2 + (1−ρ) + (1−λ) 1+ρ 1+λ = . 4 (1 − λ) (1 − ρλ)2
1.0 inverse relative efficiency
0.8 0.6 0.4 0.2 0.0 –1.0
1.0 0.5 0.0 a d –0.5 lamb
–0.5 0.0 rho
0.5 1.0 –1.0
Figure 1. Relative efficiency of GLS to FE estimator. C The Author(s). Journal compilation C Royal Economic Society 2008.
Asymptotic properties of estimators for the linear panel regression model
567
1.0 inverse relative efficiency
0.8 0.6 0.4 0.2 0.0 –1.0
1.0 0.5 0.0 a d –0.5 lamb
–0.5 0.0 rho
0.5 1.0 –1.0
Figure 2. Relative efficiency of GLS to FD estimator.
One can easily verify that both relative efficiencies are larger or equal to 1. Comparing the GLS estimator with the FE and FD estimators, the relative efficiency depends on the values of ρ and λ. As shown in Figures 1 and 2, when ρ is small, the FE estimator performs well in terms of relative efficiency with respect to true GLS. When ρ is large, the FD estimators performs well in terms of relative efficiency with respect to true GLS. In case 2, the disturbance is I(1) but the regressor is I(0). The noise is strong so that it dominates the signal. In the time series case, the OLS estimator is not consistent. After double smoothing √ using panel data, the asymptotic distribution becomes normal and the convergence speed is n. GLS √ estimation, however, transforms the disturbance into I(0). Therefore the estimation will be the convergence speed is nT . When the disturbance is I(1), first-difference √ same as GLS except for the first observation. Hence it is also nT consistent. In case 3, the disturbance is I(0) but the regressor is I(1). This is the co-integration case. The co-integration literature shows that the GLS√estimators is T consistent in time series models. In the panel data model, both GLS and FE are nT consistent. In case 4, both the disturbance and the regressor are I(1). This is the spurious regression case. √ As shown in Kao (1999), the FE estimator is n consistent. For the same reason given in case 2, first-differencing transforms the disturbance term √ from I(1) to I(0). Therefore, the convergence speed of both the GLS or the FD estimators is nT . To compare the FD estimator with the FE estimator, in case 3, the FE estimator is more efficient when v it are stationary, including the special case when v it are serially uncorrelated. In cases 2 and 4, the FD estimator is more efficient when v it follows a random walk. These results verify the conclusion in Wooldridge (2002). However, in case 1, when ρ is large, even though v it does not follow a random walk, the FD estimator is still more efficient than the FE estimator.
C The Author(s). Journal compilation C Royal Economic Society 2008.
568
B. H. Baltagi, C. Kao and L. Liu Table 1. Relative efficiencies of standard panel data estimators (N = 40, T = 20). λ
OLS
FE
FD
GLS-PW
ρ
0
0.2
0.4
0.6
0.8
0.9
1
0 0.2
1.915 2.077
2.310 2.404
2.897 2.845
3.838 3.496
5.305 4.436
5.799 4.668
5.211 4.065
0.4 0.6
2.512 3.427
2.815 3.793
3.154 4.108
3.565 4.340
4.032 4.346
3.994 3.956
3.273 2.906
0.8 0.9 1
5.784 9.257 22.685
6.476 10.529 26.255
7.017 11.590 29.502
7.260 12.198 31.992
6.745 11.475 31.608
5.653 9.528 27.087
3.506 5.451 15.126
0 0.2
1.002 1.073
1.003 1.075
1.005 1.073
1.009 1.071
1.024 1.076
1.048 1.098
1.102 1.161
0.4 0.6 0.8
1.315 1.838 2.970
1.332 1.915 3.223
1.332 1.949 3.397
1.311 1.918 3.426
1.278 1.811 3.198
1.281 1.755 2.972
1.330 1.733 2.650
0.9 1
4.066 7.306
4.505 8.326
4.843 9.222
4.976 9.843
4.678 9.767
4.294 9.330
3.619 8.142
0 0.2 0.4
1.486 1.223 1.095
1.699 1.329 1.141
2.021 1.494 1.215
2.561 1.780 1.350
3.601 2.355 1.642
4.594 2.934 1.964
7.548 4.722 3.022
0.6 0.8
1.037 1.013
1.053 1.016
1.079 1.022
1.129 1.032
1.253 1.064
1.413 1.117
1.991 1.342
0.9 1 0
1.006 1.016 1.272
1.007 1.019 1.357
1.009 1.021 1.456
1.011 1.022 1.565
1.020 1.023 1.648
1.036 1.023 1.652
1.117 1.024 1.625
0.2 0.4
1.120 1.043
1.168 1.063
1.232 1.092
1.316 1.137
1.410 1.201
1.440 1.234
1.453 1.270
0.6 0.8 0.9
1.012 1.002 1.000
1.017 1.002 1.000
1.025 1.003 1.000
1.039 1.004 1.000
1.065 1.008 1.002
1.085 1.014 1.006
1.119 1.027 1.021
1
1.016
1.018
1.020
1.024
1.031
1.045
1.111
Note: (a) Relative mean square error with respect to the true GLS. (b) 10,000 replications.
σ 2μ
=
σ 2e
= 5.
9. MONTE CARLO SIMULATION This section reports the results of Monte Carlo experiments designed to investigate the finite sample relative efficiency of the OLS, FE, FD, GLS-CO and GLS-PW estimators with respect to true GLS. The model is generated by (9.1) yit = xit β + μi + vit , i = 1, . . . , n, t = 1, . . . , T , i.i.d.
with β = 10, μi ∼ N (0, 5) and v it and x it follow an AR(1) process given in (2.2) and (2.3), respectively with ρ and λ varying over the range (0,0.2,0.4,0.6,0.8,0.9,1) and σ 2ε = σ 2e = 5. C The Author(s). Journal compilation C Royal Economic Society 2008.
569
Asymptotic properties of estimators for the linear panel regression model Table 2. Relative efficiencies of standard panel data estimators (N = 60, T = 60). λ
OLS
FE
FD
GLS-PW
ρ
0
0.2
0.4
0.6
0.8
0 0.2
1.963 2.101
2.414 2.467
3.123 2.989
4.441 3.881
7.826 6.062
12.201 8.836
13.958 9.628
0.4 0.6
2.523 3.445
2.858 3.848
3.249 4.207
3.796 4.509
4.980 4.844
6.446 5.209
6.329 4.077
0.8 0.9 1
6.008 10.598 61.715
6.802 12.223 73.063
7.438 13.622 84.083
7.754 14.515 94.300
7.290 13.850 100.976
6.375 11.702 97.379
3.150 4.367 41.327
0 0.2
1.000 1.081
1.000 1.083
1.000 1.078
1.000 1.064
1.002 1.039
1.005 1.028
1.037 1.062
0.4 0.6 0.8
1.359 2.010 3.871
1.379 2.110 4.291
1.371 2.147 4.601
1.323 2.073 4.681
1.219 1.802 4.180
1.151 1.576 3.485
1.149 1.415 2.511
0.9 1
6.686 20.601
7.647 24.289
8.490 27.902
9.050 31.303
8.675 33.476
7.464 32.129
4.664 23.459
0 0.2 0.4
1.485 1.211 1.082
1.714 1.322 1.128
2.079 1.505 1.206
2.779 1.864 1.367
4.672 2.861 1.834
7.440 4.347 2.559
21.119 11.967 6.559
0.6 0.8
1.026 1.007
1.041 1.009
1.067 1.014
1.121 1.023
1.292 1.056
1.579 1.123
3.449 1.768
0.9 1 0
1.002 1.006 1.293
1.003 1.007 1.392
1.004 1.008 1.512
1.007 1.009 1.660
1.014 1.010 1.820
1.030 1.010 1.834
1.267 1.010 1.667
0.2 0.4
1.126 1.043
1.181 1.066
1.258 1.101
1.376 1.162
1.564 1.291
1.663 1.399
1.612 1.455
0.6 0.8 0.9
1.011 1.001 1.000
1.017 1.002 1.000
1.026 1.003 1.000
1.045 1.005 1.000
1.095 1.010 1.001
1.155 1.021 1.002
1.246 1.062 1.012
1
1.005
1.006
1.007
1.008
1.009
1.010
1.046
Note: (a) Relative mean square error with respect to the true GLS. (b) 10,000 replications.
0.9
σ 2μ
=
σ 2e
1
= 5.
The sample sizes n and T are varied over the range (20,40,60,120,240). For each experiment, we perform 10, 000 replications. For each replication we estimate the model using OLS, FE, FD, GLS-CO, GLS-PW and true GLS. Even with this modest design we had 1225 experiments. GAUSS for Windows 6.0 was used to perform the simulations. Random numbers for μ i and ε it were generated by the GAUSS procedure RNDNS. We generated n(T + 1000) random numbers and then split them into n series so that each series had the same mean and variance. The first 1000 observations were discarded for each series. Tables 1–3 give the relative mean square error (MSE) of each estimator of β with respect to true GLS for various values of ρ, λ, n and T. We only report 3 tables to give a flavour of the C The Author(s). Journal compilation C Royal Economic Society 2008.
570
B. H. Baltagi, C. Kao and L. Liu Table 3. Relative efficiencies of standard panel data estimators (N = 240, T = 60). λ
OLS
FE
FD
GLS-PW
ρ
0
0.2
0.4
0.6
0.8
0.9
1
0 0.2
1.903 2.042
2.330 2.390
3.022 2.895
4.305 3.771
7.364 5.770
11.040 8.095
12.103 8.331
0.4 0.6
2.462 3.385
2.783 3.783
3.152 4.110
3.690 4.392
4.802 4.731
6.021 5.003
5.490 3.598
0.8 0.9 1
5.877 10.283 66.063
6.679 11.912 78.409
7.269 13.242 89.876
7.545 14.094 100.477
7.169 13.687 108.863
6.308 11.809 107.078
2.932 4.236 43.764
0 0.2
1.000 1.085
1.001 1.088
1.001 1.082
1.002 1.073
1.006 1.065
1.011 1.060
1.024 1.054
0.4 0.6 0.8
1.369 2.036 3.850
1.395 2.152 4.301
1.384 2.182 4.598
1.344 2.113 4.683
1.274 1.912 4.352
1.219 1.718 3.789
1.147 1.433 2.601
0.9 1
6.426 19.538
7.390 23.082
8.173 26.375
8.699 29.349
8.595 31.405
7.716 30.910
4.808 26.383
0 0.2 0.4
1.471 1.210 1.085
1.687 1.312 1.125
2.051 1.495 1.203
2.733 1.853 1.368
4.389 2.748 1.803
6.646 3.973 2.409
18.377 10.465 5.791
0.6 0.8
1.030 1.009
1.041 1.011
1.065 1.014
1.123 1.024
1.289 1.059
1.536 1.120
3.103 1.675
0.9 1 0
1.004 0.999 1.290
1.004 0.998 1.384
1.005 0.997 1.508
1.007 0.996 1.647
1.015 0.998 1.714
1.030 0.998 1.640
1.241 1.000 1.527
0.2 0.4
1.128 1.047
1.179 1.066
1.258 1.102
1.374 1.165
1.504 1.268
1.523 1.321
1.473 1.336
0.6 0.8 0.9
1.014 1.003 1.001
1.018 1.003 1.001
1.027 1.004 1.001
1.047 1.006 1.001
1.091 1.013 1.003
1.125 1.019 1.004
1.163 1.033 1.008
1
0.999
0.998
0.997
0.997
0.999
1.002
1.045
Note: (a) Relative mean square error with respect to the true GLS. (b) 10,000 replications.
σ 2μ
=
σ 2e
= 5.
results, the rest are available upon request from the authors. Several conclusions emerge from these results. First, the true GLS estimator is the most efficient one in terms of mean squared error. Its efficiency gain over the OLS estimator is enormous particularly when ρ and/or λ is large. Second, the FE estimator is less efficient than true GLS, but more efficient than the feasible GLS estimator when ρ = 0. However, when ρ increases, the feasible GLS estimator quickly becomes more efficient than the FE estimator. Third, the FD estimator is also less efficient than true GLS. When ρ increases, the FD estimator becomes as efficient as the GLS estimator. Interestingly, the FD estimator behaves poorly when λ is close to 1 but ρ is small. Fourth, the feasible GLS C The Author(s). Journal compilation C Royal Economic Society 2008.
Asymptotic properties of estimators for the linear panel regression model
571
estimator is slightly less efficient than the true GLS estimator and beats OLS, FE and FD as long as ρ > 0.2. In summary, our simulation results show that the feasible GLS estimator performs well, and is second best only to true GLS when ρ > 0.2
10. CONCLUSION In this paper, we compared the efficiency of OLS, FE, FD and GLS estimators in a random effects panel model with I(0) and I(1) regressor and remainder error. When the regression error is I(0) and the regressor is I(1) and hence the model is co-integrated, both the FE and GLS estimators are asymptotically efficient. When the regression error√is I(1) √ and the regressor is I(1) and hence the model is spurious, the FE and GLS estimators are n and nT consistent, respectively. This implies that GLS is the preferred estimator as far as the regression error specification is concerned since GLS converges at as good or better rate in both cases (i.e. regression error is either I(0) or I(1)). Although there is a lot of research on dynamic autoregressive panel models, see the references in Baltagi (2005), one should study more carefully the properties of the estimators considered in this paper as well as those of GMM and MLE under non-stationarity of the regressors and the remainder errors, see Choi (1999, 2002, 2006), Choi et al. (2004) and Han and Phillips (2007) to mention a few.
ACKNOWLEDGMENTS We thank seminar participants at National Taiwan University and the Midwest Econometrics Group in St. Louis for helpful discussions.
REFERENCES Baltagi, B. (2005). Econometric Analysis of Panel Data (3rd ed.). New York: John Wiley. Baltagi, B. and C. Kao (2000). Nonstationary panels, co-integration in panels and dynamic panels: a survey. Advances in Econometrics 15, 7–51. Baltagi, B. H. and W. Kr¨amer (1997). A simple linear trend model with error components. Econometric Theory 13, 463–63. Baltagi, B. H. and Q. Li (1991). A transformation that will circumvent the problem of autocorrelation in an error component model. Journal of Econometrics 52, 371–80. Baltagi, B. H., and Q. Li (1997). Monte Carlo results on pure and pretest estimators of an error component ´ model with autocorrelated disturbances. Annales d’Economie et de Statistique 48, 69–82. Beach, B. M. and J. G. MacKinnon (1978). A maximum likelihood procedure for regression with autocorrelation errors. Econometrica 46, 51–8. Breitung, J. and M. H. Pesaran (2008). Unit roots and cointegration in panels. In L. Matyas and P. Sevestre (Eds.). The Econometrics of Panel Data: Fundamentals and Recent Developments in Theory and Practice, 279–322. Heidelberg: Springer. Choi, I. (1999). Asymptotic analysis of a nonstationary error component model. Working paper, Kookmin University, Korea. Choi, I. (2002). Instrumental variables estimation of a nearly nonstationary, heterogeneous error component model. Journal of Econometrics 109, 1–32. Choi, I. (2006). Nonstationary Panels. Palgrave Handbooks of Econometrics, Volume 1, 511–39. New York: Palgrave Macmillan. C The Author(s). Journal compilation C Royal Economic Society 2008.
572
B. H. Baltagi, C. Kao and L. Liu
Choi, C., L. Hu and M. Ogaki (2004). A spurious regression approach to estimating structural parameters. Working paper, Ohio State University, US. Fuller, W. and G. E. Battese (1973). Transformations for estimation of linear models with nested-error structure. Journal of the American Statistical Association 68, 626–32. Han, C. and P. C. B. Phillips (2007). GMM estimation for dynamic panels with fixed effects and strong instruments at unity. Cowles Foundation Discussion Paper No. 1599, Yale University. Hsiao, C. (1986). Analysis of Panel Data (2nd ed.). Cambridge: Cambridge University Press. Kao, C. (1999). Spurious regression and residual-based tests for cointegration in panel data. Journal of Econometrics 90, 1–44. Kao, C. and M-H. Chiang (2000). On the estimation and inference of a cointegrated regression in panel data. Advances in Econometrics 15, 179–222. Kao, C. and J. Emerson (2002a). Testing for structural change of a time trend regression in panel data. Part I. Journal of Propagations in Probability and Statistics 2, 57–75. Kao, C. and J. Emerson (2002b). Testing for structural change of a time trend regression in panel data. Part II. Journal of Propagations in Probability and Statistics 2, 207–50. Maeshiro, A. (1976). Autoregressive transformation, trended independent variables and autocorrelated disturbance terms. The Review of Economics and Statistics 58, 497–500. Mundlak, Y. (1978). On the pooling of time series and cross-section data. Econometrica 46, 69–85. Park, S. J. and B. M. Mitchell (1980). Estimating the autocorrelated error model with trended data. Journal of Econometrics 13, 185–201. Phillips, P. C. B. (1986). Understanding spurious regressions in econometrics. Journal of Econometrics 33, 311–40. Phillips, P. C. B. and H. Moon (1999). Linear regression limit theory for nonstationary panel data. Econometrica 67, 1057–111. Phillips, P. C. B. and J. Y. Park (1988). Asymptotic equivalence of ordinary least squares and generalized least squares in regressions with integrated regressors. Journal of the American Statistical Association 83, 111–5. Summers, R. and A. Heston (1991). The Penn world table; an expanded set of international comparisons 1950-1988. Quarterly Journal of Economics 106, 327–68. Wooldridge, J. (2002). Econometric Analysis of Cross Section and Panel Data. Cambridge: MIT Press.
C The Author(s). Journal compilation C Royal Economic Society 2008.
Econometrics Journal (2008), volume 11, pp. 573–592. doi: 10.1111/j.1368-423X.2008.00249.x
Asymptotic and qualitative performance of non-parametric density estimators: a comparative study T ERUKO TAKADA † †
Graduate School of Business, Osaka City University, Osaka 558-8585, Japan E-mail:
[email protected] First version received: September 2006; final version accepted: April 2008
Summary Motivated by finance applications, we assessed the performance of several univariate density estimation methods, focusing on their ability to deal with heavy-tailed target densities. Four approaches, a fixed bandwidth kernel estimator, an adaptive bandwidth kernel estimator, the Hermite series (SNP) estimator of Gallant and Nychka, and the logspline estimator of Kooperberg and Stone, are compared. We conclude that the logspline and adaptive kernel methods provide superior performance, and the convergence rate of the SNP estimator is remarkably slow compared with the other methods. The Hellinger convergence rate of the SNP estimator is derived as a function of tail heaviness. These findings are confirmed in Monte Carlo experiments. Qualitative assessment reveals the possibility that side lobes in the tails of the fixed kernel and SNP estimates are artefacts of the fitting method. Keywords: Density estimation, Heavy tail, SNP, Kernel, Adaptive kernel, Logspline, Convergence rate, Hellinger.
1. INTRODUCTION Non-parametric methods of density estimation play an increasingly important role in applied econometrics and finance; however, there has been little comparison of critical competing methods. Non-parametric density estimation enables data mining, and estimation or testing of the parameters of structural models with no assumptions about their true density. However, the choice of estimation method can lead to differing results, especially when it is difficult to fit the shape of the target density. In finance applications, financial data distributions are known to have heavy tails. Thus, it is particularly important to compare the performance of methods regarding heavy-tailed target densities. The primary purpose of this paper is to assess the performance of several univariate density estimators, focusing on target densities with heavy tails. The four approaches considered are the fixed bandwidth kernel (fixed kernel) estimator of Sheather and Jones (1991); the adaptive bandwidth kernel (adaptive kernel) estimator of Breiman et al. (1977) and Abramson (1982); the Hermite series (SNP) method of Gallant and Nychka (1987); and the logspline approach of Kooperberg and Stone (1991) and Stone et al. (1997). Although statistical analysis offers many competing non-parametric density estimation methods, economic and finance applications are largely limited to fixed kernel and SNP methods. C The Author(s). Journal compilation C Royal Economic Society 2008. Published by Blackwell Publishing Ltd, 9600 Garsington Road,
Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
574
T. Takada
Typical comparative studies of non-parametric density estimation methods have been limited to comparisons of the algorithms within the fixed kernel estimator, and comparative studies among different approaches are very few. For instance, Scott and Factor (1981) compared two different fixed kernel methods and an orthogonal series method. Fenton and Gallant (1996b) compared the performance of the SNP and the fixed kernel method of Silverman’s rule-of-thumb bandwidth in estimating mixtures of normal densities. Hwang et al. (1994) compared the fixed kernel, adaptive kernel, projection pursuit and the radial basis function methods in a multivariate setting. Fadda et al. (1998) compared the adaptive kernel method, the maximum penalized likelihood method, and their own method, based on the wavelet transform. We first derived the rate of convergence of the SNP method as a function of tail heaviness, and compared it with that of the other methods. 1 For algebraically tailed densities such as the Student-t family, the Hellinger and L 1 error rate of the SNP estimator is derived as n−αγ /2 , where the number of parameters p = nγ is assumed and α denotes a tail index reflecting algebraic tail heaviness. Given that the best SNP Hellinger rate is achieved at γ ∼ 1/(k + 1), where k is the order of L 2 differentiability, we can write the SNP Hellinger rate as function of α only as n−α/(2α+4) . 2 On the other hand, the theoretical error rates of the other three estimators are independent of the tail behaviour of the target density. The numerically computed error reduction rate of the Hermite (SNP) series, approximating the true target density, and a Monte Carlo experiment, designed to compare the four candidate estimators, confirm these findings. For exponentially tailed densities, the theoretical asymptotic rate of the SNP estimator is derived so as not to depend on the tail heaviness. In a small sample, however, analysis of the numerical Hermite approximation error reduction rate indicates that the rate does depend on tail heaviness, which is confirmed in a Monte Carlo experiment. Our qualitative assessment of the four estimates in fitting heavy-tailed densities reveals that the SNP and fixed kernel methods tend to estimate bumps in the tails, and removing outliers effectively reduces these bumps. This suggests that side lobes in the SNP estimates, reported in previous literature, are artefacts of the method, supporting the claim of Bollerslev et al. (1992) that a few outliers unduly influence tail oscillations, and are not a feature of the entire sample period. Performance comparisons of fitting rough densities indicated that differences among the four methods are smaller than in the case of heavy-tailed target densities. For fitting heavy-tailed densities, we conclude that the logspline and adaptive kernel methods offer superior performance, and the performance of the SNP estimator, relative to the other methods, deteriorates as the density tail becomes heavier. In addition, if side lobes are found in the tails of the SNP and fixed kernel estimates, they are likely to be artefacts of the fitting method. The paper is organized as follows. Section 2 summarizes the four methods and the error measurements. Section 3 derives the SNP convergence rates as a function of tail heaviness, which is confirmed in Section 4 by the numerical Hermite approximation error reduction rate of the true target density. In Section 5, a Monte Carlo study determines the quantitative and qualitative effect of non-normality of the target densities, and compares the convergence rates. Section 6 argues
1 The SNP method is often used as an adjunct to efficient method of moments (EMM), rather than simply estimating densities, and the EMM approach is extensively applied in estimating stochastic volatility models involving asset return data, which is known to be fat-tailed. 2 This rate is slow, considering that the degree of freedom ν of the Student-t density matches α, and that α estimates by the conventional Hill estimator (for most exchange rates) lie between three and four.
C The Author(s). Journal compilation C Royal Economic Society 2008.
575
Comparison of non-parametric density estimators
the implications of our analysis in an empirical context using the daily pound/dollar exchange rate data. Finally, Section 7 concludes the study.
2. THE METHODS AND MEASURES OF DISCREPANCY Given a sample {X 1 , . . ., X n }, the classical kernel density estimator with a Gaussian kernel has the form: n 1 x − Xi ˆ , (2.1) φ fF K (x) = nh i=1 h where φ denotes the standard Gaussian density, and a fixed bandwidth h controls the degree of smoothing. Selection of h is by the two-stage direct plug-in bandwidth selection method of Sheather and Jones (1991), which has been shown to perform quite well for many density types by Park and Turlach (1992) and Wand and Jones (1995). The adaptive kernel density estimator, first proposed by Breiman et al. (1977) and refined by Abramson (1982), is defined as: n 1 1 x − Xi ˆ , (2.2) φ fAK (x) = n i=1 hλi hλi where λ i is a local bandwidth factor, which narrows the bandwidth λ i h near modes and widens ˜ it in the tails. Based on the pilot estimate from the fixed kernel method, f (x), λi is determined as −1/2 −1 ˜ ˜ log f (Xi ). For comparison purpose, the bandwidth , where log g = n λi = {f (Xi )/g} of the pilot estimate is determined by the two-stage direct plug-in bandwidth selection method of Sheather and Jones (1991). The SNP approach of Gallant and Nychka (1987) is a truncated Hermite series estimator which is spanned as: K 2 K θi wi (x) + ε0 φ(x), θ = (θ0 , . . . , θK ), θi2 + ε0 = 1, (2.3) fˆSN P (x, θ ) = i=0
i=0
where w i (x) is a normalized Hermite polynomial, multiplied by the squared root of the standard normal density, ε 0 is a small positive number, and K = 0, 1, · · ·. The term ε 0 φ(x) works as a normalization factor. Estimation is by quasi-maximum likelihood:
n Xj − μ 1 ˆ ˆθ = arg max 1 fSNP ,θ , (2.4) log θ n σ σ j =1 where μ and σ are location and scale, respectively. The truncation point K is chosen as either a function of sample size such as K = Cn1/5 for some constant C or according to some model selection criterion. The parametric dimension is p = K + 3 because of μ, σ and the θ 0 . The logspline estimator of Kooperberg and Stone (1991) models a log-density function as a cubic spline: p
θi Bi (x) − c(θ ) , x ∈ R, (2.5) fˆLS (x, θ ) = exp i=1 C The Author(s). Journal compilation C Royal Economic Society 2008.
576
T. Takada
where c(θ ) is a normalization factor, and the functions {B i (x)} denote the standard cubic Bsplines. The coefficients θ are estimated by maximum likelihood, and knot selection determining the choice of the B-spline basis follows the BIC criterion. Certain shapes of densities are hard to be estimated depending on the employed methods. For example, in the case of ordinal kernel density estimation, Devroye (1987) discusses two determinants of difficult factors such as the tail fatness and shape roughness. Accordingly, differing from regular parametric statistical models, the level of difficulty of density estimation depends on how performance is measured. We consider three measures of discrepancy: the ˆ) = E[ |fˆ(x) − f (x)|dx], the mean integrated squared mean integrated absolute error MIAE( f error MISE(fˆ) = E[ {fˆ(x) − f (x)}2 dx], and the Hellinger error H(fˆ) = E[{ (fˆ(x)1/2 − f (x)1/2 )2 dx}1/2 ]. The MIAE is a natural criterion, the MISE emphasizes large errors, as in mode fitting, and the Hellinger criterion accentuates the tail error.
3. SNP CONVERGENCE RATES FOR HEAVY-TAILED DENSITIES In this section, we show the rate of convergence of the SNP estimator as a function of the dimension of the SNP model and the degrees of tail heaviness of the target densities. 3.1. Dependence of the SNP error rates on tail heaviness Coppejans and Gallant (2002) have shown that for K = nγ the best Hellinger error rate for the SNP density estimator is h ∼ n−kγ /2 , where k is determined by the conditions in the following lemma. 3 LEMMA 3.1 (The rate for Hermite coefficients). Let f be a density such that f (x) = 2 (j ) −x 2 /2 ) for j = 0, 1, . . . , g 2 (x)e−x /2 + 0 φ(x). If g is k-times differentiable ∞ 2 with g −k ∈ L2 (e k, then the Hermite coefficients of g satisfy i=K θi = o(K ). ∞ 2 Note that every g in L2 (e−x /2 ) has the expansion g(x) = i=0 θi wi (x). Lemma 3.1 seems to suggest that the rate of convergence of the SNP approximation is good if f is sufficiently smooth, as determined by the order of the differentiability, k. While the differentiability of g in 2 the foregoing lemma may be considered quite innocuous, the requirement that g (j ) ∈ L2 (e−x /2 ), turns out to be surprisingly restrictive for the class of heavy-tailed f . Let F(t) denote the distribution function for fixed value t. Classically, we may distinguish densities with exponential tails satisfying lim t→ ∞ − log(1 − F (t))/γ t β = 1 for some γ > 0 and β > 0 from densities with algebraic tails satisfying for some α > 0 lim t→ ∞ − log(1 − F (t))/α log(t) = 1. He et al. (1990) referred to distributions with exponential tails as light-tailed, and distributions with algebraic tails as heavy-tailed. Light-tailed densities include the logistic and double exponential with β = 1, and normal mixtures with β = 2. Heavy-tailed densities include the Student-t family. The tail index α for much financial data is considered to be about three to five. For example, Huisman et al. (2001) reported the estimates of the tail index α for five major weekly exchange
3 The L rate of convergence was derived by Fenton and Gallant (1996a) and later replaced with a slower rate in erratum 1 ∞ (Fenton and Gallant 1996c) as −∞ |fˆSNP (x) − f (x)|dx = os (n−1/4+γ /4+δ ) + o(n−kγ /2 ) for every small δ > 0 where k is determined by the conditions in Lemma 3.1.
C The Author(s). Journal compilation C Royal Economic Society 2008.
577
Comparison of non-parametric density estimators
rates against the U.S. dollar, over the period of 1979–1990; the estimates of α by the conventional Hill estimator lie between three and four, and α, estimated by their modified Hill estimator, are mostly larger than four. 2 For the class of densities with algebraic tails, we can show that the condition g (j ) ∈ L2 (e−x /2 ) binds the SNP convergence rates to dependence on tail heaviness. P ROPOSITION 3.1. Let the true density f have the form f (x) = g 2 (x)e−x /2 + 0 φ(x) with g (j ) ∈ 2 L2 (e−x /2 ) for j = 0, 1, . . ., k. If f is the density corresponding to the class of distribution functions with algebraic tails, F, of the form 1 − F (x) ∼ x −α for x → ∞, then the maximum order of differentiability k is the largest integer satisfying k < α/2. 2
For example, for the family of Student-t densities, whose degree of freedom is ν = α, the maximum order of L 2 differentiability is k = 0, 0, 1, 1, 2, 2, · · · for ν = 1, 2, 3, 4, 5, 6, · · ·. If we approximate k as: k ∼ α/2 for both even and odd α,
∞ i=K
(3.1)
θi2 = o(K −k ) in Lemma 3.1 becomes: ∞
θi2 = o(K −α/2 ).
(3.2)
i=K
This connection suggests that the convergence of the SNP estimator is highly sensitive to the tail behaviour of the target density, with poor performance at heavy-tailed densities, such as the Student-t with low degrees of freedom. β For densities with exponential tails, 1 − F (x) ∼ e−γ x for x → ∞, and it is easily checked that the order of L 2 differentiability does not depend on the tail heaviness parameter β. The 2 β β value for [g (j ) (x)]2 e−x /2 is e−γ x x 2j for β ≤ 2 and e−γ x x (−2+2β)j for β ≥ 2. For any value of exponential factor has a larger affect than the algebraic factor in the tails. Thus ∞β, the (j ) 2 −x 2 /2 [g (x)] e dx < ∞ is satisfied for all β. −∞ 3.2. SNP coefficients, Hellinger error and asymptotic Hellinger rate We show that the Hellinger error has a nice connection with the coefficients of Hermite polynomials θ , and that the SNP estimator minimizes the Hellinger error. Under certain regularity conditions, a density f can be represented as an infinite sum of orthonormal Hermite polynomials: f (x, θ ) =
∞
2 θi wi (x)
i=0
,
∞
θi2 = 1,
(3.3)
i=0
.). The normalized Hermite polynomial w i (x) is given by wi (x) = where θ = (θ 0 , θ 1 , . . √ √ 2 ( 2π i!)−1/2 2−i/2 Hi (x/ 2)e−x /4 , where the standard Hermite polynomials H i (x) is defined 2 2 as Hi (x) = (−1)i ex [d i e−x /dx i ] for i ≥ 0. Let T 2K (x, λ) be a squared partial sum of normalized Hermite polynomials TK (x, λ) = [ K i=0 λi wi (x)] , where λ i are arbitrary, and K is the number of Hermite polynomial terms. The squared Hellinger error in approximating f (x) by T K (x, λ) is √ ∞ √ H 2 (f , TK ) = −∞ [ f (x) − TK (x, λ)]2 dx. C The Author(s). Journal compilation C Royal Economic Society 2008.
578
T. Takada
The coefficients, θ i , minimizing the Hellinger error, can be determined by calculating: ∞ f (x, θ ) wi (x)dx. (3.4) θi = −∞
K 2 2 Using the orthogonality of w i , we have H 2 (f , TK ) = 1 + K i=0 (λi − θi ) − i=0 θi . Thus, H (f , T K ) is minimized over λ i by choosing λ i = θ i for i = 1, . . . , K. Let f K (x, θ ) be the squared partial sum of the Hermite series such as: K 2 fK (x, θ ) = θi wi (x) . (3.5) i=0
Having chosen λ i in this way, we have: H (f , TK ) = H (f , fK ) =
∞
1/2 θi2
,
(3.6)
i=K+1
therefore, ε 0 in (2.3) may be interpreted as the squared Hellinger error of the SNP estimator with the coefficient determined as in (3.4). The relationship in (3.6) enables reducing errors coming from numerical integration in obtaining the Hellinger approximation error rate of the SNP (Hermite) series in Section 4. Given (3.2) and (3.6), the Hellinger convergence rate for algebraically tailed densities is: H (f , K, α) ∼ K −α/4 .
(3.7)
For K = nγ , Coppejans and Gallant (2002) states that the best Hellinger rate is achieved when γ ∼ 1/(k + 1). Given relationship (3.1), the best Hellinger rate in fitting algebraically tailed densities is written as a function of the sample size n and the tail heaviness α as: H (f , n, α) ∼ n−α/(2α+4) .
(3.8)
For example, the best Hellinger rate is n−1/6 for the t(1) density, and n−3/10 for the t(3) density. The error rates for estimating exponentially tailed densities, decrease exponentially.
4. EVALUATING THE APPROXIMATION ERROR OF THE SNP ESTIMATOR In this section, we observe the Hellinger error reduction rate as the number of parameters p increases when the SNP (Hermite) series approximates the true target density with varying tail heaviness. This analysis shows that the nature of Hermite polynomial approximation causes the SNP estimator not to work well in heavy-tailed situations. In application, the number of parameters p increases as the sample size n increases, therefore, this section enables observing the rate of convergence in a small sample, in addition to confirming the derived asymptotic rate of convergence as (3.7). Note that the Hermite coefficients are determined by orthogonalization given a PDF of a true target density, which is different from the Monte Carlo setting. No simulation is involved. In the Monte Carlo simulation setting, SNP density function is modelled as (2.3) and θ is determined by QMLE as in (2.4), given a random sample. To obtain Hermite approximation error, we model a true target density as in (3.3), and determine θ by orthogonalization as in (3.4), C The Author(s). Journal compilation C Royal Economic Society 2008.
579
Comparison of non-parametric density estimators (a) Densities with Algebraic Tails
(b) Densities with Exponential Tails
1e+00
1e+00
β=0.2
t(1) β=0.5
Hellinger (log scale)
t(3) t(5) Bimodal
t(11)
Hellinger (log scale)
t(2)
β=1 Bimodal (β=2)
β=4 1
2
5 10
50
200
K (log scale)
1
2
5 10
50
β=3
200
K (log scale)
Figure 1. The Hellinger approximation error of the SNP (Hermite) series as a function of the number of Hermite polynomial terms K.
given a PDF of a true target density. Evaluating the Hellinger distance between infinite sum of the Hermite series f (x, θ ) as in (3.3) and the squared partial sum of the Hermite series f K (x, θ ) as in (3.5), which is computed by (3.6), obtains the Hermite approximation error. For several densities f with varying tail behaviour, we computed the Hermite coefficient θ 0 , . . . , θ K , up to K = 200. The number of parameters is given by p = K + 3. All the computations are done by Mathematica ver 5.0.0.0. The integrals were computed using the adaptive Gauss–Kronrod quadrature rule with precision 10−70 . As densities with algebraic tails of differing tail heaviness, we consider Student-t distributions with degrees of freedom ν = 1, 2, 3, 5 and 11. A bimodal mixture of normal densities, such that 12 N[−1, 0.52 ] + 12 N[1, 0.52 ], from Marron and Wand (1992), is added as a benchmark of normal tails. Figure 1(a) shows the log–log plot of the Hellinger error as a function of K—each density in a different grey scale. Solid lines are the Hellinger approximation errors according to differing degrees of tail heaviness. For the t(ν) density, α = ν. Dotted lines in the left panel are the derived theoretical SNP Hellinger rate K −α/4 for each α. The bimodal density is 12 N[−1, 0.52 ] + 12 N[1, 0.52 ]. Smaller α or β indicates heavier tails. The plotted curves are asymptotically linear for the Student-t densities, but curve downward for the bimodal density. Moreover, as the degrees of freedom ν decreases, the plotted slope becomes flatter, indicating that the rate of approximation error reduction measured by the Hellinger distance slows as the tails of the t-density become heavier. The theoretically derived Hellinger rates K −α/4 are plotted in the dotted lines such that they cross the solid lines of the corresponding numerical rates at K = 200. Observe that every dotted line overlaps the corresponding asymptotically linear slope. The validity of K −α/4 is also indicated by the linear least square regression of the asymptotically linear lines. See table 1.4 of Takada (2001) for further details. C The Author(s). Journal compilation C Royal Economic Society 2008.
580
T. Takada
For densities with exponential tails, the SNP convergence rate is shown not to depend on the tail heaviness parameter β in Section 3.1. To confirm this result numerically we need an exponentially tailed density whose tail heaviness is controlled by parameter β, but many problems arise if we use known distributions with β < 2. 4 Thus we create a new density which continuously changes its tail heaviness by controlling β:
β/2 x 2 + (β/2 − 1)2 ) f (x, β) = C exp − , (4.1) 2 where β > 0, and C is the normalization factor. The density is designed to be standard normal when β = 2. In Figure 1(b), the plotted grey solid lines are the Hellinger error as a function of K on log–log scale for each of the densities formulated as in (4.1) with β = 0.2, 0.5, 1, 2, 3 and 4. To observe the error reduction rate a density with β = 2 is substituted with the mixture of normal, bimodal densities. All the lines in Figure 1(b) curve downward, indicating that the Hellinger error decreases at an exponential rate. This observation confirms the asymptotic result of Section 3.1; the Hellinger error decreases exponentially, and the asymptotic rate does not depend on tail heaviness parameter β when the target density has exponential tails. For small K, however, the Hellinger error size clearly depends on tail heaviness parameter β. In application, the number of Hermite polynomial terms K increases as the sample size increases, implying that the Hellinger error size in a small sample depends on tail heaviness.
5. MONTE CARLO SIMULATION STUDY This section compares the quantitative and qualitative performance of the four density estimators in fitting non-normal densities, via a Monte Carlo study. The asymptotic rates of convergence in estimating t(3) and lognormal densities are compared with respect to the MIAE, MISE and the Hellinger error measures, confirming the validity of the derived SNP Hellinger rate. 5.1. Experimental design In addition to tail heaviness, we consider another type of non-normality—shape roughness. Figure 2 shows the densities used for comparative analysis in this section. In addition to the normal density, t(3) and t(1) densities are chosen to determine the effect of algebraic tail behaviour. The lognormal density is included as a density with a heavier exponential tail than the normal density. As densities with different degrees of roughness, skewed bimodal and asymmetric claw densities are chosen from the Marron and Wand (1992) test suit for evaluating non-parametric density estimators and all densities are rescaled so that 99.9% of the mass is included in the range of [−3, 3]. We generated 500 replications for each combination of target density f (x) and sample size n. For each replication, the density estimates fˆ(x) of the four methods are computed M on an x-grid from −M to M with 1025 equally spaced points. We then computed −M |fˆ − 4 For example, the Laplace density with β = 1 is not continuous at x = 0. The exponential density cannot be approximated by the SNP method. The Weibull density changes the shape depending on various parameters, some of which the SNP method cannot estimate.
C The Author(s). Journal compilation C Royal Economic Society 2008.
581
Comparison of non-parametric density estimators Lognormal
2.5
t(1)
70
t(3)
2.0
Normal
1.5
60 50
1.5
40
1.0
30
1.0
0.4
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0.1
0.0
0.0
0.4 0.3
0.5
20 0.5
Skewed Bimodal Asymmetric Claw
0.4
10
0.0
0 –3
–2
–1
0
1
2
3
0.0 –3
–2
–1
0
1
2
3
–3
–2
–1
0
1
2
3
–3
–2
–1
0
1
2
3
0.0 –3
–2
–1
0
Figure 2. Densities for performance evaluation. Fine skewed bimodal = asymmetric claw = 12 N [0, 1] + 2i=−2 (21−i /31)N [i + 12 , (2−i /10)2 ].
1
2
3
–3
3 N [0, 1] 4
–2
–1
0
1
2
3
+ 14 N [ 23 , ( 13 )2 ],
M M f |dx, −M (fˆ − f )2 dx, [ −M (fˆ1/2 − f 1/2 )2 ]1/2 dx for each repetition, and averaged the 500 values to get estimates of MIAE(fˆ), MISE(fˆ) and H(fˆ), respectively. The integrals were computed by Romberg integration, using qromb from Press et al. (1992). The support is set as M = 3. All computations are conducted interfaced to R (version 1.9.1) with precision 10−15 . The used program for SNP density estimation in C++ is version 2002, downloaded from ftp.econ.duke.edu, in directory pub/arg/npe and pub/arg/libcpp (Gallant and Nychka, 1987). The direct plug-in, fixed kernel program in FORTRAN (version 2.22-14, 2004 Aug, Wand and Jones, 1995) and the logspline density estimation program in C (version 2.0.1, Mar 2004, and Stone et al., 1997) are downloaded Comprehensive R Archive Network (CRAN, http://lib.stat.cmu.edu/R/CRAN). The adaptive kernel estimator code in C, was written by the author, based on Silverman (1986, pp. 101–102). In the Monte Carlo simulation study, the default settings are basically used for all algorithms. The logspline program includes knot deletion and addition procedures. For comparison, the number of parameters p for the SNP and logspline estimators are determined to minimize BIC. The upper bound of the Hermite polynomial order K of the SNP method is set at K = n1/2 to avoid non-convergence. 5.2. Effect of non-normality For the densities shown in Figure 2, we conducted simulations for a sample size n = 500 to determine performance changes as the densities become heavier-tailed or rougher. The statistics are summarized in Table 1. 5.2.1. Effect of tail heaviness. The effect of tail heaviness is more severe with the SNP and fixed kernel methods than with the logspline and adaptive kernel methods, and there is significant difference between the two groups. For the t density with degrees of freedom 2, 3, 5, 9, 11 and the normal density, Figure 3 clearly shows this tendency by plotting the Hellinger error obtained by Monte Carlo Simulation (500 samples) of the four methods computed in the same setting as for the results listed in Table 1. The difference between the bold lines (SNP and fixed kernel methods) and thinner lines (logspline and adaptive kernel methods) becomes significantly larger for heavier-tailed target densities. The MIAE and MISE for the SNP method are also the worst of the heavy-tailed density estimates in Table 1: lognormal, t(1) and t(3) densities. SNP estimation requires sample variance as in (2.4) and estimation of densities without variance, such as t(1) and t(2), is considered to C The Author(s). Journal compilation C Royal Economic Society 2008.
582
T. Takada
Table 1. Simulated errors of non-normal density estimates (sample size 500, average of 500 simulated samples). MIAE
MISE
Hellinger
p
Mean (S.D.)
Mean (S.D.)
Mean (S.D.)
Mean(S.D.)
SNP Fixed Kernel
0.272 (0.064) 0.148 (0.022)
0.133 (0.068) 0.028 (0.010)
0.189 (0.032) 0.153 (0.011)
22.7 (3.0) –
Adaptive Kernel Logspline
0.128 (0.024) 0.099 (0.033)
0.024 (0.010) 0.022 (0.015)
0.106 (0.011) 0.073 (0.019)
– 4.8 (1.2)
SNP Fixed Kernel Adaptive Kernel
0.983 (0.327) 0.544 (0.116) 0.167 (0.052)
17.199 (6.401) 6.711 (3.174) 0.547 (0.636)
0.591 (0.199) 0.329 (0.038) 0.176 (0.034)
25.5 (0.7) – –
Logspline
0.133 (0.069)
0.520 (1.384)
0.105 (0.049)
7.6 (1.1)
SNP
0.127 (0.046)
0.014 (0.013)
0.103 (0.024)
10.8 (3.9)
Fixed Kernel Adaptive Kernel Logspline
0.104 (0.024) 0.092 (0.026) 0.097 (0.037)
0.008 (0.005) 0.009 (0.006) 0.011 (0.011)
0.105 (0.012) 0.066 (0.015) 0.072 (0.021)
– – 4.5 (1.1)
SNP Fixed Kernel
0.041 (0.020) 0.079 (0.024)
0.001 (0.001) 0.002 (0.001)
0.028 (0.014) 0.056 (0.014)
3.0 (0.0) –
Adaptive Kernel Logspline
0.080 (0.024) 0.098 (0.033)
0.002 (0.002) 0.004 (0.003)
0.056 (0.012) 0.065 (0.018)
– 4.0 (1.1)
SNP Fixed Kernel Adaptive Kernel
0.159 (0.033) 0.149 (0.027) 0.130 (0.028)
0.008 (0.003) 0.007 (0.002) 0.005 (0.002)
0.113 (0.020) 0.115 (0.018) 0.097 (0.018)
7.3 (1.4) – –
Logspline
0.137 (0.033)
0.007 (0.004)
0.096 (0.020)
5.3 (0.8)
Estimated density Lognormal
t(1)
t(3)
Normal
Skewed Bimdal
Asymmetric Claw
Method
SNP
0.251 (0.045)
0.021 (0.006)
0.177 (0.019)
8.6 (4.2)
Fixed Kernel Adaptive Kernel Logspline
0.240 (0.023) 0.214 (0.024) 0.198 (0.033)
0.019 (0.002) 0.017 (0.002) 0.015 (0.005)
0.174 (0.011) 0.168 (0.009) 0.153 (0.016)
– – 9.8 (1.8)
Note: The number of parameters are denoted as p. The statistics for the logspline estimates of t(1) density are based on 497 samples, where estimates are available.
be very difficult. The relative performance for heavy-tailed density estimation measured by the MIAE and the Hellinger error is (1) logspline, (2) adaptive kernel, (3) fixed kernel and (4) SNP. Table 1 shows that relatively more parameters are used for SNP estimation than for the logspline estimation as the tail heaviness increases, except for the normal density. To assess the qualitative performance in fitting heavy tails, t(1), t(3) and the lognormal densities are estimated using the four methods for one realization of a random sample. The representative results are shown in Figure 4. Solid curves are estimated densities and dotted curves are true densities. The data range of the generated random sample is [−2.99,1.27] for the lognormal density, [−2.27,1.27] for the t(1) density, and [−3.33,3.92] for the t(3) density. The fit of the SNP estimator tends to be too flat in the centre, while the logspline estimator sometimes fits C The Author(s). Journal compilation C Royal Economic Society 2008.
Comparison of non-parametric density estimators
583
Hellinger 0.16
SNP Fixed Kernel Logspline Adaptive Kernel
0.14 0.12 0.10 0.08 0.06 0.04
t(2)
t(3)
t(5)
t(9)
t(11)
Normal
Target densities Figure 3. Effect of algebraic-tail heaviness.
too acutely at modes, worsening the MISE compared to the fixed kernel method. To emphasize tail fitting performance, the vertical axes of each panel in Figure 4(a) are transformed to log scale in Figure 4(b). The distance log f − log fˆ is similar to the Hellinger distance f 1/2 − fˆ1/2 in that the squared root operation for emphasizing tail behaviour is replaced with the logarithm operation. In Figure 4(b), the SNP estimates around the tails for the three target densities involve high frequency oscillation. This is probably because the SNP estimator has no other way to fit heavy tails than mixing higher degrees of Hermite polynomials with the Gaussian tail. 5 Prior to the work of Gallant and Nychka (1987), Good and Gaskins (1971) suggested modelling a density having a Hermite expansion, as in (3.3), with a roughness penalty for the truncation method, instead of using selection criteria such as BIC. 6 Their roughness penalty depends on the first and second derivatives of the curvatures of the estimated density, which might improve the high frequency oscillation in the estimated shape of the heavy-tails. The fixed kernel density estimates in Figure 4(b) show that its heavy tail fitting performance is poor for all three densities. The possible reason is that it cannot produce estimates where there is no data. In order to fit sparse data, or outliers in tails, therefore, the fixed bandwidth becomes very narrow. This causes generated estimates to include many small bumps, and the estimates can include discontinuities. We do not see problems in fitting t(3) for the adaptive kernel method. For the lognormal density, the estimated shape is relatively good up to the largest data point 1.27, but curves downward where there is no data. This does not happen if the bandwidth for the pilot estimation is determined by Silverman’s (1986) rule-of-thumb h = 0.9An−1/5 , where 5 The Gaussian leading term can be replaced with ARCH or GARCH representation, which might improve the tail fitting performance. Techniques to remove the small bumps in the tail, as in Efromovich (1999, Section 3.1), might improve the performance. 6 This point is suggested by Roger Koenker.
C The Author(s). Journal compilation C Royal Economic Society 2008.
584
T. Takada
(a) Normal scaled vertical axes
Lognormal
f(x) 2.5
t(1)
f(x)
SNP
Fixed Kernel
Adaptive Kernel
Logspline
2.5
2.5
2.0
2.0
2.0
1.5
1.5
1.5
1.5
1.0
1.0
1.0
1.0
0.5
0.5
0.5
0.5
0.0
0.0
0.0
0.0
70
70
70
70
2.5 2.0
60
60
60
60
50
50
50
50
40
40
40
40
30
30
30
30
20
20
20
20
10
10
10
10
0
0
0
0
1.5
1.5
1.5
t(3)
f(x) 1.5
1.0
1.0
1.0
0.5
0.5
0.5
0.5
0.0
0.0
0.0
0.0
1.0
(b) Log-scaled vertical axes SNP
Fixed Kernel
Adaptive Kernel
Logspline
Lognormal
log f(x)
log f(x) 1e+00
1e+00
1e+00
t(1)
1e+00
t(3)
log f(x)
Figure 4. Estimates of heavy-tailed densities.
C The Author(s). Journal compilation C Royal Economic Society 2008.
Comparison of non-parametric density estimators
585
A = min{standard deviation, interquartile range}/1.34. The direct plug-in method generally gives a narrower bandwidth than Silverman’s rule-of-thumb in estimating heavy-tailed densities. Therefore, the direct plug-in offers slightly better quantitative performance, but the estimated density shape involves slightly more wavy lines than generated by Silverman’s rule-of-thumb. Around tails, adaptive kernels are wider than fixed kernels, but t(1) tails seem too heavy. That is, the tail data must be too sparse even for adaptive kernels, generating estimate plots similar to those of the fixed kernel method. The logspline estimator fits well in the log-scaled graph, partly because it optimizes with respect to the log-density, and extrapolates linear intervals in log scale beyond the extreme knots. Now we discuss the problems involved in t(1) estimation, which is useful in showing extreme situations of the many possible problems we have in fitting heavy-tailed densities. For t(1) estimation, the adaptive kernel method shows reasonably good performance with no nonconvergence problem, if we choose a simple bandwidth selection method for the pilot estimation. For all methods, the error statistics of t(1) estimation in Table 1 are significantly larger than those for estimation of other densities. The SNP is the worst method. For example, its MIAE is 0.983, implying that the SNP estimator almost completely misses the shape of the t(1) density. The theoretical maximum value for MIAE is two, which occurs when there is no common range for the support of the two densities. The maximum Hellinger error varies according to the shape of the true density. The standard deviation of the error statistics for SNP estimates is by far the largest among the four, indicating high variability of the SNP estimates. The MIAE for the fixed kernel method for t(1) density estimates is 0.544—more than double that of the adaptive kernel and the logspline methods. The algorithm for direct plug-in bandwidth selection and fixed kernel density estimation, coded by Wand and Jones (1995), did not generate estimates for 170 samples out of 500, because the generated bandwidth was narrower than the minimum bandwidth their algorithm can accept. The statistics for the fixed kernel method in Table 1 are obtained by setting the minimum acceptable bandwidth h = 0.00147 for the 170 estimates. The degree of difficulty of t(1) density estimation differs depending on the bandwidth selection method or computational algorithm. If we use the R function “density” with the default bandwidth of Silverman’s rule-ofthumb, no error occurs, and the obtained MIAE, MISE and Hellinger error are 0.214, 0.895 and 0.222, respectively. The error statistics of the adaptive kernel method shown in Table 1 outperform those of the fixed kernel method. Because of the direct plug-in bandwidth selection used for pilot estimates, the same problem arose as in the case of the fixed kernel method, and the identical remedy is applied. Simple pilot estimations, such as Silverman’s rule-of-thumb, do not cause failure in t(1) estimation, and becomes good candidates for robust and efficient heavy-tailed density estimation. While error statistics of the logspline methods for t(1) density estimation, are reasonably good, it has some problems. The logspline program fails to produce appropriate estimates for 46 results out of 500. The algorithm determines the number of initially placed knots in a datadriven manner, and knots are added or deleted from the initial placement according to the BIC criterion. We modified their code so that they try other number of the initial knots, and 43 estimates are obtained out of 46 failures. However, the modified logspline program fails in fitting three samples, accordingly the statistics in Table 1 are based on 497 samples. The most extreme values of the three samples in failure are −53.56, −34.99 and −295.15, respectively. The logspline method places knots at the first and the last order statistics, and this extremeness appears to be beyond the assumption of their knot placement rule, a part of which is designed by experience.
C The Author(s). Journal compilation C Royal Economic Society 2008.
586
T. Takada
Normal
f(x)
SNP
Fixed Kernel
Adaptive Kernel
Logspline
0.4
0.4
0.4
0.4
0.3
0.3
0.3
0.3
0.2
0.2
0.2
0.2
0.1
0.1
0.1
0.1
0.0
0.0
0.0
0.0
0.4
0.4
0.4
0.4
0.3
0.3
0.3
0.3
0.2
0.2
0.2
0.2
0.1
0.1
0.1
0.1
0.0
0.0
0.0
0.0
0.4
0.4
0.4
0.4
0.3
0.3
0.3
0.3
0.2
0.2
0.2
0.2
0.1
0.1
0.1
0.1
0.0
0.0
0.0
0.0
Skewed Bimodal
f(x)
Asymmetric Claw
f(x)
Figure 5. Estimates of rough densities.
5.2.2. Effect of roughness. We consider the performance of fitting rough densities with multimodes and skewness. Statistics for skewed bimodal and asymmetric claw densities indicate that the performance is better for logspline and adaptive kernel methods than for SNP and fixed kernel methods in all error measures, but the difference is smaller than in the case of fitting heavy-tailed densities. The qualitative performance, shown in Figure 5, where solid lines are estimated densities and dotted lines are true densities, illustrates that the SNP estimates are relatively flatter in modes than those of other estimators, in accordance with the worst MISE performance. The SNP method sometimes fits the complicated modes nicely, but sometimes totally fails to capture the shape. When estimating asymmetric claw densities, the standard deviation of MIAE for SNP estimates is 0.045 which is significantly larger than for other methods. Kernel based methods capture the existence of modes, but generally miss the sharp peaks. The middle level MIAE (among the four methods with lowest standard deviation) indicates this tendency. However, the adaptive kernel sensitivity to peak capture is better than that of the fixed kernel method. The logspline method generally captures the sharp peaks and steep valleys better than the other three methods. For estimating the samples from normal densities the fixed kernel method cannot correctly capture the shape of the mode; logspline and adaptive kernel methods tend to perform slightly worse than the fixed kernel, as the logspline and adaptive kernel methods tend to react too sensitively to peak shapes. Pagan and Ullah (1999, p. 74) explained that this problem arises because the adaptive kernel uses the geometric mean as a normalizing factor. C The Author(s). Journal compilation C Royal Economic Society 2008.
587
Comparison of non-parametric density estimators MIAE
MISE
Hellinger
0.06 0.05 0.04 0.03
0.25 0.20 0.15
0.02
0.10
0.01
0.20 0.15 0.10
SNP Fixed Kernel Logspline Adaptive Kernel
0.05 0.05
50
100
500
2000 5000
Sample size n
50
100
500
2000 5000
50
Sample size n
100
500
2000 5000
Sample size n
Figure 6. Rates of convergence to the t(3) density.
5.3. Rate of convergence Through a Monte Carlo simulation study, we compared the rate of error reduction in fitting heavytailed densities as the sample size increases. We choose the t(3) density as the algebraically tailed density, and the lognormal density as the exponentially-tailed density. We simulated 500 random samples for each density for sample sizes n = 50, 100, 500, 1000, 2500 and 5000. The samples are contiguous, in that the observations in smaller samples are the leading observations of larger samples. In t(3) density estimation, the SNP procedure did not converge for three results of n = 2500 and 11 results of n = 5000, respectively. The statistics are based on the available 497 results for n = 2500 and 489 results for n = 5000. The statistics of the other methods are averages of all 500 results. For sample sizes n = 100, 1000, 2500 and 5000, the number of parameters p for SNP estimates is 6.2, 13.6, 18.5 and 23.2, while for the logspline estimates they are 3.5, 5.0, 5.8 and 6.5, respectively. Each of three panels in Figure 6 plots the rates of convergence in estimating the t(3) density, measured by the MIAE, MISE and Hellinger error, respectively. The SNP rates are significantly slowest when measured by MIAE and MISE. When measured by MISE, the fixed kernel method, which minimizes MISE, proves best in terms of the error level and convergence rates. When evaluated by the Hellinger error, which emphasizes tail fitting performance, however, the SNP rate is similar to that of the fixed kernel method, and the difference in the level and rates with the other two methods is significant. The performance as ranked by the Hellinger error is: (1) adaptive kernel, (2) logspline, (3) fixed kernel and (4) SNP. Figure 7 compares the simulated Hellinger error rates with the theoretical SNP rate to the t(3) density. The dotted line is the derived theoretical SNP Hellinger rate to the t(3) density, n−3/10 , which is plotted so that it intersects with the simulated results at n = 1000. The bold grey line, denoted as MC rate (convergent only), is the same as the grey bold line in Figure 6, which is the average of 497 and 489 convergent results for n = 2500 and 5000, respectively. The bold grey line curves downward for n ≥ 2500, because non-convergent results are removed in obtaining the error statistics. The thinner solid black line in Figure 7 denoted as MC (default) is the simulation results, based on the default output from the SNP program, where non-convergent results are replaced with normal approximations. This is the reason the thin solid black line curves slightly upward for n ≥ 2500. The true Monte Carlo result is considered to lie between the bold grey line C The Author(s). Journal compilation C Royal Economic Society 2008.
588
T. Takada
Hellinger 0.20 0.15
0.10
Theoretical rate MC rate (default) MC rate (convergent only)
0.05 50
100 200
500
2000
5000
Sample size n
Figure 7. Theoretical and simulated SNP convergence rates to the t(3) density. MIAE
MISE
0.4
Hellinger 0.30 0.25
0.200
0.3
0.20
0.100
0.15
0.2 0.050
0.10 0.1
0.020 0.010
SNP Fixed Kernel Logspline Adaptive Kernel
0.05
0.005 50
100
500
2000 5000
Sample size n
50
100
500
2000 5000
50
100
Sample size n
500
2000 5000
Sample size n
Figure 8. Rates of convergence for lognormal density.
and thin, solid black line, and our theoretical SNP Hellinger rate is in good accordance with the simulated SNP Hellinger rate. Three panels in Figure 8 plot the convergence rate of the four methods to the standard lognormal density, where the notations are the same as in Figure 6. The statistics are based on the convergent 498 SNP results for n = 5000, while the other statistics are averages of all 500 results. Due to the restriction K ≤ n1/2 , to avoid non-convergence of the algorithm, the maximum p is set as 13, 35, 53 and 74 for n = 100, 1000, 2500 and 5000, respectively. The corresponding p of the SNP estimates are 10.7, 31.3, 48.5 and 68.1, respectively, suggesting that the SNP method requires more parameters for more accurate estimation. The corresponding p of the logspline estimates are 4.2, 5.4, 6.3 and 7.2, respectively—much smaller than those of the SNP method. In the right panel, where the Hellinger convergence rates are plotted, the observed relative order of performance is: (1) logspline, (2) adaptive kernel, (3) fixed kernel and (4) SNP. Similar to the t(3) density, the errors measured by MIAE and MISE are significantly worse for the SNP C The Author(s). Journal compilation C Royal Economic Society 2008.
589
Comparison of non-parametric density estimators SNP
Fixed Kernel
Adaptive Kernel
Logspline
All data used
f(x) 0.8
0.8
0.8
0.8
0.6
0.6
0.6
0.6
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0.0
0.0
0.0
0.0
Outliers removed
All data used
log f(x)
log f(x)
Figure 9. Density of daily changes of the pound/dollar rate (1974–1983).
method. The bold grey curve for the MIAE and MISE rate of the SNP method curves slightly downward, compared with the linear slopes of the other methods. This confirms the conclusion derived in Section 3.1. Changing the SNP selection method from BIC to a cross-validation strategy does not change our conclusion; by significantly increasing the number of parameters, it provides some improvement in mode fitting, but tail fitting performance is affected very little.
6. EMPIRICAL EXAMPLE AND IMPLICATIONS We compare unconditional density estimates for the 2510 observations of adjusted daily log price change of the British pound/U.S. dollar exchange rate from 1974 to 1983, which are used in Gallant et al. (1991), and publicly available at ftp.econ.duke.edu, in directory pub/arg/data. We get 1025 points of density estimates corresponding to the range of the data set. All other settings are the same as in Section 5. The top row of panels of Figure 9 illustrates the density estimates of our four methods. There are 18 parameters for SNP, and five for the logspline method. The peak height is significantly lowest for the SNP estimates, which is inferred as underestimate, considering the tendency of the SNP method to fit peaks of heavy-tailed density too low. The estimates are illustrated in the middle row of panels in the figure, with a log-scaled vertical axis to emphasize tail fitting performance. The same difference patterns are seen in the tail shape of the t(3) density estimates that were exhibited in Figure 4, suggesting that the side lobes of the fixed kernel and SNP estimates are likely to be artefacts of the fitting method, rather than real features of the data. C The Author(s). Journal compilation C Royal Economic Society 2008.
590
T. Takada
With the log-scaled vertical axis, the bottom row panel of Figure 9 illustrates the estimates of the same data set after removing the two outliers. Here, large bumps in the left tail of the SNP and fixed kernel methods disappear, implying that the bumps are generated by the outliers. This observation is relevant to the dispute in progress over the existence of side lobes in the density tails of asset price changes. Using the same dataset, Gallant et al. (1991) concluded that side lobes existed in their conditional density estimates using ARCH-type specification. Similar lobes are reported by Engle and Gonzalez-Rivera (1991) in the distribution of the standardized residuals from a GARCH(1,1) model. They suggested that the detected side lobes reflect the residual non-linearity, which remained unexplained by the ARCH/GARCH formulation. In contrast, studies by Bollerslev (1987) and Baillie and Bollerslev (1989), analysing data on the British pound for 1980–1985, found little evidence against GARCH(1,1) with t-distributed errors. Bollerslev et al. (1992) suggested that tail oscillations were the result of excessive influence by a few outliers, rather than a feature of the entire sample period. This conclusion seems consistent with the findings reported above. The modes found in the SNP bivariate density estimates are also likely to be artefacts. Based on logarithms of consumption growth and bill returns from 1959 to 1978, Gallant and Tauchen (1989) found modes in the tails of their SNP conditional density estimates with ARCHtype specifications, explaining it as coming either from sampling variations or significant nonlinearities. Gallant et al. (1992) reported the existence of modes in the tails of the SNP density estimates of 16127 observations of adjusted daily changes in the S&P500 index and NYSE trading volume from 1928 to 1987, whereas Takada (2001) did not detect any modes in the tails of the adaptive kernel density estimates. In SNP density estimation, the Gaussian leading term in the Hermite series expansion can be replaced with ARCH type representation, such as GARCH or EGARCH, and adding nonparametric SNP (Hermite) terms, by increasing the Hermite order K, is expected to explain the non-linear features that ARCH type representation cannot account for. In many applications of SNP in adjunct to EMM estimation, however, the number of Hermite polynomial terms used is limited, keeping it very small or zero, and several problems in heavy-tailed density estimation have been reported (for example, see Andersen et al., 1999). If the deviation from ARCH type leading terms is mainly tail heaviness, our analysis indicates that adding SNP terms will contribute very little to efficiency.
7. CONCLUSION It is clear that, for heavy-tailed densities like Student-t, the rate of convergence of the SNP estimator is remarkably slow, and therefore its computation time is significantly long compared with the logspline and adaptive kernel methods, or even conventional kernel methods. Since financial data are often characterized as heavy-tailed, use of the logspline or adaptive kernel method seems a better recommendation for financial applications. Extensions and further exploration of these methods for dependent data are promising lines of future research.
ACKNOWLEDGMENTS The author is deeply grateful to Roger Koenker and Shoji Takada for many useful comments and for encouraging me to complete the work. The author also thanks the co-editor, two anonymous C The Author(s). Journal compilation C Royal Economic Society 2008.
Comparison of non-parametric density estimators
591
referees, Yuichi Kitamura, and the seminar participants at the conference on non-parametric and semi-parametric statistics at the Institute of Statistical Mathematics for comments and suggestions. All errors are mine. The financial support to the author from the Japan Society for the Promotion of Science and Ishii Memorial Securities Research Promotion Foundation are greatly acknowledged.
REFERENCES Abramson, I. S. (1982). On bandwidth variation in kernel estimates—a square root law. Annals of Statistics 10, 1217–23. Andersen, T. G., H. J. Chung and B. E. Sørensen (1999). Efficient method of moments estimation of a stochastic volatility model: a Monte Carlo study. Journal of Econometrics 91, 61–87. Baillie, R. T. and T. Bollerslev (1989). The message in daily exchange rates: a conditional variance tale. Journal of Business and Economic Statistics 7, 297–305. Bollerslev, T. (1987). A conditional heteroskedastic time series model for speculative prices and rates of return. Review of Economics and Statistics 69, 542–7. Bollerslev, T., R. Y. Chou and K. F. Kroner (1992). ARCH modeling in finance: a review of the theory and empirical evidence. Journal of Econometrics 52, 5–59. Breiman, L., W. Meisel and E. Purcell (1977). Variable kernel estimates of multivariate densities. Technometrics 19, 135–44. Coppejans, M. and A. R. Gallant (2002). Cross-validated SNP density estimates. Journal of Econometrics 110, 27–65. Devroye, L. (1987). A Course in Density Estimation. Boston: Birkh¨auser. Efromovich, S. (1999). Nonparametric Curve Estimation. New York: Springer-Verlag. Engle, R. F. and G. Gonzalez-Riviera (1991). Semiparametric ARCH models. Journal of Business and Economic Statistics 9, 345–60. Fadda, D., E. Slezak and A. Bijaoui (1998). Density estimation with non-parametric methods. Astronomy and Astrophysics Supplement Series 127, 335–52. Fenton, V. M. and A. R. Gallant (1996a). Convergence rates of SNP density estimators. Econometrica 64, 719–27. Fenton, V. M. and A. R. Gallant (1996b). Qualitative and asymptotic performance of SNP density estimators. Journal of Econometrics 74, 77–118. Fenton, V. M. and A. R. Gallant (1996c). Erratum. Econometrica 64, 1493. Gallant, A. R., D. A. Hsieh and G. E. Tauchen (1991). On fitting a recalcitrant series: the pound/dollar exchange rate 1974–1983. In: W. A. Barnett, J. Powell G. E. Tauchen and (Eds.), Nonparametric and Semiparametric Methods in Econometrics and Statistics. Cambridge: Cambridge University Press. Gallant, A. R. and D. W. Nychka (1987). Semi-nonparametric maximum likelihood estimation. Econometrica 55, 363–90. Gallant, A. R., E. Rossi and G. E. Tauchen (1992). Stock prices and volume. The Review of Financial Studies 5, 199–242. Gallant, A. R. and G. E. Tauchen (1989). Seminonparametric estimation of conditionally constrained heterogeneous processes: asset pricing applications. Econometrica 57, 1091–120. Good, I. J. and R. A. Gaskins (1971). Nonparametric roughness penalties for probability densities. Biometrika 58, 255–77. He, X., J. Jureckova, R. Koenker and S. Portnoy (1990). Estimators and their breakdown points. Econometrica 58, 1195–214. Huisman, R., K. G. Koedijk, C. J. M. Kool and F. Palm (2001). Tail-index estimates in small samples. Journal of Business & Economic Statistics 19, 208–16. C The Author(s). Journal compilation C Royal Economic Society 2008.
592
T. Takada
Hwang, J. N., S. R. Lay and A. Lippman (1994). Nonparametric multivariate density estimation: a comparative study. IEEE Transactions on Signal Processing 42, 2795–810. Kooperberg, C. and C. J. Stone (1991). A study of logspline density estimation. Computational Statistics & Data Analysis 12, 327–47. Marron, J. S. and M. P. Wand (1992). Exact mean integrated squared error. Annals of Statistics 20, 712–36. Pagan A. and B. A. Ullah (1999). Nonparametric Econometrics. Cambridge: Cambridge University Press. Park, B. U. and B. A. Turlach (1992). Practical performance of several data driven bandwidth selectors (with discussion). Computational Statistics 7, 251–85. Press, W. H., S. A. Teukolsky, W. T. Vetterling and B. P. Flannery (1992). Numerical Recipes in C: The Art of Scientific Computing (2nd ed.). New York: Cambridge University Press. Scott, D. W. and L. E. Factor (1981). Study of three data-based nonparametric probability density estimators. Journal of American Statistical Association 76, 9–15. Sheather, S. J. and M. C. Jones (1991). A reliable data-based bandwidth selection method for kernel density estimation. Journal of the Royal Statistical Society, Series B 53, 683–90. Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. London: Chapman and Hall. Stone, C. J., M. Hansen, C. Kooperberg and Y. K. Truong (1997). Polynomial splines and their tensor products in extended linear modeling. Annals of Statistics 25, 1371–470. Takada, T. (2001). Density estimation for robust financial econometrics. Ph.D. dissertation, Department of Economics, University of Illinois at Urbana-Champaign. Wand, M. P. and M. C. Jones (1995). Kernel Smoothing. London: Chapman and Hall.
APPENDIX Proof 3.1: For simplicity, we focus on the tail shape of f . Recall that g (j ) ∈ L2 (e−x /2 ) ∞ of Proposition 2 iff −∞ [g (j ) (x)]2 e−x /2 dx < ∞, and the finiteness depends on tail behaviour. As x → ∞, the density with 2 algebraic tails has the form f (x) ∼ x −(α+1) , andthe corresponding g is g(x) ∼ x −(1+α)/2 ex /4 . It follows that 2 2 ∞ [g (j ) (x)]2 e−x /2 ∼ x 2j −(1+α) . The requirement −∞ [g (j ) (x)]2 e−x /2 dx < ∞ implies that j must satisfy the condition that 2j − (1 + α) < −1. Thus the maximum order of differentiability k is obtained as the largest integer satisfying k < α/2. 2
C The Author(s). Journal compilation C Royal Economic Society 2008.
Econometrics Journal (2008), volume 11, pp. 593–616. doi: 10.1111/j.1368-423X.2008.00250.x
Estimation of the stochastic conditional duration model via alternative methods J OHN K NIGHT † AND C ATHY Q. N ING ‡ †
Department of Economics, University of Western Ontario, London, Ontario, Canada N6A 5C2 E-mail:
[email protected] ‡
Department of Economics, Ryerson University, Toronto, Ontario, Canada M5B 2K3 E-mail:
[email protected] First version received: November 2006; final version accepted: April 2008
Summary This paper examines the estimation of the Stochastic Conditional Duration model by the empirical characteristic function and the generalized method of moments when maximum likelihood is unavailable. The joint characteristic function for the durations along with general expressions for the moments are derived, leading naturally to estimation via the empirical characteristic function and generalized method of moments. In a Monte Carlo study as well as an empirical application, these alternative methods are compared with quasi maximum likelihood. These experiments reveal that the empirical characteristic function approach outperforms the quasi maximum likelihood and generalized method of moments in terms of both bias and root mean square error. Keywords: Empirical characteristic function, Irregularly spaced data, Duration, Latent variable model.
1. INTRODUCTION The timing of transactions carries a wealth of information about the microstructure of the financial market. As a result, it is very important to model the time interval between transactions. This requires the use of high frequency data as the microstructure features of the data will be lost if we use long intervals in which multiple transactions are averaged. Fortunately, due to the rapid development in electronic trading and computing power, ultra high frequency financial transaction data are now more readily available. The analysis of intraday data has also led to the development of new models. This paper contributes to this growing literature by exploring the use of alternative estimation methodologies for a particular dynamic model for the transaction durations. An important feature of high frequency transaction data is that they are irregularly spaced. Traditional econometric models can not deal with irregularly spaced data. For instance, both the stochastic volatility model and the GARCH model require the data to be evenly spaced, and therefore they may lose important information about returns and volatility. Engle and Russell (1998) proposed the Autoregressive Conditional Duration (ACD) model for the duration C The Author(s). Journal compilation C Royal Economic Society 2008. Published by Blackwell Publishing Ltd, 9600 Garsington Road,
Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
594
J. Knight and C.Q. Ning
(time interval) between two successive market events (trade, a certain amount of volume or price changes), in which the conditional mean of the durations is modeled as a conditionally deterministic function of past information. This is one of the first models to analyze irregularly spaced financial data. By taking into account the irregular spacing of the data, the model allows us to more closely study the price formation process and the way market variables behave within the trading day. An attractive property of the ACD model is that it can capture the clusters of data that are observed in financial durations. Moreover, the ACD forecasted duration clustering in transaction time corresponds to ARCH volatility clustering in calendar time. In the literature, market microstructure models of Easley and O’Hara (1992) and Diamond and Verrerchia (1987) provide theoretical justifications for developing time series models for durations. In markets where traders have different levels of information about the underlying value of traded assets, trade durations play a key role in leading markets to price discovery. In the model of Easley and O’Hara (1992), a long duration implies that no new information has been released and the underlying price of the asset has not changed. In contrast, in Diamond and Verrerchia (1987), long durations are in fact more likely to appear when informed traders wish to sell the asset but short selling constraints prevent them from doing so. Thus long durations induce a negative re-evaluation of the asset value. Both Easley and O’Hara (1992) and Diamond and Verrerchia (1987) indicate that market participants can learn from observing durations and adjust investment decisions accordingly. Moreover, Gourieroux et al. (1999) find that trade and volume durations mirror features such as market liquidity and the information arrival rate. These findings imply that durations are not exogenous to the price formation process. Easley and O’Hara (1992) state that econometric models and empirical work that ignore the information content contained in the durations will therefore be biased. The ACD model of Engle and Russell (1998) provided a key starting point for the analysis of the irregularly spaced duration data. Since then, various duration models have been developed including latent variable models (see Bauwens et al., 2004 for a survey and comparison of such models). Ghysels et al. (2004) introduce the stochastic volatility duration model (SVD) to capture the dynamics of both the mean and the variance in the financial duration process. In their model, the volatility of the duration is assumed to be stochastic and the duration is driven by a mixture of gamma and exponential distributions. Recently, Bauwens and Veredas (2004) proposed the stochastic conditional duration model (SCD), in which the evolution of the durations is assumed to be driven by a latent factor. The motivation for the use of the latent variable is that it captures general unobservable information flow on the market. While the flow of information is available to the financial agents in the market, this information is unfortunately not quantifiable and thus not available to the econometrician. The SCD model keeps the important characteristic of the ACD model, namely, being able to capture the clusters of durations by using an AR(1) process for the latent variable. Moreover, the SCD model is often preferred to the ACD model for several reasons. First, the latent variable in the SCD model can capture the unobserved information flow that modifies over time the probability of a quote revision as well as the inter-quote durations, the trading intensity, and volume. Second, the SCD model is a double stochastic process, one for the observed duration and the other for the latent variable. Consequently, the conditional expected duration of the ACD model becomes a random variable in the SCD model, which offers a flexible structure for the dynamics of the duration process. Third, the SCD model can generate a wider range of shapes of hazard functions than the ACD model. Finally, according to Bauwens and Veredas (2004), the SCD model performs better than the ACD model in fitting some features of the duration data, especially in terms of the unconditional densities and the conditional hazard functions. C The Author(s). Journal compilation C Royal Economic Society 2008.
Estimation of SCD model via alternative methods
595
The contrast between the ACD and the SCD models for durations mimics the contrast between the GARCH and the Stochastic Volatility (SV) models for stock returns. Despite the fact that the SCD is a preferred model, the unobserved latent variable makes the estimation very challenging since the likelihood has no closed form and exact maximum likelihood estimation of the model is thus unavailable. Because of the similarity between the SCD and the SV models, it would appear that many of the estimation approaches for the SV model could be useful in the estimation of the SCD. In the literature, many approaches have been developed for the estimation of SV models including quasi maximum likelihood (QML), generalized method of moments (GMM), empirical characteristic function (ECF), efficient method of moments (EMM), Markov Chain Monte Carlo (MCMC), Monte Carlo maximum likelihood (MCML), simulated maximum likelihood (SML), to name just a few. Bauwens and Veredas (2004) use QML to estimate the model by applying the Kalman filter after transforming the model into a linear state space system. Strickland et al. (2006) propose a MCMC methodology. For the SCD-leverage model, Feng et al. (2004) adopt the MCML approach proposed by Durbin and Koopman (1997). As QML only approximates the true information and distribution, it may lose a substantial amount of information and is clearly not efficient. MCMC is more efficient but it is computationally expensive. Compared with the estimation of SV models, estimation of the SCD is less developed. Consequently, it is of interest to develop other estimation methods and compare their performance in estimation of the SCD model. Since its introduction, the SCD model has attracted the interests of some applied researchers. For instance, Davig (2001) applies the model to business cycle durations, while Zernov (2003) uses the model for trade and price durations of currency futures. 1 This paper contributes to the literature by considering two alternative methodologies for the estimation of the SCD model, namely, the empirical characteristic function (ECF) approach and the generalized method of moments (GMM) approach. These two estimators are compared with QML in both a Monte Carlo study and an empirical application. Our development of alternative estimation methods may encourage a broader application of the SCD model. Although the closed form likelihood is unavailable, it is shown in this paper that there is a closed form joint characteristic function (CF) for the model. Since there is a one to one correspondence between the characteristic function and the distribution function, the empirical characteristic function should contain the same amount of information as the likelihood and inference based on the characteristic function should perform as well as inference based on the likelihood. Therefore, estimation using the characteristic function offers a nice alternative. It has been shown by Knight and Yu (2002), Singleton (2001), Jiang and Knight (2002) and Knight et al. (2002) that the ECF can yield a relatively efficient estimator. The moments of durations for the SCD model can be readily derived, and GMM is straightforward. Moreover, with the use of the joint characteristic function, general moments of the log durations can be readily found thus suggesting an alternative GMM to be used for the estimator. As a result, we construct two versions of GMM, GMM1 for the durations and GMM2 for the log durations. The important issues with the use of GMM concern how many moments to use and which moments to match. These considerations will be discussed in more detail later in the paper. A Monte Carlo simulation study is undertaken to compare the performance of ECF, GMM and QML. The results show that the ECF estimator performs the best in terms of both bias and
1
We thank Luc Bauwens for these references.
C The Author(s). Journal compilation C Royal Economic Society 2008.
596
J. Knight and C.Q. Ning
root mean square error (RMSE). QML overall performs better than the GMM estimators, but there are some mixed results between them. GMM1 outperforms GMM2. In addition, the ECF estimated autocorrelation functions (ACF) provide the best fit to the true ACF. Finally, the ECF, GMM and QML approaches are compared in an empirical study using Boeing transaction data. We find the usual clustering and overdispersion in the transaction durations. From a comparison of the dispersion ratios, moments and densities estimated from different approaches and from the true data, we find that the ECF estimator fits the data best especially in terms of the dispersion ratio and the density. The paper is organized as follows. Section 2 introduces the SCD model and explains why the model is difficult to estimate. Section 3 presents a discussion of the ECF estimation for the SCD model while Section 4 considers the GMM. Section 5 contains the results of the simulation study comparing the four estimators: ECF, GMM1, GMM2 and QML. Section 6 discusses the empirical application and Section 7 concludes the paper. All proofs are collected in the Appendices.
2. SCD MODEL The SCD model was proposed recently by Bauwens and Veredas (2004) as a model for sequential durations. It assumes that there exists a stochastic latent variable that generates the durations. The observed duration d t is modeled as the product of a latent variable H t and a positive random variable ε t . If we denote the time of the transaction by τ t , d t is the time difference between trades at time τ t−1 and τ t , i.e. d t = τ t − τ t−1 . The SCD model is given by dt = Ht · εt , where
Ht = exp(ht )
ht = α + βht−1 + ut .
(2.1)
ε t |I t−1 is assumed to be an identically and independently distributed (i.i.d.) random variable with a positive support, where I t−1 denotes the information set at the end of duration d t−1 . Bauwens and Veredas (2004) propose Weibull or Gamma distributions for ε t |I t−1 . εt |It−1 ∼ Gamma(v, 1) or Weibull(γ , 1). The density functions for Gamma(v, 1) and Weibull(γ , 1) are in Appendix A. When v or γ equals 1, they both collapse to an exponential distribution. We will call the model with Gamma or Weibull distribution the Gamma SCD or Weibull SCD, respectively. To simplify the notation, we will use G and W for the corresponding model. The error term in the AR(1) process is assumed to be i.i.d. normal 2 ut |It−1 ∼ i.i.d. N (0, σ 2 ), and ε t |I t−1 , u t |I t−1 are assumed to be independent. It is well-known that we require |β| < 1 to guarantee that the process is stationary. High β implies high clustering and strong persistency of the duration data. 2 The original model developed by Bauwens and Veredas (2004) specified the Normal distribution for the error term u t , however, as pointed out by a referee, this could easily be extended to the Stable S a (δ, b, 0) family of distributions. In this case the ECF approach would be the only available technique.
C The Author(s). Journal compilation C Royal Economic Society 2008.
Estimation of SCD model via alternative methods
597
The log transformed model is: yt = ht + ζt , ht = α + βht−1 + ut ,
(2.2)
where y t = ln d t and ζ t = ln ε t . ζ t has log Gamma LogG(v, 1) or log Weibull LogW(γ , 1) distribution. The estimation of the parameters in this type of model with a latent variable is difficult because the likelihood function can not be derived in closed form. The problems encountered here are similar to those of SV models. Bauwens and Veredas (2004) estimate the log transformed model (2.2) by approximating the log Gamma (or log Weibull) with a Gaussian distribution and applying the Kalman filter. An advantage of the Kalman filter QML estimation is that it generates, as a by-product, an estimate of the latent variable h t . Estimates of h t are not readily available via other methods.
3. ECF ESTIMATION The ECF approach was initiated by Parzen (1962), and has been used to deal with the i.i.d. case, such as in Paulson et al. (1975), Heathcode (1977), Feuerverger and Mureika (1977), Feuerverger and McDunnough (1981a, 1981b), Tran (1998) and Carrasco and Florens (2002). The approach has also been developed to deal with the dependent stationary stochastic processes via the joint CF in Feuerverger (1990), Knight et al. (2002), Yu (1998), Knight and Yu (2002), Carrasco et al. (2002) and Jiang and Knight (2002). The idea of the ECF approach is to match the ECF with CF by minimizing some suitable distance measure. For i.i.d. data, the estimation procedure is often implemented by the discrete ECF, which matches the ECF with CF over a grid of finite points. It is well known that the discrete ECF method depends on the choice of a grid of discrete points. Generally, the number of the discrete points should be sufficiently large and the grid sufficiently fine and extended. However, when the grid is too fine, the covariance matrix becomes singular and hence the ECF estimator can not be computed. More discussion of this can be found in Feuerverger and McDunnough (1981b) and Carrasco and Florens (2002). For dependent data one needs to match the joint characteristic functions associated with overlapping blocks of the original data. Thus, allowing y 1 , y 2, . . . y T to be our observed log durations as in (2.2), we define the overlapping blocks as z j = (y j , y j +1 , . . . , y j +p ), with the block size p + 1 and sample size T, where j = 1, . . . , T − p. As the z j s are moving blocks, they are dependent. The CF for each block is c(r, θ ) = E[exp(ir zj )], where r is a transformation variable and also a vector, r = (r 1 , r 2, . . . r p+1 ). The joint ECF is 1 exp(ir zj ), where n = T − p. n j =1 n
cn (r) =
For the joint CF, discrete ECF requires the choice of the vectors (blocks) instead of the discrete points. This makes the discrete ECF very difficult and problematic and there is little guidance in the literature for this. To avoid these problems we use the continuous ECF approach which C The Author(s). Journal compilation C Royal Economic Society 2008.
598
J. Knight and C.Q. Ning
estimates parameters by matching the joint CF with the ECF via the minimization of the following integral: min . . . |c(r, θ ) − cn (r)|2 w(r)dr1 . . . drp+1 . (3.1) θ
Under standard regularity conditions, Knight and Yu (2002) prove that the ECF estimators with a general weighting function are strongly consistent and asymptotically normal. The asymptotic covariance matrix of the estimators and its proof can be found in Knight and Yu (2002). For the SCD model in (2.2), though a closed form of the likelihood function is not available, there is a closed form expression for the corresponding characteristic function. Therefore we can estimate the SCD model by the ECF. As noted earlier, see footnote 2, the SCD model can be extended to incorporate Stable errors on the latent variable. While our primary concern is to estimate the original model we do give in Proposition 3.2 the appropriate joint CF for the Stable case and leave the estimation of this extended model for future research. Next the CF for the LogG(v, 1) or LogW(γ , 1) is given in Proposition 3.1. Then the joint CF for y t , . . . , y t+k−1 (k = p + 1) is obtained in Proposition 3.2 for the Stable S a (δ, b, 0) distribution case with the result for the Normal given in Corollary 3.2. Proposition 3.3 gives the autocorrelation function of {y t }Tt=1 in the Normal case. P ROPOSITION 3.1. Suppose ζ t is the LogG(v, 1) or LogW(γ , 1) distribution. The characteristic functions of ζ t , denoted as C G (r) or C W (r), respectively are: C G (r) =
(ir + v) , if ζt LogG(v, 1), (v)
ir + 1 , if ζt LogW (γ , 1), C (r) = γ √ where i is the imaginary number and defined as −1.
(3.2)
W
(3.3)
P ROPOSITION 3.2. If u t |I t−1 ∼ i.i.d S a (δ, b, 0), then the joint CF of y t , . . . , y t+k−1 is C(r1, r2, . . . . . . rk, θ ) a δ a k k j −l a iα k k β rj − |β|a l=2 j =l β j −l rj jk=1 rj − = exp 1−β 1 − |β|a l=1 j =l
k j −l iδ a b tan(π a/2) k k j −l a + β r sign j =l β rj j l=1 j =l 1 − |β|a sign(β) a k
k k j =l β j −l rj sign jk=l β j −l rj − |β|a sign(β) l=2 j =1 C(rj ) where C(r j ) = C G (r j ) or C W (r j ) given in Proposition 3.1. √ We note that when a = 2, b = 0 and δ = σ/ 2 the S a (δ, b, 0) reduces to a Normal, N(0, σ 2 ) distribution. The result for the joint CF of y t , . . . , y t+k−1 is given in the following Corollary. C The Author(s). Journal compilation C Royal Economic Society 2008.
Estimation of SCD model via alternative methods
599
C OROLLARY 3.1. Specializing the result in Proposition 3.2 by substituting a = 2, b = 0 and √ δ = σ/ 2 we have the joint CF of y t , . . . , y t+k−1 , where u t |I t−1 ∼ i.i.d N(0, σ 2 ), given by
iα σ2 C(r1, r2, . . . . . . rk, θ ) = exp jk=1 rj − 1−β 2(1 − β 2 )
k k k j −l−1 2 × j =1 rj + 2β l=1 j =l+1 β rl rj × kj =1 C(rj ), (3.4) where C(r j ) = C G (r j ) or C W (r j ) for Gamma SCD or Weibull SCD, respectively. Having obtained the joint CF, we can easily get the joint cumulant generating function under normality and hence the autocorrelation function of {y t }Tt=1 . This is given in the Proposition 3.3. P ROPOSITION 3.3. The autocorrelation function of {y t }Tt=1 defined by (2.2) is given by: ρkG =
ρkW
σ2 βk 1−β 2 σ2 1−β 2
=
+
(v) (v)
−
σ2 βk 1−β 2 σ2 1−β 2
−
(v) 2 ,
(3.5)
(v)
1 (1) 2 γ 2 (1)
.
(3.6)
In the implementation of the ECF technique, k in the characteristic function is chosen to be 2. Theoretically, more blocks always contain no less information as k increases. Thus the resulting estimators should be asymptotically more efficient. However, the increase of k also reduces the effective sample size (i.e. the total number of blocks). Moreover, the calculation associated with a large block size usually is more computationally intensive. Therefore, in practice, there is always a trade-off between the block size and the sample size, and thus between asymptotic efficiency and computational efficiency. A small k often works well. For instance, the ECF estimation of ARMA(1,1) process works well with a block size of 2 in Yu (1998). More discussion of ECF estimation method is in Knight and Yu (2002). When the block size is 2, the characteristic function in (3.4) can be written as:
2 σ2 r1 + r2 2 − r · C(r1 ) · C(r2 ), (3.7) + r + 2βr r C(r1 , r2 , θ ) = exp iα 1 2 2 1−β 2(1 − β 2 ) 1 where C(r j ) = C G (r j ) or C W (r j ) for Gamma SCD or Weibull SCD, respectively, and j = 1 or 2. The corresponding ECF is 1 exp(ir1 yj + ir2 yj +1 ), n j =1 n
Cn (r1 , r2 ) =
where n = T − 1. Let Rex and Imx be the real and imaginary parts of x, respectively, then 1 cos(r1 yj + r2 yj +1 ), n j =1 n
ReCn (r1 , r2 ) =
C The Author(s). Journal compilation C Royal Economic Society 2008.
(3.8)
600
J. Knight and C.Q. Ning
1 sin(r1 yj + r2 yj +1 ). n j =1 n
I mCn (r1 , r2 ) =
For the reasons discussed previously and the fact that the continuous ECF method performs better than the discrete ECF method, see Yu (1998), in this study we use the continuous ECF method, given by (3.1), to estimate the SCD model. The exponential function is chosen as the weight function, thus giving a lot of weight around the origin. This is consistent with the fact that the CF contains most information around the origin. Note that the weighting function is in essence the density function of a bivariate normal distribution. Specializing (3.1), the ECF estimation technique minimizes the following objective function with respect to the parameters θ , min θ
[ReC(r1 , r2 , θ ) − ReCn (r1 , r2 )]2 + [I mC(r1 , r2 , θ ) − I mCn (r1 , r2 )]2
× exp −Ar12 − Ar22 dr1 dr2 ,
(3.9)
where A is an arbitrary positive constant. There are two ways to implement the integration: quadratic integration and Monte Carlo integration. The quadratic integration is chosen for convenience, accuracy, and computational efficiency.
4. GMM ESTIMATION While GMM has not been used to estimate the SCD model, Bauwens and Veredas (2004) did suggest it may be a suitable estimator. It has, however, been used to estimate the stochastic volatility (SV) model by Ruiz (1994), Andersen (1994), Andersen and Sorensen (1996) and Jiang et al. (2005). Ruiz claims that QML performs better than GMM in the estimation of the SV model. It is our intention to examine if this claim carries over to the SCD model. The main problems for GMM are which moments to match and how many moments to include in the estimation. Andersen and Sorensen (1996) document that the inclusion of an excessive number of moments results in more pronounced biases and larger RMSE. Thus, the use of additional information can be harmful. They find that there is a fundamental trade-off for GMM: the inclusion of more information in the form of additional moment restrictions improves estimation performance for a given degree of precision in the estimate of the weighting matrix, but the increase of the moments reduces the precision of the estimation of the weighting matrix. Since moments for both the durations, d t and ln d t , are readily calculated, as detailed in the following propositions, we will consider GMM using both. We will denote the GMM associated with the use of the model 1(defined by the equation system (2.1)) the GMM1 while that associated with model 2 (defined by the equation system (2.2)) the GMM2. The following proposition gives the corresponding moments for the durations, d t . P ROPOSITION 4.1. For the model 1, the moments of durations are:
α(m + n) m2 + n2 + 2mnβ r 2 n · (fm )(fn ), E dtm dt−r + σ = exp 1−β 2(1 − β 2 )
(4.1)
C The Author(s). Journal compilation C Royal Economic Society 2008.
601
Estimation of SCD model via alternative methods
where fm = (v+m) = (v)m , fn = (v+n) = (v)n for the Gamma SCD, and fm = ( m + (v) (v) γ n 1), fn = ( γ + 1) for the Weibull SCD. By changing the value of m, n and r, we can easily get the expression of moments of different orders. For the log durations, all the moments can be readily derived via the joint moment generation function C(−ir 1, −ir 2, . . . , −ir k, θ ). The results are given in Proposition 4.2. P ROPOSITION 4.2. Let m n denote the nth moment of log duration, then we have marginal moments as follows m1 = +
α , 1−β
m2 = (1) + σ 2 /(1 − β 2 ) + m21 ,
(4.3)
m3 = (2) + 3m1 m2 − 2m31 ,
(4.4)
m4 = (3) + 6m41 − 12m21 m2 + 3m22 + 4m1 m3 ,
(4.5)
where, for the Gamma SCD, =
(v) (v)
= ψ(v) and (n) = ψ (n) (v), the nth derivative of the
digamma function, while for the Weibull SCD, = The cross moments of the log durations are E[yt yt+k ] =
E
2 yt yt+k
=
α + 1−β
2 E yt2 yt+k =
(4.2)
3
σ2 βk + 1 − β2
1 (1) γ (1)
= γ1 ψ(1), (n) =
α + 1−β
1 ψ (n) (1). γn
2 ,
(4.6)
α α σ2 2σ 2 k (1) + + + , + β + 1 − β2 1−β 1 − β2 1−β (4.7)
4 2 α α 4σ 2 β k σ2 (1) + + + +2 + 1−β 1 − β2 1 − β 1 − β2 2 2 2 k 2 σ β σ2 α (1) + +2 + + . × 1−β 1 − β2 1 − β2
(4.8)
Under standard regularity conditions, GMM estimators are consistent and asymptotically normal (see Hansen, 1982).
5. SIMULATION STUDY To compare the performance of the different estimation methodologies, we employ Monte Carlo experiments with a sample size of 10 000 and the number of replications of 100. 10 000 observations are considered to be representative of the typically large sample sizes that are C The Author(s). Journal compilation C Royal Economic Society 2008.
602
J. Knight and C.Q. Ning
associated with transaction data. 10 500 observations are generated and the first 500 observations are discarded in order to avoid initial effect. Parameters are chosen to be consistent with empirical findings in the literature and our empirical application study. Duration data have the property of high persistency, which implies a value of β close to unity. In our study, beta is set at either 0.9 or 0.98. α is usually a negative small number and σ is small but different from zero (see the empirical finding in Bauwens and Veredas (2004) and the empirical application in Section 6). The shape parameter v (or γ ) is set greater than 1 as found in the empirical applications. We note that a shape parameter greater than one produces a wider range of hazard functions with a small hump near the origin and then a decreasing pattern, which is consistent with empirical findings. In summary, the parameters are set as follows, set 1: θ = {α β σ v} = {−0.001 0.9 0.1 1.5} and set 2: θ = {α β σ v} = {−0.003 0.98 0.05 1.3}. For our GMM estimation, we select moments with the following considerations. First, in determining the numbers of moments, we keep in mind the trade-off found in Andersen and Sorensen (1996): more moments improve estimation performance but cause a deterioration in the estimation of the weighting matrix. Second, since the autocorrelation is varying over time we use cross moments to capture the dependence. Third, the first four moments should be included to capture the mean, variance, skewness and kurtosis. Consequently we choose the first four univariate moments and the first ten cross moments, namely E[dtm ], E[dt d t−r ] for the durations and E[ytm ], E[yt y t−r ], for the log durations for m = 1, 2, 3, 4 and r = 1, 2, . . . , 10. 3 We also examine the comparison between the theoretical and sample moments for higher orders. We find that the difference between the theoretical and sample moments for the first two univariate moments and the first ten cross moments are smallest, which support our selection of moments. The Parzen kernel is used for the estimation of the weighting matrix. The Monte Carlo results are reported in Table 1. In the upper part of Table 1, the three methods are compared for the Gamma SCD model. In seven of eight cases, the ECF estimates have the smallest bias. The ECF estimates also have the lowest RMSE in all the cases except for the estimates of the shape parameter. For α, β and σ , the ECF is clearly the best approach, while for v there are some mixed results. QML generally performs better than both GMM1 and GMM2. For GMM1, when β is 0.98, in only 64 of the 100 replications does GMM1 converge and for this reason the results are not presented. GMM2 performs better than GMM1 with lower bias (except for α) and RMSE (except for β) for the convergent Gamma SCD model. The results of the Weibull SCD are summarized in the lower part of Table 1. For the parameter set 1, again the GMM1 has convergence problems and the result is not presented. The ECF estimates provide the smallest bias in α, β and σ and the QML estimates give the lowest bias for the shape parameter γ . Also the ECF achieves lower RMSE than all the other methods in the estimation of β and σ . The QML gives slightly smaller RMSE than the ECF and GMM2 for α and γ estimates For the parameter set 2, the smallest bias of estimates are distributed among the three estimation methods: the ECF for γ , QML for β and σ , ECF, QML and GMM1 for α. But if we consider both the bias and the standard deviation, that is, the RMSE, again the ECF performs the best in all cases except for γ , in which QML gives slightly lower RMSE than the ECF. QML estimates generally have lower bias and RMSE than the GMM estimates. For this case, GMM1 performs better than GMM2 based on the RMSE. It is interesting to note that for the Gamma
3 The inclusion of the third and the fourth moments in GMM for GMM2 introduces convergence problems, consequently we do not include these two moments in GMM2.
C The Author(s). Journal compilation C Royal Economic Society 2008.
603
Estimation of SCD model via alternative methods
Parameters
Table 1. Monte Carlo experiments: comparison of bias and RMSE. True Bias RMSE relative to ECF value
ECF
QML
GMM1
GMM2
QML
GMM1
GMM2
Gamma SCD model α β σ
−0.001 0.9 0.1
0.0005 0.0007 0.0020
0.0007 0.0060 0.0041
0.0008 0.0104 0.0084
0.0010 0.0044 0.0029
1.12 9.49 1.77
1.32 11.10 2.39
1.29 11.52 2.201
ν α
1.5 −0.003
0.0040 0.0001
0.0037 0.0002
0.0077 –
0.0064 0.0020
0.96 1.61
1.23 –
1.05 5.93
β σ ν
0.98 0.05 1.3
0.0001 0.0002 0.0003
0.0018 0.0015 0.0014
– – –
0.0125 0.0115 0.0051
4.80 1.24 0.91
– – –
19.92 4.57 1.001
Weibull SCD Model α −0.001
0.0000
0.0001
–
0.0000
0.96
–
1.01
β σ
0.9 0.1
0.0007 0.0001
0.0009 0.0002
– –
0.0262 0.0159
6.15 1.17
– –
9.79 1.83
γ
1.5
0.0041
0.0031
–
0.0045
0.85
–
0.93
α
−0.003
0.0001
0.0001
0.0001
0.0002
1.73
4.65
6.59
β σ γ
0.98 0.05 1.3
0.0010 0.0019 0.0017
0.0005 0.0002 0.0019
0.0011 0.0056 0.0063
0.0009 0.0074 0.0042
1.87 1.04 0.94
6.02 4.03 1.04
9.11 5.84 1.02
SCD, GMM1 has convergence problems with the parameter set 1 while for the Weibull SCD, it has convergence problems with the parameter set 2. It seems that the convergence of GMM1 is not governed by any one particular parameter value but by the combination of parameter values. To visually see the performance of all the methods considered, we compare the autocorrelation functions (ACF) in Figures 1 and 2 for the Gamma SCD model and Figures 3 and 4 for the Weibull SCD model. In Figure 1, for the parameter set 1, the ACFs from the ECF and QML look similar, both are close to the true ACF with QML matching better when the order is smaller than 7 and the ECF fitting better for orders greater than 7. GMM2 fits the ACF almost the same as QML, with GMM2 being slightly closer to the true ACF after order 10. GMM1 gives the worst fit to the true ACF among all methods. In Figure 2 for the parameter set 2, the ECF estimated ACF matches the true ACF almost perfectly. The ACF from QML is slightly below the true ACF while the ACF from GMM2 is under that from QML, further away from the true ACF. Figures 3 and 4 for the Weibull SCD model basically confirm that the ECF fits the ACF the best, while QML gives a better fit of the true ACF than GMM. To summarize, the ECF estimator has the lowest bias and RMSE, the best fit of the true ACF, and hence performs the best. In terms of bias, RMSE and the fit of the true ACF, QML estimator performs better than GMM estimator, but there are some mixed results between them. GMM1 seems to perform better than GMM2 for Gamma SCD and vice versa for the Weibull SCD. C The Author(s). Journal compilation C Royal Economic Society 2008.
604
J. Knight and C.Q. Ning
Figure 1. Gamma SCD: ACF for parameter set 1.
Figure 2. Gamma SCD: ACF for parameter set 2.
C The Author(s). Journal compilation C Royal Economic Society 2008.
Estimation of SCD model via alternative methods
Figure 3. Weibull SCD: ACF for parameter set 1.
Figure 4. Weibull SCD: ACF for parameter set 2.
C The Author(s). Journal compilation C Royal Economic Society 2008.
605
606
J. Knight and C.Q. Ning
Variable
Table 2. Descriptive statistics of durations Obs Mean S. D.
Min
Max
Raw duration Adjusted duration
90136 90136
1.00 0.07
209 17.82
10.07 1.00
11.55 1.11
6. EMPIRICAL APPLICATION In order to see the performance of the different estimation methods with real financial data, we apply the model to the Boeing data, extracted from the trades and quotes (TAQ) database in the New York Stock Exchange (NYSE). 6.1. The Data The data contain Boeing trades from September 1, 2000 to October 31, 2000. The data set includes transaction date, time in seconds, transaction price, and volume. There are 119 595 observations in the two months. Since the initial durations of each day are influenced by the opening auction, a model of the trading process would be contaminated by including opening trades. Therefore we delete the trades that occurred before 9:50 a.m. or after 4:00 p.m. to eliminate irregularities during opening and closing periods. We calculate durations following the procedure in Engle and Russell (1998). Duration is defined as the time difference between two consecutive trades. The first duration of each day is computed as the average duration of the 10 minutes before 10:00 am. There are some null durations. We assume that all these trades are from the same trader, who has split a big block into smaller ones but has sent them to the market at the same time. Therefore, we delete these null durations. After all these processes, there are 90 136 durations. Durations are rounded to seconds. The descriptive statistics are in Table 2. The longest duration is 209 seconds and the shortest is 1 second. The mean of the durations is 10 seconds and the standard deviation is 11.5 seconds, showing the overdispersion phenomenon. To examine the dependence, we calculate the autocorrelations (ACF) and partial autocorrelations (PACF) of the raw durations presented in Table 3. Both ACF and PACF are positive and decay very slowly. Therefore, the data show a strong pattern of clustering and ARMA structure. 6.2. Seasonal Adjustment We adjust the durations for seasonal effects. Bauwens and Veredas (2004) argue that the durations can be thought of as consisting of two parts: a dynamic stochastic part to be explained by the SCD model, and a deterministic part, namely the seasonal intradaily pattern. The seasonal effect is from the systematic variation of the market activity during each trading day and should be removed in order to study the dynamic part captured by the SCD model. We consider two seasonal effects, namely the day-of-week effects and the time-of-day effects. We typically observe fewer transactions during the early days of a week from Monday to Wednesday. From then on trading becomes progressively more active towards the end of the week especially on Friday. This pattern is reflected in the duration data. The durations remain high during the early part of the week, then decrease afterwards, and become shortest on Friday. This is called the day-of-week effect. During a day, more trades happen during the early morning, C The Author(s). Journal compilation C Royal Economic Society 2008.
607
Estimation of SCD model via alternative methods Table 3. Dynamic properties of the trading durations Raw duration Seasonally adjusted duration LAG
AC
PAC
AC
PAC
1 2 3
0.1360 0.1375 0.1210
0.1360 0.1212 0.0911
0.1149 0.1172 0.0999
0.1149 0.1054 0.0776
4 5 6
0.1102 0.1135 0.1170
0.0725 0.0719 0.0711
0.0897 0.0948 0.0937
0.0618 0.0642 0.0595
7 8
0.1044 0.1027
0.0528 0.0489
0.0819 0.0831
0.0437 0.0439
9 10 11
0.1034 0.1098 0.1007
0.0481 0.0532 0.0403
0.0847 0.0884 0.0812
0.0444 0.0464 0.0364
12 13
0.1043 0.1003
0.0428 0.0372
0.0836 0.0784
0.0380 0.0314
14 15
0.0993 0.0970
0.0351 0.0315
0.0771 0.0754
0.0294 0.0271
then decrease around the noon, and increase toward the close of the market. This is known as time-of-day effect. To remove the day-of-week effect, we calculate the average sample duration for a weekday, denoted by F w , with w being 1, 2, 3, 4, 5. The duration after removing day-ofweek effect is given as dw = Fdwt . To eliminate the time-of-day effect, we first choose 13 knots over each trading day, with the first one being at 10:00 a.m., the last one at 4:00 p.m., and the remaining knots are every 30 minutes apart. Then we compute the duration at each knot by averaging the durations around the knot within 15-minutes either side. Finally, we take the average of the duration at each knot for all trading days in the two months, which is regarded as daily seasonal factor, denoted by F d . Then the adjusted duration data are calculated as di = dFwd . The descriptive statistics of the adjusted duration are in the Table 2. The mean of the adjusted duration is 1 and the standard deviation is greater than 1, again showing the characteristic of overdispersion. As shown in Table 3 for the autocorrelation (AC) and partial autocorrelation (PAC) coefficients, the seasonally adjusted duration process still remains highly persistent which, again, provides evidence of ARMA type of structure in the data generating process. We use this seasonally adjusted duration data to estimate model parameters in the next section. 6.3. Empirical Estimation Results We estimate the model by the ECF, GMM1, GMM2 and QML methods, respectively and present the results in Table 4. We discuss the results first and then compare the goodness of fit. 6.3.1. Discussion of Estimation Results. All estimates of β are close to one, showing high persistency of the duration process. But they are all less than one, ensuring the stationarity of the process. The estimates of the parameter v of the Gamma distribution and γ of the Weibull C The Author(s). Journal compilation C Royal Economic Society 2008.
608
Model type
Gamma SCD
J. Knight and C.Q. Ning Table 4. Empirical estimates of ECF, QML and GMM Parameter ECF QML GMM1 α
−0.0042 (0.00012)
−0.0004 (0.00014)
−0.0021 (0.00034)
−0.0028 (0.00035)
β
0.9902 (0.00068)
0.9935 (0.00070)
0.9698 (0.00456)
0.9611 (0.00364)
σ
0.0587 (0.00292) 1.3701
0.0388 (0.00197) 1.4660
0.0885 (0.00756) 1.0467
0.1058 (0.00575) 1.5043
(0.00508) 1.03
(0.00545) 0.94
(0.01530) 0.98
(0.00623) 0.96
α
−0.0046 (0.00093)
−0.0041 (0.00083)
−0.0033 (0.01420)
−0.0080 (0.00152)
β σ
0.9751 (0.00073) 0.1280
0.9935 (0.00070) 0.0385
0.9628 (0.00549) 0.1018
0.9611 (0.00364) 0.1058
ν
(0.00151) 1.3259
(0.00198) 1.3064
(0.00853) 1.0537
(0.00575) 1.3290
(0.00343) 1.31
(0.00324) 0.89
(0.00582) 1.09
(0.00366) 0.91
ν σˆ d μˆ d
Weibull SCD
GMM2
σˆ d μˆ d
Note: Numbers in the bracket are the asymptotic standard errors.
distribution are all greater than one implying misspecification of an exponential distribution. The estimates of σ are all significantly different from zero. Therefore modelling the conditional mean of duration as a deterministic process of the past information is rejected, and it is necessary to model it as a stochastic process. This favors the SCD model over the ACD model. From Bauwens and Veredas (2004), the dispersion ratio is defined as σ d /μ d , the ratio of 2
σ the standard deviation to the mean of duration, which is (1 + v1 ) exp( 1−β 2 ) − 1 for the Gamma 2 (1+ γ ) 2 σ SCD and ((1+ 1 ))2 exp( 1−β 2 ) − 1 for the Weibull SCD. By substituting estimated parameters γ
into the above formula, we obtain estimated dispersion ratios presented in Table 4. The dispersion ratio of the adjusted duration data is around 1.11. The estimated ratio from the ECF is about 1.03 for the Gamma SCD and 1.31 for the Weibull SCD, reflecting the over dispersion of the durations. It seems the ECF approach is preferred to match the data dispersion ratio. Other estimated ratios are all less than 1, too small (except for the case of GMM1 for the Weibull SCD) compared to the data ratio. Overall our results are comparable to Bauwens and Veredas (2004) trade duration results. Their estimate of α is negative, estimate of β is close to one but less than one, and estimates of the scale parameters v and γ are all greater than one. All these results reflect typical properties of trade durations: high clustering and persistency and overdispersion. 6.3.2. Comparison of Goodness of Fit. To compare the goodness of fit, we first compare the moments computed from the estimated parameters with the moments from observed durations. C The Author(s). Journal compilation C Royal Economic Society 2008.
609
Estimation of SCD model via alternative methods Table 5. Relative difference between the estimated moments and the observed moments Gamma SCD Weibull SCD Moments
ECF
QML
GMM1
GMM2
ECF
QML
GMM1
GMM2
E(d t )
−0.0278
−0.0674
−0.0004
−0.0629
−0.0837
−.0834
−0.0068
−0.0791
E(d 2t ) E(d 3t ) E(d 4t )
−0.1261 −0.2236 −0.2807
−0.2637 −0.4651 −0.6209
−0.0013 −0.0123 −0.0021
−0.2412 −0.4195 −0.5542
−0.1705 −0.1884 −0.0817
−0.3258 −0.5667 −0.7402
−0.0312 −0.0715 −0.0934
−0.3057 −0.5309 −0.6959
E(dt d t−1 ) E(dt d t−2 ) E(dt d t−3 )
−0.0144 −0.0185 −0.0016
−0.1451 −0.1479 −0.1323
−0.0057 −0.0120 0.0030
−0.1143 −0.1213 −0.1094
0.0177 0.0071 0.0181
−0.1751 −0.1777 −0.1627
−0.0094 −0.0168 −0.0030
−0.1446 −0.1514 −0.1398
E(dt d t−4 ) E(dt d t−5 )
0.0080 0.0007
−0.1232 −0.1288
0.0106 0.0015
−0.1039 −0.1132
0.0217 0.0084
−0.1539 −0.1593
0.0035 −0.0067
−0.1345 −0.1436
E(dt d t−6 ) E(dt d t−7 ) E(dt d t−8 )
0.0002 0.0118 0.0087
−0.1283 −0.1175 −0.1194
−0.0007 0.0091 0.0044
−0.1163 −0.1086 −0.1138
0.0023 0.0083 −0.0001
−0.1589 −0.1484 −0.1502
−0.0098 −0.0009 −0.0064
−0.1465 −0.1391 −0.1441
E(dt d t−9 ) E(dt d t−10 )
0.0053 −0.0004
−0.1215 −0.1258
−0.0004 −0.0075
−0.1190 −0.1261
−0.0087 −0.0192
−0.1523 −0.1563
−0.0120 −0.0197
−0.1491 −0.1560
The comparison is given in Table 5. The moments compared are the first four marginal moments and the first ten cross moments. The numbers in the table are the relative difference between the observed and the estimated moments computed as follows: (estimated moments − observed moments)/observed moments. The estimated moments are generally close to the data moments with the relative moment differences being less than 1 for the ECF and GMM1 estimations in most cases. However as the order of the moments increase, the estimated moments are less close to the observed counterparts, as generally can be expected. If we compare the moments from different estimation methods, the GMM1 estimated moments are closest to the data moments in most cases as this method is designed to match duration moments. But still, in eleven of twenty eight cases, the ECF estimated moments match the data moments the best which is also intuitive since the ECF is supposed to match all the moments theoretically. Next closest is the GMM2 estimated moments as this method is designed to match log durations. The furthest is from the QML estimation, which is consistent with the design of the method using just an approximation of the distribution. We then simulate four sequences by using the ECF, GMM and QML estimated parameters, respectively. By using kernel density estimation, we plot the density from the data and densities of the four simulated data sets. The density graphs are shown in Figures 5 and 6. Basically, the estimated density fits the data density better when the duration gets larger. In Figure 5 for the Gamma SCD model, apparently, the ECF estimates give a better fit than either GMM2 or QML estimates. The ECF performs similar or slightly better than GMM1 in matching the true density. The density from QML estimates is close to that from GMM2 estimates. In Figure 6 for the Weibull SCD model, the ECF estimated density fits the data density the best with the GMM1 estimated density a slightly worse fit than that from the ECF. The QML and GMM2 estimated density curves are further away from the data density than those from the other two methods, C The Author(s). Journal compilation C Royal Economic Society 2008.
610
J. Knight and C.Q. Ning
Figure 5. Gamma SCD: density comparison.
Figure 6. Weibull SCD: density comparison.
indicating a worse fit of QML and GMM2 than ECF and GMM1. This result is very intuitive as theoretically the ECF method basically matches all the moments and hence the density. In summary, by comparing dispersion ratios, moments and densities, the ECF estimator performs better than either GMM or QML estimators. For matching the dispersion ratios of C The Author(s). Journal compilation C Royal Economic Society 2008.
Estimation of SCD model via alternative methods
611
the data and moments of data, the performance of both GMM1 and GMM2 are better than QML. For matching density, GMM1 performs better than QML.
7. CONCLUSIONS In this paper the ECF and GMM methods are developed for the estimation of the SCD model. As there is no closed form likelihood for the SCD model, the maximum likelihood method can not be implemented. However, it is shown in the paper that the analytical form of the characteristic function and moments can be derived, therefore the ECF and GMM are alternative approaches, and the ECF is theoretically comparable to maximum likelihood. The ECF and GMM are compared with QML using Monte Carlo experiments. It is found that the ECF approach outperforms GMM and QML in terms of both bias and efficiency. This finding is consistent with the results found in the estimation of the SV models (Yu, 1998). The empirical application also shows that the ECF estimates fit the data best. The results in this paper provide further evidence that ECF approach is a suitable alternative when maximum likelihood is unavailable.
ACKNOWLEDGEMENTS We wish to thank the editor Siem Jan Koopman, two anonymous referees, Christian Gourieroux, John Galbraith, George Jiang, Adrian Pagan, Richard Roll, and Stephen Sapp for their helpful comments and suggestions. Thanks are also due to participants in workshops and the 2004 Alumni Conference at the University of Western Ontario, NFA 2004 Conference and CESG 2004 Conference.
REFERENCES Andersen, T.G. (1994). Stochastic autoregressive volatility: a framework for volatility modeling. Mathematical Finance 4, 75–102. Andersen, T.G. and B. E. Sorensen (1996). GMM estimation of a stochastic volatility model: a Monte Carlo study. Journal of Business and Economic Statistics 14, 328–52. Bauwens, L., P. Giot, J. Graimming, and D. Veredas (2004). A comparison of financial duration models via density forecasts. International Journal of Forecasting 20, 589–604. Bauwens, L. and D. Veredas (2004). The stochastic conditional duration model: a latent variable model for the analysis of financial durations. Journal of Econometrics, 119, 381–482. Carrasco, M. and J. Florens (2002). Efficient GMM estimation using the empirical characteristic function. Working paper, Department of Economics, University of Rochester. Carrasco, M., M. Chernov, J. Florens, and E. Ghysels (2002). Efficient estimation of jump diffusions and general dynamic models with a continuum of moment conditions. Working paper, Department of Economics, University of Rochester. Davig, T. (2001). A Structural approach to modeling business cycle durations. Working paper, Department of Economics, Indiana University. Diamond, D. W. and R. E. Verrecchia (1987). Constraint on short-selling and asset price adjustments to private information. Journal of Financial Economics 18, 277–311. C The Author(s). Journal compilation C Royal Economic Society 2008.
612
J. Knight and C.Q. Ning
Durbin, J., and S. J. Koopman (1997). Monte carlo maximum likelihood estimation for non-Gaussian state space models. Biometrika 84, 669–84. Easley, D. and M. O’Hara (1992). Time and the process of security price adjustment. The Journal of Finance 47, 577–605. Engle, R. F. and J. R. Russell (1998). Autoregressive conditional duration: a new approach for irregularly spaced transaction data. Econometrica 66, 1127–62. Feng, D., G. J. Jiang, and P. X. K. Song (2004). Stochastic conditional duration models with “leverage effect” for financial transaction data. Journal of Financial Econometrics 2, 390–421. Feuerverger, A. (1990). An efficiency result for the empirical characteristic function in stationary timeseries models. The Canadian Journal of Statistics 18, 155–61. Feuerverger, A. and P. McDunnough (1981a). On some Fourier methods for inference. Journal of the American Statistical Association 76, 379–87. Feuerverger, A. and P. McDunnough (1981b). On the efficiency of empirical characteristic function procedures. Journal of the Royal Statistical Society 43, Series B, 20–27. Feuerverger, A. and R. A. Mureika (1977). The empirical characteristic function and its applications. The Annals of Statistics 5, 88–97. Ghysels, E., C. Gourieroux, and J. Jasiak (2004). Stochastic volatility duration models. Journal of Econometrics 119, 413–33. Gourieroux, C., J. Jasiak, and G. Le Fol (1999). Intraday market activity. Journal of Financial Markets 2, 193–226. Hansen, Lars Peter (1982). Large sample properties of generalized method of moments estimators. Econometrica 50, 1029–54. Heathcote, C. R. (1977). The integrated squared error estimation of parameters. Biometrika 64, 255–64. Jiang, G. J. and J. L. Knight (2002). Estimation of continuous time processes via the empirical characteristic function. Journal of Business and Economic Statistics 20, 198–212. Jiang, G. J., J. L. Knight, and G. Q. Wang (2005). Alternative specifications of stochastic volatility asset return models: theoretical and empirical comparisons. Working paper, University of Arizona. Knight, J. L., J. Yu (2002). The empirical characteristic function in time series estimation. Econometric Theory 18, 691–721. Knight, J.L., S. E. Satchell, and J. Yu (2002). Estimation of the stochastic volatility model by the empirical characteristic function method. Australian and New Zealand Journal of Statistics 44, 319–35. Parzen, E. (1962). On estimation of a probability density function and mode. The Annals of Mathematical Statistics 33, 1065–76. Paulson, A.S., E. W. Holcomb, and R. A. Leitch (1975). The estimation of the parameters of the stable laws. Biometrika 62, 163–170. Ruiz, E. (1994). Quasi-maximum likelihood estimation of stochastic volatility models. Journal of Econometrics 63, 289–306. Strickland, C. M., C. S. Forbes, and G. M. Martin (2006). Bayesian analysis of the stochastic conditional duration model. Computational Statistics and Data Analysis 50, 2247–67. Tran, K.C. (1998). Estimating mixtures of normal distributions via empirical characteristic function. Econometric Reviews 17, 167–83. Singleton, K. J. (2001). Estimation of affine asset pricing models using the empirical characteristic function. Journal of Econometrics 102, 111–41. Yu, J. (1998). Empirical characteristic function in time series estimation. Ph.D. dissertation. The University of Western Ontario.
C The Author(s). Journal compilation C Royal Economic Society 2008.
Estimation of SCD model via alternative methods
613
Zernov, S. (2003). Dynamics of trade and price durations of currency futures. Working paper, Department of Economics, Mcgill University.
APPENDIX A: THE CHARACTERISTIC FUNCTION Proof of Proposition 3.1:
The original SCD model is as follows:
dt = exp(ht ) · εt , εt |It−1 ∼ Gamma(v, 1) or Weibull(γ , 1) ht = α + βht−1 + ut , ut |It−1 ∼ i.i.dN (0, σ 2 ). Transform the model by taking logarithm of both sides, we obtain: yt = ln(dt ) = ht + ln εt . Given that the pdfs of the Gamma and Weibull distributions are respectively, the characteristic function of ln ε t is: C(r) = E[(exp(ln εt ))ir ] = E[(εt )ir ] 1 ∞ ir+v−1 (εt ) exp(−εt )dεt = (ir+v) (v) 0 (v) = ∞ ir γ −1 γ ir (ε ) γ ε exp(−ε )dε = ( + 1) t t t t 0 γ
1 εv−1 (v) t
exp(−εt ) and γ x γ −1 exp(−x γ ),
if εt |It−1 ∼ Gamma(v, 1), if εt |It−1 ∼ W eibull(γ , 1).
APPENDIX B: THE JOINT CHARACTERISTIC FUNCTION OF DURATIONS Proof of Proposition 3.2: If u t |I t−1 ∼ i.i.d S a (σ , b, 0), then the CF of u t is: C (r) = ut
exp{−|δr|a [1 − ibsign(r) tan(π a/2)]} if a = 1, exp{−|δr|a [1 − ib π2 sign(r) ln |r|]}
if a = 1.
Given that h t is an AR(1) process, we can rewrite the latent variable h t into: ht =
k α ∞ + k=0 β ut−k 1−β
Thus the CF of h t is: k iαr ∞ C ht (r) = E[exp(irht )] = exp k=0 E exp irβ ut−k 1−β iαr ut k ∞ = exp k=0 C (rβ ) 1−β ∞ a a ak ∞ iαr exp k=0 = exp −δ |r| |β| exp k=0 (iδ a b|r|a |β|ak sign(rβ k ) tan(π a/2) 1−β
δ a |r|a iαr iδ a b|r|a sign(r) tan(π a/2) − . = exp + 1−β 1 − |β|a 1 − |β|a sign(β) C The Author(s). Journal compilation C Royal Economic Society 2008.
614
J. Knight and C.Q. Ning
Then the joint CF of y t , y t+1, . . . , y t+k−1 , is C(r1, r2, · · · , rk, θ ) = E[exp(ir1 yt + ir2 yt+1 + · · · + irk yt+k−1 )] = E[exp(ir1 ht + ir1 ln εt + ir2 ht+1 + ir2 ln εt+1 + · · · + irk ht+k−1 + irk ln εt+k−1 )] = E[exp(ir1 ht + ir2 ht+1 + · · · + irk ht+k−1 )]kj =1 E[exp(irj ln εt+j −1 )]
1 − β j −1 = exp iα jk=2 rj × E exp iht jk=1 β j −1 rj × kl=2 E exp iut+l−1 jk=l β j −l rj 1−β
× kj =1 E exp irj ln εt+j −1 .
1 − β j −1 = exp iα jk=2 rj × C ht jk=1 β j −1 rj × kl=2 C ut+l−1 jk=l β j −l rj × kj =1 C ln εt+j −1 (rj ) 1−β a iα jk=1 β j −1 rj δ a jk=1 β j −1 rj 1 − β j −1 k rj × exp − = exp iα j =2 1−β 1−β 1 − |β|a
a iδ a b jk=1 β j −1 rj sign jk=1 β j −1 rj tan(π a/2) + 1 − |β|a sign(β) a
× k exp − δ k β j −l rj 1 − ibsign k β j −l rj tan(π a/2) } × k C(rj ) j =l
l=2
j =l
j =1
k iα δa k jk=1 rj − | jk=l β j −l rj |a l=1 | jk=l β j −l rj |a − |β|a l=2 1−β 1 − |β|a
iδ a b tan(π a/2) k | jk=l β j −l rj |a sign jk=l β j −l rj + l=1 a 1 − |β| sign(β)
k j −l a k k j −l a β rj sign β rj k C(rj ), − |β| sign(β)
= exp
l=2
j =l
j =l
j =1
is the characteristic function of log Gamma or log Weibull in Appendix A. When a = 2, b = where C(r j ) √ 0 and δ√ = σ/ 2 the S a (δ, b.0) reduces to a Normal N (0, σ 2 ) distribution. By substituting a = 2, b = 0 and δ = σ/ 2 into above equation, we obtain the joint CF of y t , . . . , y t+k−1 , when u t |I t−1 ∼ i.i.d N (0, σ 2 ) as given in Corollary3.1, C(r1, r2, . . . . . . rk, θ )
k σ2 iα 2 k k j −l−1 jk=1 rj − × kj =1 C(rj ), r + 2β β r r = exp j l j l=1 j =l+1 1−β 2(1 − β 2 ) j =1 where C(r j ) = C G (r j ) or C W (r j ) for Gamma SCD or Weibull SCD, respectively.
APPENDIX C. AUTOCORRELATION FUNCTIONS Proof of Proposition 3.3: By definition, the cumulant generating function for y t , y t+1 , . . ., y t+k−1 is: φ(r1 , r2 , · · · , rk , θ ) = log[C(−ir1 , −ir2 , · · · , −irk , θ )] k σ2 α 2 k k k j −l−1 + r + 2β β r r log C(−irj ). = jk=1 rj + j l j l=1 j =l+1 1−β 2(1 − β 2 ) j =1 j =1 C The Author(s). Journal compilation C Royal Economic Society 2008.
615
Estimation of SCD model via alternative methods Thus the autocorrelation functions of {y t }Tt=1 are as follows:
Cov(yt, yt+k ) = ρk = V ar(yt )
=
⎧ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎩
∂ 2 φ(r1 ,r2 ,···,rk ,θ) |r1 =r2 =···=rk =0 ∂r1 ∂rk ∂ 2 φ(r1, r2, ···,rk, θ) |r1 =r2 =···=rk =0 ∂r12
σ 2 βk 1−β 2 σ 2 + (v) −[ (v) ]2 (v) (v) 1−β 2
if εt |It−1 ∼ Gamma(v, 1),
σ 2 βk 1−β 2 (1) − 12 [ (1) ]2 γ
if εt |It−1 ∼ W eibull(γ , 1),
σ2 1−β 2
where k = 1, 2, · · ·.
APPENDIX D: MOMENTS OF DURATIONS 2
α σ Proof of Proposition 4.1: From the SCD model, we can derive that ht ∼ N ( 1−β , 1−β 2 ) and ht = j =r−1 j r β ht−r + j =0 β ut−j . Substitute these results into the moments for durations, we have:
α(1−β r ) 1−β
+
n n E dtm dt−r = E emht +nht−r εtm εt−r , ⎡ ⎛ ⎞⎤ j =r−1 r mα(1 − β ) n β j ut−j + (mβ r + n)ht−r ⎠⎦ · E εtm εt−r = E ⎣exp⎝ +m 1−β j =0 j =r−1 '
mα(1 − β r ) n · = exp , E exp mβ j ut−j · E exp((mβ r + n)ht−r ) · E εtm εt−r 1−β j =0 where E[exp(mβ j u t−j )] and E[exp((mβ r + n)h t−r )] are the moment generation function (mgf) of u t−j and h t−r , respectively. Substitute these mgfs into the corresponding parts of the above equation, we get n E dtm dt−r j =r−1 2 2 2j ' mσ β mα(1 − β r ) × = exp exp 1−β 2 j =0
n (mβ r + n)2 σ2 α + × × E εtm E εt−r × exp (mβ r + n) 2 1−β 2 1−β
n α(m + n) m2 + n2 + 2mnβ r 2 + σ E εtm E εt−r . = exp 1−β 2(1 − β 2 ) ∞ If ε t ∼ i.i.d. Gamma(v, 1), then E[εtm ] = 0 εtm × pdf (εt )dεt = (v+m) = (v)m = v(v + 1) . . . (v + m − (v) 1). As well, E[ε nt ] = (v) n . If ε t ∼ i.i.d. Weibull (γ , 1), then E[εtm ] = ( mv + 1). By changing the value of m, n and r, we can easily obtain all orders of marginal moments and cross moments. C The Author(s). Journal compilation C Royal Economic Society 2008.
616
J. Knight and C.Q. Ning
APPENDIX E: MOMENTS OF LOG DURATIONS Proof of Proposition 4.2: Let k n be the nth cumulants of y t . By definition, ∂ n φ(r1 , r2 , . . . , rk , θ ) . kn = ∂r1,n r1 =r2 =···=rk =0
And let m n be the nth moments, then we have
∂φ(r1 , r2 , . . . , rk , θ ) m1 = k1 = ∂r1
=
r1 =r2 =···=rk =0
α + , 1−β
m2 = k2 + m21 = σ 2 /(1 − β 2 ) + (1) + m21 , m3 = k3 + 3m1 m2 − 2m31 = (2) + 3m1 m2 − 2m31 , m4 = k4 + 6m41 − 12m21 m2 + 3m22 + 4m1 m3 = (3) + 6m41 − 12m21 m2 + 3m22 + 4m1 m3 , where = 1 (1) (n) ( ) γ n (1)
(v) , (n) (v)
(v) (n) = ( (v) ) , the nth derivative of digamma, for Gamma SCD, and =
1 (1) , (n) γ (1)
=
for Weibull SCD. To obtain the cross moments, we use the moment generation function of the log durations. α(r1 + r2 ) r12 + r22 + 2r1 r2 β r 2 · C(−ir1 )C(−ir2 ). M(r1, r2 ) = C(−ir1, − ir2 ) = exp + σ 1−β 2(1 − β 2 )
Then the cross moments can be derived by taking derivatives of the mgf as follows: 2 ∂ 2 M(r1, r2 ) σ2 α k + = β + , E[yt yt+k ] = ∂r1 ∂r2 1 − β2 1−β r1 =r2 =0
3 α ∂ 3 M(r1, r2 ) α 2σ 2 k + + + |r1 =r2 =0 = β 2 1−β 1 − β2 1−β ∂r1 ∂r2 2 α σ + , + + (1) 1 − β2 1−β 4 2 α ∂ 4 M(r1, r2 ) α 4σ 2 β k 2 + + E yt2 yt+k = | = + r =r =0 1 2 1−β 1 − β2 1 − β ∂r12 ∂r22 2 2 k 2 2 α σ2 σ β σ2 (1) (1) + +2 + + 2 + + . 1 − β2 1−β 1 − β2 1 − β2 2 ] = E[yt yt+k
C The Author(s). Journal compilation C Royal Economic Society 2008.
Econometrics Journal (2008), volume 11, pp. 617–637. doi: 10.1111/j.1368-423X.2008.00251.x
Distinguishing short and long memory volatility specifications S HIUYAN P ONG † , M ARK B. S HACKLETON † AND S TEPHEN J. TAYLOR † †
Department of Accounting and Finance, Lancaster University, LA1 4YX, UK E-mail:
[email protected],
[email protected],
[email protected] First version received: January 2006; final version accepted: March 2008
Summary Asset price volatility appears to be more persistent than can be captured by individual, short memory, autoregressive or moving average components. Fractional integration offers a very parsimonious and tempting formulation of this long memory property of volatility but other explanations such as structural models (aggregates of several autoregressive components) are possible. Given the ability of the latter to mimic the former, we investigate the extent to which it is possible to distinguish short from long memory volatility specifications. For a likelihood ratio test in the spectral domain, we investigate size and power characteristics by Monte Carlo simulation. Finally applying the same test to Sterling/Dollar returns, we draw conclusions about the minimum number of structural factors that must be present to mimic the long memory volatility properties that are empirically observed. Keywords: Long Memory, Power, Size, Spectral Test, Volatility.
1. INTRODUCTION The persistent nature of asset price volatility and power transforms of asset returns has been well documented in numerous studies, commencing with Taylor (1986). Ding et al. (1993) observed hyperbolic decay in the autocorrelations of powers of daily absolute returns obtained from U.S. stock indices, while Andersen et al. (2001a, b) found the same phenomenon in realized volatilities. Fractionally integrated, long memory models have thus received considerable interest because of their ability to capture the slowly decaying autocorrelations of volatility. Long memory models are characterized by a hyperbolic decay rate in autocorrelation, which is consistent with findings from empirical data. Baillie et al. (1996) and Bollerslev and Mikkelsen (1996) introduced long memory processes in the context of conditional variance by extending the GARCH models of Bollerslev (1986) and the exponential ARCH models of Nelson (1991). Breidt et al. (1998) proposed a long memory stochastic volatility model by incorporating a fractionally integrated process in a standard volatility scheme. Andersen et al. (2001a, b) suggest that this ARFIMA model is well suited for realized volatility that is constructed from high-frequency intraday returns. Although the evidence for long memory effects in volatility is at first sight compelling, there is an alternative explanation for these effects. Gallant et al. (1999) show that the sum of two (short memory) AR(1) processes will appear to have long memory features when the parameters are selected appropriately. A similar observation is made by Barndorff-Nielsen and Shephard (2001). Alizadeh et al. (2002) show that the sum of two AR(1) processes describes FX volatility C The Author(s). Journal compilation C Royal Economic Society 2008 Published by Blackwell Publishing Ltd, 9600 Garsington Road,
Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
618
S. Pong, M. B. Shackleton and S. J. Taylor
better than one and Pong et al. (2004), find that the sum of two AR(1) processes performs as well as a fractionally integrated process for forecasting the realization of exchange rate volatility up to three months ahead. Given the interest in volatility persistence, it would seem important to resolve the properties of its memory. This issue is particularly important for option traders when they value options that expire several months into the future. Taylor (2005, p. 394) compares the implied volatility term structures for short and long memory ARCH models, and shows that the valuation of S&P 100 options that expire after one year can be very different for short and long memory assumptions. Ohanissian et al. (2008) obtain the same conclusion for S&P 500 options when volatility is modeled by a diffusion process. Thus popular pricing models, based upon the affine jump-diffusion framework of Duffie et al. (2000), may provide inappropriate theoretical prices. Granger (1980) showed that under certain conditions a linear combination of an infinite set of AR(1) processes is a fractionally integrated process, which may therefore explain why the sum of two AR(1) processes can mimic long memory if the parameters are chosen judiciously; however, mimicking behaviour will break down as the sample length increases. Given the close relation between the sum of two AR(1) processes and a fractionally integrated process, it is natural to ask a fundamental question: given a certain data set, can these two alternative data generating processes be distinguished statistically? Likewise, for a given source of data, how much more is needed to distinguish between a true long memory process and one that merely mimics long memory within that data length? To answer these questions, in this paper we design and investigate a statistical test operating in the frequency domain based on likelihood ratios. Through a Monte Carlo study, we make use of our proposed statistical test to find out the probability of identifying the correct process given that the data was generated either by a fractionally integrated process or by a sum of two AR(1) processes designed to mimic long memory. By generating simulated series of different lengths we examine the effect of sample size on the possibility of distinguishing these two processes. We find that the probability of identifying the correct process can be relatively low, even for series of 2000 observations. It therefore appears that traders of long-lived options will often have to cope with uncertainty about the memory characteristics of volatility when they price options. We also document recommended series lengths for differentiating between a fractionally integrated process and a sum of AR(1) processes. Another test procedure has recently been developed by Ohanissian et al. (2008). They test the null hypothesis that a process has a long memory by comparing estimates of the long memory parameter d across data frequencies. Section 2 describes the definitions and features of long memory and short memory processes. Long memory and short memory processes (fractionally integrated and aggregated factor processes, respectively) are introduced and it is shown how the spectral densities of these two models can be very similar when the parameters are chosen appropriately. Section 3 describes our proposed statistical test to distinguish long memory from short memory and also the Monte Carlo design. We then present our results based on simulated series of four different lengths in Section 4. The findings clearly highlight the effect of series length on the probability of identifying the correct type of memory. Section 5 provides estimation results for Sterling/Dollar (GBP/USD) exchange rate volatility. The Monte Carlo simulation evidence is then used to interpret the empirical test results. With less than 3000 observations, it cannot be concluded that long memory or short memory characterizes the realized volatility series. However, with the number of observations multiplied by six, C The Author(s). Journal compilation C Royal Economic Society 2008
619
Distinguishing volatility specifications
Table 1. Limit properties of the autocorrelation function and the spectral density. Limiting case Short memory Long memory Autocorrelation τ → ∞ Frequency ω → 0
φ −τ |ρ τ | → C 1 > 0, 0<φ<1
τ 1−2d ρ τ → D 1 > 0 d>0
f (ω) → C 2 > 0
ω2d f (ω) → D 2 > 0 d>0
we show that the mimicking process can be rejected. Further conclusions are provided in Section 6.
2. LONG, SHORT AND MIMICKING MEMORY PROCESSES There are several definitions that categorize stochastic processes as having either a long memory or a short memory. Relevant discussions are provided by McLeod and Hipel (1978), Brockwell and Davis (1991), Baillie (1996) and Granger and Ding (1996). Consider a discrete-time stochastic process {y t }, with autocorrelation function ρn τ and spectral density f (ω). A covariance stationary process possesses a short memory if τ =1 ρτ converges as n tends to infinity, otherwise it is said to have a long memory. The autocorrelation structure ρ τ of a short memory process is geometrically bounded, while its spectral density f (ω) is bounded for all frequencies. In contrast, the autocorrelations of a long memory process have a hyperbolic decay and a spectral density that is unbounded at low frequencies. Representing the degree of fractional integration by d (for a value less than 0.5) and dividing the autocorrelations ρ τ and spectral density f (ω) by τ 2d−1 and ω−2d , respectively, the resulting values will converge to positive constants when τ → ∞ and ω → 0. Table 1 summarizes the limiting properties of the autocorrelations and the spectral density for both short and long memory processes. 2.1. Fractionally integrated models Granger and Joyeux (1980) and Hosking (1981) introduced a flexible class of long memory processes, called autoregressive, fractionally integrated, moving average or ARFIMA models. For an excellent survey of long memory processes, including applications to financial data, see Baillie (1996). The ARFIMA(p,d,q) model for a process {y t } is given, using the lag operator L, by: φ(L)(1 − L)d (yt − μ) = θ (L)t ,
(2.1)
where d again represents the order of fractional integration, φ(L) = 1 − φ 1 L − · · · − φ p LP represents the lag polynomial for the autoregressive component, θ (L) = 1 − θ 1 L − · · · − θ q Lq represents the lag polynomial for the moving average component and μ is the expectation of y t . The respective degrees of the polynomials φ(L) and θ (L) are p and q. The roots of φ(L) and θ (L) lie outside the unit circle and { t } is a zero-mean white noise process with variance ξ 2 . When 0 < d < 1, the process has a long memory property and it is covariance stationary when d < 0.5. The autocorrelation function of the ARFIMA(0,d,0) process is written as: ρτ =
(1 − d) (τ + d) , (d) (τ + 1 − d)
C The Author(s). Journal compilation C Royal Economic Society 2008
(2.2)
620
S. Pong, M. B. Shackleton and S. J. Taylor
where (.) represents the gamma function. The computation of the autocorrelations of the general ARFIMA(p, d, q) model is more complicated and detailed discussions about the calculations have been provided by Sowell (1992) and Chung (1994). Using complex notation, the spectral density function of an ARFIMA(p, d, q) model is given (see Baillie, 1996) by: ξ2 farfima (ω) = |θ (e−iω )|2 |φ(e−iω )|−2 |1 − e−iω |−2d . 2π
(2.3)
2.2. Structural AR-factor models Granger and Newbold (1977) showed that when p distinct and independent AR(1) processes are aggregated, an ARMA(p, p − 1) model is obtained which is categorized as a short memory structural model; the cumulative action of many factors on a system then determines its structure. Applications of structural factor models in finance can be found in Gallant et al. (1999), Alizadeh et al. (2002) and Pong et al. (2004), who all discuss and use the sum of two AR(1) processes, having an ARMA(2, 1) characterization, to capture the dynamics of asset price volatility. The ARMA(p, p − 1) model for a process {y t } is given by a restricted version of the more general ARFIMA model: φ(L)(yt − μ) = θ (L)t
(2.4)
with the respective degrees of the polynomials φ(L) and θ (L) equal to p and p − 1. For the special case when p is 2, an ARMA(2,1) process is the sum of two AR(1) processes if and only if the roots of the quadratic equation z2 φ(z−1 ) = 0 are both real; when these roots are real they equal the autoregressive parameters of the component AR(1) processes. It is straightforward in theory, but algebraically time consuming, to derive the autocorrelation function of an ARMA(p, q) process. However, if the process is an ARMA(p, p − 1) specification that can be rewritten as the sum of p independent AR(1) processes, then the calculations are made simpler. As the autocovariance function of the sum of independent processes is the sum of the autocovariance functions of the individual processes, the autocorrelations of the ARMA(p, p − 1) model can be calculated as: p νk2 ψkτ
ρτ =
k=1 p
νk2
,
(2.5)
i=k
where ψ k and ν 2k are, respectively, the autoregressive and variance parameters of the kth AR(1) process. Furthermore, the spectral density function can be represented as the sum of the spectral densities of p AR(1) processes: p 1 2 νk 1 − ψk2 |1 − ψk e−iω |−2 . fsum (ω) = 2π k=1
(2.6)
C The Author(s). Journal compilation C Royal Economic Society 2008
Distinguishing volatility specifications
621
Figure 1. Spectral densities of ARMA(2,1) and ARFIMA(0, d, 0) processes.
2.3. Comparison of ARFIMA and ARMA spectral densities Gallant et al. (1999) point out that the sum of two AR(1) processes can mimic the appearance of long memory in volatility when an appropriate choice of parameters is made. For the commonly found value of d = 0.4, they used a least squares fit to find the two-factor process that best fits the spectral density of a fractionally integrated process. Specifically, an approximation to ARFIMA(0, d, 0) can be made by an ARMA(2, 1) obtained via the following spectral minimization: 2 300 2 2 −iωj −2d 2 2 −iωj −2 ξ |1 − e min | − νk 1 − ψk |1 − ψk e | , (2.7) ψ1 ,ψ2 ,ν1 ,ν2
j =1
k=1
π where ωj = j 300 . With d chosen as 0.4 and ξ as 0.2, (ψ 1 , ν 1 , ψ 2 , ν 2 ) = (0.978, 0.136, 0.447, 0.216) minimizes the sum in equation (7). The values of the autoregressive parameters indicate that the first AR(1) process is highly persistent (ψ 1 = 0.978) while the second has a transient nature (ψ 2 = 0.447). The two variances are similar. Figure 1 shows the theoretical spectral densities of the ARFIMA(0, d, 0) and ARMA(2, 1) models. It can be seen that the spectral density of the long memory model is closely approximated by that of the short memory model over much of the frequency domain. The match is strikingly good and it may not be possible to distinguish the spectral densities with the naked eye. For the realized volatility of USD/GBP, USD/DEM and USD/JPY exchange rates, spanning seven years, Pong et al. (2004) use maximum likelihood estimation in the spectral domain to C The Author(s). Journal compilation C Royal Economic Society 2008
622
S. Pong, M. B. Shackleton and S. J. Taylor
estimate ARFIMA and ARMA(2, 1) models. They find that the likelihood function values of both models are very close, for each of the three exchange rates, and neither of the models possesses forecasting abilities that are statistically superior to the other. Their findings highlight the proximity of the fractionally integrated process and the two-factor AR process in terms of their persistence. These results lead to two fundamental and linked questions. First, can two-factor and fractionally integrated processes be distinguished statistically? Second, what length of a financial time series is necessary to have a “good” chance of identifying the correct type of model? A Monte Carlo analysis can provide some answers to these questions. By generating simulated series from a specific process repeatedly, we can calculate the percentage probability of recovering the same process based on the Monte Carlo evidence. In addition, we can investigate this probability as a function of the length of the financial time series to determine the ability of a particular statistical test to distinguish long and short memory series.
3. MONTE CARLO SIMULATION A number of simulation studies have addressed the performances of different tests designed to distinguish long memory and short memory processes. Breidt et al. (1998) investigate the abilities of a spectral regression test, proposed by Geweke and Porter-Hudak (1983) and further developed by Robinson (1994), as well as the so-called R/S statistic test (see Beran, 1994) used to detect long memory in stock return volatility. They find that the tests are able to distinguish between ARFIMA(0, d, 0) and ARFIMA(1, 0, 0) specifications incorporated in a stochastic volatility model. Smith et al. (1997) use semi-parametric and maximum likelihood estimation methods to investigate the probability of selecting the correct specification given a simulated fractionally integrated process. They find that it is not always possible to identify the existence of long memory statistically for short simulated series with 256 observations. Our research is similar in spirit but is different in four aspects. First, we place emphasis specifically on two types of process in our simulation analysis, the fractionally integrated process and the two-factor process, because of their significance and popularity in volatility modeling and in particular the ability of the latter to mimic the former. Second, the typical time span of financial data employed in recent volatility modeling research is in the neighbourhood of ten years (see Gallant, 1999, Andersen et al., 2001a, b, and Areal and Taylor, 2002). Therefore, the length of simulated series in our Monte Carlo study is chosen to match these features. Thirdly, previous studies concern the ability of statistical tests to identify a long memory feature when long memory actually exists in the data. Our paper provides a new dimension by looking at the possibility of mistaking a mimicking two-factor process for a long memory process. Finally, we investigate how the length of the time series affects the probability of identifying the correct type of process. 3.1. Simulation design We design a statistical test to detect the correct type of memory in a Gaussian process which is composed of a pair of likelihood ratio tests. The Gaussian assumption is appropriate for the process of primary interest to us, namely the logarithm of realized volatility, because the empirical evidence shows that the unconditional distribution of log volatility is almost normal; C The Author(s). Journal compilation C Royal Economic Society 2008
Distinguishing volatility specifications
623
Figure 2. Design of the combined likelihood ratio test to detect the correct specification in simulated series.
this evidence commences with Andersen et al. (2001a, b) and has been confirmed by several subsequent studies, such as Areal and Taylor (2002) and Pong et al. (2004). We first consider a series simulated by an ARFIMA(0, d, 0) model. The objective of the test is to choose between ARFIMA(0, d, 0) and ARMA(2,1) models. Since these models are not nested, they cannot be compared directly. However, they can be tested indirectly by using the encompassing ARFIMA(2, d, 1) specification as a bridge. A likelihood ratio test can be used to choose between ARFIMA(0, d, 0) and ARFIMA(2, d, 1) by testing the null hypothesis that (φ 1 , φ 2 , θ 1 ) = (0, 0, 0). The test should (asymptotically) nearly always accept the null hypothesis if the true process is ARFIMA(0, d, 0) and the significance level is small. We can simultaneously use another likelihood ratio test to choose between the ARMA(2,1) and ARFIMA(2, d, 1) models by testing the null hypothesis that d = 0, when the null hypothesis should be rejected (asymptotically). There are four combinations of results from the pair of LR tests. First, the combined LR test can accept ARFIMA(0, d, 0) and reject ARMA(2,1). Second, it can accept ARMA(2,1) and reject ARFIMA(0, d, 0). Thirdly, it can accept both the ARFIMA(0, d, 0) and ARMA(2,1) models. Finally, it can reject both models and accept ARFIMA(2, d, 1). The left panel of Figure 2 provides a graphic presentation of the test design. All our estimates of ARMA(2,1) models do not restrict the AR and MA parameters, so we do not explicitly assume that these models are the sum of two AR(1) processes. However, we always find that these estimated models do correspond to AR(1) sums, i.e. the quadratic equation z2 φ(z−1 ) = 0 always has two real roots. By replicating the experiment and documenting the percentage of times each of these four conclusions is obtained we find the probability of identifying the existence of long memory using the combined LR test. Asymptotically the tests should nearly always accept ARFIMA(0, d, 0) and reject ARMA(2,1); however, bearing in mind their proximity it is expected that the test results will be less than perfect given a series of finite length. Our Monte Carlo results, which are shown in Section 4, do indeed show that long and short memories are not always distinguishable. On the other hand, if the series is truly generated by an ARMA(2,1) model for our simulation experiment, the percentage of detections of short memory can be obtained using the same procedure, except that one modification has been made. One leg of the LR tests is the same as before, in which the ARMA(2,1) and ARFIMA(2, d, 1) models are compared, while the design C The Author(s). Journal compilation C Royal Economic Society 2008
624
Simulated process ARFIMA (0, d, 0)
ARMA (2,1)
S. Pong, M. B. Shackleton and S. J. Taylor Table 2. Averages of parameter estimates for simulated series. Estimated Length = Length = Length = Length = model 2000 4000 8000 12,000 0.4022 0.4020 0.4016 0.4010 d MLE ARFIMA(0, d, 0) 0.4027 0.4019 0.4018 0.4015 dGP H ⎧ ⎪ ψ1 0.9465 0.9541 0.9605 0.9632 ⎪ ⎪ ⎪ ⎨ν 0.1806 0.1836 0.1870 0.1884 1 ARMA(2,1) ⎪ψ2 0.2428 0.2527 0.2614 0.2643 ⎪ ⎪ ⎪ ⎩ν 0.1857 0.1870 0.1889 0.1895 2 ⎧ ⎪ ψ1 0.9742 0.9763 0.9767 0.9774 ⎪ ⎪ ⎪ ⎨ν 0.1387 0.1378 0.1368 0.1369 1 ARMA(2,1) ⎪ψ2 0.4419 0.4447 0.4466 0.4472 ⎪ ⎪ ⎪ ⎩ν 0.2146 0.2150 0.2156 0.2156 2 d MLE 0.3030 0.3047 0.3029 0.3040 ARFIMA(0, d, 0) 0.3317 0.3289 0.3245 0.3212 dGP H
True value 0.4000 0.4000
0.9780 0.1360 0.4470 0.2160
Note: The degree of fractional integration is estimated using the maximum likelihood estimation method and the GPH(1983) method modified by Robinson(1994). The parameters of the ARMA(2,1) model for simulation are chosen to mimic an ARFIMA(0, d, 0) process with d = 0.4.
of the second leg is changed by comparing ARFIMA(1, d, 1) instead of ARFIMA(0, d, 0) with ARFIMA(2, d, 1) in order to keep the same degrees of freedom for the two LR tests. The test design is shown in the right panel of Figure 2. To carry out the LR test, we perform the maximum likelihood estimation in the frequency domain. This estimation methodology is computationally efficient, particularly for long time series, because the Fast Fourier Transform is applicable. Whittle (1951) provides the expression for the log-likelihood function of a process in the frequency domain, under the assumption of normality of the residuals. The log-likelihood function for N observations is log L = −
N−1 N−1 1 1 log[2π f (ωj )] − [f (ωj )/f (ωj )]. 2 j =1 2 j =1
(3.1)
The theoretical spectral density, f (ω j ), at frequency ω j = 2π j /N , is always computed from equation (3), while f (ωj ) is the value of the periodogram at frequency ω j . Discussion of maximum likelihood estimation in the frequency domain has been provided by Fox and Taqqu (1986), Cheung and Diebold (1994) and Baillie (1996) in the context of long memory. The maximum likelihood estimation can also be performed by full maximum likelihood estimation (MLE) in the time domain; however, it is computationally intensive especially for long time series. Cheung and Diebold (1994) find that spectral-likelihood estimation is preferable to full MLE when the mean of the process is not required. Breidt et al. (1998) show that strong consistency can be obtained by MLE in the frequency domain. These findings explain our choice of MLE in the frequency domain for our simulation exercise. Davies and Harte (1987) suggest an algorithm for simulating ARFIMA models which makes use of a two-step Fourier transform to simulate a series which matches the autocovariance structure of the fractionally-integrated process. Beran (1994) provides a review of standard C The Author(s). Journal compilation C Royal Economic Society 2008
Distinguishing volatility specifications
625
approaches for simulating ARFIMA models, including the Cholesky decomposition of the covariance matrix of the time series and the Davies and Harte algorithm. Our Monte Carlo analysis relies on the Davies and Harte algorithm to simulate the ARFIMA(0, d, 0) process. The simulation of the ARMA(2,1) process is relatively simple, which can be done by summing two independent first-order autoregressive Gaussian processes. We set the minimum length of simulated series as 2000. Assuming 250 trading days in a year, such a length is equivalent to a time span of eight years for a daily sampling frequency and it represents the typical length of financial time series employed in volatility modeling. In order to investigate the effect of series length on identification of the true processes, we also include simulated series with lengths of 4000, 8000 and 12,000 in our study. While it is predicted that the power of the test increases when the simulated series is lengthened, we aim to identify a length which gives us a sufficient level of confidence to identify the correct type of memory in a process. Finally, our analysis is based on 1000 replications of each combined LR test.
4. RESULTS The upper part of Table 2 shows the results of parameter estimation when the simulated series is generated from an ARFIMA(0,d,0) model with d = 0.4 and ξ = 0.2. This value of d represents the typical degree of fractional integration found in recent volatility studies (e.g. Andersen et al., 2001a, b and Areal and Taylor, 2002). The first two rows of Table 2 show the mean value of the estimated d for different simulated lengths when we estimate the correct specification. The values of d are estimated using the maximum likelihood estimation method as well as the semi-parametric method proposed by Geweke and Porter-Hudak (1983). It is found that the mean estimates are close to the true value of d. However, what deserves more attention are the results found in the next four rows which show the mean estimates of the parameters when we mistakenly suppose the two-factor model is the correct model. The values of the estimates indicate that the typical mistaken process is composed of a highly persistent process and a transient process. The lower part of Table 2 shows the estimation results when the simulated series is generated by the two-factor model with (ψ 1 , ν 1 , ψ 2 , ν 2 ) = (0.978, 0.136, 0.447, 0.216). These input values are obtained using the least squares approximation stated in equation (7), to match a fractionally integrated process with d = 0.4. The values of the estimates for the two-factor model are found to be close to the input values as shown in the table. The figures in the next two rows show the average value of d when we mistake the correct specification as ARFIMA(0, d, 0). It is found that the average values of d are significantly greater than zero and are in the neighbourhood of 0.3. The above results carry important implications. They suggest that it is often easy to claim a series has a long memory by performing some of the popular statistical tests for detecting long memory but without considering the short memory alternative specifications (and also vice versa). 4.1. Identifying long memory Using the Monte Carlo design suggested in Section 3.1, we investigate the probability of identifying long memory given a simulated series generated from an ARFIMA(0, d, 0) model.
C The Author(s). Journal compilation C Royal Economic Society 2008
626
S. Pong, M. B. Shackleton and S. J. Taylor
Figure 3. The log-likelihood ratios from combined likelihood ratio tests when simulated series are generated from the ARFIMA(0, d, 0) process.
We define λ 1 as the likelihood ratio test statistic obtained from the LR test that compares ARFIMA(0, d, 0) with ARFIMA(2, d, 1) and λ 2 as that obtained from comparing ARMA(2,1) with ARFIMA(2, d, 1). Figures 3(a–d) show plots of λ 2 againest λ 1 , with both variables plotted on a logarithmic scale. The dotted lines on the figures represent the cut-off boundaries of both tests at the 5% significance level. For this significance level, a test statistic greater than 7.81 leads to a rejection of the ARFIMA(0, d, 0) model in the first LR test and a statistic greater than 3.84 leads to a rejection of the ARMA(2,1) model in the second LR test. We first look at the results when the simulated series consists of 2000 observations, as shown in Figure 3(a). It can be easily noticed that there are many points located in the region for accepting the ARMA(2,1) model. This indicates that there is a significant probability of failing to reject the wrong specification. Figures 3(b–d) show the plots when the lengths of simulated series are 4000, 8000 and 12,000, respectively. The values of λ 2 generally shift upwards as the C The Author(s). Journal compilation C Royal Economic Society 2008
Distinguishing volatility specifications
627
simulated length increases and as a result fewer points are found within the acceptance region of the second leg of the test. It shows that, as expected, the ability of the test to detect long memory improves as the simulated length increases. The chance of obtaining each of the four different conclusions is also displayed on Figure 3. The numbers in the upper left-hand quadrants represent the fraction of simulations for which the correct specification is obtained, i.e. the ARFIMA(0, d, 0) model is accepted while the ARMA(2,1) model is rejected. This value equals 72% when the simulated length is 2000. The probability of identifying the ARFIMA(0, d, 0) model increases substantially to 91% when the length of the simulated series is extended to 4000; the probability approaches 95% (the asymptotic value of the test) when the length reaches 8000. In addition to having a snapshot of the results at the 5% significance level, it is interesting to see how the performance of the LR tests changes with the significance level. The plot of the probability of rejecting the ARMA(2,1) model against the probability of rejecting the ARFIMA(0, d, 0) model for different length series is given by Figure 4(a). The former and latter probabilities are labeled and can be regarded as the power and the size of the test, respectively. As expected, the power increases with the size of the test. However, the convex shapes of the curves show that the rate of change in power decreases when the size becomes greater. Given any size, the longer the simulated series the higher is the power. When the length reaches 8000, the power is close to 100% even for a very small test size. 4.2. Identifying short memory We now investigate the ability of the combined LR test to identify memory correctly given that the underlying series is generated from an ARMA(2,1) model. The first LR test compares the ARFIMA(1, d, 1) and the ARFIMA(2,d,1) models and the second compares the ARMA(2,1) and the ARFIMA(2, d, 1) models. The parameter values of the ARMA(2,1) model are chosen as shown in Section 2.3. The scatter diagrams of the likelihood-ratio test statistics λ 1 and λ 2 can be found in Figures 5(a–d), where a similar pattern to that of Figure 3 is seen. The boundaries of the acceptance region for both tests (shown as dotted lines) are 3.84 for a 5% significance level. Several points can be found in the acceptance region of the first test when the length is 2000 and it implies a significant probability of failing to reject the wrong specification. As the length of the simulated series increases, the whole pattern shifts towards the right and it indicates that it is easier to identify the true ARMA(2,1) process. The probabilities of obtaining the various conclusions for different simulated lengths at the 5% significance level are also shown on the same figure. Given that the simulated series are generated from an ARMA(2,1) model we find that the probability of failing to reject the ARFIMA(1, d, 1) specification is only substantial for a series length of 2000. On 43% of occasions the long memory hypothesis is not rejected. This probability can be further broken down into two components: for 38% of the series the ARFIMA(1, d, 1) and ARMA(2,1) model cannot be differentiated and 5% of the time accepting ARFIMA(1, d, 1) is the sole conclusion. The probability of choosing the correct ARMA(2,1) specification is 55%, which can be found in the lower-right quadrant. The performance of the test improves as the length of the series increases. The probability of identifying short memory rises to 82% when the length is 4000 and the probability of choosing the correct specification is close to 95% when the length of the simulated series is either 8000 or 12,000. C The Author(s). Journal compilation C Royal Economic Society 2008
628
S. Pong, M. B. Shackleton and S. J. Taylor
Figure 4. Comparisons of test size and power.
Figure 4(b) shows the size-power curve for the combined likelihood test given the generating process is ARMA(2,1). The value of size represents the probability of rejecting ARMA(2,1), which is the correct specification, while that of power represents the probability of rejecting ARFIMA(1, d, 1), which is the incorrect specification. The pattern of the graph is similar to that of Figure 4(a). In particular, it is again found that the power of the test increases as the length of the simulated series becomes longer given any test size. In addition, the rate of change of power C The Author(s). Journal compilation C Royal Economic Society 2008
Distinguishing volatility specifications
629
Figure 5. The log-likelihood ratios from combined likelihood ratio tests when simulated series are generated from the ARFIMA(2,1) process.
is higher when the test size is smaller. Finally, the power is close to 100% even for a small size when the length of simulation reaches 8000. 4.3. Sensitivity analysis Sensitivity analysis is now performed with regards to changes in the parameter values of ARFIMA and ARMA models. We investigate the effect of a reduction in the persistence level of a process on the probability of identifying the correct specification. In particular, to simulate the long memory series, we suppose d is reduced from 0.4 to one of 0.3, 0.2 and 0.1. To generate short memory series, we choose the three following parameter sets: (ψ 1 , ν 1 , ψ 2 , ν 2 ) = (0.967, 0.095, 0.304, 0.204), (0.942, 0.069, 0.165, 0.196), (0.894, 0.048, 0.063, 0.196), which correspond to solving equation (7) while using d = 0.3, 0.2 and 0.1 and ξ = 0.2 as inputs. Tables 3A and 3B display the results of the sensitivity analysis when the series are generated by ARFIMA(0, d, 0) and ARMA(2,1) models, respectively and all tests have a 5% significance level. We find that the performance of the combined LR test generally improves as the simulated
C The Author(s). Journal compilation C Royal Economic Society 2008
630
S. Pong, M. B. Shackleton and S. J. Taylor Table 3. Probabilities of correct memory choices as the level of persistence changes.
A. Simulated series generated from the ARFIMA(0, d, 0) process Parameter/length 2000 4000 8000
12,000
d = 0.4 d = 0.3 d = 0.2
72.4% 61.2% 37.8%
90.6% 87.9% 66.8%
93.7% 91.9% 90.1%
94.6% 94.8% 94.3%
d = 0.1
7.1%
19.5%
46.5%
64.3%
B. Simulated series generated from the two-factor process ARMA(2,1) Parameter/length
2000
4000
8000
12,000
55.4%
81.9%
93.9%
95.7%
(ψ 1 , ν 1 , ψ 2 , ν 2 ) =
46.3%
65.7%
91.9%
94.9%
(0.97,0.10,0.30,0.20) (ψ 1 , ν 1 , ψ 2 , ν 2 ) =
25.9%
48.1%
77.5%
90.1%
(0.94,0.07,0.17,0.20) (ψ 1 , ν 1 , ψ 2 , ν 2 ) =
9.6%
16.6%
30.4%
46.0%
(ψ 1 , ν 1 , ψ 2 , ν 2 ) = (0.97,0.14,0.44,0.21)
(0.90,0.48,0.06,0.20) Note: Panels A and B show the percentage of simulations for which the correct process is identified using a significance level of 5%. The results in Panels A and B are obtained using simulated series generated by the ARFIMA(0, d, 0) and ARMA(2,1) processes, respectively. The parameter values of the ARMA(2,1) process are obtained by approximating the spectral density of an ARFIMA(0, d, 0) process with d = 0.4, 0.3, 0.2 and 0.1, respectively.
series is lengthened for all persistence levels. However, the probability of identifying the correct type of memory deteriorates as the persistence level decreases for both cases. This finding is expected since a long memory process behaves more like its short memory counterpart as the degree of fractional integration diminishes. The probability of identifying the correct type of memory is found to be relatively low for small values of d or its matching AR parameters even for a very long time series. For example, we can see from Table 3A that when the length is 12,000 the probability of correctly identifying the ARFIMA(0, d, 0) specification is 64% when d = 0.1 and such a probability is much less than the 95% obtained when d = 0.4. Similar findings are obtained when the underlying process is ARMA(2,1), which are shown in Table 3B. They indicate that when the level of persistence is low, a very large number of observations is required in order to have acceptable power to identify the correct specification. 4.4. Implications for volatility modeling Our simulation study illustrates that it is possible to mistake a fractionally integrated series for a two-factor process, irrespective of the amount of data used (and vice versa). This result carries significant implications for volatility modeling. Claiming that a volatility series has a fractionally integrated structure while not considering the factor specification represents a bold assertion. It is therefore wise to consider both specifications before any conclusion on the type of memory C The Author(s). Journal compilation C Royal Economic Society 2008
Distinguishing volatility specifications
631
present in the volatility series is made. Our proposed test procedures serve this purpose and they identify the probability of incorrect conclusions. With limited data, our study shows that it is not always possible to distinguish a long memory process and a short memory process statistically. In volatility modeling and forecasting research, the typical time span of financial data employed is in the neighbourhood of eight years, which is equivalent to 2000 daily observations. Our results illustrate that based on this kind of history it is difficult to tell whether the long or short memory model describes the volatility process better. However, the encouraging news is that the ability to distinguish long memory and short memory processes improves if a more substantial data history is investigated.
5. EMPIRICAL EVIDENCE In this section, we show how the Monte Carlo results can be applied in an empirical study to distinguish between long and short memory in a volatility process. The data used in our empirical study are GBP/USD spot exchange rates provided by Olsen and Associates. The raw data is in the form of five-minute exchange rate differences computed using mid-quote exchange rates. The sample period is from July 2, 1987 to December 31, 1998. Although the foreign exchange market operates twenty-four hours a day, seven days a week, trading volume and volatility of the market diminish significantly during weekends and holidays, as depicted for example by Bollerslev and Domowitz (1993). To eliminate the weekend effect, we exclude the period from 21:00 GMT Friday to 21:00 GMT Sunday from our study. We also exclude several holidays: bank holidays in May and August for the U.K. and Memorial Day, July Fourth, Labour Day and Thanksgiving and the day after for the U.S. In addition, the returns during Christmas (24/12 to 26/12), Good Friday, Easter Monday and New Year (31/12 to 01/01) are also removed. We are left with a volatility series spanning 2867 days. Our analysis is carried out based on the realized volatility series of the GBP/USD exchange rate. Realized volatility is constructed by summing the squared intraday returns sampled at a particular frequency. Andersen et al. (2001a, b) have shown that when certain regularity conditions apply, as sampling becomes more frequent, the realized volatility is an increasingly accurate measure of integrated return volatility. The realized variance for day t is defined by the following equation: σt2 =
n
2 rt,j ,
(5.1)
j =1
where r t,j is the return in interval j on day t and n represents the total number of intervals in a day. To construct the realized volatility series we adopt a five minute sampling frequency in order to lessen the effects of market microstructure. As a result, 288 five-minute returns are used to construct one daily observation of realized volatility. Following Andersen et al. (2001a, b), we model log (σ t2 ) because its distribution is closer to Gaussian. The upper panel of Table 4A shows the results of the parameter estimation from MLE in 1 and θ1 the frequency domain. The value of d in the ARFIMA(1, d, 1) model is 0.348 while φ are 0.943 and 0.899, respectively. The estimated values of the parameters of the ARMA(2,1) 2 , 1 , φ θ1 ) = (1.201, −0.221, 0.797). The autoregressive parameters in the two model are (φ AR(1) components that aggregate to the ARMA(2,1) model are equal to 0.973 and 0.228. These C The Author(s). Journal compilation C Royal Economic Society 2008
632
S. Pong, M. B. Shackleton and S. J. Taylor
results are consistent with the estimates in Pong et al (2004), who use a shorter time span for the same exchange rate. Not knowing the true volatility process, we now use LR tests to choose the most suitable model for the log realized volatility. Two test values and the corresponding significance levels are shown in the lower panel of Table 4A. We first consider the LR test statistic for comparing the ARFIMA(1, d, 1) and ARFIMA(2, d, 1) models. The results show that the null hypothesis φ 2 = 0 cannot be rejected at the 5% significance level, as the p-value of this test is 23%; consequently this test can accept ARFIMA(1, d, 1) as the correct specification. We then compare the loglikelihoods of the ARMA(2,1) and ARFIMA(2, d, 1) specifications. The null hypothesis d = 0 cannot be rejected at a 5% significance level, the p-value being 11%. The combined result of the two tests indicates that we cannot statistically conclude whether the ARFIMA(1, d, 1) or the ARMA(2,1) model is the better model for the log volatility when the length of our volatility series is ‘only’ 2867 daily observations. Our Monte Carlo evidence indicates that a longer series is required to improve the statistical performance of the combined LR test. There are two ways to increase the amount of data for our estimation. The first method is to lengthen the sample period so that more daily observations of realized volatilities are obtained. However this method depends on data availability. The second method is to shorten the sampling interval so that more observations are obtained given a fixed time span. We will adopt the latter approach for our empirical analysis and estimate 4-hour realized volatilities by summing the squares of 48, five-minute returns. As a result, 17,202 observations are produced, which is six times the number of observations obtained on a daily basis. We may suppose that 4-hour realized volatilities (RVs) are the product of latent volatility, a cyclical term and a residual factor that averages unity. The logarithms of the 4-hour RVs will have the same memory characteristics (long or short) as the log of the latent 4-hour volatility process. Furthermore, as the 24-hour RV is the sum of six 4-hour RV terms we expect the latent 4-hour and the latent 24-hour series to have identical memory characteristics (Andersen and Bollerslev, 1997); either both have a short memory or both have a long memory with the same value of d. Figure 6 shows the logarithm of the periodogram of the 4-hour realized volatility series from January 1, 1993 to December 31, 1994. Periodicity can be found at the daily frequency (and its harmonics) which can be attributed to intraday volatility patterns that reflect the opening and closing of various financial centers in the world, their attendant macroeconomic news schedules, etc. The shape at the left end of the spectrum (near ω = 0) indicates a high volatility persistence. In order not to bias the estimation of the long memory spectrum when performing maximum likelihood estimation, we need to get rid of the periodic intraday component in the volatility series. We apply a relatively simple but efficient frequency-domain method that removes the periodic effects, as recommended in Shackleton (1998) which is similar to the method applied in Deo et al. (2006). By omitting observations at the daily frequency and its harmonics (and also the ten adjacent ordinates) on the periodogram, we obtain a deseasonalised periodogram. The omission of the ordinates is not expected to distort the pure stochastic components of the volatility process, since only a small fraction of ordinates in the periodogram are deleted (the percentage deleted is less than 1%). Table 4B shows the results of maximum likelihood estimation for the ARFIMA(1, d, 1), ARMA(2,1) and ARFIMA(2, d, 1) models using all 17,202 observations. When modeled as ARFIMA(1, d, 1), the value of d equals 0.470 and φ 1 , φ 2 and θ 1 equal 1.195, −0.204 and 0.922, respectively. The autoregressive parameters in the two AR(1) components that aggregate to the ARMA(2,1) model are equal to 0.990 and 0.206, respectively. Finally, when modeled as C The Author(s). Journal compilation C Royal Economic Society 2008
C The Author(s). Journal compilation C Royal Economic Society 2008
log likelihood
MA(1)
AR(2)
AR(1)
d
ARFIMA
1.47 (0.23)
2.60 (0.11)
Test Statistic
Likelihood Ratio
II. (1, d, 1) v (2, d, 1) (2,0,1) v (2, d, 1) Model Comparison
3028.45
(0.0313) 0.8756 (0.0513)
(0.0365) 0.7966 (0.0306) 3027.15
(0.0307) −0.0621
(0.0397) −0.2215
I.
B.
12197.18
0.7190 (0.0278)
(0.0313)
(0.0322) 0.5075
0.4701
(1, d, 1)
12206.00
(0.0272) 0.5983 (0.0248)
(0.0302) −0.0495
(0.0390) 0.4384
0.4279
(2, d, 1)
17.64 (0.00)
68.39 (0.00)
(1, d, 1) v (2, d, 1) (2,0,1) v (2, d, 1)
12171.80
(0.0119) 0.9223 (0.0063)
(0.0123) −0.2035
1.1951
(2,0,1)
N = 17, 202
Note: The estimation in Panel A is based on the logarithm of daily realized volatility series from July 1, 1987 to December 31, 1998 while that in Panel B is based on the logarithm of the 4-hour realized volatility series for the same period. Sub-panel I shows the estimates of the parameters of three different models using Maximum Likelihood Estimation. The values in parentheses represent the standard errors of the estimates. Sub-panel II shows the Likelihood Ratio Test Statistics when comparing two models. The values in parentheses are the levels of significance of the test statistics.
Test Statistic
Likelihood Ratio
II. Model Comparison
3027.72
log likelihood
(0.0355)
(0.0742) 1.0142
(0.0476) 0.9430
(2, d, 1) 0.2614
1.2008
(2,0,1)
N = 2867
0.3479
0.8988 (0.0595)
d
(1, d, 1)
MA(1)
AR(2)
AR(1)
ARFIMA
I.
A.
Table 4. Maximum likelihood estimation for log realized volatilities, for the /$ exchange rate.
Distinguishing volatility specifications
633
634
S. Pong, M. B. Shackleton and S. J. Taylor
Figure 6. Periodogram of the 4-hour log realized volatility for the USD/GBP exchange rate for the period from January 1, 1993 to December 31, 1994.
ARFIMA(2, d, 1), the value of d equals 0.428 while φ 1 , φ 2 and θ 1 equal 0.438, −0.050 and 0.598, respectively. Table 4B also shows the log-likelihood values for the different models which are used to perform the likelihood-ratio tests. When we compare ARFIMA(1, d, 1) and (2, d, 1), the LR test results in the lower panel of Table 4B show that the ARFIMA(1, d, 1) hypothesis is rejected at significance levels below 1%. When ARMA(2,1) and ARFIMA(2, d, 1) are compared, it is found that the ARMA(2,1) hypothesis is also rejected at levels below 1%. The statistical results conclude that the ARFIMA(2, d, 1) model is the best of those evaluated for the logarithm of 4-hour realized volatility. The above results only apply to the situation when the number of parameters is small. Models such as ARFIMA(3, d, 2) and ARMA(3,2) are not included in our study because it is difficult to estimate the required number of parameters. Andersen and Bollerslev (1997, 1998) suggest that the log volatility process can be expressed as the sum of many AR(1) processes. Following Granger (1980) they show that under certain assumptions the autocorrelation structure of the log volatility process approaches that of a fractional integrated process as the number of AR(1) processes becomes large. Consequently the fractionally integrated dynamics can be considered as a reduced form of a high-order multi-factor volatility process. Our empirical results show that the two-factor model is as good statistically as the fractionally integrated model when the history of the volatility process is relatively short. However, as the number of observations increases, the long memory representation is better than the two-factor structure in explaining the volatility process, but potentially still inferior to structural component models of yet higher order. Consequently, our results cannot exclude the possibility that the real volatility process is defined by the sum of many AR(1) components. However, the number of components must be more than two. C The Author(s). Journal compilation C Royal Economic Society 2008
Distinguishing volatility specifications
635
6. CONCLUSIONS Evidence of long memory in the volatility process has been well documented in many studies for both the equity and foreign exchange markets. However this should more precisely be explained as evidence for long memory effects since there are short memory processes that can have similar abilities to capture and mimic the persistent nature of asset price volatility (the sum of two AR(1) processes is an example). In this paper, we show that it is easy to claim that a series has a long memory by performing popular statistical tests for detecting long memory but without considering short memory alternative specifications. However, it is not always possible to identify the correct memory specification when tests are evaluated for both types. Option traders, in particular, must therefore often face an important degree of uncertainty about the correct specification of volatility processes. We find that the length of the financial series plays an important role in deciding if long or short memory is present. Through a Monte Carlo study, we examine if the long or short memory of a simulated series can be revealed statistically. Our results first show that it is indeed easy to mistake a long memory process as a short memory mimicking process and vice versa. We propose a test, which is composed of two likelihood ratio tests, to identify the correct specification using simulated series generated from ARFIMA(0, d, 0) and ARMA(2,1) models with the parameters appropriately chosen. By generating simulated series of different length, we document the effect of series length on the ability to identify the correct type of memory. We find that it is often not possible to identify the correct type of memory when the simulated series is not long enough (for example 2000 observations). However, the chance of such a mistake can be reduced by including more observations. For example, considering a persistence level equivalent to d equaling 0.4 and when the simulated series contains 12,000 observations, there is a 95% chance of identifying the correct nature of memory when the tests employ the 5% significance level. Our Monte Carlo results are useful in identifying the correct type of memory in GBP/USD exchange rate volatility. Using approximately 3000 observations of daily realized volatility, we cannot decide if a long or short memory exists in the volatility series. However, the Monte Carlo results show that by lengthening the series there is a good chance to identify the correct specification. This is done by increasing the number of observations by increasing the volatility sampling frequency from daily to four hours. It is then found that a long memory process is preferable to the sum of two AR processes in capturing the volatility dynamics although this could still be a parsimonious treatment of a yet higher-order, factor AR model.
ACKNOWLEDGEMENTS The authors thank Stephane Gregoir the editor, two referees, Nick Taylor and Granville Tunnicliffe Wilson for helpful comments and suggestions, Jinsha Zhao for assistance with LaTeX.
REFERENCES Alizadeh, S., M. W. Brandt and F. X. Diebold (2002). Range-based estimation of stochastic volatility models. Journal of Finance 57, 1047–91. C The Author(s). Journal compilation C Royal Economic Society 2008
636
S. Pong, M. B. Shackleton and S. J. Taylor
Andersen, T. G. and T. Bollerslev (1997). Heterogenous information arrivals and return volatility dynamics: uncovering the long-run in high frequency returns. Journal of Finance 52, 975–1005. Andersen, T. G. and T. Bollerslev (1998). Deutsche Mark-Dollar volatility: intraday activity patterns, macroeconomic announcements, and longer run dependencies. Journal of Finance 53, 219–65. Andersen, T. G., T. Bollerslev, F. X. Diebold and H. Ebens (2001a). The distribution of realized stock return volatility. Journal of Financial Economics 61, 43–76. Andersen, T. G., T. Bollerslev, F. X. Diebold and P. Labys (2001b). The distribution of realized exchange rate volatility. Journal of the American Statistical Association 96, 42–55. Areal, N. M. P. C. and S. J. Taylor (2002). The realized volatility of FTSE-100 futures prices. Journal of Futures Markets 22, 627–48. Baillie, R. T. (1996). Long memory processes and fractional integration in econometrics. Journal of Econometrics 73, 5–59. Baillie, R. T., T. Bollerslev and H. O. Mikkelsen (1996). Fractionally integrated generalized autoregressive conditional heteroskedasticity. Journal of Econometrics 74, 3–30. Barndorff-Nielsen, O. E. and N. Shephard (2001). Non-Gaussian Ornstein-Uhlenbeck based models and some of their uses in financial economics. Journal of the Royal Statistical Society B63, 167–241. Beran, J. (1994). Statistics for Long-Memory Processes, Chapman and Hall, New York. Bollerslev, T. (1986). Generalized autoregressive conditional heteroscedasticity. Journal of Econometrics 31, 307–27. Bollerslev, T. and I. Domowitz (1993). Trading patterns and prices in the interbank foreign exchange market. Journal of Finance 48, 1421–43. Bollerslev, T. and H. O. Mikkelsen (1996). Modeling and pricing long memory in stock market volatility. Journal of Econometrics 73, 151–84. Breidt, F. J., N. Crato and P. de Lima (1998). The detection and estimation of long memory in stochastic volatility. Journal of Econometrics 83, 325–48. Brockwell, P. J. and R. A. Davis (1991). Time Series: Theory and Methods (2nd ed.). New York: Springer. Cheung, Y. W. and F. X. Diebold (1994). On maximum likelihood estimation of the differencing parameter of fractionally-integrated noise with unknown mean. Journal of Econometrics 62, 301–16. Chung, C.-F. (1994). A note on calculating the autocovariances of fractionally integrated ARMA models. Economics Letters 45, 293–97. Davies, R. B. and D. S. Harte (1987). Tests for Hurst effect. Biometrika 74, 95–102. Deo, R., C. Hurvich and Y. Lu (2006). Forecasting realized volatility using a long-memory stochastic volatility model: estimation, prediction and seasonal adjustment. Journal of Econometrics 131, 29–58. Ding, Z., C. W. J. Granger and R. F. Engle (1993). A long memory property of stock markets and a new model. Journal of Empirical Finance 1, 83–106. Duffie, D., J. Pan and K. J. Singleton (2000). Transform analysis and asset pricing for affine jump-diffusions. Econometrica 68, 1343–76. Fox, R. and M. S. Taqqu (1986). Large sample properties of parameter estimates for strongly dependent stationary Gaussian time series. Annals of Statistics 14, 517–32. Gallant, A. R., C.-T. Hsu and G. E. Tauchen (1999). Using daily range data to calibrate volatility diffusions and extract the forward integrated variance. Review of Economics and Statistics 81, 617–31. Geweke, J. and S. Porter-Hudak (1983). The estimation and application of long memory time series models. Journal of Time Series Analysis 4, 221–38. Granger, C. W. J. (1980). Long memory relationships and the aggregation of dynamic models. Journal of Econometrics 14, 227–38. Granger, C. W. J. and Z. Ding (1996). Varieties of long memory models. Econometrica 73, 61–77.
C The Author(s). Journal compilation C Royal Economic Society 2008
Distinguishing volatility specifications
637
Granger, C. W. J. and R. Joyeux (1980). An introduction to long memory time series models and fractional differencing. Journal of Time Series Analysis 1, 15–39. Granger, C. W. J. and P. Newbold (1977). Forecasting Economic Time Series. New York: Academic Press. Hosking, J. R. M. (1981). Fractional differencing. Biometrika 68, 165–76. McLeod, A. I. and K. W. Hipel (1978). Preservation of the rescale adjusted range 1. A reassessment of the Hurst phenomenon. Water Resources Research 14, 491–508. Nelson, D. B. (1991). Conditional heteroskedasticity in asset returns: a new approach. Econometrica 59, 347–70. Ohanissian, A., J. R. Russell and R. S. Tsay (2008). True or spurious long memory? A new test. Journal of Business and Economic Statistics 26, 161–75. Pong, S. Y., M. B. Shackleton, S. J. Taylor and X. Xu (2004). Forecasting currency volatility: A comparison of implied volatilities and AR(FI)MA models. Journal of Banking and Finance 28, 2541–63. Robinson, P. M. (1994). Semiparametric analysis of long-memory time series. Annals of Statistics 22, 515– 39. Shackleton, M. B. (1998). Frequency domain and stochastic control theory applied to volatility and pricing in intraday financial data. Ph.D. thesis, London Business School, University of London. Smith, J., N. Taylor and S. Yadav (1997). Comparing the bias and misspecification in ARFIMA models. Journal of Time Series Analysis 18, 507–27. Sowell, F. (1992). Maximum likelihood estimation of stationary univariate fractionally integrated time series models. Journal of Econometrics 53, 165–88. Taylor, S. J. (1986). Modelling Financial Time Series. Chichester, UK: John Wiley. Taylor, S. J. (2005). Asset Price Dynamics, Volatility, and Prediction. Princeton: Princeton University Press. Whittle, P. (1951). Hypothesis Testing in Time Series Analysis. Uppsala: Almquist and Wiksells.
C The Author(s). Journal compilation C Royal Economic Society 2008
Econometrics Journal (2008), volume 11, pp. 638–647. doi: 10.1111/j.1368-423X.2008.00257.x
Critical values for linearity tests in time-varying smooth transition autoregressive models when data are highly persistent R ICKARD S ANDBERG † †
Department of Economic Statistics, Stockholm School of Economics, P. O. Box 6501, SE-113 83 Stockholm, Sweden E-mail:
[email protected] First version received: October 2006; final version accepted: April 2008
Summary In this paper, we derive asymptotic distributions for linearity tests in timevarying smooth transition autoregressive models in the presence of a unit root. The limiting distributions are non-standard because of the unit root assumption, and it is shown that the linearity hypothesis is rejected far too often (up to 30.9% of the times at a 5% significance level) when using critical values from a chi-square distribution. Keywords: Linearity tests, Random walk, Smooth transition models, Wald test.
1. INTRODUCTION Testing the linearity hypothesis in non-linear models is exceedingly important and has received considerable attention in the economic and econometric literature in the last two decades or so. For instance, Luukkonen et al. (1988), Granger and Ter¨asvirta (1993), Ter¨asvirta (1994), Lin and Ter¨asvirta (1994), Jansen and Ter¨asvirta (1996), van Dijk and Franses (1999), van Dijk et al. (2002) and Lundbergh et al. (2003), provide a framework for linearity tests in various smooth transition autoregressive (STAR) models which is based on Taylor expansions and conducted by simple OLS regressions. These tests have been widely applied to financial and macroeconomic data and evidence of STAR non-linearities are many times found. A maintained assumption in the above literature is that stationarity holds under the null hypothesis of linearity and as a result the applied linearity tests will have standard F or chi-square limiting distributions. However, we notice that many financial and macroeconomic time series appear to be very persistent with a near unit root or unit root behaviour (see e.g. Stock and Watson, 1999) who report that a random walk is rejected only in 13.5% of the cases for 215 U.S. macroeconomic and financial time series of monthly data ranging from 1959:1 to 1996:12), and we should take great care when applying the linearity tests without first verifying the stationarity assumption. If the stationarity assumption is not fulfilled, the standard asymptotic distributions and critical values for the linearity tests are no longer valid which then possibly leads to spurious inference and conclusions. Testing linearity under the unit root assumption in non-linear models has, to the best of our knowledge, only been studied by Kilic¸ (2004). In more detail, he assumes that the data generating process (DGP) is a random walk when testing the null hypothesis of linearity in logistic STAR (LSTAR) and exponential STAR (ESTAR) models with a lagged dependent variable as transition C The Author(s). Journal compilation C Royal Economic Society 2008. Published by Blackwell Publishing Ltd, 9600 Garsington Road,
Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
Critical values for linearity tests in TV-STAR models
639
variable. It is shown that the null hypothesis of linearity is rejected up to 23% of the cases at a 5% significance level when critical values from a chi-square distribution are used erroneously. As such, non-standard limiting distributions and critical values are presented for his linearity tests in the case of a unit root. Inspired by Kilic¸ (2004), we will study the consequences of applying linearity tests when the DGP is a random walk in LSTAR and ESTAR models where the transition variable is time, and also in a time-varying STAR (TV-STAR) model where both time and a lagged dependent variable are employed as transition variables. The LSTAR and ESTAR models with time as the transition variable are typically used to model structural instability in e.g. industrial production, income and unemployment rate time series, see Lin and Ter¨asvirta (1994), Jansen and Ter¨asvirta (1996) and Skalin and Ter¨asvirta (2002), amongst others, for examples. Applications of the TVSTAR model include for instance Lundbergh et al. (2003), where structural instability as well as non-linearities are considered for 215 U.S. macroeconomic and financial time series (the same data set as in Stock and Watson, 1999). Because of the unit root assumption, the limiting distributions of the linearity tests derived in this paper are non-standard and they are functions of Brownian motions on the unit interval. Even though the limiting distributions in Kilic¸ (2004) also are non-standard, they are fundamentally different from those in this paper for the simple reason that the tests derived in Kilic¸ (2004) stem from STAR models with an endogenous transition variable. Furthermore, for comparison and to get an enhanced understanding when testing linearity in non-linear models when the DGP is a random walk, we also provide the simulation results reported in Kilic¸ (2004). The rest of the paper is organized as follows: In Section 2, we define the non-linear models and derive the asymptotic distributions of the linearity tests. Simulation studies yielding percentiles of the limiting distributions are conducted in Section 3. Section 4 contains a conclusion and a discussion on related issues. Finally, the proofs are found in the Appendix. A few words on the notation in this paper: ⇒ signifies weak convergence with respect to the d
p
Skorohod metric (as defined in Billingsley, 1968), → and → denote convergence in distribution and probability, respectively, and W(r) abbreviates a standard Wiener process on the unit interval with the short-hand notation W.
2. DISTRIBUTION OF WALD-TYPE OF LINEARITY TESTS IN THE PRESENCE OF A UNIT ROOT In this section, the asymptotic distributions of Wald test statistics are derived when testing linearity in the STAR models under the presence of a unit root. A first-order LSTAR model with time as the transition variable is defined as (see e.g. Lin and Ter¨asvirta, 1994) yt = φ1 yt−1 [1 − F (t)] + φ2 yt−1 F (t) + t , t = 1, . . . , T ,
(2.1)
where (φ1 , φ2 ) ∈ R2 , t is a disturbance term, the properties of which will be discussed below, and F(t) is a smooth transition function given by −1 , γ ∈ R+ , a ∈ [0, T ]. (2.2) F (t; γ , a) = 1 + exp {−γ (t − a)} The function in (2.2) is a logistic function and implies a continuum of regimes for the dynamic root in (2.1), and because F(t) is a monotone function in t with range [0, 1], the dynamic C The Author(s). Journal compilation C Royal Economic Society 2008.
640
R. Sandberg
root changes monotonically from φ 1 to φ 2 (approximately) as time passes. The changes are symmetric around the location parameter a, and the speed of transition between regimes is determined by the parameter γ . A low(high) value of γ implies a slow(fast) transition between regimes. In particular, letting γ → 0 implies that F = 0.5, and the resultant model is the linear AR(1) model y t = ρ 1 y t−1 + t , where ρ 1 = (φ 1 + φ 2 )/2. A first-order ESTAR model with exogenously driven non-linearities can be defined as in (2.1) but the transition function (2.2) is replaced with (see e.g. Jansen and Ter¨asvirta, 1996) F (t; γ , a) = 1 − exp{−γ (t − a)2 }, γ ∈ R+ , a ∈ [0, T ].
(2.3)
The function in (2.3) also implies a continuum of regimes for the autoregressive parameter, but is different to the specification in (2.2) because it is instead symmetrically u-shaped in t with its center at the location parameter a. The parameter γ has the same interpretation as in (2.2). With (2.3) in (2.1), the dynamic root changes gradually from φ 2 to φ 1 (at t = a) and thereafter back to φ 2 as time passes. Furthermore, letting γ → 0 implies that F = 0, and the ESTAR model is reduced to the linear AR(1) process y t = ρ 2 y t−1 + t , where ρ 2 = φ 1 . Another non-linear model that is commonly used is the first-order TV-STAR model which is a model that employs both an endogenous and exogenous transition variable and can be defined as (see e.g. Lundbergh et al., 2003) (2.4) yt = h1 (t)yt−1 1 − G(yt−1 + h2 (t)yt−1 G(yt−1 ) + t , t = 1, . . . , T , where h 1 (t) = ϕ 1 [1 − F (t)] + ϕ 3 F (t) and h2 (t) = ϕ2 [1 − F (t)] + ϕ4 F (t), (ϕ1 , . . . , ϕ4 ) ∈ R4 , t and F(t) are defined as in (2.2). Furthermore, G(y t−1 ) is another logistic smooth transition function given by −1 , ϑ ∈ R+ , c ∈ R. (2.5) G(yt−1 ; ϑ, c) = 1 + exp {−ϑ(yt−1 − c)} The function in (2.5) is the same type of transition function as in (2.2), but it is endowed with an endogenous argument. The TV-STAR model is preferably interpreted as describing y t as a STAR model, with the transition variable y t−1 , at all times. Specifically, for any fixed t = t 0 , (2.4) accommodates a continuum of regimes for the dynamic root, and since G(y t−1 ) is bounded between 0 and 1 and monotone in y t−1 , it follows that the dynamic root increases from ϕ 1 + F (t 0 )(ϕ 3 − ϕ 1 ) to ϕ 2 + F (t 0 )(ϕ 4 − ϕ 2 ) with y t−1 . The smooth changes will be symmetric around the location parameter c. Moreover, a linear model emerges as a special case in the TVSTAR model letting both γ → 0 and ϑ → 0 which implies that F = G = 0.5, and the AR(1) model y t = ρ 3 y t−1 + t , where ρ 3 = (ϕ 1 + ϕ 2 + ϕ 3 + ϕ 4 )/4, is obtained. As indicated above, linearity in the LSTAR, ESTAR and TV-STAR models can be tested by the hypotheses H01 : γ = 0, H02 : γ = 0 and H03 : γ = ϑ = 0, respectively (indices 1, 2 and 3 will hereafter refer to the STAR, ESTAR and TV-STAR model, respectively). These models are obviously identified only under the alternative hypothesis because the parameter vectors (φ 1 , φ 2 ) and (ϕ 1 , . . . , ϕ 4 ) are unidentified under H01 and H02 and H03 , respectively. We circumvent this problem by replacing the transition functions with their first-order Taylor expansions around γ = 0 and ϑ = 0, see Luukkonen et al. (1988) for details. Applying corresponding first-order approximations to the models above yields the following regression equations yt = βi xit + it∗ , i = 1, 2, 3,
(2.6)
where β 1 = (β 11 , β 12 ) , β 2 = (β 21 , β 22 , β 23 ) , β 3 = (β 31 , . . . , β 34 ) , x 1t = (y t−1 , ty t−1 ) , x 2t = (y t−1 , ty t−1 , t 2 y t−1 ) and x 3t = (y t−1 , ty t−1 , y 2t−1 , ty 2t−1 ) . Furthermore, ∗1t = t + r 1 (γ ), ∗2t = C The Author(s). Journal compilation C Royal Economic Society 2008.
Critical values for linearity tests in TV-STAR models
641
t + r 2 (γ ) and ∗3t = t + r 3 (γ , ϑ) are error terms adjusted with respect to the Taylor expansion and where r 1 (γ ), r 2 (γ ) and r 3 (γ , ϑ) are remainder terms such that r 1 (0) = r 2 (0) = r 3 (0, 0) = 0 holds. In (2.6) we now notice that the relationships β 11 = ρ 1 , β 21 = ρ 2 , β 31 = ρ 3 and β ij = 0 for all i and j > 1 hold if the null hypothesis H0i holds. The originally stated null hypotheses of linearity in the STAR, ESTAR and the TV-STAR models are therefore transformed, because aux aux of the Taylor approximations, to Haux 01 : β12 = 0, H02 : β22 = β23 = 0 and H03 : β32 = β33 = β34 = 0, respectively, which can be expressed as Haux 0i : Ri βi = ri , i = 1, 2, 3,
(2.7)
where R1 = [ 0 1 ], r1 = 0, R2 = [ 02×1 I2 ], r2 = 02×1 , R3 = [ 03×1 I3 ] and r 3 = 0 3×1 . 1 The alternative hypotheses simply state that (2.7) does not hold. The linearity hypotheses in (2.7) will be tested by the following Wald-type of test statistic ⎡ T
−1 ⎤−1 xit xit Ri ⎦ × Ri (βˆi − βi ), i = 1, 2, 3, (2.8) Wi = Ri (βˆi − βi ) × ⎣Si2 Ri t=1
where βˆi are the OLS estimators of β i and Si2 = Tt=1 (ˆit∗ )2 /(T − (i + 1)) is the OLS estimator of the error variance with ˆit∗ = yt − βˆi xit . If stationarity is assumed to hold under H0i , i.e. if ρ i ∈ (−1, 1) holds, then β i1 ∈ (−1, 1) aux holds under Haux 0i . As a result, the limiting distribution for W i under H0i is (under some suitable d
d
d
conditions on t ) W1 → χ 2 (1), W2 → χ 2 (2) and W3 → χ 2 (3) in the LSTAR, ESTAR and TVSTAR case, respectively. However, and as argued above, many economic time series are highly persistent with a close to unit root or unit root behaviour, and instead assuming that the DGP is a random walk under the null hypothesis, i.e. assuming that ρ i = 1 holds under H0i , implies that β i1 = 1 holds under Haux 0i and the resultant limiting distributions of the Wald tests are nonstandard. These asymptotic distributions are given in Theorem 2.1 below under the following assumptions on the error term t . A SSUMPTION 2.1. Let ( t ) be a sequence of independent and identically distributed (i.i.d.) random variables defined on the probability triple ( , F, P) such that Et = 0 and Et2 = σ2 hold. In addition, assume that E|t |4+δ < ∞ for some δ > 0. 2 T HEOREM 2.1. If y t = y t−1 + t with y 0 = 0 and if Assumption 2.1 holds, then Wi ⇒ Qi Pi −1 Qi , i = 1, 2, 3,
(2.9)
I n denotes a n × n identity matrix. The i.i.d. assumption on the error terms is merely imposed as a matter of convenience to prove the results in Theorem 2.1, and also to get a straightforward comparison to the results in Kilic¸ (2004). For instance, the i.i.d. assumption can be relaxed so that the weak convergence results in Theorem 2.1 still hold when the error term accommodates generalized autoregressive conditional heteroscedasticity (GARCH), see e.g. Ling et al. (2003) for related results on weak convergence to stochastic integrals for unit root processes with GARCH innovations. It is also possible to derive results similar to those in Theorem 2.1 when the error terms are weakly dependent as in the case of linear stationary autoregressive or moving average processes, but then nuisance parameters (due to the induced serial correlation) enter the asymptotic distributions, see Hansen (1992, Theorems 4.1 and 4.2) and Sandberg (2006b, Theorem 1) for details on convergence to stochastic integrals for dependent heterogeneous processes. 1 2
C The Author(s). Journal compilation C Royal Economic Society 2008.
642
R. Sandberg
where Qi and Pi are matrix functions of W and σ 2 with dim Qi =i × 1 and dim Pi =i × i, and they are defined in the Appendix. Even though the matrices Qi and Pi in Theorem 2.1 depend upon the nuisance parameter σ 2 , we notice that the product Qi Pi −1 Qi is nuisance parameter free. In addition, the asymptotic null distribution in the case of W 1 entails a particularly simple result, namely
2 1 2 1 2 1 2 2 2 W dr W (1) − W dr − 1/2 − rW dr W (1) − 1 W1 ⇒
0
0
0
⎛
2 ⎞ 1 1 1 1 4 W 2 dr ⎝ W 2 dr r 2 W 2 dr − rW 2 dr ⎠ 0
0
0
.
(2.10)
0
Here, the absence of nuisance parameters is manifested as well as it is clear that W1 ⇒ χ 2 (1). The derivation of complete expressions for the asymptotic distributions corresponding to W 2 and W 3 is also straightforward but yields rather lengthy results and is therefore not reported here. 3 It is evident that the Taylor series approximations result in linearity tests that are very simple to conduct and give rise to the tractable nuisance parameter free asymptotic null distributions in Theorem 2.1. However, the simplicity comes in this case at a price and the linearization of the STAR models implies that information about the non-linear structure under the alternative is lost, which may affect the power of the test adversely. As such, on the assumption of stationarity under the null hypothesis, there exist alternative methods which many times are more powerful (but also many times very computationally demanding) when testing for the STAR type of nonlinearities. For instance, the optimal tests by Andrews (1993) and Andrews and Ploberger (1994) might be considered if we use the simulation based technique developed by Hansen (1996) to obtain critical values. Notably is also that the tests for STAR non-linearities based on Taylor series approximations have been shown not to be consistent, see Hill (2005) for a discussion. In the present case the settings are somewhat different because we consider a random walk under the null hypothesis but the inconsistency problem carries over to our framework as well. To remedy this problem, and as a referee pointed out, we notice that it seems possible to derive a consistent conditional moment type of linearity test under the presence of a unit root in STAR models by the virtue of e.g. Bierens (1990), de Jong (1996), Bierens and Ploberger (1999), Stinchcombe and White (1998) and Hill (2005) with bootstrapped p-values in the tradition of Hansen (1996). This is a topic left for further research.
3. SIMULATIONS Asymptotic percentiles for the distributions of the W 1 , W 2 and W 3 tests, in the presence of a unit root with increments t ∼ nid(0, 1), are found by simulations. We let T = 100, 000 and the number of replications were set to 1, 000, 000. The asymptotic percentiles are presented in Table 1 together with the percentiles for the χ 2 (1), χ 2 (2) and χ 2 (3) distributions. To enable a comparison to the study in Kilic¸ (2004) we also report the percentiles for the linearity test in LSTAR and ESTAR models when the transition variable is a lagged dependent variable, and
3
Complete expressions for these distributions can be obtained upon request from the author. C The Author(s). Journal compilation C Royal Economic Society 2008.
643
Critical values for linearity tests in TV-STAR models
Table 1. Percentiles of χ 2 (1), χ 2 (2) and χ 2 (3) distributions and the limiting distribution of the linearity tests W 1 , W 2 and W 3 . Percentile LM L W1 χ 2 (1) LM E W2 χ 2 (2) W3 χ 2 (3) 0.50 1.00 2.50
0.000 0.000 0.001
0.000 0.000 0.001
0.000 0.000 0.001
0.091 0.172 0.367
0.013 0.026 0.059
0.010 0.020 0.051
0.107 0.181 0.365
0.072 0.115 0.216
5.00 10.0
0.006 0.024
0.005 0.019
0.004 0.016
0.616 1.023
0.123 0.261
0.103 0.211
0.637 1.207
0.352 0.584
25.0 50.0 75.0
0.164 0.837 2.750
0.112 0.533 1.512
0.101 0.455 1.323
2.000 3.597 5.750
0.684 1.617 3.168
0.575 1.390 2.770
2.900 5.545 8.605
1.212 2.366 4.108
90.0 95.0
5.185 6.879
3.027 4.278
2.706 3.841
8.191 10.008
5.038 6.607
4.605 5.991
11.806 13.985
6.251 7.815
97.5 99.0 99.5
8.493 10.562 11.957
5.513 7.308 8.514
5.024 6.635 7.880
11.701 13.750 15.415
8.186 9.999 11.326
7.378 9.210 10.597
16.025 18.455 20.474
9.348 11.345 12.838
Notes: The results are based on T = 100, 000 and 1,000,000 replications. LM L and LM E denote the linearity tests in Kilic¸ (2004) based on LSTAR and ESTAR models with a lagged dependent variable as a transition variable. W 1 , W 2 and W 3 abbreviate the linearity tests based on the LSTAR, ESTAR and TV-STAR models in this paper.
they are denoted LM L and LM E , respectively. Moreover, we are also interested in the actual rejection frequencies of a true null hypothesis of linearity in the presence of a unit root when the critical values from a chi-square distribution are (wrongly) used. These rejection frequencies are reported in Table 2 and are abbreviated pW 1 , pW 2 and pW 3 for the tests in this paper, and pLM L and pLM E for the tests in Kilic¸ (2004). In Table 1 it is clear that the limiting distributions of the Wald test statistic for the tests in this paper (assuming a unit root) have thicker right tails compared to the tails of the chi-square distributions (the cases of stationarity). The discrepancies in percentiles between the distribution of W 3 and χ 2 (3) are particularly alarming. It is evident that erroneously assuming stationarity when testing linearity in the presence of a unit root implies that incorrect critical values for all tests are adopted. To take an example, if the significance level of the linearity test is 5%, then one would use the critical values 3.840, 5.991 and 7.815 to test linearity in the LSTAR, ESTAR and TV-STAR models, respectively, when in fact the values 4.278, 6.607 and 13.985 should be used. Moving on to the comparison to the tests in Kilic¸ (2004), it appears in Table 1 that using a lagged dependent variable rather than time as transition variable when testing linearity in LSTAR and ESTAR models yields asymptotic distributions with even more probability mass in the right tail. This shows that linearity tests involving a lagged dependent variable as transition variable are more sensitive to a violation of the stationarity assumption. In Table 2, the consequences of erroneously assuming stationarity when testing linearity are demonstrated, and actual rejection frequencies when testing linearity in the presence of a unit root are presented. For instance, at a 5% significance level, the W 1 , W 2 and W 3 tests reject the null hypothesis of linearity in 6.9, 7.5 and 30.9% of the cases, whereas the tests LM L and LM E in Kilic¸ (2004) reject the null hypothesis in 17.1 and 23.0% of the cases. C The Author(s). Journal compilation C Royal Economic Society 2008.
644
R. Sandberg
Table 2. Rejection frequencies of a true null hypothesis of linearity in the presence of a unit root. Percentile pLM L pLM E pW 1 pW 2 pW 3 0.50 1.00
0.997 0.996
1.000 0.999
0.995 0.991
0.998 0.993
0.998 0.996
2.50 5.00
0.979 0.961
0.997 0.996
0.978 0.955
0.981 0.957
0.988 0.978
10.0 25.0 50.0
0.921 0.800 0.610
0.989 0.957 0.851
0.910 0.771 0.540
0.917 0.785 0.564
0.959 0.903 0.799
75.0 90.0
0.416 0.255
0.633 0.373
0.292 0.128
0.308 0.120
0.640 0.438
95.0 97.5 99.0
0.171 0.109 0.056
0.230 0.138 0.068
0.069 0.037 0.016
0.075 0.040 0.017
0.309 0.211 0.118
99.5
0.034
0.041
0.009
0.001
0.077
Notes: The results are based on T = 100, 000 and 1,000,000 replications. pLM L and pLM E denote the rejection frequencies for the linearity tests in Kilic¸ (2004) based on LSTAR and ESTAR models with a lagged dependent variable as a transition variable. pW 1 , pW 2 and pW 3 signify the rejection frequencies for the linearity tests based on the LSTAR, ESTAR and TV-STAR models in this paper.
4. CONCLUSIONS AND DISCUSSIONS Bearing in mind that many economic time series seem to be highly persistent, we derive asymptotic distributions for linearity tests in the presence of a unit root in LSTAR, ESTAR and TV-STAR models utilizing time and a lagged dependent variable as transition variables. These distributions are non-standard and simulated critical values are therefore reported. We conclude that the null hypothesis of linearity in the presence of a unit root is rejected far too often if critical values from a standard chi-square distribution are used. In fact, the rejection rate is as high as 30.9% at a 5% significance level in the TV-STAR case. It is also shown that linearity tests in models using an endogenous variable rather than time are more sensitive to a violation of the stationarity assumption. Our simulation results emphasize the importance of verifying the stationarity assumption when testing linearity in the STAR type of models. A natural strategy in our case seems therefore to consult some of the unit root tests in the presence of STAR non-linearities by Leybourne et al. (1998), Harvey and Mills (2002), Kapetanios et al. (2003), He and Sandberg (2006) and Sandberg (2006a), among others, in order to examine the stationarity assumption. The outcome of such a unit root pre-test will dictate if critical values from a standard or non-standard distribution should be used when testing linearity. Another obvious approach is to work with first-differences instead of levels in order to ensure stationarity under the null hypothesis, and conduct a test for STAR non-linearities based on differences. However, one argument against using first-differences is that it may weaken or remove (possible) non-linear features in the data. Applying a linearity test based on first-differences may therefore not lead to the rejection the null hypothesis of linearity, whereas using a test based on levels may yield evidence in favour of non-linearities. C The Author(s). Journal compilation C Royal Economic Society 2008.
Critical values for linearity tests in TV-STAR models
645
ACKNOWLEDGMENTS I would like to thank one anonymous referee for constructive and helpful remarks to improve this paper. Any remaining errors are mine. This research has been supported by the Jan Wallander’s and Tom Hedelius’ Foundation, Grant No. W2005–0103:1.
REFERENCES Andrews, D. W. (1993). Tests for parameter instability and structural change with unknown change point. Econometrica 4, 821–56. Andrews, D. W. and W. Ploberger (1994). Optimal tests when a nuisance parameter is present only under the alternative. Econometrica 62, 1383–414. Bierens, H. J. (1990). A consistent conditional moment test of functional form. Econometrica 58, 1443–58. Bierens, H. J. and W. Ploberger (1999). Asymptotic theory of integrated conditional moment tests. Econometrica 65, 1129–51. Billingsley, P. (1968). Convergence of Probability Measures. New York: John Wiley. de Jong, R. (1996). The Bierens test under data dependence. Journal of Econometrics 72, 1–32. Granger, C. W. J. and T. Ter¨asvirta (1993). Modelling Nonlinear Economic Relationsships. Oxford, UK: Oxford University Press. Hamilton, J. D. (1994). Time Series Analysis. New Jersey: Princeton University Press. Hansen, B. E. (1992). Convergence to stochastic integrals for dependent heterogeneous processes. Econometric Theory 8, 489–501. Hansen, B. E. (1996). Inference when a nuisance parameter is not identified under the null hypothesis. Econometrica 64, 413–30. Harvey, D. I. and T. C. Mills (2002). Unit roots and double smooth transitions. Journal of Applied Statistics 29, 675–83. He, C. and R. Sandberg (2006). Dickey-Fuller type of tests against non-linear dynamic models. Oxford Bulletin of Economic and Statistics 68, 835–61. Hill, J. B. (2005). Consistent and non-degenerate model specification tests against smooth transition alternatives. Working paper, Florida International University. Jansen, E. S. and T. Ter¨asvirta (1996). Testing parameter constancy and super exogeneity in econometric equations. Oxford Bulletin of Economics and Statistics 58, 735–63. Kapetanios, G., Y. Shin and A. Snell (2003). Testing for a unit root in the nonlinear STAR framework. Journal of Econometrics 112, 359–79. Kilic¸, R. (2004). Linearity tests and stationarity. Econometrics Journal 7, 55–62. Leybourne, S., P. Newbold and D. Vougas (1998). Unit roots and smooth transitions. Journal of Time Series Analysis 19, 83–97. Lin, C. F. J. and T. Ter¨asvirta (1994). Testing the constancy of regression parameters against continuous structural change. Journal of Econometrics 62, 211–28. Ling, S., W. K. Li and M. McAleer (2003). Estimation and Testing for unit root processes with GARCH(1,1) errors: theory and Monto Carlo evidence. Econometric Reviews 22, 179–202. Lundbergh, S., T. Ter¨asvirta and D. van Dijk (2003). Time-varying smooth transition autoregressive models. Journal of Business and Economic Statistics 21, 104–21. Luukkonen, R., P. Saikkonen and T. Ter¨asvirta (1988). Testing linearity against smooth transition autoregressive models. Biometrika 75, 491–99.
C The Author(s). Journal compilation C Royal Economic Society 2008.
646
R. Sandberg
Sandberg, R. (2006a). A unified theoretical framework when testing the unit root hypothesis in multipleregime STAR type of models. SSE/EFI Working Paper Series in Economics and Finance, Stockholm School of Economics. Sandberg, R. (2006b). Convergence to stochastic power integrals for dependent heterogeneous processes. SSE/EFI Working Paper Series in Economics and Finance, Stockholm School of Economics. Skalin, J. and T. Ter¨asvirta (2002). Modeling asymmetries and moving equilibria in unemployment rates. Macroeconomic Dynamics 6, 202–41. Stinchcombe, M. and H. White (1998). Consistent specification testing with nuisance parameters present only under the alternative. Econometric Theory 14, 295–325. Stock, J. H. and M. W. Watson (1999). A comparison of linear and nonlinear univariate models for forecasting macroeconomic time series. In R.F. Engle and H. White (Eds.), Cointegration, Causality and Forecasting: A Festschrift in Honour of Clive W. J. Granger. Oxford, UK: Oxford University Press. Ter¨asvirta, T. (1994). Specification, estimation and evaluation of smooth transition autoregressive models. Journal of the American Statistical Association 89, 208–18. van Dijk, D. and P. Franses (1999). Modeling multiple regimes in the business cycle. Macroeconomic Dynamics 3, 311–40. van Dijk, D., T. Ter¨asvirta and P. Franses (2002). Smooth Transition autorgeressive models—a survey of recent developments. Journal of Business and Economic Statistics 21, 1–47.
APPENDIX Proof of Theorem 2.1: The Wald test statistic in (2.8) can be written as (see e.g. Hamilton, 1994 p. 525)
⎡
Wi = Ri γiT (βˆi − βi ) × ⎣Si2 Ri γiT
⎤−1
−1
γiT Ri ⎦
xit xit
× Ri γiT (βˆi − βi ), i = 1, 2, 3,
t
where γ 1T = diag(T , T 2 ), γ 2T = diag(T , T 2 , T 3 ) and γ 3T = diag(T , T 2 , T 3/2 , T 5/2 ) are scaling matrices. In order to derive the limiting distribution for W i we first notice that Ri γiT (βˆi − βi ) = Ri γiT−1
−1
xit xit γiT−1
t
γiT−1
xit it∗ .
t
Moreover, since it is assumed that E|t |4+δ < ∞ for some δ > 0, it follows that we can use the results by Hansen (1992, Theorems 4.1 and 4.2) and Sandberg (2006b, Theorem 1) to obtain the joint weak aux 2 p 2 ∗ convergence under Haux 0i (noticing that Si → σ and it = t hold for all i under H0i ) ⎛ ⎝Ri γiT (βˆi − βi ),
Si2 Ri γiT
−1 xit xit
⎞ γiT Ri ⎠
⇒ (Qi , Pi ), i = 1, 2, 3,
t
C The Author(s). Journal compilation C Royal Economic Society 2008.
Critical values for linearity tests in TV-STAR models
647
−1 −1 −1 −1 2 where Qi = σ Ri −1 i Qi Pi and Pi = σ Ri i Qi i Ri , with submatrices defined by 1 = diag (σ , σ ), 2 = diag (σ , σ , σ ), 3 = diag (σ , σ , σ 2 , σ 2 ), 1 Q1 = (W, rW) (W, rW) dr, 0
1
W, rW, r 2 W W, rW, r 2 W dr,
1
W, rW, W 2 , rW 2 W, rW, W 2 , rW 2 dr,
Q2 = 0
Q3 =
0
and
1
P1 =
(W, rW) dW,
0
1
W, rW, r 2 W dW,
1
W, rW, W 2 , rW 2 dW.
P2 = 0
P3 =
0
Now, applying the continuous mapping theorem it follows that Wi ⇒ Qi Pi−1 Qi . Furthermore, −1 −1 × by straightforward matrix manipulations we notice that Qi Pi−1 Qi = (Ri Q−1 i Pi ) × (Ri Qi Ri ) −1 (Ri Qi Pi ), and the asymptotic null distribution of W i is nuisance parameter free. Finally, alternative expressions for the stochastic integrals in P i , such as those used to derive the results in (2.10), can be obtained from Corollary 2 in Sandberg (2006b).
C The Author(s). Journal compilation C Royal Economic Society 2008.