The
Econometrics Journal Econometrics Journal (2009), volume 12, pp. 1–25. doi: 10.1111/j.1368-423X.2008.00273.x
Identification and estimation of local average derivatives in non-separable models without monotonicity S TEFAN H ODERLEIN † AND E NNO M AMMEN ‡ †
Department of Economics, Brown University, Robinson Hall #302C, Providence, RI 02912, USA E-mail:
[email protected]
‡
Department of Economics, University of Mannheim, L 7, 3-5, 68131 Mannheim, Germany E-mail:
[email protected] First version received: May 2008; final version accepted: October 2008
Summary In many structural economic models there are no good arguments for additive separability of the error. Recently, this motivated intensive research on non-separable structures. For instance, in Hoderlein and Mammen (2007) a non-separable model in the single equation case was considered, and it was established that in the absence of the frequently employed monotonicity assumption local average structural derivatives (LASD) are still identified. In this paper, we introduce an estimator for the LASD. The estimator we propose is based on local polynomial fitting of conditional quantiles. We derive its large sample distribution through a Bahadur representation, and give some related results, e.g. about the asymptotic behaviour of the quantile process. Moreover, we generalize the concept of LASD to include endogeneity of regressors and discuss the case of a multivariate dependent variable. We also consider identification of structured non-separable models, including single index and additive models. We discuss specification testing, as well as testing for endogeneity and for the impact of unobserved heterogeneity. We also show that fixed censoring can easily be addressed in this framework. Finally, we apply some of the concepts to demand analysis using British Consumer Data. Keywords: Non-parametric, Non-parametric identification, Quantile regression, Weak axiom.
IV,
Non-separable
model,
Partial
1. INTRODUCTION 1.1. The non-separable model Models without additively separable error terms have become increasingly popular because they are perfect tools for modelling general economic relationships empirically. The non-separable model takes the form Y = φ(X, A), (1.1) where Y is a scalar response variable and X is an observable real valued random d-vectors, while A is an unobservable random variable. Often, the relationship between Y and some or all of C The Author(s). Journal compilation C Royal Economic Society 2009. Published by Blackwell Publishing Ltd, 9600 Garsington Road,
Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
2
S. Hoderlein and E. Mammen
the regressors X is of key economic interest, whereas A is meant to capture omitted factors and all types of unobserved heterogeneity. There are instances in which economic theory suggests certain identifying restrictions on the functional dependence of φ on A. For instance, in the famous returns to schooling example where Y denotes log wage and X individual covariates including years of schooling, one may be willing to assume that the only important omitted factor is ability, and that this factor ceteris paribus drives up wages. Therefore, in this example, monotonicity of φ in a scalar A is a reasonable assumption. 1 However, often economists think of A as capturing all kinds of unobservables and are reluctant to place any structure on its influence. A particularly important example is unobserved heterogeneity in preferences or technologies: In the textbook model of consumer choice, rationality places strong restrictions on the compensated price effects for any given utility function, but no restrictions on the impact of changes in utility functions (cf. Mas-Colell et al., 1997). In particular, φ can neither be assumed to be monotonic in utility functions, nor are utility functions scalars—in fact they are elements of an infinitely dimensional function space. 2 1.2. Identification of LASD This paper discusses estimation, application and extensions of the concept of local average structural derivatives (LASD). It makes use of the key identification result in Hoderlein and Mammen (2007), which establishes what can be learned from the quantiles about the marginal effect of one regressor, say x 1 , on the dependent variable y in non-separable models of type (1.1) when no monotonicity assumption on the unobservables A is made. For fixed values x ∗ ∈ Rd and 0 < α < 1, in Hoderlein and Mammen (2007) the following relationship between the derivative of the conditional quantile and the marginal effect of the non-separable function φ has been shown as E[∂x1 φ(X, A)|X = x ∗ , Y = kα (x ∗ )] = ∂x1 kα (x ∗ ).
(1.2)
where k α (x) denotes the conditional α-quantile of Y given X = x, i.e. for 0 < α < 1 the quantity k α (x) is defined by P(Y ≤ kα (x)|X = x) = α. Furthermore, ∂x1 denotes the partial derivative with respect to the first component of x. The result (1.2) holds under the technical Assumptions A2, . . . , A5 (stated in Section A.1.1), and the essential assumption that the random variables A and X 1 are conditionally independent, given X 2 , . . , X d . For the result it is not necessary that A is a scalar. It is allowed to take values in a Borel space A, i.e. a set that is homeomorphic to a Borel subset of the unit interval endowed with the Borel σ -field. This includes the case that A is a random element of a Polish space, e.g. a random piecewise continuous (utility) function. The result states that we can identify an average over the marginal effects ∂x1 φ from the data. The derivative of the quantile is the best approximation to the underlying marginal effect, given all our information. To give an example, suppose we were given data on the expenditure for food by individuals, and some covariates, say, income, age (in decades) and gender. Then we may identify the average income effect for all women, age 1
See Matzkin (2003) for further examples, mainly from the duration literature. Of course, as Chesher (2003) points out, there are also completely econometric examples for high dimensional vectors of unobservables, including certain measurement error and mixture models, as well as models with nested errors. 2
C The Author(s). Journal compilation C Royal Economic Society 2009.
Non-separable models without monotonicity
3
40–50, at a given income and a given value of expenditure for food. But this will in general still be a heterogeneous group. Since the conditional expectation is a projection that minimizes the L 2 -distance, this means that LASDs are the best approximation of the true marginal effects given all the information at our disposal. It is also instructive to compare this with the mean regression E[Y |X = x ∗ ] = g(x ∗ ). In this case, ∂x1 g(x ∗ ) = E[∂x1 φ(X, A)|X = x ∗ ] is straightforward (see Altonji and Matzkin, 2005, as well as Hoderlein, 2002, 2008 for discussions). This result shows a lot of parallels to (1.2). However, the derivative of the quantile is a conditional average that includes the information about the dependent variable in our information set as well. 1.3. Related literature In econometrics, Roehrig (1988), extending earlier work of Brown (1983), was the first to consider identification in non-separable models formally. He considers identification in a system of equations, i.e. with multivariate Y, and gives conditions for global identification if there is continuous variation. His work was extended by Matzkin (2003). She examines the scenario when Y is a continuously distributed random scalar, X is continuously distributed and exogenous, and A is a scalar, w.l.o.g. U [0, 1] distributed. Moreover, she assumes that φ is monotonic in A and, as a consequence, achieves full identification of φ. Another closely related extension of Roehrig (1988) is Brown and Matzkin (1996), who discuss an extremum estimator for his system of equations. Chesher (2003, 2005), as well as Imbens and Newey (2003), consider identification in triangular systems of non-separable equations with, at least at some stage, monotonically entering error and endogenous regressors. Whereas Imbens and Newey (2003) aim at global identification of φ with continuous regressors, Chesher aims at local (i.e. at a fixed position of the regressors) identification with continuous (2003), respectively discrete, covariates (2005). Moreover Chesher (2003, 2005) gives more emphasis to identification, while Imbens and Newey (2005) consider estimation in detail. For the identification of average marginal effects, Imbens and Newey (2003), Altonji and Matzkin (2005) and Hoderlein (2002, 2008) give results without assuming monotonicity in unobservables, using derivatives of the mean regressions. We will explore the relationship between this line of research and our approach when discussing endogeneity. In the case of endogenous regressors, Florens et al. (2003), and Newey and Powell (2003) consider a non-parametric IV estimator, provided the errors enter additively. Another subclass of models with a more specific structure is considered in Florens et al. (2005), who also treat the case of endogenous regressors. Related work is Chernozhukov et al. (2007) and Chernozhukov and Hansen (2005) who assume marginal independence conditions. Finally, the main theme of the paper shares also similarities with the philosophy of partial identification (Manski, 2003), the work on heterogeneity in economic theory (Hildenbrand, 1993), and, last but not least, the work of Heckman and Vytlacil (1999, 2001), where potential outcomes are non-additive in unobservables. 1.4. Structure of the paper This paper is structured as follows. In the next section, we discuss estimation of LASD by local polynomial quantile fitting, we provide large sample theory for the estimator. In C The Author(s). Journal compilation C Royal Economic Society 2009.
4
S. Hoderlein and E. Mammen
Section 3, we provide a brief application of the core concepts and the estimator to British consumer data. Finally, in the fourth section, we provide several extensions in the case of exogenous regressors. We discuss the estimation of a class of semi- and non-parametric submodels of (1.1), containing the weakly separable (WS) single index, as well as the WS additive model. Moreover, we show how testing for unobserved heterogeneity, as well as testing for specification may be accomplished. Finally, we address the issue of censoring, and conclude this paper with an outlook.
2. ESTIMATION PROCEDURES FOR LASD AND THEIR LARGE SAMPLE BEHAVIOUR We consider model (1.1) with scalar response Y and vector X taking values in a compact subset I of Rd and propose estimates of m α and m, defined as mα (x) = E[∂x1 φ(X, A)|X = x, Y = kα (x)],
(2.1)
m(x, y) = E[∂x1 φ(X, A)|X = x, Y = y].
(2.2)
Specifically, we propose to estimate the function m α by a local polynomial smoother m α . The function m is estimated by using that m(x, y) = m α(x,y) (x) where α(x, y) = P(Y ≤ y|X = x). Employing a kernel smoothing estimate α of α, we propose the following estimate of m: m (x, y) = m α(x,y) (x).
(2.3)
We suppose that i.i.d. data: (Y i , X i ), (1 ≤ i ≤ n) are given. The estimator m α is defined as a local quadratic regression quantile estimator. For its calculation one has to minimize n
τα [Yi − μ0 − μT1 (Xi − x) − (Xi − x)T μ2 (Xi − x)]K[h−1 (Xi − x)]
(2.4)
i=1
over scalars μ 0 , d-vectors μ 1 and d × d matrices μ 2 . Here K is a product kernel function, h = (h 1 , . . . , h d ) is a bandwidth vector and τ α (u) = u[α − I (u < 0)]. The diagonal matrix −1 −1 with diagonal elements h−1 1 , . . . , hd is denoted by h . The minimizers of (2.4) are denoted by μ1 , μ2 . We define μ0 , m α (x) = μ1,1 , where μ1,1 denotes the first element of the vector μ1 . We now develop an asymptotic theory for m α (x). It is based on a Bahadur representation of the estimator, which, along with the assumptions, can be found in the Appendix. C The Author(s). Journal compilation C Royal Economic Society 2009.
Non-separable models without monotonicity
5
T HEOREM 2.1. Under Assumptions C1–C5 (stated in Section A.1.1) it holds for x in the interior of I and α ∈ (0, 1) that 1/2 2 fY |X (kα (x)|x)2 fX (x)nh(prod) h21 κ2,1 2 2 m α (x) − mα (x) u1 K1 (u) du α(1 − α) d 1 1ll 2 111 2 κ2,1 κ2,l kα (x)hl − κ4,1 kα(x,y) (x)h1 + 3 → N (0, 1) 6κ2,1 l=2 4 ij l in distribution, 2where kα (x) = ∂xi ∂xj ∂xl kα (x) for 1 ≤ i, j , l ≤ d and where κ4,1 = u K1 (u)du α (x) and m β (u) are and κ2,l = u Kl (u)du for 1 ≤ l ≤ d. Furthermore, it holds that m asymptotically independent, if x = u, but they are asymptotically dependent, if x = u. For x fixed, the process 1/2 2 fY |X (kα (x)|x)2 fX (x)nhprod h21 κ2,1 2 2 α → m α (x) − mα (x) u1 K1 (u) du d 1 1ll 2 2 111 − κ2,1 κ2,l kα (x)hl κ4,1 kα (x)h1 + 3 6κ2,1 l=2 converges in distribution to a Brownian bridge B(α) for α in a closed subinterval of [0, 1]. Theorem 2.1 shows that m(x, y) = E[Y |X = x, Y = y] can be estimated with the same rate of convergence as E[Y |X = x], where we write in abuse of notation Y = ∂x1 φ(X, A). Note that ∂x1 E[Y |X] = ∂x1 E[φ(X, A)|X] = E[∂x1 φ(X, A) | X] = E[Y |X], under mild regularity assumptions. Thus E[Y |X = x] can be estimated with the same rate as a first order partial derivative of a regression function from Rd → R. Because m goes from Rd+1 to R we get one additional dimension without loosing speed of convergence of the estimator. This is interesting from a theoretical point of view and it is quite natural, too. Unconditional distribution functions and quantile functions can be estimated with parametric rate. For conditional quantile and distribution functions the (non-parametric) rate is determined by the dimension of the conditioning variables. The expansions of Theorem 2.1 can be used for obtaining asymptotic expressions of integrated squared errors. This gives formulas for asymptotically optimal bandwidths. Data-adaptive optimal bandwidths can be calculated by plugging-in consistent estimators for the corresponding unknown terms. This is in line with plug-in approaches in classical non-parametric regression. The bandwidths depend on the used criterion, i.e. if the integrated squared error is minimized over all α and x, for fixed α or for fixed x, respectively. We now discuss the estimator m . This estimator is defined by (2.3) where α (x, y) =
n
1 1[Yi ≤ y]L g −1 (Xi − x) ngprod i=1
with a product kernel function L and bandwidth vector g = (g 1 , . . . , g d ). As earlier, g −1 is the −1 diagonal matrix with diagonal elements g −1 1 , . . . , g d , g prod = g 1 · . . . · gd and g max = max 1≤l≤d gl . C The Author(s). Journal compilation C Royal Economic Society 2009.
6
S. Hoderlein and E. Mammen
The Bahadur expansion in the Appendix implies also asymptotic normality of m (x, y) if the difference between m (x, y) and m α(x,y) (x) is of second order. Because by definition α (x, y) converges to α(x, y) fast enough. This is guaranteed m (x, y) = m α(x,y) (x) this holds if by Assumptions C6 and C7, stated in Section A.1.1. T HEOREM 2.2. Under Assumptions C1–C7 (stated in Section A.1.1) it holds for x in the interior of I and y ∈ R that
2 fY |X (y|x)2 fX (x)nhprod h21 κ2,1
1/2
m (x, y) − m(x, y) u21 K 2 (u) du α(x, y)(1 − α(x, y)) d 1 1ll 111 − (x)h21 + 3 κ2,1 κ2,l kα(x,y) (x)h2l κ4,1 kα(x,y) → N (0, 1) 6κ2,1 l=2
in distribution. Furthermore it holds that m (x, y) and m (u, v) are asymptotically independent, if x = u, but they are asymptotically dependent, if x = u. For x fixed, the process
1/2 2 fY |X (y|x)2 fX (x)nhprod h21 κ2,1 2 y → m (x, y) − m(x, y) u1 K 2 (u) du d 1 1ll 2 111 2 − κ2,1 κ2,l kα(x,y) (x)hl κ4,1 kα(x,y) (x)h1 + 3 6κ2,1 l=2 converges in distribution to B(α(x, y)) for y in a compact set. Here B is a Brownian bridge. Theorem 2.1 or 2.2 can be applied for the construction of confidence intervals for m α (x) or m(x, y), respectively. This application requires consistent estimates of f Y |X , f X , k α and k 1ll α (x) for l = 1, . . . , d. The densities f Y |X and f X can be consistently estimated by kernel μ0 , defined earlier as minimizer of density estimates. A consistent estimate of k α is given by (2.4). The estimation of k 1ll α (x) (l = 1, . . . , d) causes the usual problems. It can be consistently estimated because we assume that this quantity is continuous in x. This can be done by smooth differentiation of kα or by using local cubic polynomials with undersmoothing bandwidth. Smooth differentiation is defined by −1 sprod ∂x1 ∂x2l K(s −1 (x − u) kα (u)du with undersmoothing bandwidth vector s. We do not pursue this discussion here because it is in line with the usual approaches of bias estimation in non-parametric regression.
3. AN EMPIRICAL APPLICATION: DEMAND ANALYSIS IN A HETEROGENEOUS POPULATION In this section we put some of the concepts and methods to work. We take our application from consumer demand, and start by giving an overview of the data, the methods of data clearance and of the definitions of variables involved. C The Author(s). Journal compilation C Royal Economic Society 2009.
Non-separable models without monotonicity
7
3.1. The data: FES Every year, the FES reports the income, expenditures, demographic composition and other characteristics of about 7000 households. The sample surveyed represents about 0.05% of all households in the United Kingdom. The information is collected partly by interview and partly by records. Records are kept by each household member, and include an itemized list of expenditures during 14 consecutive days. The periods of data collection are evenly spread out over the year. The information is then compiled and provides a repeated series of yearly cross sections. 3.1.1. Grouping of goods, income definition and data clearance. We consider demand for a single category of good that is related to food consumption and consists of the subcategories food bought, food out (catering) and tobacco, which are self explanatory. For brevity, we call this category food. ‘Income’ in demand analysis is total expenditure, under an additive separability assumption of preferences over time and decisions. It is obtained by adding up all expenditures, with a few exceptions which are known to have measurement error. This is done to define nominal income; real income is then obtained by dividing through the retail price indices.
3.2. Issues in estimation 3.2.1. The issue of household characteristics. The role household covariates play in our approach is largely that of control variables that capture the observable part of preference heterogeneity. It is exactly for such a situation that the conditional independence assumption is ideal: For our purposes, the household covariates are nuisance directions and we are not interested in the derivatives. Hence, they are allowed to be arbitrarily correlated with the unobservables. Conditioning on their information is something that can be done in a variety of ways. The route that we take in this paper is to stratify the population to obtain more homogeneous subpopulations. More specifically, like much of the demand literature we focus on one subpopulation, namely two person households, both adults, at least one is working and the head of household is a white collar worker. This focus is also justified because other subpopulations are much more prone to measurement problems.
3.3. Empirical results The discussion of the empirical results will concentrate on the implementation of the main identification result. In particular, we illustrate concepts using Figures 1 and 2, shown below. Nevertheless, we will be able to address a number of issues discussed earlier. In the figures we show the semi-elasticities of food demand with respect to income. Consider Figure 1 first. The solid line shows the semi-elasticities of demand of a household in the earlier mentioned group which is at the 10th percentile of the budget share distribution. We call this group the ‘eat very little’. The x-axes displays log weakly income, while the y-axes shows the semi-elasticity. Obviously, the budget share of food of the ‘eat very little’ decreases strongly between 2 and C The Author(s). Journal compilation C Royal Economic Society 2009.
8
S. Hoderlein and E. Mammen
Figure 1. Marginal effect of income on the food budget share across quantiles of food budget share distribution.
3 units of log income. From there on the budget share continues to decrease, but at a lower rate, so that the decrease at an income level of around 4 is only half as strong. This result makes a lot of sense: It reflects the fact that food is a necessity, whose relative importance diminishes. The upswing for high incomes is due to the fact that those individuals substitute ‘food bought’ by ‘food out’, which is more expensive. In spite of this, the importance of food diminishes even further, so that the food budget share reduces by 0.25 in the income range displayed. The dotted lines around the solid line are 90% bootstrap confidence bands. The bootstrap kUI (Xi ). Here U 1 , . . . , U n is an i.i.d. resample is chosen as (X∗i , Y ∗i ) with X∗i = X i and Yi∗ = sample of random variables with uniform distribution on [0, 1] and kα (x) is a local quadratic quantile estimator with oversmoothed bandwidth. The bootstrap confidence bands show that the effect is significantly smaller than zero over the entire range, and among other things we see that the marginal effects for the ‘eat very little’ at low income and high income are significantly different as well. Moreover, a straight line through, e.g. y = −0.16 would be inside the dotted lines for incomes between 2 and 3, but it is outside for incomes bigger than 3, indicating that a single index specification would not be appropriate. Observe in particular the dashed line, which gives the semi-elasticities of the 30th percentile people, the ‘eat little’. Qualitatively, this group shows the same behaviour as the ‘eat very little’. Recall that in our approach an individual household could be ‘eat little’ at low income and ‘eat very little’ at high income. Hence, the only comparison we should make between different quantiles is locally, for each value of x. For instance, from the fact that the ‘eat very little’ and the C The Author(s). Journal compilation C Royal Economic Society 2009.
Non-separable models without monotonicity
9
Figure 2. Marginal effect of income on the food budget share across quantiles of food budget share distribution.
‘eat little’ (almost) intersect at income 3, and the fact that the dashed line is within the confidence bounds at the same level, we can not reject the null that all individuals within these two subgroups have the same preferences. The same is actually true at any point on the income scale. We are nowhere in the income range displayed able to reject the hypothesis that within the subgroup defined by characteristics, households within the ‘eat very little’ and the ‘eat little’ subgroups have the same preferences. Contrast this with Figure 2: It shows exactly the same graphs as Figure 1, save for one difference: The dashed line represents not the 30th percentile ‘eat little’ group, but rather the 90th percentile ‘eat a lot’ group. This group is characterized by a stronger decline in their budget shares, i.e. a more negative semi-elasticity. Since this means that the conditional budget shares are getting ever closer, the result admits the following interpretation: At low income levels there are pronounced differences in the basic requirements. However, these differences in necessities disappear if one gets to higher income ranges, where the relative differences in the demand behaviour for other, more luxurious, categories than food becomes large. More important for the theory presented in this paper is that the ‘eat a lots’ show a significant difference: They are outside the dotted confidence bands for most of the income range. That means that, at a certain income level (say 3), we may be able to reject the hypothesis that the individual household within the ‘eat very little’ group, and the ‘eat a lots’ share the same preferences. However, note that at very high income levels we may still not be able to reject the hypothesis that the households in these groups share the same preferences. A closer discussion of these points needs the development of formal tests along the lines of (4.5) or (4.6). This issue along with a more elaborate analysis will be addressed elsewhere. C The Author(s). Journal compilation C Royal Economic Society 2009.
10
S. Hoderlein and E. Mammen
4. EXTENSIONS OF THIS MODELLING APPROACH Thus far we have discussed the estimation of marginal effects of general form in the single equation case with exogenous regressors, and illustrated that this concept is straightforward to apply. However, in most Microeconometric settings, additional complications arise. Some of them shall be discussed in this section. The first extension of our approach we discuss is towards systems of equations. Indeed, in many applications of microeconomic importance, individuals decide about more than just one issue, e.g. consumer purchase more than a single good. Unfortunately, we can show that the extension to the multivariate case is not straightforward. Since our analysis thus far is entirely non-parametric, another important issue we consider is that of semiparametric specifications. We propose both estimators with a semiparametric structure, as well as specification tests, including tests for the way heterogeneity enters the model. Finally, endogeneity is a major issue in most economic applications. We discuss what happens in our framework when the key identifying independence assumptions does not hold, and show how endogeneity can be accommodated in this framework. 4.1. The multivariate case We now discuss whether the identification theorem of Hoderlein and Mammen (2007) can be generalized to the case of a multivariate Y. As in the literature on non-separable models reviewed above, the case of a scalar dependent variable is less problematic, whereas the case of multivariate Y presents additional difficulties. Even with the monotonicity assumption, additional assumptions like triangularity have to be invoked, see Chesher (2003) as well as Imbens and Newey (2003), and references therein. Unsurprisingly, when using our more general approach, we also encounter difficulties in the multivariate case. We show by a counterexample that a generalization of the Hoderlein and Mammen (2007) theorem to the multivariate case is not possible without additional substantial assumptions. Before considering this example, we shortly mention a result that is an analogue of this theorem in the multivariate case. This result illustrates what may be learned from data in the multivariate case. For simplicity, we only consider bivariate responses Y = (Y 1 , Y 2 ) and a scalar valued observable X. We make the following assumption on the relationship between these random variables. There exists a measurable function φ = (φ 1 , φ 2 ) from R × A to R2 with Y = φ(X, A). Our result in the multivariate case is the following ∂x fY1 ,Y2 |X (y1 , y2 |x ∗ ) = ∂y1 y1 fY1 ,∂x φ1 ,Y2 |X (y1 , y1 , y2 |x ∗ )dy1 + ∂y2 × y2 fY1 ,Y2 ,∂x φ2 |X (y1 , y2 , y2 |x ∗ )dy2 .
(4.1)
For a proof of (4.1), see the Appendix. R EMARK 4.1. The right-hand side of (4.1) could also be written as trace[H] where H is a matrix with elements Hij = ∂yj hi and hi = E[∂x φi |Y1 , Y2 , X] · fY1 ,Y2 |X . It is important to note that the objects of interest, e.g. E[∂x φi |Y1 , Y2 , X] are not identified. C The Author(s). Journal compilation C Royal Economic Society 2009.
Non-separable models without monotonicity
11
This result suggests that the Hoderlein and Mammen (2007) theorem may not be generalized to higher dimensions. We show this now formally by giving a counterexample. This example makes a similar point in our more general setting as Benkard and Berry (2006) for the case of a function monotonic in a scalar A. E XAMPLE 4.1. For independent random variables U , V and X where U and V have a standard normal distribution N(0,1) we define Y1 = G1 {[cos(ρ(X))U + sin(ρ(X))V ] X} , Y2 = G2 {[− sin(ρ(X))U + cos(ρ(X))V ] X} , where ρ is an arbitrary unknown function. Then the joint distribution of (X, Y 1 , Y 2 ) does not depend on ρ ! And in accordance with (1.2) both E[∂x φ1 |Y1 , X] and E[∂x φ2 |Y2 , X] do not depend on ρ. This follows from G−1 −1 1 (Y1 ) ∂x φ1 = + G2 (Y2 )ρ (X) G1 (Y1 ) , X G−1 −1 2 (Y2 ) + G1 (Y1 )ρ (X) G2 (Y2 ) . ∂x φ2 = X These representations immediately imply G−1 1 (Y1 ) G1 (Y1 ) , X G−1 (Y2 ) G2 (Y2 ) . E[∂x φ2 |Y2 , X] = 2 X
E[∂x φ1 |Y1 , X] =
Thus both expressions are independent of ρ. But E[∂x φ1 |Y1 , Y2 , X] depends on ρ according to
Y1 −1 E[∂x φ1 |Y1 , Y2 , X] = G1 (Y1 ) + Y2 Xρ (X) G1 (Y1 ). X Thus because ρ is assumed to be unknown, E[∂x φ1 |Y1 , Y2 , X] is not identifiable. This argument could be extended to the case where ρ(X) is replaced by ρ(X, R) with a random variable R that is independent of (U , V , X). This example suggests that the multivariate case is rather hopeless without invoking strong additional assumptions like triangularity, which may be hard to rectify on economic grounds. However, as we show in the following example, we may circumvent this problem by considering implications of hypotheses in systems of equations in one-dimensional subspaces without additional assumptions. 4.2. Application to models and test procedures In this section, we provide some extensions to more specific structures of the general form (1.1). First, we consider identification of several semi- and non-parametric model specifications that may lead to estimation procedures not discussed in this paper. Since it is a common problem in econometrics, and easily tractable within this framework, we also treat fixed censoring here. C The Author(s). Journal compilation C Royal Economic Society 2009.
12
S. Hoderlein and E. Mammen
Second, we discuss specification analysis in this very general framework: We provide a test for the presence of unobserved heterogeneity, and we consider tests for index type specifications for model (1.1). 4.2.1. Identification of semi- and non-parametric models. As is common in non-parametric analysis, the curse of dimensionality makes it imperative to place some structure on the function φ so that the LASDs be estimable with data sets commonly encountered in practise. Arguably the most popular semiparametric model with additive scalar errors is the single index models. By analogy, define the weakly separable single index model (WS-SIM) as follows Y = φ(X, A) = ψ(X β, A). From the main identification theorem (1.2), we get that ∇x kα (x) = βE[∂z ψ(X β, A)|X = x, Y = kα (x)],
(4.2)
where ∂ z denotes the derivative w.r.t. the index, and ∇ x denotes the gradient. In particular, with a weight function w ∇x kα (ξ )w(ξ )dξ (4.3) identifies β up to scale, for all α, and β could be estimated by an average quantile derivative estimator, as in Chaudhuri et al. (1997), imposing e.g. β = 1. Our approach allows to integrate (4.2) over α and x with a weighting function depending on α and x. Consequently, β would also be identified by ∇x kα (ξ )v(ξ, α)dξ dα, where v is a weighting function. This class of estimators includes the choice average weighted mean v(ξ , α) = v(ξ ), which yields the E[∂x φ(X, A)|X = x, Y = regression derivative estimator, E[∂x φ(X, A)|X = ξ ]v(ξ )dξ = kα (x)]v(ξ )dαdξ . This may be seen as an important advantage of our approach, as our approach allows to increase efficiency relative to weighted average (mean regression or quantile) derivative estimators for β. Another very popular class of non-parametric models is the class of additive model, which also nests the partially linear model. Define the weakly separable additive model (WS-AM) as follows: ⎞ ⎛ γj (Xj ), A⎠ , (4.4) Y = φ(X, A) = ψ ⎝ j
where the subscript j denotes the jth component. For related mean and quantile regression models with additive error terms, see also Christopeit and Hoderlein (2006), Horowitz (2001) and Horowitz and Mammen (2007a,b). Using again the main identification theorem (1.2), we obtain that ⎡ ⎛ ⎞ ⎤ γj (Xj ), A⎠ |X = x, Y = kα (x)⎦ = β(x)c(x, α), ∇x kα (x) = β(x)E ⎣∂z ψ ⎝ j
where β(x) = (γ 1 (x 1 ), . . , γ d (x d )). After imposing the scale normalization β(x) = 1∀x, we may identify c(x, α) ∀x. This suggests a normalized marginal quantile integration C The Author(s). Journal compilation C Royal Economic Society 2009.
Non-separable models without monotonicity
estimator, i.e.
as well as,
13
∂x1 kα (x1 , ξ−1 ) (c(x1 , ξ−1 , α))−1 w(ξ−1 )dξ−1 ∀α,
∂x1 kα (x1 , ξ−1 ) (c(x1 , ξ−1 , α))−1 v(ξ−1 , α)dξ dα,
to estimate β 1 (x 1 ) = γ 1 (x 1 ). Obviously, the same arguments regarding efficiency can be made. 4.2.2. Specification testing in this framework A test for the influence of unobserved heterogeneity. Recall our major result, E[∂x1 φ(X, A)|X = x, Y = kα (x)] = mα (x), where the function m α may change with α. If unobserved heterogeneity is not allowed to have an effect on the derivative, then mα (x) = ∂x1 λ(x), where the right-hand side is a function that is independent of α. The fact that the marginal effect does not depend on x 1 could result from a model of the type Y = λ(X) + ϕ(X −1 , A). Hence, this hypothesis has also an interpretation as postulating that the model is (partially) additively separable in the error. Rewriting the null hypothesis, 2 mα (x) − mβ (x)g(β)dβ g(α)dα = 0, (4.5) where g denotes a weighting function. This hypothesis is for one fixed value x only. Averaging the test statistic over a range of values of x may determine whether heterogeneity has an effect for some values of x. An integrated version of this hypothesis is given by 2 mα (x) − mβ (x)g(β)dβ g(α)dαw(x)dx = 0, (4.6) where w is a weight function. Sample counterparts have the form 2 m α (x) − m β (x)g(β)dβ g(α)dα and
2 m α (x) − m β (x)g(β)dβ g(α)dαw(x)dx. An alternative test could be based on the sup norm using α (x) − m β (x)g(β)dβ . Tn = sup g(α) m α∈J
J
The asymptotic behaviour of this test statistic immediately follows from Theorem 2.1. Theorem 2.2 suggests to use g(β) equal to a consistent estimate of fY |X (kβ (x)|x)[ JfY |X (kγ (x)| C The Author(s). Journal compilation C Royal Economic Society 2009.
14
S. Hoderlein and E. Mammen
x)dγ ]−1 . With such a choice, nhprod h21 [ u21 K12 (u)du]−1/2 fX (x)1/2 [ JfY |X (kγ (x)|x)dγ ]−1 κ2,1 Tn converges in distribution to 1 sup |B(α) − B(β)dβ| λ(J ) J α∈J where B is a Brownian Bridge and λ is the Lebesgue measure. In the limit for J → [0, 1], this coincides with the asymptotic distribution of a two-sided Kolmogorov–Smirnov test statistic. Specification tests and model choice. After determining the impact of unobserved heterogeneity, we may ask whether the function φ is completely amorphous, or whether it has some structure. Paralleling the section on estimation, we also provide a test for the index structure. As was already noted earlier, we obtain that under the null of index structure the ratio of two derivatives is constant and does not depend on α. This opens up the way to non-parametric specification tests, e.g. by comparing the ratio of any two derivatives with a constant. Since this has to hold for any α, we either get a large number of hypotheses, or we may increase the power of tests by using a weighted average of the tests that arise from these hypotheses. Finally, we may test for additivity as in (4.4), but we may also be able to test for another generalization of the linear index restriction, i.e. Y = φ(X, A) = ψ(ζ (Z) , P , A),
(4.7)
where X = (Z , P ) . In this case, ∂z1 kα (x)/∂z2 kα (x) = ∂z1 ζ (z)/∂z2 ζ (z) implying that the ratio is only a function of z. Again, the ratio should also be invariant across α, and both facts open up the way for specification tests. 4.3. Censored data In a number of applications there is exogenous censoring of the data. More formally, consider a model with fixed censoring: Y ∗ = φ(X, A) and Y = 1 Y ∗ > 0 Y ∗ , (4.8) where X and Y are observed. In this case, (1.2) yields: E[∂x1 φ|X = x ∗ , Y = kα (x ∗ )] = mα (x ∗ ), as long as k α (x ∗ ) > 0. For k α (x ∗ ) > 0, the conditional α-quantile of Y, and the conditional α-quantile of Y ∗ coincide. Similar arguments can be applied to more complicated settings of censoring. 4.4. Endogeneity Thus far all of our analysis requires that the error be conditionally independent of the regressors. In economics it is common to believe that this assumption is violated. In the non-parametric mean regression case with additively separable errors, endogeneity of regressors has proven to be a difficult problem. Although there do exist estimators, e.g. Newey and Powell (2003), Darolles et al. (2003) and Hall and Horowitz (2003), their speed of convergence might be very slow in some cases. C The Author(s). Journal compilation C Royal Economic Society 2009.
Non-separable models without monotonicity
15
In this section, we will establish the following results. First, we show precisely what goes wrong when we do not invoke the conditional independence assumption required ofr identification. Second, we show that LASD is well identified under a control function assumption, and estimable under general conditions with standard speed of convergence. Third, we propose a test for endogeneity. Fourth, we show the relationship to Altonji and Matzkin (2005) as well as Imbens and Newey (2003) as far as the estimation of derivatives is concerned. Finally, we give a counterexample which establishes that the independence of errors and instruments alone is not sufficient to identify the LASD. 4.5. The role of the conditional independence assumption For the identification result of LASD in Hoderlein and Mammen (2008) the essential assumption was that the random variables A and X 1 are conditionally independent, given X 2 , . . , X d . In this section, we discuss the importance of the conditional independence assumption by highlighting what happens if this assumption does not hold. In the case of dependent X and A, the following theorem may be used to obtain bounds on the marginal effects. T HEOREM 4.1. For fixed values x ∗ ∈ Rd and 0 < α < 1 assume that Assumptions A1–A5 (stated in Section A.1.1) hold. Then, E[∂x1 φ(X, A)|X = x ∗ , Y = kα (x ∗ )] = mα (x ∗ ) + lα (x ∗ ), where
E 1[Y ≤ kα (x ∗ )]∂x1 lnfA|X (A|x ∗ )|X = x ∗ . lα (x ) = fY |X (kα (x ∗ )|x ∗ ) ∗
R EMARK 4.2. This result illustrates the importance of the conditional independence assumption. In its absence, the derivative of the conditional quantile function contains both the best projection of the underlying marginal effect ∂x1 φ, and a distributional effect that indicates how the composition of the unobservables changes as we vary the level of covariates. The conditional expectation E[1[Y ≤ kα (x ∗ )]∂x1 lnfA|X (a|x ∗ )|X = x ∗ ] is not identified without additional assumptions. In this paper, we will not discuss such assumptions, but Theorem 4.1 could be used as a starting point for weakening the independence Assumption A1. If the conditional independence assumption breaks down we adopt the standard terminology and call X 1 endogenous. We now discuss possible solutions for this problem. 4.5.1. A control function approach for the unrestricted case. Let the model be given by Y = φ(X, A), but A is now not independent of X 1 conditionally on X 2 , . . , X d . However, we have instruments, Z, with the following property: Define U as the unobservables in the mapping X 1 = μ(Z, X 2 , . . , X d , U ), with U independent of Z conditionally on X 2 , . . , X d . Then, assume that A is independent of Z, conditional on U and X 2 , . . , X d . Call this assumption (ACF ). This has as implication that A is independent of X, conditional on U and X 2 , . . , X d . Then, due to (1.2), mα (x, u) = E[∂x1 φ(X, A)|X = x, U = u, Y = kα (x, u)], for any α. Note that no monotonicity in U is required for this argument. However, since U has to be used as a regressor, it must be pre-estimated and therefore additional assumptions have to C The Author(s). Journal compilation C Royal Economic Society 2009.
16
S. Hoderlein and E. Mammen
be imposed. For instance, μ is monotone in U and that U is uniform on [0, 1] (conditionally on Z and X 2 , . . , X d ). This can be seen as a control function approach (CF-IV) to the endogeneity problem. 4.5.2. Testing endogeneity. A test for endogeneity under (ACF) can be built on the following observation. In the following section, we assume throughout that A is independent of X, conditional on U and X 2 , . . , X d . The assumption to be tested is whether A is independent of X, conditional on X 2 , . . , X d . Under the null that X is conditionally independent, we have mα (x) = E E[∂x1 φ(X, A)|X, U , Y ]|X = x, Y = kα (x) , but this equals mα (x) =
Mα (x, u)fU |X,Y α (u|x, y α )du,
where M α (x, u) denotes the derivative of the α quantile of Y conditional on X and U. This suggests again using the L 2 -distance between both sides of the equality to test for the validity of the exogeneity assumption. An empirical test statistic for the global validity of the exogeneity assumption is the following: 2 ˆ α (x) w(α, x, u)dαdxdu, Mˆ α (x, u) − m where the hats denote standard non-parametric estimators as in Section 2, and w is a weighting function. 4.5.3. The relationship to Imbens and Newey (2003) and Altonji and Matzkin (2005). As far as the estimation of average structural marginal effects are concerned, both Altonji and Matzkin (2005) and Imbens and Newey (2003) use the fact that (4.9) E[∂x1 φ(X, A)|X] = E E[∂x1 φ(X, A)|X, U ]|X = E ∂x1 M(X, U )|X , where M(x, u) = E[Y |X = x, U = u], to obtain an estimator for the LASD. Similar arguments where frequently used in Hoderlein (2002). In (4.9), conditioning was done with respect to σ (X). In the approach put forward in this paper we use more information, i.e. E ∂x1 φ(X, A)|X, Y = y α = E E[∂x1 φ(X, A)|X, U , Y ]|X, Y = y α ] = E ∂x1 kα (X, U )|X, Y = y α , ∀α. (4.10) where y α is shorthand for the conditional α quantile. If we are interested in obtaining average derivatives over the entire population, then from both quantities, i.e. E[∂x1 φ(X, A)|X] and E[∂x1 φ(X, A)|X, Y ] overall averages may be obtained. However, if we also consider weighted average derivatives, our approach allows to consider a larger class of averages as the weights may also depend on Y. This may be important for policy considerations. Last, but by no means least, note again that an estimator that uses (4.10) gives a closer approximation to the true underlying ∂x1 φ than the one based on the right-hand side of (4.9). 4.5.4. The limitations of traditional IV. In general, the LASD is not identified in the traditional IV setting. Assume that in the model Y = φ(X, A) the error variable A is not C The Author(s). Journal compilation C Royal Economic Society 2009.
Non-separable models without monotonicity
17
independent of X 1 (conditionally on X 2 , . . . , X d ). Then we show that for identification of LASD it does not suffice to have an instrument Z that is independent of A. We now show this by a class of counterexamples. Suppose that three scalar random variables Z, A and B are given. Moreover, assume that Z is independent of A, that Z is independent of B, but that Z is not independent of the tuple (A, B). In particular, we assume that Z is not independent of ρ(A, B) where ρ(A, B) is a function that is strictly monotone in a for fixed b. For a function φ : R2 → R we suppose that we observe X, Y and Z with X = ρ(A, B) and Y = φ(X, A). Because ρ is monotone in a there exists a function λ : R2 → R with A = λ(X, B). Then Y = ψ(X, B) with ψ(x, b) = φ[x, λ(x, b)]. Now we have two representations of our data: Y = φ(X, A) and Y = ψ(X, B) with error variables A or B, respectively. By construction, Z is an instrument in both specifications. LASD is a conditional expectation of ∂ x φ(X, A) or ∂ x ψ(X, B) = ∂ x φ(X, A) + ∂ a φ(X, A)∂ x λ(X, B). It is clear that in general (conditional) expectations of these two expressions differ. This shows that for identifiability of LASD it does not suffice that the instrument is independent of the error variable. More structural assumptions are needed. It does not help to assume that A is a scalar and that φ is monotone in A. This can be seen by a slight modification of our example. Assume additionally that φ is monotone in A and that ρ(a, b) is monotone in b for fixed a. Then ψ is monotone in B and we have two representations of our data where the error variable enters the model monotonically. We conjecture that the additional assumption has to exclude a complicated non-parametric notion of weakness of the instrument.
5. CONCLUSION AND OUTLOOK In this paper, we were concerned with the non-separable model Y = φ(X, A),
(5.1)
where Y and X are observable real valued random p- and d-vectors, while A is an element of a Borel space. The key innovation is that we do not place any restrictions on A, or on it’s influence. Nevertheless, we were able to show identification of local average structural derivatives (LASD), which is a conditional expectation of the marginal effects that exhausts all the information given in the entire data. More specifically, our main results links derivatives of conditional quantiles to conditional average structural functions. From this perspective both the quantile regression and the mean regression are not mutually exclusive competitors, but different projections from model (5.1) which characterizes a heterogeneous population. Since the derivative of the conditional quantile can be seen as a conditional expectation w.r.t. a larger σ -algebra, it is closer in the L 2 distance sense, and should therefore be preferred in the single equation case. However, the mean regression works well in the multiple equation case, whereas we established that this case is not easily tractable using quantiles. Another area of application for the principles developed in this paper are other operators acting on the function φ. For instance, for applications where risk taking behaviour is analyzed the second derivative of φ may be of interest. Also, differentials, integrals or other objects like the Slutsky matrix could be considered in a similar way as in this paper. This paper has furthermore established that many econometric methods may be generalized to this very general class of models. We provided the large sample theory necessary to handle C The Author(s). Journal compilation C Royal Economic Society 2009.
18
S. Hoderlein and E. Mammen
the asymptotic distribution of all estimators, and showed how the bootstrap may be performed. Moreover, we discussed the identification of structured models, specifically weakly separable single index and additive models. In addition, we established how testing for specification may be performed in model (5.1). It is possible to test for the influence of unobserved heterogeneity, and for many semiparametric specifications. Starting from the asymptotic results given in this paper, the large sample theory of all test statistics may be developed, but we leave this to a companion paper. Further specification analysis may also be performed in a similar fashion as in this paper. More specifically, this issue may be combined with relaxing the exogeneity assumption (conditional independence of X 1 and A). The issue of relaxing the exogeneity assumption played an important role in this paper. We have established that a control function approach works neatly and provides a natural extension to the exogenous case. Moreover, we were able to provide quantile analogues to the results on estimation of average marginal effects under endogeneity in Imbens and Newey (2003) and Altonji and Matzkin (2005) without assuming monotonicity. In addition, we also suggested a test for endogeneity. Finally, we have established that simple independence between instruments and unobservables is not sufficient to identify the LASD. In this paper, we argued in favour of the control function approach assumption, because it yields identification of the LASD in the absence of any major additional assumption, and hence provides a robust and convenient route. Hoderlein (2008) discusses the economic content of the exogeneity assumption for the case of consumer demand and the mean regression. However, it remains to be established how tractable this assumption is in general. There may well be applications when researchers are only willing to assume the marginal independence condition as in Chernozhukov et al. (2007). It is clear from our analysis that additional assumptions, e.g. on the functional form or the dependence structure between all variables are needed for identification. What type of additional assumption suits a specific type of application still remains to be determined, and is a challenging question for future research.
ACKNOWLEDGMENTS The authors are indebted to Andrew Chesher, Joel Horowitz, Oliver Linton, Rosa Matzkin, Whitney Newey, Jim Powell and seminar participants at the ESWC, EMS Oslo, Bergen, Berlin, G¨ottingen, Frankfurt, Madrid, Mannheim, Northwestern, Strasbourg, T¨ubingen, and UCL/IFS for helpful comments. The usual disclaimer applies. Financial support by Landesstiftung BadenW`urttemberg ‘Elitef¨orderungsprogramm’ is gratefully acknowledged.
REFERENCES Altonji, J. and R. Matzkin (2005). Cross section and panel data estimators for nonseparable models with endogenous regressors. Econometrica 73, 1053–103. Benkard, L. and S. Berry (2006). Nonparametric identification of nonlinear simultaneous equation models. Econometrica 74, 1429–40. Billingsley, P. (1968). Convergence of Probability Measures. New York: John Wiley. Brown, D. and R. Matzkin (1996). Estimation of nonparametric functions in simultaneous equation models, with an application to consumer demand. Working paper, Northwestern University. C The Author(s). Journal compilation C Royal Economic Society 2009.
Non-separable models without monotonicity
19
Chaudhuri, P., K. Doksum and A. Samarov (1997). On average derivative quantile estimation. Annals of Statistics 25, 715–44. Chernozhukov, V. and C. Hansen (2005). An IV model of quantile treatment effects. Econometrica 73, 245–61. Chernozhukov, V., G. Imbens and W. Newey (2007). Instrumental variables identification and estimation via quantile conditions. Journal of Econometrics 139, 4–14. Chesher, A. (2003). Identification in nonseparable models. Econometrica 71, 1405–43. Chesher, A. (2005). Nonparametric identification under discrete variation. Econometrica 73, 1525–50. Christopeit, N. and S. Hoderlein (2006). Local partitioned regression. Econometrica 74, 787–817. Darolles, S., J. P. Florens and E. Renault (2003). Nonparametric instrumental regression. Working paper, IDEI, Universit´e de Toulouse. Fan, J., T. Hu and Y. Truong (1994). Robust non-parametric function estimation. Scandinavian Journal of Statistics. 21, 433–46. Florens, J. P., J. Heckman, C. Meghir and E. Vyctlacil (2003). Instrumental variables, local instrumental variables and control functions. Working paper, IDEI, Universit´e de Toulouse. Hall, P. and J. Horowitz (2003). Nonparametric methods for inference in the presence of instrumental variables. CWP 02/03, Centre for Microdata Methods and Practice, IFS and UCL. Heckman, J. and E. Vyctlacil (1999). Local instrumental variables and latent variable model for identifying and bounding treatment effects. Proceedings of the National Academy of Science 96, 4730–34. Heckman, J. and E. Vyctlacil (2001). Local instrumental variables. In C. Hsiao and K. Morimune (Eds.), Nonlinear Statistical Inference: Essays in Honour of Takeshi Amemiya, 1–46. Cambridge: Cambridge University Press. Hildenbrand, W. (1993). Market Demand: Theory and Empirical Evidence. Princeton, NJ: Princeton University Press. Hoderlein, S. (2002). Econometric modelling of heterogeneous consumer behaviour theory, empirical evidence and aggregate implications. Ph.D. thesis, LSE. Hoderlein, S. (2008). How many consumers are rational? Working paper, Brown University. Hoderlein, S. and E. Mammen (2007). Identification of marginal effects in nonseparable models without monotonicity. Econometrica 75, 1513–18. Horowitz, J. (2001). Nonparametric estimation of a generalized additive model with an unknown link function. Econometrica 69, 499–514. Horowitz, J. and E. Mammen (2007a). Oracle-efficient nonparametric estimation of an additive model with an unknown link function. Working paper, University of Manheim. Horowitz, J. and E. Mammen (2007b). Rate-optimal estimation for a general class of nonparametric regression models with unknown link functions. Annals of Statistics 35, 2589–619. Imbens, G. and W. Newey (2003). Identification and estimation of triangular simultaneous equations models without additivity. Working paper, MIT. Manski, C. F. (2003). Partial Identification of Probability Distributions. New York: Springer. Mas-Colell, A., M. Whinston and J. Green (1997). Microeconomic Theory. Oxford: Oxford University Press. Matzkin, R. (2003). Nonparametric estimation of nonadditive random functions. Econometrica 71, 1339– 76. Newey, W. and J. Powell (2003). Instrumental variable estimation of nonparametric models. Econometrica 67, 565–603. Roehrig, C. (1988). Conditions for identification in nonparametric and parametric models. Econometrica 56, 433–47.
C The Author(s). Journal compilation C Royal Economic Society 2009.
20
S. Hoderlein and E. Mammen
APPENDIX A.1. Assumptions and Proofs A.1.1. Assumptions. The following assumptions are needed to show a Bahadur representation for m α (x), see Theorem A.1 below. Theorem A.1 will be used to prove Theorems 2.1 and 2.2. A SSUMPTION C1. The random tuples (X i , Y i ) are i.i.d. The random variables X i take values in a compact subset I ⊂ Rd . A SSUMPTION C2. The conditional density f Y |X (y|x) of Y given X is uniformly continuous in x and y and bounded from below and from above on R × I . The density f X of X is bounded from below and continuous on I. A SSUMPTION C3. For 0 < α < 1 all partial derivatives of k α of order 3 exist in I and are bounded. A SSUMPTION C4. There exists a β > 0 with nβ h l → 0 and n1−β h prod → ∞ for l = 1, . . . , d and h prod = h 1 · . . . · hd . The kernel K is a product kernel K(u) = K 1 (u 1 ) · . . . · K d (u d ). The functions K 1 , . . . , K d are symmetric probability density functions with bounded support. A SSUMPTION C5. With h max = max 1≤l≤d hl it holds that nh prod h21 h4max = OP (1). These assumptions are rather standard. In Assumption C3 we assume the existence of three derivatives to obtain an asymptotic expansion of the bias for our estimate of the derivative of k α . Assumption C5 assumes that no oversmoothing is used. As can be seen from Theorem A.1 if the variance part and the bias part of m α (x) are of the same order we have that also (nh prod h21 )−1 and h4max are of the same order. Assumption C5 is used to avoid a higher order expansion of the bias with its additional smoothness assumptions. For the result of Theorem 2.2 we need that α (x, y) converges to α(x, y) fast enough. This is guaranteed by the following assumptions. h
h2
1 → 0 and ng4max h prod h21 → 0. A SSUMPTION C6. For the bandwidth vectors h and g it holds that gprod prod The kernel L is a product kernel L(u) = L 1 (u 1 ) · . . . · L d (u d ). The functions L 1 , . . . , L d are symmetric probability density functions with bounded support.
A SSUMPTION C7. All partial derivatives of α(x, y) of order two with respect to x exist and are bounded. For the general identification result, as well as the discussions in Section 4.5 we make use of the following assumptions. A SSUMPTION A1. (a) For fixed a ∈ A the function φ(x 1 , x ∗−1 , a) is continuous in x 1 = x ∗1 . ∗ (b) P[φ(x1 , x−1 , A) = kα (x ∗ )|X = x ∗ ] = 0 for x 1 in a neighbourhood of x ∗1 .
(c) The conditional distribution of A given X = (x 1 , x ∗−1 ) is absolutely continuous w.r.t. the conditional distribution of A given X = x ∗ for x 1 in a neighborhood of x ∗1 . It holds that ∗ fA|X (a|x1∗ + δ, x−1 ) ≤ |δ|g(a) − 1 f (a|x ∗ ) A|X
for |δ| small enough for a measurable function g fulfilling E[g(A)|X = x ∗ ] < +∞. The function x 1 f A|X (a|x 1 , x ∗−1 ) is differentiable at x 1 = x ∗−1 for all a ∈ A. C The Author(s). Journal compilation C Royal Economic Society 2009.
Non-separable models without monotonicity
21
A SSUMPTION A2. The conditional distribution of Y given X is absolutely continuous w.r.t. the Lebesgue measure for x 1 in a neighbourhood of x ∗1 and for x −1 = x ∗−1 . Here we use the notation x −1 = (x 2 , . . . , x d ) . The density f Y |X (y|x 1 , x ∗−1 ) of Y given X is continuous in (y, x 1 ) at the point (y, x 1 ) = (k α (x ∗ ), x ∗1 ). The conditional density f Y |X (y|x ∗ ) of Y given X = x ∗ is bounded in y ∈ R. A SSUMPTION A3.
k α (x) is partially differentiable with respect to the first component at x = x ∗ .
A SSUMPTION A4.
There exists a measurable function satisfying ∗ , A) − φ(x ∗ , A) − δ(A)| ≥ εδ | X = x ∗ ) = o(δ) P(|φ(x1∗ + δ, x−1
for δ → 0 and fixed ε > 0. We also write ∂x1 φ(x ∗ , a) for (a) and ∂x1 φ or ∂x1 φ(x ∗ , A) for (a). A SSUMPTION A5. The conditional distribution of (Y , ∂x1 φ), given X, is absolutely continuous w.r.t. the Lebesgue measure for x = x ∗ . For the conditional density [fY ,∂x1 φ|X of [(Y ; ∂x1 φ) given X the following inequality holds with a constant C and a positive density g on [R with finite mean (i.e. [ |y |g(y ) dy < ∞) fY ,∂x1 φ|X (y, y |x ∗ ) ≤ Cg(y ). Assumptions A2–A5 are as the ones used for the theorem in Hoderlein and Mammen (2007). There, instead of Assumption A1 it has been assumed that the random variables A and X 1 are conditionally independent, given X 2 , . . . , X d .
A.1.2. Proofs for Theorems 2.1 and 2.2. The following result states a Bahadur representation for m α (x). The theorem is the basic tool for the proofs of Theorems 2.1 and 2.2. T HEOREM A.1. (Bahadur representation). Under Assumptions C1–C4 the following property holds uniformly for α in a closed subset of (0, 1) and for x in a closed subset of I that does not contain boundary points m α (x) − mα (x) 1 1 1 −1 κ2,1 =− fY |X (kα (x)|x) fX (x) nhprod h21 ×
n
Xi,1 − x1 K h−1 (Xi − x) i=1
× I [Yi ≤ kα (Xi )] − α + FY |X kα (x) + Dx kα (x)T (Xi − x) 1 + (Xi − x)T Dxx kα (x)(Xi − x)|Xi − FY |X [kα (Xi )|Xi ] 2
−1/2 , + oP nhprod h21 where κ2,1 = u2 K1 (u)du and where Dx k α (x) (or Dxx k α (x)) is the vector (or matrix) of first (second) order partial derivatives. Proof. For simplification of notation we give the proof only for d = 1. We start the proof similarly as the proof of Theorem 2 in Fan et al. (1994). Put √ μ1 − kα (x)), h2 ( μ2 − kα (x)) . μ0 − kα (x), h( θ = θ (α, x) = nh The vector θ minimizes Gn (θ ) = Gn,α,x (θ ) =
n i=1
τα
Yi∗
! "X − x # √ i ∗ , − θ Zi / nh − τα (Yi ) K h T
C The Author(s). Journal compilation C Royal Economic Society 2009.
22
S. Hoderlein and E. Mammen
where Z i = [1, (X i − x)/h, (X i − x)2 /h2 ] and 1 Yi∗ = Yi∗ (α, x) = Yi − kα (x) − kα (x)(Xi − x) − kα (x)(Xi − x)2 . 2 Put 1 T θ Zi τα (Yi∗ )K Wn (θ ) = Wn,α,x (θ ) = √ nh i=1 n
"
Xi − x h
# .
For the proof of the theorem we will make use of the following two lemmas.
L EMMA A.1. For all η > 0 it holds for γ > 0 small enough and for a closed interval J ⊂ (0, 1) that sup
θ ≤nγ ,α∈J ,x∈I
|Gn (θ ) + Wn (θ ) − E[Gn (θ ) + Wn (θ)|X1 , . . . , Xn ]| = oP (1).
L EMMA A.2. For all η > 0 it holds for γ > 0 small enough and for a closed interval J ⊂ (0, 1) that sup
θ ≤nγ ,α∈J ,x∈I
−
|E[Gn (θ ) + Wn (θ )|X1 , . . . , Xn ]
" # n 1 1 1 fY |X kα (x) + kα (x)(Xi − x) + kα (x)(Xi − x)2 |Xi 2 nh i=1 2 # " Xi − x | = oP (1). (θ T Zi )2 K h
Lemma A.1 follows by application of Bernstein’s inequality. Note that Gn (θ ) + Wn (θ ) = −
n ∗ Yi − (nh)−1/2 θ T Zi 1 Yi∗ − (nh)−1/2 θ T Zi < 0 i=1
− 1 Yi∗ < 0 K[(Xi − x)/h] is a sum of independent random variables that are absolutely bounded by C(nh)−1/2 θ with a positive constant C. For a proof of Lemma A.2 one uses a Taylor expansion of E[Gn (θ) + Wn (θ)|X1 , . . . , Xn ]. Lemma A.2 immediately implies that sup
θ ≤nγ ,α∈J ,x∈I
|E[Gn (θ ) + Wn (θ )|X1 , . . . , Xn ] ⎛
1
0
κ2,1
⎞
⎜ ⎟ 1 κ2,1 0 ⎟ − fY |X (kα (x)|x)fX (x)θ T ⎜ ⎝0 ⎠ θ| = oP (1). 2 κ4,1 κ2,1 0 We now use the fact that G n is a convex function and that it is approximated in the last equation by another convex function. This shows that the location of the minimum G n is approximated by the location of the minimum of the approximating function. This implies that uniformly for α ∈ J and x ∈ I 1 −1 1 κ2,1 (Xi − x)K[(Xi − x)/h] 3 fY |X (kα (x)|x)fX (x) nh i=1 × 1[Yi∗ < 0] − α + oP ((nh3 )−1/2 ). n
m α (x) − mα (x) = −
C The Author(s). Journal compilation C Royal Economic Society 2009.
Non-separable models without monotonicity
23
The theorem now follows from n 1 (Xi − x)K[(Xi − x)/h] nh3 i=1
× 1[Yi∗ < 0] − 1[Yi < kα (Xi )] − E 1[Yi∗ < 0] − 1[Yi < kα (Xi )] |Xi
= oP ((nh3 )−1/2 ). Proof of Theorem 2.1: The theorem follows by application of Theorem A.1. The bias term can be calculated by using Taylor expansions and standard smoothing theory to
nhprod h31
n
−1 (Xi,1 − x1 )K(h−1 (Xi − x)) i=1
1 FY |X kα (x) + Dx kα (x)T (Xi − x) + (Xi − x)T Dxx kα (x)(Xi − x)|Xi − FY |X [kα (Xi )|Xi ] . 2 Convergence of the process Bn (α) =
2 fY |X (kα (x)|x)2 fX (x)nhprod h21 κ2,1 2 u1 K 2 (u) du
1/2 { mα (x) − mα (x)
d 1 1ll 2 2 111 κ2,1 κ2,l kα (x)hl − κ4,1 kα (x)h1 + 3 6κ2,1 l=2 follows by application of a tightness criterion (e.g. Theorem 15.6 in Billingsley, 1968) to its Bahadur approximation
−1/2 2 2 n (α) = −fX (x)−1/2 (nhprod )−1/2 h−2 K (u) du u B 1 1 1 ×
n
(Xi,1 − x1 )K(h−1 (Xi − x)){I [Yi ≤ kα (Xi )] − α}.
i=1
Proof of Theorem 2.2: Note that α (x, y) − α(x, y) is of order OP ((ng prod )−1/2 + g 2max ). This expansion holds for fixed x and uniformly for y in a compact interval J. We will make use of the following fact. For all sequences c n → 0 and constants δ > 0 it holds that sup
n (β) − B n (α)| = oP (1). |B
δ<α<β<1−δ,β−α
n (α) (see characterizations of tightness in theorem 15.3 This follows easily from tightness of the process B in Billingsley, 1968) and monotonicity of the indicator function I [Y i ≤ k α (X i )] in α. We apply this result with multiples of ((ng prod )−1/2 + g 2max ). This shows that n ( n (α(x, y))| = oP (1). sup |B α (x, y)) − B y∈J
This shows that in the difference of m (x, y) and m(x, y) all terms are of smaller order than mα(x,y) (x) − mα(x,y) (x). Under our smoothness assumptions the function m α (x) has a bounded derivative with respect to α. This implies that
2 sup |mα(x,y) (x) − mα(x,y) (x)| = OP (ngprod )−1/2 + gmax . y∈J
C The Author(s). Journal compilation C Royal Economic Society 2009.
24
S. Hoderlein and E. Mammen
Because of Assumption C6 the right-hand side of the last equation is of order oP ((nh prod h21 )−1/2 ). This shows the statement of the theorem. Proof of Theorem 4.1: For simplicity of exposition, we concentrate on the scalar x case. W.l.o.g. we also assume that A is uniformly distributed on [0, 1]. By definition of k α (x) for δ > 0 0 = P[Y ≤ kα (x ∗ + δ)|X = x ∗ + δ] − P[Y ≤ kα (x ∗ )|X = x ∗ ] = A1 + A2 + A3 ,
(A.1)
where A1 = P[φ(x ∗ + δ, A) ≤ kα (x ∗ + δ)|X = x ∗ + δ] − P[φ(x ∗ + δ, A) ≤ kα (x ∗ )|X = x ∗ + δ] A2 = P[φ(x ∗ + δ, A) ≤ kα (x ∗ )|X = x ∗ + δ] − P[φ(x ∗ + δ, A) ≤ kα (x ∗ )|X = x ∗ ] A3 = P[φ(x ∗ + δ, A) ≤ kα (x ∗ )|X = x ∗ ] − P[φ(x ∗ , A) ≤ kα (x ∗ )|X = x ∗ ]. As in the proof of the theorem in Hoderlein and Mammen (2007) one gets A1 = δ∂x kα (x ∗ )fY |X (kα (x ∗ )|x ∗ ) + o(δ),
(A.2)
A3 = −δE[∂x φ|Y = k(x ∗ ), X = x ∗ ]fY |X (kα (x ∗ )|x ∗ ) + o(δ).
(A.3)
For the term A 2 one gets, by application of Assumption A1 (a)–(c) #
" fA|X (A|x ∗ + δ) X = x ∗ − 1 A2 = E 1[φ(x ∗ + δ, A) ≤ kα (x ∗ )] fA|X (A|x ∗ ) ! = δE 1[φ(x ∗ , A) ≤ kα (x ∗ )]∂x ln fA|X (A|x ∗ )|X = x ∗ + o(δ). From (A.2)–(A.4) we get the statement of the theorem.
(A.4)
Proof of (4.1): The equation follows by essentially the same arguments as (1.2). We only give a sketch of the proof. It holds for δ → 0 δ3
∂ fY ,Y |X (y1 , y2 |x ∗ ) + o(δ 3 ) ∂x 1 2 = P (y1 ≤ φ1 (x ∗ + δ, A) ≤ y1 + δ, y2 ≤ φ2 (x ∗ + δ, A) ≤ y2 + δ|X = x ∗ + δ) −P (y1 ≤ φ1 (x ∗ , A) ≤ y1 + δ, y2 ≤ φ2 (x ∗ , A) ≤ y2 + δ|X = x ∗ ) = 1 + 2 + 3
with 1 = P (y1 ≤ φ1 (x ∗ + δ, A) ≤ y1 + δ, y2 ≤ φ2 (x ∗ + δ, A) ≤ y2 + δ|X = x ∗ + δ) −P (y1 ≤ φ1 (x ∗ + δ, A) ≤ y1 + δ, y2 ≤ φ2 (x ∗ + δ, A) ≤ y2 + δ|X = x ∗ ), 2 = P (y1 ≤ φ1 (x ∗ + δ, A) ≤ y1 + δ, y2 ≤ φ2 (x ∗ + δ, A) ≤ y2 + δ|X = x ∗ ) −P (y1 ≤ φ1 (x ∗ + δ, A) ≤ y1 + δ, y2 ≤ φ2 (x ∗ , A) ≤ y2 + δ|X = x ∗ ), 3 = P (y1 ≤ φ1 (x ∗ + δ, A) ≤ y1 + δ, y2 ≤ φ2 (x ∗ , A) ≤ y2 + δ|X = x ∗ ) −P (y1 ≤ φ(x ∗ , A) ≤ y1 + δ, y2 ≤ φ2 (x ∗ , A) ≤ y2 + δ|X = x ∗ ). C The Author(s). Journal compilation C Royal Economic Society 2009.
Non-separable models without monotonicity
25
Now we have by assumption that 1 = 0. For 3 , we obtain 3 = P (y1 ≤ φ1 + δ∂x φ1 ≤ y1 + δ, y2 ≤ φ2 ≤ y2 + δ|X = x ∗ ) −P (y1 ≤ φ1 ≤ y1 + δ, y2 ≤ φ2 ≤ y2 + δ|X = x ∗ ) y2 +δ y1 +δ−δy1 ∗ fφ1 |∂x φ1 ,φ2 ,X (u1 |y1 , u2 , x ∗ ) du1 fφ2 ,∂x φ1 |X (u2 , y1 |x ) = −∞
y2
− − + =
y1 −δy1
−∞ y1 +δ −∞ y1
=δ
fφ1 |∂x φ1 ,φ2 ,X (u1 |y1 , u2 , x ∗ ) du1 fφ1 |∂x φ1 ,φ2 ,X (u1 |y1 , u2 , x ∗ ) du1
−∞ y2 +δ
y2 3
∂ ∂y1
fφ1 |∂x φ1 ,φ2 ,X (u1 |y1 , u2 , x ∗ ) du1
fφ2 ,∂x φ1 |X (u2 , y1 |x ∗ )δ 2 y1
dy1 du2
∂ fφ |∂ φ ,φ ,X (y1 | y1 , u2 , x ∗ )dy1 du2 ∂y1 1 x 1 2
y1 fφ1 ,∂x φ1 ,φ2 |X (y1 , y1 , y2 |x ∗ )dy1 ) + o(δ 3 ).
Showing a similar expansion for 2 completes the statement.
C The Author(s). Journal compilation C Royal Economic Society 2009.
The
Econometrics Journal Econometrics Journal (2009), volume 12, pp. 26–44. doi: 10.1111/j.1368-423X.2008.00268.x
Assessing the magnitude of the concentration parameter in a simultaneous equations model D. S. P OSKITT † AND C. L. S KEELS ‡ †
Department of Econometrics and Business Statistics, Monash University, Victoria 3800, Australia E-mail:
[email protected] ‡
Department of Economics, The University of Melbourne, Victoria 3010, Australia E-mail:
[email protected] First version received: September 2006; final version accepted: September 2008
Summary This paper provides the practitioner with a method of ascertaining when the concentration parameter in a simultaneous equations model is small. We provide some exact distribution theory for a proposed statistic and show that the statistic possesses the minimal desirable characteristics of a test statistic when used to test that the concentration parameter is zero. The discussion is then extended to consider how to test for weak instruments using this statistic as a basis for inference. We also discuss the statistic’s relationship to various other procedures that have appeared in the literature. Keywords: Admissibility, Concentration parameter, Exact distribution theory, Local-to-zero asymptotics, Weak identification, Weak instruments.
1. INTRODUCTION The analysis of single structural equations typically requires additional information in the form of knowledge of a corresponding reduced form. This knowledge is used to identify and estimate the parameters of interest and such analyses are known to perform poorly when this additional information is weak; see, inter alia, Stock et al. (2002) and Hahn and Hausman (2003) for surveys of the recent extensive literature on weakly identified models. The literature concerned with determining the strength of identifying information falls largely into two categories. The first evaluates the weakness of identification by conducting hypothesis tests on the reduced form, with the null hypothesis implying no relationship, i.e. conducting tests for a lack of identification. In the context of a single endogenous regressor, this amounts to the usual F test. An advantage of the hypothesis testing approach is that it readily generalizes to the case of many endogenous regressors in the structural equation of interest; in particular, the test procedures of Cragg and Donald (1993) are widely used in this context. The first contribution of this paper is to explore, under classical assumptions, the properties of the likelihood ratio test of the null hypothesis that the model is totally unidentified. We present the sampling properties of a simple transformation of the likelihood ratio statistic, designated A2 hereafter. We also C The Author(s). Journal compilation C Royal Economic Society 2009. Published by Blackwell Publishing Ltd, 9600 Garsington Road,
Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
Assessing the concentration parameter
27
demonstrate that critical regions based on A2 yield an admissible test that has a power curve which is monotonically increasing in each of the maximal invariants. In addition we discuss the relationship between A2 and various other statistics that have appeared in the literature. The second category explores goodness-of-fit measures from the reduced form model. These measures examine the explanatory power of the reduced form equation for the endogenous regressor in the equation of interest. Such considerations typically lead to some notion of a partial R2 measure such as those found in Bound et al. (1995) and Shea (1997) (see also Godfrey, 1999). As demonstrated by Buse (1992) and Bound et al. (1995) such measures are closely related to the F test described above. A difficulty with such measures is that if one needs to account for more than one endogenous regressor then it is not clear how these measures should be generalized. Consequently, this approach has been largely restricted to the case of a single endogenous regressor. The second contribution of this paper is to discuss the relative merits of two simple measures that are applicable in the multiple endogenous regressor case. The first of these is A2 , which, following Hotelling (1936), can be interpreted as a measure of vector alienation. The second, denoted R2 , can be interpreted as a measure of vector correlation and is seen to be one possible generalization of the partial R2 of Bound et al. (1995). Following the developments leading to the sampling distribution of A2 we also provide the corresponding result for R2 . As described above, it has been commonplace to check for weak identification by implicitly testing the simple null hypothesis that a particular parameter of interest is equal to zero. (At the risk of getting ahead of ourselves, the parameter in question is the so-called concentration parameter, which we define in Section 2.) Rather than representing weak identification, however, the hypothesis that the concentration parameter equals zero corresponds to a complete lack of identification, and so it can be argued that such tests are inappropriate as they are focussed at a slightly different problem. As observed by Stock and Yogo (2005), weak identification might best be thought of as being characterized by the concentration parameter being non-zero but in close proximity to zero. The construction of a measure of the relative strength of identification, based on an evaluation of the magnitude of the concentration parameter, therefore seems more germane to the problem of assessing the presence of weak identification. In a weakly identified model the concentration parameter is in a neighbourhood of zero, rather than being identically equal to zero, and so the hypotheses of interest are composite rather than simple. In the scalar case of a single endogenous regressor acceptance and rejection regions are difficult to formulate because it is not obvious how to define the relevant neighbourhood. With multiple endogenous regressors the concentration parameter is a matrix and so the problem is further complicated by the need for a metric with which one might measure the proximity to the origin of a non-zero matrix. In this context, Stock and Yogo (2005) use an ingenious argument in which they replace the multidimensional hypothesis of interest by a different hypothesis that is couched in terms of a bound on the inferential implications of the concentration parameter being small. This new hypothesis is then mapped into an equivalent hypothesis about a scalar parameter, the smallest eigenvalue of the concentration parameter. They examine the magnitude of this root using a modified version of a Cragg and Donald (1993) test. The third contribution of this paper is to propose a test procedure, in a similar spirit, which exploits the information contained in all of the roots of the matrix. The structure of the paper is as follows. Section 2 describes the model of interest, our initial assumptions, and notational conventions. In Section 3, we focus on the totally unidentified model. Consideration of the likelihood ratio statistic leads to the introduction of A2 and discussion of C The Author(s). Journal compilation C Royal Economic Society 2009.
28
D. S. Poskitt and C. L. Skeels
its sampling properties. We also consider R2 as a basis for comparison. Section 4 is devoted to establishing an analytical justification for the proposition that A2 is a more useful measure of the magnitude of the concentration parameter than are some others currently in use. We appeal to some existing simulation results to provide further support for this proposition. The focus of Section 5 is weak identification. Here we construct a test of weak instruments based on the statistic A2 and provide a local-to-zero asymptotic approximation to the distribution of the test statistic. We provide some simulation results that present a comparison of the behaviour of the test based on A2 to the procedure based on the modified Cragg and Donald (1993) test as originally proposed by Stock and Yogo (2005). The final section summarizes the paper’s contributions. Proofs are assembled in the Appendix.
2. THE MODEL, NOTATION AND ASSUMPTIONS Consider the classical structural equation model y = Yβ + Xγ + u , with corresponding reduced form
π 1 1 [y Y] = [X Z] π 2 2
(2.1)
+ [v V] ,
(2.2)
where the endogenous variables y and Y are T × 1 and T × n, respectively, [v V] is partitioned conformably with [y Y], and the predetermined variables X and Z are T × k and T × ν, respectively, with K = k + ν. We suppose, without loss of generality, that [X Z] has full column rank almost surely, ρ{[X Z]} = K, so that X and Z are essentially disjoint. The probabilistic structure of the model is encapsulated in the following assumption specifying a Gaussian reduced form (GRF) although, as we shall see below, normality is not essential to our needs. A SSUMPTION GRF: The T × n matrix [v V] is distributed as N (0, ⊗ I T ) given [X Z], denoted [v V]|[X Z] ∼ N(0, ⊗ I T ), so that the rows of [v V] are independent with common (n + 1) × (n + 1) covariance matrix ω11 ω12 . = ω21 22 The usual compatibility conditions hold, namely π 1 = 1 β + γ , π 2 = 2 β and u = v − Vβ ∼ N(0, σ 2u I T ), with σ 2u = [1, −β ] [1, −β ] . Adopting the notational convention R A = I T − P A where P A = A(A A)−1 A , define P = R X Z (Z R X Z)−1 Z R X , with rank ρ{P} = ν and PX = 0. Now, let = (1/2 ) 1/2 , where 0 ω 0 ω−1 1/2 −1/2 = = , so that (2.3) −1/2 1/2 −1/2 , −−1 22 ω21 22 22 ω21 /ω 22 with ω2 = ω 11 − ω 12 −1 22 ω 21 and 22 is the symmetric square root of 22 . Then 1/2
S = [y Y] P[y Y]|[X Z] ∼ Wn+1 (ν, , ) ,
(2.4)
C The Author(s). Journal compilation C Royal Economic Society 2009.
Assessing the concentration parameter
29
where = (−1/2 ) [β I n ] [β I n ]−1/2 and = 2 Z PZ 2 = 2 Z R X Z 2 . 1 The matrix variate S plays a central role in the theory of structural equation estimators and the sampling properties of functions of S are critically dependent upon the value of . 2 For example, the two stage least squares (TSLS) estimator of β, (Y PY)−1 Y Py, is only a function of the elements of S, as is the limited information maximum likelihood (LIML) estimator. It is that is commonly referred to as the concentration parameter, and it is the proximity of to zero that is the focus of our attention because it is well known by now that small values of have serious implications for inference. 3
3. A LIKELIHOOD RATIO STATISTIC Consider the reduced form equation for Y obtained from (2.2), Y = X1 + Z2 + V ,
(3.1) †
and its associated assumptions. The likelihood ratio statistic for testing the null hypothesis H0 : † 2 = 0 against the alternative that H1 : 2 = 0 in (3.1) is det[Y (RX − P)Y] T /2 det[Y R[X Z] Y] T /2 = . LR = det[Y RX Y] det[Y RX Y] Noting that Y PY can be thought of as a sample analogue of 2 Z R X Z 2 , in as much as plim T →∞ T −1 Y PY − 2 Z R X Z 2 = 0, suggests that the proportion of the generalized variance of R X Y explained by R X Z will be small whenever is small. This, in turn, suggests that LR can be employed to test the statistical significance of . Indeed, since we have assumed that ρ{[X Z]} = K, it follows that Z R X Z > 0 and ρ{R X Z} = ρ{Z} = ν almost surely. 4 Therefore, † † † testing H0 against H1 using LR, where small values of LR constitute evidence against H0 , is equivalent to testing H0 : = 0 against the alternative H1 : 0. Rather than work with LR directly a monotonic transformation of it will prove to be more convenient for our purposes, in particular (3.2) A2 = [LR]2/T = det In − (Y RX Y)−1/2 (Y PY)(Y RX Y)−1/2 . The statistic A2 is a partial version of the vector alienation coefficient introduced by Hotelling (1936), hence our notation. We see that A2 = 1 when R X Y and R X Z are orthogonal and A2 = 0 if there exists a matrix D of full column rank such that R X Y = R X ZD. Thus A2 can be viewed as a measure of the perpendicularity between Y and Z having adjusted for the effects of X, it represents the proportion of the generalized variance of R X Y that remains once the regression mean square in the multivariate regression of R X Y on R X Z has been accounted for. Interestingly 1 The notation S ∼ W n+1 (ν, , ) should be read as follows: The (n + 1) × (n + 1) symmetric matrix S has a non-central Wishart distribution with ν degrees of freedom, covariance matrix , and non-centrality parameter . 2 See, inter alia, Phillips (1983, 465–6), where S is variously denoted A(P ) and W. Z 3 Treatments of some of the issues associated with ‘weak instruments’ appeared in the literature as early as the 1970s; see Phillips (2006) for an interesting historical perspective. We shall defer discussion of weak instruments until Section 5. 4 We adopt here the notational convention that A > 0 denotes a positive definite matrix. We shall also use the notation A 0 to denote a non-zero positive semi-definite matrix and note that {A: A > 0} ⊂ {A : A 0}. C The Author(s). Journal compilation C Royal Economic Society 2009.
30
D. S. Poskitt and C. L. Skeels
enough, whereas the more conventional approaches to assessing the magnitude of are based upon an examination of correlations—see, for example, the partial R2 statistics of Bound et al. (1995) and Shea (1997) (see also Godfrey, 1999), and the asymptotic test procedures considered by Cragg and Donald (1993) and Hall et al. (1996)—the likelihood ratio statistic focuses its attention on alienation. Before considering some other statistics, we have the following distributional result for A2 which provides a basis for inference: L EMMA 3.1. Assume that equation (2.2) and its accompanying assumptions obtain. Then, under H0 : = 0, the statistic A2 possesses Wilks’- distribution, with parameters n, T − K and ν, written A2 ∼ (n, T − K, ν). If we adopt an hypothesis testing perspective, then it follows that the set CR{A2 , α} = {A2 : A2 < λα (n, T − K, ν)} , where λ α (n, T − K, ν) denotes the α · 100% percentile point of the (n, T − K, ν) distribution, defines a size α critical region for testing H0 against H1 . Equally, we might use p-values of the form p0 = P((n, T − K, ν) < A2 ) to provide a probability scale indicative of the magnitude of . 5 We imagine that in many situations such a scale may be of greater practical relevance than is a test that is identically equal to zero. For models with more than one endogenous regressor, a natural generalization of the univariate partial R2 of Bound et al. (1995) is a partial version of Hotelling’s coefficient of vector correlation det[Y PY] = ri2 , det[Y RX Y] i=1 n
R2 =
(3.3)
where r 21 ≥ . . . ≥ r 2n lists in descending order the partial canonical correlations between Y and Z having adjusted for the effects of X. Similarly, from (3.2), A2 =
n
1 − ri2 .
(3.4)
i=1
Following developments which parallel those leading to Lemma 3.1, it is straight-forward to show that, when = 0, R2 ∼ (n, ν, T − K). It is important to recognize, however, that in general R2 = 1 − A2 and so probability calculations based on A2 and R2 will not necessarily 5 Wilks’- distribution is the same as that of a product of independent beta random variables, see Wilks (1962, section 18.5.1): n T −K+1−i ν
, 2 ν ≥ n, i=1 B 2 (n, T − K, ν) ∼ ν
T −K−n+i B , n2 otherwise. i=1 2
We note that this distribution is difficult to calculate directly and that a number of approximations to it are available. For a survey of some computational aspects of Wilks’- distribution the interested reader is referred to Poskitt and Skeels (2004, section 4.2); see also Anderson (2003, section 8.5.2). C The Author(s). Journal compilation C Royal Economic Society 2009.
Assessing the concentration parameter
31
be equivalent. This raises the question of which measure is most appropriate for our current needs. From expressions (3.3) and (3.4) it follows that it is only necessary for the largest (smallest) partial canonical correlation to deviate substantially from zero (one) for A2 (R2 ) to deviate significantly from unity. Thus, whereas A2 will be sensitive to departures from orthogonality R2 is designed to detect exact correlation. Now recall that the use of Wilks’- distribution is contingent on being equal to zero, which we have already observed is equivalent to the hypothesis that R X Y and R X Z are uncorrelated. Therefore, A2 is more in accord with the basic assumption underlying the application of Wilks’- distribution than is R2 . Hence, A2 appears to be more suited to detecting departures of from 0.
4. OPTIMALITY PROPERTIES Given that, for the purposes of testing, A2 is equivalent to LR, we can anticipate that A2 will inherit any desirable properties of the likelihood ratio test. In order to show this, we need to set up our model slightly differently so that our hypotheses of interest are couched in terms of the perpendicularity, or alienation, between R X Y and R X Z. To formalize this idea let us assume, in a natural extension of equation (3.1), that Z = X 3 + U 2 and [Y Z] = X[(1 + 3 2 ) 3 ] + [U1 U2 ] ,
(4.1)
where the conditional distribution of U = [U 1 U 2 ] = [(V + U 2 2 ) U 2 ] given X is Gaussian with mean zero and variance–covariance ⊗ I T , U|X ∼ N (0, ⊗ I T ). If we regard (4.1) as a specification for the joint distribution of [Y Z] conditional on X then we can contemplate testing that the instruments are orthogonal to the endogenous regressors by testing H0 : 12 = 0 against the alternative that H1 : 12 = 0, where 12 denotes the covariance between U 1 and U 2 . The likelihood ratio statistic for testing H0 against H1 is, again, LR. This is to be expected, of course, because if H0 is true in (4.1) then Z does not appear in the conditional mean of Y given Z and † X, implying that 2 = 0, so that H0 and H0 are also true. Under this formulation of the model we can establish the following result. T HEOREM 4.1. Assume that equation (4.1) and its accompanying assumptions hold, and let ρ 21 ≥ . . . ≥ ρ 2n , denote the (population) canonical correlations, the eigenvalues of −1/2 −1/2 2 11 12 −1 22 21 11 . Then the statistic A yields an admissible test, invariant under the group of linear transformations (Y, Z) → (YG, ZH) where G and H are arbitrary non-singular matrices of dimension n × n and ν × ν, respectively. Moreover, the power function of A2 is monotonically increasing in each of the maximal invariants ρ 2i , i = 1, . . . , n. Theorem 4.1 indicates: First, that within the class of tests invariant under the group of transformations (Y, Z) → (YG, ZH) there is no test which dominates A2 in the sense of having better power for at least one point in the parameter space and no less power elsewhere. Second, that the power function of A2 depends upon all of the maximal invariants and will, therefore, be sensitive to any departures from the null. These are not especially strong properties but even these are not possessed by other test procedures that have been proposed in the literature. Another statistic of interest, seemingly related to A2 , is that version of the Cragg and Donald (1993) procedure for testing the rank of 2 that is ‘concerned with whether X 2 (Z) can serve as instruments for Y 2 (Y) in the sense that there is enough correlation’ is given by (in the notation of this paper) the smallest eigenvalue of Y PY in the metric of Y R [X Z] Y; see hypothesis H 0I C The Author(s). Journal compilation C Royal Economic Society 2009.
32
D. S. Poskitt and C. L. Skeels
and Theorem 3 of Cragg and Donald (1993). That is, the Cragg and Donald (1993) statistic is n where 1 ≥ . . . ≥ n > 0 are the roots of the equation det[Y PY − Y R[X Z] Y] = 0. Note that det[Y PY − Y R[X Z] Y] =
det[(1 + )Y RX Y] × det (Y RX Y)−1/2 Y PY(Y RX Y)−1/2 −
In . 1+
Writing /(1 + ) = r 2 we see that the roots r 21 ≥ . . . ≥ r 2n of det[(Y RX Y)−1/2 Y PY(Y RX Y)−1/2 − r 2 In ] = 0 are the sample analogues of ρ 21 ≥ . . . ≥ ρ 2n . Hence we can conclude that this version of Cragg and Donald’s statistic is equivalent to testing the significance of the smallest canonical correlation. Hall et al. (1996) have also advocated using the smallest canonical correlations between R X Y and R X Z to assess the relevance of the instruments for the estimation of β; see also Bowden and Turkington (1984, Section 2.3). If we are interested in looking for evidence that is small then our previous discussion suggests that we should examine those linear combinations R X Y and R X Z that yield evidence in favour of the hypothesis that R X Y and R X Z are uncorrelated. The Union-Intersection principle of Roy (1957) indicates that this ultimately leads to a procedure akin to those considered by Cragg and Donald (1993) and Hall et al. (1996), except that we should examine the size of the largest rather than the smallest canonical correlation. It is of interest to note that Theorem 8.10.4 of Anderson (2003) implies that whereas Roy’s maximum root test with an acceptance region of the form {r 21 : r 21 ≤ κ α } is admissible, in the sense that it cannot be improved upon by reducing the probability of Type I and/or Type II errors, the minimum root test with acceptance region {r 2n : r 2n ≤ κ α } is not admissible. That this result might be so is easy to see heuristically if one considers the diagonal matrix diag [106 , 1, 10−6 ], say. Consideration of the smallest eigenvalue of this matrix might lead one to conclude that it is close to zero whereas consideration of the largest root (or all of the roots) would probably lead to a different assessment of its magnitude. Roy’s maximum root test is invariant under the group of non-singular linear transformations prompting comparison with CR{A2 , α}, which is also invariant and admissible under this group of transformations. Simulation results presented in Schatzoff (1966), based on an experimental design structured in terms of the maximal invariants, indicate that although Roy’s maximum root test will have good power in alternative directions where ρ 21 > 0 and ρ 22 = . . . = ρ 2n = 0, its performance will be inferior to that of A2 in more general directions. Indeed, Schatzoff goes so far as to suggest that “. . . the largest root should not be used except to test specifically against an alternative of rank one” Schatzoff (1966, p. 429). He concludes that the test based on Hotelling’s alienation coefficient is to be preferred. An obvious advantage of using A2 as a basis for calibrating the magnitude of the concentration parameter is that it does not focus on a particular canonical correlation but summarizes the simultaneous impact of all ρ 2i , i = 1, . . . , n, suggesting that A2 will be sensitive to deviations of from zero in all possible directions.
5. WEAK INSTRUMENTS When testing H0 : = 0 against H1 : 0 one is formally seeking to distinguish between a completely unidentified model and an identified, or possibly partially identified, model, along the lines of Cragg and Donald (1993). However, as pointed out by Stock and Yogo (2005) for example, the weak instrument problem is typically construed as being symptomatic of a situation C The Author(s). Journal compilation C Royal Economic Society 2009.
Assessing the concentration parameter
33
where the model is identified, but with a small, non-zero, concentration parameter. Consequently, even if one treats the test procedures of the previous developments as pure significance tests, they are not appropriately focussed if one is concerned with the weak instruments problem since it is clear that can be deemed statistically significantly different from zero, and yet the instruments may still be weak. The hypothesis of greater practical interest in the context of weak instruments is that > 0, but is possibly, in some sense, close to zero. Thus, the problem of interest can be characterized 1/2 as one of testing the null hypothesis H 0 : ≥ 0 > 0, 0 ≥ η > 0, against the alternative 6 H 1 : ≺ 0 . In accord with the analysis of Stock and Yogo (2005) we shall consider this problem using the local-to-zero (LTZ) asymptotics of Staiger and Stock (1997). We adopt the following assumptions from Stock and Yogo (2005, p. 85): √ A SSUMPTION LTZ1: 2 ≡ 2T = C/ T , where C is a fixed ν × n matrix. A SSUMPTION LTZ2: The following limits hold jointly for fixed ν:
p (a) (T −1 u u, T −1 V u, T −1 V V) → σu2 , Vu , VV , with 1 −β 1 0 σu2
uV
= = ;
Vu VV 0 In −β In respectively; (b) Letting [Xt Zt ] denote the tth row of [X Z], T
−1
[X Z]
p
[X Z] → E[Xt
Zt ] [Xt
Zt ]
QXX QXZ = = Q; QZX QZZ
D
(c) T −1/2 [X Z] [u V] → ∼ N(0, ⊗ Q). Standard asymptotic results are driven by the fact that as T → ∞, the non-centrality −1/2 −1/2 parameter 22 = 22 22 diverges. 7 The point of Assumptions LTZ1 and LTZ2 is that, as T → ∞, 22 = O(1) which, in turn, drives the non-standard results obtained under local to zero asymptotics. Under these assumptions, Stock and Yogo (2005) show that the statistic GT = (T − K)(Y R[X Z] Y)−1/2 Y PY(Y R[X Z] Y)−1/2 D
→ Wn (ν, In , plim22 ) , −1/2
(5.1) −1/2
where plim 22 = 22 C [Q ZZ − Q ZX Q−1 XX Q XZ ]C22 , and they suggest the use of the smallest eigenvalue of ν −1 G T , namely
r2 T −K T −K n , gmin =
n = (5.2) ν ν 1 − rn2 a re-scaled version of the Cragg and Donald statistic, as a basis for inference. 6 It may be argued that the restriction > 0 is unnecessarily restrictive as it precludes partially identified models from 0 consideration. A less restrictive null hypothesis might therefore be H 0 : ≥ 0 0, although we shall not pursue this further. 7 See the discussion of Rothenberg (1984). We observe that is a sub-matrix of the parameter appearing in (2.4). 22 C The Author(s). Journal compilation C Royal Economic Society 2009.
34
D. S. Poskitt and C. L. Skeels
Stock and Yogo (2005) observe that the local-to-zero asymptotic distribution of g min will depend upon all of the eigenvalues of plim 22 or, equivalently, the eigenvalues of plim = C [Q ZZ − Q ZX Q−1 XX Q XZ ]C in the metric of 22 , δ 1 ≥ δ 2 ≥ · · · ≥ δ n . They argue that this dependence so complicates the process of obtaining critical values as to ‘produce an infeasible test’ (Stock and Yogo 2005, p. 98). To address this problem they propose conservative critical values x which satisfy the relationship P(gmin ≥ x) ≤ P(χ 2 (ν, δmin ) ≥ νx),
(5.3)
where δ min , in theory, equals δ n and χ 2 (ν, δ min ) denotes a random variable with a noncentral chi-squared distribution with ν degrees of freedom and non-centrality parameter δ min . At first glance δ min is a nuisance parameter and so the test proposed by Stock and Yogo (2005) is non-similar. An ingenious aspect of their proposal is that they provide alternate characterisations of the minimum eigenvalue in terms of asymptotic bias and size distortion (Stock and Yogo, 2005, section 3). Thus, in order to construct the test, δ min is assigned by the practitioner by specifying either the maximum asymptotic estimation bias, or the maximum Wald test size distortion, that they are prepared to live with. The test procedure is then implemented by rejecting the null hypothesis that the instruments lie in the so-called ‘weak instrument set’ if ν × g min exceeds the 100(1 − α)% percentile of χ 2 (ν, δ min ). Let us now consider how we might use A2 to detect the presence of weak instruments. First, we have the following result: D
L EMMA 5.1. Under Assumption LTZ2, Y PY → Wn (ν, 22 , 22 ) and, as T → ∞, plim(T − K)−1 Y R [X Z] Y = 22 and plimT −1 Y PY − 2 Z R X Z 2 = 0. The limiting distribution given in (5.1) follows as an immediate consequence of Lemma 5.1 of course. A second implication of Lemma 5.1 is that plim| i − λ i /T | = 0, where λ 1 ≥ n 1/2 λi ≤ λ 2 ≥ · · · ≥ λ n , are the eigenvalues of 22 . Now, from the inequality 22 = i=1 −1/2 1/2 be bounded above and small if < η; we can 22 · 1/2 we can see that the λ i will therefore anticipate that in this case A2 = ni=1 (1 + i )−1 will take on a value towards the upper 1/2 end of its range. This suggests that a test of H 0 : ≥ 0 > 0, 0 ≥ η > 0, against H 1 : ≺ 0 can be obtained by constructing a critical region of the form {A2 : A2 > aα } where a α is chosen so as to produce a given level of significance α. Another corollary of Lemma 5.1 is that the asymptotic behaviour of Y PY and (T − −1 K) Y R [X Z] Y under Assumption LTZ2 coincides with the large sample properties obtained under Gaussian assumptions. This follows by observing that, under Assumption GRF, Y PY|[X Z] ∼ Wn (ν, 22 , 22 ) for all T and Y R[X Z] Y ∼ Wn (T − K, 22 ). The latter implies that E[Y R[X Z] Y] = (T − K)22
and
Var[Y R[X Z] Y] = (T − K)(Iν 2 + Kνν )(22 ⊗ 22 ) where K νν is the ν 2 × ν 2 commutation matrix. We can therefore conclude, by Chebychev’s inequality, that plim (T − K)−1 Y R [X Z] Y = 22 as T → ∞, and hence the stated coincidence. This corollary leads ultimately to the following result. C The Author(s). Journal compilation C Royal Economic Society 2009.
Assessing the concentration parameter
35
T HEOREM 5.1. Under Assumption LTZ2 the distribution function of −τ log A2 can be approximated by
1 (n + ν + 1)μ1 P χ 2 (f + 2, μ1 ) ≤ x PT −τ log A2 ≤ x = P χ 2 (f , μ1 ) ≤ x + 4τ
− [(n + ν + 1)μ1 − μ2 ] P χ 2 (f + 4, μ1 ) ≤ x 2
(5.4) − μ2 P χ (f + 6, μ1 ) ≤ x , where the Bartlett correction factor τ = T − (n+ ν + 1)/2, the degrees of freedom parameter f = nν, the non-centrality parameter μ1 = ni=1 λi and μ2 = ni=1 λ2i . Furthermore, |P(−τ log A2 ≤ x) − PT (−τ log A2 ≤ x)| = O(T −2 ) uniformly in x. The null hypothesis of interest has been couched in terms of the norm of which, under H 0 , is bounded below by η. To construct an appropriate test and calculate the level of significance, however, we need to use Theorem 5.1, wherein the distribution is expressed in terms of the eigenvalues λ 1 , . . . , λ n . In order to link these quantities together, note from the inequality 1/2 1/2 1/2 n λi that η corresponds to a set of equivalent λ = 1/2 ≤ 22 · 22 = 22 i=1 (λ 1 , . . . , λ n ) and, conversely, a given λ 0 = (λ 10 , . . . , λ n0 ) implies a given η. Now set CRWI {A2 , α, λ} = {A2 : −τ log A2 ≤ aα (λ)} where a α (λ) denotes the 100α% percentile of the asymptotic distribution of −τ log A2 calculated from (5.4), i.e. PT (−τ log A2 ≤ aα (λ)) = α. P ROPOSITION 5.1. If Assumption LTZ2 holds, then for any given λ0 , CRWI {A2 , α, λ0 } defines an asymptotic critical region of size α that yields an unbiased test of H 0 : ≥ 0 > 0, 1/2 1/2 n λi0 . Moreover, no matter 0 ≥ η > 0, against H 1 : ≺ 0 , where η = 22 i=1 how small the λ i0 , i = 1, . . . , n, the test based on CRWI {A2 , α, λ0 } is consistent against weak instruments characterized by Assumption LTZ1. In order to implement Proposition 5.1 it only remains for the practitioner to designate the null value λ 0 . This choice is similar in some ways to the requirement to specify δ min by assigning maximum values to either asymptotic bias or size distortion when applying Stock and Yogo’s procedure. We would argue, however, that the requirements for implementing the test based on A2 are less than those implicit in the aforementioned relationships leading to δ min . Both of the latter are complicated approximations and it is difficult to imagine anybody developing a strong intuition as to the appropriate parameter choices. The difficulties in doing so increase dramatically as n increases because of the number of parameters involved. In contrast, the elements of λ 0 can be readily characterized via the magnitudes of the maximal invariants, namely the canonical correlations, according to λ i0 = T ρ 2i0 /(1 − ρ 2i0 ), i = 1, . . . , n. Although the choice may be somewhat arbitrary, we believe that most would understand what it means to specify a ‘small’ correlation. It is easy, for example, to envisage a practitioner choosing λ 0 by simply setting ρ i0 = c, i = 1, . . . , n, for some suitably small value of c. Proposition 5.1 tells us that the test will be consistent against weak instruments no matter how small is c. Of course the finite sample properties of the test will vary with the choice of λ 0 but this is an issue that simulation studies can shed some light on as the maximal invariants are portable from problem to problem, application to application. In Figures 1 and 2 we illustrate the results presented above by graphing the simulated power of A2 and g min for n = 2 and ν = 5. The power curves in Figure 1 and power surfaces in C The Author(s). Journal compilation C Royal Economic Society 2009.
36
D. S. Poskitt and C. L. Skeels 1
Relative Frequency of Rejection
0.9 0.8
A2
gmin
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
0
0.05
0.1
0.15
0.2
0.25
ρ2
Figure 1. Simulated power curves.
Figure 2 were constructed using the canonical representation, as employed in the proof of Theorem 4.1, and are expressed as functions of the canonical correlations. The outcomes are based on 10,000 replications, with a sample size T = 200, and both tests were implemented using a nominal level of significance of 0.05. When calculating critical values for A2 we used the null values λ i0 = T ρ 2i0 /(1 − ρ 2i0 ), i = 1, 2, where ρ 210 = ρ 220 = ρ 20 , and for g min we equated δ min to the theoretical minimum T ρ 20 /(1 − ρ 20 ). For the purposes of these experiments we have chosen ρ 20 = 0.1. √ The choice of ρ 20 deserves some elaboration. A correlation coefficient ρ0 = 0.1 ≈ (±)0. 3 is something that a practitioner might plausibly think of as ‘small’. In our particular example, where T = 200 and n = 2, values of ρ 20 ≥ 0.1 map into instrument sets where the relative bias of TSLS and LIML is less than 5%, and the size distortion of both estimators is less than 10%, again something a practitioner may view as acceptable. Less stringent bounds would have followed had a value for ρ 20 < 0.1 been chosen. Such a choice might be guided by appeal to the boundary eigenvalue functions presented by Stock and Yogo (2005, Figures 5.1 and 5.2). The profiles of these functions indicate, however, that for combinations of T ≥ 200 and n ≤ 20, the choice ρ 20 = 0.1 will generate a null value for the smallest eigenvalue that exceeds the supremum of the boundary eigenvalues graphed. This suggests that a simple and intuitively appealing choice such as ρ 20 = 0.1 may yield an acceptably stringent test, in the sense of imposing tight limits on estimation bias and Wald test size distortion, in a wide variety of circumstances. We first consider in Figure 1 the special case ρ 21 = ρ 22 = ρ 2 (say). Here ρ 2 = ρ 20 = 0.1 designates the boundary between the test statistics’ null and alternative parameter sets. For A2 the null set is {ρ 2 : ρ 2 ≥ ρ 20 = 0.1} and the power curve for A2 clearly exhibits the theoretical properties presented in Proposition 5.1. For the statistic g min the null set is {ρ 2 : ρ 2 ≤ ρ 20 = 0.1} and it is apparent that the probability bound given in (5.3) is extremely conservative and C The Author(s). Journal compilation C Royal Economic Society 2009.
Assessing the concentration parameter
37
Figure 2. Contour plots of simulated power surfaces.
results in the test being significantly undersized and biased. Having said that, it is also clear that the rejection rate for g min increases steadily to unity as the true value of ρ 2 increases and the eigenvalues λ 1 = λ 2 = T ρ 2 /(1 − ρ 2 ) pass through the ‘weak instrument set’ and beyond into the alternative parameter set. C The Author(s). Journal compilation C Royal Economic Society 2009.
38
D. S. Poskitt and C. L. Skeels
Since we are considering the special case in which the canonical correlations are equal, this means that, bar the common boundary value ρ 20 = 0.1, the null and alternative parameter sets of A2 and g min are complements of each other. This raises the question of exactly how we might compare their performance. One possibility, following the suggestion of Stock and Yogo (2005, Section 5), is to think of each weak instrument test as a decision rule and calculate the probability of making a correct decision. This amounts to calculating the power under the alternative and the operating characteristic function under the null (Wilks 1962, Section 13.1). For A2 , for example,
n n P A2 ∈ CRWI {A2 , α, λ0 } , λi,0 , i=1 λi < 2
n ni=1 P CA2 (λ) = 2 1 − P A ∈ CRWI {A , α, λ0 } , i=1 λi ≥ i=1 λi,0 . Defining P Cgmin (λ) in a corresponding fashion we can then calculate ζ (ξ ) = P CA2 (λ) − P Cgmin (λ) dλ . ξ
Evaluating this for Figure 1 when ξ = {λ : λ 1 = λ 2 = λ, 0 ≤ λ ≤ 85.7143} gives ζ (ξ ) = 0.2059, indicating that the weak instrument test based on A2 is about 20% more likely to lead to the correct action than is the test based on g min . 8 The structure of the data generating mechanisms underlying the results presented in Figure 1 might be considered as providing the most favourable case for the statistic g min because the concentration parameter is uniquely determined by the single-repeated eigenvalue. Figure 2 presents simulation results that extend the structure of the data generating mechanisms considered previously by allowing the canonical correlations, and hence the eigenvalues, to be different. 9 In this case the null and alternative hypotheses of A2 and g min are no longer complements of each other. For the canonical variates upon which the current simulations are based, the null set of A2 corresponds to the set
2 2
ρ12 ρ22 η2 ≥ ρ1 , ρ2 : + T 1 − ρ12 1 − ρ22 where the lower bound
ρ02 λ10 + λ20 η2 = 0.2222 . = =2 T T 1 − ρ02
The ‘weak instrument set’ of the test based on g min corresponds to {(ρ 21 , ρ 22 ) : 0 < ρ 21 ≤ ρ 20 = 0.1} ∪ {(ρ 21 , ρ 22 ) : 0 < ρ 22 ≤ ρ 20 = 0.1}. The null set of each test is represented in Figure 2 by the light grey shaded region. Examination of the rejection frequencies of A2 in Figure 2(a) indicates that, as in Figure 1, the test clearly exhibits the theoretical properties outlined in Proposition 5.1. Its power increases rapidly in value as the instruments become weaker, with contour lines roughly parallel to ρ12 + ρ22 = const. as ρ 21 and ρ 22 both approach the origin. On the other hand, the values of the power of 8 We have set the upper bound λ = T ρ 2 /(1 − ρ 2 ) = 200(0.3/0.7) ≈ 85.7143 because, roughly speaking, for any ρ 2 > 0.3 the values of P CA2 (λ) and P Cgmin (λ) are virtually indistinguishable. 9 The rejection frequencies of Figure 1 are the cross-sections obtained by traversing the 45◦ -lines in the respective panels of Figure 2.
C The Author(s). Journal compilation C Royal Economic Society 2009.
Assessing the concentration parameter
39
A2 for those (ρ 21 , ρ 22 ) pairs that lie inside the null set indicate that the operating characteristic of the test is at least 0.95. From the power surface of g min in Figure 2(b) we can see that the test is largely behaving as might be anticipated. Recall from Figure 1 that g min was both undersized and biased. This manifests itself here in the fact that the 0.05-contour lies outside of the ‘weak instrument set’: that the point (0.1,0.1) lies to the south-west of the 0.05-contour indicates that the test is undersized; all points in the region bounded from above by the 0.05-contour and to the left and from below by the lines ρ 21 = 0.1 and ρ 22 = 0.1, respectively, are points in the parameter space where the power of g min is less than its nominal size. The observed size accords more closely with the nominal significance level as one moves along the lines ρ 21 = 0.1 and ρ 22 = 0.1 away from the origin, suggesting that the approximation in (5.3) improves as the larger eigenvalue increases and the eigenvalues become more separated. The statistic g min has significant power when both ρ 21 and ρ 22 are large and a long way away from the ‘weak instrument set’. In Figure 2(a) the contours are essentially parallel and perpendicular to the 45◦ -line, whereas those in Figure 2(b) are more parallel to the coordinate axes, and this difference merits further comment. The contours in Figure 2(a) show that the power of A2 is roughly constant along paths in the parameter space where is constant. By way of contrast, the top-left and bottomright regions of Figure 2(b) represent parameter combinations where the test based on g min has virtually no power at all. This is because the test only examines the smallest eigenvalue and is unable to reject the null of weak identification regardless of the magnitude of . Consider, for example, the points (ρ 21 , ρ 22 ) = (0.05, 0.25) and (ρ 21 , ρ 22 ) = (0.25, 0.05). At both of these points λ 2 = 200(0.05/(1 − 0.05)) = 10.5263. 10 The test statistic g min is therefore correctly identifying these as points in the parameter space that have a minimum eigenvalue smaller than the upper bound specified in the statistic’s null set, λ 20 = 22.2222. However, at the points (0.05, 0.25) and (0.25, 0.05)
1/2
=
0.05 0.25 + 200 = 8.786 , 1 − 0.05 1 − 0.25
1/2
and so 1/2 > 0 = η = 6.6666. Moreover, the magnitude of the concentration parameter at these two points is also larger than that obtained at (ρ 21 , ρ 22 ) = (0.125, 0.125)—where 1/2 = 7.5593—a point outside the test’s own null set! To assign these two points to a ‘weak instrument set’, as g min does, therefore seems inappropriate. This particular feature, to our mind, vividly illustrates the limitations of constructing a test procedure based on a single eigenvalue rather than a procedure, such as the test based on A2 , that depends upon all of the maximal invariants. Finally, when the parameter space is partitioned into null and alternative sets along the 1/2 boundary {(ρ 21 , ρ 22 ) : = 0 = 6.6666}, and ζ (ξ ) is calculated using ξ = {λ : 0 ≤ λ 1 , λ 2 ≤ 85.7143}, the value obtained is 0.5818. This suggests that the weak instrument test based on A2 is about 58% more likely to lead to the correct action than is the test based on g min , providing further compelling evidence in support of the test procedure based on A2 .
10
Here λ 2 is the smallest eigenvalue of 22 which, in this case, is a 2 × 2 matrix.
C The Author(s). Journal compilation C Royal Economic Society 2009.
40
D. S. Poskitt and C. L. Skeels
6. CONCLUSION This paper seeks to address three distinct but related questions about the concentration parameter, , associated with a linear simultaneous equations system. What is an appropriate measure with which to assess the magnitude of the concentration parameter? Is it possible to devise an ‘optimal’ test of whether or not this matrix is significantly different from zero? Can one readily distinguish between a small non-zero concentration parameter and one that is, in some sense, bigger? In all cases we find that the answer is yes. Our starting point is a monotonic transformation of a likelihood ratio statistic, namely A2 . The statistic A2 has a long history in multivariate analysis and we are able to exploit existing results in developing the distribution theory required to use A2 as a basis for inference. In the context of assessing the magnitude of the we note that A2 admits an interpretation as a coefficient of vector alienation, which differs from the coefficient of vector correlation that is the more natural generalization of the R2 measures used in the scalar case. As A2 is a monotonic transformation of the likelihood ratio test statistic for testing the null hypothesis H 0 : = 0 it inherits a number desirable properties. In particular, we find that critical regions based upon A2 are admissible, in contrast to critical regions designed to test that the model is totally unidentified constructed from a popular version of the Cragg and Donald (1993) statistics. This result is driven by the fact that the Cragg-Donald test is based on only the smallest eigenvalue of whereas A2 uses information on all of the roots. Existing simulation results illustrate that the A2 -based test dominates that based on only the smallest root of . Finally, we examine the problem of testing the null hypothesis that is small, so that the model is weakly identified. This is a problem of distinct interest, related to issues at the heart of the weak instrument literature. The crux of the problem is to find a characterization of what it means for the concentration parameter to be small. In a similar spirit to Stock and Yogo (2005), who define small in terms of different consequences for inference, we base a test on A2 where we define small directly in terms of the magnitude of , via a one–to–one mapping between its eigenvalues and the canonical correlations—the maximal invariants. In particular, we show how knowledge of the structure of the statistic A2 can be exploited to construct a test of weak instruments, and appealing to the local-to-zero asymptotics of Staiger and Stock (1997) we demonstrate that distributional results based on an assumption of normality port across, virtually unchanged, leading to a local-to-zero asymptotic approximation to the distribution of the test statistic. We compare the test based on the statistic A2 with the modified Cragg–Donald procedure proposed by Stock and Yogo (2005), both in terms of the reasoning underlying the two approaches and via a simulation study. Both considerations point towards the use of A2 in preference to the modified Cragg–Donald procedure.
ACKNOWLEDGMENTS The authors would like to thank the editor, Frank Windmeijer, and three anonymous referees for their helpful and constructive comments. Both authors wish to acknowledge the financial support of the Australian Research Council under grant DP0771445.
REFERENCES Anderson, T. W. (2003). An Introduction to Multivariate Statistical Analysis (3rd ed.). New York: John Wiley. C The Author(s). Journal compilation C Royal Economic Society 2009.
Assessing the concentration parameter
41
Bound, J., D. A. Jaeger and R. M. Baker (1995). Problems with instrumental variables estimation when the correlation between the instruments and the endogenous explanatory variable is weak. Journal of the American Statistical Association 90, 443–50. Bowden, R. J. and D. A. Turkington (1984). Instrumental Variables. Econometric Society Monographs No. 8. Cambridge: Cambridge University Press. Buse, A. (1992). The bias of instrumental variables estimators. Econometrica 60, 173–80. Cragg, J. G. and S. G. Donald (1993). Testing identifiability and specification in instrument variable models. Econometric Theory 9, 222–40. Das Gupta, S. and M. D. Perlman (1974). Power of the noncentral F-test: effect of additional variates on Hotelling’s T 2 -test. Journal of the American Statistical Association 69, 174–80. Godfrey, L. G. (1999). Instrument relevance in multivariate linear models. Review of Economics and Statistics 81, 550–2. Hahn, J. and J. Hausman (2003). Weak instruments: diagnosis and cures in empirical econometrics. American Economic Review 93, 118–25. Hall, A. R., G. D. Rudebusch and D. W. Wilcox (1996). Judging instrument relevance in instrumental variables estimation. International Economic Review 37, 283–98. Hotelling, H. (1936). Relations between two sets of variables. Biometrika 28, 321–77. Muirhead, R. J. (1982). Aspects of Multivariate Statistical Theory. New York: John Wiley. Phillips, P. C. B. (1983). Exact small sample theory in the simultaneous equations model. In Z. Griliches and M. D. Intriligator (Eds.), Handbook of Econometrics, Volume 1, 449–516. Amsterdam: North Holland. Phillips, P. C. B. (2006). A remark on bimodality and weak instrumentation in structural equation models. Econometric Theory 22, 947–60. Poskitt, D. S. and C. L. Skeels (2004). Assessing the magnitude of the concentration parameter in a simultaneous equations model. Working Paper 29/04, Department of Econometrics and Business Statistics, Monash University, available at: http://www.buseco.monash.edu.au/depts/ebs/pubs/ wpapers/2004/29-04.php. Rao, C. R. (1973). Linear Statistical Inference and Its Applications (2nd ed.). New York: John Wiley. Rothenberg, T. J. (1984). Approximating the distributions of econometric estimators and test statistics. In Z. Griliches and M. D. Intriligator (Eds.), Handbook of Econometrics, Volume 2, 881–935. Amsterdam: North Holland. Roy, S. N. (1957). Some Aspects of Multivariate Analysis. New York: John Wiley. Schatzoff, M. (1966). Sensitivity comparisons among tests of the general linear hypothesis. Journal of the American Statistical Association 61, 415–35. Shea, J. (1997). Instrument relevance in multivariate linear models: a simple measure. Review of Economics and Statistics 79, 348–52. Staiger, D. and J. H. Stock (1997). Instrumental variables regression with weak instruments. Econometrica 65, 557–86. Stock, J. H., J. Wright and M. Yogo (2002). A survey of weak instruments and weak identification in generalized method of moments. Journal of Business & Economic Statistics 20, 518– 529. Stock, J. H. and M. Yogo (2005). Testing for weak instruments in linear IV regression. In D. W. K. Andrews and J. H. Stock (Eds.), Identification and Inference for Econometric Models: Essays in Honor of Thomas Rothenberg, 80–108. Cambridge: Cambridge University Press. Wilks, S. S. (1962). Mathematical Statistics. New York: John Wiley.
C The Author(s). Journal compilation C Royal Economic Society 2009.
42
D. S. Poskitt and C. L. Skeels
APPENDIX: PROOFS Proof of Lemma 3.1: Let τ = (τ 1 , . . . , τ n ) = 0 be an arbitrary constant vector. First partialling out X from (3.1) by pre-multiplying by R X and then post-multiplying equation by τ yields RX Yτ = RX Zφ + η ,
(A.1)
with φ = 2 τ , η = R X Vτ , and η|[X Z] ∼ N (0, σ 2τ R X ), where σ τ2 = τ 22 τ , by Assumption GRF. The operators P and R [X Z] are idempotent with ranks ν and T − K, respectively, where K = k + ν and ν is the number of instruments used in addition to X, supposing that X is employed as its own instrument. We can therefore conclude that, for given Z and X, the quadratic forms τ Y PYτ and τ Y R [X Z] Yτ are independently distributed as σ τ2 · χ 2 (ν, μ), μ = φ Z R X Zφ/σ τ2 , and σ 2τ · χ 2 (T − K) random variables, respectively. Since τ is arbitrary we have from Rao (1973, Section 8b.2 (ii) and (iii)) that the matrices Y PY and Y R [X Z] Y will have independent Wishart distributions: Y PY|[X Z] ∼ Wn (ν, 22 , 22 )
and
Y R[X Z] Y ∼ Wn (T − K, 22 ).
When = 0 the non-centrality parameter μ = τ τ /σ 2τ = 0 and both Y PY and Y R [X Z] Y will have central Wishart distributions. Writing A2 as the ratio of det[Y R[X Z] Y] to det[Y (R[X Z] + P)Y] it follows that when = 0 the statistic A2 will possess Wilks’- distribution (Wilks 1962, section 18.5.1), as required. Proof of Theorem 4.1: First note that the problem of testing the hypothesis that 12 = 0 or, equivalently, ρ 2i = 0, i = 1, . . . , n, is invariant under the group of non-singular linear transformations. It is well known that the canonical correlations are the maximal invariants under this group of transformations and so, from (3.4), we see that A2 is an invariant test statistic. Admissibility follows by writing the acceptance region of the test as n n −1 2 2 2 2 1 − ri ≤ κα = A : (1 + i ) ≤ κα , AR{A , α} = A : (A.2) i=1
i=1
where the i = r 2i /(1 − r 2i ) coincide with the non-zero eigenvalues of the matrix (Y R[X Z] Y)−1/2 Y PY(Y R[X Z] Y)−1/2 , and applying Corollary 8.10.2 of Anderson (2003). Since R X is idempotent of rank T − k there exists a T × (T − k) column orthonormal matrix Q X , QX Q X = I T −k , such that R X = Q X QX and QX [Y Z] = QX [U 1 U 2 ]. There also exists two non-singular matrices A and G that map U 1 and U 2 , respectively, to the canonical variates so that [V 1 V 2 ] = QX [U 1 A U 2 G] is distributed In [ 0] N [0 0], ⊗ IT −k , [ 0] Iν where = diag[ρ 1 , . . . , ρ n ]. Given the instruments, the conditional distribution of V 1 is N (V 2 [ 0] , 1/2 (I n − 2 ) ⊗ I T −k ) and W 1 = V 1 (I n − 2 )−1/2 ∼ N (V 2 M , I n ⊗ I T −k ), where M = [diag[λ1 , . . . , 1/2 2 2 λn ] 0], λ i = ρ i /(1 − ρ i ). Define O as the (T − k) × (T − k) orthogonal matrix ⎤ ⎡ (V2 V2 )−1/2 V2 ⎥ ⎢ O = ⎣(W1 RV2 W1 )−1/2 W1 RV2 ⎦ , O3 C The Author(s). Journal compilation C Royal Economic Society 2009.
43
Assessing the concentration parameter where O 3 is a (T − k − ν − n) × (T − k) matrix that makes O orthogonal. Then ⎤ ⎤ ⎡ ⎡ W11 (V2 V2 )−1/2 V2 W1 ⎥ ⎥ ⎢ ⎢ OW1 = ⎣ (W1 RV2 W1 )1/2 ⎦ = ⎣W12 ⎦ O3 W1 0
and W 11 ∼ N ((V2 V 2 )1/2 M , I n ⊗ I ν ) is distributed independently of W 12 ∼ N (0, I n ⊗ I n ). Moreover, by construction, the i , i = 1, . . . , n, of expression (A.2) are the non-zero eigenvalues of W 11 (W12 W 12 )−1 W11 . Applying the same argument as that used by Anderson (2003, pp. 368–9) it follows that AR{A2 , α} is convex in each row of W 11 given W 12 and the other rows of W 11 and hence, by Theorem 8.10.6 of Anderson (2003), the conditional power of A2 , given the instruments, is monotonically increasing in the eigenvalues of M V2 V 2 M . But the eigenvalues of MV2 V 2 M are all monotonically increasing in λ i , i = 1, . . . , n. Taking the unconditional power, recognizing that the marginal distributions of W12 W12 ∼ Wn (n, In ) and V2 V2 ∼ Wν (T − k, Iν ) do not depend on the λ i , i = 1, . . . , n, gives us the result that for all possible sets of the instruments the power of A2 is monotonically increasing in each λ i , i = 1, . . . , n, and hence in each ρ i . Proof of Lemma 5.1: Echoing the proof of Lemma 3.1, pre-multiply the reduced form equation (3.1) by P and post-multiply by τ = (τ 1 , . . . , τ n ) = 0 to give PYτ = PZφ + η ,
(A.3)
where φ = 2 τ , as before, but now η = PVτ . Then τ (Y − Z2 ) P(Y − Z2 )τ = η η, where
η η = τ V PVτ =
τ V RX Z T 1/2
Z RX Z T
−1
Z RX Vτ T 1/2
.
By LTZ2(b) T −1 (Z RX Z) → QZZ − QZX Q−1 XX QXZ in probability and from LTZ2(c) it follows that " # D T −1/2 (τ V RX Z) → N 0, στ2 QZZ − QZX Q−1 , XX QXZ D
where σ τ2 = τ 22 τ . Using the continuous mapping theorem we can conclude that η η → στ2 · χ 2 (ν) and hence that D
τ (Y − Z2 ) P(Y − Z2 )τ → στ2 · χ 2 (ν, μ) where μ=
τ 2 Z RX Z2 τ φ Z RX Zφ = . 2 στ στ2
Since τ is arbitrary it follows from the Cram´er–Wold device and Rao (1973, Section 8b.2 (ii) and (iii)) that D
Y PY → Wn (ν, 22 , 22 ) , as stated in the first part of the lemma. C The Author(s). Journal compilation C Royal Economic Society 2009.
44
D. S. Poskitt and C. L. Skeels Noting that R [X Z] Y = R [X Z] V we have (T − K)−1 Y R[X Z] Y = (T − K)−1 V R[X Z] V = (1 − K/T )−1 (T −1 V V − T −1 V P[X Z] V) . D
By an argument that parallels that just employed V P[X Z] V → Wn (K, 22 ), and by LTZ2(a) plimT −1 V V =
VV = 22 . Thus (T − K)−1 Y R [X Z] Y = 22 + op (1) + Op (T −1 ). The last part of the lemma is establish in similar manner, the details, which are relatively straightforward, are omitted. D
Proof of Theorem 5.1: Under Assumption LTZ2 GT → Wn (ν, In , 22 ), which is identical to the large sample distribution of G T obtained under Assumption GRF. This implies that Assumption LTZ2 leads to the same the asymptotic distribution for the eigenvalues i , i = 1, . . . , n, and, therefore, A2 = ni=1 (1 + i )−1 , as that obtained under Assumption GRF. Applying Theorem 10.5.7 of Muirhead (1982) we can conclude that under Assumption GRF P(−τ log A2 ≤ x) can be approximated for large τ by the expansion on the right-hand side of (5.4), with an approximation error O(τ −2 ). Since τ /T = 1 − (n + ν + 1)/2T → 1 as T → ∞ the result follows. Proof of Proposition 5.1: By noting that P(A2 ≥ a) is monotonically increasing in the maximal invariants ρ i , i = 1, . . . , n,—see the proof of Theorem 4.1, or by applying standard results on non-central chi-squared random variables, as in Das Gupta and Perlman (1974, remark 4.1), directly to PT (−τ log A2 ≤ x)—we find that P(A2 < a) is monotonically decreasing in the λ i = ρ 2i /(1 − ρ 2i ), i = 1, . . . , n. This implies that
sup P A2 ∈ CRWI {A2 , α, λ0 } = α λ≥λ0
and that, for any λ < λ 0 ,
P A2 ∈ CRWI {A2 , α, λ0 } > α,
where λ ≥ λ 0 denotes λ i ≥ λ 0i , and λ < λ 0 denotes λ i < λ 0i , i = 1, . . . , n. This establishes that the test has appropriate size and is unbiased. Now suppose that Assumption LTZ1 holds, i.e. that the instruments are weak. It is readily established that this leads to the conclusion that plim|T λ i − σ i | = 0, implying that A2 → 1 and hence that −τ log A2 will be less than a α (λ 0 ) for any given α > 0 and λ 0 > 0.
C The Author(s). Journal compilation C Royal Economic Society 2009.
The
Econometrics Journal Econometrics Journal (2009), volume 12, pp. 45–61. doi: 10.1111/j.1368-423X.2008.00259.x
Determining the number of factors in a multivariate error correction–volatility factor model Q IAOLING L I † AND J IAZHU P AN ‡ †
‡
School of Mathematical Sciences, Peking University, Beijing 100871, China E-mail:
[email protected]
Department of Statistics and Modelling Science, University of Strathclyde, Livingstone Tower, Richmond Street, Glasgow G1 1XH, UK E-mail:
[email protected] First version received: March 2007; final version accepted: August 2008
Summary In order to describe the co-movements in both conditional mean and conditional variance of high dimensional non-stationary time series by dimension reduction, we introduce the conditional heteroscedasticity with factor structure to the error correction model (ECM). The new model is called the error correction–volatility factor model (EC–VF). Some specification and estimation approaches are developed. In particular, the determination of the number of factors is discussed. Our setting is general in the sense that we impose neither i.i.d. assumption on idiosyncratic components in the factor structure nor independence between factors and idiosyncratic errors. We illustrate the proposed approach with a Monte Carlo simulation and a real data example. Keywords: Co-integration, Dimension reduction, Error correction–volatility factor model, Model selection, Penalized goodness-of-fit criteria.
1. INTRODUCTION The concept of co-integration (Granger, 1981, Granger and Weiss, 1983, and Engle and Granger, 1987) has been successfully applied to modelling multivariate non-stationary time series. The literature on co-integration is extensive. The most frequently used representations for a cointegrated system are the ECM of Engle and Granger (1987), the common trends form of Stock and Watson (1998) and the triangular model of Phillips (1991). The error correction model has been applied in various practical problems, such as determining exchange rates, capturing the relationship between expenditure and income, modelling and forecasting inflation, etc. From the equilibrium point of view, the term ‘error correction’ reflects the correction on the long-run relationship by the short-run dynamics. However, the ECM ignores the characteristics of time-varying volatility, which plays an important role in various financial areas such as portfolio selection, option evaluation and risk management. Kroner and Sultan (1993) argued that the neglect of either co-integration or timevarying volatility would affect the hedging performance of existing models in the literature for the futures market. Similar conclusion has been given by Ghost (1993) and Lien (1996) through C The Author(s). Journal compilation C Royal Economic Society 2008. Published by Blackwell Publishing Ltd, 9600 Garsington Road,
Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
46
Q. Li and J. Pan
empirical calculation and theoretical analysis, respectively. Therefore, the traditional ECM needs to be generalized to have conditional heteroscedasticity for capturing both co-integration and time-varying volatility. Univariate volatility models have been extended to multivariate cases. Extensions of the generalized autoregressive heteroscedastic (GARCH) model (Bollerslev, 1986) include, e.g. vectorized GARCH (VEC-GARCH) model of Bollerslev et al. (1988), the BEKK model of Engle and Kroner (1995), a dynamic conditional correlation (DCC) model of Engle (2002) and Engle and Sheppard (2001), a generalized orthogonal GARCH model of van der Weide (2002); see a survey of multivariate GARCH models by Bauwens et al. (2006). 1 These models assume that a vector transformation of the covariance matrix can be written as a linear combination of its lagged values and the innovations. Andersen et al. (1999) showed that these models perform well relatively to competing alternatives. But the curse of dimensionality becomes a major obstacle in application. A useful approach to simplifying the dynamic structure of a multivariate volatility process is to use factor models. As is well known, factor models have been used for performance evaluation and risk measurement in finance. Moreover, it is now widely accepted that the financial volatilities move together over time across assets and markets (Anderson et al., 2006). These make it reasonable that we impose a factor structure on the residual term of a multivariate error correction model. In this sense, an error correction–volatility factor (EC–VF) model can capture the features of co-movements in both conditional mean (co-integration) and conditional variance (volatility factors) of a high dimensional time series. The contribution of this paper is to estimate the EC–VF model. The set of parameters is divided into three subsets: structural parameter set including lag order and all autoregressive coefficient vector and matrices, co-integration parameter set including the co-integration vectors and the rank, and factor parameter set including the factor loading matrix and the number of factors. We conduct a two-step procedure to estimate relevant parameters. First, assuming that the structural and co-integration parameters are known, we give the estimation of factor loading matrix in the volatility factor model, and then give a method to determine the number of factors consistently. Our model specification and estimation approaches are general, because we impose neither i.i.d. assumption on the idiosyncratic components in the factor structure nor independence between factors and idiosyncratic errors. In contrast to the innovation expansion method in Pan and Yao (2008) and Pan et al. (2007), where they can not prove that their algorithm for the number of factors is consistent, our method in this paper is based on a penalized goodness-of-fit criterion. We prove our estimator of the number of factors is consistent. Secondly, the structural and co-integration parameters will be consistently estimated without knowing the true factor structure. The main distinction between Bai and Ng (2002) and this paper is that their factor model concerned the unconditional mean of economic variables while our factor structure is imposed on the conditional variance to reduce the dimension of volatilities. The rest of the paper is organized as follows. Section 2 defines the EC–VF model and mentions some practical backgrounds of the model. Section 3 presents an information criterion for determining the number of factors and the consistency of our estimator. In Section 4, a simple Monte Carlo simulation is conducted to check the accuracy of the proposed estimation for the factor loading matrix and the number of factors. In Section 5, an application to financial risk management is discussed to show the advantages of the EC–VF model to other traditional alternatives. All theoretical proofs are given in the Appendix. 1 The early version of Engle and Kroner (1995) was written by Baba, Engle, Kraft and Kroner, which led to the name BEKK of their model.
C The Author(s). Journal compilation C Royal Economic Society 2008.
Determining the number of factors in EC–VF model
47
2. MODEL 2.1. Definition Suppose that {Y t } is a d × 1 time series. The EC–VF model is of the form Yt = μ + 1 Yt−1 + 2 Yt−2 + · · · + k−1 Yt−k+1 + 0 Yt−1 + Zt , Zt = AFt + et ,
(2.1)
where Y t = Y t − Y t−1 , μ is a d × 1 vector, i , i = 1, . . . , k are d × d matrices. The rank of 0 , denoted by m, is called the co-integration rank. {Z t } is strictly stationary with E(Zt |Ft−1 ) = 0 and V ar(Zt |Ft−1 ) = z (t), where Ft = σ (Zt , Zt−1 , . . .). Ft is a r × 1 time series, r < d is unknown, A is a d × r unknown constant matrix. F t and e t are assumed to satisfy E(Ft |Ft−1 ) = 0, E(et |Ft−1 ) = 0, (2.2) E(Ft et |Ft−1 ) = 0, E(et et |Ft−1 ) = e , where e is a positive definite matrix independent on t. The components of F t are called ‘factors’, and r is the number of factors. Note that F t and e t are conditional uncorrelated. There is no loss of generality in assuming that E(Ft F t ) is a r × r positive definite matrix (otherwise, the above model may be expressed equivalently in terms of a smaller number of factors). R EMARK 2.1. The error term {Z t } in an EC–VF model is conditionally heteroscedastic and follows a factor structure, while the error term in the traditional ECM developed by Engle and Granger (1987) is covariance stationary with mean 0. Here, the factor structure is not the classical one because we assume neither that the idiosyncratic components e t are i.i.d. with a diagonal covariance matrix nor that the factor components F t is independent of e t . Model (2.1) assumes that the volatility dynamics of Y t is determined by a lower dimensional volatility dynamics of F t and the static variation of e t , as y (t) = z (t) = Af (t)A + e ,
(2.3)
where y (t) = V ar(Yt |Ft−1 ) and f (t) = V ar(Ft |Ft−1 ). Without loss of generality, we assume rank (A) = r. The lower dimensional volatility dynamics f (t) can be fitted by, e.g. the dynamic conditional correlation model of Engle (2002) or the conditionally uncorrelated components model of Fan et al. (2008). 2.2. Practical background Factor analysis is an effective way for dimension reduction, and then it is a useful statistical tool for modelling multivariate volatility. Because there might exist co-integration relationship among financial asset prices, the framework given by equation (2.1) applies to many cases of financial analysis. 2.2.1. Value-at-risk. Value-at-risk (VaR) defines the maximum expected loss on an investment over a specified horizon at a given confidence level, and is used by many financial institutions as a key measurement of market risk. The VaR of a portfolio of multiple assets can be obtained when the prices are described by an EC–VF model. The EC–VF model can be also used to determine C The Author(s). Journal compilation C Royal Economic Society 2008.
48
Q. Li and J. Pan
an optimal portfolio based on maximizing expected returns subject to a downside risk constraint measured by VaR. 2.2.2. Hedge ratio. The importance of incorporating the co-integration relationship into statistical modelling of spot and futures prices is well documented in the literature for futures market. It has been shown in Lien and Luo (1994) that although GARCH model may characterize the price behaviour, the co-integration relationship is the only indispensable component when comparing ex post performance of various hedge strategies. A hedger who omits the cointegration relationship will adopt a smaller than optimal futures position, which results in a relatively poor hedge performance; see Lien and Tse (2002) for a survey on hedging and references there. 2.2.3. Multi-factor option. A multi-factor option (or multi-asset option) is an option whose payoff depends upon the performance of two or more underlying assets. Basket and rainbow options belong to this category. Duan and Pliska (2004) investigated theoretical and practical aspects of such options when the multiple underlying assets are co-integrated. In particular, they proposed an ECM with stochastic volatilities that follow a multivariate GARCH process. To avoid introducing too many parameters, they give a parsimonious diagonal model for the volatilities, but it is rather restrictive for the cross-dynamics. In contrast, volatility factor models can be used for reducing dimension as well as for representing the dynamics of both variances and covariances. The EC–VF model, with some modification, is more suitable for valuating the multi-factor options.
3. ESTIMATION OF THE NUMBER OF FACTORS The parameter set of the EC–VF model (2.1) is {; 0 ; A}, in which = {μ, 1 , . . . , k−1 } is called the structural parameter, 0 the co-integration parameter and A the factor parameter. In first two subsections, {, 0 } is assumed known and its determination will be discussed later in subsection 3.3. 3.1. Determining A Note that the factor loading matrix A and the vector of factors F t in equation (2.1) are not separately identifiable. Our goal is to determine the rank of A and the space spanned by the columns of A. Without loss of generality, we may assume A A = I r , where I r denotes the r × r identity matrix. Let M(A) be the linear subspace of R d spanned by the columns of A, which is called the factor loading space. Then, we need to estimate M(A) or its orthogonal complement M(B), where B is a d × (d − r) matrix for which (A, B) forms a d × d orthogonal matrix, i.e. B A = 0 and B B = I d−r . Now it follows from equation (2.1) that B Zt = B et .
(3.1)
From equation (3.1) and the assumption that {e t } is a conditional homoscedastic sequence of martingale differences (see equation (2.2)), we have E(B Zt Zt B|Ft−1 ) = B e B = B z B, C The Author(s). Journal compilation C Royal Economic Society 2008.
49
Determining the number of factors in EC–VF model
where z = E(Zt Zt ). This implies that B E(Zt Zt − e )I (Zt−τ ∈ C)B = 0 for any τ ≥ 1 and C ∈ B,
(3.2)
or equivalently sup B E[(Zt Zt − e )I (Zt−τ ∈ C)]B = 0
for any τ ≥ 1 and C ∈ B,
(3.3)
C∈B
where B consists of some subsets in R d , and M = [tr(M M)]1/2 denotes the norm of matrix M. Hence, we may estimate B by minimizing n 1 ˆ n (B) = sup (Zt Zt − z )I (Zt−τ ∈ C)B B n − τ 0 t=τ +1 1≤τ ≤τ0 ,C∈B 0
(3.4)
ˆz = subject to the condition B B = I d−r , where τ 0 is a prescribed positive integer and 1 n Z Z . This is a high-dimensional optimization problem, but it does not explicitly t=τ0 +1 t t n−τ0 address the issue how to determine the number of factors r consistently. We first assume r is known and introduce some properties of the estimator of B derived by Pan et al. (2007) before we present a consistent estimator of r. Let Hr be the set of all d × (d − r) (d ≥ r) matrix B satisfying B B = I d−r . We partition Hr into equivalent classes such that B1 , B2 ∈ Hr belong to the same class if and only if M(B1 ) = M(B2 ), which is equivalent to (Id − B1 B1 )B2 = 0 and (Id − B2 B2 )B1 = 0.
(3.5)
Define D(B1 , B2 ) = (Id − B1 B1 )B2 . r = Hr /D defined The equivalent classes can be regarded as the elements of the quotient space HD r , and by D-distance. It can be shown that D is a well-defined metric distance on the space HD r thus (HD , D), which is our parametric space, is a metric space; see Pan and Yao (2008). r , i.e. Our estimator of B is the minimizer of n (·) in HD
Bˆ = arg minr n (B). B∈HD
Under the assumptions listed below, the estimator Bˆ is consistent with a convergence rate
√
n.
A SSUMPTION 3.1. {Z t } is a strictly stationary d-dimensional time series with EZ t 2p < ∞ for some p > 2. The β-mixing coefficients β(n) = E
0 sup P (B) − P B|F−∞
B∈Fn∞
C The Author(s). Journal compilation C Royal Economic Society 2008.
50
Q. Li and J. Pan
satisfy β n = O(n−b ) for some b > j }.
p , p−2
j
where Fi is the σ -algebra generated by {Z t , i ≤ t ≤
A SSUMPTION 3.2. Denote (B) = sup1≤τ ≤τ0 ,C∈B BE[(Zt Zt − e )I (Zt−τ ∈ C)]B. There exists a matrix B0 ∈ Hr which minimizes (B), and (B) reaches its minimum value at a matrix B ∈ Hr if and only if D(B, B 0 ) = 0. A SSUMPTION 3.3. There exists a positive constant a such that (B) − (B 0 ) ≥ aD(B, B 0 ) for any matrix B ∈ Hr . By the similar way to that in proof of Theorem 2 in Pan et al. (2007), we can prove the following result, which is useful in deriving a consistent estimator for the number of factors in next subsection. T HEOREM 3.1. If the collection B of subsets in R d is a VC-class, and Assumptions 3.1 and 3.2 hold, then √ (3.6) sup n|n (B) − (B)| = Op (1). B∈HD
If, in addition, Assumption 3.3 also holds, √ ˆ B0 ) = Op (1). nD(B,
(3.7)
˘ (VC) class can be found in van der Vaart R EMARK 3.1. The definition of Vapnik-Cervonenkis and Wellner (1996). 3.2. Determining r Let r 0 be the true number of factors and A 0 the true factor loading matrix with rank r 0 . We discuss ˆ derived how to estimate r 0 based on the estimated factor loading matrix Aˆ (or its counterpart B) in the previous subsection. The basic idea is to treat the number of factors as the ‘order’ of model (2.1) and to determine the order in terms of an appropriate information criterion. In the following, we always assume that Assumptions 3.1–3.3 hold. Let M l denote a matrix with rank d − l. In particular, B r00 and Bˆ r (0 ≤ r ≤ d) denote the matrices B 0 and Bˆ with ranks d − r 0 and d − r, respectively. Let ˆr ˆ n (r, Bˆ r ) = sup B Dn,τ (C)Bˆ r , 1≤τ ≤τ0 ,C∈B (3.8) r r r, B0 = sup B0 Dτ (C)B0r , 1≤τ ≤τ0 ,C∈B
where Dˆ n,τ (C) =
n 1 ˆ z )I (Zt−τ ∈ C), (Zt Zt − n − τ0 t=τ +1 0
Dτ (C) = E[(Zt Zt − e )I (Zt−τ ∈ C)], Bˆ r = arg minr n (r, B), B0r = arg minr (r, B). B∈HD
B∈HD
C The Author(s). Journal compilation C Royal Economic Society 2008.
Determining the number of factors in EC–VF model
51
Our penalized goodness-of-fit criterion is defined as P C(r) = n (r, Bˆ r ) + rg(n),
(3.9)
where g(n) is a penalty for ‘overfitting’. We may estimate r 0 by minimizing PC(r), i.e. rˆ = arg min P C(r). 0≤r≤d
We call equation (3.9) a penalized goodness-of-fit criterion because of Lemma A.1. R EMARK 3.2. n (·) can be regarded as fitting error, because a model with r + 1 factors can fit no worse than a model with r factors, while Lemma A.1 shows that n (·) is a non-increasing function of r. But the efficiency is lost as more factors are estimated. For example, there is neither error nor efficiency in the extreme case when r = d, n (d, Bˆ d ) = 0 with Bˆ d = 0. The following theorem shows that rˆ is a consistent estimator of r 0 provided that the penalty function g(n) satisfies some mild conditions. P
T HEOREM 3.2. Under Assumptions 3.1–3.3, as n → ∞, rˆ → r0 provided that g(n) → 0 and √ ng(n) → ∞. 3.3. Determining {, 0 } In this subsection, we give an estimation of the structural and co-integration parameter sets without knowledge of the true factor structure for Z t . By the Grange representation theorem, if there are exactly m co-integration relations among the components of Y t , and 0 admits the decomposition 0 = γ α , then α is a d × m matrix with linearly independent columns and α Y t is stationary. In this sense, α consists of m co-integration vectors. As α and γ are not separately identifiable, our goal is to determine the rank of α, i.e. the dimension of the space spanned by the columns of α. Besides Assumptions 3.1–3.3 on {Z t }, we need an additional assumption on {Y t } as follows. A SSUMPTION 3.4. The process Y t satisfies the basic assumptions of the Granger representation theorem given by Engle and Granger (1987), and Eα Y t−1 4 < ∞. Our estimation of co-integration vectors is the solution to the following optimization problem max tr(α S10 S01 α),
α S11 α=Im
(3.10)
where Sij = T −1 Tt=1 Rit Rj t , R0t = Yt − 1 Xt , R1t = Yt−1 − 2 Xt , Xt = (1, Yt−1 ,..., T T T T −1 −1 Yt−k+1 ) , 1 = t=1 Yt Xt ( t=1 Xt Xt ) , 2 = t=1 Yt−1 Xt ( t=1 Xt Xt ) . The solution of equation (3.10) is αˆ ≡ (αˆ 1 , . . . , αˆ m ), where αˆ 1 , . . . , αˆ m are the m generalized eigenvectors of S 10 S 01 with respect to S 11 corresponding to the m largest generalized eigenvalues. The estimated co-integration vectors are consistent with the standard root-n convergence rate. The corresponding estimator γˆ = S01 αˆ of the co-integration loading matrix and the estimator ˆ = 1 − γˆ αˆ 2 of the structural parameter are also consistent. These conclusions are obtained by Li et al. (2006), who also give a joint estimation for the co-integration rank and the lag order of the error correction model by a penalized goodness-of-fit measure M(m, k) = R(m, k, α) ˆ + nm,k g1 (n), C The Author(s). Journal compilation C Royal Economic Society 2008.
(3.11)
52
Q. Li and J. Pan
where R(m, k, α) ˆ = tr S00 − S01 α( ˆ αˆ S11 α) ˆ −1 αˆ S10 ,
(3.12)
g 1 (n) is the penalty for ‘overfitting’ and n m,k is the number of free parameters. Note that n m,k = d + d 2 (k − 1) + 2dm − m2 for model (2.1). We may estimate m 0 by minimizing M(m, k), i.e. ˆ = arg ˆ k) (m,
min
0≤m≤d,1≤k≤K
M(m, k),
where K is a prescribed positive integer. Let k 0 be the true lag order. The theorem below ensures ˆ is a consistent estimator for (m 0 , k 0 ). ˆ k) that (m, P ˆ → ˆ k) (r0 , k0 ) provided that g 1 (n) T HEOREM 3.3. Under Assumptions 3.1–3.4, as n → ∞, (m, → 0 and ng1 (n) → ∞. √ √ In practice, the choice of penalty function g(·) is flexible, e.g. ln(n)/ n or 2 ln(ln(n))/ n.
4. MONTE CARLO SIMULATION We present a simple Monte Carlo experiment to illustrate the proposed approach in this section. Particularly, we check the accuracy of our estimation for the factor loading matrix A and the number of factors r. Consider a simple EC–VF model with d = 6, m = 1, r = 1, ⎧ ⎪ ⎨ Yt = μ + γ α Yt−1 + Zt , Zt = AFt + et , ⎪ ⎩ Ft |Ft−1 ∼ N 0, σt2 , et |Ft−1 ∼ N (0, I6 ),
(4.1)
where σ 2t = β 0 + β 1 F 2t−1 + β 2 σ 2t−1 , e t is independent of F t , and the values of ,γ = parameters are given as follows: μ = (0.2028, 0.1987, 0.6038, 0.2722, 0.1988, 0.0153) √ √ √ √ √ √ √ 6 6 6 6 6 6 6 (0.1, 0.2, 0.3, 0.4, 0.5, 0.6) , α = (1, 2, −1, −1, −2, 3) , A = ( 6 , 6 , 6 , 6 , 6 , 6 , 6 ) and β = (β 0 , β 1 , β 2 ) = (0.02, 0.10, 0.76) . Note that A A = 1. We conduct 2000 replications, and for each replication, the sample sizes are n = 500 and 1000, respectively. We estimate the transformation matrix B by minimizing n (B) defined by equation (3.4), and measure the estimation error of the factor loading space M(A) by 1/2 ˆ ˆ = ([tr{Aˆ (Id − AA )A} ˆ + tr(Bˆ AA B)]/d) D1 (A, A) .
The coefficients β i , i = 0, 1, 2, are estimated by quasi-maximum likelihood estimation (MLE) based on a Gaussian likelihood. The resulting estimates are summarized in Table 1. ˆ is less than 0.06, while it decreases over 15% as the The mean of estimation errors D1 (A, A) sample size increases from 500 to 1000. The negative biases indicate a slight underestimation for the heteroscedastic coefficients. The relative frequencies for rˆ taking different values are listed in Table 2. It shows that when the sample size n increases, the estimation of r becomes more accurate. C The Author(s). Journal compilation C Royal Economic Society 2008.
53
Determining the number of factors in EC–VF model Table 1. Simulation results: summary statistics of estimation errors. ˆ A) D1 (A, βˆ0 βˆ1
n = 500
n = 1000
rˆ
βˆ2
Mean Median
0.0563 0.0438
0.0179 0.0183
0.0894 0.0827
0.7414 0.7521
STD Bias
0.0601 –
0.0022 −0.0021
0.0403 −0.0106
0.0935 −0.0186
RMSE
–
0.0029
0.0454
0.0958
Mean Median STD
0.0477 0.0390 0.0426
0.0193 0.0199 0.0010
0.0922 0.0897 0.0276
0.7481 0.7543 0.0724
Bias RMSE
– –
−0.0007 0.0013
−0.0078 0.0295
−0.0119 0.0766
Table 2. Relative frequencies for rˆ taking different values, when r = 1. 0 1 2 3 4 5
n = 500 n = 1000
0.0120 0.0090
0.8425 0.9765
0.1310 0.0100
0.0105 0.0045
0.0040 0
0 0
6 0 0
5. APPLICATION TO REAL DATA The VaR is widely adopted by banks and other financial institutions to measure and manage market risk, as it reflects downside risk of a given portfolio or investment. Specifically, at a given confidence level 1 − a, the VaR of a portfolio with weight ω t is defined as the solution to P (ωt Yt < V aRa |Ft−1 ) = a,
(5.1)
where Y t is a vector of log returns of assets in the portfolio. In the case when the conditional density f (Yt |Ft−1 ) is normal, equation (5.1) reduces to the well-known formula 1/2 za , (5.2) V aRa = ωt μy (t) + ωt y (t)ωt where z a is the ath quantile of the univariate standard normal distribution. In this section, we attempt to compare the VaR forecasting results by assuming three different models: AR-DCC, EC-DCC, EC-VF-DCC for the asset price series {Y t }. The DCC refers to dynamic conditional correlation, a volatility model proposed by Engle (2002). Focusing on the methodology, we only consider the case when the conditional multivariate density f (Yt |Ft−1 ) is normal, while the impact of other distributions (like Student-t and some non-parametric densities) on VaR computation is beyond our scope here. 5.1. Data set and estimation of the EC-VF-DCC model Our data set consists of 2263 daily log prices of CSCO, DELL, INTC, MSFT and ORCL, the five most active stocks in US market, from 19 June 1997 to 16 June 2006. The plots of C The Author(s). Journal compilation C Royal Economic Society 2008.
54
Q. Li and J. Pan (a) CSCO 20 0 0
500
1000
1500
2000
2500
1500
2000
2500
1500
2000
2500
(b) DELL 20 0
0
500
1000
0
500
1000
(c) INTC
20 0
(d) MSFT 20 0 0
500
1000 1500 (e) ORCL
2000
2500
0
500
1000
2000
2500
20 0 1500
Figure 1. Plots of daily log-returns in percentage.
log returns (in percentage) are presented in Figure 1 which shows significant time-varying volatilities. Descriptive statistics are listed in Table 3. All unconditional distributions of these series exhibit excessive kurtosis and non-zero skewness, indicating significant departure from the normal distribution. The estimation procedure for the EC-VF-DCC model is given step by step as follows. Step 1. Fit an ECM for Y t to determine the structural and co-integration parameters. Compute ˆ t + γˆ αˆ Yt−1 . the estimate of conditional mean vector μˆ y (t) = X Step 2. Conduct a multivariate portmanteau test for the squared residuals obtained from the previous step to detect conditional heteroscedasticity. If there exists serial dependence, fit ˆ a volatility factor model for the residual series {Z t } to determine the factor loading matrix A, otherwise switch to Step 3 with Aˆ = Ir and r = d. C The Author(s). Journal compilation C Royal Economic Society 2008.
55
Determining the number of factors in EC–VF model Table 3. Summary statistics of the log-returns. n = 2263 Mean Stdev
CSCO 0.000423 0.031847
DELL
INTC
MSFT
ORCL
0.000523 1.95 × 10−5 0.030270 0.030313
0.000200 0.023074
0.000418 0.036400
Min −0.145000 −0.20984 Max 0.218239 0.163532 Skewness 0.149215 −0.118260 Kurtosis
4.558020
3.690575
−0.248680 −0.169760 −0.346150 0.183319 0.178983 0.270416 −0.391560 −0.173470 −0.226370 5.631860
5.955046
8.519630
Denote B = (b 1 , b 2 , . . ., b d−r ), the objective function (3.4) can be modified to 2 τ0 n 1 ˆ B
n (B) = w(C) (Z Z − )I (Z ∈ C)B t t z t−τ n−τ 0 t=τ +1 τ =1 C∈B 0
where w(C) ≥ 0 are weights which ensure that the sum over C ∈ B converges. In numerical implementation, we simply take B as the collection of all the balls centred at the origin in R d and w(C) = {#(B)}−1 . An algorithm for estimating B and r is given as follows. Put ⎡ ⎤2 τ0 n 1 ˜ τ (b) = ˆ z )I (Zt−τ ∈ C)b⎦ , ˜ τ (b),
(b) = w(C) ⎣b (Zt Zt − n − τ 0 τ =1 t=τ0 +1 C∈B ⎫ ⎧ ⎡ ⎤2 ⎪ τ0 ⎪ n l−1 ⎬ ⎨ 1 ˆ z )I (Zt−τ ∈ C)b⎦ + ˜ τ (b) . w(C) ⎣bˆi (Zt Zt −
l (b) = ⎪ ⎪ n − τ0 t=τ +1 ⎭ τ =1 ⎩ i=1 C∈B 0 Compute bˆ1 by minimizing (b) subject to the constraint b b = 1. For l = 2, . . ., d, compute bˆl which minimizes l (b) subject to the constraint b b = 1, b bˆi = 0 for i = 1, 2, . . ., l − 1. Let rˆ = arg min0≤r≤d P C(r) with Bˆ r = (bˆ1 , bˆ2 , . . . , bˆr ), where PC(r) is defined by equation (3.9). Note that Bˆ r Bˆ r = Id−ˆr . Let Aˆ consist of the rˆ (orthogonal) unit eigenvectors, corresponding to the common eigenvalue 1, of matrix Id − Bˆ r Bˆ r . Step 3. Fit a DCC volatility model (Engle, 2002) for {Aˆ Zt } and compute its conditional ˜ z (t) = Dt1/2 Rt Dt1/2 . covariance To this end, we first fit each element of D t with a univariate GARCH(1,1) model using the ith component of Aˆ Zt only, and then model the conditional correlation matrix R t by ) + θ2 Rt−1 , Rt = S(1 − θ1 − θ2 ) + θ1 (εt−1 εt−1
where ε t is a rˆ × 1 vector of the standardized residuals obtained from the separate GARCH(1,1) fittings for the rˆ components of Aˆ Zt , and S is the sample correlation matrix of Aˆ Zt . ˆ y (t) of Y t is equal to ˜ z (t) and If Aˆ = Id , the estimate of conditional covariance matrix terminate the algorithm. Otherwise, proceed to Step 4. C The Author(s). Journal compilation C Royal Economic Society 2008.
56
Q. Li and J. Pan
2
M(m,k)
1.5
1
0.5 4 3 lag order k
4 3
2 1
1
2 cointegration rank m
0
Figure 2. Plot of M(m, k) against the co-integration rank m and the lag order k.
Step 4. The factor structure in equation (2.1) and the facts B A = 0, B e t = B Z t , AA + BB = I d lead to a dynamics for y (t) ≡ z (t) as follows ˆz = where
1 n−τ0
n
˜ z (t)Aˆ + Aˆ Aˆ ˆ z Bˆ Bˆ + Bˆ Bˆ ˆ z, ˆ y (t) = Aˆ
t=τ0 +1
(5.3)
Zt Zt .
We determine the co-integration rank by minimizing M(m, k) defined by equation (3.11). The surface of M(m, k) is plotted against m and k in Figure 2. The minimum point of the surface is attained at (m, k) = (1, 1), leading to an error correction model for this data set with lag order 1 and co-integration rank 1. Applying the Ljung-Box statistics to the squared residuals, we have Q 5 (1) = 63.2724, Q 5 (5) = 305.7613 and Q 5 (10) = 633.7103. Based on asymptotic χ 2 distributions with degrees of freedom 11, 111 and 236, the p-values of these Q statistics are all close to zero. 2 Consequently, the portmanteau test confirms the existence of conditional heteroscedasticity. The algorithm stated in Step 2 leads to an estimator for the number of factors, and PC(r) is plotted against r in Figure 3. Clearly, a two-factor structure (i.e. rˆ = 2) is determined for the residual series {Z t }. 5.2. Comparison of value-at-risk forecasting results The VaRs are computed at level 0.05 (denoted by VaR 0.05 ) for the last 1000 trading days of data span. We assume three models: AR-DCC, EC-DCC, EC-VF-DCC for the asset prices {Y t }, and 2 The Q (l) statistic has asymptotically a χ 2 distribution with degree of freedom d 2 l − n d m,k where n m,k = d + d 2 (k − 1) + 2dm − m2 is the number of free parameters in the ECM. C The Author(s). Journal compilation C Royal Economic Society 2008.
57
Determining the number of factors in EC–VF model 0.445 0.44 0.435
PCvalue
0.43 0.425 0.42 0.415 0.41 0.405 0.4
0
1
2 3 the number of factors r
4
5
Figure 3. Plot of PC(r) against the number of factors r.
Table 4. Comparison of VaR 0.05 . ω1 AR-DCC
ω2
ω3
ω4
t (Min)
0.067 (0.001) 0.071 (0.000) 0.065 (0.005) 0.062 (0.032)
287.3
EC-DCC 0.052 (0.659) 0.059 (0.061) 0.051 (0.713) 0.053 (0.268) EC-VF-DCC 0.049 (0.713) 0.056 (0.308) 0.053 (0.268) 0.055 (0.312)
294.7 41.5
Note: Figures in parentheses are p-values for the Kupiec likelihood ratio test used to compare the empirical failure rate with its theoretical value, see Kupiec (1995). The average computing time in minute for each model is recorded in the last column.
four time invariant portfolios with weights ω1 = (1, 1, 1, 1, 1) /5, ω2 = (1, 2, 3, 4, 5) /15, ω3 = (5, 4, 3, 2, 1) /15, ω4 = (1, 3, 5, 4, 2) /15. To compare the VaR forecasting performances, we calculate failure rates for the different specifications. The failure rate is defined as the proportion of r t = ωt Y t smaller than the VaRs. For a correctly specified model, the empirical failure rate is supposed to be close to the true level a. Table 4 displays the results for the 5% level. We observe from Table 4 that the EC-VF-DCC performs reasonably well, while AR-DCC has a difficulty in providing failure rates close to 0.05. The empirical failure rates for AR-DCC are high, which means that it underestimates the risk. The results for the EC-DCC and EC-VFDCC model are comparable, but the average computing time for EC-DCC is much longer, see C The Author(s). Journal compilation C Royal Economic Society 2008.
58
Q. Li and J. Pan
the last column of Table 4. This shows that the factor structure imposed on the residual term of an ECM can improve the computational velocity in high-dimensional problems. The above results show that the EC-VF model proposed in this paper is a promising tool for risk analysis. First, it incorporates the impact of co-integration which makes the VaR computation more accurate. Second, it deduces a high-dimensional optimization problem into a much lowerdimensional problem, thus accelerates the VaR computation to a great extent.
ACKNOWLEDGMENTS The authors are grateful to an anonymous referee and the co-editor for their insightful comments and valuable suggestions. Qiaoling Li was partially supported by the National Natural Science Foundation of China (grant no. 10571003). Jiazhu Pan was partially supported by the starter grant from University of Strathclyde (UK) and the National Basic Research Program of China (grant no. 2007CB814902).
REFERENCES Andersen, T. G., T. Bollerslev, F. X. Diebold and P. Labys (1999). (Understanding, optimizing, using and forecasting) realized volatility and correlation. Working paper, Northwestern University. Anderson, H. M., J. V. Issler and F. Vahid (2006). Common features. Journal of Econometrics 132, 1–5. Bai, J. S. and S. Ng (2002). Determining the number of factors in approximate factor models. Econometrica 70, 191–211. Bauwens, L., S. Laurent and J. V. K. Rombouts (2006). Multivariate GARCH models: a survey. Journal of Applied Econometrics 21, 79–109. Bollerslev, T. (1986). Generalized autoregressive conditional heteroskedasticity. Journal of Econometrics 31, 307–27. Bollerslev, T., R. F. Engle and J. M. Wooldridge (1988). A capital asset pricing model with time varying covariances. Journal of Political Economy 96, 116–31. Duan, J. C. and S. R. Pliska (2004). Option valuation with co-integrated asset prices. Journal of Economic Dynamics and Control 28, 727–54. Engle, R. F. (2002). Dynamic conditional correlation – a simple class of multivariate GARCH models. Journal of Business and Economic Statistics 20, 339–50. Engle, R. F. and C. W. J. Granger (1987). Co-integration and error correction: representation, estimation and testing. Econometrica 55, 251–76. Engle, R. F. and K. F. Kroner (1995). Multivariate simultaneous generalized ARCH. Econometric Theory 11, 122–50. Engle, R. F. and K. Sheppard (2001). Theoretical and empirical properties of dynamic conditional correlation multivariate GARCH. Working paper 2001-15, Department of Economics, University of California, San Diego. Fan, J., M. Wang and Q. Yao (2008). Modelling multivariate volatilities via conditionally uncorrelated components. Journal of the Royal Statistical Society, Series B 70, 679–702. Ghost, A. (1993). Hedging with stock index futures: estimation and forecasting with error correction model. The Journal of Futures Markets 13, 743–52. Granger, C. W. J. (1981). Some properties of time series data and their use in econometric model specification. Journal of Econometrics 16, 121–30. C The Author(s). Journal compilation C Royal Economic Society 2008.
Determining the number of factors in EC–VF model
59
Granger, C. W. J. and A. A. Weiss (1983). Time series analysis of error correction models. In S. Karlin, T. Amemiya and L. A. Goodman (Eds.), Studies in Econometrics, Time Series and Multivariate Statistics, 255–78. New York: Academic Press. Kroner, K. and J. Sultan (1993). Time-varying distributions and dynamic hedging with foreign currency futures. Journal of Financial and Quantitative Analysis 28, 535–51. Kupiec, P. (1995). Techniques for verifying the accuracy of risk measurement models. Journal of Derivatives 2, 173–84. Li, Q., J. Pan and Q. Yao (2006). On determination of cointegration rank. Working paper, Peking University. Lien, D. (1996). The effect of the cointegration relationship on futures hedging: a note. The Journal of Futures Markets 16, 773–80. Lien, D. and X. Luo (1994). Multiperiod hedging in the presence of conditional heteroscedasticity. The Journal of Futures Markets 14, 927–55. Lien, D. and Y. K. Tse (2002). Some recent developments in futures hedging. Journal of Economic Surveys 16, 357–96. Pan, J., D. Pena, W. Polonik and Q. Yao (2007). Modelling multivariate volatilities by common factors: an innovation expansion method. Working paper, The London School of Economics and Political Science. Pan, J. and Q. Yao (2008). Modelling multiple time series via common. Biometrika 95, 365–79. Phillips, P. C. B. (1991). Optimal inference in cointegrated systems. Econometrica 59, 283–306. Stock, J. H. and M. Watson (1988). Testing for common trends. Journal of the American Statistical Association 83, 1097–107. van der Vaart, A. W. and J. A. Wellner (1996). Weak Convergence and Empirical Processes. New York: Springer. van der Weide, R. (2002). Go-GRACH: a multivariate generalized orthogonal GRACH model. Journal of Applied Econometrics 17, 549–64.
APPENDIX: PROOFS OF RESULTS The first lemma shows the n (r, Bˆ r ) defined in subsection 3.2 is a non-increasing function of the number of factors r. L EMMA A.1. If 0 ≤ r 1 < r 2 ≤ d, then n (r1 , Bˆ r1 ) ≥ n (r2 , Bˆ r2 ). Proof: For 0 ≤ r1 < r2 ≤ d, Bˆ r1 can be written as (B˜ r2 , B˜ d−(r2 −r1 ) ) where B˜ r2 consists of the first d − r 2 columns of the matrix Bˆ r1 . We have
n (r1 , Bˆ r1 ) = =
sup
(B˜ r2 , B˜ d−(r2 −r1 ) ) Dˆ n,τ (C)(B˜ r2 , B˜ d−(r2 −r1 ) )
sup
(B˜ r2 Dˆ n,τ (C) B˜ r2 B˜ r2 Dˆ n,τ (C)B˜ d−(r2 −r1 )
1≤τ ≤τ0 ,C∈B 1≤τ ≤τ0 ,C∈B
B˜ d−(r2 −r1 ) Dˆ n,τ (C)B˜ r2 B˜ d−(r2 −r1 ) Dˆ n,τ (C)B˜ d−(r2 −r1 ) )
≥
sup
1≤τ ≤τ0 ,C∈B
B˜ r2 Dˆ n,τ (C)
B˜ r2 = n (r2 , B˜ r2 )
≥ n (r2 , Bˆ r2 ).
r , D). The last inequality holds because Bˆ r is the minimizer of n (B) in the metric space (HD C The Author(s). Journal compilation C Royal Economic Society 2008.
60
Q. Li and J. Pan The proof of Theorem 3.2 needs the following two lemmas.
r such that (r, B) = 0. For 0 ≤ r < L EMMA A.2. For any fixed r with r 0 ≤ r ≤ d, there exists a B ∈ HD r . r 0 , (r, B) > 0 holds for all B ∈ HD
Proof: It is clear that B A 0 = 0 implies (r, B) = 0 from the relation between (r, B) and the factor model with true loading matrix A 0 . r For r = r 0 , there must be a matrix in HD0 , denoted by B r0 , such that B r0 A0 = r0 r r0 r0 0, thus (r0 , B ) = 0 and it reaches the minimum value. We have B = B0 in HD0 by Assumption 3.2. r For r 0 < r ≤ d, let B = B00 H , where H is an arbitrary (d − r 0 ) × (d − r) matrix such that H H = r r I d−r . Then, B ∈ HD and B A 0 = 0. In the other words, (r, B00 H ) = 0. r For any B ∈ HD with r < r 0 , B A 0 = 0. If (r, B) = 0, which means that for any 1 ≤ τ ≤ τ 0 and any C ∈ B, B Dτ (C)B = 0, by choosing C = Rd , we have B A 0 E(Ft Ft )A 0 B = 0. This is impossible because E(Ft Ft ) is a positive definite matrix. L EMMA A.3. For any 0 ≤ r < r 0 , there exists a κ r > 0 such that p lim n (r, Bˆ r ) − n (r0 , Bˆ r0 ) ≥ κr , n→∞
where p lim denotes the limit in probability. For any r 0 ≤ r < d, it holds that 1 . n (r, Bˆ r ) − n (r0 , Bˆ r0 ) = Op √ n Proof: It follows from the definition of Bˆ that r n (r, Bˆ r ) − n r0 , Bˆ r0 ≥ n (r, Bˆ r ) − n r0 , B00 . r
Recall that (r0 , B00 ) = 0 by Lemma A.2. Hence, r n (r, Bˆ r ) − n r0 , B00 r r = [n (r, Bˆ r ) − (r, Bˆ r )] − n r0 , B00 − r0 , B00 + (r, Bˆ r ) = Op √1n + (r, Bˆ r ) ≥ Op √1n + (r, B0r ).
(A.1)
The second equality holds by the similar way to equation (3.6) with a slight modification that Bˆ r is related to n. The last inequality is from the definition of B 0 . These imply that, for any 0 ≤ r < r 0 , p lim n (r, Bˆ r ) − n r0 , Bˆ r0 ≥ κr := r, B0r , n→∞
and from Lemma A.2, κ r > 0. For the second part, since n (r, Bˆ r ) − n r0 , Bˆ r0 ≤ n (r, Bˆ r ) − n r0 , B r0 + n r0 , B r0 − n r0 , Bˆ r0 0 0 r ≤ 2 maxr ≤r≤d n (r, Bˆ r ) − n r0 , B 0 , 0
0
it is sufficient to prove that for any r 0 ≤ r ≤ d, r n (r, Bˆ r ) − n r0 , B00 = Op
1 . √ n
C The Author(s). Journal compilation C Royal Economic Society 2008.
61
Determining the number of factors in EC–VF model
r Notice that, from equation (A.1), n (r, Bˆ r ) − n (r0 , B00 ) = Op ( √1n ) + (r, Bˆ r ). Thus, we need to prove (r, Bˆ r ) = Op ( √1 ) for any r 0 ≤ r ≤ d, where n
(r, Bˆ r ) =
sup
1≤τ ≤τ0 ,C∈B
ˆr B Dτ (C)Bˆ r .
For an arbitrary (d − r 0 ) × (d − r) matrix H such that H H = I d−r , we have r Bˆ Dτ (C)Bˆ r r r r r r r r r = (Bˆ r − B00 H H B00 Bˆ r + B00 H H B00 Bˆ r ) Dτ (C)(Bˆ r − B00 H H B00 Bˆ r + B00 H H B00 Bˆ r ) r0 r0 r0 r0 ˆ r r r0 ˆ r r0 r ˆ ˆ (I = B B − B H H B ) B D (C) B + B H H B D (C) I − B H H B d τ τ d 0 0 0 0 0 0 r
r
r
where the last equality holds because the relation B00 A0 = 0 implies that B00 Dτ (C)B00 = 0 for any τ ≥ 1 and C ∈ B. Hence, r r r r r Bˆ Dτ (C)Bˆ r ≤ Id − B00 H H B00 Bˆ r Dτ (C) Bˆ r + B00 H H B00 Bˆ r √ r r r = D Bˆ r , B00 H Dτ (C) d − r + B00 H H B00 Bˆ r √ r ≤ D Bˆ r , B00 H Dτ (C)( d − r(1 + d − r)). r
Note that (r, B00 H ) = 0 Op ( √1n ). It is easy to Op ( √1n ).
r r by Lemma A.2, i.e. D(B00 H , B0r ) = 0. Thus, D(Bˆ r , B00 H ) = see that sup1≤τ ≤τ0 ,C∈B Dτ (C) = Op (1). Therefore, (r, Bˆ r ) =
Proof of Theorem 3.2: The objective is to verify that lim n→∞ P (PC(r) − PC(r 0 ) < 0) = 0 for all 0 ≤ r ≤ d and r = r 0 , where P C(r) − P C(r0 ) = n (r, Bˆ r ) − n r0 , Bˆ r0 − (r0 − r)g(n). For r < r 0 , if g(n) → 0 as n → ∞,
P (P C(r) − P C(r0 ) < 0) = P n (r, Bˆ r ) − n r0 , Bˆ r0 < (r0 − r)g(n) → 0
because, by Lemma A.3, n (r, Bˆ r ) − n (r0 , Bˆ r0 ) has a positive limit in probability. √ For r > r 0 , Lemma A.3 implies that n (r, Bˆ r ) − n (r0 , Bˆ r0 ) = Op ( √1n ). Thus, if ng(n) → ∞ as n → ∞, we have P (P C(r) − P C(r0 ) < 0) = P n r0 , Bˆ r0 − n (r, Bˆ r ) > (r − r0 )g(n) √ √ n n r0 , Bˆ r0 − n (r, Bˆ r ) > (r − r0 ) ng(n) → 0. =P
C The Author(s). Journal compilation C Royal Economic Society 2008.
The
Econometrics Journal Econometrics Journal (2009), volume 12, pp. 62–81. doi: 10.1111/j.1368-423X.2008.00260.x
On the impact of error cross-sectional dependence in short dynamic panel estimation VASILIS S ARAFIDIS † AND D ONALD R OBERTSON ‡ †
‡
Discipline of Econometrics and Business Statistics, The University of Sydney, NSW, 2006, Australia E-mail:
[email protected]
Faculty of Economics and Politics, The University of Cambridge, Cambridge, CB3 9DD, UK E-mail:
[email protected] First version received: September 2006; final version accepted: July 2008
Summary This paper explores the impact of error cross-sectional dependence (modelled as a factor structure) on a number of widely used IV and generalized method of moments (GMM) estimators in the context of a linear dynamic panel data model. It is shown that, under such circumstances, the standard moment conditions used by these estimators are invalid – a result that holds for any lag length of the instruments used. Transforming the data in terms of deviations from time-specific averages helps to reduce the asymptotic bias of the estimators, unless the factor loadings have mean zero. The finite sample behaviour of IV and GMM estimators is investigated by means of Monte Carlo experiments. The results suggest that the bias of these estimators can be severe to the extent that the standard fixed effects estimator is not generally inferior anymore in terms of root median square error. Time-specific demeaning alleviates the problem, although the effectiveness of this transformation decreases when the variance of the factor loadings is large. Keywords: Asymptotic bias, Cross-sectional dependence, Dynamic panel data, Generalized method of moments, Instrumental variables, Time-specific demeaning.
1. INTRODUCTION In a panel regression model with lagged endogenous variables, the fixed effects estimator (FE) is inconsistent for small T (the number of time series observations in the panel), as shown by Nerlove (1967, 1971) using simulated data and formalized by Nickell (1981) for the case of a simple first-order autoregressive model. Since then, a standard estimation approach has been to transform the regression model in first-differences and use appropriate lagged values of the dependent variable in levels as instruments for the transformed endogenous regressor (see Anderson and Hsiao, 1981, Holtz-Eakin et al., 1988 and Arellano and Bond, 1991). However, the first-differenced generalized method of moments (GMM) estimator can have poor finite sample properties (bias and imprecision) when the series is highly persistent or when the variance of the individual time-invariant unobserved effects is large relative to the variance of the purely idiosyncratic error component (Blundell and Bond, 1998). To alleviate this problem, subsequent C The Author(s). Journal compilation C Royal Economic Society 2008. Published by Blackwell Publishing Ltd, 9600 Garsington Road,
Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
On the impact of error cross-sectional dependence in short dynamic panel estimation
63
research in the field has led to the development of GMM estimators that make use of additional moment conditions, based upon certain extra assumptions about the initial conditions (see e.g. Ahn and Schmidt, 1995, Arellano and Bover, 1995, and Blundell and Bond, 1998). These methods have proved popular in empirical work; among the many examples where GMM-type estimators have been used is the empirical growth literature (Bond et al., 2001, among many others), the literature of estimating wage equations and Phillips curves (Alonso-Borrego and Arellano, 1999 and others), the literature of estimating production functions (Blundell et al., 2000), and for the estimation of money demand functions (e.g. Bover and Watson, 2005). More recently, a vigorous literature has been developed on testing and dealing with error cross-sectional dependence in panel data models. Cross-sectional dependence is a situation that is often encountered in macroeconomic and financial applications, where the ever-increasing economic and financial integration of countries and financial entities implies substantial crosssectional interactions – but also in microeconomic panel data sets, where the propensity of micro-units to behave similarly may be explained by social norms, neighbourhood effects, herd behaviour and interdependent preferences. A particular form of cross-sectional dependence that has become popular is the factor structure approach. This has been used extensively in empirical work (see e.g. Barro and Sala-i-Martin, 1992) and it has been analysed in theoretical treatments at even greater length. 1 Consequently, in this paper we use the notions of error cross-sectional dependence and factor structure dependence interchangeably. The impact of error cross-sectional dependence on the dynamic fixed effects (FE) estimator has been studied by Phillips and Sul (2003, 2007), who showed that if there is sufficient dependence across cross-sectional units, the efficiency gains that one had hoped to achieve by pooling the data largely diminish – to the extent that the FE estimator provides little gain over estimating each individual-specific time series regression separately using OLS. 2 This paper investigates the impact of error cross-sectional dependence on a number of widely used IV and GMM dynamic panel estimators. We demonstrate that estimators relying on standard instruments with respect to lagged values of the dependent variable (either in levels or in firstdifferences) are inconsistent as N → ∞ for fixed T. We show that this result holds true for any lag length of the instruments used. This is an important outcome given that error cross-sectional dependence is a likely empirical situation – the econometrician may not have sufficient explanatory variables to remove all correlated behaviour – and the desirable N-asymptotic properties of estimators based on instrumental variables rely crucially upon the assumption that the errors are uncorrelated across individuals. We also show that the asymptotic bias of these estimators will most likely decrease if the data are transformed in terms of deviations from time-specific averages prior to estimation, which we phrase as time-specific demeaning, provided that the factor loadings do not have mean zero. In the latter case, while time-specific demeaning does not have a positive effect, it does not deteriorate the properties of the estimators either. Simulation results confirm these findings and provide a formal justification for the practice of including common time effects in the context of a short dynamic panel data model with large N. 1 The literature on factor models is growing rapidly. See e.g. Robertson and Symons (2000), Coakley et al. (2002), Bai (2006), Pesaran (2006), to mention only a few. 2 Intuitively, if all cross-sectional units behave similarly, there is little gain to be obtained from looking at more than one of them.
C The Author(s). Journal compilation C Royal Economic Society 2008.
64
V. Sarafidis and D. Robertson
The structure of the paper is as follows. Section 2 sets out the assumptions of our model and Section 3 provides the main results of the paper. Section 4 analyses the asymptotic bias reduction of the IV estimator achieved by time-specific demeaning of the data. Section 5 describes the Monte Carlo design and discusses the simulation results of the paper. A final section concludes.
2. ASYMPTOTIC BIAS OF INSTRUMENTAL VARIABLES AND GMM ESTIMATORS We consider the following first-order autoregressive panel data model 3 yit = λyit−1 + υit , i = 1, . . . , N and t = 2, . . . , T υit = αi + uit , uit =
M
φmi fmt + εit = φi ft + εit ,
(2.1)
m=1
where y it is the observation of the dependent variable of the ith cross-sectional unit at time t and λ is the unknown parameter of interest with 0 < λ < 1. α i denotes an individual-specific time-invariant effect with zero mean and constant, finite variance σ 2α . u it obeys a multi-factor structure, where f t = (f 1t , . . . , f Mt ) denotes an M × 1 vector of individual-invariant timespecific unobserved effects, φ i = (φ 1i , . . . , φ Mi ) is an M × 1 vector of factor loadings and ε it is a purely idiosyncratic component with zero mean and constant, finite variance σ 2ε . The factor structure approach is widely used to model error cross-sectional dependence because it can approximate a wide variety of dependence forms, provided that the number of factors allowed is sufficiently large. 4 We make the following assumptions: A SSUMPTION 2.1. E(α i ε it ) = 0 for all i, t. A SSUMPTION 2.2. E(ε it ε is ) = 0 for all i and t = s. A SSUMPTION 2.3. E(y i1 ε it ) = 0, for t = 2, 3, . . . T. IM for t = s A SSUMPTION 2.4. E(ft ) = 0, E(ft fs ) = 0 otherwise. A SSUMPTION 2.5. E(φ i ) = μ φ , E[(φ i − μ φ ) (φ i − μ φ ) ] = φ , where ||μ φ || < B 1 and φ is an M × M positive semi-definite matrix. A SSUMPTION 2.6. E(ε it φ i ) = 0, E(ε it f t ) = 0, E(α i φ i ) = 0, E(α i f t ) = 0, E(φ i ft ) = 0 for all i, t. Assumptions 2.1–2.3 are standard in the GMM literature. Assumption 2.2 can be easily relaxed by allowing ε it ∼ MA(k), where k is a small number. Assumption 2.3 means that the initial conditions are predetermined. This ensures that sufficiently lagged values of y it will be uncorrelated with ε it . Assumption 2.4 implies that the factors are serially and mutually 3 The main results of this paper naturally extend to panel autoregressive processes of higher order, as well as to autoregressive distributed lag panel models. See e.g. Sarafidis et al. (2008). 4 See footnote 1.
C The Author(s). Journal compilation C Royal Economic Society 2008.
On the impact of error cross-sectional dependence in short dynamic panel estimation
65
uncorrelated. Assumption 2.5 ensures that the initial observations are bounded. Assumption 2.6 implies that f t and φ i are mutually uncorrelated, as well as uncorrelated with α i and ε it for all i and t. Define the (T − 1) × M matrix F = [f 2 , f 3 , . . . f T ] and the N × M matrix = [φ 1 , φ 2 , . . . φ N ] . The initial model given in (2.1) can be written more compactly as Y = λY−1 + α + F + ε,
(2.2)
where Y = [Y 1 , . . . , Y N ] , a N × (T − 1) matrix with Y i = (y i2 , y i3 , . . . , y iT ) , Y −1 = [Y 1,−1 , . . . , Y N,−1 ] a N × (T − 1) matrix with Y i,−1 = (y i1 , y i2 , . . . , y iT −1 ) , α = [α 1 , α 2 , . . . , α N ] with α i = α i i T −1 and i T −1 being a (T − 1) × 1 column vector of ones and ε = [ε 1 , . . . , ε N ] with ε i = (ε i2 , ε i3 , . . . , ε iT ) . Given that F = AA−1 F , where A is an arbitrary M × M invertible matrix with M(M − 1)/2 free elements, identification of the effects between factors and factor loadings requires M(M − 1)/2 restrictions. 5 These can be obtained by requiring φ to be a diagonal matrix, which implies that factor loadings from different factors are mutually uncorrelated.
3. ASYMPTOTIC BIAS OF INSTRUMENTAL VARIABLES AND GMM ESTIMATORS Under error cross-sectional dependence, the standard IV and GMM estimators are inconsistent as N → ∞ for fixed T. This is an important result given that the applied econometrician may not have sufficient explanatory variables to remove all correlated behaviour and the desirable N-asymptotic properties of these estimators rely crucially upon this assumption. In order to illustrate this point, suppose first that φ i = 0 so that there is no error cross-sectional dependence, in which case (2.1) may be written in first-differences as yit = λ yit−1 + εit .
(3.1)
To overcome the induced endogeneity between the lagged dependent variable and the resulting error term, Anderson and Hsiao (1981) suggested a choice of single instruments for y it−1 , the most popular of which has been y it−2 . 6 Holtz-Eakin et al. (1988) and Arellano and Bond (1991) pointed out that under Assumptions 2.1–2.3 the autoregressive model in (2.1) implies that the following [(T − 1) (T − 2)/2] linear moment conditions are valid: E(yit−s εit ) = 0;
for t = 3, . . . , T
and 2 ≤ s ≤ t − 1.
(3.2)
These moment conditions give rise to a first-differenced GMM (DIF GMM) estimator, which is consistent for fixed T and asymptotically more efficient than the standard IV estimator. On the other hand, DIF GMM has been shown to be subject to a weak instruments problem when λ → 1 or σ 2α /σ 2ε → ∞. Hence, Blundell and Bond (1998) developed an approach outlined in Arellano and Bover (1995), which uses y it−1 for t = 3, 4, . . . , T, as additional instruments with respect to the equations in levels and results in a system GMM (SYS GMM) estimator. This approach is valid provided that the deviations of the initial observations from the long-run convergent 5 The total number of elements of A is M 2 but we have already imposed the standard orthonormalization, E(f ) = 0 t and var(f t ) = I M , which yields M(M + 1)/2 restrictions. 6 See Arellano (1989).
C The Author(s). Journal compilation C Royal Economic Society 2008.
66
V. Sarafidis and D. Robertson
αi values are uncorrelated with the individual effects – namely, E[αi (yi1 − 1−λ )] = 0. Under this mean-stationarity restriction on the initial conditions of the data generating process, the following T − 2 additional linear moment conditions are valid 7
E( yit−1 vit ) = 0;
for t = 3, 4, . . . , T .
(3.3)
However, with error cross-sectional dependence, the moment conditions outlined in (3.2) and (3.3) are violated as the following proposition demonstrates. P ROPOSITION 3.1. Under Assumptions 2.1–2.6 and model (2.1), the moment conditions used by the standard DIF GMM and SYS GMM estimators are violated as N → ∞ for fixed T. In particular, we have: Moment conditions used by DIF GMM: E yit−s uit | {fn }t−∞ = ft φ + μφ μφ wt−s = 0; for t = 3, . . . , T and 2 ≤ s ≤ t − 1, where wt−s =
∞
j =0
(3.4) λj ft−s−j .
Additional moment conditions used by SYS GMM: E yit−1 υit | {fn }t−∞ = ft φ + μφ μφ wt−1 = 0; for t = 3, 4, . . . , T , j where wt−1 = ∞ j =0 λ ft−1−j .
(3.5)
Notice that these moment conditions are violated for any lag length of the instruments used. This implies that standard estimation procedures that try to exploit orthogonality conditions between lags of the dependent variable and the error process will not be consistent for fixed T as N → ∞ regardless of the lag length of the instruments used. R EMARK 3.1. The unconditional expectation in (3.4) and (3.5) is equal to zero because f t has mean zero. The conditional expectation, of course, need not be zero. In our large N asymptotics, it is the conditional expectation that is relevant because we are never taking large T averages. It is useful to consider y it−2 as a single instrument for the first-differenced endogenous regressor in (3.1), which gives rise to the standard IV estimator (Anderson and Hsiao, 1981). In particular, under Assumptions 2.1–2.6 and model (2.1), the asymptotic bias of the IV estimator for λ has the following convenient representation 8 T
plimN→∞ N1 N (T − 2) 2 −1 i=1 t=3 yit−2 uit σ plimN→∞ λI V − λ = = ηNT ηDT − , T 1+λ ε plimN→∞ N1 N i=1 t=3 yit−2 yit−1 (3.6) where ηNT = wt−2 = 7 8
∞
j =0
T
T ft φ + μφ μφ wt−2 , ηDT = wt−1 φ + μφ μφ wt−2 ,
t=3
λj ft−2−j and wt−1 =
∞
j =0
t=3
λj ft−1−j .
Kiviet (2007) refers to this as a ‘stationary accumulated effects’ restriction. See Appendix A. C The Author(s). Journal compilation C Royal Economic Society 2008.
67
On the impact of error cross-sectional dependence in short dynamic panel estimation
For the single factor case, the asymptotic bias expression reduces to 2 σφ + μ2φ κ1 plimN→∞ λI V − λ = 2 , −2) 2 σε σφ + μ2φ κ2 − (T1+λ
(3.7)
T 2 2 where σφ2 + μ2φ = plimN→∞ N1 N and κ2 = i=1 φi = E(φi ), κ1 = t=3 wt−2 ft T T T 2 w w = w w − (w ) . t−1 t−2 t−2 t=3 t=3 t−1 t−2 t=3 For any fixed T, the magnitude of the asymptotic bias of λI V depends on the mean and the variance of the factor loadings and the variance of ε it . Clearly, if σ 2φ and μ φ are negligible, or the purely idiosyncratic component, ε it , dominates the error process, the asymptotic bias is relatively small. Similarly, for serially uncorrelated stochastic factors and large T, the asymptotic bias diminishes because T1 κ1 = op (1). However, in this case the bias of the fixed effects estimator disappears too so the IV estimator loses its relative merit anyway. 9 Notice that the sign of the asymptotic bias of λI V depends on the sign of κ 1 and κ 2 . For instance, if κ2 < [(T − 2)/(1 + λ)][σε2 /(σφ2 + μ2φ )], the asymptotic bias will be negative for κ 1 > 0 and positive for κ 1 < 0. Under Assumption 2.4 it can be shown that 10 E(κ1 ) = 0 E(κ2 ) = −(T − 2)/(1 − λ2 )
(3.8)
cov(κ1 , κ2 ) < 0. Hence, there are two cases that we need to consider: (i) for κ 2 < E(κ 2 ), κ 1 is more likely to be positive. But as in this case κ 2 is smaller than [(T − 2)/(1 + λ)][σε2 /(σφ2 + μ2φ )], the denominator in (3.7) is positive and therefore the asymptotic bias of λI V has a negative sign; (ii) similarly, for κ 2 > E(κ 2 ) κ 1 is more likely to be negative. Therefore, for κ2 > [(T − 2)/(1 + λ)][σε2 / λI V has a negative (σφ2 + μ2φ )] the denominator in (3.7) is positive and the asymptotic bias of sign again. The opposite holds for −(T − 2)/(1 − λ2 ) < κ2 < [(T − 2)/(1 + λ)][σε2 /(σφ2 + μ2φ )]. Notice also that for f t ∼ N(0, 1), the actual distribution of κ 2 is skewed to the left because it involves the sum of squares of T − 2 standard normal terms (with a minus sign in front). This implies some large negative values for κ 2 , which are naturally associated with large positive λI V . As a result, the asymptotic bias of λI V values for κ 1 and a large negative asymptotic bias for is likely to be negative – an outcome that has been confirmed in our simulation experiments.
4. REDUCING THE BIAS OF IV AND GMM IN SHORT DYNAMIC PANELS One way to reduce the amount of error cross-sectional dependence and therefore the bias of the IV and GMM estimators is to transform the data in terms of deviations from time-specific averages. This is an appealing procedure because it is easy to implement and it does not require
9 See Phillips and Sul (2007) for an analysis of the properties of the dynamic fixed effects estimator with error crosssectional dependence. 10 See Appendix A.
C The Author(s). Journal compilation C Royal Economic Society 2008.
68
V. Sarafidis and D. Robertson
projecting out the unobserved factors or estimating the factor loadings, both of which typically require large T. 11 Transforming the data in terms of deviations from time-specific averages is equivalent to including common time-specific effects in the regression model, which is standard practice in the estimation of short dynamic panels as a way of capturing common variations in the dependent variable. 12 In order to see the impact of this transformation it is instructive to reconsider the simple IV estimator. Specifically, averaging (2.1) over i and subtracting yields yit − y t = (αi − α) + λ yit−1 − y t−1 + φi − φ ft + (εit − ε t ) , (4.1) N N where y t = i=1 yit /N, φ = i=1 φi /N, and similarly for the remaining variables. As a result, the mean value of the factor loadings has been removed and the error term has now mean zero. Taking first-differences in the above equation yields yit − y t = λ yit−1 − y t−1 + φi − φ ft + (εit − ε t ) (4.2) = λ yit−1 − y t−1 + (uit − ut ) . Using (yit−2 − y t−2 ) as an instrument for (yit−1 − y t−1 ) gives rise to an IV estimator with the following probability limit 13 T yit−2 − y t−2 [ (uit − ut )] plimN→∞ N1 N i=1 t=3 λI V − λ = plimN→∞ T plimN→∞ N1 N i=1 t=3 yit−2 − y t−2 yit−1 − y t−1
(T − 2) 2 −1 (4.3) = ηNT ηDT − σε , 1+λ where ηNT =
T
ft φ wt−2 , ηDT =
t=3
T
wt−1 φ wt−2
t=3
and w t−2 , w t−1 are defined below (3.6). For M = 1, (4.3) reduces to λI V − λ = plimN→∞ where σφ2 = plimN→∞ N1 (3.7).
N
i=1 (φi
− φ)2 , φ =
1 N
σφ2 κ1
, (4.4) −2) 2 σφ2 κ2 − (T1+λ σε φi and κ 1 and κ 2 are defined below
11 Such methods have been proposed by Robertson and Symons (2000), Coakley et al. (2002), Phillips and Sul (2003), Moon and Perron (2004), Bai (2006) and Pesaran (2006). However, these methods are justified only in a set up where T is large. Ahn et al. (2001) developed a fixed T consistent GMM approach that controls for a single time-varying individual effect (which is similar to a single-factor structure with no individual-specific time-invariant effects) in a model with strictly exogenous regressors. We do not consider this approach here because our focus is on a dynamic panel with a multi-factor structure and the usual individual-specific time-invariant effects. 12 See, for example, Arellano and Bond (1991) and Blundell et al. (2000). 13 See Appendix A.
C The Author(s). Journal compilation C Royal Economic Society 2008.
On the impact of error cross-sectional dependence in short dynamic panel estimation
69
Comparing (3.7) and (4.4), it will follow that the IV estimator applied to time-specific demeaned data, λI V , will have a smaller asymptotic bias compared to the IV estimator in the basic model if one can show that
σφ2
−2) 2 asy.bias( σε λI V ) σφ2 ·κ2 − (T1+λ
= (4.5) RB λI V =
< 1, (σφ2 +μ2φ )
asy.bias λI V
(T −2) 2 2 2 (σφ +μφ )·κ2 − 1+λ σε where RB( λI V ) denotes the relative bias of λI V compared to λI V . Intuitively, as time-specific demeaning reduces the amount of error cross-sectional dependence (by removing the mean value of φ i ) and because the latter is responsible for the non-zero asymptotic bias of the IV estimator, it is natural to expect that RB( λI V ) < 1. This will hold true indeed, unless κ 2 takes an unusually large positive value, in which case it is possible for λI V . The following proposition provides the necessary the bias of λI V to be greater than that of condition for (4.5) to hold true. P ROPOSITION 4.1. Under Assumptions 2.1–2.6 and (2.1), the asymptotic bias of the IV estimator will be reduced when the estimator is applied to time-specific demeaned data so long as 2 2σφ + μ2φ (T − 2)σε2 κ2 < B = 2 > 0. (4.6) 2 σφ + μ2φ σφ2 (1 + λ) Using Assumption 2.4, κ 2 will most likely be smaller than this bound, unless the value of σ 2φ is unrealistically large. 14 In particular, as an indication, for f t standard normal we have simulated the probability that κ 2 ≥ B and we found that for – say – σ 2φ = 10, P r(κ 2 ≥ B) = 0.00087. To see what this practically means, if the factor loadings are uniformly distributed, such that φ i ∼ i.i.d.U [a, b], σ 2φ = 10 gives a difference between a and b of 10.95, which seems an unlike degree of heterogeneity to arise in most empirical applications; and even if it does, the probability that (4.6) is violated is still very small. Therefore, time-specific demeaning will most likely have favourable effects for practical purposes, at least in large samples. 15 Turning to (4.5), for a given value of κ 2 < B and T fixed, the magnitude of RB( λI V ) will depend on σ 2φ , μ φ and σ 2ε . For example, if μ φ is zero, the numerator and the denominator in (4.5) are exactly the same and therefore RB( λI V ) equals unity; there is no gain from timespecific demeaning of the data. On the other hand, if the factor loadings are the same across λI V ) is also zero; time-specific demeaning is all individuals, σ 2φ is equal to zero and hence RB( fully effective. λI V ), for a Figure 1 shows graphically the relative asymptotic bias of λI V , denoted as RB( range of values of σ 2φ and μ φ , while setting T = 6, λ = 0.4, σ 2ε = 1 and κ 2 = E(κ 2 ). We can see that RB( λI V ) is consistently less than 1 unless μ φ is zero, in which case time-specific demeaning λI V ) → 0 as σ 2φ becomes smaller. Also, has no effect. For a given non-zero value of μφ , RB( 2 for a fixed value of σφ RB( λI V ) decreases as μ φ becomes larger, although the rate of decrease 2 depends on the value of σ φ . Similar results hold for different values of λ.
14 This result is intuitive because the effect that time-specific demeaning has on reducing the asymptotic bias of the IV estimator is small in this case. 15 This is unless μ = 0, as we will see. φ C The Author(s). Journal compilation C Royal Economic Society 2008.
70
V. Sarafidis and D. Robertson
Figure 1. Relative asymptotic bias of λI V .
Figure 2. Asymptotic bias of λI V and λI V and relative bias of λI V .
Figure 2 illustrates the asymptotic bias of λI V and λI V , as well as RB( λI V ) for a range of values of μ φ and σ 2ε , while setting T = 6, λ = 0.4, σ 2φ = 1, κ 2 = E(κ 2 ). Observe that while the asymptotic bias of both λI V and RB( λI V ) falls (the surface goes up since it is negative) with 2 λI V than it is for RB( λI V ) (unless higher values of σ ε , the rate of decrease is much lower for λI V ) decreases as σ 2ε becomes larger. μ φ = 0). Thus, the net effect is that RB( C The Author(s). Journal compilation C Royal Economic Society 2008.
On the impact of error cross-sectional dependence in short dynamic panel estimation
71
5. SMALL SAMPLE PROPERTIES OF ESTIMATORS We investigate the finite sample properties of the IV, DIF GMM and SYS GMM estimators with error cross-sectional dependence. In the simulation experiments presented below, we have restricted ourselves regarding the generality of the Monte Carlo design by focusing on three specifications for the distribution of the factor loadings, chosen on the basis of existing literature. However, in conjuction with the analytical results above, we aim to make rather more general inferences about the properties of the estimators. Notice that we are only interested in the small-T, large-N case, i.e. samples where these estimators are routinely applied by practitioners to estimate dynamic panel data models. 5.1. Experimental design The underlying data generating process is given by yit = λyit−1 + αi + uit , uit = φi ft + εit ,
(5.1)
where α i , ε it and f t are drawn in each replication from i.i.d.N (0, σ 2α ), i.i.d.N (0, σ 2ε ) and i.i.d.N(0, 1), respectively. To control the degree and heterogeneity of error cross-sectional dependence, we follow closely Phillips and Sul (2003) and we consider three specifications for the distribution of the factor loadings – namely, φ i = 0, φ i ∼ i.i.d.U [0.5, 2.1] and φ i ∼ i.i.d.U [1, 4]. The last two are chosen as examples of medium and high cross-sectional dependence, corresponding to an average error correlation coefficient that is roughly equal to 0.55 and 0.80, respectively. 16 Following Kiviet (1995) and Bun and Kiviet (2006), we choose σ 2α such that the impact of the two error components, a i and u it , on var(y it ) is held constant across different experiments. This is because the performance of the GMM estimators depends on the ratio σ 2α /σ 2u and therefore as the level of cross-sectional dependence rises, the impact of α i on var(y it ) will tend to fall, making the comparisons across experiments with different levels of cross-sectional dependence invalid. Hence, noticing that ⎛ ⎞ ∞ 2 σα σu2 σ2 σα2 var(yit ) = + var ⎝ λj uit−j ⎠ = + = (ψ 2 + 1) u 2 , 2 2 2 (1 − λ) (1 − λ) 1−λ 1−λ j =0 where ψ 2 =
σα2 /(1−λ)2 , σu2 /(1−λ2 )
we choose σ 2α by setting 1−λ 1−λ ψ 2 σu2 = ψ 2 [var(φi ft ) + var(εit )] , σα2 = 1+λ 1+λ
with var(φi ft ) = [E(ft )]2 σφ2 + [E(φi )]2 σf2 + σφ2 σf2 .
16 Phillips and Sul (2003) set φ ∼ i.i.d.U [0, 0.2] and φ ∼ i.i.d.U[1, 4] as examples of low and high error crossi i sectional dependence, respectively. Therefore, the bounds we choose for medium cross-sectional dependence are the average values of the bounds of these two specifications.
C The Author(s). Journal compilation C Royal Economic Society 2008.
72
V. Sarafidis and D. Robertson
Normalizing σ 2ε to the value of one and given that E(f t ) = 0, the following result is obtained 1−λ 2 ψ 2 μ2φ + σφ2 + 1 . σα = 1+λ We consider N = 100, 400 and T = 6, 10, since the focus of the analysis is T fixed, N large. λ alternates between 0.4 and 0.8 and ψ is set equal to one. The initial value of y it is set equal to zero and the first 50 observations are discarded before choosing our sample, so as to ensure that the initial conditions do not have an impact on the results. 2000 replications are performed in each experiment. 5.2. Results Since the IV estimator has no finite moments, Table A1 in the appendix reports median bias and root median square error (denoted as RMedSE) for all estimators. The latter is defined as λr − λ)2 , RMedSE = median ( where λr is an estimator of λ in the rth draw. λF E , λI V , λDI F and λSY S denote the FE, IV, DIF GMM and SYS GMM estimators, λI V , λDI F and λSY S denote the corresponding estimators operated on respectively, while λF E , time-specific demeaned data. The GMM estimators are estimated in two steps and they use the second and third lag of the dependent variable (in levels) as instruments for the endogenous regressor in the first-differenced equations. 17 We can see that with zero error cross-sectional dependence, the median bias of λI V , λDI F and λSY S is small for λ = 0.4 and it increases for λ λF E = 0.8, especially for λI V and λDI F . Regardless of this, all estimators perform better than with respect to both bias and RMedSE. Also, λDI F and λSY S outperform λI V , and they perform similarly to each other for λ = 0.4 but not for λ = 0.8, in which case λSY S does somewhat better. 18 Notice that transforming the data in terms of deviations from time-specific averages, even when this is redundant, has no adverse effects on either median bias or RMedSE for all estimators. With error cross-sectional dependence, the situation changes considerably; first, the results suggest that all estimators – without exception – experience a large increase in bias and RMedSE. This is regardless of the size of N, T and the value of λ. The direction of the bias appears to be negative, which confirms the analysis provided below (3.8). Secondly, in plenty of circumstances λDI F can be so severe that these estimators are consistently the downward bias in λI V and outperformed by λF E with respect to RMedSE. The impact of time-specific demeaning on estimation differs, depending on the variance of the factor loadings. When σ 2φ is not large, the transformation is effective in reducing bias and RMedSE considerably for all estimators. Naturally, the performance of the estimators improves with the size of N and T, as expected. On the other hand as σ 2φ gets larger, the effectiveness of time-specific demeaning decreases, although it still reduces noticeably the bias and RMedSE for λDI F all estimators compared to the case where the data have not been transformed. λI V and 2 suffer from severe bias, especially when λ = 0.8 and σ φ is large, which tends to decrease slightly Furthermore, SYS GMM uses the optimal weighting matrix (when σ 2α = 0), as derived in Windmeijer (2000). However, notice that our Monte Carlo does not consider data series where the initial conditions deviate from mean stationarity. In this case, SYS GMM is not consistent while DIF GMM remains so. Our design is also specific in that it imposes ψ = 1. See Bun and Kiviet (2006) and Kiviet (2007). 17 18
C The Author(s). Journal compilation C Royal Economic Society 2008.
On the impact of error cross-sectional dependence in short dynamic panel estimation
73
Table 1. Comparison between RB( λI V ) and the finite sample results.
λDI F −λ λSY S −λ
T =6 λ = 0.4 RB( λI V ) λλI V −λ
λDI F −λ λSY S −λ I V −λ N = 100 φ i ∼ i.i.d.U [0.5, 2.1]
0.268
0.247
0.235
0.113
φ i ∼ i.i.d.U [1, 4] N = 400 φ i ∼ i.i.d.U [0.5, 2.1] φ i ∼ i.i.d.U [1, 4]
0.489 0.268 0.489
0.294 0.214 0.265
0.445 0.127 0.363
0.110 0.026 0.057
as N grows. λSY S performs comparatively better than the other estimators in all cases. Table 1 λI V compared to λI V as defined in (4.5), for evaluates RB( λI V ), the relative asymptotic bias of a subset of the parameters specified in our Monte Carlo design, and compares this with the finite sample results that we have obtained for all estimators. 19 As we can see, for IV and DIF GMM there is a common expected pattern, which is that as σ 2φ increases, the relative median bias of the estimators performed on time-specific demeaned data increases, although the actual results are somewhat better than what is implied by (4.5). 20 For the system GMM estimator, the change in relative bias is not as dramatic.
6. CONCLUDING REMARKS This paper has analysed the impact of error cross-sectional dependence on a number of widely used IV and GMM estimators in the context of a linear dynamic panel data model with fixed T. It has been demonstrated that the estimators relying on standard instruments with respect to lagged values of the dependent variable (either in levels or in first-differences) are inconsistent as N → ∞ for fixed T. This result persists for any lag length of the instruments used. This is an important outcome given that the error cross-sectional dependence is a likely empirical situation in many applications. Transforming the data in terms of deviations from time-specific averages is shown to have a favourable effect in bias and RMedSE when there is error cross-sectional dependence, whilst it has no adverse effect otherwise. This provides a formal justification for using common time effects when estimating short dynamic panels based on methods of moments estimators, even in those cases where dealing with cross-sectional dependence does not seem to be a priority or particularly relevant for the empirical researcher.
ACKNOWLEDGMENTS This paper has benefited substantially from the comments and suggestions of two anonymous referees and a Co-Editor, Frank Windmeijer. We would also like to thank Richard Gerlach, Jan Kiviet, Daniel Oron, Vladimir Smirnov and Takashi Yamagata for helpful discussions. All remaining errors are our own. The first author gratefully acknowledges full financial support from the ESRC during his Ph.D. at Cambridge University (PTA-030-2002-00328).
For instance, ( λI V − λ)/( λI V − λ) is the modulus of the ratio between the median bias of λI V and the median bias of λI V . Similarly for the other estimators. 20 Qualitatively similar conclusions can be made for λ = 0.8, although this time the results are somewhat worse than implied by (4.5). 19
C The Author(s). Journal compilation C Royal Economic Society 2008.
74
V. Sarafidis and D. Robertson
REFERENCES Ahn, S. C., Y. H. Lee, and P. Schmidt (2001). GMM estimation of linear panel data models with timevarying individual effects. Journal of Econometrics 101, 219–55. Ahn, S. C. and P. Schmidt (1995). Efficient estimation of models for dynamic panel data. Journal of Econometrics 68, 5–28. Alonso-Borrego, C. and M. Arellano (1999). Symmetrically normalized instrumental-variable estimation using panel data. Journal of Business and Economic Statistics 17, 36–49. Anderson, T. W. and C. Hsiao (1981). Estimation of dynamic models with error components. Journal of the American Statistical Association 76, 598–606. Arellano, M. (1989). A note on the Anderson-Hsiao estimator for panel data. Economic Letters 31, 337–41. Arellano, M. and S. Bond (1991). Some tests of specification for panel data: monte carlo evidence and an application to employment equations. Review of Economic Studies 58, 277–97. Arellano, M. and O. Bover (1995). Another look at the instrumental variable estimation of error-component models. Journal of Econometrics 68, 29–51. Bai, J. (2006). Panel data models with interactive fixed effects. Working paper, New York University. Barro, R. and X. Sala-i-Martin (1992). Convergence. Journal of Political Economy 100, 223–51. Blundell, R. and S. Bond (1998). Initial conditions and moment restrictions in dynamic panel data models. Journal of Econometrics 87, 115–43. Blundell, R., S. Bond, and F. Windmeijer (2000). Estimation in dynamic panel data models: improving on the performance of the standard GMM estimators. In B. Baltagi (Ed.), Nonstationary Panels, Panel Cointegration, and Dynamic Panels. Advances in Econometrics, Volume 15, 53–91. New York: JAI Press, Elsevier Science. Bond, S., A. Hoeffler and J. Temple (2001). GMM Estimation of empirical growth models. Oxford Economic Papers 2001-W21, Nuffield College, University of Oxford. Bover, O. and N. Watson (2005). Are there economies of scale in the demand for money by firms? Some panel data estimates. Journal of Monetary Economics 52, 1569–89. Bun, M. J. G. and J. Kiviet (2006). The effects of dynamic feedbacks on LS and MM estimator accuracy in panel data models. Journal of Econometrics 132, 409–44. Coakley, J., A. Fuertes, and R. Smith (2002). A principal components approach to cross-section dependence in panels. Working paper, Birkbeck College, University of London. Holtz-Eakin, D., W. Newey, and H. Rosen (1988). Estimating vector autoregressions with panel data. Econometrica 56, 1371–96. Kiviet, J. (1995). On bias, inconsistency, and efficiency of various estimators in dynamic panel data models. Journal of Econometrics 68, 53–78. Kiviet, J. (2007). Judging contending estimators by simulation: tournaments in dynamic panel data models. In G. D. A. Phillips and E. Tzavalis (Eds.), The Refinement of Econometric Estimation and Test Procedures; Finite Sample and Asymptotic Analysis, 282–318. Cambridge, UK: Cambridge University Press. Moon, R. G. and B. Perron (2004). Efficient estimation of the SUR cointegrating regression model and testing for purchasing power parity. Econometric Reviews 23, 293–23. Nerlove, M. (1967). Experimental evidence on the estimation of dynamic economic relations from a time series of cross-sections. Economic Studies Quarterly 18, 42–74. Nerlove, M. (1971). Further evidence on the estimation of dynamic economic relations from a time series of cross-sections. Econometrica 39, 359–87. Nickell, S. (1981). Biases in dynamic models with fixed effects. Econometrica 49, 1417–26.
C The Author(s). Journal compilation C Royal Economic Society 2008.
On the impact of error cross-sectional dependence in short dynamic panel estimation
75
Pesaran, H. (2006). Estimation and inference in large heterogeneous panels with a multifactor error structure. Econometrica 74, 967–1012. Phillips, P. and D. Sul (2003). Dynamic panel estimation and homogeneity testing under cross-sectional dependence. Econometrics Journal 6, 217–59. Phillips, P. and D. Sul (2007). Bias in dynamic panel estimation with fixed effects, incidental trends and cross-sectional dependence. Journal of Econometrics 137, 162–88. Robertson, D. and J. Symons (2000). Factor residuals in SUR regressions: estimating panels allowing for cross-sectionalal correlation. Working paper, Faculty of Economics and Politics, The University of Cambridge. Sarafidis, V., T. Yamagata, and D. Robertson (2008). A test of error cross section dependence for a linear dynamic panel model with regressors. Working paper, Faculty of Economics and Politics, The University of Cambridge. Windmeijer, F. (2000). Efficiency comparisons for a system GMM estimator in dynamic panel data models. In R. D. H. Heijmans, D. S. G. Pollock, and A. Satora (Eds.), Innovations in Multivariate Statistical Analysis. Dordrecht: Kluwer Academic Publishers.
APPENDIX A: PROOFS OF RESULTS Proof of Proposition 3.1: For the moment conditions used by DIF GMM, we have
E yit−s uit | {fn }t−∞
⎡⎛
⎞ ∞ ∞ α i + φi = E⎣ ⎝ λj ft−s−j + λj εit−s−j ⎠ 1−λ j =0 j =0 ⎤
φi ft + εit | {fn }t−∞ ⎦ ⎡
=E
⎣ ft φi φi
∞
⎤ λ
j
ft−s−j | {fn }t−∞ ⎦
= ft φ + μφ μφ wt−s = 0,
(A.1)
j =0
∞ j where φ + μ φ μφ = plimN→∞ N1 N i=1 φi φi and wt−s = j =0 λ ft−s−j . For the additional moment conditions of SYS GMM we have ⎡⎛ ⎞ ∞ ∞ λj ft−1−j + λj εit−1−j ⎠ E yit−1 υit | {fn }t−∞ = E ⎣ ⎝φi ⎤
j =0
j =0
αi + φi ft + εit | {fn }t−∞ ⎦ ⎡ = E ⎣ft φi φi
∞
⎤
λj ft−1−j | {fn }t−∞ ⎦ = ft φ + μφ μφ wt−1 = 0,
(A.2)
j =0
where wt−1 =
∞
j =0
λj ft−s−j .
Proof of (3.6): The derivation of η NT in (3.6) follows directly from (A.1) by replacing conditional expectations with plims and setting s = 2. C The Author(s). Journal compilation C Royal Economic Society 2008.
N = 100, T = 6
0.409
λI V
0.411
λI V
0.380
λDI F
λSY S
λI V
λF E
λF E
λI V
0.379
0.404
0.404
0.369
0.369
0.767
Error cross-sectional independence: φ i = 0
λDI F
0.775
λI V
0.732
λDI F
0.732
λDI F
0.785
λSY S
0.786
λSY S
0.234
0.404
0.400
0.388
0.388
0.408
0.408
0.556
0.556
0.793
0.784
0.764
0.764
0.799
0.799
0.098
0.402
0.402
0.393
0.393
0.399
0.399
0.372
0.372
0.803
0.806
0.781
0.781
0.798
0.798
0.402
0.402
0.397
0.397
0.401
0.401
0.557
0.557
0.807
0.807
0.790
0.790
0.798
0.798
0.092
0.242
0.439
0.170
0.346
0.249
0.417
0.309
0.361
0.124
0.562
0.274
0.608
0.553
0.795
(0.343) (0.308) (0.732) (0.279) (0.306) (0.115) (0.218) (0.094) (0.492) (0.439) (1.20) (0.869) (0.553) (0.236) (0.267) (0.091)
0.063
Medium cross-sectional dependence: φ i ∼ i.i.d.U [0.5, 2.1]
(0.166) (0.165) (0.053) (0.053) (0.019) (0.019) (0.018) (0.018) (0.243) (0.243) (0.171) (0.172) (0.027) (0.027) (0.020) (0.020)0
0.235
(0.302) (0.302) (0.086) (0.085) (0.034) (0.034) (0.028) (0.029) (0.428) (0.428) (0.275) (0.274) (0.055) (0.055) (0.030) (0.030)
0.098
(0.166) (0.166) (0.112) (0.114) (0.041) (0.041) (0.039) (0.039) (0.244) (0.244) (0.357) (0.366) (0.060) (0.060) (0.040) (0.040)
N = 400, T = 10 0.236
N = 400, T = 6
0.095
λF E
(0.304) (0.305) (0.175) (0.174) (0.069) (0.069) (0.058) (0.060) (0.431) (0.431) (0.523) (0.533) (0.116) (0.118) (0.063) (0.046)
0.096
N = 100, T = 10 0.237
N = 100, T = 6
λF E
Table A1. Median point estimates of λ and root median square error of FE, IV, DIF GMM and SYS GMM. λ = 0.4 λ = 0.8
76 V. Sarafidis and D. Robertson
C The Author(s). Journal compilation C Royal Economic Society 2008.
C The Author(s). Journal compilation C Royal Economic Society 2008.
0.233
0.364
0.303
0.422
0.502
0.546
0.148
0.703
0.450
0.699
0.624
0.808
0.094
0.269
0.428
0.203
0.375
0.240
0.396
0.310
0.360
0.176
0.712
0.314
0.687
0.517
0.778
0.227
0.339
0.418
0.231
0.380
0.279
0.397
0.502
0.545
0.174
0.772
0.456
0.733
0.602
0.786
0.077
0.152
0.327
0.054
0.246
0.154
0.427
0.273
0.338
0.013
High cross-sectional dependence: φ i ∼ i.i.d.U [1, 4] 0.181
0.185
0.334
0.449
0.797
0.207
0.450
0.146
0.280
0.250
0.450
0.475
0.525 -0.025 0.261
0.355
0.484
0.565
0.809
0.085
0.196
0.454
0.078
0.283
0.153
0.386
0.280
0.338
0.035
0.193
0.214
0.399
0.408
0.749
0.267
0.436
0.134
0.291
0.228
0.394
0.476
0.523 -0.007 0.307
0.352
0.508
0.533
0.774
(0.241) (0.189) (0.830) (0.402) (0.304) (0.148) (0.221) (0.099) (0.327) (0.277) (1.27) (1.05) (0.450) (0.297) (0.278) (0.088)
0.213
(0.364) (0.316) (1.01) (0.541) (0.398) (0.196) (0.310) (0.135) (0.526) (0.463) (1.38) (1.22) (0.616) (0.423) (0.409) (0.139)
0.059
(0.235) (0.186) (0.828) (0.409) (0.297) (0.164) (0.208) (0.112) (0.329) (0.275) (1.24) (1.02) (0.445) (0.318) (0.241) (0.085)
0.217
(0.380) (0.324) (0.990) (0.563) (0.417) (0.231) (0.309) (0.163) (0.536) (0.462) (1.38) (1.18) (0.641) (0.500) (0.375) (0.132)
0.048
N = 400, T = 10 0.185
N = 400, T = 6
0.420
(0.208) (0.173) (0.620) (0.166) (0.221) (0.055) (0.167) (0.044) (0.299) (0.255) (1.15) (0.547) (0.339) (0.082) (0.206) (0.048)
N = 100, T = 10 0.195
N = 100, T = 6
0.316
(0.333) (0.306) (0.753) (0.231) (0.287) (0.082) (0.222) (0.065) (0.490) (0.440) (1.26) (0.739) (0.512) (0.153) (0.300) (0.073)
0.074
N = 400, T = 10 0.200
N = 400, T = 6
0.228
(0.202) (0.172) (0.620) (0.197) (0.222) (0.071) (0.161) (0.060) (0.299) (0.254) (1.13) (0.062) (0.351) (0.120) (0.182) (0.055)
N = 100, T = 10 0.208
On the impact of error cross-sectional dependence in short dynamic panel estimation
77
78
V. Sarafidis and D. Robertson For η DT we have T N 1 yit−2 yit−1 N i=1 t=3 ⎧ ⎫ ∞ j ∞ j αi ⎪ T ⎪ N 1 ⎨ 1−λ + φi j =0 λ ft−2−j + j =0 λ εit−2−j ⎬ = plimN→∞ ∞ j ⎪ N i=1 t=3 ⎪ j ⎭ ⎩ · φi ∞ j =0 λ ft−1−j + j =0 λ εit−1−j ⎡⎛ ⎞ ⎛ ⎞⎤ N ∞ ∞ T 1 ⎣⎝ j = plimN→∞ λ ft−1−j ⎠ φi φi ⎝ λj ft−2−j ⎠⎦ N i=1 t=3 j =0 j =0 ⎡ ⎤ T ∞ N ∞ 1 ⎣ j λ εit−2−j λj εit−1−j ⎦ + plimN→∞ N i=1 t=3 j =0 j =0
plimN→∞
=
T t=3
(T − 2) 2 σ , wt−1 φ + μφ μφ wt−2 − 1+λ ε
(A.3)
j where wt−1 = ∞ j =0 λ ft−1−j . Then (3.6) follows directly by multiplying η NT with the inverse of η DT . Proof of (3.8): First, given Assumption 2.4, we have E(κ1 ) = E[ Tt=3 wt−2 (ft − ft−1 )] = 0 and " E(κ2 ) = E
T
#
"
wt−1 wt−2 − E
t=3
= (T − 2)
T
# (wt−2 )
2
t=3
1 1 λ < 0. − (T − 2) = −(T − 2) 1 − λ2 1 − λ2 1+λ
(A.4)
Therefore, the covariance between κ 1 and κ 2 equals " cov(κ1 , κ2 ) = E(κ1 κ2 ) = E −E
" T t=3
T
wt−2 ft
t=3
wt−2 ft−1
T t=3
T t=3
#
"
wt−1 wt−2 − E #
wt−1 wt−2 + E
"
T
wt−2 ft
t=3 T t=3
wt−2 ft−1
T
# (wt−2 )
t=3 T
(wt−2 )
2
# 2
.
(A.5)
t=3
It is straightforward to show that the individual elements of (A.5) are equal to the following terms: % # ⎧ λ $T −4 " T T 2j ⎨ λ (2T − 6 − 2j ) + (T − 3) for T ≥ 3 2 j =1 wt−2 ft wt−1 wt−2 = 1−λ E ⎩ 0 otherwise. t=3 t=3 % " T # ⎧ 2λ2 $T −5 T 2j ⎨− j =1 λ (T − 4 − j ) + (T − 4) for T ≥ 5 1−λ2 2 −E wt−2 ft (wt−2 ) = ⎩ 0 otherwise. t=3 t=3 $ % " T # ⎧ T ⎨ − 1 T −3 λ2j (2T − 4 − 2j ) + (T − 2) for T ≥ 3 j =1 1−λ2 −E wt−2 ft−1 wt−1 wt−2 = ⎩ 0 otherwise. t=3 t=3 % " T # ⎧ 2λ $T −4 T 2j ⎨ j =1 λ (T − 3 − j ) + (T − 3) for T ≥ 4 1−λ2 2 E wt−2 ft−1 (wt−2 ) = ⎩ 0 otherwise. (A.6) t=3 t=3 C The Author(s). Journal compilation C Royal Economic Society 2008.
On the impact of error cross-sectional dependence in short dynamic panel estimation
79
Hence, the covariance between κ 1 and κ 2 will be negative provided that F (T ) = λ
T −4
λ2j (2T − 6 − 2j ) − 2λ2
j =1
+ 2λ
T −5
λ2j (T − 4 − j ) −
j =1 T −4
T −3
λ2j (2T − 4 − 2j )
j =1
λ2j (T − 3 − j ) − 2λ2 (T − 4) + 3λ(T − 3) < (T − 2), for T ≥ 4.
(A.7)
j =1
(A.7) holds immediately for T = 3 and T = 4 because F (3) = 0 < (T − 2) = 1 and F (4) = −2λ2 + 3λ < 2 ⇒ λ/(1 + λ2 ) < 2/3, which is true for any ∀λ ∈ R. Below we demonstrate that this result holds for any T ≥ 5 using induction. In particular, let us assume that for some T ≥ 5 the following is true F (T ) < T − 2.
(A.8)
We need to prove that (A.8) holds for T + 1 as well. Notice first that (A.8) can also be written as F (T ) = λ2 F (T − 1) + λ3 (4T − 16) + λ(3T − 9) − λ3 (3T − 12) − λ2 (4T − 14) < T − 2. Therefore, we obtain F (T + 1) = λ2 F (T ) + λ3 (4T − 12) + λ(3T − 6) − λ3 (3T − 9) − λ2 (4T − 10) = = λ2 F (T ) + λ3 (T − 3) + λ(3T − 6) − λ2 (4T − 10) < T − 1.
(A.9)
Since (λ − 1)3 < 0 and (2 − λ) (1 − λ) > 0, we have
(λ − 1)3 (T − 3) < (2 − λ)(1 − λ) ⇒ 1 + (λ − 1)3 (T − 3) < T − 3 + (2 − λ)(1 − λ) ⇒
(λ3 − 3λ2 + 3λ)(T − 3) < T − 1 − 3λ + λ2 ⇒ λ3 − 3λ2 + 3λ (T − 3) + 3λ − λ2 < T − 1 ⇒
λ3 (T − 3) − λ2 (3T − 8) + λ(3T − 6) < T − 1 ⇒ λ2 (T − 2) + λ3 (T − 3) + λ(3T − 6) − λ2 (4T − 10) < T − 1.
(A.10)
The last inequality in (A.10) is very similar to (A.9), with the only difference being that (T − 2) replaces F(T) in the first term. But since F(T) < T − 2, (A.9) must also be true. This proves that for any T ≥ 5, cov(κ 1 , κ 2 ) < 0. Proof of (4.3): For ηNT in (4.3) we get the following expression T N % $ 1 yit−2 − y t−2 φi − φ ft + (εit − εt ) = N i=1 t=3 ⎧$ % ⎫ ∞ j ∞ j αi −α ⎪ T ⎪ N ⎨ ⎬ ε + φ − φ λ f + λ − ε i t−2−j it−2−j t−2−j j =0 j =0 1−λ 1 = plimN→∞ % $ ⎪ ⎪ N ⎩ ⎭ · φ − φ f + (ε − ε )
plimN→∞
i=1 t=3
=
T
i
ft φ wt−2 ,
t=3 C The Author(s). Journal compilation C Royal Economic Society 2008.
t
it
t
(A.11)
80
V. Sarafidis and D. Robertson
where φ = plimN→∞ N1
N
i=1 (φi
− φ)(φi − φ) and w t−2 has been defined above. For ηDT we have
N T 1 yit−2 − y t−2 yit−1 − y t−1 N i=1 t=3 ⎧$ % ⎫ ∞ j ∞ j ⎪ T ⎪ αi −α N ⎬ j =0 λ ft−2−j + j =0 λ εit−2−j − ε t−2−j 1 ⎨ 1−λ + φi − φ = plimN→∞ % $ ⎪ ∞ j ⎪ · φ − φ ∞ λj f N i=1 t=3 ⎩ ⎭ i t−1−j + j =0 j =0 λ εit−1−j − ε it−1−j ⎡ ⎤ T N ∞ ∞ 1 ⎣ j = plimN→∞ λ ft−1−j φi − φ φi − φ λj ft−2−j ⎦ N i=1 t=3 j =0 j =0 ⎤ ⎡ T N ∞ ∞ 1 ⎣ j λ εit−2−j − εt−2−j λj εit−1−j − εit−1−j ⎦ +plimN→∞ N i=1 t=3 j =0 j =0
plimN→∞
=
T
wt−1 φ wt−2 −
t=3
(T − 2) 2 σ . 1+λ ε
(A.12)
Then, (4.3) follows directly by multiplying (A.11) with the inverse of (A.12). Proof of Proposition 4.1: Essentially, what we need to show is that
σφ2 + μ2φ σφ2
<
.
2
σφ · κ2 − (T −2) σε2 (σφ2 + μ2φ ) · κ2 − (T −2) σε2 1+λ 1+λ
(A.13)
For this, we need to consider three cases for the value of κ 2 and we will naturally assume that T ≥ 3. Case I: κ2 <
[(T −2)/(1+λ)]σε2 $ % . σφ2 +μ2φ
In this case, the denominator of both ratios in (A.13) is negative, so the inequality above reduces to σφ2 2 − σφ · κ −
<
σφ2 + μ2φ
⇒ −2) 2 − (σφ2 + μ2φ ) · κ − (T1+λ σε
2 2 2 2 (T − 2) 2 2 (T − 2) 2 2 2 2 2 σ σ < − σφ + μφ σφ · κ − σ σ + μφ ⇒ − σφ + μφ σφ · κ − 1+λ ε φ 1+λ ε φ (T − 2) 2 2 (T − 2) 2 2 σ σ < σ σ + μ2φ , (A.14) 1+λ ε φ 1+λ ε φ
(T −2) 2 σ 1+λ ε
which implies that (A.13) will always be satisfied under Case I. This also means that (4.5) holds true for any κ 2 < 0. Case II: κ2 >
[(T −2)/(1+λ)]σε2 . σφ2
Notice here that the denominator of both ratios in (A.13) is positive. Hence, (A.13) becomes equal to σφ2 + μ2φ σφ2 2 < 2 ⇒ (T −2) 2 −2) 2 2 σε σφ · κ2 − 1+λ σε σφ + μφ · κ2 − (T1+λ
2 (T − 2) 2 2 (T − 2) 2 2 σε σφ < σφ2 + μ2φ σφ2 · κ2 − σε σφ + μ2φ ⇒ σφ + μ2φ σφ2 · κ2 − 1+λ 1+λ (T − 2) 2 2 (T − 2) 2 2 − σ σ <− σ σ + μ2φ , (A.15) 1+λ ε φ 1+λ ε φ C The Author(s). Journal compilation C Royal Economic Society 2008.
On the impact of error cross-sectional dependence in short dynamic panel estimation
which, however, cannot be true given that all parameters are positive. Therefore, for any κ2 >
81
[(T −2)/(1+λ)]σε2 , σφ2
time-specific demeaning of the data will not result in better asymptotic properties for the IV estimator. Case III:
[(T −2)/(1+λ)]σε2 σφ2
> κ2 >
[(T −2)/(1+λ)]σε2 $ % . σφ2 +μ2φ
In this case, (A.13) reduces to σφ2 + μ2φ < 2 ⇒ −2) 2 2 (σφ + μφ ) · κ2 − (T1+λ σε
2 (T − 2) 2 2 (T − 2) 2 2 σε σφ < − σφ2 + μ2φ σφ2 · κ2 − σε σφ + μ2φ ⇒ σφ + μ2φ σφ2 · κ2 − 1+λ 1+λ 2 (T − 2) 2 σ ⇒ 2 σφ + μ2φ σφ2 · κ2 < 2σφ2 + μ2φ 1+λ ε 2 −2) 2 2σφ + μ2φ (T1+λ σε (A.16) 2 κ2 < . 2 2 2 σφ + μφ σφ σφ2 2 − σφ · κ2 −
(T −2) 2 σ 1+λ ε
This provides the complete proof of Proposition 4.1.
C The Author(s). Journal compilation C Royal Economic Society 2008.
The
Econometrics Journal Econometrics Journal (2009), volume 12, pp. 82–104. doi: 10.1111/j.1368-423X.2008.00277.x
Value at Risk with time varying variance, skewness and kurtosis—the NIG-ACD model A NDERS W ILHELMSSON † †
Department of economics, Lund University, P.O. Box 7082, S-220 07 Lund, Sweden E-mail:
[email protected] First version received: October 2007; final version accepted: November 2008
Summary A new model for financial returns with time varying variance, skewness and kurtosis based on the Normal Inverse Gaussian (NIG) distribution is proposed. The new model and two previously suggested NIG models are evaluated by their Value at Risk (VaR) forecasts on a long series of daily Standard and Poor’s 500 returns. All three models perform very well compared with extant models and clearly outperform a Gaussian GARCH model. Moreover, the results show that only the new model cannot be rejected as providing correct conditional VaR forecasts. Keywords: GARCH, Normal inverse Gaussian distribution, Time varying kurtosis, Time varying skewness, Value at Risk.
1. INTRODUCTION Realistic modelling of financial time series is of utmost importance in asset pricing and risk management. Empirical ‘facts’ for equity returns that should be accounted for include skewed leptokurtic return distributions and dependence in second moments. The second moment dependence and, to some extent, the leptokurtosis are addressed in the seminal article of Engle (1982). Among the models that account for the excess kurtosis not captured by the Gaussian GARCH (GARCH-n) model is the model of Barndorff-Nielsen (1997) based on the Normal inverse Gaussian (NIG) distribution. This distribution, in addition to having nice analytical properties, can be theoretically motivated from the mixture of distribution hypothesis of Clark (1973). Extensions of Barndorff-Nielsen’s model that allow for complex dynamics in the variance equation have been proposed by Andersson (2001), Jensen and Lunde (2001), as well as Forsberg and Bollerslev (2002). Recent studies by, for example, Harvey and Siddique (1999, 2000) indicate that there is also dependence in the conditional skewness and possibly in the kurtosis of stock returns. Alternative models for conditional skewness and/or kurtosis are proposed by, among others, Hansen (1994), Harvey and Siddique (1999), Guermat and Harris (2002), Mittnik ˜ and Paolella (2003), Br¨ann¨as and Nordman (2003a,b), Niguez and Perote (2004), as well as Lanne and Saikkonen (2007). These contributions motivate my extension of the Jensen and Lunde (2001) model to comprise not only time dependence in the conditional variance but also a time-dependent conditional skewness and kurtosis. The model proposed in this study has several advantages over previous models. The parameters that govern the shape of the distribution need not be restricted as in Hansen (1994). C The Author(s). Journal compilation C Royal Economic Society 2009. Published by Blackwell Publishing Ltd, 9600 Garsington Road,
Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
The NIG-ACD model
83
Both the skewness and kurtosis are time varying, whereas in Harvey and Siddique (1999) and Lanne and Saikkonen (2007) only the skewness and in Guermat and Harris (2002) only the kurtosis is allowed to vary over time. The model has a closed form likelihood, making ˜ estimation easy, in contrast to Mittnik and Paolella (2003) and Niguez and Perote (2004), where the likelihood lacks an analytical expression. The attainable combinations of skewness and kurtosis are much more flexible than in Br¨ann¨as and Nordman (2003a,b). Furthermore, the NIG distribution can be motivated from economic theory and is in that sense not an ad hoc choice, such as the Student’s t distribution. In the initial estimation sample of daily Standard and Poor’s 500 returns ranging from July 3, 1962 to July 11, 1974, the new model shows a dramatic improvement in terms of insample fit. With the ongoing adoption of Basel II, which allows banks to use internal Value at Risk (VaR) models for the purpose of regulating capital requirements, there is much practical as well as academic interest in this measure. VaR is the maximum loss expected to incur over a certain time period (h) with a given probability α. The new model as well as the models of Jensen and Lunde (2001), Hansen (1994), and Forsberg and Bollerslev (2002) are applied to compute VaR forecasts each day from July 12, 1974 to September 20, 2005, giving 7879 forecasts, each based on the 3000 latest observations. Since the capital requirements for a bank are directly affected by the number of VaR exceptions, i.e. the number of occasions when the actual loss is larger than predicted by the VaR model, evaluating VaR models by their ability to produce a correct number of exceptions (correct unconditional coverage) seems natural. However, Christoffersen (1998) points out that the exceptions from a correctly specified model should also be independently distributed over time. Using the terminology of Christoffersen (1998), a VaR model that has the correct number of exceptions, which are also independent, is said to have correct conditional coverage. The VaR forecasts in this study are therefore evaluated both by their conditional and unconditional coverage. I find that the models based on the NIG-distribution perform very well with an unconditional coverage that cannot be rejected as incorrect for any of the six VaR levels evaluated. The NIGautoregressive conditional density (NIG-ACD) model proposed in this study as well as the models of Jensen and Lunde (2001) and Forsberg and Bollerslev (2002) are in the green zone as defined in BASEL (1996), whereas the ACD model of Hansen (1994) is in the yellow zone and a GARCH-n model is in the red zone. The green zone means that no additional capital requirements are necessary. The capital requirements given by a model in red zone have to be scaled upwards and measures to improve the model must be taken immediately. Further, the NIG-ACD model is the only model that cannot be rejected as providing correct conditional forecasts for any of the six VaR levels investigated. The rest of this article is structured as follows. Section 2 presents the theoretical and empirical motivation for modelling financial returns using the NIG distribution. Section 3 presents the model whereas Section 4 describes the data and estimation. Section 5 gives a brief introduction to VaR and backtesting of VaR models. The results are presented in Section 6. Section 7 summarizes and discusses the findings.
2. THEORETICAL AND EMPIRICAL MOTIVATIONS FOR THE NIG DISTRIBUTION To capture the characteristics of financial returns such as non-normality, conditional heteroscedasticity (Mandelbrot, 1963, and Fama, 1965) and leverage effects for stocks (Black C The Author(s). Journal compilation C Royal Economic Society 2009.
84
A. Wilhelmsson
1976), a vast number of models has been proposed. Amongst the most successful are the GARCH-type models, for a review see (1994) or the collection of articles in Engle (1995). Previous research show these models to capture the persistence in volatility well. They also capture some but not all of the excess kurtosis in the data. To remedy this problem, alternative error distributions have been proposed. Among these are the Student’s t (Bollerslev, 1987), generalized error distribution (Nelson, 1991) and the skewed Student’s t (Hansen, 1994). The effect of different error distributions for estimation efficiency has recently been investigated in a simulation setting by Venter and de Jongh (2004), and their results favoured the NIG distribution for most of the data generation processes used. Since the number of possible distributions to choose from is very large and since results are also dependent on the specification of the mean and variance equations, the number of possible combinations is daunting. This is true even if we restrict ourselves to the GARCH class models. For example, Hansen and Lunde (2005) examine 330 different model specifications; despite this impressive number of models, their study is far from being exhaustive. An alternative to an empirical or simulation based hunt for the best distribution is needed. The current study pursues the use of a distribution that, in addition to capturing the salient features of the data, can be motivated from economic theory. Consider the time t price of a financial asset, for example, a stock price, denoted by Pt , whose continuously compounded return over the unit time interval is given by rt = log (Pt /Pt−1 ) assuming possible dividends being added to the price. The mixture of distribution hypothesis (MDH) of Clark (1973) states that the conditional distribution of rt given a latent information arrival process (directing process) is normal. Traditionally the directing process has been assumed to follow a lognormal distribution resulting in a lognormal normal mixture distribution for the returns, which unfortunately cannot be written in closed form. 1 Instead, I follow Barndorff-Nielsen (1997) and assume the conditional mixing distribution, which is the distribution of the directing process, to be the inverse Gaussian. That is, σt2 |t−1 ∼ IG(δ, γ ) with t being the information set up to and including time t information. Forsberg (2002) tests this assumption empirically on realized variance calculated from 5 minute ECU/USD data and find that the inverse Gaussian distribution provides an even better fit for both the conditional and unconditional variance than the lognormal distribution. The density function √ of an IG-distributed variable x, is given by f (x; δ, γ ) = δx −3/2 exp[δγ − 12 (δ 2 x −1 + γ 2 x)]/ 2π . The results in Barndorff-Nielsen (1977, 1978) then give that the unconditional distribution of rt must be NIG. In contrast to the lognormal normal mixture distribution, the density of the NIG distribution can be expressed in closed form using the Bessel function. Ease of estimation is thus greatly enhanced and can be done by straightforward (numerical) maximum likelihood. The density function of the NIG distribution is given by x − μ −1 α f (x; α, β, μ, δ) = exp δ α 2 − β 2 − βμ q π δ (2.1) x−μ exp (βx) × K1 δαq δ √ with 0 ≤ |β| < α, δ > 0 and q(z) = 1 + z2 . K1 (·) is the modified Bessel function of third order and index one. α controls the kurtosis of the distribution and β the asymmetry. The location and 1 Clark (1973) assumed an iid lognormal distribution. Later Taylor (1986) relaxed this assumption and let the variance, which proxies for the information arrival, follow an auto-regression, resulting in the stochastic volatility model.
C The Author(s). Journal compilation C Royal Economic Society 2009.
The NIG-ACD model
85
Figure 1. Skewness-kurtosis bounds.
scale of the distribution are decided by μ and δ, respectively. The attractive features of the NIG distribution include the ability to fit leptokurtic and skewed data combined with nice analytical properties. In particular, the NIG distribution is closed under convolution, for fixed values of α and β, meaning that if, for example, daily returns are NIG distributed then weekly returns will also be NIG distributed. Figure 1 illustrates what levels and combinations of skewness and kurtosis are attainable by the NIG distribution. The results in Jondeau and Rockinger (2003) are used to show that the skewness (μ 3 ) is always bounded for a given level of kurtosis (μ 4 ) by μ23 < μ 4 − 1, assuming zero mean and unit variance. The results for the NIG distribution are compared with the generalized skewed t distribution of Hansen (1994). For the NIG-distribution the bounds are found by setting α = 0.9999β and computing the skewness and kurtosis for a fine grid of values that corresponds to levels of kurtosis ranging from 3.01 to 30. The bounds for the Generalized skewed t distribution are computed as in Jondeau and Rockinger (2003). As can be seen from Figure 1, the NIG-distribution is generally more flexible in accommodating varying combinations of skewness and kurtosis than the generalized skewed t distribution, making it a strong candidate distribution for financial modelling. The NIG distribution is a special case of the generalized hyperbolic distribution, which was introduced to the field of finance by Eberlein and Keller (1995) and Barndorff-Nielsen (1995). It has shown promising results for computing VaR in Eberlein, Keller, and Prause (1998), Bauer (2000), Forsberg and Bollerslev (2002), as well as Venter and de Jongh (2002). For more details about the distribution, including the moment generating function, see Barndorff-Nielsen (1997) and references therein.
3. PRESENTATION OF THE NIG-ACD MODEL The discussion in the previous section showed the unconditional return distribution to be NIG according to the MDH if the directing process is inverse Gaussian. However, above I have C The Author(s). Journal compilation C Royal Economic Society 2009.
86
A. Wilhelmsson
assumed that the mixing distribution is iid even though the conditional variance and possibly higher moments are time-varying. The contribution of the current study is to capture this effect by making the three parameters in the NIG distribution that govern the variance, skewness and kurtosis conditional on prior information. In previous models based on the NIG distribution (Andersson, 2001, Jensen and Lunde, 2001, and Forsberg and Bollerslev, 2002), only the variance is allowed to vary over time. To specify the dynamics of the variance, it is convenient to have it depend on a single parameter. This is done by using the location-scale invariant parametrization in Jensen and Lunde (2001) α¯ = αδ, β¯ = βδ, resulting in the density
x − μ −1 α¯ t (x − μ) 2 2 ¯ ¯ ¯ q exp α¯ t − βt + βt fJ L x; α¯ t , β t , μ, δ t = π δt δt δt (3.1) x−μ × K1 α¯ t q δt with 0 ≤ |β¯t | < α¯ t , δt > 0. Let γ¯t = α¯ t2 − β¯t2 ∈ R+ and ρt = (β¯t /α¯ t ) = (βt /αt ) ∈ [0, 1). Now specify the mean equation according to 1/2
rt = μ + γ¯t
δt ρt + δt ηt
(3.2) 3/2 γ¯t
1/2 with δ t η t = ε t and the distribution of η t is NIG(α¯ t , β¯ t , −γ¯t ρt , α¯ t ). The t subscripts on the parameters are added to indicate parameters that can vary over time. The purpose of the above specification used by Jensen and Lunde (2001) is that the mean and variance of η t will equal 0 and 1, respectively. Moreover, the conditional mean and variance of the returns will be given 1/2 by E(rt |t−1 ) = δt γ¯t ρt + μ and Var(rt |t−1 ) = δt2 . The return in (3.2) can be divided into 1/2 three parts—a constant mean μ, a compensation for time-varying volatility risk γ¯t δt ρt and a return innovation ε t . The sign of the risk compensation is given by the ρ t parameter, which as pointed out in Lanne and Saikkonen (2007) is a limitation, since a positive compensation for time-varying volatility risk is expected, but to model negative skewness, the ρ t parameter must be negative. However, the specification in (3.2) is necessary to be able to model the conditional variance within the GARCH framework. Moreover, there is no theoretical reason for why the risk compensation must be positive in an intertemporal setting as pointed out in, e.g. Abel (1988) and Glosten, Jagannathan and Runkle (1993). In the NIG-ACD model, as in the ACD model of Hansen (1994), the rescaled innovations η t = ε t /δ t are uncorrelated but not independent of each other since higher order dependence is present. The conditional standard deviation, δ t , is chosen to evolve according to the asymmetric power ARCH (APARCH) model of Ding, Granger, and Engle (1993)
υ υ + a |εt−1 | − τ εt−1 . (3.3) δtυ = c + bδt−1
So far the model is identical to the NIG-S & ARCH model of Jensen and Lunde (2001). The current study adds to the literature by adding time variation in skewness and kurtosis. Recent interest in conditional higher moments (see, inter alia, Harvey and Siddique, 1999, 2000, Dittmar, 2002, Guermat and Harris, 2002, as well as Christoffersen, Heston and Jacobs, 2006) motivates this extension. This is done by the steepness and asymmetry parameters, given in BarndorffNielsen and Prause (2001), ξ = (1 + γ¯ )−1/2 , which is closely related to the kurtosis and χ = ρξ , ¯ < α¯ makes the region for the attainable which is related to the skewness. The restriction 0 ≤ |β| steepness and asymmetry a triangle in R2 given by {(χ , ξ ): −1 < χ < 1, 0 < ξ < 1}, which is C The Author(s). Journal compilation C Royal Economic Society 2009.
87
The NIG-ACD model
called the NIG shape triangle. I make the steepness and asymmetry of the distribution conditional on the data set according to 2 γ¯t = exp λ0 +λ1 εt−1 + λ2 log (γ¯t−1 ) (3.4) ρ˜t = θ0 + θ1 εt−1
(3.5)
with ρ˜t = ). The exponential form in (3.4) is used to guarantee that γ¯t is positive without having to impose any restrictions on the estimated parameters λ 0 , λ 1 and λ 2 . Similarly, since t ) ∈ R, and so, (3.5) can be estimated without any restrictions on the ρt ∈ [0, 1), ρ˜t = log( 1+ρ 1−ρt parameters θ 0 and θ 1 . Equation (3.4) for the steepness can be seen similar to the EGARCH model Nelson (1991) proposed for the variance, but (3.4) does not allow for different responses in steepness to positive and negative return innovations of the same magnitude. A specification for (3.5) that allows for more persistence by adding ρ˜t−1 and ε3t−1 as explanatory variables was tried but turned out to be insignificant. It should be mentioned that these other specifications were only tried in the in sample estimation and not on the forecasting part of the sample. The parameters γ¯t and ρ˜t are closely related, but not identical, to the kurtosis and skewness. The conditional skewness depends on both ρ˜t and, to some degree, also on α¯ t . The conditional kurtosis depends on both γ¯t and on the skewness parameter ρ˜t . To investigate the effect of this, the conditional skewness is plotted for ρ˜t ∈ (−0.30, 0.30) × α¯ t ∈ {4, 11, 22}, a region that covers 98.3% of the empirical data values of ρ˜t and α¯ t . As can be seen from Figure 2, the skewness is an increasing and slightly concave function of ρ˜t . The effect of α¯ t on the skewness increases for small values of α¯ t but is still minor compared with the effect of ρ˜t . Figure 3 shows the conditional kurtosis for γ¯t ∈ (0.4, 40) × ρ˜t ∈ {−0.30, −0.15, 0} (99.8% of the sample values are in this region). The kurtosis is shown to be a decreasing and concave function of γ¯t and the influence of the skewness parameter ρ˜t on the kurtosis is very minor. In conclusion, γ¯t and ρ˜t do jointly determine kurtosis and skewness, but the influence of ρ˜t on t log( 1+ρ 1−ρt
Alphabar=4
Alphabar=11
Alphabar=22
1 0.8 0.6
Skewness
0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1
rhotilde
Figure 2. Skewness as a function of ρ˜ and α. ¯ C The Author(s). Journal compilation C Royal Economic Society 2009.
88
A. Wilhelmsson
rhotilde=-0.30
rhotilde=-0.15
rhotilde=-0.00
16 14 12
Kurtosis
10 8 6 4 2
39.3
37.2
38.2
36.1
35.1
34.0
33.0
30.9
31.9
29.8
28.8
26.7
27.7
25.6
24.6
22.5
23.5
20.4
21.4
19.3
18.3
17.2
15.1
16.2
13.0
14.1
12.0
9.9
10.9
8.8
7.8
6.7
5.7
4.6
2.5
3.6
0.4
1.5
0
gammabar
Figure 3. Kurtosis as a function of γ¯ and ρ. ˜
kurtosis and α¯ t on skewness is small, and in the remainder of the text, γ¯t will be referred to as the kurtosis parameter and ρ˜t as the skewness parameter. One alternative would be to directly model the skewness and kurtosis. However, the combinations of skewness and kurtosis must then be restricted to lie within the allowed region in Figure 1. This can be achieved with numerical techniques during optimization, but there would be no guarantee that forecasted values of skewness and kurtosis would be in the allowed region and this alternative is not pursued further. The full NIG-ACD model proposed in this study is given by equations (3.2)–(3.5). 2 The NIG-S & ARCH model of Jensen and Lunde (2001) is given by (3.2) and (3.3), meaning that the steepness and asymmetry of the NIG distribution is forced to be constant in their specification. The conditional moments of the NIG distribution are given in Appendix A. In addition to the NIG-ACD and the NIG-S&ARCH models also the GARCH-NIG model of Forsberg and Bollerslev (2002), a GARCH model with Gaussian errors and the ACD model of Hansen (1994) are used in the empirical part. These models are described in Appendix B.
4. DATA AND ESTIMATION In this section, we take a look at the data and describe the estimation and forecasting procedures. Estimation results are presented and discussed, and the section ends with model diagnostics and density forecasting results.
2 The model can also be derived from the stochastic volatility literature (see Jensen and Lunde, 2001), which explains why the name NIG-S & ARCH (Stochastic volatility and ARCH) was chosen by Jensen and Lunde.
C The Author(s). Journal compilation C Royal Economic Society 2009.
89
The NIG-ACD model
Measure
Table 1. Descriptive statistics of the Standard and Poor’s 500 index. Initial estimation sample Forecasting sample
No. of observations Daily mean (%)
3000 0.025
7879 0.046
Yearly standard deviation (%) Maximum (%)
10.89 4.90
16.08 8.50
Minimum (%) Skewness Excess kurtosis
−3.20 0.081 3.233
−21.70 −1.340 30.50
JB
<0.001
<0.001
Notes: This Table shows the descriptive statistics for the daily Standard and Poor’s 500 percentage returns for the initial estimation sample July 3, 1962 to July 11, 1974 as well as for the out-of-sample period July 12, 1974 to September 20, 2005. JB is the P value from the Jarque and Bera (1987) test with the null hypothesis of normally distributed returns.
4.1. Description of data Examining VaR performance is an examination of rare events and hence requires a long time series of data. A time series of daily high quality data, which also is economically interesting, is the Standard and Poor’s 500 index obtained from the CRSP record. The descriptive statistics for the dividend adjusted continuously compounded returns from July 3, 1962 (t = 0) to September 20, 2005 (t = 11,879) in Table 1 show, as is usual for daily stock return data, that normality is overwhelmingly rejected with P values of the Jarque and Bera (1987) statistic less than 0.001. The forecasting part of the sample is more volatile with a yearly average standard deviation equal to 16.08% compared with the 10.89% in the initial estimation sample and with an excess kurtosis of 3.23 compared with 30.49 (4.85, excluding the crash of October 19, 1987) in the forecast sample. The sample skewness is slightly positive in the estimation sample (0.08) and negative in the forecast sample −1.34 (−0.12, with the return of October 19, 1987 removed). 4.2. Estimation The initial model estimation uses the 3000 first observations, which will be called the estimation sample. The rest of the data are saved for forecasting purposes. The models are estimated using a rolling window so that the parameters are estimated on observations t = 0 to 2999, and the complete density is forecasted for t = 3000 for the four models. Then the estimation sample is rolled forward one day so that the parameters are estimated on observations t = 1 to 3000, and new forecasts are calculated. This procedure is repeated for the whole sample, resulting in 7879 density forecasts for each model. The models are estimated by numerical maximum likelihood using the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm. The Bessel function is numerically calculated using the polynomial approximation formula in Abramowitz and Stegun (1965) p. 378. Using Monte Carlo simulations with the data generation process being given by the NIG-ACD model, with parameter values equal to the estimates in Table 2, I find that the parameters of the model can be well estimated from 3000 observations, but the parameters that govern the skewness and kurtosis cannot be well estimated with as few as 1000 observations. 3 3
The Monte Carlo results are unreported for brevity but can be obtained from the author upon request.
C The Author(s). Journal compilation C Royal Economic Society 2009.
90
Parameter
A. Wilhelmsson Table 2. Estimation results for daily percentage Standard and Poor’s 500 returns. NIG-ACD ACD NIG-S&ARCH GARCH-NIG GARCH-n
μ
0.0842∗ (0.0366)
0.0537∗∗ (0.0100)
0.1033∗∗ (0.0189)
0.0529∗∗ (0.0097)
0.0510∗∗ (0.0103)
a
0.0707∗∗ (0.0055) 0.9303∗∗ (0.0061) 0.0093∗∗ (0.0020)
0.1378∗∗ (0.0172) 0.8522 (0.0172) 0.0095∗∗ (0.0020)
0.0705∗∗ (0.0010) 0.9353∗∗ (0.0112) 0.0085∗∗ (0.0024)
0.1296∗∗ (0.0173) 0.8671∗∗ (0.0165) 0.0069∗∗ (0.0017)
0.1294∗∗ (0.0188) 0.8597∗∗ (0.0192) 0.0075∗∗ (0.0023)
b c τ υ
0.8965∗∗ (0.0796) 1.0068∗∗ (0.0191)
– –
0.7975∗∗ (0.0868) 0.8615∗∗ (0.1328) 6.3557∗∗ (1.4876) −0.3877∗∗ (0.1052)
–
–
–
–
4.0789∗∗ (0.9496) –
α
–
–
β
–
–
λ0
0.8793∗∗ (0.1026) −0.1878∗∗ (0.0601)
−0.2649∗∗ (0.0620) −0.3062∗∗ (0.0768)
–
–
–
–
–
–
0.7085∗∗ (0.0911) −0.0614 (0.0377) 0.2187∗∗ (0.0471)
0.0115 (0.0508) −1.1187∗∗ (0.3828) −1.9594∗∗ (0.7110)
–
–
–
–
–
–
–
–
–
–
–
–
λ1 λ2 θ0 θ1 θ2 Stationarity LL
– 0.986 −2541.45
0.8230 (0.5287) 0.990 −2666.35
0.988 −2621.94
0.9967 −2685.58
– –
0.9891 −2705.36
Notes: This Table shows the maximum likelihood parameter estimates with standard errors from the final update of the Hessian matrix in parenthesis from the initial estimation of the models on Standard and Poor’s 500 index returns from July 3, 1962 to July 11, 1974 (3,000 observations). LL gives the log likelihood values (constant terms included) of the models. ∗ Indicates significance at the 5% level. ∗∗ Indicates significance at the 1% level.
In case of parameter stability (i.e. constant unconditional variance), the rolling scheme is of course less efficient than extending the estimation sample to include more and more observations. However, it is not feasible to assume away shifts in the unconditional variance during the sample period in this study, which covers more than 40 years in length. We can clearly see from Figure 4, which shows the 7879 rolling estimates from the NIG-ACD model, that there is C The Author(s). Journal compilation C Royal Economic Society 2009.
91
The NIG-ACD model a
b
c 0.035
1 0.9
0.03 0.8 0.025
0.7
0.02
0.5
c
a,b
0.6
0.015
0.4 0.3
0.01
0.2 0.005 0.1 0
0
Date
Figure 4. Rolling estimates July 12, 1974 to September 20, 2005.
considerable variation over time, especially in the variance intercept parameter c. Shifts in the unconditional variance will lead to spurious long memory in the GARCH variance equation, pushing the persistence towards one (see, e.g. Mikosch and Starica, 2004). 4.3. Estimation results For the GARCH(1,1) model with Gaussian innovations, the conditional probability of the October 19, 1987 return of −21.7% (or lower) is 1.16 × 10−32 and for the NIG-ACD model, which has the highest probability, 0.0003. In practise, this means that none of the models can generate returns of this magnitude (possibly with the exception of the NIG-ACD model), and an additive dummy is used in the variance equations of all the models to capture the effect of the crash. The NIG-ACD model is allowed to have different asymmetry and steepness parameters for each observation. This is illustrated by plotting the shape of the distribution for each day in the NIG shape triangle. We can see from Figure 5 that the distribution most of the days is close to being symmetric although there is a slight negative skewness, both the mean and the median of the asymmetry parameter χ t is −0.007. The steepness is generally clustered around 0.25, but a few observations show considerably higher steepness. The daily shapes of the distribution can be compared to the normal distribution that corresponds to the region near χ t = 0, ξ t = 0 and to the Cauchy distribution that is the limiting case near χ t = 0, ξ t = 1. The parameter estimation results in Table 2 show that the shape parameter α¯ in both the GARCH-NIG and NIG-S&ARCH displays thicker tails than the normal distribution with parameter estimates of 4.08 and 6.36, implying a conditional kurtosis of 3.74 and 3.48, respectively. Since all the models have a variance equation equal to or nested by the APARCH C The Author(s). Journal compilation C Royal Economic Society 2009.
92
A. Wilhelmsson 1 0.9 0.8 0.7
Steepness
0.6 0.5 0.4 0.3 0.2 0.1 0 -1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Asymmetry
Figure 5. Asymmetry χ t = ρ t ξ t and steepness ξt = (1 + γ¯t )−1/2 for the NIG-ACD model.
model, we can use the weak stationarity condition aE[(|η t | − τ η t )υ ] + b < 1 from Giot and Laurent (2003). All models are found to be covariance stationary. Ling and McAleer (2002) show that the mυth moment E[|ε|mυ ], with υ as in (3.3), exists if and only if E[(a(|η t−1 | − τ η t−1 )υ + b)m ] < 1 for the APARCH model. Numerical evaluation reveals that the first four unconditional moments exist for the NIG-S & ARCH model. For the NIG-GARCH model, which has τ = 0 and υ = 2, the condition reduces to ¯ = 1.039; so, the unconditional kurtosis does not exist (is not finite). b2 + 2ab + a 2 (3 + 3/α) Further using the results in Meitz and Saikkonen (2008) that for the APARCH specification, results in the strict stationarity condition E[ln (a(|η t−1 | − τ η t−1 )υ + b)] < 0, we find that the NIG-S & ARCH model is also strictly stationary. For the NIG-S & ARCH, ACD and NIG-ACD models the unconditional kurtosis is found, by Monte Carlo simulation, to be 7.71, 16.48 and 9.53, respectively, compared with the sample kurtosis of 6.37. The unconditional kurtosis of the the NIG-ACD and ACD models must be treated with caution since these models are outside the family of models considered in Ling and McAleer (2002), as well as in Meitz and Saikkonen (2008). The reason is that the models standardized residuals η t = ε t /δ t are uncorrelated but not iid. No sufficient conditions for strict stationarity or existence of unconditional moments yet exist for these models. The significant β¯ estimate of −0.388 for the NIG-S & ARCH indicates that setting β to zero is too restrictive. The λ 1 parameter of the NIG-ACD model shows that γ¯t , which is a one to one transform of the steepness ξt = (1 + γ¯ t )−1/2 of the distribution, is significantly negatively related to previous periods’ squared return innovations, meaning that a large squared return will give higher future kurtosis as well as variance; further, the λ 2 parameter of 0.7085 shows that there is much persistence in the steepness. The dependence in kurtosis is an additional risk factor that cannot be captured by models that assume constant higher moments. The positive and significant θ 1 parameter means that a negative return today will give lower (more negative) skewness tomorrow. C The Author(s). Journal compilation C Royal Economic Society 2009.
The NIG-ACD model
93
The NIG-S & ARCH model is obtained from the NIG-ACD model by setting λ1 = λ2 = θ 1 = 0, an LR test of the restriction gives a test statistic of 160.98, which is distributed χ 2 (3). The restriction is rejected with a P value less than 0.001, and hence the new model is clearly favored by the data. 4.4. Model diagnostics As a diagnostic measure for the models the Q-test of Hong and Li (2005) is used. One of the advantages of this test is that both the in-sample fit as well as the density forecasts produced by the models can be evaluated with the same test. The test can be seen as a formalization of visually inspecting QQ plots and autocorrelation functions for different powers of the generalized residuals (given by the probability integral transform, PIT). It is of interest to assess the out-of-sample performance of the models, especially for the NIGACD model that is similar in number of parameters to Hansen’s (1994) conditional autoregressive density model. For example, Guermat and Harris (2002, p. 410) express concern that ‘in Hansen’s interest rate model, there are 15 estimated parameters and hence its effectiveness for out-ofsample forecasting is likely to be limited’. Therefore the estimated (the densities from t = 0 to 2999 given t = 2999 information), as well as the predicted (the densities from t = 3000 to 11,877 given t − 1 information) densities for all models will be evaluated. The Hong and Li (2005) Q-test is based on the result that the PIT of the correct density will rt pˆ t (u)du ∼ iid U (0, 1), if the probability be iid uniform on the unit interval. That is, zˆ t = −∞ density function (pdf) of the model (pˆ t ) is equal to the true pdf of the returns. Using the PIT to evaluate density forecasts has previously been proposed by Diebold, Gunther, and Tay (1998) as well as by Kim, Shephard, and Chib (1998). Hong and Li (2005) build on these ideas and propose an omnibus test to jointly test for both the uniformity and the independence of {ˆzt }Tt=1 by comparing a kernel estimate of the joint distribution of {ˆzt , zˆ t−j }Tt=j +1 with a bivariate uniform distribution. The test is given by
ˆ (j ) = (T − j ) hMˆ (j ) − A0h /V 1/2 (4.1) Q 0 which is asymptotically standard normal under the null hypothesis of a correctly specified density. The bandwidth parameter, h, is chosen to equal the sample standard deviation of the PIT-series times the number of observations (T) raised to the power of −1/6 in accordance with ˆ ) is a kernel estimate of the distance between the estimated distribution Hong and Li (2005). M(j and the product of two U(0,1) densities, A0h is a centring constant and V 0 is an expression for the variance of the test statistic, see Hong and Li (2005) for further details. Since negative values of ˆ ) can only occur under the null in large samples, one-sided critical values are used. 4 Q(j Figure 6 shows the Q-statistics for the estimated densities produced by the models for values of j (lag lengths) ranging from 1 to 20. The solid line is the 5% one-sided critical value (1.64) for rejecting the null of the ‘estimated’ density being the true density. The GARCH-n, ACD, GARCH-NIG and the NIG-S & ARCH models can clearly be rejected, whereas the NIG-ACD model has Q-statistics very close to the 5% critical value of 1.64. For the GARCH-n, ACD and GARCH-NIG models, the test also clearly rejects for j > 2. This is indicative of poor modelling of both dynamic and static properties. The NIG-S&ARCH performs much better for 4 The test is implemented by numerical integration in Gauss using computer code generously provided by Professor Yongmiao Hong, at Cornell University.
C The Author(s). Journal compilation C Royal Economic Society 2009.
94
A. Wilhelmsson
NIG-ACD
ACD
NIG-S&ARCH
GARCH-NIG
GARCH-N
5% critical value
40 35 Q-statistic
30 25 20 15 10 5 0 -5 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
17
18
19
20
Lag
Figure 6. In-sample Q-tests.
NIG-ACD
ACD
NIG-S&ARCH
GARCH-NIG
GARCH-N
5% critical value
30
Q-statistic
25 20 15 10 5 0 -5 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Lag
Figure 7. Out-of-sample Q-tests.
j > 1 indicating that only (unaccounted) one day dependence is the cause of rejection. The reason for this interpretation is the joint testing of uniformity and dependence. If the test statistic drops below the rejection level after a certain value for j then the static properties (the marginal distribution) of the returns must be well modelled, and the high Q-statistic for lower values of j must be due to misspecification of the dynamics since poor modelling of the static properties will show for all values of j. The out-of-sample density forecast ability among the models shown in Figure 7 is rather similar for the models based on the NIG distribution. The solid line is the 5% one-sided critical value (1.64) for rejecting the null of the ‘predicted’ density being the true density. The ACD and GARCH-n models clearly performs the worst with Q(1) statistics at 18.08 and 23.93. The NIG-ACD is still the best model, but it can be rejected with a Q(1) statistic of 5.68. Actually, the NIG-ACD model is the only model that fares worse out-of-sample than in-sample, which hints at possible overfitting and/or difficulty in estimating the dynamics of the higher moments. C The Author(s). Journal compilation C Royal Economic Society 2009.
The NIG-ACD model
95
5. VALUE AT RISK VaR is the maximum loss expected to incur over a certain time period (h) with a given probability α. Equivalently, it can be stated that the loss will be less than VaR(α, h) dollars, (1 − α) × 100% −1 of the time. Statistically, VaRt (α, h) = F −1 t+h (α)| t ., where F t+h is the h-step conditional forecast of the inverse cumulative distribution function (cdf) of the return rt = log (Pt /P t−1 ), and t is the information set up to and including time t information. With the ongoing adoption of Basel II, which allows banks to use internal VaR models for the purpose of regulating capital requirements, there is much interest in the measure. For a survey see, for example, Duffie and Pan (1997) or the textbook treatment in Jorion (2000). 5.1 Backtesting VaR models The Basel committee on banking supervision states in their 2004 ‘International convergence of capital measurement and capital standards’ (page 39) that ‘Internal models will only be accepted when a bank can prove the quality of its model to the supervisor through the backtesting of its output using one year of historical data’. The exact method for backtesting is not prescribed. However, the number of exceptions, i.e. the number of occasions when the actual loss is larger than predicted by the VaR model, is used to determine a multiplier that directly affects the capital requirements. Following the terminology of Christoffersen (1998), the Basel committee is only concerned with the unconditional coverage of the models. However, a model might have the correct average coverage even though it can be miss-specified at a given point of time. Christoffersen (1998) derives a test for correct conditional coverage that will be presented below. Define the indicator variable It with t being a time subscript according to 1, if rt > Ft−1 (α) |t−1 It = (5.1) 0 Otherwise where F −1 t (α)| t−1 is the conditional VaR forecast (the inverse of the cdf evaluated at α) from the particular model being evaluated. To test if the number of exceptions is correct, is called to test for correct unconditional coverage. If we have correct unconditional coverage, α percent of the returns will be lower than the VaR forecast, so under the null we will have T 1 It = 1 − α. E (5.2) T 1 The test for correct conditional coverage can be divided into two separate parts—one part tests for correct unconditional coverage and one part tests for independence in the sequence of exceptions. This is very useful, since it can then be investigated if model rejection is due to unconditional coverage failure, clustering of the exceptions or both. The null hypothesis for correct unconditional coverage gives that It ∼ Bern (1 − α), which can be tested by a likelihood ratio test of the form (5.3) LRU C = 2 log πˆ 1T1 (1 − πˆ 1 )T −T1 − log (1 − α)T1 α T −T1 . The number of observations is given by T, the number of ones is given by T 1 and πˆ 1 = T1 /T . C The Author(s). Journal compilation C Royal Economic Society 2009.
96
A. Wilhelmsson
To see if the exceptions tend to cluster together over time, Christoffersen (1998) suggests testing for independence with first-order Markov dependence used as an alternative. The test statistic is given by ⎛ ⎞ T01 T11 log (1 − πˆ 01 )T0 −T01 πˆ 01 (1 − πˆ 11 )T1 −T11 πˆ 11 ⎟ ⎜ (5.4) LRIND = 2 ⎝ ⎠. T1 T −T1 − log πˆ 1 (1 − πˆ 1 ) Tij is the number of observations valued i followed by observations valued j. The maximum likelihood estimates of πˆ ij are simply πˆ 01 = T01 /T0 and πˆ 11 = T11 /T1 . The joint test of correct conditional coverage means that It ∼ iid (Bern(1 − α) ∀t). The test statistic is simply given as the sum of the two individual tests in equations (5.3) and (5.4): LRCC = LRUC + LRIND .
(5.5)
Christoffersen (1998) uses the asymptotic distribution results for the test statistics in equations (5.3)–(5.5). However, I will follow the recommendation in Christoffersen and Pelletier (2004) and simulate the distribution of the test statistics, since the effective sample sizes (the expected number of exceptions) are rather small in typical VaR settings.
6. VALUE AT RISK RESULTS All models, with the exception of the GARCH-n and ACD models, show exceptionally good results for unconditional coverage at the six different VaR levels 0.5%, 1%, 2%, 3%, 4% and 5%. The P values reported in Table 3 are calculated by simulating the distribution of the test statistic under the null as outlined by Christoffersen and Pelletier (2004). A sample size equal to the empirical sample size (7879 observations) with 100,000 replications is used. For 0.5% VaR, the effective sample size is very small, and there are large deviations between the simulated and the asymptotic (unreported) P values. For the 1% level and higher, the simulated and asymptotic P values are very similar. Since the Basel regulations only evaluate models based on the 1% VaR level, special emphasis will be given to the models performance at this level. The empirical percentage of rejections is 0.96% for the NIG-ACD model and 0.89% for the NIG-S & ARCH model. These numbers can clearly not be rejected as being different from 1% with P values from the LR UNC test of 0.77 and 0.33, respectively. The unconditional coverage of the GARCH-NIG model at the 1% level is almost perfect with 80 exceptions compared with the correct value of 78.79, whereas the GARCH-n clearly underestimates the VaR with the empirical percentage of exceptions at 1.52%, which can be rejected as providing correct unconditional coverage with a P value of less than 0.01. The ACD model has an empirical number of exceptions at 1.31% for the 1% VaR, which is significantly higher than 1% with a P value of 0.01. The BASEL (1996) ‘Supervisory Framework for the use of “Backtesting” in conjunction with the internal models approach to market risk capital requirements’ is only concerned with unconditional coverage and divides models into three groups—green, yellow and red— depending on the number of violations. A model is in the green zone if a 95% confidence interval around the correct number of exceptions covers the realized number of exceptions, in the yellow zone if a 99.99% confidence interval covers and, otherwise, in the red zone. The green zone has no additional capital requirements, the yellow zone can lead to additional capital C The Author(s). Journal compilation C Royal Economic Society 2009.
97
The NIG-ACD model
VaR level
0.5%
Table 3. Value at Risk results. 1% 2% 3%
4%
5%
Basel
Panel A: Percentage of returns below VaR NIG-ACD ACD NIG-S&ARCH
0.61% 0.80% 0.56%
0.96% 1.31% 0.89%
1.92% 2.39% 2.02%
2.96% 3.25% 2.92%
4.01% 4.11% 3.82%
4.91% 5.06% 4.67%
Green Yellow Green
GARCH-NIG GARCH-n
0.55% 0.93%
1.02% 1.52%
2.15% 2.54%
3.19% 3.27%
3.95% 4.00%
4.89% 4.66%
Green Red
Panel B: LR statistics and simulated P values from the unconditional test for coverage NIG-ACD 1.77 0.10 0.28 0.05 <0.01 0.13 (0.17) 12.02 (<0.01)
(0.77) 6.85 (0.01)
(0.60) 5.65 (0.02)
(0.81) 1.64 (0.19)
(>0.99) 0.26 (0.60)
(0.71) 0.07 (0.77)
GARCH-NIG
0.52 (0.43) 0.32
1.02 (0.33) 0.52
0.01 (0.90) 0.83
0.18 (0.69) 0.92
0.67 (0.42) 0.06
1.83 (0.18) 0.21
ARCH-n
(0.52) 23.0
(0.86) 18.8
(0.35) 10.8
(0.31) 1.99
(0.82) <0.01
(0.66) 1.98
(<0.01)
(<0.01)
(<0.01)
(0.15)
(>0.99)
(0.16)
ACD NIG-S&ARCH
Panel C: LR statistics and simulated P values from the independence test NIG-ACD ACD
1.06 (0.16) 0.39
4.06 (0.02) 16.01
2.67 (0.11) 25.05
2.22 (0.14) 19.22
3.00 (0.09) 21.06
2.58 (0.11) 20.49
NIG-S&ARCH
(0.55) 1.32
(<0.01) 0.20
(<0.01) 12.3
(<0.01) 12.0
(<0.01) 9.82
(<0.01) 11.5
(0.13) 1.40 (0.12)
(0.69) 0.04 (0.83)
(<0.01) 10.34 (<0.01)
(<0.01) 10.05 (<0.01)
(<0.01) 12.88 (<0.01)
(<0.01) 15.55 (<0.01)
0.14 (>0.99)
6.22 (0.01)
13.7 (<0.01)
8.88 (<0.01)
13.8 (<0.01)
14.6 (<0.01)
GARCH-NIG GARCH-n
requirements according to the judgment of the supervisor, and the red zone, in addition to higher capital requirements, requires that measures to improve the model should be taken immediately. For the number of observations in this study, models that have exceptions below 1.18% will be in the green zone, 1.18–1.42% in the yellow zone and exceptions higher than 1.42% will be in the red zone. 5 The three models with NIG errors are clearly in the green zone, the ACD model is
5 The √ limits for the zones are 3.7190 (0.01 · 0.99/7879) = 1.42%.
given
by
√ 0.01 + 1.6449 (0.01 · 0.99/7879) = 1.18%
C The Author(s). Journal compilation C Royal Economic Society 2009.
and
0.01 +
98
A. Wilhelmsson
VaR level
0.5%
1%
Table 3. Continued. 2% 3%
4%
5%
Basel
Panel D: LR statistics and simulated P values from the correct conditional coverage test NIG-ACD ACD
2.83 (0.15) 12.41
4.16 (0.10) 22.86
2.95 (0.24) 30.7
2.27 (0.33) 20.86
3.01 (0.22) 21.32
2.71 (0.25) 20.56
NIG-S&ARCH
(<0.01) 1.85
(<0.01) 1.22
(<0.01) 12.3
(<0.01) 12.2
(<0.01) 10.5
(<0.01) 13.3
(0.29) 1.72 (0.35)
(0.75) 0.06 (0.97)
(<0.01) 11.2 (<0.01)
(<0.01) 11.0 (<0.01)
(0.01) 12.9 (<0.01)
(<0.01) 15.7 (<0.01)
23.4 (<0.01)
25.0 (<0.01)
24.5 (<0.01)
10.9 (<0.01)
13.8 (<0.01)
16.6 (<0.01)
GARCH-NIG GARCH-n
Notes: This Table reports the percentage of exceptions from the 7879 daily VaR forecasts calculated from July 12, 1974 to September 20, 2005. Also reported are the test statistics (simulated P values in parentheses) for the unconditional, independence and joint test for the null of correct conditional coverage. Basel is the model adequacy as determined by the three zones in the Basel (1996) regulations which are only relevant for the 1% VaR (see main text for details).
in the yellow zone, and the GARCH-n model is in the red zone and hence would not be accepted by the supervision agency. As can be seen from Panel B in Table 3, the unconditional coverage for 3%, 4% and 5% VaR of all the models is satisfactory, showing that the failures of the GARCH-n and ACD models are only apparent in the utmost parts of the tail. The three models based on the NIG distribution in this study compare favorably, in terms of unconditional coverage, to mixture GARCH models with a stable Paretian error distribution, which was proposed and evaluated by Haas et al. (2005) on a German Dax-30 sample of comparable size. 6 In terms of independence (Panel C of Table 3), which ignores the unconditional coverage and only test if the exceptions tend to cluster together, the models do not perform as well. Independence can be rejected at the 1% confidence level for four different VaR levels for both the GARCH-NIG and the NIG-S & ARCH model; in the case of the GARCH-n and ACD models, independence can be rejected for five of the six VaR levels. The NIG-ACD model is the only model, where independence of the exceptions cannot be rejected at the 1% confidence level for any of the VaR levels, but it can be rejected at the 5% level for 1% VaR. The NIG-ACD and NIGS&ARCH models are identical, apart from the time variation that is allowed in the skewness and kurtosis in the NIG-ACD model. This leads us to conclude that modelling dynamics in skewness and kurtosis is important to avoid clustering of the VaR exceptions since the NIG-ACD models performs much better in terms of independence than the NIG-S & ARCH model. However, not all models with time varying conditional skewness and kurtosis lead to improvements as evident from the poor performance of the ACD model. Panel D in Table 3 shows the results of the joint test (LR CC ) for a correct number of exceptions that are also independent. The NIG-ACD model is the only model that is not rejected by the joint test, LR CC , for providing correct conditional coverage. The other two NIG based models are 6 Conditional coverage is not investigated in Haas et al. (2005), limiting the possible comparison to unconditional coverage.
C The Author(s). Journal compilation C Royal Economic Society 2009.
The NIG-ACD model
99
rejected on four of the six VaR levels, and the GARCH-n and ACD models are rejected at all six VaR levels. These results can be compared with the recent study by Kuester, Mittnik and Paolella (2006), who evaluate the performance of 18 different conditional and 5 unconditional VaR models. Their study includes GARCH models with different error distributions and the CAViaR model of Engle and Manganelli (2004). Models are evaluated on 6681 one-step-ahead VaR forecasts for the NASDAQ index. A direct comparison between the results is of course only indicative since different samples are used, but the NIG-ACD model in this study beat all the 23 models considered in Kuester et al. (2006) when comparing the L CC statistic for 5% VaR.
7. CONCLUSIONS This paper proposes a new model based on the NIG distribution for the modelling of time varying variance, skewness and kurtosis. The NIG distribution is very flexible in that it can fit the skewness and excess kurtosis of the data. Furthermore, it has attractive analytical properties, such as being closed under convolution and having a closed form density. In addition, the distribution naturally arises from Clark’s (1973) mixture of distribution theory when the mixing variable is assumed to follow and inverse Gaussian distribution. The new NIG-ACD model as well as the NIG-S&ARCH model of Jensen and Lunde (2001), the GARCH-NIG model of Forsberg and Bollerslev (2002) and the ACD model of Hansen (1994) are applied to VaR forecasting on the Standard and Poor’s 500 from July 12, 1974 to September 20, 2005. The models perform very well compared with extant models evaluated in, for example, Haas et al. (2005) and Kuester et al. (2006) on the German DAX and NASDAQ indices, respectively. The NIG-ACD model proposed in this paper is the only model that cannot be rejected as providing a correct number of independent VaR exceptions for all six VaR levels evaluated. The GARCH-n and ACD models both significantly underestimates the 0.5% and 1% VaR and also fails to produce correct conditional coverage. As judged by the rules set out in BASEL (1996), the three NIG based models are in the green zone, having no additional capital requirements, whereas the GARCH-n model is in the red zone, meaning that measures to improve the model must be taken. The failure of the GARCH-n model is economically important, since this model is a somewhat more general case of the model that is used in JPMorgans Riskmetrics that has become an industry standard for VaR calculations. Since the NIG-ACD model gave better in-sample fit and performed better on out-of-sample VaR and density forecasts than the NIG-S & ARCH model, it appears that there are gains from modelling not only the conditional variance but also the conditional skewness and kurtosis as time-varying. This is most apparent when evaluating the independence of the VaR exceptions where the NIG-S & ARCH and GARCH-NIG models fail.
ACKNOWLEDGMENTS Financial support from ‘Bankforskningsinstitutet’ and the Research and Training Network ‘Microstructure of Financial Markets in Europe’ is gratefully acknowledged. The author would like to thank two anonymous referees, Ole Barndorff-Nielsen, Asger Lunde, Markku Lanne, C The Author(s). Journal compilation C Royal Economic Society 2009.
100
A. Wilhelmsson
Hossein Asgharian, Peter Nyberg, Anette Bj¨orkman, participants at the Centre for Analytical Finance (CAF) members meeting 2006, the ‘Symposium f¨or anvendt statistik’ Copenhagen 2006 and seminar participants at the Lund University, for helpful comments and suggestions. Part of this research was completed during a one year visit at the department of marketing and statistics at the Aarhus School of Business.
REFERENCES Abel, A. (1988). Stock prices under time-varying dividend risk: an exact solution in an infinite horizon general equilibrium model. Journal of Monetary Economics 22, 375–95. Abramowitz, M. and I. Stegun (1965). Handbook of Mathematical Functions. New York: Dover. Andersson, J. (2001). On the normal inverse Gaussian stochastic volatility model. Journal of Business and Economic Statistics 19, 44–54. Barndorff-Nielsen, O. E. (1977). Exponentially decreasing distributions for the logarithm of particle size. Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences 353, 401–19. Barndorff-Nielsen, O. E. (1978). Hyperbolic distributions and distributions on hyperbolae. Scandinavian Journal of Statistics 5, 151–7. Barndorff-Nielsen, O. E. (1995). Normal inverse Gaussian process and the modelling of stock returns. Research report 300, Department of Theoretical Statistics, University of Aarhus. Barndorff-Nielsen, O. E. (1997). Normal inverse Gaussian distributions and stochastic volatility modelling. Scandinavian Journal of Statistics 24, 1–14. Barndorff-Nielsen, O. E. and K. Prause (2001). Apparent scaling. Finance and Stochastics 5, 103–13. BASEL (1996). Supervisory framework for the use of “backtesting” in conjunction with the internal models approach to market risk capital requirements. Basel Committee publication 22, Bank for International Settlement, Basel, Switzerland. BASEL (2004). International convergence of capital measurement and captial standards. A revised framework. Updated November 2005. Basel Committee on Banking Supervision, Bank for International Settlement, Basel, Switzerland. Bauer, C. (2000). Value at risk using hyperbolic distributions. Journal of Economics and Business 52, 455– 67. Black, F. (1976). Studies of stock price volatility changes. Proceedings of the American Statistical Association Business and Economic Statistics Section, 177–81. Bollerslev, T. (1986). Generalised autoregressive conditional heteroskedasticity. Journal of Econometrics 51, 307–27. Bollerslev, T. (1987). A conditional heteroskedastic time series model for speculative prices and rates of return. Review of Economics and Statistics 69, 542–7. Bollerslev, T., R. F. Engle and D. B. Nelson (1994). ARCH models. In R. F. Engle and D. McFadden (Eds.). The Handbook of Econometrics, Volume 4, 2959–3038. Amsterdam: North-Holland. Br¨ann¨as, K. and N. Nordman (2003a). An alternative conditional asymmetry specification for stock returns. Applied Financial Economics 13, 537–41. Br¨ann¨as, K. and N. Nordman (2003b). Conditional modeling for stock returns. Applied Economics Letters 10, 725–8. Christoffersen, P. (1998). Evaluating interval forecasts. International Economic Review 39, 841–62. Christoffersen, P., S. Heston and K. Jacobs (2006). Option valuation with conditional skewness. Journal of Econometrics 131, 253–84. C The Author(s). Journal compilation C Royal Economic Society 2009.
The NIG-ACD model
101
Christoffersen, P. and D. Pelletier (2004). Backtesting Value-at-Risk: a duration-based approach. Journal of Financial Econometrics 2, 84–108. Clark, P. K. (1973). A subordinated stochastic process model with fixed variance for speculative prices. Econometrica 41, 135–56. Diebold, F. X., T. A. Gunther and T. S. Tay (1998). Evaluating density forecasts with applications to financial risk management. International Economic Review 39, 863–83. Ding, Z., C. W. J. Granger and R. F. Engle (1993). A long memory property of stock market returns and a new model. Journal of Empirical Finance 1, 83–106. Dittmar, R. (2002). Nonlinear pricing kernels, kurtosis preference, and the cross-section of equity returns. Journal of Finance 57, 369–403. Duffie, D. and J. Pan (1997). An overview of value at risk. The Journal of Derivatives 4, 7–49. Eberlein, E. and U. Keller (1995). Hyperbolic distributions in finance. Bernoulli 1, 281–99. Eberlein, E., U. Keller and K. Prause (1998). New insights into smile, mispricing and value at risk: the hyperbolic model. Journal of Business 71, 371–405. Engle, R. and S. Manganelli (2004). CAViaR: conditional autoregressive value at risk by regression quantiles. Journal of Business and Economic Statistics 22, 367–81. Engle, R. F. (1982). Autoregressive conditional heteroskedasticity with estimates of the variance of the United Kingdom inflation. Econometrica 50, 987–1007. Engle, R. F. (1995). ARCH: Selected Readings. Oxford: Oxford University Press. Fama, E. F. (1965). The behaviour of stock market prices. Journal of Business 38, 34–105. Forsberg, L. (2002). On the normal inverse Gaussian distribution in modeling volatilty in the financial markets. Ph. D. thesis, Uppsala University. Forsberg, L. and T. Bollerslev (2002). Bridging the gap between the distribution of realized (ecu) volatility and arch modeling (of the euro): the GARCH-NIG model. Journal of Applied Econometrics 17, 535– 48. Giot, P. and S. Laurent (2003). Value-at-Risk for long and short trading positions. Journal of Applied Econometrics 18, 641–64. Glosten, L. R., R. Jagannathan and D. Runkle (1993). Relationship between the expected value and the volatility of the nominal excess return on stocks. Journal of Finance 48, 1779–802. Guermat, C. and R. Harris (2002). Forecasting Value-at-Risk allowing for time variation in the variance and kurtosis of portfolio returns. International Journal of Forecasting 18, 409–19. Haas, M., S. Mittnik, M. Paolella and S. Steude (2005). Stable mixture GARCH models. Working paper, NCCR FINRISK No. 257, University of Zurich. Hansen, B. E. (1994). Autoregressive conditional density estimation. International Economic Review 35, 705–30. Hansen, P. R. and A. Lunde (2005). A forecast comparison of volatility models: does anything beat a GARCH(1,1)? Journal of Applied Econometrics 20, 873–89. Harvey, C. and A. Siddique (1999). Autoregressive conditional skewness. Journal of Financial and Quantitative Analysis 34, 465–88. Harvey, C. and A. Siddique (2000). Conditonal skewness in asset pricing tests. Journal of Finance 55, 1263–95. Hong, Y. and H. Li (2005). Nonparametric specification testing for continuous-time models with applications to term structure of interest rates. Review of Financial Studies 18, 37–84. Jarque, C. and A. Bera (1987). A test for normality of observations and regression residuals. International Statistical Review 55, 163–72. Jensen, M. B. and A. Lunde (2001). The NIG-S&ARCH model: a fat tailed, stochastic, and autoregressive conditional heteroskedastic volatility model. Econometrics Journal 4, 319–42. C The Author(s). Journal compilation C Royal Economic Society 2009.
102
A. Wilhelmsson
Jondeau, E. and M. Rockinger (2003). Conditional volatility, skewness, and kurtosis: existence, persistence, and comovements. Journal of Economic Dynamics and Control 27, 1699–737. Jorion, P. (2000). Value at Risk. The New Benchmark for Managing Financial Risk. New York: McGrawHill. Kim, S., N. Shephard and S. Chib (1998). Stochastic volatility: likelihood inference and comparison with ARCH models. Review of Economic Studies 65, 361–93. Kuester, K., S. Mittnik and M. Paolella (2006). Value-at-Risk prediction: a comparison of alternative strategies. Journal of Financial Econometrics 4, 53–89. Lanne, M. and P. Saikkonen (2007). Modeling conditional skewness in stock returns. European Journal of Finance 13, 691–704. Ling, S. and M. McAleer (2002). Stationarity and the existence of moments of a family of GARCH processes. Journal of Econometrics 106, 109–17. Mandelbrot, B. (1963). The variation of certain speculative prices. Journal of Business 36, 394–419. Meitz, M. and P. Saikkonen (2008). Ergodicity, mixing, and existence of moments of a class of markov models with applications to GARCH and ACD models. Econometric Theory 24, 1291–320. Mikosch, T. and C. Starica (2004). Non-stationarities in financial time series, the long range dependence and the IGARCH effects. Review of Economics and Statistics 86, 378–90. Mittnik, S. and M. Paolella (2003). Prediction of financial downside risk with heavy tailed conditional distributions. Working paper, Johann Wolfgang Goethe-Universit¨at. Nelson, D. B. (1991). Conditional heteroskedasticity in asset pricing: a new approach. Econometrica 59, 347–370. ˜ Niguez, T. and J. Perote (2004). Forecasting the density of asset returns. Working paper, London School of Economics and Political Science. Taylor, S. J. (1986). Modelling Financial Time Series. Chichester: John Wiley. Venter, J. and P. de Jongh (2002). Risk estimation using the normal inverse Gaussian distribution. Journal of Risk 6, 27–52. Venter, J. and P. de Jongh (2004). Selecting an innovation distribution for GARCH models to improve efficiency of risk and volatility estimation. Journal of Risk 4, 1–23.
APPENDIX A: MOMENTS OF THE NIG DISTRIBUTION Jensen and Lunde (2001) give the first four moments for a stochastic variable X distributed NIG(α¯ t ,β¯t , 3/2 μ, δ t ). The first four conditional moments for the return rt that is distributed NIG(α¯ t , β¯ t , μ, δ t γ¯t /α¯ t ) are then easily computed and given below: 1/2
E (rt ) = δt γ¯t
ρt + μ
(A.1)
Var (rt ) = δt2 Skew (rt ) = 3
4
(A.2)
ρt
(A.3)
√ 1 − ρt2 α¯ t
4ρ 2 + 1 Kurt (rt ) = 3 1 + t α¯ t 1 − ρt2
(A.4)
C The Author(s). Journal compilation C Royal Economic Society 2009.
103
The NIG-ACD model
with ρt = βt /αt = β¯t /α¯ t . The moments of the NIG-ACD model are obtained by noticing that, ρ˜t = exp(ρ˜t )−1 t ¯t = ρt γ¯t (1 − ρt2 )−1/2 and α¯ t = β¯t2 + γ¯t2 with γ¯t and ρ˜t given in (3.4) ) ⇔ ρ = − , β log( 1+ρ t 1−ρt − exp(ρ˜t )−1 and (3.5), respectively.
APPENDIX B: ADDITIONAL MODELS USED IN THE VAR EVALUATION The GARCH-NIG model Forsberg (2002) as well as Forsberg and Bollerslev (2002) present the GARCH-NIG model by the parametrization α¯ = αδ, σ 2 = δ/α and by setting β = 0. This results in the density fFB z; α, ¯ σt2 =
√
α¯ exp (α) ¯ q π σt
z √ σt α¯
−1
z ¯ K1 αq √ σt α¯
(B.1)
for the zero mean variable z. This parametrization sets the second central moment equal to σ 2 , making it straightforward to incorporate and evaluate temporal dependence. The model is given by the mean equation rt = μ + σt ηt
(B.2)
2 2 σt2 = c + aεt−1 + bσt−1
(B.3)
and variance equation
with ε t = σ t η t , c, a, b > 0 and εt ∼ N I G(α, ¯ 0, 0, σt2 ). The sufficient stationary condition is given by a + b < 1. The σ 2t parameter, which is the mean of the IG distribution and the conditional variance of the returns, follows the variance equation from Bollerslev (1986). The GARCH-NIG model is nested by the NIG-S&ARCH (and consequently also by the NIG-ACD model), as can be seen by setting β = 0, τ = 0, υ = 2 and δ 2 /α¯ = σ 2 .
The ACD model Hansen (1994) introduces the autoregressive conditional density (ACD) model. A random variable ε t with probability density function given by ⎧ − η+1 2 ⎪ ⎪ 1 bz + a 2 a ⎪ ⎪ ⎪ bc 1 + ,ε < − ⎪ ⎨ η−2 1−λ b (B.4) g(εt |ηt , λt ) = η+1 ⎪ 2 − 2 ⎪ ⎪ bz + a a 1 ⎪ ⎪ ⎪ bc 1 + ,ε ≥ − ⎩ η−2 1+λ b ( η+1 )
2 ), b2 = 1 + 3λ2 − a 2 and c = √π (η−2)( with a = 4λc( η−2 η , is said to be distributed Generalized t. The η−1 2) conditional kurtosis and skewness is made a function of prior return innovations according to
2 ηt = θ0 + θ1 εt−1 + θ2 εt−1
(B.5)
2 . λt = λ0 + λ1 εt−1 + λ2 εt−1
(B.6)
The mean equation is given by (B.2) and variance equation by (B.3). C The Author(s). Journal compilation C Royal Economic Society 2009.
104
A. Wilhelmsson
The shape parameters for the GT distribution need to be restricted with η t ∈ (2, ∞) and λ t ∈ (− 1, 1). This is obtained by using the logistic mapping θt,restricted = L +
(U − L) 1 + exp(−θt )
(B.7)
with θ t = {η t , λ t }. U and L are the upper and the lower bounds for the restricted parameter, respectively. The skewness parameter, λ t , is restricted to (−0.9,0.9) and the kurtosis parameter, η t , is restricted to (2.1,30), the same values as Hansen (1994) used when introducing the distribution.
The GARCH-n model This is the GARCH model of Bollerslev (1986), with mean equation given by (B.2) and variance equation by (B.3). The error distribution is Gaussian.
C The Author(s). Journal compilation C Royal Economic Society 2009.
The
Econometrics Journal Econometrics Journal (2009), volume 12, pp. 105–126. doi: 10.1111/j.1368-423X.2008.00253.x
Bayesian estimation of a Markov-switching threshold asymmetric GARCH model with Student-t innovations D AVID A RDIA † †
Department of Quantitative Economics, University of Fribourg, Switzerland E-mail:
[email protected] First version received: June 2007; final version accepted: July 2008
Summary A Bayesian estimation of a regime-switching threshold asymmetric GARCH model is proposed. The specification is based on a Markov-switching model with Student-t innovations and K separate GJR(1,1) processes whose asymmetries are located at free nonpositive threshold parameters. The model aims at determining whether or not: (i) structural breaks are present within the volatility dynamics; (ii) asymmetries (leverage effects) are present, and are different between regimes and (iii) the threshold parameters (locations of bad news) are similar between regimes. A novel MCMC scheme is proposed which allows for a fully automatic Bayesian estimation of the model. The presence of two distinct volatility regimes is shown in an empirical application to the Swiss Market Index log-returns. The posterior results indicate no differences with regards to the asymmetries and their thresholds when comparing highly volatile periods with the milder ones. Comparisons with a singleregime specification indicates a better in-sample fit and a better forecasting performance for the Markov-switching model. Keywords: Asymmetry, Bayesian, GARCH, Markov-switching, SMI, Threshold.
1. INTRODUCTION Markov-switching GARCH models (MSGARCH) have received a lot of attention in recent years as they provide an explanation for the high persistence in volatility observed with single-regime GARCH models (see, e.g. Lamoureux and Lastrapes, 1990). Furthermore, MSGARCH models allow for a sudden change in the (unconditional) volatility level which leads to significant improvements in volatility forecasts (see, e.g. Dueker, 1997, Klaassen, 2002, and Marcucci, 2005). In the framework of MSGARCH models, a hidden Markov sequence {s t } with state space {1, . . . , K} allows for discrete changes in the GARCH parameters. Following the seminal work of Hamilton and Susmel (1994), different parametrizations have been proposed to account for changes in the scedastic function’s parameters (see, e.g. Gray, 1996, Dueker, 1997, and Klaassen, 2002). However, these specifications lead to computational difficulties. The evaluation of the likelihood function for a sample of length T requires the integration over all K T possible paths, rendering the Maximum Likelihood (ML) estimation infeasible. Approximations are thus required to shorten the dependence to the state variable history but these lead to difficulties in the interpretation of the variance dynamics in each regime. C The Author(s). Journal compilation C Royal Economic Society 2008. Published by Blackwell Publishing Ltd, 9600 Garsington Road,
Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
106
D. Ardia
In order to avoid these problems, Haas et al. (2004) hypothesize K separate GARCH(1,1) processes for the conditional variance of the MSGARCH model. The conditional variances at time t can be written in vector form as follows: ⎛ 1⎞ ⎛ 1⎞ ⎛ 1⎞ ⎛ 1⎞ ⎛ 1 ⎞ ht β α0 α1 ht−1 ⎜ . ⎟ . ⎜ . ⎟ ⎜ . ⎟ 2 ⎜ . ⎟ ⎜ . ⎟ ⎜ . ⎟ = ⎜ . ⎟ + ⎜ . ⎟y + ⎜ . ⎟ ⎜ . ⎟ , (1.1) ⎝ . ⎠ ⎝ . ⎠ ⎝ . ⎠ t−1 ⎝ . ⎠ ⎝ . ⎠ hK t
α0K
α1K
βK
hK t−1
where denotes the element-by-element multiplication. The MSGARCH process {y t } is then simply obtained by setting yt = εt (hst t )1/2 , where {ε t } is i.i.d. standard Normal distributed. The parameters α k0 , α k1 and β k are therefore the GARCH(1,1) parameters related to the kth state of the nature. Under this specification, the conditional variance is solely a function of the past data and current state s t which renders the ML estimation feasible. In addition to its appealing computational aspects, the MSGARCH model of Haas et al. (2004) has conceptual advantages. In effect, one reason for specifying Markov-switching models that allow for different GARCH behaviour in each regime is to capture the differences in the variance dynamics in low- and highvolatility periods. As pointed out by Haas et al. (2004, p. 498), a relatively large value of α k1 and relatively low values of β k in high-volatility regimes may indicate a tendency to over-react to news, compared to regular periods, while there is less memory in these sub-processes. This interpretation requires a parametrization of MSGARCH models that implies a clear association between the GARCH parameters within regime k, i.e. α k0 , α k1 and β k and the corresponding {hkt } process. Specification (1.1) allows for that clear-cut interpretation of the variance dynamics in each regime. This paper generalizes the model of Haas et al. (2004) in three ways. First, we extend the symmetric GARCH specification to account for asymmetric movements between the conditional variances and the underlying time series. This is achieved by replacing the K separate GARCH(1,1) processes by K separate GJR(1,1) processes. The GJR specification of Glosten et al. (1993) has proved to be effective in single-regime models for reproducing the asymmetric behaviour of the conditional variance, a phenomenon observed especially on equity markets and referred to as the leverage effect (Black 1976). 1 One explanation of this empirical fact is that negative returns raise a firm’s financial leverage which increases its risk and therefore its equity volatility. In the context of regime-switching models, this first extension will allow us to determine whether the impact of past negative shocks on the conditional variance is different between the regimes. Second, instead of centering the asymmetric function at zero, as in the original GJR specification, we center it at a free non-positive threshold parameter. Our aim is to test whether it is the sign of the past shock (i.e. leverage effect) or the effect of a bad news (i.e. a return’s value below a given non-positive level) which influences the conditional variance. In a regime-switching framework, this second extension will allow us to determine if bad news is similar between the different volatility regimes. Finally, we consider Student-t innovations instead of Normal innovations. The use of a Student-t distribution enhances the stability of the states and allows to focus on the conditional variance’s behaviour instead of capturing some outliers (Klaassen, 2002). Moreover, the Student-t distribution includes the Normal distribution as the limiting case so we have additional flexibility in the modelling. 1 Other specifications such as the GQARCH model of Sentana (1995) are also able to reproduce the leverage effect. The GJR model is however more often used in empirical applications.
C The Author(s). Journal compilation C Royal Economic Society 2008.
Bayesian Markov-switching threshold asymmetric GARCH
107
The Bayesian estimation of MSGARCH models has several advantages over the classical approach. First, proper computational methods based on Markov chain Monte Carlo (MCMC) procedures avoid the common problem of local maxima encountered in ML estimation of this class of models (Ardia, 2008, section 7.7). Second, the exploration of the joint posterior distribution gives a complete picture of the parameter uncertainty and this cannot be achieved via the classical approach (see, e.g. Hamilton and Susmel, 1994). In particular, for the model considered in this paper, the asymptotic distribution theory of the ML estimator is inoperable because of the threshold parameters (Geweke and Terui, 1991, p. 43). Third, constraints on the model parameters can be incorporated through prior specifications. Finally, discrimination between models can be achieved through the calculation of model likelihoods and Bayes factors. In the classical framework, testing the null of K versus K states is not possible (see, e.g. Fr¨uhwirth-Schnatter, 2006, section 4.4). This paper proposes a novel MCMC scheme to perform the Bayesian estimation of MSGARCH models. Our methodology has the advantage of being fully automatic and thus avoids the time-consuming and difficult task of choosing and tuning a sampling algorithm. Nonexpert users who need to run the estimation frequently and/or for a large number of time series should find this new procedure helpful in that regard. As an application, we fit a single-regime threshold GJR(1,1) model and a two-state Markov-switching threshold GJR(1,1) model to the Swiss Market Index log-returns. We use the random permutation sampler of Fr¨uhwirth-Schnatter (2001) to find suitable identification constraints and show the presence of two distinct volatility regimes in the time series. The posterior results indicate no difference between asymmetries and locations of the asymmetry for high versus low volatility periods. By using the Deviance information criterion and by estimating the model likelihoods, we show the in-sample superiority of the regime-switching specification. To test the predictive performance of the models, we run a forecasting analysis based on the Value at Risk and estimate the predictive likelihoods. Both measures indicate the better performance for the Markov-switching model. The paper proceeds as follows. We set up the model in Section 2. The MCMC scheme is detailed in Section 3. The models are estimated in Section 4 where we assess their goodness-offit and test their predictive performance. Section 5 concludes.
2. THE MODEL AND THE PRIORS A Markov-switching threshold GJR(1,1) model with Student-t innovations may be written as follows: yt = εt ( et ht )1/2 i.i.d.
εt ∼ S(0, 1, ν) ,
for t = 1, . . . , T
(2.1)
. where et = (I{st = 1} · · · I{st = K}) ; I{•} is the indicator function; the sequence {s t } is assumed to be a stationary, irreducible Markov process with discrete state space {1, . . . , K} . . and transition matrix P = [P ij ] where Pij = P(st+1 = j |st = i); S(0, 1, ν) denotes the standard . Student-t density with ν degrees of freedom; = (ν − 2)/ν is a scaling factor which ensures that the conditional variance is given by et h t . We define the vector of threshold GJR(1,1) conditional C The Author(s). Journal compilation C Royal Economic Society 2008.
108
D. Ardia
variances in (2.1) in a compact form as follows: . 2 + α 2 I{yt−1 < τ }(τ − yt−1 )2 + βht−1 , (2.2) ht = α 0 + α 1 yt−1 . 1 . 1 . . 1 . K K 1 K K where h t = (ht · · · ht ) , α • = (α • · · · α • ) , β = (β · · · β ) , τ = (τ . . . τ ) and I{yt−1 <τ } = (I{yt−1< τ 1 } · · · I{yt−1<τ K }) . In order to ensure the positivity of the conditional variance in every regime, we require that α 0 > 0 and α 1 , α 2 , β 0, where 0 is a vector of zeros. Moreover, we require that τ 0 to ensures that the conditional variance is minimal when the past return is zero (i.e. when the underlying price remains constant over the last time period). Indeed, a positive threshold would yield inconsistencies such as having a return of zero affecting the volatility by more than a positive return. It is also hard to justify why a positive return would be construed as . . bad news. 2 Finally, we set h 0 = 0 and y 0 = 0 for convenience. 3 Specification (2.2) encompasses the Markov-switching GARCH(1,1) model of Haas et al. . (2004) when α 2 = 0 and the Markov-switching GJR(1,1) model studied in Ardia, (2008, chap. . . 7) when τ = 0. Single-regime models are obtained in a straightforward manner by setting K = 1. As pointed out in Section 1, expression (2.2) allows to determine whether or not: (i) structural breaks are present within the volatility dynamics (e.g. α0k = α0k ); (ii) an asymmetric response k is present (i.e. α 2 > 0 for at least on k), and is different between the regimes (i.e. α2k = α2k ) k k and (iii) locations of bad news are similar between the regimes (i.e. τ = τ ). Moreover, expression (2.2) leads to a smooth news impact curve (Engle and Ng, 1993) at the threshold values and is convenient in the construction of the proposal density for the generation of the thresholds (see the Appendix). The use of a Student-t instead of a Normal distribution is quite popular in standard singleregime GARCH literature. For regime-switching models, a Student-t distribution might be seen as superfluous since the switching regime can account for large unconditional kurtosis in the data. However, as empirically observed by Klaassen (2002), allowing for Student-t innovations within regimes enhances the stability of the states and allows to focus on the conditional variance’s behaviour instead of capturing some outliers. Moreover, the Student-t distribution includes the Normal distribution as the limiting case where the degrees of freedom parameter goes to infinity. We have therefore an additional flexibility in the modelling and can impose Normality by constraining the lower boundary for the degrees of freedom parameter through the prior distribution. The Student-t specification in (2.1) needs to be re-written in order to perform a convenient Bayesian estimation (see, e.g. Geweke, 1993): yt = εt (t et ht )1/2 i.i.d.
εt ∼ N (0, 1)
;
for t = 1, . . . , T i.i.d. t ∼ IG ν2 , ν2 ,
where N (0, 1) is the standard Normal and IG the Inverted Gamma density. The degrees of freedom parameter ν characterizes the density of t as follows:
ν ν2 ν −1 − ν −1 ν . (2.3) p(t |ν) = t 2 exp − 2 2 2t 2 A sensitivity analysis has been performed in which the maximum value of the threshold was set to a small positive value. Results were however similar to those obtained by setting the maximum value at zero. . 3 The assumption h = 0 could be relaxed, but for a large number of observations which is often the case with financial 0 data, this should have a negligible impact on the posterior distribution. A sensitivity analysis has been performed and did not show significant differences in the posterior results.
C The Author(s). Journal compilation C Royal Economic Society 2008.
Bayesian Markov-switching threshold asymmetric GARCH
109
. For a parsimonious expression of the likelihood function, we define the vectors y = . . . (y 1 · · · y T ) , = ( 1 · · · T ) , s = (s 1 · · · s T ) and α = (α 0 α 1 α 2 ) . The model parameters are . . then regrouped into = (ψ, , s), where ψ = (α, β, τ , ν, P ). Finally, we define the diagonal . t T matrix = ( ) = diag ({t et h }t=1 ) where we recall that , e t and h t are both functions of the model parameters. We can now express the likelihood function of as follows: 1 L( | y) ∝ (det )−1/2 exp − y −1 y . 2 This likelihood function is invariant with respect to relabelling the states, which leads to a lack of identification of the state-specific parameters. So, without a prior inequality restriction on some state-specific parameters, a multimodal posterior is obtained and is difficult to interpret and summarize. To overcome this problem, we make use of the permutation sampler of Fr¨uhwirthSchnatter (2001) to find suitable identification constraints. The permutation sampler requires priors that are labelling invariant. Furthermore, we cannot be completely non-informative about the state specific parameters since, from a theoretical viewpoint, this would result in improper posteriors (see, e.g. Diebolt and Robert, 1994). In addition, a diffuse prior on τ would lead to a flat posterior since the threshold parameters are unidentified for α 2 = 0. For the scedastic function’s parameters α, β and τ , we use truncated Normal densities: p(α) ∝ N3K (α|μα , α ) I{α 0} p(β) ∝ NK (β|μβ , β ) I{β 0} p(τ ) ∝ NK (τ |μτ , τ ) I{τmin τ 0} , where μ • , • and τ min are the hyperparameters and Nd is the d-dimensional Normal density (d > 1). The assumption of labelling invariance is fulfilled if we assume that the hyperparameters are . . . . the same for all states. In particular, we set [μα ]i = μα0 , [ α ]ii = σα20 , [μβ ]i = μβ , [ β ]ii = . . . . . σβ2 , [μτ ]i = μτ , [ τ ]ii = στ2 , [τmin ]i = τmin for i = 1, . . . , K; [μα ]i = μα1 , [ α ]ii = σα21 for . . 2 i = K + 1, . . . , 2K and [μα ]i = μα2 , [ α ]ii = σα2 for i = 2K + 1, . . . , 3K, where μ • , σ 2• and τ min are fixed values. Note that a lower boundary τ min is used since the likelihood function is invariant for threshold values below the minimum value of the observed data. Hence, the prior on τ could be considered to correspond to an empirical Bayes approach rather than a fully Bayesian one. The prior density of the vector conditional on ν is found by noting that t are independent and identically distributed from (2.3), which yields: T − ν2 −1 T
ν T2ν ν −T 1 ν p( |ν) = t exp − . 2 2 2 t=1 t t=1 Following Deschamps (2006), we choose a translated Exponential with parameters λ > 0 and δ 2 for the prior on the degrees of freedom parameter: p(ν) = λ exp[−λ(ν − δ)] I{δ < ν < ∞} . For large values of λ, the mass of the prior is concentrated in the neighbourhood of δ and a constraint on the degrees of freedom can be imposed in this manner. The Normality for the errors is obtained when δ becomes large. As pointed out by Deschamps (2006), this prior density is useful for two reasons. First, for numerical reasons, to bound the degrees of freedom parameter C The Author(s). Journal compilation C Royal Economic Society 2008.
110
D. Ardia
away from two which avoids explosion of the conditional variance. Second, we can approximate the Normality for the errors while maintaining a reasonably tight prior which can improve the convergence of the MCMC sampler. Conditionally on the transition probabilities matrix P, the prior on vector s is Markov: p(s|P ) = π (s1 )
K K
N
Pij ij ,
i=1 j =1
. where N ij = #{s t+1 = j |s t = i} is the number of one-step transitions from state i to j in the vector s. The mass function for the initial state, π (s 1 ), is obtained by calculating the ergodic probabilities of the Markov chain (see, e.g. Hamilton, 1994, section 22.2). The prior density for the transition matrix is obtained by assuming that the K rows are . independent and that the density of the ith row is Dirichlet with parameter η i = (η i1 · · · η iK ): p(P ) =
K
D(ηi ) ∝
K K
η −1
Pijij
.
i=1 j =1
i=1
. . Due to the labelling invariance assumption, we require that η ii = η p for i = 1, . . . , K and η ij = η q for i, j ∈ {1, . . . , K; i = j }. Finally, we form the joint prior by assuming prior independence between α, β, τ , ( , ν) and (s, P ). The joint posterior density is then obtained by combining the likelihood function and the joint prior via Bayes’ rule.
3. SIMULATING THE JOINT POSTERIOR We draw an initial value from an arbitrary proper distribution and then we cycle through the full conditionals: p(α|β, τ , , ν, s, y)
p(β|α, τ , , ν, s, y) p(τ |α, β, , ν, s, y)
p( |α, β, τ , ν, s, y)
p(ν| )
p(s|α, β, τ , , ν, P , y) p(P |s) , using the most recent conditional values. Among the full conditional densities listed above, only and P can be simulated from known expressions. Draws of α, β and τ are achieved by a multivariate extension of the methodology proposed by Nakatsuma (2000). The generation of state vector s is made by using the FFBS algorithm described in Chib (1996). Finally, sampling ν is achieved by an efficient rejection technique. The reader is referred to the Appendix for further details on the simulation techniques. As noted previously, we rely on the permutation sampler of Fr¨uhwirth-Schnatter (2001) to overcome the identification problem encountered with mixture models. 4 We use the random permutation sampler to determine suitable identification constraints; in this version of the permutation sampler, each pass of the MCMC scheme is followed by a random permutation of the regime definition. This algorithm improves the mixing of the MCMC sampler and allows 4 The permutation-augmented sampler by Geweke (2007) can also be used to that purpose. The constrained permutation sampler is however useful as a diagnostic test to determine whether the constraint is well suited.
C The Author(s). Journal compilation C Royal Economic Society 2008.
Bayesian Markov-switching threshold asymmetric GARCH
111
to explore the full unconstrained parameter space. Then, we post-process the MCMC output of the random permutation sampler in an exploratory way to determine appropriate identification constraint. At this stage, the model parameters are estimated again under the constraint by enforcing the corresponding permutation of the regimes; this version of the permutation sampler is referred to as the constrained permutation sampler.
4. AN APPLICATION TO THE SWISS MARKET INDEX We apply our Bayesian estimation method to demeaned daily log-returns {y t } of the Swiss Market Index (SMI). The sample period is from November 12, 1990 to April 10, 2006 for a total of 4000 observations and the log-returns are expressed in percent. The first 2500 observations, which represent slightly less than two third of the sample, are used to estimate the models while the remaining 1500 log-returns are used in a forecasting performance analysis. The time series under investigation is plotted in Figure 1. We test for autocorrelation in the times series by testing the joint nullity of autoregressive coefficients for {y t }. We estimate the regression with autoregression coefficients up to lag 15 and compute the covariance matrix using the White estimate. The p-value of the Wald test is 0.1427, which does not support the presence of autocorrelation. When testing for the autocorrelation in the series of squared observations we strongly reject the null of the absence of autocorrelation. Thus, as is typical for financial data, the presence of GARCH effects in the time series of returns is confirmed. 4.1. Estimation We apply our Bayesian estimation to a single-regime threshold GJR(1,1) model and a two-state Markov-switching threshold GJR(1,1) model, henceforth referred to as TGJR and MSTGJR. In order to diminish the impact of the prior on the joint posterior, we estimate both models under rather vague priors. For the priors on the scedastic function’s parameters, we set μ • to zero, σ 2• to 10,000 and τ min to −1.38, which is the 2.5th percentile of the observed log-returns. For the prior on the degrees of freedom parameter, we choose λ = 0.01 and δ = 2 to ensure the existence of the conditional variance. Finally, we set η ii = 2 and η ij = η j i = 1 for i, j ∈ {1, 2} so that we have a prior belief that the probabilities of persistence are bigger than the probabilities of transition. A sensitivity analysis has been performed and confirmed that our initial choice is vague enough and does not introduce significant information in our estimation. We run two chains for 50,000 iterations each and assess the convergence of the sampler by using the diagnostic test of Gelman and Rubin (1992). The convergence appears rather quickly, but we nevertheless consider the first half of the iterations as a burn-in phase for precaution. For the TGJR model, the acceptance rate is 88% for α and 97% for β, indicating that the proposal distributions for these parameters are close to the full conditionals. The acceptance rate for τ is 32%. The one-lag autocorrelations in the chain range from 0.34 for α 1 to 0.94 for β which is reasonable. For the MSTGJR model, the random permutation sampler is run first to determine suitable identification constraints. In Figure 2, we show the contour plots of the posterior density for (β k , α k0 ), (β k , α k1 ), (β k , α k2 ) and (β k , τ k ), respectively. 5 As we can notice, the bimodality of the posterior density is clear for the parameter β k on the four graphs, suggesting a constraint of the type β 1 < β 2 for identification. Therefore, the model is estimated again under this constraint; 5
The value k is arbitrary since all marginal distributions contain the same information (Fr¨uhwirth-Schnatter, 2001).
C The Author(s). Journal compilation C Royal Economic Society 2008.
112
D. Ardia
7.5
5.0
2.5
0.0
–2.5
–5.0
–7.5
1991
1993
1995
1997
1999
2001
2003
2005
year
Figure 1. Swiss Market Index daily log-returns (in %).
in that case, label switching only appeared 26 times after the burn-in phase thus confirming the suitability of the identification constraint. The acceptance rates obtained with the constrained permutation sampler range from 15% for τ to 93% for β. The one-lag autocorrelations range from 0.7 for α 21 to 0.97 for α 10 . We keep every fifth draw from the MCMC output for both models in order to diminish the autocorrelation in the chains. The two chains are then merged to get a final sample of length 10,000. Our algorithm is not only fully automatic, the results described here also show desirable convergence properties. Note that a three-state MSTGJR model has also been estimated. However post-processing the MCMC output has not allowed to find any clear identification constraint, which suggests that the number of regime is too large (Fr¨uhwirth-Schnatter, 2006, section 4.2). Moreover, the model likelihood estimate clearly indicates that a three-state Markov-switching model is not supported by the data (see Table 3). The posterior statistics for both models are reported in Table 1. In the case of the TGJR model (upper panel), we note the presence of an asymmetric response to past shocks; the posterior mean of α 2 is 0.174 and the probability P(α2 > 0| y) is one. The posterior mean of β is 0.803, indicating a high memory in the conditional variance process. The posterior mean and median of the threshold parameter are −0.094 and −0.071, respectively. However, the 95% highest posterior density interval (HPDI) for τ does contain the value of zero, suggesting that the threshold value C The Author(s). Journal compilation C Royal Economic Society 2008.
113
Bayesian Markov-switching threshold asymmetric GARCH
αk0
αk1 0.08
0.50 0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00
0.07 5
0.06
4
0.02
35
2
0.01
45
12
6
40
0.4
0.6
0.8
1.0
30
0.0
0.2
20
0.2
90
0.00 0.0
50 40
25
70
16
8
18 14
12
20
0.4
0.6
60 65
0.03
10
14
25
35
15
8
20
30
10
0.04
4
18
10
0.05
75
16
20
10
15
6
45
2
55
0.8
1.0
k
βk
β αk2
τk
3
2
8
4
6.5
4.5
5.5 4.5
3.5 2.5
–0.4
5
7
2.5
13 3
0.10
4
1
7
2
15
5
0.15
–0.3
6.5
2
6
6
7
–0.2
14
8
0.20
3
17 11 9
9
0.25
–0.1
8.5
7.5
6
10
5
5
6
5.
9
0.30
5 7 9 8 7.5
4 0 12 8 1
4
3
0.0
2
0.35
3.5
0.40
1
–0.5
0.05 0.00
1.5
–0.6 0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
βk
0.8
1.0
βk
Figure 2. Contour plots for (β k , α k0 ), (β k , α k1 ), (β k , α k2 ) and (β k , τ k ).
is not significantly different from zero at the 5% level. Finally, the value of 8.64 for the posterior mean of the degrees of freedom parameter indicates conditional leptokurtosis in the data set. In the MSTGJR case (lower panel), we observe an asymmetric response to past shocks on the conditional variance for both states; the posterior mean of α k2 is 0.253 and 0.243 for the first and the second state, respectively; both estimated probabilities P(α2k > 0| y) are equal to one. A comparison of the scedastic function’s parameters between the two regimes indicates similar 95% HPDI for the components of α 1 and α 2 while the difference for components of α 0 is slightly more pronounced. The 95% HPDI for parameters α 11 and α 21 does contain the value of zero. Moreover, the 95% HPDI of (α 12 − α 22 ) (not shown) does contain the value of zero, suggesting that the asymmetric function is similar for both regimes. We note that the difference between the two regimes is significant for β since the 95% HPDI intervals do not overlap. The posterior mean of β 1 is 0.411, indicating a low memory in the variance process, while the posterior mean of β 2 is 0.788, similar to the value observed in the single-regime model. For the threshold parameter, the C The Author(s). Journal compilation C Royal Economic Society 2008.
114
D. Ardia
ψ
ψ
Table 1. Estimation results. ψ 0.5 [95% HPDI]
NSE
IF
TGJR α1 α2 τ
0.058 0.174 −0.094
0.057 0.170 −0.071
0.024 0.100 −0.267
0.095 0.257 0.000
0.242 0.598 2.179
1.74 2.18 6.81
β ν
0.803 8.648
0.804 8.502
0.745 6.348
0.856 11.100
1.118 42.178
15.25 11.19
0.256 0.185
0.256 0.179
0.151 0.088
0.358 0.295
2.946 1.984
29.39 12.24
α 11 α 21
0.023 0.024
0.017 0.020
0.000 0.000
0.063 0.058
0.325 0.290
2.76 2.45
α 12 α 22 τ1
0.253 0.243 −0.089
0.245 0.229 −0.062
0.096 0.109 −0.247
0.421 0.394 0.000
2.007 1.957 3.214
5.27 5.82 11.71
τ2 β1 β2
−0.223 0.411 0.788
−0.172 0.410 0.791
−0.616 0.208 0.700
0.000 0.627 0.872
11.664 4.126 1.882
35.34 14.41 17.02
ν p 11
10.020 0.996
9.866 0.997
7.097 0.993
12.930 1.000
54.942 0.023
12.61 1.22
p 22
0.995
0.996
0.990
0.999
0.027
1.17
MSTGJR α 10 α 20
Note: ψ: posterior mean; ψ 0.5 : posterior median; [95% HPDI]: 95% highest posterior density interval; NSE: numerical standard error (×103 ) and IF: inefficiency factor (i.e. ratio of the squared numerical standard error and the variance of the sample mean from a hypothetical i.i.d. sampler). The posterior statistics are based on 10,000 draws from the constrained posterior sample.
posterior mean is −0.089 in the first regime and −0.223 in the second regime. However, the 95% HPDI for (τ 1 − τ 2 ) (not shown) does contain the value of zero, indicating that the thresholds are not significantly different at the 5% level. Overall, these results suggest that both the asymmetric responses and the locations of the asymmetry are similar between the two regimes. 6 As in the single-regime model, the posterior distribution for the degrees of freedom parameter indicates conditional leptokurtosis. We note however that the posterior mean and median are larger than for the TGJR model. The posterior means for probabilities p 11 and p 22 are, respectively 0.996 and 0.995, indicating infrequent mixing between states. The 95% HPDI of (p 11 − p 22 ) (not shown) does contain the value of zero. Finally, the inefficiency factors (IF) reported in the last column of Table 1 indicate that using 10,000 draws out of the MCMC sampler seems appropriate
6 We could continue with a reduced model, i.e. by setting α 1 = α 2 , α 1 = α 2 and τ 1 = τ 2 . However, this is not the goal 1 1 2 2 of the paper to perform such model selection.
C The Author(s). Journal compilation C Royal Economic Society 2008.
115
Bayesian Markov-switching threshold asymmetric GARCH
Daily log–returns (in percent)
Pr(st=2 | y) 1.00
7.5
5.0
0.75 2.5
0.0 0.50
–2.5
0.25 –5.0
–7.5 0.00
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
year Figure 3. Smoothed probabilities for the second regime (solid line, left-hand axis) together with the insample log-returns (circles, right-hand axis).
if we require that the Monte Carlo error in estimating the mean is smaller than one percent of the variation of the error due to the data. 7 In Figure 3, we present the smoothed probabilities for the second regime (solid line, left-hand axis) together with the in-sample daily log-returns (circles, right-hand axis). The 95% (robust) confidence bands are shown in dashed lines but are almost indistinguishable from the point estimates. As we can notice, the second state is clearly associated with high-volatility periods, while the first regime corresponds to more tranquil periods. The beginning of the year 1991 is 7 The inefficiency factors are computed as the ratio of the squared numerical standard error (NSE) of the MCMC simulations and the variance estimate divided by the number of iterations. The NSE are estimated by the method of Andrews (1991), using a Parzen kernel and AR(1) pre-whitening as presented in Andrews and Monahan (1992). Note also that the inefficiency factor is the inverse of the relative numerical efficiency (RNE) introduced by Geweke (1989).
C The Author(s). Journal compilation C Royal Economic Society 2008.
116
D. Ardia
associated with the high-volatility state. Then, from the second half of 1991 to 1997, the returns are clearly associated with the low-volatility regime, with the exception of 1994. From 1997 to 2000, the model remains in the high-volatility regime with a transition during the second semester 2000 to the low-volatility state. 4.2. In-sample performance analysis 4.2.1. Model diagnostics. We check for model misspecification by analyzing the predictive probabilities also referred to as p-scores (see, e.g. Kaufmann and Fr¨uhwirth-Schnatter, 2002). We make use of a simpler version of this method, as proposed by Kim et al. (1998), which . consists in conditioning on point estimates of ψ = (α, β, τ , ν, P ). To be meaningful, the point estimate has to be chosen when the identification is imposed. Hence, we consider the mean ψ of the constrained posterior sample. Upon defining Ft−1 as the information set up to time (t − 1), the (approximate) p-scores are defined as follows: . zt = P(Yt yt |st = k, ψ, Ft−1 ) P(st = k|ψ, Ft−1 ) . K
k=1
The probability P(Yt yt |st = k, ψ, Ft−1 ) can be estimated by the Student-t integral and the filtered probability P(st = k|ψ, Ft−1 ) is obtained as a byproduct from the FFBS algorithm (see, e.g. Chib, 1996, p. 83). Under a correct specification, the p-scores should have independent uniform distributions asymptotically (Rosenblatt, 1952). A further transformation through the . Normal integral is often applied for convenience, i.e. u t = −1 (z t ) where −1 (•) denotes the inverse cumulative standard Normal function. If the model is correct, {u t } should be independent standard Normal and common tests can be used to check these features. In particular, we test the presence of autocorrelation in the series {u t } and {u2t } using a Wald test. We also report the results of a joint test for zero mean, unit variance, zero skewness, and the absence of excess kurtosis (Berkowitz, 2001). In the case of the TGJR model, the Wald statistic for testing the joint nullity of autoregressive coefficients up to lag 15 for u t has a p-value of 0.0903, and for u2t , a p-value of 0.3838. In the case of the MSTGJR model, the p-values are 0.0841 and 0.3909, respectively. Therefore, both models adequately capture the volatility clustering present in the data. The Normality test of Berkowitz (2001) yields p-values of 0.1184 for the TGJR model and 0.0756 for the MSTGJR model. Overall, these results indicate no evidence of misspecification at the 5% level for both models. 4.2.2. Deviance information criterion. In order to evaluate the goodness-of-fit of the models, we first use the Deviance information criterion (DIC) introduced by Spiegelhalter et al. (2002). Given a set of models, the one with the smallest DIC has the best balance between goodness-of-fit and model complexity. As noted in Celeux et al. (2006), difficulties arise when applying this criterion to mixture models. To overcome these problems, we integrate out the state vector by considering the observed likelihood and make use of the constrained posterior sample in the estimation (Celeux et al., 2006, section 3.1). In the context of Markov-switching models, the observed likelihood is C The Author(s). Journal compilation C Royal Economic Society 2008.
117
Bayesian Markov-switching threshold asymmetric GARCH Table 2. Deviance information criterion. DIC
Model TGJR MSTGJR
6762.8 6706.6
95% CI [6762.3 6763.3] [6705.8 6707.1]
Note: 95% CI: 95% confidence interval of the DIC computed by a resampling technique.
(see, e.g. Kaufmann and Fr¨uhwirth-Schnatter, 2002, p. 457): L(ψ| y) =
K T t=1
p(yt |ψ, st = k, Ft−1 ) P(st = k|ψ, Ft−1 ) ,
(4.1)
k=1
. and DIC = 2 {ln L ψ| y − 2 Eψ| y [ln L(ψ| y)]}. The DIC estimates are reported in Table 2 together with their 95% confidence intervals obtained by a resampling technique (Ardia, 2008, section 7.4.2). From this table, we can notice that the criterion favours the MSTGJR model. Indeed, the DIC estimates based on the initial joint posterior sample is 6762.8 for the TGJR model and 6706.6 for the MSTGJR model. Both 95% confidence intervals do not overlap which suggests significant improvement of the Markov-switching model. 4.2.3. Model likelihood. As a second criterion to discriminate between the models, we consider the model likelihood: p( y) = L(ψ| y)p(ψ)dψ , where L(ψ| y) is given in (4.1) and p(ψ) is the joint prior density on ψ. The estimation of p( y) requires the integration over the whole set of parameters, which is a difficult task in practice. In the context of mixture models, Fr¨uhwirth-Schnatter (2004) documents that the bridge sampling technique (Meng and Wong, 1996) using the MCMC output of the random permutation sampler and an i.i.d. sample from an importance density q(ψ) which approximates the unconstrained posterior yields the best (and a robust) estimator. Specifically, the model likelihood is approximated as the limit of: . pt ( y) = pt−1 ( y) ×
L
pt−1 (ψ [l] | y) l=1 Lq(ψ [l] )+Mpt−1 (ψ [l] | y) q(ψ [m] ) 1 M m=1 Lq(ψ [m] )+Mpt−1 (ψ [m] | y) M 1 L
,
{ψ [l] }Ll=1 are i.i.d. draws from where {ψ [m] }M m=1 are MCMC draws from the joint posterior, . the importance sampling density q(ψ) and pt (ψ| y) = L(ψ| y)p(ψ)/pt ( y). This sequence is typically initialized with the reciprocal importance sampling estimator of Gelfand and Dey (1994). We adopt the approach of Kaufmann and Fr¨uhwirth-Schnatter, (2002, pp. 438–439) to construct an importance density which reproduce the K! modes of the unconstrained posterior. Specifically, q(ψ) is obtained from the MCMC output of the random permutation sampler using C The Author(s). Journal compilation C Royal Economic Society 2008.
118
D. Ardia
Model
Table 3. Model likelihood estimators. ln p 0 ( y)
ln p( y)
TGJR MSTGJR
−3405.33 (2.92) −3394.04 (3.19)
−3408.04 (2.85) −3401.00 (3.40)
MSTGJR (three states)
−3427.82 (8.73)
−3492.74 (9.89)
Note: ln p 0 ( y): natural logarithm of the model likelihood estimate using reciprocal importance sampling; ln p( y): natural logarithm of model likelihood estimate using bridge sampling; (•) numerical standard error of the estimator (×102 ).
a mixture of the proposal and conditional densities: R . 1 qα α|α [r] , β [r] , τ [r] , [r] , ν [r] , s[r] , y q(ψ) = R r=1 × qβ β|α [r] , β [r] , τ [r] , [r] , ν [r] , s[r] , y
[r] [r] [r] [r] [r] [r] [r] × qτ τ |α , β , τ , , ν , s , y × p(P |s ) × qν (ν) ,
where q α , q β and q τ are the proposal densities for α, β and τ , respectively, and p(P |s[r] ) is the product of Dirichlet posterior densities for the transition probabilities (see the Appendix). q ν is a truncated skewed Student-t density whose parameters are estimated by ML from the posterior sample (Ardia, 2008, section 7.4.3). In Table 3, we report the natural logarithm of the model likelihoods obtained using the reciprocal importance sampling estimator (second column) and the bridge sampling estimator (last column) for M = L = 1000 draws with R = 1000 components. From this table, we can notice that both estimators are higher for the MSTGJR model, indicating a better in-sample fit for the regime-switching specification. As an additional discrimination criterion, we compute the (transformed) Bayes factor in favour of the MSTGJR model (see, e.g. Kass and Raftery, 1995, section 3.2). The estimated value is 2 × ln BF = 2 × [−3401.00 − (−3408.04)] = 10.08, which strongly supports the in-sample evidence in favour of the regime-switching model. 8 4.3. Forecasting performance analysis In order to evaluate the ability of the competing models to predict the future behaviour of the volatility process, we first study the forecasted one-day ahead Value at Risk (VaR), which is a common tool to measure financial and market risks. The one-day ahead VaR at risk level φ ∈ (0, 1), VaRφ , is estimated by calculating the (1 − φ)th percentile of the one-day ahead predictive 8 The model likelihood is sensitive to the choice of the prior distribution, so we must test whether an alternative joint prior specification would have modified the conclusion of our analysis. To answer this question, we modified the hyperparameters’ value and ran the sampler again. We considered slightly more informative priors for the scedastic function’s parameters. As an alternative prior on the degrees of freedom parameter we chose λ = 0.02 and δ = 2 which implies a prior mean of 52. Finally, the hyperparameters for the prior on the transition probabilities were set to η ii = 3 and η ij = η j i = 1 for i, j ∈ {1, 2}. The results were similar to those obtained previously, confirming the better fit of the MSTGJR model.
C The Author(s). Journal compilation C Royal Economic Society 2008.
Bayesian Markov-switching threshold asymmetric GARCH
119
distribution which is obtained by simulation from the joint posterior sample. 9 For the two models, the predictions are obtained for the out-of-sample window which consists of 1500 daily logreturns. To verify the accuracy of the VaR estimates, we adopt the testing methodology of φ Christoffersen (1998). This approach is based on the study of the random sequence {V t } where φ . φ . Vt = I{yt+1< VaRφt } if φ > 0.5 and Vt = I{yt+1>VaRφt } if φ 0.5. A sequence of VaR forecasts at φ risk level φ has correct conditional coverage if {V t } is an independent and identically distributed sequence of Bernoulli random variables with parameter (1 − φ) if φ > 0.5 and with parameter φ if φ 0.5. This hypothesis can be verified by testing jointly the independence on the series and the unconditional coverage of the VaR forecasts. 10 The forecasting results of the VaR with typical (short and long positions) risk levels used in financial risk management are reported in Table 4. The second and third columns give the expected and observed number of violations. The last three columns report the p-values for the tests of correct unconditional coverage (UC), independence (IND) and correct conditional coverage (CC). From this table, we note that the observed number of violations for the MSTGJR model are closer to the expected values than for the TGJR model. Indeed, at the 5% significance level, the UC test is rejected two times for the MSTGJR model while it is rejected five times for the TGJR model. The IND test is rejected two times for both models. We can notice that for some risk levels this test is not applicable since no consecutive violations have been observed. The joint hypothesis of correct unconditional coverage and independent sequence is obtained via the CC test. In the case of the MSTGJR model, the CC test is (slightly) rejected for risk levels 0.9 and 0.075. The test is strongly rejected for the TGJR model at all risk levels except for φ = 0.1. As a second measure of the forecasting performance, we compute the natural logarithm of the predictive likelihood p(y T +1 . . . y T +h | y) for the two models, where T = 2500 and h = 1500. The estimation is obtained by Monte Carlo integration from the joint posterior sample (see, e.g. Geweke, 2005, section 2.6.2). The estimate is −5242.8 for the single-regime model and −5217.8 for the Markov-switching model. The numerical standard errors of the estimates, obtained by a resampling technique (Ardia, 2008, section 7.4.2), are 0.152 and 0.173, respectively. Hence, the predictive likelihood for the MSTGJR model is significantly higher than for the TGJR model. Overall, these results indicate the better out-of-sample performance of the Markov-switching model compared to the single-regime specification.
5. CONCLUSION MSGARCH models provide an explanation for the high persistence in volatility observed in single-regime GARCH models and allow for a sudden change in the (unconditional) 9 In order to simulate from the predictive distribution over the out-of-sample observations window, the posterior sample of should be updated using the most recent information. As a consequence, forecasting the one-day ahead VaR would necessitate the estimation of the joint posterior sample at each time point in the out-of-sample observation window. However, such an approach is computationally impractical for large data set such as ours. Combination of MCMC and importance sampling to estimate efficiently this predictive distribution is proposed by Gerlach et al. (1999). Nevertheless, for the sake of simplicity, we consider the same joint posterior sample, based on the in-sample data set, when forecasting the VaR. 10 A (conservative) joint test for multiple VaR risk levels can be obtained using the Bonferroni correction. Specifically, a joint test of k VaR risk levels at significance level s is rejected if the smallest p-value among the k individual tests is smaller than s/k.
C The Author(s). Journal compilation C Royal Economic Society 2008.
120
D. Ardia
E(Vtφ )
φ
Table 4. Results of the VaR. # UC
IND
CC
TGJR 0.99 0.975 0.95
15.0 37.5 75.0
17 41 97
0.611 0.568 0.013
NA NA 0.124
NA NA 0.013
0.925 0.9
112.5 150.0
130 161
0.093 0.349
0.024 0.002
0.019 0.006
0.1 0.075 0.05
150.0 112.5 75.0
129 88 50
0.065 0.013 0.002
0.288 0.276 NA
0.103 0.025 NA
0.025 0.01
37.5 15.0
13 5
0.000 0.003
NA NA
NA NA
0.99 0.975
15.0 37.5
15 36
1.000 0.803
NA NA
NA NA
0.95 0.925 0.9
75.0 112.5 150.0
87 115 149
0.165 0.807 0.931
0.105 0.134 0.014
0.103 0.315 0.048
0.1 0.075
150.0 112.5
149 101
0.931 0.252
0.075 0.022
0.203 0.038
0.05 0.025 0.01
75.0 37.5 15.0
62 18 5
0.113 0.000 0.003
NA NA NA
NA NA NA
MSTGJR
φ
Note: φ: risk level; E(Vt ): expected number of violations; #: observed number of violations; UC: p-value for the correct unconditional coverage test; IND: p-value for the independence test; CC: p-value for the correct conditional coverage test and NA: not applicable.
volatility level which improves significantly the volatility forecasts. Different parametrizations of MSGARCH models have been proposed in the literature but they raise difficulties both in their estimation and their interpretation. Haas et al. (2004) overcome these problems by providing a model which can be estimated by ML and which allows for a clear-cut interpretation of the variance dynamics in each regime. This paper generalizes the model of Haas et al. (2004) in three ways. First, we account for asymmetric movements between the conditional variances and the underlying time series by replacing the separate GARCH(1,1) processes by GJR(1,1) processes. Second, a free non-positive threshold parameter is used to test whether it is the sign of the past shock or the effect of a particularly bad news which influences the conditional variance. Finally, we consider Student-t innovations instead of Normal innovations to gain flexibility in the modelling. The model is estimated under the Bayesian approach using a novel, fully automatic MCMC procedure. As an application, we fit a single-regime threshold GJR(1,1) model and a two-state Markov-switching threshold GJR(1,1) model to the Swiss Market Index logreturns. We show the presence of two distinct volatility regimes in the time series. Moreover, the posterior results indicate no difference between asymmetries and locations of the asymmetry C The Author(s). Journal compilation C Royal Economic Society 2008.
Bayesian Markov-switching threshold asymmetric GARCH
121
for highly volatile periods relative to more subdued ones. Finally, we test the in- and out-ofsample performance of the two competing models and document the better performance of the Markov-switching specification. In this study, we have considered a fixed transition matrix for the state process. As a consequence, the expected persistence of the regimes is constant over time, which is questionable. In a more general formulation, we could allow the transition probabilities to change over time depending on some observables (see, e.g. Gray, 1996). Also, the assumption of recurring states could be relaxed using the hierarchical approach of Pesaran et al. (2006). Finally, we could allow the degrees of freedom parameter to be state dependent as in Dueker (1997) and Perez-Quiros and Timmermann (2001). We leave these extensions for future research.
ACKNOWLEDGMENTS This version has been written while the author was visiting the Econometric Institute, Erasmus University Rotterdam, the Netherlands. The author sincerely acknowledges the hospitality of Herman K. van Dijk and is grateful to the Swiss National Science Foundation (under grant #FN PB FR1-121441) for financial support. The author wishes to thank Luc Bauwens, Cathy W. Chen, Carlos Ord´as Criado, Philippe J. Deschamps, Dennis Fok, Richard H. Gerlach, Lennart F. Hoogerheide, Richard Paap, Denis Pelletier, J´erˆome Ph. Taillard, Herman K. van Dijk and Martin Wallmeier for helpful comments. He also acknowledges two anonymous reviewers and the Editor, Olivier Linton, for numerous helpful suggestions for improvement of the paper. Finally, the author thanks participants of the Econometric Institute seminars, Erasmus University Rotterdam, participants of the 2nd International Workshop on Computational and Financial Econometrics, University of Neuchˆatel, and participants of the 14th International Conference on Computing in Economics and Finance, University la Sorbonne Paris. Any remaining errors or shortcomings are the author’s responsibility.
REFERENCES Andrews, D. W. K. (1991). Heteroskedasticity and autocorrelation consistent covariance matrix estimation. Econometrica 59, 817–58. Andrews, D. W. K. and J. C. Monahan (1992). An improved heteroskedasticity and autocorrelation consistent covariance matrix estimator. Econometrica 60, 953–66. Ardia, D. (2008). Financial Risk Management with Bayesian Estimation of GARCH Models: Theory and Applications, Volume 612: Lecture Notes in Economics and Mathematical Systems. Heidelberg, Germany: Springer. Berkowitz, J. (2001). Testing density forecasts, with applications to risk management. Journal of Business and Economic Statistics 19, 465–74. Black, F. (1976). The pricing of commodity contracts. Journal of Financial Economics 3, 167–79. Celeux, G., F. Forbes, C. P. Robert and M. Titterington (2006). Deviance information criterion for missing data models. Bayesian Analysis 1, 651–706. Chib, S. (1996). Calculating posterior distributions and modal estimates in Markov mixture models. Journal of Econometrics 75, 79–97. Christoffersen, P. F. (1998). Evaluating interval forecasts. International Economic Review 39, 841–62. C The Author(s). Journal compilation C Royal Economic Society 2008.
122
D. Ardia
Deschamps, P. J. (2006). A flexible prior distribution for Markov switching autoregressions with Student-t errors. Journal of Econometrics 133, 153–90. Diebolt, J. and C. P. Robert (1994). Estimation of finite mixture distributions through Bayesian sampling. Journal of the Royal Statistical Society, Series B 56, 363–75. Dueker, M. J. (1997). Markov switching in GARCH processes and mean-reverting stock-market volatility. Journal of Business and Economic Statistics 15, 26–34. Engle, R. F. and V. K. Ng (1993). Measuring and testing the impact of news on volatility. Journal of Finance 48, 1749–78. Fr¨uhwirth-Schnatter, S. (2001). Markov chain Monte Carlo estimation of classical and dynamic switching and mixture models. Journal of the American Statistical Association 96, 194–209. Fr¨uhwirth-Schnatter, S. (2004). Estimating marginal likelihoods for mixture and Markov switching models using bridge sampling techniques. The Econometrics Journal 7, 143–67. Fr¨uhwirth-Schnatter, S. (2006). Finite Mixture and Markov Switching Models. Springer Series in Statistics. New York/Berlin/Heidelberg: Springer. Gelfand, A. E. and D. K. Dey (1994). Bayesian model choice: Asymptotics and exact calculations. Journal of the Royal Statistical Society, Series B 56, 501–14. Gelman, A. and D. B. Rubin (1992). Inference from iterative simulation using multiple sequences. Statistical Science 7, 457–72. Gerlach, R. H., C. Carter and R. Kohn (1999). Diagnostics for time series analysis. Journal of Time Series Analysis 20, 309–330. Geweke, J. F. (1989). Bayesian inference in econometric models using Monte Carlo integration. Econometrica 57, 1317–39. Geweke, J. F. (1993). Bayesian treatment of the independent Student-t linear model. Journal of Applied Econometrics 8, S19–40. Geweke, J. F. (2005). Contemporaneous Bayesian Econometrics and Statistics. Series in Probability and Statistics. Hoboken, New Jersey, USA: John Wiley & Sons. Geweke, J. F. (2007). Interpretation and inference in mixture models: Simple MCMC works. Computational Statistics and Data Analysis 51, 3529–50. Geweke, J. F. and N. Terui (1991). Threshold autoregressive models for macroeconomic time series: A Bayesian approach. In American Statistical Association 1991 Proceedings of the Business and Economic Statistics Section, 42–50. Glosten, L. R., R. Jaganathan, and D. E. Runkle (1993). On the relation between the expected value and the volatility of the nominal excess return on stocks. Journal of Finance 48, 1779–801. Gray, S. F. (1996). Modeling the conditional distribution of interest rates as a regime-switching process. Journal of Financial Economics 42, 27–62. Haas, M., S. Mittnik and M. S. Paolella (2004). A new approach to Markov-switching GARCH models. Journal of Financial Econometrics 2, 493–530. Hamilton, J. D. (1994). Time Series Analysis. Princeton, USA: Princeton University Press. Hamilton, J. D. and R. Susmel (1994). Autoregressive conditional heteroskedasticity and changes in regime. Journal of Econometrics 64, 307–33. Kass, R. E. and A. E. Raftery (1995). Bayes factors. Journal of the American Statistical Association 90, 773–95. Kaufmann, S. and S. Fr¨uhwirth-Schnatter (2002). Bayesian analysis of switching ARCH models. Journal of Time Series Analysis 23, 425–58. Kim, S., N. Shephard and S. Chib (1998). Stochastic volatility: Likelihood inference and comparison with ARCH models. Review of Economic Studies 65, 361–93.
C The Author(s). Journal compilation C Royal Economic Society 2008.
Bayesian Markov-switching threshold asymmetric GARCH
123
Klaassen, F. (2002). Improving GARCH volatility forecasts with regime-switching GARCH. Empirical Economics 27, 363–94. Lamoureux, C. G. and W. D. Lastrapes (1990). Persistence in variance, structural change, and the GARCH model. Journal of Business and Economic Statistics 8, 225–43. Marcucci, J. (2005). Forecasting stock market volatility with regime-switching GARCH models. Studies in Nonlinear Dynamics and Econometrics 9, 1–53. Meng, X.-L. and W. H. Wong (1996). Simulating ratios of normalizing constants via a simple identity: A theoretical exploration. Statistica Sinica 6, 831–60. Nakatsuma, T. (2000). Bayesian analysis of ARMA-GARCH models: a Markov chain sampling approach. Journal of Econometrics 95, 57–69. Perez-Quiros, G. and A. Timmermann (2001). Business cycle asymmetries in stock returns: evidence from higher order moments and conditional densities. Journal of Econometrics 103, 259–306. Pesaran, M. H., D. Pettenuzzo and A. Timmermann (2006). Forecasting time series subject to multiple structural breaks. Review of Economic Studies 73, 1057–84. R Development Core Team (2007). R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. Rosenblatt, M. (1952). Remarks on a multivariate transformation. Annals of Mathematical Statistics 23, 470–72. Sentana, E. (1995). Quadratic ARCH models. The Review of Economic Studies 62, 639–61. Spiegelhalter, D. J., N. G. Best, B. P. Carlin and A. van der Linde (2002). Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society, Series B 64, 583–639.
APPENDIX: SIMULATING THE JOINT POSTERIOR A.1. Generating α, β and τ The methodology used to draw α, β and τ can be viewed as a multivariate extension of the approach . . proposed by Nakatsuma (2000). Let us consider the vector w t = y 2t ι K /η t − h t where we define η t = t . 2 and ι K is a K × 1 vector of ones. In order to simplify the notations further, we define v t = y t /η t which yields w t = v t ι K − h t . From there, we can transform the expression (2.2) as follows: 2 + α 2 I{yt−1 < τ }(τ − yt−1 )2 + βht−1 ht = α 0 + α 1 yt−1 2 ⇔ (vt ιK − wt ) = α 0 + α 1 yt−1 + α 2 I{yt−1 < τ }(τ − yt−1 )2
+ β(vt−1 ιK − wt−1 ) 2 ⇔ wt = vt ιK − α 0 − α 1 yt−1 − α 2 I{yt−1 < τ }(τ − yt−1 )2
− βvt−1 ιK + βwt−1 . . . Moreover, let us define w t = et w t , h t = et h t and note that w t can be written as follows: yt2 . − 1 ht = χ12 − 1 ht , wt = et wt = vt − ht = t ht where χ 21 denotes a Chi-squared variable with one degree of freedom. This comes from the fact that the conditional distribution of y t is Normal with zero mean and variance t h t . Therefore, the conditional mean of w t is zero and the conditional variance is 2h2t . As this is done for the single-regime GARCH model (Nakatsuma, 2000), this variable can be approximated by z t , a Normal variable with a mean of zero and a . variance of 2h2t . The variable z t can be further expressed as z t = et z t where z t is a function of α, β and τ C The Author(s). Journal compilation C Royal Economic Society 2008.
124
D. Ardia
given by: 2 − α 2 I{yt−1 < τ }(τ − yt−1 )2 z t (α, β, τ ) = vt ιK − α 0 − α 1 yt−1
− βvt−1 ιK + βz t−1 (α, β, τ ) .
(A.1)
. . . From there, we construct the vector z = (z 1 · · · z T ) where z t = et z t as well as the diagonal matrix =
(α, β, τ ) = diag {2et h2t (α, β, τ )}Tt=1 and express the approximate likelihood function of (α, β, τ ) as follows: 1 (A.2) L(α, β, τ |, ν, s, y) ∝ (det )−1/2 exp − z −1 z . 2 The construction of the proposal densities for α, β and τ is based on this likelihood function.
A.1.1. Generating α. We first note that z t (•) in (A.1) can be expressed as a linear function of α. To show this, we note that the kth component of the vector z t can be written as follows: ⎛ k⎞ α0 ⎜ ⎟ [z t ]k = vt − lt∗ (β k ) vt∗ (β k ) vt∗∗ (β k , τ k ) ⎝ α1k ⎠ , α2k . . . with l ∗t (β k ) = 1 + β k l ∗t−1 (β k ), v ∗t (β k ) = y 2t−1 + β k v ∗t−1 (β k ) and vt∗∗ (β k , τ k ) = (τ k − yt−1 )2 I{yt−1 < τ k } + ∗ ∗ ∗∗ k ∗∗ k k β vt−1 (β , τ ) where l 0 , v 0 , v 0 are set to zero. Let us now regroup the recursive values into a K × 3K matrix C t as follows: . Ct = ⎛ ∗ 1 ⎞ lt (β ) 0 · · · 0 vt∗ (β 1 ) 0 · · · 0 vt∗∗ (β 1 , τ 1 ) 0 ··· 0 ⎜ ⎟ ⎜ ⎟ .. .. .. ∗ 2 ∗∗ 2 2 ⎜ 0 l ∗ (β 2 ) 0 ⎟ (β ) 0 (β , τ ) 0 . 0 v . 0 v . t t t ⎜ ⎟ ⎜ ⎟. ⎜ . ⎟ . . . . . . . . . . ⎜ .. ⎟ . . . 0 0 0 0 . 0 . 0 ⎝ ⎠ 0
···
0 lt∗ (β K )
0
···
0 vt∗ (β K )
0
···
0 vt∗∗ (β K , τ K )
. It is straightforward to show that z t = v t ι K − C t α and since z t = et z t we get z t = v t − et C t α. Then, by . . defining the vectors z = (z 1 · · · z T ) and v = (v 1 · · · v T ) as well as the T × 3K matrix C whose tth row is et C t , we end up with z = v − Cα which is the desired linear expression for z. The proposal density to sample α is obtained by combining the approximate likelihood (A.2) and the prior density by Bayes’ update: α ) I{α 0} , α , β, τ, , ν, s, y) ∝ N3K (α| μα , qα (α| with . −1 α−1 = C C + α−1 . −1 −1 μα = α C v + α μα , . = where diag({2et h2t ( α , β, τ )}Tt=1 ), the value α being the previous draw of α in the Metropolis–Hasting (M–H) sampler. A candidate α is sampled from this proposal density and accepted with probability: α |α , β, τ , , ν, s, y) p(α , β, τ , , ν, s, P | y) qα ( , 1 . min p( α , β, τ , , ν, s, P | y) qα (α | α , β, τ , , ν, s, y)
C The Author(s). Journal compilation C Royal Economic Society 2008.
Bayesian Markov-switching threshold asymmetric GARCH
125
A.1.2. Generating β. The function z t (•) in (A.1) could be expressed as a linear function of α but cannot be expressed as a linear function of β. To overcome this problem, we linearize the vector z t (β) by a first order Taylor expansion at point β: dz t β) + × (β − β) , z t (β) z t ( dβ β=β where β is the previous draw of β in the M–H sampler. Furthermore, let us define the following: dz t . . , r t = z t (β) + Gt β ; Gt = − dβ β=β
(A.3)
. where the K × K matrix G t can be computed by the recursion Gt = vt−1 IK − Zt−1 + Gt−1 β where Z t−1 is a K × K diagonal matrix with z t−1 ( β) in its diagonal, I K is a K × K identity matrix and G 0 is a K × K matrix of zeros. From (A.3) we get z t r t − G t β and the approximation for z t is obtained as z t r t . . − et G t β where r t = et r t . Let us now define the vector r = (r 1 · · · r T ) as well as the T × K matrix G whose tth row is et G t . It turns out that z r − Gβ thus we can approximate the exponential in (A.2). The proposal density to sample β is obtained by combining this approximation with the prior density by Bayes’ update: β ) I{β 0} , β, τ , , ν, s, y) ∝ NK (β| μβ , qβ (β|α, with . −1 β−1 = G G + β−1 . −1 −1 μβ = β G r + β μβ , . = β, τ )}Tt=1 ). A candidate β is sampled from this proposal density and accepted where diag({2et h2t (α, with probability: p(α, β , τ , , ν, s| y) qβ ( β|α, β , τ , , ν, s, y) ,1 . min p(α, β, τ , , ν, s | y) qβ (β |α, β, τ , , ν, s, y)
A.1.3. Generating τ . As for β, we linearize the vector z t (τ ) by a first order Taylor expansion . τ ) + Gt τ where the at point τ , the previous draw of τ in the M–H sampler. In this case, r t = z t ( . τ }( τ − yt−1 )] − Gt−1 τ where K × K matrix G t is computed by the recursion Gt = 2IK [α 2 I{yt−1 < I K is a K × K identity matrix and G 0 is a K × K matrix of zeros. It turns out that z r − Gτ , thus we can approximate the exponential in (A.2). The proposal density to sample τ is obtained by combining this approximation with the prior density by Bayes’ update: τ ) I{τmin τ 0} , τ , , ν, s, y) ∝ NK (τ | μτ , qτ (τ |α, β, with . −1 τ−1 = G G + τ−1 . −1 μτ = τ G r + τ−1 μτ , . = τ )}Tt=1 ). A candidate τ is sampled from this proposal density and accepted where diag({2et h2t (α, β, with probability: τ |α, β, τ , , ν, s, y) p(α, β, τ , , ν, s| y) qτ ( min , 1 . p(α, β, τ , , ν, s| y) qτ (τ |α, β, τ , , ν, s, y)
C The Author(s). Journal compilation C Royal Economic Society 2008.
126
D. Ardia
A.2. Generating The components of are independent a posteriori and the full conditional posterior of t is obtained as follows: bt − (ν+3) , p(t |α, β, τ , ν, s, y) ∝ t 2 exp − t . which is the kernel of an Inverted Gamma density with parameters (ν + 1)/2 and b t = (y 2t / . (et h t (α, β, τ )) + ν)/2 where = (ν − 2)/ν.
A.3. Generating ν Draws from p(ν| ) are made by optimized rejection sampling from a translated Exponential source density. The result in Deschamps (2006) can be used without modifications.
A.4. Generating s and P The results in Chib (1996) can be used without modifications.
A.5. Computational details The MCMC scheme is implemented in R (R Development Core Team, 2007), version 2.6.1, with some subroutines written in C. The estimation of the MSTGJR model takes about 15 minutes on a Genuine Intel R CPU T2400 1.83 Mhz processor. Moreover, the validity of the algorithm as well as the correctness of the . computer code are verified using the following methodology. We sample = (α, β, τ , ν, P , , s) from a proper joint prior and generate some passes of the M–H algorithm; at each pass, we simulate the dependent variable y from the full conditional p( y| ) which is given by the conditional likelihood. This way, we draw a sample from the joint density p( y, ). If the algorithm is correct, the resulting replications of should reproduce the prior. The Kolmogorov-Smirnov empirical density test does not reject this hypothesis at the 1% level.
C The Author(s). Journal compilation C Royal Economic Society 2008.
The
Econometrics Journal Econometrics Journal (2009), volume 12, pp. 127–146. doi: 10.1111/j.1368-423X.2008.00276.x
Causality and forecasting in temporally aggregated multivariate GARCH processes C HRISTIAN M. H AFNER † †
Institut de statistique and CORE, Universit´e catholique de Louvain, Voie du Roman Pays 20, B-1348 Louvain-la-Neuve, Belgium E-mail:
[email protected] First version received: October 2007; final version accepted: November 2008
Summary This paper discusses the effects of temporal aggregation on causality and forecasting in multivariate GARCH processes. It is shown that spurious instantaneous causality in variance will only appear in degenerated cases, but that spurious Granger causality will be more common. For forecasting volatility, it is generally advisable to aggregate forecasts of the disaggregate series rather than forecasting the aggregated series directly, and unlike for vector autoregressive moving average (VARMA) processes, the advantage does not diminish for large forecast horizons. Results are derived for the distribution of multivariate realized volatility if the high-frequency process follows multivariate GARCH. A numerical example illustrates some of the results. Keywords: Causality in variance, Multivariate GARCH, Realized volatility, Temporal aggregation, Volatility forecasts.
1. INTRODUCTION The consequences of temporal aggregation of time series for empirically relevant problems such as causality and forecasting have been of interest in econometrics. Most of available results are for linear ARMA models, e.g. L¨utkepohl (1987) and Marcellino (1999). Few results are available for volatility or multivariate GARCH models. Recently, results for temporal aggregation of volatility models have been derived by Meddahi and Renault (2004) and Hafner (2008). This paper investigates the effects of temporal aggregation on some properties of multivariate GARCH processes such as Granger causality, forecasting volatility and realized volatility. These issues are important in practice when dealing with, e.g. portfolio allocation, the problem of volatility spillover between markets or the optimal forecasting frequency. We first discuss some issues related to causality in volatility. In VARMA processes, Breitung and Swanson (2002) investigate the phenomenon of ‘spurious instantaneous causality’, that is, instantaneous causality of the low frequency process, which is solely induced by temporal aggregation without any causal relationship at the high frequency. For multivariate GARCH processes, we show that such misleading causality can be ruled out whenever there is a non-zero conditional correlation between the series, or if the dimension is not larger than two. Spurious Granger causality, i.e. uni- or bi-directional causality, is of more practical relevance since if the C The Author(s). Journal compilation C Royal Economic Society 2009. Published by Blackwell Publishing Ltd, 9600 Garsington Road,
Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
128
C. M. Hafner
parameter matrices of the high-frequency process are diagonal (i.e. no Granger causality), those of the low frequency will in general not be diagonal. However, as measures for causality suggest, this spurious Granger causality is typically much smaller than the instantaneous causality. All Granger causality in volatility disappears as the series is more and more aggregated. Moreover, the normalized series converges to a multivariate Gaussian white noise series with increasing aggregation level. For the prediction of volatility, it is no surprise that the method that predicts the disaggregate process and then aggregates the forecasts has a smaller mean square prediction error than the method that directly predicts the aggregated series. In the VARMA framework, this has been demonstrated, e.g. by L¨utkepohl (1987). However, whereas in VARMA models the two methods become identical when the prediction horizon increases, this is not the case for multivariate GARCH processes. The reason is an additional noise term in the aggregated series, which is absent in the aggregation of VARMA processes. Finally, we try to build a link to the increasing literature on so-called realized volatilities, that is, aggregation of the high-frequency (typically intraday) second-order process to obtain a measure rather than a model for the low-frequency volatility, see, e.g. Andersen et al. (2003). Based on results of Breitung and Swanson (2002), it can be shown that if the high frequency process follows multivariate GARCH, then the multivariate realized volatility process for finite but large aggregations can be approximated by a vector moving average (VMA) (1) process. Throughout the paper, we assume that the original and aggregate timescales are regularly spaced. Other types of aggregation have been considered recently, where either the original or the aggregate timescale, or both, are irregularly spaced (see, e.g. Jorda and Marcellino, 2004). This is of interest, for example, when dealing with financial transaction data, where the durations between transactions are typically random. In our more classical case, we consider a typical aggregation to be fromed daily to weekly or monthly data, with regularly spaced time intervals on both scales. The paper is organized as follows. Section 2 summarizes the main results of temporal aggregation of multivariate GARCH(1,1) processes as developed by Hafner (2008) and partially extends them to models of higher order. Section 3 discusses the causality in volatility and Section 4 the prediction of volatility. Section 5 derives results for realized volatility. Finally, Section 6 concludes. Throughout the paper, a numerical example is used to illustrate the results. Proofs of the theorems are given in the Appendix.
2. TEMPORAL AGGREGATION OF MULTIVARIATE GARCH PROCESSES Let ε t denote a stochastic vector process with K components and E[εt | Ft−1 ] = 0, where Ft = σ (εs , −∞ < s ≤ t). We consider multivariate GARCH processes, defined by a positive definite and symmetric matrix H t with vech(H t ) = h t , ht = ω +
q i=1
Ai ηt−i +
p
Bj ht−j ,
(2.1)
j =1
where ω = vech(), η t = vech(ε t εt ) and N × N parameter matrices , A i , B j , with N = K(K + 1)/2. The operator vech(·) stacks the lower triangular part including the diagonal of a symmetric matrix into a column vector. Extending the definitions of strong, semi-strong and weak GARCH models of Drost and Nijman (1993) to the multivariate case, we say that ε t is a C The Author(s). Journal compilation C Royal Economic Society 2009.
129
Temporally aggregated multivariate GARCH −1/2
strong multivariate GARCH(p, q) process, if ξ t = H t ε t is an iid process with mean zero and variance the identity matrix, i.e. all temporal dependence comes from the conditional covariance matrix H t . It is said to be semi-strong multivariate GARCH(p, q) if Var(εt | Ft−1 ) = Ht but −1/2 ξ t = H t ε t is not necessarily iid, and there could be temporal dependence in third or higher moments of ξ t . Finally, ε t is called weak multivariate GARCH(p, q) if h t is the best linear predictor of η t in terms of a constant and lagged values of η t , that is ht = P (ηt | Ht−1 ) = P (ηt,1 | Ht−1 ), . . . , P (ηt,N | Ht−1 ) , where Ht = sp{1, ηt−τ,1 , . . . , ηt−τ,N , τ ≥ 0} denotes the infinite dimensional Hilbert space spanned by all linear combinations of a constant and η t−τ,1 , . . . , η t−τ,N . Strong multivariate GARCH processes allow for a VARMA representation of the process η t ,
max(p,q)
ηt = ω +
Qi ηt−i −
p
Bj ut−j + ut ,
(2.2)
j =1
i=1
where Q i = A i + B i , u t = η t − h t and where we set A q+1 = · · · = A p = 0 if p > q and B p+1 = · · · = B q = 0 if q > p. If ε t is strong or semi-strong GARCH, then u t is a martingale difference sequence, but if ε t is weak GARCH, then u t is only weak white noise. We assume that ε t has finite fourth moments and define the following matrices, = E[ε t εt ], η = E[η t ηt ], h = E[ht ht ], u = E[ut ut ], assuming that they have full rank. If ε t is semi-strong multivariate GARCH, then u = η − h . This follows directly by writing out the expectations and applying the law of iterated expectations. In semi-strong and strong GARCH(p, q) processes, exists if and only if ε t is covariance max(p,q) Qi have moduli stationary. This is the case if and only if all eigenvalues of the matrix i=1 smaller than one, see Engle and Kroner (1995). The unconditional covariance matrix = Var(ε t ) would then be given by −1 max(p,q) Qi ω, (2.3) σ = vech( ) = IN − i=1
where the (N × 1) vector σ contains the K unconditional variances and the K(K − 1)/2 unconditional covariances of ε t . The temporal aggregation results of Hafner (2008) are derived under the following moment condition, )vec(εt εt−j )] = 0, E[vec(εt εt−i
∀i, j ≥ 0,
i = j ,
(2.4)
which can be shown to be satisfied if the high-frequency process is strong GARCH with spherical innovations. A weaker sufficient condition for (2.4) to hold is that all conditional skewness and co-skewness coefficients of ε t are zero, i.e. E[ηt εt | Ft−1 ] = 0, and that the conditional variance | Ft−i−1 ] = 0, ∀i ≥ 1. A similar of ε t is conditionally uncorrelated to all lagged εt , E[ηt εt−i condition has been imposed by Drost and Nijman (1993) in the univariate case. To derive the autocovariance structure of η t , it is convenient to work with the pure vector moving average (VMA(∞)) representation of η t . The assumed covariance stationarity of ε t and existence of u implies covariance stationarity of η t , which implies existence of the VMA(∞) representation of η t . From the VARMA representation (2.2), we obtain ∞ ηt = σ +
i ut−i , (2.5) i=0 C The Author(s). Journal compilation C Royal Economic Society 2009.
130
C. M. Hafner
where the N × N matrices i can be determined recursively by 0 = I N :
i = −Bi +
i
Qj i−j ,
i = 1, 2, . . . ,
(2.6)
j =1
and the infinite sum in (2.5) is well defined as a limit in mean square, (1993, see L¨utkepohl pp. 220). From (2.5), we see directly that E[η t ] = σ and Var(ηt ) = ∞ i=0 i u i , whereas the autocovariance matrix is given by (τ ) = E (ηt − σ )(ηt−τ − σ ) =
∞
τ +i u i ,
(2.7)
i=0
which is finite if η t is covariance stationary and u exists. Conditions for existence of η are given by Hafner (2003), where, also, explicit expressions for η and u are provided for the case of spherical innovations. We now summarize the main results of Hafner (2008) for the case of multivariate GARCH(1, 1) models and aggregation of flow variables. The best linear predictor of η t in terms of a constant and η t−1 , η t−2 , . . . , is given by ht = P (ηt | Ht−1 ) = ω + Aηt−1 + Bht−1 . Recall from (2.2) that η t has the VARMA(1,1) representation ηt = ω + Qηt−1 − But−1 + ut ,
(2.8)
where Q = A + B and u t = η t − h t . (m) , t ∈ Z}, which for the case Denote the process ε t that is aggregated over m periods by {εmt of flow variables is given by (m) = εmt + εmt−1 + · · · + εmt−m+1 . εmt
Since ε t is a white noise process, it follows immediately that the unconditional variances of the aggregated process ε(m) mt are in the case of stock variables and m in the case of flow variables, where vech( ) is given in (2.3). This implies that in both cases the unconditional correlation matrix remains unchanged under temporal aggregation. (m) (m) Now denote by η(m) mt = vech(ε mt ε mt ) the vector process that contains the squares and cross-products of the aggregated process ε(m) mt . Since for arbitrary vectors a and b of dimension + −1 K, vech(ab ) + vech(ba ) = 2D + K vec(ab ), where D m = (D m D m ) D m is the generalized inverse of the duplication matrix D m , we have (m) (m) = ηmt + ηmt−1 + · · · + ηmt−m+1 + wmt ; ηmt
(2.9)
where, using the lag operator Lk xt = x t−k , we get m−2 m−3 (m) + Li vec(εmt εmt−1 )+ Li vec(εmt εmt−2 ) + · · · + vec(εmt εmt−m+1 )} . (2.10) wmt = 2DK i=0
For example, if m = 2 then w(2) 2t
i=0 (m) = 2D + K vec(ε 2t ε 2t−1 ). Each term of w mt
has expectation zero and is uncorrelated with every other term. Thus, it acts as a noise term that is added to the sum of the C The Author(s). Journal compilation C Royal Economic Society 2009.
Temporally aggregated multivariate GARCH
131
high-frequency second-order process η t . It turns out that this noise complicates the analysis of temporal aggregation when compared with VARMA processes, where this term is missing. See, however, Section 5 for the approach of realized volatility that suppresses this term and thus aims at aggregating not the returns but rather volatility directly. For later reference, we can calculate (m) the variance matrix of w (m) mt , w , say, as m−1
(m − i)vec((i) + σ σ ), vec w(m) = 4GK
(2.11)
i=1 + + where GK = (DK ⊗ DK )(IK ⊗ CKK ⊗ IK )(DK ⊗ DK ), with C mn denoting the commutation matrix. The aggregated process η(m) mt has the following VARMA representation, (m) (m) = ω(m) + vmt , (IN − Qm L)ηmt
(2.12)
ω(m) = m(IN + Q + . . . + Qm−1 )ω,
(2.13)
where
and v (m) mt is a vector moving average process of order one, that is, it has expectation zero, finite (m) (m) (m) (m) (m) covariance matrix (m) v = E[v mt v mt ], first-order autocovariance matrix v = E[v mt v m(t−1) ] and higher order autocovariances equal to zero. By convention, the lag operator in (2.12) that (m) operates on an aggregated process lags it on the low-frequency scale, that is, Lη(m) mt = ηm(t−1) . m The coefficient matrix of the autoregressive part is given by Q . All eigenvalues of Q have modulus smaller than one, so that Qm converges to the zero matrix exponentially fast. The moving average part takes the form (m) = vmt
2m−1
(m) (m) Ji umt−i + wmt − Qm wm(t−1) .
(2.14)
i=0
The J i matrices are determined as follows: J0 = IN , Ji = IN + A + QA + · · · + Qi−1 A, Jm = {IN + Q + · · · + Q Ji = {Q
i−m
+Q
i−m+1
m−2
}A − Q
+ ··· + Q
i = 1, . . . , m − 1, m−1
m−2
B,
}A − Qm−1 B,
i = m + 1, . . . , 2m − 2,
J2m−1 = −Qm−1 B. From equation (2.14), we obtain the variance and first-order autocovariance of v (m) mt as v(m) =
2m−1
Ji u Ji + w(m) + Qm w(m) (Q )m ,
(2.15)
i=0
v(m) =
m−1
Ji+m u Ji − Qm w(m) ,
i=0 (m) where (m) w is the variance matrix of w mt given in (2.11). C The Author(s). Journal compilation C Royal Economic Society 2009.
(2.16)
132
C. M. Hafner
By Theorem 1 of Hafner (2008), the class of weak multivariate GARCH(1,1) processes is (m) (m) , E[εmt | closed under temporal aggregation. This means that for the aggregated process εmt (m) (m) (m) (m) (m) (m) Fm(t−1) ] = 0, where Fmt = σ (εms , −∞ < s ≤ t). Moreover, hmt = P (ηmt | Hm(t−1) ), with (m) (m) (m) = sp(1, ηm(t−τ Hmt ),1 , . . . , ηm(t−τ ),N , τ ≥ 0), and where (m) (m) + A(m) ηm(t−1) + B (m) h(m) h(m) mt = ω m(t−1) ,
(2.17)
where ω(m) is given by (2.13) and B (m) is given by the solution to the system of quadratic equations B (m) v(m) B (m) + B (m) v(m) + v(m) = 0,
(2.18)
such that all eigenvalues of B (m) are smaller than one in modulus, where the matrices (m) v and (m) in (2.17) is given by (m) v are given by (2.15) and (2.16). The matrix A A(m) = Qm − B (m) ,
(2.19)
(m) (m) (m) and the projection error {u(m) mt , t ∈ Z}, umt = ηmt − hmt , is a weak white-noise vector process (m) with covariance matrix u with
−1
vec u(m) = IN 2 + B (m) ⊗ B (m) vec v(m) .
(2.20)
Introducing the notation Q(m) = A(m) + B (m) , it follows from (2.19) that Q(m) = Qm . Furthermore, the aggregated process η(m) mt follows a weak VARMA(1,1) process that can be written as (m) (m) (m) = ω(m) + Q(m) ηm(t−1) − B (m) u(m) ηmt m(t−1) + umt .
(2.21)
From the weak VARMA representation (2.21), one obtains the weak VMA(∞) representation (m) ηmt = σ (m) +
∞
(m)
(m) i um(t−i) ,
(2.22)
i=0
−1
where σ (m) = IN − Q(m) ω(m) , and where the N × N matrices (m) are given by 0 = I N i and
(m) = (Q(m) )i−1 A(m) , i
i = 1, 2, . . . .
(2.23)
We end this section with a new result on temporal aggregation of higher order multivariate GARCH models. T HEOREM 2.1. The class of weak finite-order multivariate GARCH(p, q) models is closed under temporal aggregation. The orders p∗ and q ∗ of the aggregated model satisfy p∗ ≤ N (r − 1) + 1 and q ∗ ≤ N (r − 1) + 1, respectively, where r = max(p, q). Theorem 2.1 states that higher order weak multivariate GARCH models remain in the same class but with possibly changing order p∗ and q ∗ . For example, the upper bound for the order of an aggregated bivariate GARCH(2,2) model is (4,4). Note that if p = q = 1, then also p∗ = q ∗ = 1 as in Theorem 1 of Hafner (2008). C The Author(s). Journal compilation C Royal Economic Society 2009.
Temporally aggregated multivariate GARCH
133
3. CAUSALITY There is a substantial literature on the effects of temporal aggregation for causality between time series, see, e.g. Marcellino (1999) for a recent overview and references. The general difficulty in empirical work is that only data of the temporally aggregated series is available, for which one typically observes contemporaneous correlation between the series. The question for the investigator is whether this correlation stems from a true causal relation of the high-frequency series or whether it is a mere artefact of temporal aggregation. We will address this issue here in the volatility context and show that, again, there are important differences from the VARMA case. As is common in econometrics, we use the term causality in the sense of ‘Granger causality’, which for volatility has been defined by Granger et al. (1984). However, there are at least three alternative versions of Granger causality, one based on the entire distribution of a variable to be forecast, another on the conditional expecation and yet another on optimal linear forecasts. Knowing from Section 2 that temporally aggregated multivariate GARCH processes are only weak multivariate GARCH, we have to be careful in defining causality in variance, because notions based on conditional expectations or conditional variances become difficult to check for the aggregated series. Rather, one has to weaken the concept and use the notion of best linear predictors, but this stands in the tradition of, for example, Boudjellaba et al. (1992) and Comte and Lieberman (2000). Also, we use the term ‘Granger causality’ for the case of a causal lag greater than zero (sometimes this is also called ‘directional causality’), whereas we use ‘instantaneous causality’ for the causal lag being actually zero. Comte and Lieberman (2000) distinguish between causality in variance and second-order causality, where the difference occurs in the way the caused series is demeaned. As in our setting the series ε t has conditional mean zero, we do not need to make this distinction here. Furthermore, Comte and Lieberman (2000) provide conditions on the parameters for absence of the alternative forms of causality. We are mainly interested in measures for causality in variance and how they behave under temporal aggregation, an issue which has not been addressed by Comte and Lieberman (2000). In the following, we define the alternative concepts of causality in variance. The only difference with respect to Comte and Lieberman (2000) is that we allow for more than two subgroups of the vector process ε t , and for a causal lag h that can be larger than one. Comte and Lieberman (2000) restrict h to be equal one, which in their case of only two subgroups is without loss of generality, due to the results of Dufour and Renault (1998). On the other hand, to simplify the presentation, we assume that the first two subgroups are scalar, whereas in Comte and Lieberman (2000) they can be vectors. Thus, we are interested in the impact of one scalar variable on the conditional variance (or best linear predictor of the square) of another scalar variable at some arbitrary lag, allowing for the presence of other variables. Suppose we are interested in the causality in variance between the first two elements of ε t , ε t,1 and ε t,2 . Let us introduce the following notation. Denote the σ -algebra generated by ε s,i , s ≤ t, i = 1, 3, 4, . . . , K by Ft(−2) . Moreover, denote by Ht the set of all linear combinations of a constant and ε s,i ε s,j , s ≤ t, i, j = 1, . . . , K, by Ht(−2) the set of all linear combinations of a constant and ε s,i ε s,j , s ≤ t, i, j = 1, 3, 4, . . . , K and by Ht(+2) the set of all linear combinations of a constant, ε s,i ε s,j , s ≤ t, i, j = 1, . . . , K and ε t+1,2 ε t+1,i , i = 2, . . . , K. C The Author(s). Journal compilation C Royal Economic Society 2009.
134
C. M. Hafner
D EFINITION 3.1. GCV
(1) We say that ε t,2 Granger causes ε t,1 in variance (GCV), denoted by εt,2 → εt,1 , if for some h ≥ 1,
Var(εt+h,1 | Ft ) = Var εt+h,1 | Ft(−2) , (3.1) (2) There is said to be instantaneous causality in variance (ICV) between ε t,2 and ε t,1 , denoted I CV
by εt,1 ↔ εt,2 , if Var(εt+1,1 | Ft ) = Var(εt+1,1 | Ft ∨ σ (εt+1,2 )),
(3.2)
where Ft ∨ σ (εt+1,2 ) denotes the augmentation of Ft−1 by the information contained in ε t,2 . LGCV
(3) We say that ε t,2 linearly Granger causes ε t,1 in variance (LGCV), denoted by εt,2 → εt,1 , if for some h ≥ 1,
2 2 (3.3) | Ht = P εt+h,1 | Ht(−2) , P εt+h,1 (4) There is said to be linear instantaneous causality in variance (LICV) between ε t,2 and ε t,1 , LI CV
denoted by εt,1 ↔ εt,2 , if
2 2 | Ht = P εt+1,1 | Ht(+2) . P εt+1,1
(3.4)
For weak multivariate GARCH processes, it is only possible to investigate linear causality since the conditional variances are not specified or not known. On the other hand, for semistrong multivariate GARCH processes, it is well possible to investigate causality, but that would only be relevant for the high-frequency process. To obtain more intuition about the concepts, consider a semi-strong multivariate GARCH model of the form (2.1). The variance of ε t,1 conditional on the past is a linear function of lagged ε t,i ε t,j . There would be a direct impact of lagged ε t,2 if one of the coefficients corresponding to ε2s,2 or ε s,2 ε s,i for some i and some s < t were different from zero. But even when these coefficients are zero, it is still possible to have causality chains going from the second to a third and back to the first variable at some lag larger than one. The definition of GCV thus requires that taking away ε s,2 , s ≤ t, from the information set Ft changes the conditional variance of ε t+h,1 for some h ≥ 1. For the case of only two subgroups, Comte and Lieberman (2000) and Hafner and Herwartz (2008) give sufficient (and necessary) conditions for absence of GCV (and LGCV). In temporally aggregated VARMA models, Breitung and Swanson (2002) have investigated the effect of so-called ‘spurious instantaneous causality’, as first investigated by Renault and Szafarz (1991) and Renault et al. (1998). This occurs if there is no causality between the disaggregated time series but instantaneous causality between the aggregated time series. We adapt this definition to the volatility case. If there is no causality in volatility (instantaneous or CV directional) between the series ε t,1 and ε t,2 , we denote this by εt,1 εt,2 , and correspondingly LCV
we write εt,1 εt,2 , if there is no linear causality in volatility (instantaneous or directional) between the series. C The Author(s). Journal compilation C Royal Economic Society 2009.
Temporally aggregated multivariate GARCH
135
D EFINITION 3.2. I CV
CV
(m) (m) ↔ εmt,2 for some m ≥ 2 and (1) There is said to be spurious ICV, if εt,1 εt,2 , but εmt,1 some t ∈ Z. LCV (m) LI CV (m) (2) There is said to be spurious LICV, if εt,1 εt,2 , but εmt,1 ↔ εmt,2 for some m ≥ 2 and some t ∈ Z.
It has sometimes been argued that spurious instantaneous causality can be problematic in empirical work, since if two aggregated time series are found to show instantaneous causality, it may be because there is causality between the disaggregated series or because it is induced by temporal aggregation. Breitung and Swanson (2002) give sufficient conditions to exclude spurious instantaneous causality in VARMA models. In the volatility case, the following theorem gives a necessary condition for spurious instantaneous causality. T HEOREM 3.1. If the high-frequency process follows strong multivariate GARCH(1,1) with Gaussian innovations, then a necessary condition for spurious LICV between (ε t,1 ) and (ε t,2 ) is ht,2 = 0 and
K ≥ 3,
for all t, where h t,2 is the second component of h t , i.e. the conditional covariance of ε t,1 and ε t,2 . In the following, let us be a bit more loose in terminology and only refer to GCV and ICV when it could also mean LGCV or LICV. Theorem 3.1 implies that in empirical work spurious ICV is of much less relevance than spurious instantaneous causality in the conditional mean, because the two series will in most cases show some non-zero conditional covariance, be it constant or not. Financial series such as stock returns, for example, tend to be positively correlated at high frequencies. So, ICV will be the rule rather than the exception if high-frequency financial series are investigated. Rather than ICV, it is far more interesting to see whether there is GCV. It turns out that there may be absence of GCV between the disaggregate series, but presence of GCV between the aggregated series. This might be called ‘spurious Granger causality in volatility’. A sufficient condition for absence of GCV is that the parameter matrices A and B of the multivariate GARCH model are diagonal. Many empirical studies have shown that diagonal GARCH models may give good descriptions of the DGP at many frequencies. This can be due to the fact that even though there may be GCV induced by temporal aggregation, it is possibly much less important numerically than ICV. To see whether this is the case for a given multivariate GARCH model, we need measures for the alternative causalities, which we will look at in the following. Measures for the causality in variance have been considered by Hafner (2003), based on wellknown measures for causality in VARMA models introduced by Geweke (1982). For simplicity, we only consider the bivariate case in the following, but extensions to causality measures conditional on other variables follow in analogy to Geweke (1984). Let x t = ε2t,1 and y t = ε2t,2 . By the results of Nijman and Sentana (1996), the marginal process ε t,1 follows a weak univariate GARCH process, and therefore x t has a weak ARMA(q ∗ , p∗ ) representation such as ∗
∗
q p
x xt = ω + αi + βix xt−i − βjx wt−j + wt , x
i=1 C The Author(s). Journal compilation C Royal Economic Society 2009.
j =1
(3.5)
136
C. M. Hafner
(−2) 2 where wt = xt − P (εt,1 | Ht−1 ), ωx , αix and β xj are parameters. Upper bounds for the AR and ∗ MA orders are given by q ≤ 3 and p∗ ≤ 3, respectively, by Corollary 4.2.2 of L¨utkepohl (1987) or Nijman and Sentana (1996). The process w t is univariate weak white noise with variance σ 2w , say. A measure for GCV from y t to x t is given by
GCVy→x = log
σw2 . [ u ]11
(3.6)
By symmetry, one obtains a causality measure for the reverse causality direction, GCVx→y . Summing up these unidirectional causality measures, we can define a measure for bidirectional causality as GCVy↔x = GCVy→x + GCVx→y .
(3.7)
A measure for ICV between x t and y t is given by ICVx↔y = log
u,11 u,33 . 2 u,11 u,33 − u,13
(3.8)
Finally, the measure for linear dependence between x t and y t is denoted by CVx,y . This measure can be decomposed into the three causality measures: CVx,y = GCVx→y + GCVy→x + ICVx↔y = GCVy↔x + ICVx↔y .
(3.9)
Now suppose one is mainly interested in the bidirectional GCV measure, GCVy↔x , because, for example, one wants to see how important spurious GCV can become. For example, the hypothesis of a diagonal GARCH model amounts to testing whether this bidirectional measure is zero. For a given multivariate GARCH process, there is no obvious way to find the unidirectional measures GCVy→x and GCVx→y , other than via determining the univariate GARCH models for the marginal processes, which is straightforward but tedious, see Nijman and Sentana (1996). However, there is a simple way to find the bidirectional measure GCVy↔x , as we will see immediately. The measure for linear dependence can be decomposed in the frequency domain as π 1 f11 (λ)f33 (λ) log dλ, CVx,y = 2π −π f11 (λ)f33 (λ) − |f13 (λ)|2 see, e.g. Geweke (1982), where f (λ) denotes the spectral density matrix of η t = vech(ε t εt ), which is given by ⎞ ⎛ ⎞ ⎛ ∞ ∞
j eij λ ⎠ u ⎝
j eij λ ⎠ . (3.10) f (λ) = ⎝ j =0
j =0
The bidirectional measure GCVy↔x can now easily be obtained as a residual of equation (3.9), i.e. by the difference between CVx,y and ICVx↔y . The advantage of this approach is that f (λ) and therefore the bidirectional measure can be calculated directly using the representation of the joint process ε t . The alternative way of summing up the two unidirectional measures requires the determination of the marginal processes ε t,1 and ε t,2 , which is somewhat more involved, see Section 3 of Nijman and Sentana (1996). The above causality measures can now also be obtained for the aggregated series η(m) mt by (m) given replacing u in (3.10) and (3.8) by (m) u given in (2.20) and replacing i in (3.10) by i C The Author(s). Journal compilation C Royal Economic Society 2009.
137
Temporally aggregated multivariate GARCH
in (2.23). This gives us a measure of bidirectional causality in volatility for the aggregated series, defined as (m) (m) (m) = CVx,y − ICVx↔y . GCVy↔x
(3.11)
Since ∀i ≥ 1, (m) → 0 as m → ∞, the spectral density matrix of the series m−1 η(m) mt converges i −2 (m) to the limit of m u , U , say. For example, by the results of Section 2, this would be given (m) by vec(U ) = (cGK − IN 2 )vec(σ σ ), under the assumption of spherical innovations. Thus, CVx,y (m) and ICVx↔y converge to the same limit given by (m) (m) lim CVx,y = lim ICVx↔y = log
m→∞
m→∞
U11 U33 . 2 U11 U33 − U13
(m) = 0, meaning that all directional Granger Using (3.11), this implies that limm→∞ GCVy↔x causality in variance disappears eventually as the series is aggregated. This is of course no surprise, as it corresponds to the aggregation results in VARMA processes. To illustrate the results we will use the following bivariate example process. 1/2
εt = Ht ξt , ξt ∼ iidN(0, I2 ), ⎡ ⎤ ⎡ ⎤ ⎡ 1 0.16 0.08 0.01 0.64 ⎢ ⎥ ⎢ ⎥ ⎢ 0.12 0.03 ⎦ ηt−1 + ⎣ 0 vech(Ht ) = ht = ⎣ 0 ⎦ + ⎣ 0 1 0 0 0.09 0
0 0.72 0
⎤ 0 ⎥ 0 ⎦ ht−1 . 0.81 (3.12)
The parameter values of ω, A and B have been chosen in order to resemble typical parameter estimates of financial time series. In particular, the diagonal elements of A are typically much smaller than those of B, and their sum is close to one. The process satisfies the condition for existence of fourth moments given by Hafner (2003). In (3.12), the conditional variance of the second component of ε t is only affected by its own squared lagged values, and therefore, one can speak of absence of causality from the first to the second component in volatility. Figure 1 shows the alternative causality measures for the example process (3.12). The dashed line shows the (m) , the dotted line shows the linear dependence measure instantaneous causality measure ICVy↔x (m) (m) CVy↔x and the solid line shows the bi-directional Granger causality measure GCVy↔x , where (m) (m) x = εmt,1 and y = εmt,2 . Clearly, the bidirectional GCV measure is much smaller here than the ICV measure and also dissipates to zero very quickly. Note that the bidirectional GCV measure of the disaggregate process (3.12) is equal to the unidirectional GCV measure from ε t,2 to ε t,1 , since the matrices A and B are upper triangular, so that there is no GCV from ε t,1 to ε t,2 . However, the bidirectional GCV measure of the aggregated process incorporates some causality from ε t,1 to ε t,2 , although smaller than from ε t,2 to ε t,1 . But this is not shown in the figure. Finally, the discussed causality measures could be used for testing causality for a given empirical time series. If the errors u t of the VARMA representation of multivariate GARCH models were Gaussian, then an estimate of the GCV measure, multiplied by the sample size T, would be the usual likelihood ratio statistic, having an asymptotic χ 2 distribution (see Geweke, could 1982). Now u t is not Gaussian but skewed and conditionally heteroskedastic. Thus, T GCV be called pseudo likelihood ratio statistic with a non-standard asymptotic distribution. To obtain valid critical values, one can use the bootstrap as in Hafner and Herwartz (2008). They find that this statistic has similar size and power properties as the so-called CCF test of Cheung and C The Author(s). Journal compilation C Royal Economic Society 2009.
138
C. M. Hafner
Figure 1. Causality measures for the example process (3.12) as a function of the aggregation level m.
Ng (1996). The CCF test estimates univariate GARCH models and computes cross-correlations of standardized residuals. It is therefore in the spirit of Lagrange Multiplier statistics. A third way to approach the testing problem is to use Wald type statistics, for example, based on QML estimation and inference of the multivariate model. Hafner and Herwartz (2008) show that the Wald test has more power under local alternatives than both the CCF and the pseudo-likelihood ratio test. However, their framework is a semi-strong multivariate GARCH model, and it is unlikely that this carries over to weak GARCH models.
4. FORECASTING Suppose one is interested in the prediction of multivariate volatility of the aggregated series h periods ahead. That is, given information at time mt one wants to predict the volatility of ε(m) m(t+h) . (m) Let us only consider the flow variable case here, so that εm(t+h) = ε m(t+h) + ε m(t+h)−1 + · · · + ε m(t+h−1)+1 , and assume that the high-frequency process is multivariate GARCH(p, q). Prediction (m) of the volatility of ε(m) m(t+h) is the same as prediction of ηm(t+h) , which by Theorem 2.1 follows a finite-order VARMA process. One can now build a forecast of η(m) m(t+h) , based on the VMA(∞) (m) ∞ representation of this process with coefficient matrices { i }i=0 . This forecast is defined by (m) ηmt (h) = σ (m) +
∞
(m)
(m) h+i um(t−i) .
i=0 C The Author(s). Journal compilation C Royal Economic Society 2009.
139
Temporally aggregated multivariate GARCH
The mean square error of this forecast is given by the matrix a (h) =
h−1
(m) (m)
(m) i u i .
i=0
Another possibility is to predict the disaggregated series and then aggregate the forecasts. Based on the VMA(∞) representation of the disaggregated series in (2.5), the optimal r-step forecast in a mean-square-error sense is given by ηt (r) = σ +
∞
r+i ut−i .
i=0
The forecast for η(m) m(t+h) is then given by η mt (mh) + η mt (mh − 1) + · · · + η mt (m(h − 1) + 1). The mean square error of this forecast is given by d (h) = F dm (h)F , where F = (I N , . . . , I N ) is an (N × mN) aggregation matrix and dm (h) is a symmetric, positive definite (mN × mN) matrix given by ⎤ ⎡ m(h−1) m(h−1) m(h−1)
i u i
i u i+1 ···
i u i+m−1 ⎥ ⎢ ⎥ ⎢ i=0 i=0 i=0 ⎥ ⎢ ⎥ ⎢ m(h−1) m(h−1)+1 m(h−1)+1 ⎥ ⎢ ⎢
i+1 u i
i u i ···
i u i+m−2 ⎥ ⎥ ⎢ ⎥, dm (h) = ⎢ i=0 i=0 ⎥ ⎢ i=0 ⎥ ⎢ .. .. .. .. ⎥ ⎢ . . . . ⎥ ⎢ ⎥ ⎢ m(h−1)+1 ⎥ ⎢ m(h−1) mh−1 ⎦ ⎣
i+m−1 u i
i+m−2 u i · · ·
i u i i=0
i=0
i=0
see e.g. chapter 8 of L¨utkepohl (1987). There it is also shown that for VARMA models, in general d (h) ≤ a (h) in the sense that the matrix a (h) − d (h) is positive semi-definite, and that equality only holds in special cases such as periodicity with period equal to the aggregation level. An implication of this result is that the forecasts based on the disaggregated series are superior to the forecasts based on the aggregated series in terms of forecast precision. On the other hand, both forecasts become equivalent as the forecast horizon increases, as both mean square error matrices approach the same unconditional covariance matrix. For the aggregation of multivariate GARCH processes, however, the difference between both forecasts turns out to be stronger than for VARMA processes and not dissipating for increasing horizons. The reason is the additional noise term in the aggregated series, w(m) mt . The expectation of this term is zero, but it has a positive definite covariance matrix (m) w given by (2.11). (m) Therefore, the unconditional variance of ηmt is larger than that of η mt + η mt−1 + · · · + η m(t−1)+1 , and the forecast mean square error matrices converge to two different levels with increasing horizon. Thus, we have a strict inequality, d (h) < a (h) for all h > 0. Asymptotically, the difference is given by lim a (h) − d (h) = w(m) ,
h→∞
C The Author(s). Journal compilation C Royal Economic Society 2009.
(4.1)
140
C. M. Hafner
Figure 2. Mean square prediction error of forecasting the volatility of ε (m) mt,1 for the example process (3.12) with m = 2 as a function of the forecast horizon h.
where (m) w is given by (2.11). While the difference between the two forecasting methods is negligible in VARMA models for sufficiently large horizons, it turns out to be substantial in multivariate GARCH models. Equation (4.1) says that in the limit, this difference is just given by the variance matrix of the noise term w (m) mt in (2.10), which was added to the sum of the indivual η mt in constructing the aggregate η(m) mt . It should be emphasized that this noise term is missing in the aggregation of VARMA processes. Note also that (4.1) is not affected by the order of the multivariate GARCH process. The implication of (4.1) is that forecasting weekly volatility, for example, by aggregating daily volatility forecasts will always be better than forecasting the weekly series directly, no matter how large the forecasting horizon. This is also the reason why in forecasting volatility, one should use the highest frequency for which data is available, provided that there are no biases coming from microstructure effects, for example. Recent empirical research has shown that predicting daily volatility of a financial time series using intraday returns can substantially improve the precision of forecasts using the daily series only (see, for example, Andersen et al., 2003). See also Section 5, where this so-called realized volatility is investigated in the context of multivariate GARCH models. Figure 2 shows the mean square prediction errors of the two forecasting methods for the example process (3.12) with m = 2. The solid line shows a prediction using the disaggregated process and then aggregating the forecasts, the dashed line shows a prediction of the aggregated process. The values are scaled by the factor m−2 . In this example, the mean square prediction C The Author(s). Journal compilation C Royal Economic Society 2009.
Temporally aggregated multivariate GARCH
141
error can be reduced by almost 50% for all forecasting horizons by doubling the sampling frequency and using the high-frequency data for prediction.
5. MULTIVARIATE REALIZED VOLATILITY There is a growing literature on so-called realized volatilities (see, for example, Andersen et al., 2003) for an overview. Realized volatilities are estimates of low-frequency volatilities using high-frequency data. For example, the volatility of a daily return series could be estimated by the sum of squared intraday returns. When the sampling frequency goes to infinity, realized volatilities converge to the actual volatility and are therefore consistent, unbiased estimates of daily volatility. In the multivariate context, the same idea applies to the vector of squares and cross-products, η t = vech(ε t εt ). The aggregation scheme is no longer ε(m) mt = ε mt + ε mt−1 + · · · + ε mt−m+1 but η¯ mt = ηmt + ηmt−1 + · · · + ηmt−m+1 . Thus, all the cross-terms that appeared (m) (m) in our previous aggregation scheme η(m) mt = vech(ε mt ε mt ) are absent here. First, it is clear that for any finite m, η¯ mt is an unbiased estimate of the unobservable daily (m) (m) volatility. It is more efficient than the noisy η(m) mt = vech(ε mt ε mt ), but for every finite m, it is inefficient compared with h¯ mt = hmt + hmt−1 + · · · + hmt−m+1 . The practical advantage of using η¯ mt is, of course, that no parametric model of volatility needs to be specified, but a drawback is given by the restriction that m cannot be chosen arbitrarily large. In other words, the time interval between observations cannot be arbitrarily small due to market microstructure effects. If the true volatility process follows multivariate GARCH, we quantify below the loss of efficiency of η¯ mt compared with h¯ mt . To calculate the variance of η¯ mt , note that this is just the sum of the variances of the individual terms η mt , each one equal to η − σ σ , plus the sum of all covariances. This is given by Var(η¯ mt ) = m( η − σ σ ) +
m−1
(m − i) (i) + (i) .
i=1
Similarly, we obtain for the variance of h¯ mt Var(h¯ mt ) = m( h − σ σ ) +
m−1
(m − i) (i) + (i) ,
(5.1)
i=1
so that the difference is given by Var(η¯ mt ) − Var(h¯ mt ) = m( η − h ),
(5.2)
which is positive semi-definite. Note that (5.1) is O(m2 ) and (5.2) is O(m), so that the relative difference between the two variances is O(m−1 ). In other words, the loss of efficiency of realized volatilities w.r.t. the model (supposing that this is correctly specified) is diminishing with rate O(m−1 ). In practice, m cannot increase without bounds, so that the relative efficiency for a given m depends on features such as the volatility persistence and the correlation. Let us define the relative efficiency of the ith component of realized volatility w.r.t. the model as the ith diagonal element of Var(η¯ mt ), divided by the corresponding diagonal element of Var(h¯ mt ), that is, [Var(η¯ mt )]ii . REi (m) = Var(h¯ mt ) ii
C The Author(s). Journal compilation C Royal Economic Society 2009.
(5.3)
142
C. M. Hafner Table 1. Relative efficiencies of realized volatilities. RE 1 (m) RE 2 (m)
m
RE 3 (m)
2 3
3.2264 2.3839
4.3470 3.0344
6.8479 4.4386
4 5
2.0356 1.8460
2.5008 2.2121
3.4832 2.9713
10 20 30
1.5075 1.3632 1.3217
1.6985 1.4774 1.4127
2.0663 1.6745 1.5585
40 50
1.3030 1.2925
1.3832 1.3668
1.5056 1.4763
Notes: Relative efficiencies according to definition (5.3) of realized volatilities with respect to the optimal estimates when the high frequency process is known to be the process given in (3.12). RE 1 is the measure for the conditional (m) (m) (m) variance of ε mt,1 , RE 2 is the measure for the conditional covariance of ε mt,1 and ε mt,2 and RE 3 is the measure for the (m)
conditional variance of ε mt,2 .
Note that RE i (m) = 1 + O(m−1 ) so that for m sufficiently large, the efficiency loss is negligible. However, if m cannot be chosen arbitrarily large in practice, the efficiency loss may be substantial. For our example process (3.12), Table 1 lists the values of RE i (m) for selected levels m. Obviously, even at m = 50, the variance of the realized volatility estimator is still 29% higher than that of the optimal one for the first component of η(m) mt . For the other two components, the loss is even higher. For their exchange rate example, Andersen et al. (2003) use a value of m = 48, having half-hourly data for a 24 hours per day market. They cannot choose m much larger because of the problems with interfering microstructure effects, such as bid–ask bounces. The values of RE i (m) in Table 1 therefore appear relevant if our example process can be considered as a typical high frequency process. In such a situation, the practitioner has to weigh the risk of misspecifying a parametric volatility model for the high-frequency process against the efficiency loss of the nonparametric estimation using realized volatilities. There is a second issue concerning standardized residuals using realized volatilities, which turns out to be intimately related to the relative efficiency issue. Standardized residuals are −1/2 (m) , where H¯ mt is the de-vectorized h¯ mt for the given multivariate typically obtained by H¯ mt εmt GARCH model. Alternatively, without an assumption on the underlying process, one can define −1/2 ¯ mt . Due to the higher standardized residuals by ϒ mt ε(m) mt , where ϒ mt is the de-vectorized η variance of η¯ mt compared with h¯ mt , the kurtosis of the residuals standardized by realized volatilities η¯ mt will be smaller than that of re siduals standardized by h¯ mt . In particular, if the innovation distribution is Gaussian, the kurtosis of the residuals standardized by realized volatilities is smaller than three, which is also apparent in the empirical results of Andersen et al. (2003), Table 1. They claim that standardized residuals are close to being Gaussian, but for their sample of 10 years of daily returns on the DM/Dollar exchange rate, a value of 2.57 for the kurtosis of standardized residuals is likely to violate the normality assumption. 1 The negative bias of the kurtosis estimate could be related approximately to the efficiency loss 1 This can be seen by noting that for an iid Gaussian white noise, the standard error of the kurtosis estimator is (24/n)1/2 , where n is the sample size. If n = 2500, which roughly corresponds to 10 years of daily data, the standard error takes the value 0.098, so that with an estimate of 2.57, one would reject the null hypothesis of Gaussian white noise at the 95% significance level.
C The Author(s). Journal compilation C Royal Economic Society 2009.
Temporally aggregated multivariate GARCH
143
√ expressed by RE i (m). To see this point, consider the process εt = ht ξt with ξ t ∼ Nid(0, 1) and h t a univariate GARCH process of arbitrary order. Suppose there is an Ft−1 -measurable √ volatility estimator V t with E[V t ] = E[h t ] but Var[V t ] > Var[h t ]. Then the kurtosis of εt / Vt is by definition E[ε4t /V 2t ]/E[ε2t /V t ]2 . The expectation in the denominator can be written as E[ε2t /V t ] = E[h t /V t ], which can be approximated by 1, using the first-order expansion 1 E[ht ] E[ht ] ht + (ht − E[ht ]) − ≈ (Vt − E[Vt ]). Vt E[Vt ] E[Vt ] E[Vt ]2 Thus, the kurtosis can be written as E[ε4t /V 2t ] = 3E[h2t /V 2t ]. Using a similar expansion for the ratio h2t /V 2t , the kurtosis is approximately equal to 3E[h2t ]/E[V 2t ], which is smaller than 3 if Var[V t ] > Var[h t ]. In terms of the relative efficiency RE = Var(V t )/Var(h t ), the kurtosis can be approximated by K = 3(1 + c)/(RE + c), where c = E[h t ]2 /E[h2t ]. Thus, if RE > 1, then there is a downward bias of the kurtosis of residuals standardized with the inefficient estimator V t . Recently, interest has focused on the distribution of realized volatilities. If the true underlying DGP is multivariate GARCH(p, q) and m is sufficiently large, this may be approximated by the asymptotic distribution of the centred and normalized realized volatilities, which is given in the following theorem. T HEOREM 5.1. Under covariance stationarity of ε t , the asymptotic distribution of realized volatilities for m → ∞ is given by D
m−1/2 (η¯ mt − mσ ) −→ N (0, 2π f (0)) , where f (λ) is the spectral density matrix of η t at frequency λ given in (3.10). Moreover, we have ⎧
j ∞ ⎨ ∞ j =1 i=j +1 i , τ = 1 i=0 i u lim Cov(η¯ mt , η¯ m(t+τ ) ) = m→∞ ⎩ 0, τ ≥ 2. An implication of this theorem is that, for m sufficiently large, the centred and normalized realized volatilities may be approximated by a multinormal distribution. However, due to the asymmetric nature of the distribution of volatilities, typically being strongly skewed to the righthand side, it may require very large values of m before the normality result of Theorem 5.1 applies. In fact, Andersen et al. (2003) find that for moderately large m, the distribution of foreign exchange realized volatilities can be well approximated by a log-normal distribution. Further empirical evidence is required to assess how these results depend on the aggregation level m. Also, one may do Monte Carlo simulations to find the distribution of m−1/2 (η¯ mt − mσ ) for finite m and a known high-frequency process such as (3.12). This is beyond the scope of this paper but interesting for future research. The second result of Theorem 5.1 implies that the aggregated process η¯ mt for large but finite aggregation levels m can be approximated by a VMA(1) process. This is because Cov(η¯ mt , η¯ m(t+τ ) ) is O(m) for τ = 0, O(1) for τ = 1 and o(1) for τ ≥ 2. That is, for m → ∞, the process converges to white noise since the autocorrelations tend to zero, but for finite m, the first-order autocorrelation will be much larger than higher order autocorrelations. In other words, the vector of realized volatilities can be approximated by a VMA(1) process for large but finite values of m if the underlying DGP is multivariate GARCH. Hence, in practice one may directly specify a VMA(1) model for the realized volatilities for finite but large aggregation C The Author(s). Journal compilation C Royal Economic Society 2009.
144
C. M. Hafner
level m. Alternatively, one may even use standard model selection procedures to specify a VARMA(p, q) model for the realized volatilities.
6. CONCLUSIONS AND OUTLOOK The main conclusion of this paper is that there are remarkable differences in the consequences of temporal aggregation of multivariate GARCH and VARMA models. First, spurious causality induced by temporal aggregation will be much less problematic in volatility than it is in the mean. Secondly, the forecasting performance of the method that directly predicts the aggregated process does not become identical to the optimal procedure for increasing horizons. Thus, there is a substantial difference between forecasting a VARMA process and the volatility of a multivariate GARCH process. Concerning realized volatility, it will be important to shed more empirical light on the multivariate distribution of realized volatilities, for which this paper derives an asymptotic result if the high-frequency process is multivariate GARCH. Finally, it will be important to bridge the gap to continuous time processes, as was done in the univariate case by Nelson (1990) and Drost and Werker (1996). This is left to future research.
ACKNOWLEDGMENTS The author would like to thank J¨org Breitung, Rob Engle, Nour Meddahi and Jeroen Rombouts for helpful discussions. The paper has been presented at the Quantitative Finance seminar of Humboldt University Berlin, January 2003, the Workshop on Econometric Time Series Analysis—Methods and Applications, Linz, September 2003, the North American Winter Meeting of the Econometric Society, San Diego, January 2004, and the European Meeting of the Econometric Society, Madrid, August 2004. Valuable comments of seminar participants are gratefully acknowledged. Of course, the author assumes full scientific responsibility.
REFERENCES Andersen, T. G., T. Bollerslev, F. X. Diebold and P. Labys (2003). Modelling and forecasting realized volatility. Econometrica 71, 579–625. Boudjellaba, H., J.-M. Dufour and R. Roy (1992). Testing causality between two vectors in multivariate autoregressive moving average models. Journal of the American Statistical Association 87, 1082–1090. Breitung, J. and N. R. Swanson (2002). Temporal aggregation and spurious instantaneous causality in multiple time series models. Journal of Time Series Analysis 23, 651–665. Cheung, Y. W. and L. K. Ng (1996). A causality in variance test and its application to financial market prices. Journal of Econometrics 72, 33–48. Comte, F. and O. Lieberman (2000). Second order noncausality in multivariate GARCH processes. Journal of Time Series Analysis 21, 535–557. Drost, F. C. and T. E. Nijman (1993). Temporal aggregation of GARCH processes. Econometrica 61, 909– 927. Drost, F. C. and B. J. M. Werker (1996). Closing the GARCH gap: continuous time GARCH modeling. Journal of Econometrics 74, 31–57. Dufour, J. M. and E. Renault (1998). Short run and long run causality in time series: theory. Econometrica 66, 1099–1126. C The Author(s). Journal compilation C Royal Economic Society 2009.
145
Temporally aggregated multivariate GARCH
Engle, R. F. and K. F. Kroner (1995). Multivariate simultaneous generalized ARCH. Econometric Theory 11, 122–150. Geweke, J. (1982). Measurement of linear dependence and feedback between multiple time series. Journal of the American Statistical Association 77, 304–313. Geweke, J. (1984). Measures of conditional linear dependence and feedback between time series. Journal of the American Statistical Association 79, 907–915. Granger, C. W. J., R. P. Robins and R. F. Engle (1984). Wholesale and retail prices: bivariate time series modelling with forecastable error variances. In D. Belsley and E. Kuh (Eds.), Model Reliability, 1–17, Cambridge, MA: MIT Press. Hafner, C. M. (2003). Fourth moment structure of multivariate GARCH models. Journal of Financial Econometrics 1, 26–54. Hafner, C. M. (2008). Temporal aggregation of multivariate GARCH processes. Journal of Econometrics 142, 467–483. Hafner, C. M. and H. Herwartz (2008). Testing for causality in variance using multivariate GARCH models. Annales d’Economie et de Statistique, forthcoming. Jorda, O. and M. Marcellino (2004). Time-scale transformations of discrete time processes. Journal of Time Series Analysis 25, 873–894. L¨utkepohl, H. (1987). Forecasting Aggregated Vector ARMA Processes. Lecture Notes in Economics and Mathematical Systems, volume 284, Berlin: Springer Verlag. L¨utkepohl, H. (1993). Introduction to Multiple Time Series Analysis (2nd ed.). Berlin, New York: Springer Verlag. Marcellino, M. (1999). Some consequences of temporal aggregation in empirical analysis. Journal of Business and Economic Statistics 17, 129–136. Meddahi, N. and E. Renault (2004). Temporal aggregation of volatility models. Journal of Econometrics 119, 355–379. Nelson, D. B. (1990). ARCH models as diffusion approximations. Journal of Econometrics 45, 7–39. Nijman, T. and E. Sentana (1996). Marginalization and contemporaneous aggregation in multivariate GARCH processes. Journal of Econometrics 71, 71–87. Renault, E. and A. Szafarz (1991). True versus spurious instantaneous causality. Discussion paper 9103, CEME, Universit´e Libre de Bruxelles. Renault, E., K. Sekkat and A. Szafarz (1998). Testing for spurious causality in exchange rates. Journal of Empirical Finance 5, 47–66.
APPENDIX Proof of Theorem 2.1: Consider the multivariate GARCH(p, q) model (2.1) and define r = max(p, q), A q+1 = · · · = A r = 0 if p > q, and B p+1 = · · · = B r = 0 if q > p. Then (2.1) has the following multivariate GARCH(1,1) representation, H t = + AE t−1 + BH t−1 , where ⎛
ht
⎜h ⎜ t−1 Ht = ⎜ ⎜. ⎝ .. ht−r+1
⎞ ⎟ ⎟ ⎟, ⎟ ⎠
⎛
ηt
⎞
⎜ ⎟ ⎜ ηt−1 ⎟ ⎜ ⎟ Et = ⎜ . ⎟, ⎜. ⎟ ⎝. ⎠ ηt−r+1
C The Author(s). Journal compilation C Royal Economic Society 2009.
⎛
ω
⎜ ⎜0 ⎜ = ⎜. ⎜. ⎝. 0
⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠
146
C. M. Hafner
are N r × 1 vectors, and where ⎛ A1 ⎜ ⎜0 ⎜ A = ⎜. ⎜. ⎝. 0
. . . Ar−1
Ar
... 0 .. .
0 .. .
... 0
0
⎞ ⎟ ⎟ ⎟ ⎟, ⎟ ⎠
⎛
B1 . . . Br−1 Br
⎜ 0 ⎜ IN ⎜ B=⎜ . ⎜ .. ⎝ 0 . . . IN
0 .. .
⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠
0
are N r × N r parameter matrices. By Theorem 1 of Hafner (2008), the aggregated process has a weak (m) (m) (m) multivariate GARCH(1,1) representation H (m) + A(m) E (m) H m(t−1) , where (m) , A(m) mt = m(t−1) + B (m) and B are functions of , A and B, respectively. Moreover, there is a corresponding VARMA(1,1) representation for E (m) mt . Define the N × N r transformation matrix F = (I N , 0, . . . , 0). Then, applying (m) ∗ ∗ corollary 4.2.2 of L¨utkepohl (1987); the process η(m) mt = FE mt has a VARMA(p , q ) representation with ∗ ∗ ∗ has a multivariate GARCH(p , q ) representation with the p ∗ , q ≤ N (r − 1) + 1. Correspondingly, h(m) mt same upper bounds for p ∗ and q ∗ . LI CV
Proof of Theorem 3.1: First, εt,1 εt,2 is equivalent to [ u ] 13 = 0, by proposition 2.3 of L¨utkepohl (1993, p. 40). Now [ u ] 13 = E[ε2t,1 ε2t,2 ] − E[h t,1 h t,3 ]. Under the assumption of conditional normality, the first term is given by E[ε2t,1 ε2t,2 ] = E[h t,1 h t,3 + 2h2t,2 ] by Theorem 1 of Hafner (2003). Thus, [ u ] 13 = 0 is GCV
equivalent to h t,2 = 0. But if ht,2 = 0, εt,1 εt,2 and K = 2, then the diagonality of the matrices A, B, (m) (m) , B (m) and (m) and u implies also diagonality of the matrices (m) v and v , and therefore A u . Thus, if LCV
LCV
(m) (m) εt,1 εt,2 and K = 2, then we also have εmt,1 εmt,2 . Hence, spurious LICV can only appear if h t,2 = 0 and K ≥ 3
Proof of Theorem 5.1: Under our conditions, the aggregated process η¯ mt has a weak finite-order VARMA representation that is stationary and invertible. Thus, it also has a linear VMA(∞) representation, for which Breitung and Swanson (2002) have shown the asymptotic results for m−1 Var(η¯ mt ) and Cov(η¯ mt , η¯ m(t+τ ) ), τ ≥ 1. Covariance stationarity of η¯ mt implies that there are constants C, j ∗ and ρ ∈ (0, 1) such that for all j > j ∗ , j ≤ Cρ j for any matrix norm ·, where j are the coefficients of the VMA(∞) representation. This in turn implies that ∞ j =0 j j < ∞, the condition given by Breitung and Swanson (2002), such that the partial sums for τ = 1 converge in mean square. The asymptotic normality follows similar to proposition 3.3 of L¨utkepohl (1993). The formulae for f (λ) and for u have been derived by Hafner (2003) for a multivariate GARCH(1,1) process. If p > 1 and/or q > 1, then one can obtain f (λ) and u from the multivariate GARCH(1,1) representation of the proof of Theorem 2.1.
C The Author(s). Journal compilation C Royal Economic Society 2009.
The
Econometrics Journal Econometrics Journal (2009), volume 12, pp. 147–163. doi: 10.1111/j.1368-423X.2008.00261.x
Testing for volatility interactions in the Constant Conditional Correlation GARCH model ‡,§ ¨ T OMOAKI N AKATANI †,‡ AND T IMO T ER ASVIRTA †
Department of Agricultural Economics, Hokkaido University, Kita 9, Nishi 9, 060-8589, Sapporo-shi, Japan Email:
[email protected] ‡
Department of Economic Statistics, Stockholm School of Economics, P.O. Box 6501, SE-113 83 Stockholm, Sweden
§ CREATES, School of Economics and Management, University of Aarhus, Building 1322, DK-8000 Aarhus C, Denmark Email:
[email protected] First version received: March 2007; final version accepted: August 2008
Summary In this paper, we propose a Lagrange multiplier test for volatility interactions among markets or assets. The null hypothesis is the Constant Conditional Correlation generalized autoregressive conditional heteroskedasticity (GARCH) model in which volatility of an asset is described only through lagged squared innovations and volatility of its own. The alternative hypothesis is an extension of that model in which volatility is modelled as a linear combination not only of its own lagged squared innovations and volatility but also of those in the other equations while keeping the conditional correlation structure constant. This configuration enables us to test for volatility transmissions among variables in the model. Monte Carlo experiments show that the proposed test has satisfactory finite-sample properties. The size distortions become negligible when the sample size reaches 2500. The test is applied to pairs of foreign exchange returns and individual stock returns. Results indicate that there seem to be volatility interactions in the pairs considered, and that significant interaction effects typically result from the lagged squared innovations of the other variables. Keywords: Conditional correlations, Lagrange multiplier test, Monte Carlo simulation, Multivariate GARCH, Volatility interactions.
1. INTRODUCTION During the last few decades, considerable attention has been paid to the conditional second moments of financial time series. Models of generalized autoregressive conditional heteroskedasticity (GARCH), either univariate or multivariate, have become standard tools for studying such series. In financial econometrics, the analysis of interdependence in volatility is important for portfolio risk managements on the one hand, and is necessary for research on the degree of market integration on the other. A large number of researchers have found ample evidence that the conditional variances of financial time series are interacting. Empirical studies C The Author(s). Journal compilation C Royal Economic Society 2009. Published by Blackwell Publishing Ltd, 9600 Garsington Road,
Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
148
T. Nakatani and T. Ter¨asvirta
include Hamao et al. (1990), Baillie and Bollerslev (1990), Cheung and Ng (1996), Hong (2001) and Cifarelli and Paladino (2005), among others. The Extended Constant Conditional Correlation (ECCC) GARCH model that Jeantheau (1998) introduced offers a platform for modelling volatility interactions between markets or assets. The model nests the Constant Conditional Correlation (CCC) GARCH model by Bollerslev (1990). This extension of the CCC-GARCH model allows the interaction in the form of both lagged squared observations and lagged conditional variances from the other equations of the system. The CCC-GARCH model only allows contemporaneous dependence through conditional correlations, which is not sufficient for volatility interaction. Since the ECCC-GARCH model nests the CCC-GARCH one, a natural idea would be to test for volatility interactions in the ECCC-GARCH framework. In fact, Wong and Li (1997) and Wong et al. (2000) employed the ECCC-GARCH model for describing volatility interactions between the daily Standard and Poor’s 500 and the Sydney All Ordinaries index returns, and among three major foreign exchange rates, albeit without first testing the hypothesis of no interactions. In order to save computational efforts, a test of this hypothesis should merely involve estimating the CCC-GARCH model. Estimating the ECCC-GARCH model would become an issue only when the null hypothesis is rejected. Consequently, the aim of this paper is to construct such a test using the score or Lagrange multiplier (LM) principle and to investigate its finitesample properties. The existing misspecification tests of the CCC-GARCH model are designed for testing the constancy of correlations. Tse (2000) and Bera and Kim (2002) formed tests of the assumption of constant conditional correlations against an unspecified alternative. Berben and Jansen (2005) and Silvennoinen and Ter¨asvirta (2005) provided LM tests of the CCC-GARCH model against parametric alternatives with time-varying correlations that are variants of the Smooth Transition Conditional Correlation (STCC) GARCH model. Misspecification of the GARCH structure of this model has received less attention in the literature, and the contribution of this paper will lie in that direction. This paper is organized as follows. In Section 2, the ECCC-GARCH model is defined and stationarity conditions are mentioned. Section 3 contains the first and second partial derivatives of the log-likelihood function of the ECCC-GARCH model as well as the asymptotic properties of the quasi-maximum-likelihood estimator (QMLE) of its parameter vector. The LM test is derived in Section 4, and its finite-sample properties are studied by Monte Carlo simulations in Section 5. In Section 6, the test is applied to pairs of daily foreign exchange returns and daily stock returns. Section 7 concludes. There is an accompanying paper (Nakatani and Ter¨asvirta, 2008a) which contains the mathematical derivations, proofs, an illustrative example and additional figures associated with the applications.
2. THE EXTENDED CONSTANT CONDITIONAL CORRELATION GARCH MODEL 2.1. Definition Following Jeantheau (1998) and He and Ter¨asvirta (2004), consider the following vector stochastic process: yt = μ + ε t (2.1) ε t = Dt zt
(2.2)
C The Author(s). Journal compilation C Royal Economic Society 2009.
149
Testing for volatility interactions 1/2
where y t is a stochastic (N × 1) vector, μ is an (N × 1) intercept vector and D t = diag(h1,t , . . . , 1/2 hN,t ) is a diagonal matrix of conditional standard deviations of ε t . The above formulation is a special case of the vector ARMA-GARCH model in Ling and McAleer (2003). The sequence {z t } with the stochastic vector z t = [z 1,t , . . . , z N,t ] is a sequence of independent and identically distributed variables with mean 0 and time-invariant positive definite covariance matrix P = [ρ ij ] with ones on the main diagonal. With these assumptions, hi,t i=j E[εt |Ft−1 ] = 0 and E[ε t ε t |Ft−1 ] = Ht = 1/2 1/2 hi,t hj ,t ρij i = j , where Ft is the information set up to and including time t, and H t = D t PD t . Matrix H t is the conditional covariance matrix and P the constant conditional correlation matrix of the process {ε t }. The vector GARCH(p, q) process of ε t is defined as follows.
ht = [h1,t , . . . , hN,t ] = a0 +
q i=1
Ai ε (2) t−i
+
p
Bj ht−j ,
(2.3)
j =1
2 2 where ε(2) t = (ε 1,t , . . . , ε N,t ) , a 0 is an (N × 1) vector, and A i and Bj are (N × N) matrices with elements such that h i,t in h t are positive for all t. The superscript within the parentheses in a vector or a matrix denotes an elementwise exponent. A sufficient condition for h t to have positive elements for all t is that all elements in a 0 are positive and all elements in A i and Bj for each i and j are non-negative. 1 This guarantees that, together with the positive definiteness of P, the conditional variance matrix H t is positive definite almost surely for all t. Equations (2.1), (2.2) and (2.3) jointly define the N-dimensional ECCC-GARCH(p, q) model. Note that if both A i and Bj are diagonal for all i and j, the ECCC-GARCH(p, q) model collapses into the CCC-GARCH(p, q) model of Bollerslev (1990). If, furthermore, Bj = 0, j = 1, . . . , p, the model is the CCC-ARCH(q) model of Cecchetti et al. (1988). Wong and Li (1997) applied the ECCC-ARCH(1) model to the S&P 500 and Sydney’s All Ordinaries index returns. Wong et al. (2000) considered the first-order ECCC-GARCH model with the restriction that B 1 be diagonal, and applied it to the Hang Seng index and the S&P 500 index. For simplicity and given the fact that the first-order models describe many heteroskedastic time series well in a vast majority of empirical applications, we restrict our discussion to the case of p = q = 1 unless otherwise stated. Excluding μ, there are, in the ECCC-GARCH(1, 1) model, N(5N + 1)/2 parameters to be estimated, of which N(2N + 1) parameters appear in h t and the remaining N(N − 1)/2 in P.
2.2. Stationarity of the ECCC-GARCH process In the context of vector GARCH processes with constant conditional correlations, a sufficient condition for weak and strict stationarity of an ECCC-GARCH(p, q) process with a constant
1 Nakatani and Ter¨ asvirta (2008b) show that off-diagonal elements in Bj can assume negative values while positive definiteness of H t is retained. This extension does not affect the asymptotic distribution of the test statistic in Section 4 because off-diagonal elements are all zero under the null hypothesis. Therefore, we maintain the simple sufficient condition throughout the paper.
C The Author(s). Journal compilation C Royal Economic Society 2009.
150
T. Nakatani and T. Ter¨asvirta
conditional mean was established in Jeantheau (1998). Ling and McAleer (2003) found a more general condition that allows the process to have an ARMA structure in a conditional mean. A sufficient condition for weak and strict stationarity of an ECCC-GARCH(1, 1) model with normal errors is immediate from Theorem 1 of He and Ter¨asvirta (2004). Define a sequence of i.i.d. stochastic matrices {C t } such that Ct = A1 Z2t + B1
(2.4)
where Z t = diag(z 1,t , . . . , zN,t ). The ECCC-GARCH(1,1) process is weakly and strictly stationary if λ( C ) < 1,
(2.5)
where C = E[Ct ] and λ() is the spectral radius, or the modulus of the largest eigenvalue, of . If N = 1, the inequality (2.5) collapses into the condition for the univariate GARCH(1, 1) process with unit variance to be weakly stationary.
3. LIKELIHOOD FUNCTION AND ASYMPTOTIC PROPERTIES 3.1. The log-likelihood function The aim of this paper is to construct a test for testing the hypothesis that A 1 and B 1 are diagonal matrices in the model (2.1), (2.2), and (2.3) with p = q = 1. For this we need the (quasi) loglikelihood function of the model and its first two partial derivatives. Without the loss of generality, we can assume μ = 0. Let then θ = [ω , ρ ] where ω contains the parameters in h t and ρ = vecl(P). Operator vecl stacks the lower off-diagonal elements of a symmetric (N × N) matrix into an N(N − 1)/2 vector. The quasi-log-likelihood function for observation t is given by 1 N 1 −1 −1 ln(2π ) − ln |Dt PDt | − ε t D−1 t P Dt ε t 2 2 2 1 N 1 = − ln(2π ) − ln |Dt | − ln |P| − ε t H−1 t εt . 2 2 2
t (θ) = −
(3.1)
The QMLE θˆ equals θˆ = arg max θ
T
t (θ).
(3.2)
t=1
The following assumptions are made to ensure the asymptotic normality of the QMLE θˆ in (3.2) (see Ling and McAleer, 2003). A SSUMPTION 3.1. The spectral radius λ(P) has a positive lower bound over the parameter space that is a compact subspace of the Euclidean space such that all true parameters lie in interior of . In addition, each element of a 0 has positive lower and upper bounds over . q p A SSUMPTION 3.2. All the roots of det(IN − i=1 Ai x i − j =1 Bj x j ), where I N denotes the N-dimensional identity matrix, lie outside the unit circle. A SSUMPTION 3.3. The identifiability conditions discussed in Jeantheau (1998) are satisfied. C The Author(s). Journal compilation C Royal Economic Society 2009.
151
Testing for volatility interactions 6 A SSUMPTION 3.4. E|εi,t | < ∞, i = 1, . . . , N .
3.2. The score and the Hessian of the log-likelihood function We next define the first and second partial derivatives of (3.1). Let S t (θ ) = ∂ t (θ )/∂θ be the score ¯ vector for observation t, and let S(θ) = (1/T ) Tt=1 St (θ ) = (1/T )S(θ ) be the average score. We ˆ use the notation St (θˆ ) for the score evaluated at θ = θ. The population information matrix is given by the expectation of the outer product of the score evaluated at the true parameter θ 0 , i.e., I(θ 0 ) =
1 E[S(θ 0 )S(θ 0 ) ] = E[St (θ 0 )St (θ 0 ) ]. T
(3.3)
Using similar notation, the Hessian of the log-likelihood function evaluated at θ 0 equals T T ∂ 2 t (θ) Ht (θ 0 ) = . H(θ 0 ) = ∂θ ∂θ t=1
(3.4)
θ =θ 0
t=1
We further define the negative of the expected Hessian evaluated at θ 0 by 1 J (θ 0 ) = − E[H(θ 0 )] = −E[Ht (θ 0 )]. T
(3.5)
The next two lemmas give expressions for S t (θ ) and J (θ), namely, the first- and second-order partial derivatives of (3.1) with respect to the parameters of interest. L EMMA 3.1. (The score vector.) The score vector for observation t of (3.1) has the following form ⎡ ⎤ ∂t (θ)
− zt zt P−1 D−1 − D−1 P−1 zt zt ∇Dt vec 2D−1 ⎢ ∂ω ⎥ 1 t t t ⎥ St (θ) = ⎢ (3.6) ⎣ ∂t (θ) ⎦ = − 2 ∇Pvec P−1 − P−1 z z P−1 t t
∂ρ where ∇D t = ∂ vec(D t ) /∂ω and ∇P = ∂ vec(P) /∂ρ. For the proof of Lemma 3.1, see Section A of Nakatani and Ter¨asvirta (2008a). L EMMA 3.2. (The negative of the expected Hessian.) The negative of the expected Hessian for observation t has the form ⎤ ⎡ 2 ∂ t (θ ) ∂ 2 t (θ) ⎢ ∂ω∂ω ∂ω∂ρ ⎥ ⎥ ⎢ J (θ) = −E[Ht (θ)] = −E ⎢ 2 ⎥ 2 ⎣ ∂ t (θ ) ∂ t (θ) ⎦ ⎡
∂ρ∂ω ∂ρ∂ρ
−1 ∇Dt 2(D−1 t ⊗ Dt )
−1 ∇Dt D−1 ⊗ IN t P
⎢ ∇Dt ⊗ P + P ⊗ H−1 +IN ⊗ D−1 P−1 ∇P +H−1 1⎢ t t t = ⎢
−1 −1 2⎢
⎣∇P P Dt ⊗ IN ∇P P−1 ⊗ P−1 ∇P −1 −1 +IN ⊗ P Dt ∇Dt where ⊗ denotes the Kronecker product. C The Author(s). Journal compilation C Royal Economic Society 2009.
⎤ ⎥ ⎥ ⎥. ⎥ ⎦
(3.7)
152
T. Nakatani and T. Ter¨asvirta
For the proof of Lemma 3.2, see Section B of Nakatani and Ter¨asvirta (2008a). Expressions (3.6) and (3.7) are rather general in that the conditional variances in D t are not specified in detail. For this reason, a number of different specifications are possible for h t .
3.3. Asymptotic behaviour of the quasi-maximum-likelihood estimator The consistency and asymptotic normality of the QMLE θˆ were established by Ling and McAleer (2003) for a class of vector ARMA-GARCH models with constant conditional correlations. Since our ECCC-GARCH model falls into this class, we can make use of their results. When Assumptions 3.1–3.4 hold, the asymptotic normality of θˆ is given by √ D T θˆ − θ 0 → N 0, J −1 (θ 0 ) I (θ 0 ) J −1 (θ 0 ) . (3.8) If we further assume that zt ∼ N(0, P), (3.1) is an exact log-likelihood function, so that I(θ 0 ) = −J (θ 0 ) holds. It then follows that √ D T θˆ − θ 0 → N 0, I −1 (θ 0 ) . (3.9) In either case, I(θ 0 ) and J (θ 0 ) can be consistently estimated by ˆ = I(θ)
T 1 ˆ ˆ St (θ)St (θ) T t=1
(3.10)
and by ˆ =−1 J (θ) T
T
ˆ Ht (θ),
(3.11)
t=1
respectively. In developing the asymptotic theory for the QMLE of the ECCC-GARCH model, the existence of the fourth- and the sixth-order moments of {ε t } is necessary. Ling and McAleer (2003) and He and Ter¨asvirta (2004) reported the fourth-order moment conditions. Under the assumption that z t is normally distributed, simplified conditions are available in Nakatani and Ter¨asvirta (2008a).
4. TEST FOR VOLATILITY INTERACTIONS Assuming p = q = 1, we now construct an LM test for volatility interactions with the hypothesis H0 : A1 and B1 in (2.3) are diagonal matrices against the alternative H1 : either A1 or B1 or both are non-diagonal matrices. The null hypothesis defines a CCC-GARCH(1, 1) model, and the alternative is an ECCCGARCH(1, 1) model. The test may best be viewed as a test of no volatility interaction among the variables in the model, while conditional correlations between them are allowed. C The Author(s). Journal compilation C Royal Economic Society 2009.
Testing for volatility interactions
153
Let θ˜ = [ω˜ , ρ˜ ] be the maximum-likelihood (ML) estimator of θ under the null. Since ρ is a vector of nuisance parameters, the average score evaluated at θ˜ equals ⎡ ⎤ ∂t (θ) T ⎢ S¯ ω (θ˜ ) ∂ω θ =θ˜ ⎥ 1 ⎢ ⎥ ¯ θ˜ ) = S( (4.1) ⎢ ⎥= ⎦ T t=1 ⎣ ∂t (θ) 0M ∂ρ θ =θ˜ ˜ has (2N 2 + N ) where 0 M denotes an (M × 1) null vector with M = N (N − 1)/2. S¯ ω (ω) elements, of which the ones corresponding to the other nuisance parameters a 0 and the diagonal elements in A 1 and B 1 are equal to zero. To keep the notation tractable, we leave these 3N ˜ and do not define a separate block for them. zero elements in S¯ ω (θ) As already mentioned, the information matrix can be consistently estimated either by (3.10) or by (3.11). Due to the fact that the score under the null (4.1) has zero elements, we only require the relevant part of the inverse of the information matrix to derive the LM ECCC statistic. Applying the formula for the inverse of a partitioned matrix to (3.11), the inverse of the relevant −1 ˜ ˜ −1 where (θ) = (J˜ 11 − J˜ 12 J˜ −1 block equals J11 22 J12 ) J˜ 11 J˜ 12 ˜ = J (θ) . (4.2) J˜ J˜ 22 12
In (4.2), J˜ 11 and J˜ 22 correspond to the second partial derivatives with respect only to ω and to ρ, respectively, and J˜ 12 contains the cross-derivatives, all evaluated at θ = θ˜ . The partitioning in (4.2) corresponds to the one in (3.7). We are now able to state the main result. T HEOREM 4.1. (The LM test statistic.) Let Assumptions 3.1–3.3 hold and assume that the fourth-order moment matrix of {ε t } exists. Then, the LM test statistic of testing H 0 , given by the quadratic form −1 ˜ ¯ ˜ (θ )Sω (θ), LMECCC = T S¯ ω (θ˜ )J11
(4.3)
has an asymptotic χ 2 distribution with 2N(N − 1) degrees of freedom when the null hypothesis is valid. A bivariate illustration of this general result can be found in Nakatani and Ter¨asvirta (2008a).
5. SIMULATION EXPERIMENTS OF THE LM ECCC TEST STATISTIC In this section, we conduct simulation experiments in a bivariate case to see how the proposed test behaves in finite samples. 2 In both size and power simulations, we use the sample sizes T = 1000, 2500, 5000 and 10,000 for each data-generating process (DGP). Additional experiments are carried out to see how the test performs when we remove the assumption of constant conditional 2 All numerical calculations are carried out in the free statistical environment R ver. 2.3.1 or later (R Development Core Team, 2008). In producing source codes, we used C routines from GNU Scientific Library (GSL) ver. 1.8.
C The Author(s). Journal compilation C Royal Economic Society 2009.
154
T. Nakatani and T. Ter¨asvirta
A1 B1
DGP 1 0.1 0 0 0.2 0.8 0 0 0.7
Table 1. DGPs for size simulations. DGP 2 DGP 3 DGP 4 0.04 0 0.04 0 0.1 0 0 0.05 0 0.05 0 0.2 0.45 0 0.95 0 0.95 0 0 0.6 0 0.9 0 0.9
DGP 5 0.1 0 0 0.2 0.45 0 0 0.6
ρ λ( C )
0.30
0.90
0.30
0.90
0.30
0.90
0.99
0.99
0.80
0.80
λ( C⊗C )
0.89
0.98
0.98
0.72
0.72
Note: The constant term in the conditional variance equation is set to a 0 = [0.1 0.2] for all DGPs. λ( C ) and λ( C⊗C ) denote the stationarity and the fourth-order moment conditions, respectively. See Section 2.2 and Nakatani and Ter¨asvirta (2008a) for the definitions.
T
Table 2. Size simulations for test of CCC-GARCH(1,1) model. DGP 1 DGP 2 DGP 3 DGP 4
DGP 5
1000 2500
0.060 0.062
0.107 0.064
0.075 0.073
0.052 0.062
0.057 0.053
5000 10,000
0.055 0.055
0.061 0.063
0.063 0.058
0.056 0.053
0.051 0.057
Note: The numbers represent actual rejection frequencies in 5000 replications based on the nominal 5% level of significance.
correlation. To minimize initial effects, the first 500 observations are discarded. The number of replications equals 5000. The empirical rejection frequencies are compared with the 5% nominal significance level. 5.1. Size simulations The size simulations are carried out for five different CCC-GARCH DGPs whose parameter values can be found in Table 1. The intercept vector a 0 = [0.1 0.2] . In addition, the DGPs have the following dynamic properties. DGP 1 has moderate persistence in volatility (a ii + b ii = 0.9) with low correlation (ρ = 0.3) in z t . The parameter values for A 1 and B 1 are common to both DGPs 2 and 3 with very high persistence (a ii + b ii = 0.99, 0.95 for i = 1, 2). DGP 2 has high correlation (ρ = 0.9) whereas DGP 3 has low correlation (ρ = 0.3). DGPs 4 and 5 have low persistence in volatility with high and low correlations, respectively. The parameter values satisfy the stationarity condition (2.5) and the fourth-order moment condition (3.11) of Nakatani and Ter¨asvirta (2008a). The actual rejection frequencies at the nominal 5% level are reported in Table 2. As may be expected, the rejection frequencies are approaching the nominal significance level as the number of observations increases. There are size distortions, however, in DGPs 1–3 when T ≤ 5000. In DGPs 4 and 5, the actual sizes are close to the nominal size already when T = 1000, which is a modest number in many GARCH applications. C The Author(s). Journal compilation C Royal Economic Society 2009.
155
Testing for volatility interactions
A1
B1 ρ λ( C ) λ( C⊗C )
DGP 6
0.05 0.004 0.9 0.002
0.001 0.04 0.004 0.85
Table 3. DGPs for power simulations. DGP 7 DGP 8 DGP 9 0.09 0.001 0.05 0.001 0.07 0.01 0.004 0.04 0.001 0.04 0.03 0.04 0.8 0.04 0.9 0.001 0.9 0.004 0.03 0.75 0.001 0.85 0.02 0.89
DGP 10 0.25 0.01 0.02 0.1
0.35 0.01 0.04 0.2
0.7
0.95
0.7
0.8
0.3
0.95
0.992
0.95
0.814
0.604
0.9084
0.9996
0.9076
0.8978
0.4881
Note: The constant term in the conditional variance equation is set to a 0 = [0.1 0.2] for all DGPs. λ( C ) and λ( C⊗C ) denote the stationarity and the fourth-order moment conditions, respectively. See Section 2.2 and Nakatani and Ter¨asvirta (2008a) for the definitions.
T
Table 4. Power simulations for test of CCC-GARCH(1,1) model. DGP 6 DGP 7 DGP 8 DGP 9
1000 2500
0.113 0.116
0.820 0.996
0.088 0.083
0.427 0.813
0.182 0.293
5000 10,000
0.143 0.221
1.000 1.000
0.103 0.107
0.982 1.000
0.497 0.783
DGP 10
Note: The numbers represent actual rejection frequencies in 5000 replications based on the nominal 5% level of significance.
The size properties of the test suggest that at least a couple of thousands of observations are required for empirical analyses. This requirement does not appear an obstacle for implementing the test since these amounts of observations are readily available in financial time series. However, our results are only valid for bivariate models, and longer series may be needed for higher-dimensional processes. 5.2. Power simulations The DGPs for the power simulation are listed in Table 3. The weak stationarity and the fourthorder moment conditions are satisfied for all DGPs. DGP 6 has high persistence in volatility (a ii + b ii = 0.99, 0.93 for i = 1, 2) and a moderate correlation coefficient (ρ = 0.7). In DGP 7, one of the off-diagonal coefficients has a large value (b 21 = 0.02) with high persistence as in DGP6. DGP 8 has a design similar to DGP 7 but the off-diagonal elements have a small value (a ij = b ij = 0.001). DGPs 9 and 10 have a rather unusual structure in the sense that both DGPs have moderate persistence and large values for all off-diagonal elements. The results are summarized in Table 4. It can be seen that the power of the test is low for DGPs 6 and 8 for all T. This is expected, however, because the true parameters under test are close to zero. Despite that, small changes in the values of a 21 and b 21 may already bring an increase in the power of the test. In all the other cases, the power reaches a reasonable level as T increases. In DGP 7, the power is already high for T = 1,000, and the same is true for DGP 9 when T = 2500. DGP 10 with less variable conditional variances constitutes an exception: the C The Author(s). Journal compilation C Royal Economic Society 2009.
156
T. Nakatani and T. Ter¨asvirta Table 5. DGPs for simulations under changing conditional correlations.
Constants DCC and EDCC [0.02 0.01]
STCC and ESTCC
[0.01 0.03]
ARCH parameters
DCC 0.04 0 0 0.06
EDCC
STCC 0.04 0 0 0.05
ESTCC 0.04 0 0 0.05
EDCC 0.9 0.004
STCC 0.94 0
ESTCC 0.9 0.004
0.09 0.001 0.004 0.04
GARCH parameters
DCC 0.95 0 0
0.93
0.02 0.89
0
0.02 0.89
0.92
Other parameters and assumptions DCC and EDCC ¯ = 0.6 Q
ρ(1) = 0, ρ(2) = 0.5
STCC and ESTCC
[α1 β1 ] = [0.05 0.8]
st = hs,t ut , ut ∼ N(0, 1)
1/2
2 hs,t = 0.02 + 0.04st−1 + 0.95hs,t−1
γ = 5, c = 0 Note: For definitions of these models, see Engle (2002) and Silvennoinen and Ter¨asvirta (2005).
power is not yet high for T = 5000. Obtaining higher power for ECCC-GARCH processes such as DGP 10 requires a sample of 10,000 observations, which corresponds to a daily time series of about 40 years of data. It should be noted, however, that observed time series are typically quite different from realizations generated by such DGPs. 5.3. The test under changing conditional correlations We conduct size and power simulations under changing conditional correlations using the DCCand the STCC-GARCH models and their extended versions (abbreviated as the EDCC- and ESTCC-GARCH, respectively). The definition of the DGPs can be found in Table 5. For the EDCC- and the ESTCC-GARCH processes, the simulated design is such that both processes have the same parameter matrices in the conditional variance equation as the DGP7. In the STCC and the ESTCC processes, the exogenous transition variable s t is driven by a GARCH(1, 1) process following Silvennoinen and Ter¨asvirta (2005). The actual rejection frequencies based on the 5% level of significance can be found in Table 6. The size of the test is distorted when the data are generated by the DCC-GARCH process, but the degree of distortion is not a monotonically increasing function of T. It is also worth emphasizing that when the DGP is an STCC-GARCH model, the size of the test converges towards the nominal level as T increases. Overall, the results suggest that the power of the test does lie in the non-zero off-diagonal elements of the parameter matrices, and that non-constancy of the conditional correlation matrix only has a minor role to play in rejecting the null hypothesis. Therefore, the test seems reasonably robust against changing conditional correlations. C The Author(s). Journal compilation C Royal Economic Society 2009.
157
Testing for volatility interactions Table 6. Size and power simulations under changing conditional correlations. Size Power T
DCC
STCC
EDCC
ESTCC
1,000 2,500
0.127 0.110
0.083 0.063
0.979 1.000
0.807 0.994
5,000 10,000
0.118 0.159
0.060 0.057
1.000 1.000
1.000 1.000
Note: The numbers represent actual rejection frequencies in 5000 replications based on the nominal 5% level of significance.
6. APPLICATIONS TO DAILY RETURN SERIES 6.1. Data In this section, the LM ECCC statistic is applied to pairs of foreign exchange rates as well as stock prices. Before fitting GARCH models, the level series are first transformed to the continuously compounded rate of returns by 100 ln (p t /p t−1 ), thereafter the sample means of returns are subtracted from the series. The considerations include daily foreign exchange rates and stock price series. 3 The exchange rates are daily noon buying rates in New York of the Japanese yen (JPY) and the Swiss franc (CHF) against the U.S. dollars certified by the Federal Reserve Bank of New York. 4 The foreign exchange rate series extend from 2 January 1975 to 2 December 2005, with the total of 7766 observations in each series. The stock prices are the daily closing prices of General Motors (GM) and IBM traded at the New York Stock Exchange, and of two Japanese leading electronic firms, NEC and Hitachi, traded at the Tokyo Stock Exchange. The sample stretches from 2 January 1962 to 28 February 2006 with 11,116 observations for the U.S. stock data, and from 4 January 1983 to 1 March 2006 with 5914 observations for the Japanese data. 5 Descriptive statistics for all the return series can be found in Table 7. As is typical for many financial time series, one can see strong excess kurtosis (KR) and non-zero skewness (SK). The Japanese stock returns are positively skewed, whereas the U.S. stocks and the foreign exchange returns contain negative skewness. The Lomnicki–Jarque–Bera (LJB) test of normality suggests non-normality of the return series. The McLeod and Li (1983) test indicates that higher-order dependence is present in the return series. As mentioned in Section 3.3, one can still obtain the quasi-maximum likelihood estimates of the parameters even when the data do not follow a multivariate normal distribution. An alternative method is to use a leptokurtic distribution such as a multivariate Student density to take into account the non-normality of the data (see e.g. Kawakatsu, 2006). However, the outlier-robust versions of the skewness (Rob.SK) and the excess kurtosis (Rob.KR) measures described in Kim and White (2004) yield different results. In particular, the robust excess kurtosis values are 3
To save space, the figures of the relevant time series are contained in Nakatani and Ter¨asvirta (2008a). The data sets are downloadable from Economic Research and Data section at the internet page of the Board of Governors of the Federal Reserve System (http://www.federalreserve.gov/). 5 The data are downloaded from Yahoo! Finance and Yahoo! Japan Finance for the U.S. and Japanese data, respectively. 4
C The Author(s). Journal compilation C Royal Economic Society 2009.
158
T. Nakatani and T. Ter¨asvirta Table 7. Descriptive statistics for the mean subtracted return series. JPY CHF GM IBM NEC
Hitachi
Minimum
−5.619
−4.400
−23.653
−26.857
−19.012
−15.466
Maximum Std.dev SK
3.569 0.656 −0.488
5.836 0.735 −0.016
16.644 1.691 −0.093
12.327 1.636 −0.320
13.446 2.229 0.282
11.122 2.041 0.266
4.349 0.010 0.318
2.843 −0.011 0.224
8.026 0.009 0.147
13.054 0.017 0.150
3.364 −0.078 0.080
2.854 −0.081 0.161
LJB
6427.97 [0.000]
2614.75 [0.000]
29846.14 [0.000]
79106.37 [0.000]
2865.89 [0.000]
2075.84 [0.000]
Q2 (25)
1338.69 [0.000] 7765
952.21 [0.000] 7765
1499.38 [0.000] 11115
902.19 [0.000] 11115
1174.04 [0.000] 5913
1798.48 [0.000] 5913
KR Rob.SK Rob.KR
T
Note: SK and KR denote the skewness and the excess kurtosis, respectively. Rob.SK and Rob.KR are outlier-robust versions of SK and KR described as SK 2 and KR 2 in Kim and White (2004). LJB is the test of normality by Lomnicki (1961) and Jarque and Bera (1980). The numbers in square brackets are p-values. Q2 (25) is the McLeod and Li (1983) portmanteau test statistic for serial correlation up to lag 25 in the squared return series.
Table 8. Skewness and kurtosis for the standardized residuals. JPY CHF GM IBM NEC
Hitachi
SK KR Rob.SK
−0.472 3.049 −0.017
−0.142 1.487 −0.015
−0.065 4.933 0.009
−0.103 4.719 0.013
0.287 1.749 −0.055
0.268 1.269 −0.043
Rob.KR
0.222
0.134
0.086
0.071
0.060
0.069
Note: SK and KR denote the skewness and the excess kurtosis, respectively. Rob.SK and Rob.KR are outlier-robust versions of SK and KR described as SK 2 and KR 2 in Kim and White (2004). LJB is the test of normality by Lomnicki (1961) and Jarque and Bera (1980).
much smaller than the non-robust ones. These outcomes strongly suggest that large values of the standard measures of skewness and kurtosis and, consequently, those of the LJB test statistic are caused by a small number of outliers. Apart from them, the marginal distributions of the returns do not appreciably deviate from the normal distribution. 6.2. Results of CCC/ECCC-GARCH estimation Given the fact that the robust skewness and kurtosis measures do not contain much evidence against normality of the return distributions, we assume that zt ∼ N(0, P) in the applications. This assumption is supported by the descriptive statistics of the standardized residuals reported in Table 8. According to the conventional measures of skewness and kurtosis, the standardized residuals are skewed and have fat tails. In contrast, the robustified statistics, Rob.SK and Rob.KR, do not suggest non-normality. C The Author(s). Journal compilation C Royal Economic Society 2009.
Testing for volatility interactions
159
During iterations, only the positivity constraints on a 0 , A 1 and B 1 are imposed to alleviate numerical difficulties in parameter restrictions. The weak stationarity and the fourth-order moment conditions are checked ex post. All the estimation results are summarized in Panels A, B and C of Table 9. Within each panel, the upper half contains the results of the CCC-GARCH model, whereas the lower half has to do with the ECCC-GARCH model. 6 6.2.1. Foreign exchange returns. The estimation results of the CCC-GARCH(1, 1) and the ECCC-GARCH(1, 1) model fitted to the foreign exchange series appear in Panel A of Table 9. Values of the LM ECCC statistic are reported in the third column from the right. In Panel A (the pair of JPY and CHF returns) the test statistic has the p-value 3 × 10−6 , so the null hypothesis of no interaction is rejected at any conventional significance level. The estimates of the ECCCGARCH(1, 1) model conform to this outcome. It is seen from Panel A that both aˆ 12 = 0.0038 and bˆ21 = 0.0080 deviate significantly from zero at conventional levels of significance. This means that the lagged volatility in JPY has a positive effect on the current day’s volatility in CHF, and that the squared innovation of CHF of day t − 1 has a positive influence on the volatility of JPY at day t. Although in both cases a statistically significant spillover exists, the magnitude of the contributions of one currency to the other is small. Accordingly, the estimates of the conditional variance from the two models are similar to each other, and the same is true for the conditional correlation coefficients. 7 Furthermore, both models satisfy the conditions for the stationarity and the existence of the unconditional fourth moments. 6.2.2. Stock returns. The estimation results of the models applied to the pair of the U.S. and another pair of Japanese stock returns can be found in Panels B and C of Table 9. The values of the LM ECCC statistic are large enough to reject the null hypothesis of no volatility interaction for both pairs of returns. None of the estimated off-diagonal elements in Bˆ 1 is statistically significant, ˆ 1 are significant in the equation for the two U.S. stocks, whereas both off-diagonal estimates in A and aˆ 12 = 0.0132 is significant in the one for the Japanese pair of returns. Interestingly, the ˆ 1 is not much less than that of the diagonal magnitude of the significant off-diagonal elements in A elements. By comparison, it is much larger than in the exchange rate examples. The dynamic behaviour of volatility in stock returns is affected by the lagged squared innovation but not by the lagged conditional variance of the other asset. The estimate of the conditional correlation coefficient is positive and significant for both pairs of returns. The diagonal elements of Bˆ 1 are smaller in magnitude in the ECCC-GARCH model than in the CCC-GARCH one. This may be a consequence of the non-zero off-diagonal elements ˆ 1 in the ECCC-GARCH model. However, even here the changes in parameter estimates of A have left the estimated conditional variances and correlations unaffected. The figures reported in Nakatani and Ter¨asvirta (2008a) are similar for the extended and the standard CCC-GARCH model. Finally, the estimated stock return models satisfy the conditions for the stationarity and the existence of the fourth moments.
6 In estimating an ECCC-GARCH(1, 1) model, convergence is achieved in around 9 seconds for the foreign exchange returns, 50 seconds for the U.S. stock returns and 13 seconds for the Japanese stock returns, all measured in the CPU time on the Intel Pentium M 2GHz processor. 7 To save space, the graphs of volatility estimates are reported in Nakatani and Ter¨ asvirta (2008a) together with the ones in the stock return applications.
C The Author(s). Journal compilation C Royal Economic Society 2009.
Panel B
Panel A
CCC
ECCC
CCC
Model
IBM
GM
CHF
JPY
CHF
JPY
Asset
0.0684 (0.0018)
0.0263
(0.0024)
(0.0019)
(0.0022)
0.0553
(0.0019)
(0.0018)
0.0266
0.0592
6 × 10−8
0.0068 (0.0008)
(0.0017)
0.0038
(0.0012)
(0.0002)
0.0459
(0.0008)
(0.0018)
0.0012
0.0580
(0.0018)
(0.0002)
0.0079
0.0522
a0
0.0022
(0.0020)
0.9369
(0.0034)
0.0080
(0.0026)
0.9482
(0.0032)
0.9450
(0.0022)
0.9249
(0.0050)
0.9223
(0.0036)
3 × 10−5
(0.0042)
0.9278
(0.0066)
0.3809
(0.0065)
0.5414
(0.0064)
0.5413
39720.62
13904.03
13923.07
Table 9. Estimation results of bivariate CCC/ECCC-GARCH(1,1) models. A1 B1 ρ −LogLik
[8 × 10
−31
]
147.358 × 10−31 ]
[3 × 10 ]
−8
40.56
LM ECCC
λ( C )
0.9933
0.9962
0.9972
0.9960
0.9961
0.9998
λ( C⊗C )
160 T. Nakatani and T. Ter¨asvirta
C The Author(s). Journal compilation C Royal Economic Society 2009.
C The Author(s). Journal compilation C Royal Economic Society 2009.
ECCC
CCC
Hitachi
NEC
Hitachi
NEC
IBM
GM
0.0087 (0.0081)
0.0386
(0.0041)
(0.0119)
(0.0051)
0.0672
0.1011
(0.0066)
0.0519
(0.0058)
0.0132
0.0512 (0.0046)
0.0419
(0.0055)
(0.0111)
(0.0025)
0.0698
(0.0026)
0.0131
(0.0042)
0.0658
(0.0036)
(0.0025)
0.0992
0.0108
(0.0011)
(0.0022)
0.0203
0.0567
0.0259
(0.0038)
2 × 10
−9
(0.0019)
0.9006
(0.0032)
0.9139
(0.0054)
0.9281
(0.0043)
0.0012
(0.0034)
0.9382
(0.0039)
0.9147
1 × 10−6 (0.0028)
(0.0026)
8 × 10−10
(0.0015)
0.9240
0.3844
(0.0068)
0.6309
(0.0066)
0.6316
(0.0068)
23218.93
23238.84
39652.12
[3 × 10
−17
83.87 ]
0.9866
0.9894
0.9946
0.9782
0.9841
0.9947
Note: −LogLik is the negative value of the log-likelihood function at the maximum. The numbers in parentheses are ML standard errors while those in square brackets are p-values for LM ECCC statistics. The stationarity and the fourth-order moment conditions are satisfied if λ( C ) < 1 and λ( C⊗C ) < 1, respectively.
Panel C
ECCC
Testing for volatility interactions
161
162
T. Nakatani and T. Ter¨asvirta
7. CONCLUDING REMARKS In this article, we propose an LM test for detecting the presence of volatility interactions or transmission in the CCC-GARCH framework. Simulation experiments indicate that the test statistic has favourable finite-sample properties. Its empirical size is typically close to the nominal one for time series with 2500 observations. Besides, when the assumptions of the null model are violated by assuming that the conditional correlations of the model change over time, this violation does not have a large effect of the size, although minor size distortion is observed in the DCC type of changing conditional correlations. This is a useful property when the aim of the researcher is to investigate volatility interactions. All three pairs of daily return series analysed in the paper seem to have volatility interactions. In the exchange rate example, the interactions appear through the lagged conditional variance, whereas in the two ECCC-GARCH models for pairs of stock returns the lagged squared innovations form the channel for interactions. Although the interaction effects found are small, they are detected by the tests. It may thus not be a good idea to exclude such interactions a priori, which makes our test useful. Since it also works reasonably well in the case of timevarying conditional correlations, the test is a practical tool in modelling volatility in multivariate financial time series. The test is derived assuming constant conditional correlations, which is not always a realistic restriction. Although the test seems rather robust against changing conditional correlations, it is of interest to extend the test to cover CC-GARCH models with time-varying correlations. Such extensions, however, are left for future research.
ACKNOWLEDGMENTS This research has been supported by the Jan Wallander and Tom Hedelius Foundation, Project No. P2005-0033:1, and the Danish National Research Foundation. Material from this paper has been presented at the 13th Forecasting Financial Markets Conference, Aix-en-Provence, France, May 2006, useR!: the R user conference in Vienna, Austria, June 2006, the Volatility Day workshop in Stockholm, Sweden, November 2006, the 2007 Far Eastern Meeting of the Econometric Society in Taipei, Taiwan, and the 2007 European Meeting of the Econometric Society in Budapest, Hungary. We would like to thank participants for helpful comments. We are grateful to Mika Meitz and Annastiina Silvennoinen for stimulating discussions. We also acknowledge the comments from the former editor (Siem Jan Koopman) and an anonymous referee which have improved the presentation. The responsibility for any errors and shortcomings in this paper remains ours. Part of the research was conducted while the first author was visiting CREATES, University of Aarhus, whose kind hospitality is gratefully acknowledged.
REFERENCES Baillie, R. T. and T. Bollerslev (1990). Intra-day and inter-market volatility in foreign exchange rates. Review of Economic Studies 58, 565–85. Bera, A. K. and S. Kim (2002). Testing constancy of correlation and other specifications of the BGARCH model with an application to international equity returns. Journal of Empirical Finance 9, 171–95. Berben, R.-P. and W. J. Jansen (2005). Comovement in international equity markets: A sectoral view. Journal of International Money and Finance 24, 832–57. C The Author(s). Journal compilation C Royal Economic Society 2009.
Testing for volatility interactions
163
Bollerslev, T. (1990). Modeling the coherence in short-run nominal exchange rates: A multivariate generalized ARCH approach. Review of Economics and Statistics 72, 498–505. Cecchetti, S., R. Cumby, and S. Figlewski (1988). Estimation of the optimal futures hedge. Review of Economic Studies 70, 623–30. Cheung, Y.-W. and L. K. Ng (1996). A causality-in-variance test and its application to financial market prices. Journal of Econometrics 72, 33–48. Cifarelli, G. and G. Paladino (2005). Volatility linkages across three major equity markets: A financial arbitrage approach. Journal of International Money and Finance 24, 413–39. Engle, R. F. (2002). Dynamic conditional correlation: A simple class of multivariate generalized autoregressive conditional heteroskedasticity models. Journal of Business and Economic Statistics 20, 339–50. Hamao, Y., R. W. Masulis, and V. Ng (1990). Correlations in price changes and volatility across international stock markets. The Review of Financial Studies 3, 281–307. He, C. and T. Ter¨asvirta (2004). An extended constant conditional correlation GARCH model and its fourthmoment structure. Econometric Theory 20, 904–26. Hong, Y. (2001). A test for volatility spillover with application to exchange rates. Journal of Econometrics 103, 183–204. Jarque, C. M. and A. K. Bera (1980). Efficient tests for normality, homoscedasticity and serial independence of regression residuals. Economics Letters 6, 255–59. Jeantheau, T. (1998). Strong consistency of estimators for multivariate ARCH models. Econometric Theory 14, 70–86. Kawakatsu, H. (2006). Matrix exponential GARCH. Journal of Econometrics 134, 95–128. Kim, T.-H. and H. White (2004). On more robust estimation of skewness and kurtosis. Finance Research Letters 1, 56–73. Ling, S. and M. McAleer (2003). Asymptotic theory for a vector ARMA-GARCH model. Econometric Theory 19, 280–310. Lomnicki, Z. (1961). Tests for departure from normality in the case of linear stochastic processes. Metrika 4, 37–62. McLeod, A. and W. Li (1983). Diagnostic checking ARMA time series models using squared residual autocorrelations. Journal of Time Series Analysis 4, 269–73. Nakatani, T. and T. Ter¨asvirta (2008a). Appendix to Testing for Volatility Interactions in the Constant Conditional Correlation GARCH Model. Department of Economic Statistics, Stockholm School of Economics, available at http://swopec.hhs.se/hastef/abs/hastef0649.htm. Nakatani, T. and T. Ter¨asvirta (2008b). Positivity constraints on the conditional variances in the family of conditional correlation GARCH models. Finance Research Letters 5, 88–95. R Development Core Team (2008). R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing, available at http://www.r-project.org. Silvennoinen, A. and T. Ter¨asvirta (2005). Multivariate autoregressive conditional heteroskedasticity with smooth transitions in conditional correlations. SSE/EFI Working Paper Series in Economics and Finance No. 577, Stockholm School of Economics. Tse, Y. K. (2000). A test for constant correlations in a multivariate GARCH model. Journal of Econometrics 98, 107–27. Wong, H. and W. K. Li (1997). On a multivariate conditional heteroscedastic model. Biometrika 84, 111–23. Wong, H., W. K. Li, and S. Ling (2000). A cointegrated conditional heteroscedastic model with financial applications. Preprint No. 2000-5, Department of Applied Mathematics, The Hong Kong Polytechnic University, Hong Kong.
C The Author(s). Journal compilation C Royal Economic Society 2009.
The
Econometrics Journal Econometrics Journal (2009), volume 12, pp. 164–186. doi: 10.1111/j.1368-423X.2008.00272.x
EM algorithms for ordered probit models with endogenous regressors H IROYUKI K AWAKATSU † AND A NN G. L ARGEY † †
Business School, Dublin City University, Dublin 9, Ireland E-mails:
[email protected],
[email protected] First version received: February 2008; final version accepted: October 2008
Summary We propose an EM algorithm to estimate ordered probit models with endogenous regressors. The proposed algorithm has a number of computational advantages in comparison to direct numerical maximization of the (limited information) log-likelihood function. First, the sequence of conditional M(aximization)-steps can all be computed analytically. Second, the algorithm updates the model parameters so that positive definiteness of the covariance matrix and monotonicity of cutpoints are naturally satisfied. Third, the variance parameters normalized for identification can be activated to accelerate convergence of the algorithm. The algorithm can be applied to models with dummy endogenous, continuous endogenous or latent endogenous regressors. A small Monte Carlo simulation experiment examines the finite sample performance of the proposed algorithms. Keywords: EM algorithm, Endogeneity, Ordered probit.
1. INTRODUCTION This paper considers limited information maximum likelihood (LIML) estimation of ordered probit models with endogenous regressors. The log-likelihood function for limited dependent variable models with endogenous regressors is relatively straightforward to write down. Despite its asymptotic efficiency and well established results for statistical inference, however, LIML does not appear to be the estimator of choice for models in this class. This is perhaps because ‘LIML suffers from a number of computational disadvantages, especially in large models. As a consequence, the LIML estimator has generally been avoided in favor of less efficient but computationally simpler estimation methods’ (Rivers and Vuong, 1988, p. 351). Rivers and Vuong (1988) proposed a simple two-step estimator that requires only least squares and estimation of a standard (non-endogenous) ordered probit model. Although consistent, this twostep estimator is generally less efficient than LIML. Newey (1987) proposed an asymptotically efficient two-step minimum chi-square estimator that applies generally to limited dependent variable models with continuous endogenous regressors. Our main contribution is to propose a numerically stable LIML estimator based on the EM algorithm. In Section 2, we describe the class of models that can be estimated by our proposed algorithm. These include ordered response models with dummy endogenous regressors (Section 2.1), with continuous endogenous regressors (Section 2.2), and with latent endogenous C The Author(s). Journal compilation C Royal Economic Society 2009. Published by Blackwell Publishing Ltd, 9600 Garsington Road,
Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
EM algorithms for ordered probit models with endogenous regressors
165
regressors (Section 2.3). We focus our discussion on the dummy endogenous regressor model as we are not aware of generally applicable consistent two-step type estimators for this class of models. Empirical application of this type of model has been considered, for example, by Evans and Schwab (1995) and Brueckner and Largey (2008) and has attracted some attention in the recent literature of treatment effects where the treatment status may be endogenous (Angrist 2001, Abadie 2003). Section 3 describes the proposed EM algorithm. The EM algorithm has the well known general feature that it monotonically increases the log-likelihood function at each iteration. For the algorithm to be usefully applied in practice, both the E(xpectation)-step and M(aximization)step of the algorithm must be tractable and easy to implement. Section 3.1 shows that the E-step for the class of models under consideration boils down to computation of the conditional first two moments from the multivariate Gaussian distribution. For models with more than one endogenous dummy regressor, the E-step is the main computational bottleneck as the multiple integral needs to be evaluated for each observation. However, for the empirically important special case of one endogenous regressor, we show how to compute the E-step using a bivariate normal cumulative distribution function (cdf) routine. For the continuous endogenous regressors model, the E-step only requires the first two moments from the univariate truncated normal distribution which can be easily computed. The key feature of our algorithm is that the M-step can be computed analytically without the need for numerical optimization. To achieve this analytical tractability, Section 3.2 shows how to break up the M-step into a sequence of conditional M-steps (Meng and Rubin, 1993). As shown by Meng and Rubin (1993), the Expectation-Conditional-Maximization (ECM) algorithm maintains the property that each iteration of the algorithm monotonically increases the likelihood function. The regression parameters in our ECM algorithm are updated by least squares. Another important feature of our ECM algorithm is that certain parameter restrictions are naturally satisfied when the estimates are updated in the M-step. These restrictions include positive definiteness of the covariance matrix and monotonicity of the cutpoint parameters. A well-known drawback of the EM algorithm is that convergence can be slow. Liu et al. (1998) proposed a parameter expansion algorithm (PX-ECM) that can accelerate convergence. In Section 3.3, we show how parameter expansion naturally arises in our application due to the need to normalize certain variance parameters for identification. Section 4 reports results from a small Monte Carlo simulation experiment that examines the finite sample performance of the proposed ECM algorithms. Section 4.1 examines the accuracy and computational cost of using Monte Carlo integration for the conditional moments required for the E-step. In Section 4.2, we examine how the EM algorithms compare with direct maximization of the log likelihood in obtaining LIML estimates. We compare EM algorithms with and without parameter expansion and variants where the covariance parameters are updated in separate steps. We find that the PX-ECM algorithm substantially accelerates convergence of the ECM algorithm.
2. ORDERED PROBIT WITH ENDOGENOUS REGRESSORS 2.1. Dummy endogenous regressors The EM algorithms we propose depend on whether the endogenous regressors are dummy variables or continuous regressors. We focus our exposition on the following model with C The Author(s). Journal compilation C Royal Economic Society 2009.
166
H. Kawakatsu and A. G. Largey
endogenous dummy regressors. γ + ui , yi∗ = Yi β + X1i
i = 1, . . . , n
(2.1a)
Yi∗ = 1 X1i + 2 X2i + Vi = Xi + Vi
ui
Vi yi = j
∼ N (0, ) ,
if αj −1 ≤ yi∗ < αj ,
=
j = 1, . . . , m,
Yi = 1 if 0 ≤ Yi∗ ,
σ11 σ21
(2.1b)
(2.1c)
σ21 22 α0 = −∞,
αm = +∞
Yi = 0 otherwise
(2.1d) (2.1e)
Y ∗i
is the continuous y i is the observed scalar dependent variable with m ordered outcomes and latent variable underlying y i . Y i is the r × 1 endogenous dummy regressors and y ∗i is the corresponding continuous latent variables underlying Y i . X 1i is the s × 1 included exogenous regressors and X 2i is the (k − s) × 1 excluded exogenous regressors, the so-called instruments. (The analysis that follows will be conditional on the exogenous regressors X i . The conditioning on X i will not be made explicit to avoid notational clutter.) The error terms (u i , V i ) are assumed to have a joint Gaussian distribution with mean zero and (r + 1) × (r + 1) covariance matrix . For later use, we partition into the scalar σ 11 , the r × 1 vector σ 21 and the r × r submatrix 22 . Denote the (r + s + kr + (r + 1)(r + 2)/2 + m − 1) × 1 parameter vector to be estimated as θ = (β , γ , vec() , vech() , α ). 1 In empirical work, the parameters θ are usually not of direct interest on their own (Angrist, 2001). More commonly, we are interested in how the various regressors in (2.1a) affect the probability of ordered outcomes y. In particular, the partial effect of an endogenous regressor Y k is given by Pr(y = j |Yk = 1, Z = Z) − Pr(y = j |Yk = 0, Z = Z) where Z denotes the values at which the other regressors in (2.1a) are to be evaluated. These partial effects generally depend on both the entire parameter vector θ and the regressor values Z. The rather strong parametric assumptions in model (2.1) allow straightforward evaluation of these partial effects of interest once we obtain an estimate of θ . EM algorithms to obtain limited information maximum likelihood (LIML) estimates of θ are described in the next section. An alternative method to obtain maximum likelihood estimates is to numerically maximize the (observed) log-likelihood function (θ ). (θ ) for model (2.1) is relatively straightforward to write down and takes the form (θ ) =
n
log Pr(αyi −1 ≤ yi∗ < αyi , bi ≤ Yi∗ < bi ),
(2.2)
i=1
where bi , bi are r × 1 vectors with j-th element (−∞, 0) if Yij = 0 (bij , bij ) = , (0, +∞) if Yij = 1
j = 1, . . . , r.
1 vec(A) is the vector obtained from stacking the columns of matrix A on top of each other and vech(A) is the vector obtained from stacking the lower triangular part of the symmetric matrix A on top of each other.
C The Author(s). Journal compilation C Royal Economic Society 2009.
EM algorithms for ordered probit models with endogenous regressors
167
The difficulty in evaluating (θ ) is that the probability in (2.2) requires evaluation of an r + 1 dimensional multivariate normal integral. The numerical evaluation of this integral is a major source of the computational cost of direct maximum likelihood estimation. Moreover, when numerically maximizing (θ ), we need to ensure that the parameter estimates satisfy certain constraints. These include positive definiteness of the covariance matrix and monotonicity of the cutoff points α1 < · · · < αm−1 . These constraints are commonly imposed via transformations of unconstrained parameters. A commonly used reparametrization of the covariance matrix to ensure positive definiteness is to estimate its lower triangular Cholesky factor L with positive diagonals such that LL = . 2 To ensure monotonicity of the cutoff points, we may use the reparametrization α 1 = a 1 , α j = α j −1 + exp(a j ) for j = 2, . . . , m − 1 and estimate the unconstrained parameters a j instead of α j . For identification purposes, we need to impose additional restrictions on the parameters. The probability in the likelihood (2.2) is invariant to scale shifts. Thus the (r + 1) (positive) diagonal elements of the covariance matrix are arbitrary. One natural identification restriction is to set these diagonal elements to unity and estimate the correlations in . To respect this identification restriction, the Cholesky transform discussed above needs to be modified so that the diagonals of LL are unity. The free parameters in L are now the r(r + 1)/2strict lower triangular elements. 2 The diagonal elements of L are then set to L 1,1 = 1 and Li,i = 1 − i−1 j =1 Li,j for 1 < i ≤ r + 1. To identify all cutoff parameters α1 , . . . , αm−1 , the constant term must be excluded from the regressor list X 1i . Alternatively, we can arbitrarily fix a cutoff parameter, say α 1 = 0, and instead estimate the constant term as part of γ . For the binary outcome case with m = 2, this is the normalization commonly used in probit models. After imposing these identification restrictions, the number of parameters to estimate is r + s + kr + r(r + 1)/2 + m − 2. 2.2. Continuous endogenous regressors In this section, we briefly discuss the model where the endogenous regressors Y i in (2.1a) are continuous and observable to the econometrician. In this case, Y ∗i in (2.1b) is replaced by Y i . Two-step estimation of this model has been considered by Newey (1987) and Rivers and Vuong (1988). The log-likelihood function for this model can be written as (θ ) =
n (yi |Yi , Xi , θ )+ (Yi |Xi , θ ) i=1
n
αyi − μi αyi −1 − μi 1 −1 = log √ −
− r log(2π ) + log |22 | + Vi 22 Vi , √ σ1|2 σ1|2 2 i=1 (2.3) where (·) is the cumulative distribution function (cdf) of the standard normal distribution and −1 μi ≡ Yi β + X1i γ + Vi 22 σ21
(2.4a)
−1 22 σ21 . σ1|2 ≡ σ11 − σ21
(2.4b)
2 For alternative parameterizations of the covariance matrix that ensure positive definiteness, see Pinheiro and Bates (1996).
C The Author(s). Journal compilation C Royal Economic Society 2009.
168
H. Kawakatsu and A. G. Largey
The main difference between this model and the model with dummy endogenous regressors in the previous section is that the likelihood function is much easier to evaluate in the continuous case. We only need to evaluate the univariate standard normal cdf irrespective of the number of endogenous regressors r. This makes LIML estimation of this model computationally less challenging than the dummy endogenous regressors case. As shown below, the EM algorithm for this model is considerably simpler to apply than for the dummy endogenous model. For the continuous case, the diagonals of the submatrix 22 are identified and are to be estimated as part of the parameter vector. The scale parameter σ 11 for the ordered outcome equation (2.1a) is still unidentified and can be normalized to σ 11 = 1. The decomposition (2.3) of the joint likelihood into the product of the conditional and marginal likelihoods suggests an alternative normalization σ 1|2 = 1. This is the normalization used by Rivers and Vuong (1988) and simplifies the computation of the conditional log-likelihood (y i |Y i , X i , θ ) and especially its scores. Under this normalization, the two-step estimates (Newey, 1987, Rivers and Vuong, 1988) and LIML estimates are numerically identical for just-identified models (r = k − s). For the continuous case we thus have r additional parameters to estimate compared to the dummy endogenous case. 2.3. Latent and mixed endogenous regressors In this section we consider other variations of the ordered probit model with endogenous regressors. 3 For the dummy endogenous model considered in Section 2.1, the endogenous regressor in the ordered outcome equation may be the latent variable Y i∗ rather than the observed dummy variable Yi . 4 yi∗ = Yi∗ β + X1i γ + ui
Yi∗ = Xi + Vi . Semi-parametric estimation of models of this type was considered by Lee (1995). The loglikelihood for this model takes the same form (2.2) with a slightly different joint multivariate Gaussian distribution. 5 The computational difficulty of LIML estimation of this model is essentially the same as the dummy endogenous regressor case and requires evaluation of r + 1 dimensional multivariate normal integrals. We can also consider models where there are mixed types of endogenous regressors. For example, the r endogenous variables Y i could be of two types: r 1 dummy variables Y1i (with underlying continuous variables Y ∗1i ) and r 2 continuous variables Y2i where r 1 + r 2 = r. The observed log-likelihood for this model takes the form (θ ) =
n
log Pr(αyi −1 ≤ yi∗ < αyi , bi ≤ Y1i∗ < bi , Y2i ).
i=1
For identification, we set the first 1 + r 1 diagonal elements of to unity. To evaluate the likelihood for this model, we require evaluation of multivariate integrals of dimension 1 + r 1 . 3
These and several other variants are considered in Maddala (1983, 5.7). Models of this type are of interest when the latent variables Y ∗i are measurable quantities but are only available as indicators in the sample data. 5 γ + β yi∗ Xi β + X1i σ11 + 2β σ21 + β 22 β σ21 22 ∼ N , . Yi∗ σ21 + 22 β 22 Xi 4
C The Author(s). Journal compilation C Royal Economic Society 2009.
EM algorithms for ordered probit models with endogenous regressors
169
3. EM ALGORITHM To address the computational difficulties with the LIML estimator, we develop an EM algorithm as a computational device to obtain LIML estimates. EM algorithms generally produce parameter estimates that should be numerically identical to the estimates obtained from directly maximizing the (observed) log-likelihood functions for well behaved problems. A well known feature of the algorithm is that updating parameter estimates at each iteration monotonically increases the loglikelihood function. Our proposed algorithm has the following additional computational features. First, the sequence of conditional M-steps can all be computed analytically. In particular, most parameter estimates can be updated by solving a system of linear equations for which there are well established computationally efficient algorithms. For the continuous endogenous regressors case, these updates can be carried out via least-squares regressions. Second, the parameters are updated in such a way that certain restrictions are naturally satisfied without the need of using transforms. These restrictions include positive definiteness of the covariance matrix and monotonicity of cutoff parameters. In this section, we describe the EM algorithm for the dummy endogenous regressors case discussed in Section 2.1. The changes in the algorithm for the continuous (Section 2.2) and latent (Section 2.3) endogenous regressors models are described in the Appendix. 3.1. E-step The E-step requires computation of the expected complete data log likelihood q(θ ) ≡ E[ (y ∗ , Y ∗ )|y, Y , θ ] where y ∗ = (y ∗1 , . . ., y ∗n ), y = (y 1 , . . ., y n ) and similarly for Y ∗ , Y . This notation emphasizes the fact that q(θ ) is the expectation conditional on all observed data as well as the current parameter values. As pointed out by Ruud (1991), the main difficulty for the ordered probit model is that the cutoff parameters α are not identified in the EM algorithm as the latent variable y∗ in (2.1a) depends on the parameter α. To remove this parameter dependency of y∗ , we follow Ruud (1991) and apply the following location and scale shift to y∗ . yi = j ⇔ αj −1 ≤ yi∗ < αj ⇔ 0 ≤
yi∗ − αj −1 <1 αj − αj −1
and redefine the transformed variable y ∗i ← (y ∗i − α j −1 )/δj as the latent variable where δ j ≡ α j − α j −1 . 6 With this transformation, the complete data likelihood for model (2.1) is the r + 1 dimensional multivariate Gaussian p(y ∗i , Y ∗i |y i , Y i ) ∼ N (μ∗i , ∗i ) where
σ11 /δj2 σ21 /δj γ − αj −1 )/δj (Yi β + X1i ∗ ∗ μi ≡ , i ≡ . (3.1) Xi σ21 /δj 22 As the complete data likelihood belongs to the exponential family, computation of the expected complete data log-likelihood q(θ ) boils down to evaluating the complete data sufficient statistics at current parameter values. For the Gaussian distribution, the sufficient statistics are the first two conditional moments from the truncated Gaussian. Therefore, the E-step boils down 6 To handle the end point cases, the transform is modified to y ∗ ← y ∗ − α for j = 1 and to y ∗ ← y ∗ − α 1 m−1 for i i i i j = m. C The Author(s). Journal compilation C Royal Economic Society 2009.
170
H. Kawakatsu and A. G. Largey
to evaluating conditional expectations of the form a b ∗r ∗s E[y Y |y, Y ] = y ∗r Y ∗s p(y ∗ , Y ∗ |y, Y )dY ∗ dy ∗ a
=
(3.2a)
b
a a
b
y ∗r Y ∗s p(y ∗ , Y ∗ |a ≤ y ∗ ≤ a, b ≤ Y ∗ ≤ b)dY ∗ dy ∗
(3.2b)
b
for 0 ≤ r, s ≤ 2. As these expectations involve r + 1 dimensional integrals and need to be evaluated for each observation, the E-step is the computational bottleneck for the dummy endogenous regressor model. In the Appendix, we provide details of how to carry out the E-step using Monte Carlo integration for r > 1 using the method of Genz (1992). For the special case r = 1, we also show how the conditional expectations can be evaluated using a bivariate normal cdf routine without Monte Carlo integration. To derive the M-step, we express the expected complete data log-likelihood in two ways. An expression based on the joint distribution is n 1 (r + 1) log(2π ) + log || − log δj2 + tr −1 E[xi∗ xi∗ |yi , Yi ] 2 i=1 + vi −1 vi − vi −1 xi∗ − xi∗ −1 vi ,
q(θ ) = −
where
xi∗
≡
δj yi∗ Yi∗
,
xi∗
≡
E[xi∗ |yi , Yi ],
vi ≡
γ − αj −1 Yi β + X1i
(3.3a)
Xi
(3.3b)
and the conditional expectations are computed from (3.2). This expression turns out to be useful for updating the (r + 1) dimensional covariance matrix as a block in the conditional M-step described below. For updating the other parameters, we use an alternative (but equivalent) expression based on the decomposition of the joint distribution into the product of the conditional and marginal distributions. Denote the conditional expectations computed from (3.2) as yi∗ ≡ E[yi∗ |yi , Yi ] ∗ ∗ ∗2 σ11,i ≡ E[yi∗2 |yi , Yi ] − ( yi∗ )2 (scalar), Yi ≡ E[Yi |yi , Yi ] (r × 1) for the conditional means and ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ i Y i (r × r), i (r × 1) for 22,i ≡ E[Yi Yi |yi , Yi ] − Y (scalar), σ21,i ≡ E[yi Yi |yi , Yi ] − yi Y 7 the conditional variances. Then we can write n −1 ∗ 1 n i −1 V 22,i + V (1 + r) log(2π ) + log σ1|2 + log |22 | − tr 22 22 i 2 2 i=1
m 1 1 (3.4) − E (δj yi∗ − μi + αj −1 )2 |yi , Yi − log δj2 , 2 j =1 y =j σ1|2
q(θ ) = −
i
i ≡ Y i∗ − Xi and where μ i , σ1|2 are as defined in (2.4), V ∗2 ∗ ∗ 22,i E[(δj yi∗ − μi + αj −1 )2 |yi , Yi ] = δj2 σ11,i + λ λ − 2δj λ σ21,i i λ)2 + (δj yi∗ + αj −1 − Yi β − X1i γ −V 7
(3.5)
To cover the end point cases as in footnote 6, α 0 is understood to be α 1 and δ 1 = δ m = 1 in (3.4). C The Author(s). Journal compilation C Royal Economic Society 2009.
171
EM algorithms for ordered probit models with endogenous regressors
with λ ≡ −1 22 σ 21 . 3.2. M-step The M-step of the algorithm updates the parameters so as to maximize the expected complete data log-likelihood q(θ ) from the previous E-step. To ensure monotonicity of the cutpoint parameters, we use the parametrization θ = β , γ , vec() , vech() , δ , where δ is (m − 2) × 1 with typical element δ j = α j − α j −1 > 0 if m > 2. 8 The first order conditions ∂q(θ )/∂θ = 0 for the M-step do not appear to have a closed form solution. Our algorithm breaks up the M-step into a sequence of conditional M-steps where we maximize a subset of the parameters conditional on the remaining parameter values (Meng and Rubin, 1993). By breaking up the parameter vector into natural blocks, we obtain closed form solutions and are able to perform the M-step analytically. More specifically, we update the current parameter estimates θ (t) to θ (t+1) in the following order. 1 2
i λ) on Y i , X 1i yi∗ + αj −1 − V Update β (t+1) and γ (t+1) via a least-squares regression of (δj (t) where the parameters in the dependent variable are evaluated at θ . Update (t+1) by solving the system of linear equations n
n
−1 −1 ∗ λλ + σ1|2 ⊗ Xi Xi Xi Yi vec() = vec λλ + σ1|2 22
22
i=1
n
∗ −λ ⊗ δj yi + αj −1 − Yi β − X1i γ Xi ,
i=1
(3.6)
i=1
3
where the parameters are evaluated at β = β (t+1) , γ = γ (t+1) and = (t) , δ = δ (t) . Update (t+1) as 1 ∗ V (xi ) + ( xi∗ − vi )( xi∗ − vi ) , n i=1 n
=
4
(xi∗ ) ≡ E[xi∗ xi∗ |yi , Yi ] − xi∗ xi∗ , xi∗ , vi are as defined in (3.3b), and the where V (t+1) , γ = γ (t+1) , = (t+1) and δ = δ (t) . Note that this parameters are evaluated at β = β update ensures positive definiteness of (t+1) . If m > 2, update δ j in the order j = 2, . . ., m − 1 via solving the quadratic equation ⎛ ⎛ ⎞ ∗2 ∗2 ∗ 2 ⎝ yi + (α1 + δ2 + · · · + δj −1 − σ11,i + nj +1 + · · · + nm ⎠ δj + ⎝ μi ) yi∗ − λ σ21,i yi =j
+
yi =j m
(α1 + δ2 + · · · + δj −1 + δj +1 + · · · + δk−1 + δk yi∗ − μi )⎠ δj − nj σ1|2 = 0,
k=j +1 yi =k
8
αj = α1 +
j
k=2 δk
⎞
for j > 1.
C The Author(s). Journal compilation C Royal Economic Society 2009.
(3.7)
172
H. Kawakatsu and A. G. Largey i λ. 9 This where n j is the number of observations with y i = j and μi ≡ Yi β + X1i γ +V quadratic equation has real roots, the larger of which is positive as required for δ j . The coefficients of the quadratic equation are evaluated at β = β (t+1) , γ = γ (t+1) , = (t+1) , (t) (t) , . . . , δ j −1 = δ (t+1) = (t+1) , δ 2 = δ (t+1) 2 j −1 , and δ j +1 = δ j +1 , . . . , δ m−1 = δ m−1 .
As shown by Meng and Rubin (1993), each step of the ECM algorithm monotonically increases the (observed) likelihood function. 3.3. Parameter expansion and identification The main feature of the ECM algorithm is its monotonic (in observed likelihood) convergence towards a local optimum. However, a well known drawback of the EM algorithm is that convergence can be slow. A solution proposed by Liu et al. (1998) is to expand the completedata model with a larger parameter space without altering the original observed-data model. The key to implement this parameter expansion (PX-EM) algorithm is to find a suitable expansion of the parameter space that accelerates convergence. The probit example (without endogeneity) considered in Liu et al. (1998) suggests a PX-ECM algorithm for the ordered probit model (2.1) in which the normalized parameters are ‘activated’ and included in the parameter vector to be estimated. 10 The ECM algorithm described above does not impose the identification restrictions discussed in Section 2.1, i.e. unit diagonals of . Though the M-step updates the parameter so that it is positive definite, it does not impose the unit diagonals restriction. In fact, it is rather cumbersome to update only the strict lower triangular elements of . For the dummy endogenous regressors model, the PX-ECM algorithm in which the diagonals of are activated arises naturally as shown above. To complete the description of our PX-ECM algorithm, we describe how to normalize the expanded parameter space. After each M-step, we normalize the parameter vector as √ √ √ −1/2 (3.8) , vech(D −1/2 D −1/2 ) , δ / σ11 , θN = β / σ11 , γ / σ11 , vec D22 where D = diag() is the (r + 1) × (r + 1) diagonal matrix with diagonal elements from (unnormalized) and D 22 is its r × r lower right submatrix. After each M-step, we check convergence based on the normalized parameter vector θ N . As shown by Liu et al. (1998), monotonic convergence of the ECM algorithm is preserved for the PX-ECM algorithm. 3.4. Special cases We first consider the model with continuous endogenous regressors discussed in Section 2.2. The E-step for this model is considerably simpler as it boils down to obtaining the first two moments from the univariate truncated Gaussian distribution given in Appendix A.2. The M-step i∗ = Yi , V i = Vi = can be considered a special case of that described in Section 3.2 where Yi∗ = Y ∗ ∗ 22,i = 0. As the diagonals of 22 are identified without normalization σ21,i = Yi − Xi , and For j = 2, the first sum in the coefficient of δ j is yi =j (α1 − μi ) yi∗ while for j = m − 1, the second double sum in the coefficient of δ j becomes yi =m (α1 + δ2 + · · · + δm−2 + yi∗ − μi ). 10 The expanded or activated parameter is called a working parameter in Meng and van Dyk (1997). 9
C The Author(s). Journal compilation C Royal Economic Society 2009.
EM algorithms for ordered probit models with endogenous regressors
173
for the continuous endogenous regressors case, the normalization of the parameter vector in (3.8) is done with D replaced with the identity matrix except for one element D 11 = σ11 . As this case has only one unidentified parameter σ 11 , an EM algorithm without parameter expansion can be constructed based on the parametrization (3.9) θ = β , γ , vec() , λ , vech(22 ) , δ , 11 The M-step can then be based solely on the partitioned form of the where λ ≡ −1 22 σ 21 . expected complete data log-likelihood (3.4) with either the σ 11 = 1 or σ 1|2 = 1 normalization. This variant of the M-step which can also be used for the dummy endogenous regressors model is described in Appendix A.3. As this variant breaks up the block into further sub-blocks of λ and 22 , it can be considered an alternative form of conditioning the M-step. The algorithm for the latent endogenous regressors model in Section 2.3 requires certain modifications for the M-step and are outlined in Appendix A.4. The E-step is essentially unchanged from the dummy endogenous regressors case (except for the mean and variance of the complete data likelihood) and remains computationally expensive.
4. MONTE CARLO SIMULATIONS In this section we examine the performance of the proposed PX-ECM algorithms via Monte Carlo simulations. For the data generating process, we consider model (2.1) with m = 5 ordered outcomes and δ j = α j − α j −1 = 1. To cover a wide variety of parameter configurations, we redraw all exogenous data and parameters for each Monte Carlo replication. Data for the k exogenous variables X are all drawn from the standard normal distribution N (0, 1) except for the first column of X 1 which is always a constant term. The parameters β, γ , are also drawn from N(0, 1), except for the constant terms. The constants in are set to zero, while the constant in γ is set to (m − 2)δ j /2 so that the ordered outcomes do not concentrate at the endpoints. The strict lower triangular elements of the Cholesky factor L of the covariance matrix are drawn uniformly from the interval [−1/2, 1/2] and the diagonal elements of L are set to ensure unit diagonals of as described in Section 2.1. 4.1. E-step with Monte Carlo integration As discussed in Section 3.1, the main computational cost of the EM algorithm is the computation of conditional moments for the E-step. For models with dummy endogenous regressors, the E-step requires evaluation of multivariate normal integrals. We first assess the accuracy and computational cost of using Monte Carlo integration to carry out the E-step. To do so, we simulate one data set for the case r = 1 with n = 1000 observations, s = 10 included exogenous regressors and one extra instrument (k = r + s + 1). We evaluate both the (observed) log-likelihood value (2.2) and the conditional moments (3.2) using the algorithms described in Appendix B.1 and B.2. The results from using the BVND routine of Genz (2004) mentioned in Appendix B.2 are taken to be ‘exact’ as they are accurate to about 16 digits. For Monte Carlo integration, it is now well documented that those based on quasi-random numbers are more effective than those based on pseudo-random draws (Genz and Bretz 2002, S´andor and Andr´as, 2004). Here we compare 11
The λ parametrization was also used in Smith and Blundell (1986) and Rivers and Vuong (1988).
C The Author(s). Journal compilation C Royal Economic Society 2009.
174
H. Kawakatsu and A. G. Largey 2
3
E(
9
6
)
E(Y 2*)
● ●
7
Sobol
●
●
Halton
●
●
6
●
●
●
4
●
3
Mean Accurate Digits
5
RRich
8
5
4
y 1*
LogLik
●
●
●
●
2
●
1 0
Var(y 1*)
Cov(y 1*, Y 2*)
Var(Y 2*) 9 8 7 6
●
● ●
● ●
●
●
●
4
●
●
5
●
3
●
●
2
● ●
1 0
2
3
4
5
6
Log10(draws)
2
3
4
5
6
Figure 1. Accuracy of Monte Carlo integration.
two widely used deterministic quasi-random sequences (Halton and Sobol) and a randomized Richtmeyer sequence (Genz and Bretz, 2002). Figure 1 shows how the accuracy of Monte Carlo integration improves with the number of Monte Carlo draws. The panels show the approximate number of accurate digits of Monte Carlo integration as a function of (base 10) log number of Monte Carlo draws. Three types of quasirandom draws are compared: Halton, Sobol, and randomized Richtmeyer (denoted RRich). The first panel shows the accuracy for the (observed) log-likelihood and the remaining panels for the conditional moments of the E-step in the EM algorithm. Based on one draw of n = 1000 observations of data generated with m = 5 ordered outcomes, r = 1 endogenous regressor, s = 10 exogenous regressors, and k = r + s + 1 (one extra instrument). The number of accurate digits is approximated by the negative of base 10 logarithm of the absolute difference in the C The Author(s). Journal compilation C Royal Economic Society 2009.
EM algorithms for ordered probit models with endogenous regressors
175
values from Monte Carlo integration and that from the BVND routine. 12 The number of accurate digits is computed for each observation and the average over the sample (n = 1000) is shown in Figure 1. The figure shows that the number of accurate digits is nearly linear in base 10 log draws. That is, a ten fold increase in the number of draws reduces the approximation error by roughly another decimal point. There is little difference in accuracy between the three quasirandom sequences. An important result in Figure 1 is that the log-likelihood values (which are the integrating constants of the conditional distribution needed for the E-step) are more accurately computed than the conditional moments for the same number of draws. Although not reported, the standard deviation of the accurate digits (over the n = 1000 observations) is much larger for the log-likelihood than the conditional moments. For example, for the randomized Richtmeyer sequence (RRich) with 106 draws, the standard deviation is 2.45 for the log-likelihood and in the range 0.38–0.91 for the conditional moments. As for the computational cost, the execution time of Monte Carlo integration is roughly linear in the number of draws. In our implementation with 104 draws, evaluation of the loglikelihood takes about 38 seconds and the E-step about 60 seconds for n = 1000 observations. By comparison, the algorithm in Appendix B.2 using the BVND routine takes about 0.004 seconds for the log-likelihood and 0.006 seconds for the E-step. As we increase the dimension of the integral r (the number of endogenous regressors), the computation time increases but less than linearly as we compute all moments simultaneously. For 104 draws and n = 1000 observations, evaluation of the log-likelihood takes about 70 seconds for r = 2 and about 106 seconds for r = 3 and the E-step takes about 98 seconds for r = 2 and about 138 seconds for r = 3. Finally, though there are some differences in the computational cost of obtaining the different quasi-random sequences used in the Monte Carlo integration, these differences are negligible for any decent sample size n as we use common random numbers which are drawn once at the beginning. 4.2. Estimation with EM algorithms We next examine the performance of the EM algorithm as an estimator to obtain LIML estimates. We compare EM with direct numerical maximization of the observed log-likelihood (2.2). The estimates obtained from the two should be numerically the same. However, the two algorithms may differ in terms of computational cost and robustness of converging to a (local) optimum. As mentioned in Section 3.3, the main concern with EM type algorithms is its slow convergence and it is of particular interest to examine how the various parameter expansion algorithms can accelerate convergence. Because of the computational cost of Monte Carlo integration for dummy endogenous regressors, we focus our experiments for the case r = 1. For the continuous endogenous regressors model, we consider the case r = 2. For both models, we set n = 1000 observations, m = 5 ordered outcomes, s = 10 or s = 20 included exogenous regressors and k = r + s + 1 for one extra instrument. The number of parameters to estimate are 27 (s = 10) and 47 (s = 20) for the dummy endogenous model and 46 (s = 10) and 76 (s = 20) for the continuous case. In our Monte Carlo experiments, we compare the following four variants of the ECM algorithm. PX0 denotes the standard ECM algorithm without parameter expansion and PX1 denotes the algorithm where only one parameter σ 11 is activated. PX2 and PXR both activate all diagonal elements of the covariance matrix but PX2 breaks up the update into those 12
We truncate our measure of accuracy at 16 digits (double precision) to deal with log(0).
C The Author(s). Journal compilation C Royal Economic Society 2009.
176
H. Kawakatsu and A. G. Largey Dummy endogenous (r = 1) s = 10
s = 20
2.0 ● ●
●
●
●
●
● ●
0.62
0.63
●
● ●
Time relative to BFGS
1.5 ●
●
1.0
0.5
0.46 ●
●
0.48 ●
●
0.19 ●
●
● ●
0.22 ●
● ●
● ●
0.14
0.15
●
●
PX2
PXR
0.0 PX0
PX1
PX2
PXR
PX0
PX1
Figure 2. Distribution of execution time of EM algorithms relative to BFGS.
for λ and 22 (Appendix A.3) whereas PXR updates itself as in Section 3.2. PX0, PX1, PX2 are based on the parametrization (3.9) and PXR is based on (3.8). 13 For direct numerical maximization of the observed log-likelihood, we use the trust region BFGS algorithm of Gay (1983). 14 Constraints on the parameters δ j and are imposed using transformations as described in Section 2.1. All algorithms are started with the same initial parameter values: β, γ , are set to OLS estimates, is set to the identity matrix, and δ j = 0.1. The convergence criterion for the (t−1) | < θ with = 2.22 × EM algorithms was set to (θ (t) ) − (θ (t−1) ) < or max i |θ (t) i − θi −16 −8 15 and θ = 10 . We note that one of the features of the EM algorithm is that it does not 10 require evaluation of the observed likelihood (θ ). However, we have found it advantageous to exploit the monotonicity of (θ ) in the algorithm and to use it as part of the convergence criterion despite the additional cost of evaluating (θ ). Figures 2 and 3 show the distribution of time needed to achieve convergence of the four ECM algorithms relative to that of direct maximization of (θ ) using the BFGS algorithm. Figure 2 is for the model with one dummy endogenous regressor and is based on 100 Monte Carlo replications. The panels compare the time it takes to estimate the parameters of the model with 13 For the continuous case, PX1 and PX2 are the same as the diagonals of 22 are not normalized. This is why PX2 is not reported for the continuous case in Figure 3. PX1 differs from PXR in this case as σ 11 and λ are separately updated in PX1. 14 More specifically, we use the routine SMSNO based on numerical first derivatives. 15 For the BFGS routine SMSNO we use the default convergence criteria.
C The Author(s). Journal compilation C Royal Economic Society 2009.
177
EM algorithms for ordered probit models with endogenous regressors Continuous endogenous (r = 2) s = 10
s = 20
2.0 ● ● ●
Time relative to BFGS
1.5
1.0
● ● ● ● ● ● ● ●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ●
0.12
PX0
● ●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
0.02 ● ● ●
0.02
PX1
PXR
●
0.0
●
●
0.5
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
0.17 ●
●
● ● 0.02 ● ● ●
0.02
PX1
PXR
●
PX0
●
Figure 3. Distribution of execution time of EM algorithms relative to BFGS.
one dummy endogenous regressor using the EM algorithms. The timings are relative to numerical maximization of the observed log-likelihood using BFGS. PX0 denotes the ECM algorithm without parameter expansion. PX1 and PX2 are the PX-ECM algorithms that expand σ 11 and the diagonals of 22 based on the parametrization (3.9). PXR is the PX-ECM algorithm based on the parametrization (3.8). The boxplots summarize the distribution over 100 replications of sample size n = 1000, m = 5 ordered outcomes, s exogenous regressors, and k = r + s + 1 (one extra instrument). The numbers in the figure are the sample medians (corresponding to the filled circles). The model with s = 10 has 27 parameters and the model with s = 20 has 47 parameters to estimate. The four ECM algorithms all achieved convergence in all replications, while the BFGS algorithm failed in three cases due to difficulty in evaluating the numerical gradient. 16 Excluding the failed cases, the maximum absolute difference in the converged parameter values from BFGS and those from ECM were on the order of 10−3 for PX0 and 10−5 for PX1, PX2, and PXR. The parameter estimates are therefore practically identical. The median reduction in computation time relative to BFGS is about 40–50% for both PX0 and PX1. The whiskers of the boxplots indicate that for these two algorithms there are cases where it takes longer than BFGS to achieve convergence. Both PX2 and PXR always converged faster than BFGS with median reduction of about 80% in computing time. Figure 3 shows the results for the model with r = 2 continuous endogenous regressors and is based on 1000 replications. The panels compare the time it takes to estimate the parameters of 16
The return code from SMSNO was 65.
C The Author(s). Journal compilation C Royal Economic Society 2009.
178
H. Kawakatsu and A. G. Largey
the model with two continuous endogenous regressors using the EM algorithms. The timings are relative to numerical maximization of the observed log-likelihood using BFGS. PX0 denotes the ECM algorithm without parameter expansion. PX1 is the PX-ECM algorithm that expand σ 11 based on the parametrization (3.9). PXR is the PX-ECM algorithm based on the parametrization (3.8). The boxplots summarize the distribution over 1000 replications of sample size n = 1000, m = 5 ordered outcomes, s exogenous regressors, and k = r + s + 1 (one extra instrument). The numbers in the figure are the sample medians (corresponding to the filled circles). The model with s = 10 has 46 parameters and the model with s = 20 has 76 parameters to estimate. For this model with s = 20, PX0 was stopped in 64 cases and PXR in 6 cases after 9000 EM iterations, while the BFGS algorithm terminated in 23 cases due to difficulty with the numerical gradient. PX1 achieved convergence in all replications. However, examination of the maximum absolute difference in the converged parameter values from BFGS and those from PX1 revealed that in two cases the 22 estimates from PX1 were on the order of 107 . For these two cases, BFGS achieved convergence and PXR reached the maximum iterations at parameter values close to those from BFGS. Excluding these two cases and the failed cases, the maximum absolute difference in the converged parameter values from BFGS and those from ECM were on the order of 100 for PX0 and 10−3 for PX1 and PXR. The gain in computation time from using the EM algorithm is even more dramatic for this model. Even without parameter expansion, the median reduction in computation time is about 80% compared to BFGS. Despite there being only one parameter to activate, both PX1 and PXR take only a fraction of the time of BFGS to achieve convergence. A somewhat interesting result in Figures 2 and 3 is that PX2 appears to be slightly faster than PXR. Recall that PX2 breaks up the update of PXR into two steps of λ = −1 22 σ 21 and 22 updates. This result suggests that having fewer blocks of parameters to update in the conditional M-step does not necessarily improve the speed of convergence of the algorithm. This is the case not only when we measure computational cost in terms of execution time as in Figures 2 and 3 but also in terms of the number EM iterations required to achieve convergence. For the dummy endogenous regressor model with r = 1 and s = 10, the median PX2 iterations relative to PX0 was 0.36 and the median PXR iterations relative to PX0 was 0.39. For the continuous endogenous regressors with r = 2 and s = 10, the median PX1/PX2 iterations relative to PX0 was 0.16 and the median PXR iterations relative to PX0 was 0.22. However, the two false convergence cases for PX1 in the continuous case (r = 2 and s = 20) suggest that the PXR algorithm may be more numerically stable for models with a large number of parameters at a slight cost of computation time. By breaking up the update in two steps, PX1/PX2 does not necessarily ensure positive definiteness of the full covariance matrix and may explain its numerical instability in very large models. 4.3. Starting values Although the EM algorithm monotonically increases the (observed) likelihood after each iteration, it may converge to a local maximum depend on starting values. The speed of convergence (and hence computing time) may also depend on the choice of starting values. As the shape of the likelihood surface is difficult to characterize for a model with many parameters as the ones considered in the simulations, we briefly discuss how the algorithm performs when we use alternative starting values. We may expect that the algorithm performs better when we start from parameter values that are ‘close’ to the final estimates. C The Author(s). Journal compilation C Royal Economic Society 2009.
EM algorithms for ordered probit models with endogenous regressors
179
As mentioned above, the starting values used in the simulations are set to OLS estimates for the slope parameters (ignoring the discrete nature of the dependent variables), the covariance parameters all set to zero, and the cutpoint parameters arbitrarly set to δj = 0.1. These starting values are not consistent estimates but have the virtue of being very easy to obtain. For the model with continuous endogenous regressors, the two-step estimators of Newey (1987) and Rivers and Vuong (1988) can be used as starting values. While consistent, these two-step estimators are more expensive to compute as they involve estimating an ordered probit model (and we are back to the issue of selecting appropriate starting values). We are not aware of generally consistent two-step type estimators for the discrete endogenous case. However, one may still use the twostep estimates for the continuous case (ignoring the discrete nature of the endogenous regressors) as starting values. An alternative set of starting values is to set all slope parmeters and covariance parameters to zero. Under these restrictions, the maximum likelihood term and j estimates of the constant αj = −1 ( s=1 ns /n) + γ0 where −1 (·) is the cutpoint paramaters are γ0 = − −1 (n1 /n), standard normal quantile function. These starting values are easy to obtain and may be suitable for models with a relatively large number of ordered outcomes or uneven cutpoints. In empirical applications, it is generally recommended to try alternative starting values to check the sensitivity of the final estimates. For the data generating processes with s = 10 considered in Section 4.2, we have compared the iterations of the algorithms under some alternative starting values mentioned above. For the simulation design under consideration, they generally have very little effect on the number of iterations. The exception was the continuous endogenous regressor case (with r = 2), where the iterations starting from the consistent two-step estimates of Rivers and Vuong (1988) resulted in median iteration reductions of 24–46% compared to the OLS starting values used in Section 4.2. However, as mentioned above, this reduction in iteration count comes at the cost of computing the two-step starting values.
5. CONCLUDING REMARKS The proposed ECM algorithm to estimate ordered probit models with endogenous regressors is numerically stable and is the preferred algorithm to use to obtain LIML estimates. The simulations indicate that they work well even for models with a large number of parameters. Although the present paper focused on the ordered probit model with endogenous regressors, the EM algorithm can be easily modified or extended to other classes of discrete or limited dependent variable models as considered by Newey (1987). The binary probit model with endogenous regressors is a special case of the model considered in this paper and requires no modification. One can also modify the EM algorithm to estimate limited dependent variable (e.g. Tobit) models with endogenous regressors. As the variance parameter is identified without normalization in the Tobit model, it remains to be seen whether an effective parameter expansion algorithm can be found for this class of models. As inference procedures using LIML are well established, we expect more widespread application of LIML estimation using the proposed algorithm in applied work. The parameter covariance matrix required for statistical inference can be estimated by the second derivatives of the observed log-likelihood (Jamshidian and Jennrich 2000). The so-called robust covariance matrix that allows for within group (clustered) dependence can be obtained in the usual manner (Wooldridge, 2001, chapter 13). C The Author(s). Journal compilation C Royal Economic Society 2009.
180
H. Kawakatsu and A. G. Largey
The PX-ECM algorithm is stable and fast for the continuous endogenous regressor case even for models with more than one endogenous regressor. The dummy endogenous variable case, however, is computationally more demanding especially for models with more than one endogenous regressor. The main computational difficulty for this class of models is the E-step which requires evaluation of conditional moments from multivariate normal integrals. Monte Carlo integration using quasi-random numbers is a feasible solution but computationally quite expensive. It remains to be seen whether a more effective method to compute the E-step can be developed, such as the use of parallelization.
ACKNOWLEDGMENT We thank the editor and referees for many useful comments and suggestions.
REFERENCES Abadie, A. (2003). Semiparametric instrumental variable estimation of treatment response models. Journal of Econometrics 113, 231–63. Angrist, J. D. (2001). Estimation of limited dependent variable models with dummy endogenous regressors: simple strategies for empirical practice. Journal of Business and Economic Statistics 19, 2–16. Brueckner, J. K. and A. G. Largey (2008). Social interaction and urban sprawl. Journal of Urban Economics 64, 18–34. Dyer, D. D. (1973). On moments estimation of the parameters of a truncated bivariate normal distribution. Applied Statistics 22, 287–91. Evans, W. N. and R. M. Schwab (1995). Finishing high school and starting college: do catholic schools make a difference? Quarterly Journal of Economics 110, 941–74. Gay, D. M. (1983). Algorithm 611: subroutines for unconstrained minimization using a model/trust-region approach. ACM Transactions on Mathematical Software 9, 503–24. Genz, A. (1992). Numerical computation of multivariate normal probabilities. Journal of Computational and Graphical Statistics 1, 141–9. Genz, A. (2004). Numerical computation of rectangular bivariate and trivariate normal and t probabilities. Statistics and Computing 14, 151–60. Genz, A. and F. Bretz (2002). Methods for the computation of multivariate t-probabilities. Journal of Computational and Graphical Statistics 11, 950–71. Jamshidian, M. and R. I. Jennrich (2000). Standard errors for em estimation. Journal of Royal Statistical Society, Series B 62, 257–70. Johnson, N. L. and S. Kotz (1970). Continuous Univariate Distributions, Volume 1. New York: John Wiley & Sons. Lee, M.-J. (1995). Semi-parametric estimation of simultaneous equations with limited dependent variables: a case study of female labour supply. Journal of Applied Econometrics 10, 187–200. Liu, C., D. B. Rubin and Y. N. Wu (1998). Parameter expansion to accelerate EM: the PX-EM algorithm. Biometrika 85, 755–70. Maddala, G. (1983). Limited-Dependent and Qualitative Variables in Econometrics. Cambridge: Cambridge University Press. Meng, X.-L. and D. B. Rubin (1993). Maximum likelihood estimation via the ECM algorithm: a general framework. Biometrika 80, 267–78. C The Author(s). Journal compilation C Royal Economic Society 2009.
EM algorithms for ordered probit models with endogenous regressors
181
Meng, X.-L. and D. van Dyk (1997). The em algorithm—an old folk-song sung to a fast new tune. Journal of Royal Statistical Society, Series B 59, 511–67. Newey, W. K. (1987). Efficient estimation of limited dependent variable models with endogenous explanatory variables. Journal of Econometrics 36, 231–50. Pinheiro, J. C. and D. M. Bates (1996). Unconstrained parametrizations for variance-covariance matrices. Statistics and Computing 6, 289–96. Rivers, D. and Q. H. Vuong (1988). Limited information estimators and exogeneity tests for simultaneous probit models. Journal of Econometrics 39, 347–66. Ruud, P. A. (1991). Extensions of estimation methods using the EM algorithm. Journal of Econometrics 49, 305–41. S´andor, Z. and P. Andr´as (2004). Alternative sampling methods for estimating multivariate normal probabilities. Journal of Econometrics 120, 207–34. Smith, R. J. and R. W. Blundell (1986). An exogeneity test for a simultaneous equation tobit model with an application to labor supply. Econometrica 54, 679–86. Wooldridge, J. M. (2001). Econometric Analysis of Cross Section and Panel Data. Cambridge: MIT Press.
APPENDIX A: EM ALGORITHM A.1. Derivatives for M-step A.1.1. β, γ update. The parameters β and γ appear in the expected complete data log likelihood (3.4) only through μ i which is a linear function of these parameters. Denote the stacked (r + s) × 1 vectors w i ≡ (Y i , X 1i ) and θ w ≡ (β , γ ) . Then m ∂q(θ) 1 i λ)wi = (δj yi∗ + αj −1 − Yi β − X1i γ −V ∂θw σ1|2 j =1 y =j i
and the first order conditions ∂q(θ)/∂θ w = 0 can be written as n
wi wi
θw =
m
i λ)wi . (δj yi∗ + αj −1 − V
j =1 yi =j
i=1
i λ) on w i . Therefore, θ w can be updated from a least-squares regression of (δj yi∗ + αj −1 − V A.1.2. update. i ∂V = −(Ir ⊗ Xi ), ∂vec()
i λ ∂V = −(λ ⊗ Xi ) ∂vec()
−1 i 22 −1 ∂V Vi −1 i = −2(Ir ⊗ Xi )22 ⊗ Xi V Vi = −2 22 ∂vec() n −1 ∂q(θ ) i 22 ⊗ Xi V = ∂vec() i=1
−
m 1 ∗ i λ (λ ⊗ Xi ). δj yi + αj −1 − Yi β − X1i γ −V σ1|2 j =1 y =j i
C The Author(s). Journal compilation C Royal Economic Society 2009.
182
H. Kawakatsu and A. G. Largey
The first order conditions ∂q(θ )/∂vec () = 0 can be written as (3.6) which is just a system of linear equations in the parameters . A.1.3. update. For any square matrices A, B and vectors x, y, ∂tr(A −1 B) = −vec( −1 A B −1 ), ∂vec() ∂x −1 y = −vec( −1 xy −1 ), ∂vec()
∂ log || = vec( −1 ). ∂vec()
For the update, we take the first order conditions from the joint expected complete data log-likelihood (3.7) 1 −1 ∂q(θ ) =− − −1 E xi∗ xi∗ −1 − −1 vi vi −1 ∂ 2 i=1 + −1 vi xi∗ −1 + −1 xi∗ vi −1 . n
A.1.4. δ j update. ∂q(θ ) = ∂δj y =j i
−
∗2 ∗ ∗2 ∗ i λ δj yi + − λ σ11,i σ21,i + αj −1 − Yi β − X1i γ −V yi 1 − δj σ1|2
1 ∗ i λ . yi + αk−1 − Yi β − X1i γ −V δk σ1|2 k>j y =k i
The first-order condition ∂q(θ )/∂δ j = 0 is a quadratic equation in δj as given in (3.7).
A.2. E-step for continuous endogenous regressors From the first two moments of the truncated normal distribution (Johnson and Kotz 1970, pp. 81–3), we have the conditional moments ⎧ φ(z0,ij ) √ ⎪ μi − α1 − σ1|2 , j =1 ⎪ ⎪ ⎪
(z0,ij ) ⎪ ⎪ ⎪
⎨ φ(z0,ij ) − φ(z1,ij ) √ 1 μi − αj −1 + σ1|2 , 1 < j < m yi∗ ≡ E[yi∗ |yi = j ] = ⎪ δj
(z1,ij ) − (z0,ij ) ⎪ ⎪ ⎪ ⎪ ⎪ φ(z0,ij ) √ ⎪ ⎩ μi − αm−1 + (A.1a) σ1|2 , j =m
(−z0,ij ) σi∗2 ≡ E (yi∗ − yi∗ )2 |yi = j ⎧
z0,ij φ(z0,ij ) φ(z0,ij ) 2 ⎪ ⎪ − j =1 σ1|2 , ⎪ 1− ⎪ ⎪
(z0,ij )
(z0,ij ) ⎪ ⎪ ⎪ ⎪ ⎪
⎨ φ(z0,ij ) − φ(z1,ij ) 2 σ1|2 z0,ij φ(z0,ij ) − z1,ij φ(z1,ij ) − ,1<j <m 1+ = ⎪
(z1,ij ) − (z0,ij )
(z1,ij ) − (z0,ij ) δj2 ⎪ ⎪ ⎪ ⎪ ⎪
⎪ ⎪ φ(z0,ij ) 2 z0,ij φ(z0,ij ) ⎪ ⎪ − j = m, σ1|2 , ⎩ 1+
(−z0,ij )
(−z0,ij )
(A.2b)
C The Author(s). Journal compilation C Royal Economic Society 2009.
EM algorithms for ordered probit models with endogenous regressors
183
where μ i , σ 1|2 are defined in (2.4) and z0,ij ≡ −
(μi − αj −1 ) , √ σ1|2
z1,ij ≡
δj − μi + αj −1 √ σ1|2
A.3. Alternative M-step Here we describe the M-step using an alternative parametrization (3.9) which breaks up the (r + 1) −1 −1 dimensional covariance matrix into blocks of σ 1|2 ≡ σ 11 − σ 21 22 σ 21 (scalar), λ ≡ 22 σ 21 (r × 1), and 22 (r × r). 1
Update β (t+1) , γ (t+1) , λ(t+1) by solving the linear system n n ∗ ∗ 22,i wi wi + J (δj J θw = yi∗ + αj −1 )wi + δj J σ21,i i=1
2 3
4
i=1
where θw ≡ (β , γ , λ ) , wi ≡ (Yi , X1i , Vi ) , and J ≡ (0 r×(r+s) , I r×r ). Update (t+1) as in (3.5). as Update (t+1) 22 n 1 ∗ 22,i Vi Vi + 22 = n i=1 (t+1) as Update σ 1|2
σ1|2 =
5
n 1 E (δj yi∗ − μi + αj −1 )2 |yi , Yi n i=1
where the conditional expectation is given in (3.5). If m > 2, update δ j as in (3.7).
A.4. Latent endogenous regressors For the latent endogenous regressors model in Section 2.3, the complete data likelihood p(y ∗i , Y ∗i |y i , Y i ) ∼ N (μ∗i , ∗i ) where
(Xi β + X1i γ − αj −1 )/δj ∗ μi ≡ , Xi
σ11 + 2β σ21 + β 22 β /δj2 σ21 + β 22 /δj ∗ i ≡ (σ21 + 22 β)/δj 22 and the expected complete data log-likelihood function can be written as (3.3) with and v i replaced by
1 0 Xi β + X1i γ − αj −1 r B≡ , vi ≡ ← B B, β Ir Xi or as (3.8) with μ i ≡ Y ∗ i β + X 1i β + V i λ and ∗2 ∗ ∗ 22,i σ11,i + (β + λ) (β + λ) − 2δj (β + λ) σ21,i E (δj yi∗ − μi + αj −1 )2 |yi , Yi = δj2 ∗ i∗ β − X1i i λ 2 + δj yi + αj −1 − Y γ −V
C The Author(s). Journal compilation C Royal Economic Society 2009.
184
H. Kawakatsu and A. G. Largey
A.4.1. M-step. 1
Update β (t+1) by solving the linear system n n ∗ ∗ ∗ ∗ ∗ i Y i λ)Y i∗ + δj i + 22,i 22,i Y β= (δj yi∗ + αj −1 − X1i γ −V σ21,i − λ . i=1
2 3
i=1
i∗ β − V i λ) on X 1i . via a least-squares regression of (δj yi∗ + αj −1 − Y Update γ (t+1) by solving the linear system Update n
n
−1 −1 ∗ Xi Xi Xi Yi vec() = vec λλ + σ1|2 22 λλ + σ1|2 22 ⊗ (t+1)
−λ ⊗
n
i=1
i=1
i∗ β − X1i (δj yi∗ + αj −1 − Y γ )Xi .
i=1
4
Update (t+1) as
=B
5
−
n 1 ∗ ∗ ∗ V (xi ) + ( xi − vi )( xi − vi ) B −1 . n i=1
If m > 2, update δ j in the order j = 2, . . ., m − 1 by solving the quadratic equation ⎛ ⎞ ∗2 ∗ ⎝ yi∗2 + + nj +1 + · · · + nm ⎠ δj2 + ⎝ (α1 + δ2 + · · · + δj −1 − σ11,i μi ) yi∗ − (β + λ) σ21,i ⎛
yi =j
+
yi =j m
⎞
(α1 + δ2 + · · · + δj −1 + δj +1 + · · · + δk−1 + δk yi∗ − μi )⎠ δj − nj σ1|2 = 0,
k=j +1 yi =k i∗ β + X1i i λ. γ +V where μi ≡ Y
APPENDIX B: EVALUATION OF MULTIVARIATE NORMAL INTEGRALS In this section, we provide details for the computation of the multivariate normal integrals (3.2) for the E-step. a b ∗r ∗s y Y p(y ∗ , Y ∗ |y, Y )dY ∗ dy ∗ a b (B.1) a b p(y ∗ , Y ∗ |y, Y )dY ∗ dy ∗ a b The integration limits a, a (scalars) and b, b(r × 1) take values of 0, 1, or ±∞ depending on y i , Y i . We denote the stacked limits as c ≡ (a, b ) and c ≡ (a, b ) and refer to the j-th element with subscripts cj , cj for j = 1, . . ., r + 1. (i indexes the observation number in the data sample.)
B.1. r > 1 case For the case with r > 1, we can evaluate the integrals using Monte Carlo integration as described by Genz (1992). The algorithm by Genz (1992) assumes zero mean which is not suitable for computing second moments when the mean is not zero. Here we describe the algorithm in pseudo-code in a more general form with non-zero mean as in (3.5). c Pseudo-code to compute multivariate normal conditional moments h(μ, , c, c) ≡ c g(x)p(x)dx where x is K × 1, p(x) ∼ N(μ, ) and g(x) = (x , vech(xx ) ). C The Author(s). Journal compilation C Royal Economic Society 2009.
EM algorithms for ordered probit models with endogenous regressors
185
1
Input: K (dimension of integral), μ (mean), (variance), c, c, (integration limits), M (number of Monte Carlo draws). 2 Initialization. Compute Cholesky factor L such that LL = and set a = c − μ, b = c − μ, dsum = vsum = 0, N = 0, c1 = (a1 /L11 ), d1 = (b1 /L11 ), f1 = d1 − c1 . 3 Repeat (a) Draw K independent uniform variates w 1 , . . ., w K ∈ [0, 1]. −1 Lj ,k yk )/ (b) For j = 2, 3, . . ., K, set yj −1 = −1 (cj −1 + wj −1 (dj −1 − cj −1 )), cj = ((aj − jk=1 j −1 Lj ,j ), dj = ((bj − k=1 Lj ,k yk )/Lj ,j ), and f j = (d j − c j )f j −1 . (c) Set y K = −1 (c K + w K (d K − c K )) and compute x = Ly + μ. (d) Set N ← N + 1 and accumulate weighted average with dsum ← dsum + (f K − dsum)/N and vsum ← vsum + (fK g(x) − dsum)/N . until N = M. 4 Output: h(μ, , c, c) ≈ vsum/dsum. The draws from the uniform distribution in step 3(a) can be replaced by quasi random numbers (S´andor and Andr´as 2004). For each E-step of the EM iterations, we use the same sequence of M draws in step 3(a) with only the parameter vector updated at each iteration.
B.2. r = 1 case The Monte Carlo integration algorithm described above can be applied generally for any r but is computationally costly. The accuracy improves with larger draws M but so does the computational cost. Moreover, the integrals need to be evaluated n times for each E-step as the mean of the normal distribution depends on the sample observation data. Here we describe a method to compute the conditional moments for the special (but empirically important) case r = 1 that requires only a bivariate normal cdf routine. For r = 1, we require five moments m∗r,s = m r,s /m 0,0 for r, s = 0, 1, 2 where m r,s is the numerator integral in (B.1). To avoid evaluating the double integrals in m r,s , we follow the approach in Dyer (1973) and use the following recursions. Denote the bivariate normal joint density p(x1 , x2 , μ, σ ) ≡
1 φ2 s1 s2
x1 − μ1 x2 − μ2 , ,ρ s1 s2
where φ 2 (·) is the standard bivariate normal density and ρ ≡ σ 12 /(s 1 s 2 ) is the correlation. The partial derivatives of p(x 1 , x 2 , μ, ρ) with respect to the first two arguments are
x2 − μ2 x1 − μ1 ∂ 1 ρ − p(x1 , x2 , μ, ρ) p(x1 , x2 , μ, ρ) = ∂x1 s1 (1 − ρ 2 ) s2 s1
x2 − μ2 x1 − μ1 ∂ 1 − p(x1 , x2 , μ, ρ). ρ p(x1 , x2 , μ, ρ) = ∂x2 s2 (1 − ρ 2 ) s1 s2 Multiplying both sides by x r1 x s2 and integrating by parts for a few values of r and s gives the linear system ⎛
K12 A1 B
⎜ ⎜1 ⎜ ⎜ ⎜0 ⎜ ⎜ ⎜ K21 ⎜ ⎜ ⎝0 1
K12 0 0
K12
B
A2
K21 0 0
0 0 0
⎞⎛
m0,0
⎞
⎛
f0,0
⎞
⎟⎜ ⎟ ⎜ ⎟ B A1 0 ⎟ ⎜ m1,0 ⎟ ⎜ f1,0 ⎟ ⎟⎜ ⎟ ⎜ ⎟ ⎟⎜ ⎟ ⎜ ⎟ A1 0 B ⎟ ⎜ m0,1 ⎟ ⎜ f0,1 ⎟ ⎟⎜ ⎟=⎜ ⎟, ⎟⎜ ⎟ ⎜ ⎟ 0 0 0 ⎟ ⎜ m1,1 ⎟ ⎜ g0,0 ⎟ ⎟⎜ ⎟ ⎜ ⎟ ⎟⎜ ⎟ ⎜ ⎟ A2 B 0 ⎠ ⎝ m2,0 ⎠ ⎝ g1,0 ⎠
K21 B 0 A2
C The Author(s). Journal compilation C Royal Economic Society 2009.
m0,2
g0,1
(B.2)
186
H. Kawakatsu and A. G. Largey
where
μi 1 ρμj 1 , Ai ≡ − − , (1 − ρ 2 )si si sj (1 − ρ 2 )si2 1 z2 r ≡ c1 φ2 (z1 , v, ρ) − cr1 φ2 (z1 , v, ρ) (μ2 + s2 v)s dv s1 z 2 1 z1 s ≡ c2 φ2 (u, z2 , ρ) − cs2 φ2 (u, z2 , ρ) (μ1 + s1 u)r du s2 z 1
Kij ≡ fr,s gr,s
B≡
ρ (1 − ρ 2 )s1 s2
and z j ≡ (c j − μ j )/s j . The integrals f r,s , g r,s can be evaluated using the relations b φ2 (x, y, ρ)dy = φ(x)( (bx ) − (ax )) a
b
yφ2 (x, y, ρ)dy = φ(x)
1 − ρ 2 (φ(ax ) − φ(bx )) + ρx( (bx ) − (ax )) ,
a
where ax ≡ (a − ρx)/ 1 − ρ 2 , bx ≡ (b − ρx)/ 1 − ρ 2 . The system (B.2) is singular so one of the double integralsm r,sneed to be evaluated. Given a routine c1 c2 φ (x , x2 , ρ)dx2 dx2 , m0,0 can be to evaluate the standard bivariate normal cdf 2 (c1 , c2 , ρ) ≡ −∞ −∞ 2 1 17 evaluated as m0,0 = 2 (z1 , z2 , ρ) − 2 (z1 , z2 , ρ) + 2 (z1 , z2 , ρ) − 2 (z1 , z2 , ρ). The main computational cost of this algorithm arises from four calls to 2 (·), a few calls to the univariate standard normal cdf (·) and solving a linear system of dimension five for each observation.
17
For example, the BVND routine described by Genz (2004) can be used. C The Author(s). Journal compilation C Royal Economic Society 2009.