This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
/t(L) = (J(L)/1J(L) for ARMA(p, q), we have I (6.2.21) AR(p): gy(Z) = a2
0, 0 < 8 < I, -I < p, which determines the parameter space e. E[IYr-
(6.2.22)
QUESTIONS FOR REVIEW 1.
If {y,} is the covariance-stationary solution to the AR(l) equation (6.2.1) with 1¢1 < I, what is
-,
E (y,
I I, Yt-1)
(the least squares projection)? Hint: E(y,_ 1e,) = 0. Do the projection coefficients depend on t? Is E'(y, I I, YH) = E(y, I y,_I)? Hint: E(y,_I£,) = 0 but does it mean E(e, I y,_ 1) = 0? What is E'(y, I I, Yt-J, Yt-2)? Now suppose that {y,J is the covariance-stationary solution to the AR(l) equation -, when 1¢1 > I. Is it true that E(y,_ 1£1) = 0? [Answer: No.] Does E (y, I I, y,_ 1) = c + ¢y,_ 1 as was the case when 1¢1 < I?
386
Chapter 6
2. (AR(l) with¢= -I) For simplicity, setc = Oin theAR(l) equation (6.2.1). Show that the AR(l) equation with¢ = -I has no covariance-stationary solution. Hint: The same sort of successive substitution we did for the case ¢ = I yields Yt-
.
(-1) 1 Yt-j
=£,-Bt-l+···+
"-1
(-1) 1
Bt-j+l·
From this, derive .
(12
( - l ) l1p · = l - - · j .
2yo
3. (About Proposition 6.4) How do we know that ¢(1) oft 0 in (6.2.8)? Hint: Use the stationarity condition. Prove (b) of Proposition 6.4. Hint: Take the expectation of both sides of (6.2.6) and exploit the covariance-stationarity of {y, }. 4. (About Proposition 6.5) Verify that y, - 11- = ,P(L)- 1 O(L)s, is a solution to the ARMA(p, q) equation. 5. ({1/Fj} for ARMA(p, q)) For AR(p), we used the equations (6.1.12) to calculate the MA coefficients {1/Fj}. For ARMA(3, I), write down the corresponding equations. From which lag j does {1/Fj} start following
which is (6.1.17) for p = 3? Do the same for ARMA(l, 2). From which j does {1/Fj} start following (6.1.17) for p =I? [Answer: j = 3.] 6. (How filtering alters the spectrum) Let h(L) be absolutely summable. By Proposition 6.2, if {x1 } is covariance-stationary with absolutely summable autocovariances, then so is {y, l where y, = h(L)x,. Using (6.2.20), verify that
7. (The spectrum is real valued) Show that the spectrum is real valued. Hint: Substitute (6.2.16) into (6.2.14). Use the following facts from complex analysis: [cos(w)- i sin(w)]j = cos(jw)- i sin(jw); sin( -w) = - sin(w).
387
Serial Correlation
6.3
Vector Processes
It is straightforward to extend all the concepts introduced so far to vector processes. A vector white noise process {e,} is a jointly covariance-stationary process satisfying
E(e,) = 0, E(e,e;) = Q (positive definite), E(e,e;_j) = 0
and (6.3.1)
for j oft 0.
Since Q is not restricted to being diagonal, there can be contemporaneous correlation between the elements of e,. Perfect correlation among the elements of e, is ruled out because Q is required to be positive definite. A vector MA(oo) process is the obvious vector version of (6.1.6): 00
Yr = J1.
+L
(6.3.2)
with '1! 0 =I,
lJ!jEr-j
j=O
where {lJ!j} are square matrices. The sequence of coefficient matrices is said to be absolutely summable if each element is absolutely summable. That is, if Vrktj is the (k, i) element of lJ!j, 00
"{lJ!j lis absolutely summable"
'* "L IVrklj I < oo for all (k. i)."
(6.3.3)
j=O
With this definition, Proposition 6.1 generalizes to the multivariate case in an obvious way. In particular, if rj (= E[(y,- JLHYr-j- JL)']) is the j-th order autocovariance matrix, then the expression for autocovariances in part (b) of Proposition 6.1 becomes 00
rj =
L
lf!j+k
S? lf!~
(j =
o,
(6.3.4)
1, 2 .... ).
k=O
(This formula also covers j =-I, -2, ... because A multivariate filter can be written as
r - j = rj.)
where (Hj l is a sequence of (not necessarily square) matrices. So if
Hklj
is the
388
Chapter 6
(k, e) element of H 1, the (k, e) element of H(L) is
The multivariate version of Proposition 6.2 is obvious, withy, = H(L) x,. Product of Filters Let A(L) and B(L) be two filters where {A1 } ism x r and {B1} is r x s so that the matrix product A1Bk can be defined. The product of two filters, D(L) = A(L)B(L) is an m x s filter whose coefficient matrix sequence {D1} is given by th~ multivariate version of the convolution formula (6. 1.12): AoBo =Do, AoB 1 +A 1 B 0 = 0
1,
A 0 B 2 +A 1 B 1 + A 2 B 0 = 0 2 ,
(6.3.5) Inverses Let A(L) and B(L) be two filters whose coefficient matrices are square. B(L) is said to be the inverse of A(L) and is denoted A(L)- 1 if A(L) B(L) =I.
(6.3.6)
For any arbitrary sequence of square matrices {A1}, the inverse of A(L) exists if A0 is nonsingular. It is easy to show that inverses are commutative (see Review Question 2), so A(L) A(L)- 1 = A(L)- 1A(L).
Absolutely Summable Inverses of Lag Polynomials A p-th degree lag matrix polynomial is (6.3.7) where {<1>1 } is a sequence of r x r matrices with P # 0. From the theory of homogeneous difference equations, the stability condition for the multivariate case is as follows:
389
Serial Correlation
All the roots of the detenninantal equation (6.3.8) are greater than I in absolute value (i.e., lie outside the unit circle). Or, equivalently, All the roots of the detenninantal equation (6.3.8') are less than I in absolute value (i.e., lie inside the unit circle). As an example, consider a first-degree lag polynomial 4>(L) = I - 4> 1 L where 4> 1 = (r/>te) is 2 x 2. Equation (6.3.8) can be written as
With the stability condition thus generalized, Proposition 6.3 generalizes in an obvious way. Let lii(L) = 4>(L)- 1• Each component of the coefficient matrix sequence {Illi) will be bounded in absolute value by a geometrically declining sequence.
The multivariate analogue of an AR(p) is a vector autoregressive process of p-th order (VAR(p)). It is the unique covariance-stationary solution under stationarity condition (6.3.8) to the following vector stochastic difference equation:
y, - 4>,y,_, - · · · - 4>pYr-p = c + e,
or
4>(L)(y1
4>(L) =I,- 4> 1 L- 4> 2L 2 - · · · - 4>PLP
where
and
-
/L) = e 1
/L
= 4>(1)- 1c, (6.3.9)
where 4>P f= 0. For example, a bivariate VAR(l) can be written out in full as a two-equation system with common regressors: Yit
= c,
+ r/>IIYI.t-1 + ¢I2Y2.r-l + ''''
Y2r
=
+ ri>21YI.r-I + r/>nY2,r-I + <2r.
c2
were h
~
... 1
=
["'" ¢21
"''2]
.
¢22
More generally, an M-variate VAR(p) is a set of M equations with Mp+l common regressors. Proposition 6.4 generalizes straightforwardly to the multivariate case.
390
Chapter 6
In particular, each element of rj is bounded in absolute value by a geometrically declining sequence. Vector autoregressions are a very popular tool for analyzing the dynamic interrelationship between key macro variables. A thorough treatment can be found in Hamilton (1994, Chapter II). The multivariate analogue of an ARMA(p, q) is a vector ARMA(p, q) (VARMA(p, q)). It is the unique covariance-stationary solution under the stationarity condition to the following vector stochastic difference equation:
y, = c +
+
e, .. ,_, +
0ze,_z + ... + 0qet-q
or
(L)(y, -I-')= 0(L)e 1 , where 4>(L) =I- 4> 1 L- · · · - 4>PU, 0(L) =I+ 0 1 L + · .. + 0qLq,
1-' = 4>(1)- 1c,
where
f=
0 and
(6.3.10)
e, f= 0. Proposition 6.5 generalizes in an obvious way.
Autocovariance Generating Function The multivariate version of the autocovariance-generating function for a vector covariance-stationary process {y, I is 00
Gv(z) =
L
00
rjzj = ro + L:
j=-00
J=!
The spectrum of a vector process {y 1 I is defined as sv(w) = - I G v(e- iw ). 2rr
(6.3.12)
Proposition 6.6 generalizes easily. In particular, if H(L) is an r x s absolutely summable filter and Gx(z) is the autocovariance-generating function of an sdimensional covariance-stationary process {x 1 I with absolutely summable amocovariances, then the autocovariance-generating function ofy, = H(L)x, is given by Gv(z) = H(z) G,(z) H(C 1)'. (rxr)
(rxs)
(sxs)
(sxr)
(6.3.13)
391
Serial Correlation
As special cases of this, we have vector white noise: Gv(z) = Q,
(6.3.14)
VMA(oo): Gv(z) = 'li(z) Q 'li(z- 1)', VAR(p): Gv(z) = [cf>(z)- 1 ]Q[cf>(z- 1)- 1]',
(6.3.15)
VARMA(p, q): Gv(z) = [cf>(z)- 1] [0(z)] Q [0(z- 1)]' [cf>(z- 1)- 1]'.
(6.3.16) (6.3.17)
QUESTIONS FOR REVIEW
1.
Verify (6.3.4) for vector MA(l).
2. (Commutativity of inverses) Verify the commutativity of inverses, that is,
I" {} "B(L) A(L) = I" for two square filters of dimension r with lAo I oft 0 and IBol oft 0. Hint: Set Do = I, Dj = 0 (j > I) in (6.3.5) and solve forB's. Show that this B(L) satisfies B(L) A(L) =I.
"A(L) B(L)
=
3. (Lack of commutativity for multivariate filters) Provide an example where A(L) and B(L) are 2 x 2 but A(L) B(L) oft B(L) A(L). Hint: How about A(L) = I +A 1L and B(L) = I + B 1 L? Find two square matrices A 1 and B 1
such that A 1 B1 oft B1 A1. 4. Verify the multivariate version of (6.1.15b), namely, "A(L) B(L) = D(L)"
* "B(L) = A(L)- D(L)"
* "A(L) = D(L) B(L)-
1
1 ,"
provided A 0 and B0 are nonsingular. So, to solve A(L) B(L) = D(L) for B(L), "multiply both sides from the left" by A(L)- 1 5. Show that [A(L) B(L)]- 1 = B(L)- 1A(L)- 1, provided that Ao and B 0 are nonsingular. Hint: By definition, A(L)B(L)[A(L)B(L)]- 1 =I.
392
6.4
Cnapler 6
Estimating Autoregressions
Autoregressive processes (autoregressions) are popular in econometrics, not only because they have a natural interpretation but also because they are easy to estimate. This section considers the estimation of autoregressions. Estimation of ARMA processes will be touched upon at the end.
Estimation of AR(l) Recall that an AR(l) is an MA(oo) with absolutely summable coefficients. So if [£1} is independent white noise (an i.i.d. sequence with zero mean and finite variance), it is strictly stationary and ergodic (Proposition 6.1 (d)). Letting x 1 = (1, y1_ 1 )' and fJ = (c, >)',the AR(l) equation (6.2.1) can be written as a regression equation (6.4.1) We assume the sample is (y0, y 1, y 2, ... , y,), y 0 inclusive, so that (6.4.1) fort = I (which has y 0 on the right-hand side) can be included in the estimation. So the sample period is from t = I ton. We now show that all the conditions of Proposition 2.5 about the asymptotic properties of the OLS estimator of fJ with conditional homoskedasticity are satisfied here. First, obviously, linearity (Assumption 2.1) is satisfied. Second, as just noted, [y1, X 1} is jointly stationary and ergodic (Assumption 2.2). Third, since y1 _ 1 (the nonconstant regressor in x1) is a function of {£1_ 1, £1_2, ... }, it is independent of the error term £1 by the i.i.d. assumption for [£1}. Thus (6.4.2) that is, the error is conditionally homoskedastic (Assumption 2.7). Fourth, if & = x 1 · £1 = (£ 1, y1_ 1£1)', [gr} is a martingale difference sequence, as required in Assumption 2.5, because first element of E(g1 I g1_ 1, g1- 2, ... ) = EC£1 = E(£1 = 0
I gH, g1-2 •... ) I £1-1, '1-2• · · · , Y1-2£1-l, Yl-3'1-2• · · ·)
(since {£r} is i.i.d. and y1 _j is a function of (£1-j, £1_j_ 1, ... ))
393
Serial Correlation
and second element ofE(g,
I g,_ 1, g,_ 2 , •.. )
I g,_,' g,_,, ... ) E[E(ErYr-1 I Yr-J, g,_l> g,_,, .. ·) I gt-J, g,_,, · · .]
= E(E, Yt-1 =
(by the Law of Iterated Expectations)
I y,_,, g,_,, g,_,, ... ) I g,_,, g,_,, ... ] 0 (since E(E, I y,_,, E,_,, E,_,, ... , y,_,E,_J, Yr-3Er-2· ... ) = 0). (6.4.3)
= E[y,_, E(E, =
In particular, E(E,y,_ 1) = 0, sox, is orthogonal to the error term (Assumption 2.3). Finally, the rank condition about the moment matrix E(x,x;) (Assumption 2.4) is satisfied because the determinant of
' [I
E(x,x,) =
(6.4.4)
fL
is y0 > 0. This also means that E(g,g,) is nonsingular because under conditional homoskedasticity E(g,g;l = a 2 E(x,x;). So the nonsingularity condition in Assumption 2.5 is met. Thus the AR(l) with an independent white noise process satisfies all the assumptions of Proposition 2.5, so parts (a)-(c) of the propositions hold true here, with (c, ¢)'as {J, (1, y,_ 1)' as x,, and a 2 as the error variance. In particular, the OLS estimate of the error variance, n
I"''
s 2 = n- 2Le,, e, = y,- c- (i,y,_ 1 where t=l
(c, (i>)
are OLS estimates,
is consistent for a 2 .
Estimation of AR(p) Similarly for AR(p ), the autoregressive coefficients (¢ 1 , ••• , r/>p) can be consistently estimated by regressing y, on the constant and lagged y's, (y,_ 1, ... , Yr-p). We assume the sample is (Y-p+l> Y-p+'• ... , Yo. y 1 , ... , y.), inclusive of p lagged values prior to y 1, so that the AR(p) equation for t = 1 can be included and the sample period is from t = I ton. Proposition 6.7 (Estimation of AR coefficients): Let {y,} be the AR(p) process
following (6.2.6) with the stationarity condition (6.1.18). Suppose further that {E,} is independent white noise. Then the OLS estimator of (c, r/> 1, ¢ 2, ... , r/>p)
394
Chapter 6
obtained by regressing y, on the constant and p lagged values of y for the sample period oft = I, 2, ... , n is consistent and asymptotically normal. Letting fJ = (c,¢ 1,¢2 , ••• ,¢P)' andx, = (l,y,_~>····Y1 -p)',andP = OLSestimateof {J, we have
which is consistently estimated by --;.__
Avar({J) = s
2
l · (;;
n
L x,x;)
-I
,
1=1
where s 2 is the OLS estimate of the error variance given by
I~--
2
s =
n-p-
-2
I L.)Yt - c- tf>IYt-1 - · · · - t/>pYt-p) ·
(6.4.5)
t=I
The proof is to verify the assumptions of Proposition 2.5, which can be done in the same way as we did for AR(l). The only difficult part is to show that E(x,x;) is non singular (Assumption 2.4 ), which is equivalent to requiring the p x p autocovariance matrix Var(y,, ... , Yt-p+I) to be nonsingular (see Review Question l(b) below). It can be shown that the autocovariance matrix of a covariance-stationary process is nonsingular for any p if y 0 > 0 and Yj -> 0 as j -> oo 6 So Assumption 2.4, too, is met. We have not covered the maximum likelihood estimation of AR(p ), but for those of you who are curious about the connection to ML, the OLS estimator is numerically the same as the conditional Gaussian ML estimator and asymptotically equivalent to the exact Gaussian ML estimator. See Section 8.7. Choice of Lag Length
All these nice results about estimation of autoregressions assume that the order (the lag length) of autoregression, p, is known. How should we proceed if the order p is unknown? First of all, it is clear that Proposition 6.7, which is about p-th order autoregressions, is applicable to autoregressions of lower order, say r < p. That is, even if¢, op 0 but ¢,+ 1 = ¢,+ 2 = · · · = t/>p = 0, the proposition remains applicable as long as (¢ 1 , ¢,, ... , ¢,) satisfies the stationarity condition. In particular, the OLS estimates of (¢,+ 1 , ¢,+ 2 , .•. , t/>p) will converge to the true value of zero. To put the same argument differently, suppose that the true order of 6see, e.g., Proposition 5.1.1 of Brockwell and Davis (1991).
395
Serial Correlation
the autoregression is p (so 1/>p f= 0) and suppose that all we know about p is that it is less than or equal to some known integer Pmax· By setting the r in the argument just given to p and the p to Pmax• we see that Proposition 6.7 is applicable to autoregressions with Pmax lags: if the (y,) is stationary AR(p) and if we regress y, on the constant and Pmax lags of y, then the OLS coefficient estimates will be consistent and asymptotically normal. It is natural to think that the true lag length may be estimated from this OLS estimate of (¢> 1 , ••. , 1/>pm,xl· We consider two classes of rules for determining the lag length. The first is the General-to-specific sequential I rule: Start with an autoregression of Pmax lags. If the last lag is significant at some prespecified significance level (say, 10 percent), then end the procedure and set the lag length to Pmax· Otherwise drop the last lag, reestimate the autoregression with one fewer lag, and repeat the same test. If this process continues until there is only one lag in the autoregression and if that lag is insignificant, then set the lag length to 0 (no lags). This rule has the following properties, to be illustrated in a moment. Since the t -test is consistent, the lag length thus chosen will never be less than p (the true lag length) in large samples. However, the probability of overfitting (that the lag length chosen is greater than the true lag length p) is not zero, even in large samples. That is, if p is the lag length chosen by the sequential t rule, lim Prob(p < p) = 0
and
lim Prob(p > p) > 0.
To illustrate, suppose p = 2 and Pmax = 3. The sequential with three lags in the autoregression:
t
(6.4.6)
rule starts out
We test the null hypothesis that ¢>3 = 0 at the 10 percent significance level. With a probability of 10 percent, we reject the (true) null and set the lag length to 3, or accept the null with a probability of 90 percent, in large samples. In the event of accepting the null, we set ¢>3 = 0, estimate an autoregression with two lags, and test the hypothesis that ¢>2 = 0. Since ¢>2 f= 0 in truth (i.e., since p = 2), the absolute value of the !-value on the OLS estimate of ¢>2 would be very large in large samples, so we would never accept the (false) null that the second lag is zero. Therefore, in this example, Prob(p = 2) = 90% and Prob(p = 3) = 10% in large samples. There are two minor variations in the choice of the sample period with data (y,, ... , Ynl· One is to use the same sample period oft = Pmax + 1, Pmax + 2, ... , n
396
Chapter 6
throughout. The other is to let the sample period be "elastic" by expanding it as fewer and fewer lags are included. That is, when the autoregression being estimated has j lags, the sample period is 1 = j + l, j + 2, ... , n. The second rule for selecting the lag length uses either the Akaike infonnation criterion (AIC) or the Schwartz infonnation criterion (SIC), also called the Bayesian infonnation criterion (BIC). This procedure sets the lag length to the j that minimizes 1og(SSR1 jn)
+ (j + l)C(n)/n,
(6.4.7)
over j = 0, l, 2, ... , Pmaxo where SSR1 is the sum of squared residuals for the autoregression with j lags. In this formula, "j + l" enters as the number of parameters (the coefficients of the constant and j lags). For the AIC, C(n) = 2 while for the BIC, C(n) = log(n). A few comments about this information-criterion-based rule: • As the number of parameters increases with j, the first term of the objective function (6.4.7) declines because the fit of the equation gets better, but the second term increases. The information criterion strikes a balance between a better fit and model parsimony. • Without an upper bound Pmaxo a ridiculously large value of j might be chosen (indeed, the information criterion achieves a minimum of -oo at j + l = n because SSR1 = 0 for j = n- l ). In the present context where the true DGP is a finite-order autoregression, a reasonable upper bound is the maximum possible lag length. • Again there can be two ways to set the sample period. You can use the fixed sample period of I = Pmax + l, Pmax + 2, ... , n throughout, in which case SSRJ is the sum of n - Pmax squared residuals. Or you can use the "elastic" sample period, in which case SSR1 is the sum of n - j squared residuals. In either case, there is yet another variation, which is to replace the n in the information criterion (6.4.7) by the actual sample size. So you replace the n by n - Pmax if the fixed sample is to be used. Based on simulations and some theoretical considerations, Ng and Perron (2000) recommend the use of a fixed, rather than an "elastic," sample period, with n - Pmax replacing n in the information criterion. Let fiA1c and fj 61c be the lag lengths chosen by the AIC and BIC, respectively. Like the lag length selected by the general-to-specific 1 rule, they are functions of data, and hence are random variables. For them it is possible to show the following (see, e.g., Lutkepohl, 1993, Section 4.3):
397
Serial Correlation
(I) PBic < PAle unless the sample size is very small. (If the fixed sample of
Pmox + I, ... , n is used and "n - Pmox" replaces n in (6.4.7), this is true for any finite sample with n > Pmax + 8.) This is an algebraic result and holds for any sample on y. t =
(2) Suppose that {y,) is stationary AR(p) and {E,) is independent white noise with a finite fourth moment. Then lim PBIC = p, n---J.OO
lim Prob(PAIC < p) = 0, n---J.OO
lim Prob(PAIC > p) > 0. n---J.OO
(6.4.8) That is, just like the general-to-specific sequential t rule, the AIC rule has a positive probability of overfilling. In contrast, the BIC rule is consistent. Furthermore, Hannan and Deistler (1988, Theorem 5.4.l(c) and its Remark I) have shown that the consistency of PBIC holds when the upper bound Pmax is made to increase at a rate of log(n) (i.e., when it is set to [c log(n)] [the integer part of c log(n)] for any given c > 0). This means that you do not need to know an upper bound in the consistent estimation by BIC of the order of an autoregression. Estimation of VARs
Estimation ofVAR coefficients is equally easy. If y, is M -dimensional, the VAR(p) (6.3.9) is an M -equation system and can be written as Yrm =
x;am + Erm
where Yrm is the m-th element of y,,
Erm
(m = I, 2, ... , M),
(6.4.9)
is the m-th element of E,, and
I
t/J lm
x, (Mp+l)x I Yt-p
Cm
= m-th element of c,
t/J}m
= m-th row of
So the M equations share the same set of regressors. It is easy to verify that x, is orthogonal to the error term in each equation and that the error is conditionally homoskedastic (the proof is very similar to the proof for AR(p)). So the M-equation system with i.i.d. {Er) satisfies all the conditions for the multivariate regression model of Section 4.5. Thus, efficient GMM estimation is very easy:
398
Chapter 6
just do equation-by-equation OLS. The expressions for the asymptotic variance and its consistent estimate can be obtained from Proposition 4.6 with Z;m = x, (so (4.5.13') on page 280 becomes ~ L:, x,x;). If~ is the (Mp + OM-dimensional stacked vector created from (~ 1 , ... , ~M) and if~ is the equation-by-equation OLS estimate of~. we have from (4.5.17) ~
Avar(~)
-..
]
= Q ® (;:;
n
L x,x;)
-1
,
(6.4.1 0)
t=I
where e, =
y, - c- Cii,y,_, - ... -
Ci>py,_p·
(6.4.11)
If we do not know the lag length p, we estimate it from data. In the informationcriteria-based rule, the lag length we choose is the k that minimizes log
''I) + (k M L. e,e, (I ;:;I~,
2
C(n) + M) · -n-,
(6.4.12)
1=1
over k = 0, I, ... , Pmax· Here, as in the univariate case, C(n) = 2 for the AIC and log(n) for the BIC. Exactly the same results, listed as (I) and (2) above for the univariate case, hold for the multivariate case (see, e.g, Ltitkepohl, 1993, Section 4.3). Estimation of ARMA(p, q)
Letting Ur
z,
= C'r + 81Cr-1 + 82e1-2 + · · · + 8qE:t-q•
= (1, y,_,, Y<-2• · · ·, Y<-p)',
~
= (c, ¢, ¢2, · · ·,
the univariate ARMA(p, q) equation (6.2.9) can be written as y, = z;~
+ u,.
(6.4.13)
The moving-average component of ARMA creates two problems. First, obviously, the error term u, is serially correlated (in fact, it is MA(q)). Second, since the lagged dependent variables included in z, are correlated with lags of e included in the error term, the regressors z, are not orthogonal to the error term u,. That is, serial correlation in the presence of lagged dependent variables results in biased estimation. The second problem could be dealt with by using suitably lagged
399
Serial CorrelaJion
dependent variables (y,_q-!, y,_q_ 2 , ... ) as instruments. Regarding the first problem, the method to be developed later in this chapter provides consistent estimates of the coefficient vector in the presence of serial correlation. Given a consistent estimate of 8, the MA parameters (O's) can be estimated from the residuals. This procedure, however, is not efficient. 7 We will not cover the topic of efficient estimation of ARMA(p, q) processes with Gaussian errors. Interested readers are referred to, e.g., Fuller (1996, Chapter 8) or Hamilton (1994, Chapter 5). QUESTIONS FOR REVIEW 1.
(Avar of AR coefficients) (a) For AR(I), show that
cr'
Avar(~) = - = I - ¢ } , Yo where~ is the OLS estimator of¢>. Hint: It is cr 2 times the (2, 2) element of the inverse of (6.4.4).
(b) For AR(p), let~ be the OLS estimate of (¢> 1 , ¢>2 ,
.•• ,
¢>r)'. Show that
Avar(~) = cr 2 V- 1,
where V is the p x p autocovariance matrix
Yo Y1
Y1 Yo
Y2 Y1
Yp-1 Yp-2
V = Var(y,, ... , y,_r+l) = Yr-2 Yr-1
Y1 Y2
Yo Y1
Y1 Yo
Hint: If x, is as in Proposition 6.7, then
, [ I E(x,x,) = !1-1
11-l' ] V +!1-211' ,
where 1 is a p x I vector of ones. Show that the inverse of this matrix is
7The model yields infinitely many orthogonality conditions, E(u • y ) = 0 for all s :::; t - q- I. The GMM 1 5 procedure can exploit only finitely many orthogonality conditions.
400
Chapter 6
given by
This also shows that E(x,x;) is nonsingular if and only if V is. (If V is nonsingular, then E(x,x;J is nonsingular because it is invertible. If V is singular, then there exists a p-dimensional vector d ,;, 0 such that V d = 0 and E(x,x;)f = 0 where r = ( -JLl'd, d').) 2. (Estimated VAR(l) coefficients) Consider a VAR(l) without the intercept: y, = Ay 1_ 1 + e 1 • Verify that the estimate of A by multivariate regression using the sample (y 1 , y2 , ..• , Ynl (so the sample period is t = 2, 3, ... , n) is
Hint: The coefficient vector in the m-th equation is the m-th row of A.
6.5
Asymptotics for Sample Means of Serially Correlated Processes
This section studies the asymptotic properties of the sample mean ]
n
.Y=-I> n f=l
for serially correlated processes. (It should be kept in mind that y depends on n, although the notation does not make it explicit.) For the consistency of the sample mean, we already have a sufficient condition in the Ergodic Theorem of Chapter 2. The first proposition of this section provides another sufficient condition in the form of restrictions on covariance-stationary processes. We also provide two CLTs for serially correlated processes, one for linear processes and the other for ergodic stationary processes. Chapter 2 includes a CLT for ergodic stationary processes, but it rules out serial correlation because the process is assumed to be a martingale difference sequence. The CLT for ergodic stationary processes of this section generalizes it by allowing for serial correlation.
401
Serial Correlation
LLN for Covariance-Stationary Processes Proposition 6.8 (LLN for covariance-stationary processes with vanishing autocovariances): Let (y,} be covariance-stationmy with mean fl and (yj} be the auto-
co variances of (y,}. Then
00
lim Var(.j/l y) = "
(b)
y < oo L- 1
n-oo
if (y1} is summable. 8
(6.5.1)
j=-00
Since a mean square convergence implies a convergence in probability, part (a) shows that a very mild condition on covariance-stationary processes suffices for the sample mean to be consistent for fl. To prove part (a), by Chebychev's LLN (see Section 2.1), it suffices to show that lim.- 00 Var(y) = 0, which is fairly easy (Analytical Exercise 9) once the expression for Var(y) is derived. Unlike in the case without serial correlation in (y,}, calculating the variance of y is more cumbersome because covariances between two different terms have to be taken into account: n · Var(y) = Var( .j1l y)
I n
= -[Cov(y,, Yt I = -[(yo+ Yt
n
+ · · · + Yn) + · · · + Cov(y., Yt + · · · + Yn)]
+ · · · + Yn-2 + Yn-t) + (y,
+Yo+ Yt
+ · · · + Yn-2)
+ · · · + (y._, + Yn-2 + · · · + Yt +Yo)] I = -[nyo+2(n -l)y, +···+2(n- j)yi
n
n-1
=
+ ··· +2Yn-tl
,
Yo+2~(1- ~)Yt·
(6.5.2)
j=l
(If you have difficulty deriving this, set n to, say, 3 to verify.) For each fixed j > I, the coefficient of y1 in (6.5.2) goes to I as n ---> oo, so that all the autocovariances eventually enter the sum with a coefficient of one. This is not enough to show the desired result (6.5.1). Filling in the rest of the proof of (b) is Analytical Exercise 10. For a covariance-stationary process (y,}, we define the long-run variance to be the limit as n ---> oo of Var(.j/l y) (if it exists). So part (b) of the proposition says that the long-run variance equals L:J:-oo y1 , which in tum equals the value of
8 Anderson (1971, Lemma 8.3.1, p. 460), Absolute summability of {Yj} is not needed.
402
Chapter 6
the autocovariance-generating function gy(z) at z = I (see (6.2.15)) or 2rr times the spectrum at frequency zero (see (6.2.17)). Two Central Limit Theorems
We present two CLTs to cover serial correlation. The first is a generalization of Lindeberg-Levy.
Proposition 6.9 (CLT for MA(oo)): Let y, = f.L + L:J:,o t/ljEr-j where (s,} is independent white noise and L:}~o lt/lj I < oo. Then 00
Jri (ji- f.L)
--:
N( 0, L
Yj).
(6.5.3)
j=-00
We will not prove this result (see Anderson, 1971, Theorem 7.7.8 or Brockwell and Davis, 1991, Theorem 7.1.2 for a proof), but we should not be surprised to see that Avar(v) (the variance of the limiting distribution of Jri (ji - f.L)) equals the long-run variance L:)':,_ 00 Yj· We know from Lemma 2.1 that "xn ---+d x," "Var(xn) ---+ c < oo" =} "Var(x) =c." Here, set Xn = Jri (ji- f.L). By (6.5.1), Var(xn) converges to a finite limit L yj. So the variance of the limiting distribution of .fo (y- f.L) is LYj· By Proposition 6.1 (d), the process in Proposition 6.9 is (stationary and) ergodic. Remember that ergodicity limits serial correlation by requiring that two random variables sufficiently far apart in the sequence be almost independent. But ergodic stationarity alone is not sufficient to ensure the asymptotic normality of ji. A natural question arises: is there a further restriction on ergodicity, short of requiring the process to be linear as in Proposition 6.9, that delivers asymptotic normality? The restriction that has become increasingly popular is what this book calls Gordin's condition for ergodic stationary processes. It has three parts. The first two parts follow: (a) EC_v?) < oo. (This is a restriction on [strictly] stationary processes because, strictly speaking, a stationary process might not have finite second moments.) (b) E(y, I Yr-j. Yr-j-I •... ) ---+m.,, 0 as j ---+ 00. (Since (y,} is stationary, this condition is equivalenttoE(yo I Y-j. Y-j-I· ... ) ---+m.s. 0 as j---+ oo. Because E(y, I Yt-j• Yr-j-I •... ) is a random variable, the convergence cannot be the usual convergence for real numbers.)
403
Serial Correlation
This condition implies that the unconditional mean is zero, E(y,) = 0. 9 As the forecast (the conditional expectation) is based on less and less information, it should approach the forecast based on no information, i.e., the unconditional expectation. To prepare for the third part, let I,= (y,, y,_ 1, y,_2 , ..• ) and write y, as
y, = y,- [E(y, I I,_1)- E(y, I IH)1- [E(y, I I,_2)- E(y, I I,_2)1- · · ·
I I,_ 1 )- E(y, I I,_ 1)1 [y, - E(y, I I,_Jl] + [E(y, I I,_ I)- E(y, I I,_2)1 + · · · + [E(y, I I,_/+1)- E(y, I I,_;)]+ E(y, I I,_;) (r,o + rn + · · · + r,,;-1l + E(y, I Yt-J• y,_J-1, ... ) - [E(y,
=
=
where
This
r,k
is the revision of expectations about y 1 as the information set increases
from It-k-1 to It-k· By (b), y, - (r1o + rn + · · · + r,,;-1l converges in mean square to zero as j -+ oo for each t. So y, can be written as 00
Yr
=
Lrrj· J=O
This sum is called the telescoping sum. The third part of Gordin's condition is 00
(c) L[E(r1))J 112 < oo. J~O
The telescoping sum indicates how the "shocks" represented by (r, 0 , rn, ... ) influence the current value of y. Condition (c) says, roughly speaking, that shocks that occurred a long time ago do not have disproportionately large influence. As such, the condition restricts the extent of serial correlation in {y,}. To understand the condition, take for example a zero-mean AR(l) with independent errors: y, = Yt-1
+ £1,
I
{e1 } independent white noise with o- 2 = Var(e1 ).
Evidently, Gordin's condition (a) is satisfied. Condition (b) too is satisfied because
9See Lemma 5.14 of White (1984) for a proof.
404
Chapter 6
(6.5.4) and E[(¢>j Yr-j) 2 ]--> 0 as j--> oo. The expectation revision r,j can be written as (6.5.5) So the telescoping sum for AR(l) is the MA(oo) representation. Condition (c) is satisfied because [E(r~) ] 112 =
I.Pij a
and
The CLT we have been seeking is Proposition 6.10 (Gordin's CLTforzero-mean ergodic stationary processes): 10
Suppose (y,} is stationary and ergodic and suppose Gordin's condition is satisfied. Then E(y,) = 0, the autocovariances ( yj} are absolutely summable, and 00
.fiiy-:
N(o, L Yj)· j=-00
Since Gordin's condition is satisfied if (y,} is a martingale difference sequence, which exhibits no serial correlation, Proposition 6.10 is a generalization of the ergodic stationary Martingale Difference Sequence CLT of Chapter 2. Multivariate Extension
Extension of the above results to the multivariate case is straightforward. Let
J
Y=-n
n
LYr r=1
be the sample mean of a vector process (y, }. • The multivariate version of Proposition 6.8 is that (a)
y -->m.,.
fJ if each diagonal element of rj goes to zero as j -->
00,
and
(b) limn~oo Var(y'fi y) (which is the long-run covariance matrix of (y,}) equals
L:)':,_
00
rj if {fj} is summable (i.e., if each component of rj is summable).
10 This result is due to Gordin (1969), restated as Theorem 5.15 of White (1984), The claim about the absolute summability of autocovariances is noted in footnote 19 of Hansen ( 1982).
405
Serial Correlation
• Therefore, the long-run covariance matrix of {y, l can be written as 00
00
(6.5.6) j=-00
j=I
where Gv(z) (the autocovariance-generating function) and Sy (w) (the spectrum) for the vector process {y,} are defined in Section 6.3. • The multivariate version of Proposition 6.9 is that 00
-fiiCY-1-'l--;
N(o. L rj)
(6.5.7)
j=-oo
if {y, lis vector MA(oo) with absolutely summable coefficients and {e, lis vector independent white noise. • The multivariate extension of Gordin's condition on ergodic stationary processes is obvious: 11
Gordin's condition on ergodic stationary processes (a) E(y, y;) exists and is finite. (b) E(y, I Yr-j. Yr-j-1• ... ) ~m.,. 0 as j ~DO. 00
(c) L[E(r;jrlj )] 112 is finite, where j=O
r,j
= E(y,
I Yt-j· Yr-j-1, ... ) - E(y, I Yr-j-1> Yr-j-2 •... ).
• The multivariate version of Proposition 6.10 is obvious: Suppose Gordin's condition holds for vector ergodic stationary process {y1 }. Then E(y1 ) = 0, {fj} is absolutely summable, and 00
Jny--;
N(o. L rj)·
(6.5.8)
j=-oo
QUESTIONS FOR REVIEW 1.
Show that Var(.fii ji) = 1' Var(y,, Yr-1, ... , Yt-n+l)ljn, where Var(y1, Yr-I, ... , Yr-n+I) is then x n autocovariance matrix of a covariance-stationary process {y, }.
llStated as Assumption 3.5 in Hansen (1982).
406
Chapter 6
2. (Avar(y) for MA(oo)) Verify that Avar(y) for the MA(oo) process in Proposition 6. 9 equals
Hint: It should equal gy(l) for MA(oo). 3. (Asymptotics of y for AR(l )) Consider the AR(l) process y, = c+¢>y,_ 1 +er with 1¢>1 < 1. Does y --+p j.i.? Add the assumption that(&,} is an independent white noise process. Calculate Avar(y). [Answer: [(I + ¢>)/(1 - ¢>)]y0 .]
4. (Long-run covariance matrix of filtered series) Let H(L) = H 0 + H 1L and (x,} be covariance-stationary with absolutely summable autocovariances. Verify that the long-run covariance matrix of y, = H(L) x, is given by (Ho
+ H1) G,(l) (H~ + u;).
Hint: Use (6.3.13) and (6.5.6).
5. (Long-run variance of "unit-root MA'' processes) What is the long-run variance of y, = &, - &r-I> where"' is white noise? [Answer: zero.]
6.6
Incorporating Serial Correlation in GMM
The Model and Asymptotic Results
We have now all the tools needed to introduce serial correlation in the singleequation GMM model of Chapter 3. Recall that g, in Chapter 3 (with the subscript "i" replaced by "t") is a K -dimensional vector defined as x, · &,, the product of the K -dimensional vector of instruments x, and the scalar error term &, . By Assumption 3.3 (orthogonality conditions), the mean of g, is zero. We assumed in Assumption 3.5 that (g,) was a mar1ingale difference sequence. Since no serial correlation is allowed under this assumption, the matrix S, defined to be the asymptotic variance of g (= ~ L~~I g, ), was simply the variance of g,. We can now allow for serial correlation in (g,] by relaxing Assumption 3.5 as
Assumption 3.5' (Gordin's condition restricting the degree of serial correlation): (g,] satisfies Gordin's condition. Its long-ron covariance matrix is nonsingular.
407
Serial Correlation
Then by the multivariate version of Proposition 6.10, .jii g converges to a normal distribution. The variance of this limiting distribution (namely, Avar(g)), denoted S, equals the long-run covariance matrix, 00
00
(6.6.1) j=l
j=-00
where
rj
is the j -th order autocovariance matrix
rj = E(g,g;_j)
u=
o. ±I. ±2.... ).
(6.6.2)
(Since E(g,) = 0 by the orthogonality conditions, there is no need to subtract the mean from g,.) That is, S (= Avar(g)) equals r 0 under Assumption 3.5 and L~-oo rj under Assumption 3.5'. This is the only difference between the GMM model with and without serial correlation. Accordingly, all the results we developed in Sections 3.5-3.7 carry over, provided that Sis now given by (6.6.1) and S, some consistent estimation of S, is redefined to take into account serial correlation. More specifically:
-
a
• The GMM estimator (W) remains consistent and asymptotically normal. Its asymptotic variance is consistently estimated by (3.5.2), provided that the S there is consistent for S defined in (6.6.1 ). This estimated asymptotic variance sometimes gets the adjective heteroskedasticity and autocorrelation consistent (HAC). • The GMM estimator achieves the minimum variance when p1imn-oo \V =
s-t
where S is given by (6.6.1 ).
-
• The two-step procedure described in Section 3.5 still provides an efficient GMM estimator, provided that S calculated in Step 1 is consistent for S. • With S thus properly defined, the expressions for the t, W, J, C, and LR statistics of Sections 3.5-3.7 remain valid; those statistics retain the same asymptotic distributions in the presence of serial correlation. Estimating S When Autocovariances Vanish after Finite Lags
All these results assume that you have a consistent estimate, S, of the long-run variance matrix S. Obtaining such an estimate takes some intellectual effort, particularly when autocovariances do not vanish after finite lags.
408
Chapter 6
We start our quest with consistent estimation of individual autocovariances. The natural estimator is (j = 0, I, ... , n - I)
(6.6.3)
where
g,
=Xr · S,,
£, = Yr- z;J,
3consistent for~.
If we had the true value ~ in place of the estimated value ~ in this formula, then fj would be consistent for rj under Assumptions 3.1 and 3.2, the assumptions implying ergodic stationarity for {g, ). But we have to use the estimated series llld rather than the true series to calculate autocovariances. For this reason we need some suitable fourth-moment condition (which we will not bother to state) for the estimated autocovariances to be consistent. The proof of consistency is not given here because it is very similar to the proof of Proposition 3 .4. Given the estimated autocovariances, there are two cases to consider for the calculation of S. If we know a priori that rj = 0 for j > q where q is known and finite, then clearly Scan be consistently estimated by
-
(recall: j=l
-r-
j
=
_,
r).
(6.6.4)
j=-q
More difficult is the other case where we do not know q (which may or may not be infinite). How can we estimate the long-run variance matrix, which involves possibly infinitely many parameters? There are two approaches. Using Kernels to EstimateS
Recently, several procedures have been proposed to estimate the long-run variance matrix S. 12 A class of estimators, called the kernel-based (or "nonparametric") estimators, can be expressed as a weighted average of estimated autocovariances: (6.6.5)
12 For the results to be stated in this and the next subsection to hold, we need to impose a set of technical conditions (in addition to Gordin's condition above) that further restrict the nature of serial correlation. For a statement of those conditions and a more detailed discussion of the subject treated in this and the next subsection, see Den Haan and Levin ( !9%b).
409
Serial Correlation
Here, the function k(·), which gives weights for autocovariances, is called a kernel and q (n) is called the bandwidth. The bandwidth can depend on the sample size n. The estimator (6.6.4) above is a special kernel-based estimator with q(n) = q and
k(x) =
I
for lxl
{0
for lxl > 1.
(6.6.6)
This kernel is called the truncated kernel. In the present case of unknown q, we could use the truncated kernel with a bandwidth q(n) that increases with the sample size. As q(n) increases to infinity, more and more (and ultimately all) autocovariances can be included in the calculation of S. This truncated kernel-based S, however, is not guaranteed to be positive semidefinite in finite samples. This is a problem because then the estimated asymptotic variance of the GMM estimator may not be positive semidefinite. Newey and West (1987) noted that the kernel-based estimator can be made nonnegative definite in finite samples if the kernel is the Bartlett kernel:
k(x) =
I {0
-lxl
for lxl
(6.6.7)
for lxl > 1.
The Bartlett kernel-based estimator of S is called (in econometrics) the NeweyWest estimator. For example, for q(n) = 3, the kernel-based estimator includes autocovariances up to two (not three) lags: (6.6.8) In his definitive treatment of kernel-based estimation of the long-run variance, Andrews (1991) has examined positive semidefinite kernels, namely, the class of kernels (including Bartlett's) that yield a positive semidefinite S. He shows, under some suitable regularity conditions, how the speed at which S converges (in mean square) to S depends on the kernel and q(n). The most rapid possible rate is n 215 (i.e., n 215 • (S- S) remains stochastically bounded), 13 which occurs when q(n) grows at rate n 115 (i.e., q (n )/n 115 remains bounded) and the kernel is a member of a certain subset of positive semidefinite kernels. The Bartlett kernel is not in this subset. For Bartlett, the most rapid possible rate is n 113 which occurs when q(n) grows at rate n 113 • Therefore, at least on asymptotic grounds, positive semidefinite
-
!3 A sequence of random variables {x11 } is said to be stochastically bounded and written x 11 = Op if for any there exists an M > 0 such that Prob(lx11 1 > M) < f: for all n, If a sequence converges in distribution to a random variable, then the sequence is stochastically bounded.
£
410
Chapter 6
kernels in this subset should be preferred to Bartlett. An example of such kernels is the quadratic spectral (QS) kernel: k(x) =
25 (sin(6rrx/5) 6rrxj5 12rr 2 x 2
-
) cos(6rrxj5) .
(6.6.9)
Since k(x) f= 0 for lxl > I in the QS kernel, all the estimated autocovariances fj (j = 0, I, ... , n - I) enter the calculation of S even if q(n) < n - I. To be sure, the knowledge of the growth rate is not enough to determine the value of the bandwidth for any particular data of finite length. Andrews (1991) also considered a data-dependent forrnula for determining the bandwidth that not only grows at the optimal rate but also minimizes some appropriate criterion (the asymptotic mean square error). But even in this forrnula, some parameter values need to be provided by the user, so the forrnula does not really provide automatic data-based selection of the bandwidth. ~
VARHAC The other procedure for estimating S, called the VAR Heteroskedasticity and Autocorrelation Consistent (VARHAC) estimator, was proposed by Den Haan and Levin (1996a). The idea is to fit a finite-order VAR to the K -dimensional series Ill,} and then construct the long-run covariance matrix implied by the estimated VAR. This is not to say that g, is assumed to be a finite-order VAR. In fact, in this procedure (and also in the kernel-based procedure), g, does not even have to be a linear process, Jet alone a finite-order autoregression. The VARHAC procedure consists of two steps.
Step 1: Lag length selection for each VAR equation. In this step, for each VAR equation, you determine the lag length by the Bayesian information criterion (RIC). Let gk, be the k-th element of the K -dimensional vector g,. The k-th VAR equation can be written as A.(k)8kt = 'f'Jl 8J,t-l A,(k)-
A.(k)A.(k)+ ¥'12 8J,t-2 + ''' + '~-'lp 8I,t-p A,(k)-
A.(k)-
+ "'" g,,,_, + "'" g,_,_, + ... + '~-'2p g2,t-p
-_
A.(k)A.(k)A,(k)+ '~-'KlgK,t-l + '1"'1( 2 gK,t-2 +.". + '1-'KpgK,t-p + ekt ... (k).<(k)(6610) "'' g,_, + · · · + '~"'P g,_P + ek,, .•
411
Serial Correlation
where
(j = I, 2, ... , p).
In picking the lag length by the BIC, the maximum lag length Pmax needs to be specified. In Section 6.4, when the BIC was introduced in the context of estimating the true order (lag length) of a finite-order autoregression, Pmax was set to some integer known to be greater than or equal to the true lag length. In the present case where g, is not necessarily a finiteorder VAR, Den Haan and Levin (1996a, Theorem 2(c)) show that the VARHAC estimator is consistent for S when Pmax grows at rate n 113 and recommend setting Pmax to [n 113 ] (the integer part of n 113 ). To emphasize its dependence on the sample size, we denote the maximum lag length by Pmax (n). Let SSR~k) be the sum of squared residuals (SSR) from the OLS estimation of (6.6.10) (the k-th equation in the VAR) fort = PmaxCn) + I, Pmax(n)+2, . .. ,n. 14 For p = O,SSR~) = L;~pm,(,)+l(g,,) 2 , theSSR from the regression of g,, on no regressors. Since there are pK coefficients in the k-th equation, the BIC criterion is log(SSR~) jn)
+ pK ·log(n)jn.
(6.6.11)
(Then in this expression could be replaced by the actual sample size n Pmax with no asymptotic consequences.) Let p(k) be the p that minimizes the information criterion over p = 0, I, ... , Pmax(n) for the k-th VAR equation. Also let
4>t) (j
= I, 2.... , p(k)) be the OLS estimate of the
VAR coefficients <J>j') (j = I, 2, ... , p(k)) when (6.6.10) is estimated by OLS for p = p(k). Step 2: Calculating the implied long-run variance. Let P
= max{p(l), p(2), ... , p(K)).
the largest lag length in the K equations. By construction, P :5 PmaxCn). Step 1 produces an estimated VAR which can be written as
g, (Kxl)
=
4>1
~;,_I
(KxK)(Kxl)
+ · · · + 4>p g,_p + (KxK)(Kxl)
e, (Kxl)
(t = PmaxCn) +I, PmaxCn)
+ 2, ... , n).
14 So the sample period is fixed throughout the process of picking the lag length.
(6.6.12)
412
Chapter 6
Since the lag length differs across equations, some rows of ~P (p = -
'(k)
.
I, 2, ... , P) maybe zero. Therowsof~P are related tot/>i (;=I, 2, ... , p(k)) obtained in Step 1 as '(k)
-
t/>
k-th row of ~P =
{
(KxK)
for I < p ::" p(k),
P
0
(6.6.13)
for p(k) < p ::" P.
Let fi be the VAR residual variance matrix:
<> .3co (KxK)
-
I
n
" "" L
, ,, e,e,.
(6.6.14)
f=Pnw..-(n)+I
Recall that the long-run variance matrix of a VAR is given by (6.3.16) for z = I. The estimator of S implied by the estimated VAR (6.6.12) is then (6.6.15)
This VARHAC estimator is positive semidefinite by construction. Den Haan and Levin (1996a) show that (under suitable regularity conditions) the estimator converges to S at a faster rate than any positive semidefinite kernel-based estimator for almost all autocovariance structures of g, (which is not necessarily a finite-order VAR). In particular, if g, is finite-order VARMA, then the lag order chosen by the BIC grows at a logarithmic rate and the VARHAC estimator converges at a rate arbitrarily close ton 112 , which is faster than the maximum rate n215 achieved by positive semidefinite kernel-based estimators. QUESTIONS FOR REVIEW
1. When x, = z,, what is the efficient GMM estimator? 2. (Consequence of ignoring serial correlation) Consider the efficient GMM estimator that presumes {g,} to be serially uncorrelated. If {g,} is in fact serially correlated, is the estimator consistent? Efficient?
413
Serial Correlation
6. 7
Estimation under Conditional Homoskedasticity (Optional)
We saw in Section 3.8 that the GMM estimator reduced to the 2SLS estimator under conditional homoskedasticity. This section shows how the 2SLS can be generalized to incorporate serial correlation. It will then become clear that the GLS estimator is not a GMM estimator with serial correlation. The latter half of this section examines the relationship between the GLS and GMM. Kernel-Based Estimation of S under Conditional Homoskedasticity
The relationship between serial correlation in g, = x, · £ 1 and serial correlation in the error terrn £ 1 becomes clearer under conditional homoskedasticity. Let (6.7.1) Since {£,} is stationary by Assumptions 3.1 and 3.2, wi does not depend on t. If E(£1 ) = 0 (which is the case if x, includes the constant), then wi equals the j-th order autocovariance of {£ 1 }. Suppose the error terrn is conditionally homoskedastic: conditional homoskedasticity: E(£,£,_i I x,, x,_i) = wi
(j = 0, ±I, ±2, ... ).
(6. 7.2)
Under this additional condition, the usual argument utilizing the Law of Total Expectations can be applied to r / ri = E(g,g;_i)
(since E(g,) = 0)
= E(£1 £1-ix,X,_i) = E[E(£ 1£1-i
I x,, x,_j)x,x;_i] (by the Law of Total Expectations)
= wi E(x,x;_i).
(6.7.3)
Thus, unless E(x,x;_i) is zero, {g,} is serially correlated if and only if{£,} is. To estimate ri, we exploit the fact that r i is a product of two second moments, E(x,x;_i) and E(£,£,_j). A natural estimator ofE(x,x;_) is~ L;~i+ 1 x,x;_j· It is easy to show that
Wj
is consistent1y estimated by
*L~=j+l £,f:,_j where£, is
the residual from some consistent estimator (the proof is almost the same as in the case with no serial correlation, see Proposition 3.2). Thus, a natural estimate of ri
414
Chapter 6
under conditional homoskedasticity is (6.7.4)
Given these estimated autocovariances, the kernel-based estimation of S is exactly the same as in the case without conditional homoskedasticity described in the previous section. Data Matrix Representation of Estimated Long-Run Variance
For the purpose of relating the GMM estimator with serial correlation but with conditional homoskedasticity to conventional estimators, the data matrix representation ofS is useful. By defining ann x n matrix Q suitably, we can writeS as -S = I X'Si!X, n
(6.7.5)
where X is then x K data matrix whose t-th row is like the autocovariance matrix of {e, }: ' wo '
-
WJ
w,'
' "'2
' wo
' WJ
x;. The matrix !'2 looks much ' lVn-I '
Wn-2
(6.7.6)
Sil
(n xn)
WJ
' wo
' w,
' w,
' w,
' wo
'
' Wn-2 '
lVn-I
If we know a priori that Cov(e,, e,_ j) = 0 (so that elements of this !'2 as I
"
ri
= 0) for j > q, specify the
" ' f:,f:,_i nL
ifO < j < q,
0
if j > q.
(6.7.7)
t=J+I
-
Then the Sin (6.6.4) can be written as (6.7.5). If we do not know that there is such a q, then use the kernel-based estimator of S. For the Bartlett kernel, let
' = Wj
1. J I " ' - - L [I - -q(n) n •~i+l
'
f:tf:t-.
0
ifO < j < q(n)- I, (6.7.8)
J
if j > q(n)- I.
Then the consistent estimator (6.6.3) with (6.6.4) equals (6.7.5).
415
Serial Correlation
It is now easy to show the following extension of the 2SLS estimator to encompass serial correlation:
• The efficient GMM estimator under conditional homoskedasticity can be written as (6.7.9) where Z and yare the data matrices of the regressors and the dependent variable. This generalizes (3.8.3') on page 230. • The consistent estimator of Avar(~(S- 1 )) reduces to
(I _
[I )-11 ]--:-:::Avar(d(S-1)) = n Z'X ;; X'QX n X'Z = n · [Z'X(X'QX)- 1X'Zr'.
1
(6.7.10)
So the square roots of the diagonal elements of [Z'X(X'QX)- 1X'z]- 1 are the standard errors. This generalizes (3.8.5') on page 230. Relation to GLS
The procedure for incorporating serial correlation into the GMM estimation we have described is different from the GLS procedure. To see the difference most clearly, consider the case where x, = z1 . If there are L regressors, X'Z and X'QX in (6.7.9) are all L x L matrices, so the efficient GMM estimator that exploits the orthogonality conditions E(z, · £ 1 ) = 0 is the OLS estimator ~OLS = (Z'Z)- 1Z'y. The consistent estimate of its asymptotic variance is obtained by setting X = Z in (6.7.10): ~
Avar(~oLs) = n · (Z'Z)- 1 (Z'QZ)(Z'z)- 1 .
(6.7.11)
This may seem strange because we know from Section 1.6 that the GLS estimator is more efficient than OLS when the error is serially correlated. This is a finite sample result, but, given that we have a consistent estimator fi of Q, we could take the GLS formula and estimate d as "
---1
dm_s = (Z'Q
1
---1
Z)- Z'Q y.
Is not this estimator asymptotically more efficient than OLS in that its asymptotic variance is smaller? The problem with GLS is that consistency is not guaranteed,
416
Chapter 6
if the regressors are merely predetennined, as shown below. As we emphasized in Chapter 2, the regressors are not strictly exogenous in most time series models. It follows that GLS should not be used to correct for serial correlation in the error tennfor models lacking strict exogeneity. In particular, for the x, = z, case, the correct procedure to adjust for serial correlation is to leave the point estimate unchanged while incorporating serial correlation in the estimate of the asymptotic vanance.
That the GLS estimator is generally inconsistent can be seen as follows. To focus on the problem at hand, assume that the true autocovariance matrix is proportional to some known n x n matrix V:
As shown in Section 1.6, the GLS estimator of the coefficient li in the regression model y = Zli +e can be calculated as the OLS estimator on the transformed model Cy = CZ/i + Ce, where Cis the "square root" of v- 1 such that C'C = v- 1. But look at the t-th observation of the transformed model, which can be written as (6.7.12) where = CrtYI
+ · · · + CrnYn,
Z, = CrtZI f, = CrJCI
+ · · · + CrnZn, + ·' · + CtnC:n,
ji,
(c,1 .... , c,) is the t-th row of C.
For the GLS estimator to be consistent, it is necessary that E(z, · i',) left-hand side of this condition can be written as
0. The
E(z,. l,) =(a)+ (b), n
(a)=
L c:, E(z, · e,), s=I
(b)=
L c, · c, E(z, · ev). s¥v
By the orthogonality condition E(z, · e,) - 0, (a) is zero. For (b), because the regressors are merely predetermined and not assumed to be strictly exogenous, E(z, · ev) is not guaranteed to be zero for s # v. So (b) is not zero, unless the c's take particular values to offset the nonzero terms. There is one important special case where GLS is consistent, and that is when the error is a finite-order autoregressive process. To take the simplest case, suppose
417
Serial Correlation
that(£,} is AR(I): £, =
Var(£,, £z, ... , En)= a 2 · V, V =
--"7
I -
q,z
I
I
It is easy to verify that the square root, C, ofv-' (satisfying C'C = V) is given by
Ji - q,z C=
-
0 I
0
-
0 0 I
0
0
0
0 0 0
-
0 0 0
(6.7.13)
So the t-th transformed equation is Yr- Yr-1
= (z,-
(6.7.14)
The GLS estimate of 8, which is the OLS estimate in the regression of y, -
1.
Verify (6.7.5) for n
= 3, q = I.
2. (Sargan's statistic) Derive the generalization of Sargan's statistic (3.8.1 O') on
page 230 to the case of serially correlated errors. 3. Suppose that z, is a strict subset of x,. We have seen in Section 3.8 that the
2SLS estimator reduces to the OLS estimator. Is this true here as well? That is, does (6.7.9) reduce to (Z'Z)- 1Z'y? (Answer: No.] ~
4. (Data matrix representation of S without conditional homoskedasticity) Define fi suitably so that (6.7.5) equals the Bar1lett kernel-based estimator of S without conditional homoskedasticity, namely, (6.6.5) with (6.6.7). Hint: Write fi as (6.7.6). If the lag length q is known, w1 = £1£,_ 1 tor 0 < j < q and 0 for j > q. How would you define w1 tor the Bartlett kernel-based estimator?
418
Chapter 6
5. Consider the simple regression model: y, = f3z, +e, where {e,) is AR(l ), e, = ¢e,_ 1 +~~with known¢. Assume E(z,e,) = 0, E(z,~,) = 0, E(zt-t~,) = 0, and conditional homoskedasticity. Also, to simplify, assume E(z,) = 0. Show that
-
2 a, I + 2 Li:t
Avar(f3oLs) = 2 a,
I - ¢2
_
a,2
I
Avar(fJGLs) = al -:-:(1:-+---:¢"" ,-:-
where a}= Var(z,), a; = Var(~ 1 ), Pj = Corr(z,, Zr-j). Find a configuration of(¢, {yj}) such that Avar(fi'm.sl > Avar(fi'm.s). Explain why this is consistent with the fact that f3oLs is an efficient GMM estimator.
6.8
Application: Forward Exchange Rates as Optimal Predictors
In foreign exchange markets, the percentage difference between the forward exchange rate and the spot exchange rate is called the forward premium. The relation between the forward premium and the expected rate of currency appreciation/depreciation over the life of the forward contract has been the subject of a large number of empirical studies. The simplest hypothesis about the relationship is that the two are the same. In this section, we test this hypothesis using the same weekly data set used by Bekaert and Rodrick (1993). The data cover the period of I 975-1989 and include the following variables:
S, = spot exchange rate, stated in units of the foreign currency per dollar (e.g., I 25 yen per dollar) on the Friday of week t, F, = 30-day forward exchange rate on Friday of week t, namely, the price in foreign currency units of the dollar deliverable 30 days from Friday of week t,
S30, = the spot rate on the delivery date on a 30-day forward contract made on Friday of week t. The important feature of the data is that the maturity of the contract covers several sampling intervals; the delivery date is after the Friday of week t + 4 but before the Friday of week t + 5.
419
Serial Correlation
The Market Efficiency Hypothesis To indicate the assumptions underlying the hypothesis, consider three strategies for investing dollar funds for 30 days at date t. The first is to invest in a domestic money market instrument (e.g., U.S. Treasury bills) whose maturity is 30 days. Let i, be the rate of return over the 30-day period. The second investment strategy is to convert dollars into foreign currency units, buy a 30-day money market instrument denominated in the foreign currency whose interest rate is denoted i,', and convert back to dollars at maturity. The rate of return from this investment strategy is
S, ·(I+ i,')/530,. This strategy involves foreign exchange risk because at time t of investing you do not know S30, which is the spot rate 30 days hence. The third investment strategy differs from the second in that it hedges against exchange risk by buying dollars forward at price F, at the same time the dollar fund is converted into foreign currency units. This hedged foreign investment provides a sure return equal to
S, ·(I+ i,')/F,. Because the first and the third strategies involve no risk, arbitrage requires that the rates of return be the same: I
+ i, =
S, ·(I
+ i,')/ F,.
Taking logs of both sides and using the approximation log (I the covered interest parity equation:
.
'+ lr·•
lr=Sr-Jr
or
'
Jr -
Sr
= t·•r
-
+ x)
.
lr.
"'"x, we obtain
(6.8.1)
where lower case letters for Sand Fare natural logs. That is, the forward premium, J, - s, equals the interest rate differential. This relationship holds almost exactly in real data. In fact, people routinely calculate the forward premium as the interest rate differential. The rate of return from the second investment strategy is uncertain, but if investors are risk-neutral, its expected return is equal to the sure rate of return. Furthermore, if their expectations are rational, the expected return is the conditional expectation based on the information currently available, so E[S, · (I
+ i,')/ S30, I 1,] =
S, · (I
+ i,')/ F,,
(6.8.2)
420
Chapter 6
where E(- I /,) is the expectations operator and I, is the information set as of date t. Since S,, F,, and;: are known at date t, this equality reduces to E(F,j S30,
I /,) = I.
Again, by log approximation, F,j S30, "" I E(s30,
I /,) = j, or
E(e,
+ j, -
s30,. Therefore,
I /,) = 0 where e,
=s30, -
j,.
(6.8.3)
This is the market efficiency hypothesis. As is clear from the derivation, the hypothesis assumes risk-neutrality and rational expectations. The hypothesis is sometimes described as the forward rate being the optimal forecast of future spot rates, because the conditional expectation is the optimal forecast in that it minimizes the mean square error (see Proposition 2.7). Testing Whether the Unconditional Mean Is Zero
Taking the unconditional expectation of both sides of (6.8.3), we obtain the unconditional relationship: E(s30,) = E(j,)
or
E(e,) = 0.
(6.8.4)
That is, the forecast must be correct on average. Figure 6.1 is a plot of the forecast errore, for the yen/dollar exchange rate. It looks like a stationary series, and we proceed under the assumption that it is. 15 Because the forecast error is observable, we can test the hypothesis that its unconditional mean is zero. This is in contrast to the Fama exercise, where the forecast error (about future inflation) is observable only up to a constant. To test the hypothesis of market efficiency, we need to derive the asymptotic distribution of the sample mean, £, under the hypothesis. The task is somewhat involved because (e,} is serially correlated. To see why the forecast error has serial correlation, note that !,, while including (e,_s, e,-6, ... }, does not include (e,, ... , fr-4} because e,, which depends on s30,, cannot be observed until after t + 4. Therefore, E(e,
I fr-5, fr-6, ... )
= 0.
(6.8.5)
It follows that Cov(e,, e,_i) = 0 for j > 5 but not necessarily for j < 4. In 15 Using covered interest rate parity, t = (s30 - s ) - Ut'- it). If you believe that both the 30-day rate of 1 1 1 change s30r - sr and the interest rate differential i( - ir are stationary, you have to believe that the forecast error is stationary.
421
Serial Correlation
150,---------------------------------------,
- 100 <1>
Ol
c
""'"' 0
<1>
-"' Ol
50
c
<1>
0
<1>
"-
"C <1>
-~
-50
"'
:J
c
~ ~100
-I
0
C'l-150
"'
-200~--------------------~----------------~
1/75
1/78
1/81
1/84
1/87
12/89
Figure 6.1: Forecast Error, Yen/Dollar
the Fama exercise, the ex-post real interest rate (which is the inflation forecast error pins a constant) was serially uncorrelated because the maturity for the interest rate and the sampling interval were the same. Here, the forecast error has serial correlation because the maturity of the forward contract (30 days) is longer than the sampling interval (a week). The estimated correlogram is displayed along with the two-standard error band in Figure 6.2. 16 The standard error is calculated under the assumption that the forecast error is i.i.d. So it equals I 1.fii, as in Table 2.1. The pattern of autocorrelation is broadly consistent with the hypothesis: it stays well above two standard errors for the first four lags and generally lies within the band thereafter. Under the null of market efficiency, since serial correlation vanishes after a finite lag, it is easy to see that {t: 1} satisfies the essential part of Gordin's condition restricting serial correlation. By the Law of Iterated Expectations, (6.8.5) implies E(t:, I t:,_j, t:t-j-1 •... ) = 0 for j > 5, which immediately implies part (b) of Gordin's condition. Since the revision of expectations, r,j. is zero for j > 5, part (c) is satisfied, provided that r,j (0 < j < 4) have finite second moments. It then
16 Although
theory implies that the population mean is zero, I subtracted the sample mean in the caJculation of sample autocorrelations.
422
ChapterS
1
-"'
c: 0.8
" "'""'c:0 .-0 "'
"' " "' E
0.6 0.4
~ ~
0
0.
0.2 -------- ------
---------------~--------?
0
"'"' -0.2
- - - - - - - -- - - - - - - - - - - - - - - - - - - - - - - - - - -
0
5
10
- -- - -"' - - - - -- - - - - -- - - -
20
15
25
30
35
- - - - - - -
40
lag ~---
two standard errors
Figure 6.2: Correlogram of s30- f, Yen/Dollar
follows from Proposition 6.10 that
.foe
(6.8.6) 4
Jyo+2tn
(Yo+ 2L>! )In j=l
j=l
Furthermore, the denominator is consistently estimated by replacing y1 by its sample counterpart y1. So the ratio (6.8. 7) 4
(9o+2L9t)!n j=l
is also asymptotically standard normal. The denominator of this expression will be called the standard error of the sample mean. Table 6.1 reports results for testing the implication of the hypothesis that the unconditional mean of the forecast error is zero, for three foreign exchange rates against the dollar. The three currencies are the Deutsche mark, the British pound, and the Japanese yen. Columns I and 2 of the table report the mean and the standard deviation of the actual rate of change of the spot rate, s30, - s,, and the forward premium, J, - s,. (Because the spot rate is per dollar, a positive rate of change means that the dollar strengthened.) The monthly rates of change and the
423
Serial Correlation
Table 6.1: Means of Rates of Change, Forward Premiums, and Differences betw~en the Two: 1975-1989
Exchange rate
Means and Standard Deviations s30- s f-s Difference (expected rate (unexpected rate (actual rate of change) of change) of change)
¥/$
-4.98 (41.6)
-3.74 (3.65)
-1.25 (42.4) standard error = 3. 56
DM/$
-1.78 (40.6)
-3.91 (2.17)
2.13 (41.1) standard error = 3.26
£/$
3.59 (39.2)
2.16 (3.49)
1.43 (39.9) standard error= 3.29
NOTE: s30- s is the rate of change over 30 days and f - s is the forward premium, expressed as annualized percentage changes. Standard deviations are in parentheses. The standard errors are the values of the denominator of (6.8.7). The data are weekly, from 1975 to 1989. The sample size i• 778.
forward premium have been multiplied by I ,200 to express the rates in annualized percentages. The market efficiency hypothesis is that the forward premium is the expected rate of change of the spot rate. It is evident that the actual rate of change is much more volatile than the forward premium. The third column reports the sample mean of £ 1 , which equals the difference between the first two columns for the sample mean. The numbers in parentheses below the sample mean are the standard errors calculated as described above. In no case is there significant evidence against the null hypothesis of zero mean. Regression Tests The unconditional test, however, does not exploit the implication of market efficiency that the forecast error is orthogonal to conditioning information I,. Probably the most important variable in the information set I 1 for future spot rates is the forward rate J,. We can test whether the forecast error is uncorrelated with the forward rate by regressing £ 1 on the constant and f,. This is equivalent to running
424
Chapter 6
300
250
200
150
1/78
1/81
1/84
1/87
12/89
Figure 6.3: Yen/Dollar Spot Rate, Jan. 1975-Dec. 1989
the following regression (6.8.8) and testing whether {30 = 0 and {3 1 = I. Because the error term is the forecast error orthogonal to anything known at date t including ft. the regressors are guaranteed to be orthogonal to the error term. So it might appear that we can estimate (6.8.8) by OLS and adjust for serial correlation in the error term as indicated in Section 6.6 for correct inference. However, as the graph in Figure 6.3 suggests, the spot exchange rate looks like a process with increasing variance. As we will verify in Chapter 9, we cannot reject the hypothesis that the process has a unit root. The forward rate has the same property. (Its plot is not shown here because it is indistinguishable from the plot of the spot rate.) Thus, Assumption 3.2, which requires the regressor (/1 ) and the dependent variable (s301 ) to be stationary, is not satisfied. The problem actually runs deeper. We have observed in Figure 6.1 that the difference between s30, and ft looked like a stationary series. In the language to be introduced in Chapter 10, s301 and ft are "cointegrated" with a cointegrating vector of (I, -I), in which case the OLS estimate of {3 1 in (6.8.8) converges quickly to I. This can be verified in Figure 6.4 where s30, is plotted against f,. If a regression line is fitted, its slope
425
Serial Correlation
5.8
"c: .c: ,.,""' "' "'1ii" 0
"0 0
-
5.7 5.6 5.5 5.4 5.3
~
0
c.
"'0
"'
"'"
~
0
"'"'
5.2
... .... .. • , 4,:..• ... .• .... .,... -:· . ~
.
......
4.9
5
5.1 5
I
~
.... '4.;: o-~·· ;..p:r·;' ..
4.9 4.8 4.7 4.6
4.7
4.8
5.1 f(t).
5.2
5.3
5.4
5.5
5.6
5.7
5.8
log forward rate
Figure 6.4: Plot of s30 against f, Yen/Dollar
is 0.99. The problem from the point of view of testing market efficiency is that the OLS estimate converges to I regardless of whether or not the hypothesis is true. Therefore, the test has no power. The problem can be dealt with by estimating a different regression, s30,- s, = f3o
+ f3,(j,- s,) + e,.
(6.8.9)
Judging from the plot of s30, - s, and J, - s, (not shown), it is reasonable to assume that they are stationary, which is consistent with Assumption 3.2 (ergodic stationarity). That the equation satisfies other GMM assumptions under market efficiency can be verified as follows. For {30 = 0 and f3, = I, the error term £, equals the forecast error s30,- f, and so the orthogonality conditions E(e,) = 0 and E[ (J, - s,) · £,] = 0 (Assumption 3.3 with z, = x,) are satisfied. The identification condition is that the second moment of (I, f, - s,) be nonsingular, which is true if and only if Var(J, - s,) > 0. This is evidently satisfied; if the population variance were zero, we would not observe any variations in / 1 - s1 • This leaves us with Assumption 3.5', which requires that there is not too much serial correlation in
g,
= x, . £,
e,
= [ (j, - s,) £,
]
.
426
Chapter 6
Table 6.2: Regression Tests of Market Efficiency: 1975-1989 s301
-
s1 =
fJo + f3t(ft- s1 ) + £ 1
Regression Coefficients Currency
Constant
Wald Statistic for
Forward
Ho: fJo
prermum
= 0,
fJ,
= I
¥/$
-12.8 (4.01)
-2.10 (0.738)
0.034
18.6 (p = 0.009%)
DM/$
-13.6 (5.72)
-3.01 (1.37)
0.026
8.7 (p = 1.312%)
7.96 (3.54)
-2.02 (0.85)
0.033
12.9 (p=0.156%)
£1$
NOTE: Standard errors are in parentheses. They are heteroskedasticity-robust and allow for serial correlation up to four lags calculated by the formula (6.8.11 ). The
sample size is 778. Our results differ slightly from those in Bekaert and Hodrick ( 1993) because they used the Bartlett kernel-based estimator with q (n) = 4 to estimateS.
As already noted, since 11 includes {<1 -5, £ 1 _ 6 , .•• }, the forecast error satisfies (6.8.5). Review Question 3 will ask you to show that {g 1 } inherits the same property, namely, E(g,
I g,_5, g,-6· ... ) = 0.
(6.8.10)
So, as in the unconditional test, Gordin's condition is satisfied, provided that the relevant second moments are finite. Since E(g,ft,_) = 0 for j > 5, the long-run covariance matrix S can be estimated as in (6.6.4) with q = 4. (In the empirical exercise, you will be asked to estimate S by the Bartlett kernel and also by VARHAC.) To summarize, the efficient GMM estimator of p = (fJo, {J 1)' in (6.8.9) is OLS and the consistent estimator of its asymptotic variance is given by
-
-
-..
-
-1-..
-I
Avar(floLs) = Sxx S Sxx,
(6.8.11)
where S is given in (6.6.4) with q = 4. Table 6.2 reports the regression test results for the three currencies. The corresponding plot for the yen!$ exchange rate is in Figure 6.5. Notice that the estimated slope coefficient is not only significantly different from I but also negative, for all three currencies. The null hypothesis
427
Serial Correlation
150
"
100
"'
50
Ol t:
.<::
u
-
"' u "' "" " 2
.
.'.. ... . . •••·, .
~
~
0
.,
-50
I
. . . ............ .. ...:.....-.... .. .·.
-150 -200 -20
-15
-1 0
-
-.
-5
f(t)-s(t},
..
.. . .
. •. .•.
0-100
"'"
.'
0
5
10
forward premium
regression line
Figure 6.5: Plot of s30 - s against
f -
s, Yen/Dollar
implied by market efficiency that f!o = 0 and {! 1 = I can be rejected decisively. This is reflected in the too many observations in the north-west quadrant of Figure 6.5; too often expected dollar depreciation (negative f, - s,) is associated with actual dollar appreciation (positive s30, - s, ). QUESTIONS FOR REVIEW
1. (The standard error in the unconditional test) Lets, = s30, - f, and consider a regression of s, on the constant. To account for serial correlation, use (6.6.4) with q = 4 for S. Verify that the /-value for the hypothesis that the intercept is zero is given by (6.8.7). 2. (Lack of identification) For simplicity suppose F, is the two-week forward exchange rate so that the s30, in (6.8.9) can be replaced by s,+2· Suppose that the spot exchange rate is a random walk and I, = {s,, s,_. 1 , •.. }. What should J, be under the hypothesis? Which of the required assumptions are violated? [Answer: The identification condition.] Suppose, instead, that s, is AR(I) satisfying s, = c + ¢s,_ 1 + ry, where {ry,} is i.i.d. and 1>1 < I. Verify that f, =(I +
428
Chapter 6
3. Prove (6.8.10). Hint: E(e, I I,) = 0. Verify that I, includes s, {g,_ 5 , ~- 6 , .•. }. Use the Law of Iterated Expectations.
.ft and
PROBLEM SET FOR CHAPTER 6 - - - - - - - - - - - - - ANALYTICAL EXERCISES
1. (Mean square limits and their uniqueness) In Chapter 2, we defined the mean square convergence of a sequence of random variables {Zn} to a random variable z. In this and other questions, we will deal with sequences of random variables with finite second moments. For those sequences, the definition of mean square convergence
IS
DEFINITION. Let {Zn} be a sequence of random variables with E(z~) < oo.
We say that lzn} converges in mean square to a random variable z with E(z 2 ) < 00, written Zn -+m.,. z, if E[(Zn - z) 2 ] -+ 0 as n -+ 00. (a) Show that, if Zn -+m.,. z and Zn -+m.,. z', then E[(z - z') 2] = 0. Hint: z- z' = (z- Zn) + (Zn - z'). Define llxll = JE(x 2 ). Then liz- z'll 2 =liz- Znll 2 < liz- Znll
2
+ 2E[(z- Zn)(Zn- z')] + llzn- z'll 2
+ 2JE[(z- ZnPl JE[(Zn- z') 2 ] +
llzn- z'll 2
2 )) (by Cauchy-Schwartz inequality that E(xy) < JE(x 2 ) J:::-E(:-y~
= (liz- Zn II
+ llzn- z'll) 2
So liz- z'll < liz- Zn II+ llzn- z'll. This is called the triangle Inequality. (b) Show that Prob(z = z') = I. Hint: Use Chebychev's inequality: Prob(lz- z'l >e) <
~ E [liz- z'll 2 ]. £
2. (Proof of Proposition 6.1) A well-known result about mean square convergence (see, e.g., Brockwell and Davis, 1991, Proposition 2.7.1) is (i) lzn} converges in mean square if and only if E[ (zm - Zn ) 2 ] -+ 0 as m, n -+ 00.
(ii) If Xn
-+m.s. X
and Zn -+m.s. Z, then
lim E(xn) = E(x)
n---+00
and
lim E(XnZn) = E(xz).
n---+00
429
Serial Correlation
Take this result for granted in answering. (a) Prove that (6.1.6) converges in mean square as n -+ oo under the hypothesis of Proposition 6.1. Hint: Let n
Y<.n =
!-'+
L•irJe,_j· J=O
What needs to be proved is that (y, ·"} converges in mean square as n -+ oo for each t. Given (i) above, it is sufficient to prove that (assuming m > n without loss of generality) as m, n -Too.
Use the fact that an absolutely summable sequence is square summable, i.e., 00
00
I:l•hl
< oo
"* L"'f < oo,
j=O
j=O
and the fact that a sequence of real numbers (an} converges (to a finite limit) if and only if"" is a Cauchy sequence:
an
-T
a
¢}
lam - an I -T 0 as m, n
-T
00.
Set "" = L:J~o l/rf. (b) Prove that E(y,) = 1-'· Hint: Let Y<.n = /-' shown in (a) that Y<.n -+m.,. y,.
+ L:J=o l/rje<-j as in (a).
It was
(c) (Proof of part (b) of Proposition 6.1) Show that E[(y,- {t)(y<-j- {L)] = lim E[(y,,n- {t)(y,-j,n- {L)]. n~oo
Have we shown that (y,} is covariance-stationary? [Answer: Yes.] (d) (Optional) Prove part (c) of Proposition 6.1, taking the following facts from calculus for granted. (i) If (aj} is absolutely summable, then (aj} is summable (i.e., -oo < :L'f'=oaj < oo) and
Chapter 6
430 00
00
L>j
<[:lajl·
j=O
j=O
{ajd (j, k
(ii) Consider a sequence with two subscripts, < oo for each k and lets, Suppose I:}::o {sd is summable. Then
lajkl
00
00
00
L:(:[>jk) j=O
< oo
and
L:}::o
00
lajkl· Suppose
00
L(Laj,) = L(Lajk) < oo. j=O
k=O
00
=
= 0, I, 2, ... ).
k=O
k=O
j=O
Hint: Derive 00
00
00
L IYjl
j=O
k=O
3. (Autocovariances of h(L)x,) We wish to derive the autocovariances of {y,) of Proposition 6.2. As in Proposition 6.2, assume throughout this question that {x,) is covariance-stationary. First consider a weighted average of finitely many terms:
(a) Showthatforthis {Yr."l' n
n
Yj = LLh,h,yf-k+, k=O l=O
where Y/ is the j -th order autocovariance of {x,). Verify that this reduces to (6.1.2) if {x,) is white noise. Hint: If you lind this difficult, first set n = I, and then increase n to establish the pattern. (b) Now consider a weighted average of possibly infinitely many terms: y, = hox 1 + h1xr-1
+ ··· .
Taking Proposition 6.2(a) for granted, show that 00
00
Yj = L L h,h,yf-k+,
if
{hj) is absolutely summable.
k=O f=O
Hint: Yr." --+m.,. y, as n --+ oo by Proposition 6.2. Proceed as in 2(c).
431
Serial Correlation
4. (Homogeneous linear difference equations) Consider a p-th order homogeneous linear difference equation: Yi-
=0
(j
= p, P +I, ... ),
(I)
where (
Let Ak (k = I, 2, ... , K) be the distinct roots of this equation (they can be complex numbers) and rk be the multiplicity of ).k· So r 1 + r, + · · · + rK = p. The polynomial equation can be written as K
fl (I -
Ai:
1
zY'
(3)
= 0.
k=l
For example, consider the second degree polynomial equation I-
z + ~z 2
(4)
= 0.
Its root is 2 with the multiplicity of 2 because I-
z + ~z 2
=(I
-!d.
So K = I and r 1 = 2 and ). 1 = 2 in this example. As you verify below, the solution to the p-th order homogeneous equation can be written as K r*-1
Yi =
L :~::>kn. un;j
(5)
k=I n=O
The coefficients ckn (there are p of those because r 1 + · · · + rK = p) can be determined from the initial conditions which specify the values for (y0 , y 1 , ••• , Yp-J). In the above example of (4),
(6) In the important special case where all the roots are distinct, K = p and
rk
= I
432
Chapter 6
for all k, and (5) becomes p
" Yj = L.. CkO
(7)
)._k- j .
k=l
(a) For the special case of p = 2 and distinct roots, (7) is
(8) Verify that this solves the homogeneous difference equation (I) with p = 2 for any given (c 10 , c20 ). To detennine (c 10 , c 20 ), write down (8) for j = 0 and j = I, and solve for c's as a function of (y0 , y 1 , A~o "A,). (b) To learn (remember) how to deal with multiple roots, consider example
(4) above. Verify that (6) solves the second-order homogeneous difference
equation
Yj- Yj-1
+ ~Yj-2 =
0.
(c) (Optional) Prove the following lemma. LEMMA. Let 0 < ~ < I and Jet n be a non-negative integer. Then there
exist two real number.s, A and b, such that~ < b < I and (j)"~j < Abj for j =0,1,2, .... Hint: Pick any b such that ~ < b < I.
Because bj eventually gets larger than (j)"~j as j increases, there exists some j, call it J, such that (j)"~j < bjlorallj > J.
(d) Suppose the stability condition holds for the p-th order homogeneous linear difference equation. Prove that there exist two numbers, A and b, such thatO < b < I and
Hint: K
IYjl
=
r,.,-1
I: I: k=l n=O
Ckn.
un;j
433
Serial Correlation K
<
rt-l
L L ickn I · (j)" · l).k"IIi k=I n=O K
rt-l
< c I : I:
where c = max{lcknll. Since l).k"II < I by the stability condition, we can apply the lemma with~ = l).k"II and claim
for some Ak > Oand I).;II < bk
IYil < c
max{Ad and b
rt-I
L L A'bi = cpA'bi k=l n=O
(Recall that p = ri + .. · + rK.) Then define A to be cpA'. You may find this sort of argument using inequalities hard to follow. If so, set p, K, - I, and (ri, ... , rK) to some specific numbers (e.g., p = 3, K = 2, 'I '2
= 2).
5. (Yule-Walker equations) In the text, the autocovariances of an AR(I) process were calculated from the MA(oo) representation. Here, we use the Yule-
Walker equations. (a) From (6.2.1') on page 376, derive
Yi- >Yi-I = E[<, · (y,_i- J,t)]
(j = 0, I, 2, ... ).
(b) (Easy) From the MA representation, show that (T2
E[<,. (y,_i- J.L)] =
{
0
(j = 0)
(j =I, 2, ... ).
(c) Derive the Yule-Walker equations: Yo - YI =
2
Yi- >Yi-I = 0
(9) (j = I, 2, ... ).
(10)
434
Chapter 6
(d) Calculate (yo, y 1 , ... ) from (9) and (10). Hint: First set j = I in (10) to obtain Y1 =>Yo· This and (9) can be solved for (yo, YJ). 6. (AR(l) without using filters) We wish to prove, without the help of Proposition 6.2, that (y,} following the AR(l) equation (6.2.1), with the stationarity condition lc/>1 < I, must have an MA(oo) representation if it is covariancestationary. (a) Let x, = y,- fJ. so that E(x,) = 0. By successive substitution, derive from (6.2.1') (on page 376) Xr = Xr ,n
+ 4Jn Xr-n
where
(b) Show that x,,, -->m.s. x,. Hint: What needs to be shown is that E[(x,,, x,) 2 ]--> 0 as n--> oo. (x,,,- x,) 2 = cf> 2"x?_,. Because (y,} is assumed to be covariance-stationary, E(x;_,) < oo. (c) (Trivial) Recalling that the infinite sum
is defined to be the mean square limit of the partial sum x, ·"' show that y, can be written as co
y, = fJ.
+ L,
7. (Companion form) We wish to generalize the argument just developed in question 6 to AR(p). Without loss of generality, we can assume that fJ. = 0 (or redefine y, to be the deviation from the mean of y, ). Define
~.
-
y,
£,
Yr-l
0 0
Yt-2
v, -
(px I)
(px I)
Yr-p+l
0
435
Serial Correlation
F
-
¢2
0
0
0 I
"'' 0 0
(pxp)
0
I
0
Consider the following first-order vector stochastic difference equation: ~I= F~t-I
+ v,.
(II)
In this system of p equations, called the companion form, the first equation is the same as the AR(p) equation (6.2.6') on page 379 and the rest are just identities about Yt-j (j = I, 2, ... , p- 1). (a) By successive substitution, show that ~~=~'·"+(F)"~,_,,
~~,
=
v,
+ Fv,_I + F 2v,_, + ·· · + (F)"-Iv,_,+I·
(12)
The eigenvalues or characteristic roots of a p x p matrix F are the roots of the determinantal equation (13)
It can be shown that the left-hand side of this is V- ¢IV-1 - · · · -rPr-I A- rPrSo the determinantal equation becomes the p-th order polynomial equation: (14)
(b) Verify this for p = 2. In the rest of this question, assume that all the roots of the p-th order polynomial equation (which, as just seen, are the eigenvalues of F) are distinct. Let (AI, A2 , ••• , Ap) be those distinct eigenvalues. We have from matrix algebra that there exists a nonsingular p x p matrix T such that AI 0 F=TAT-I,
0 A2
0 0
0 0
A -
(15)
(pxp)
0 0
0 0
Ar-I 0
0 Ar
436
Chapter 6
(This fact does not depend on the special form of the matrix F; all that is needed for (15) is that the A's are distinct eigenvalues of F.) From (15) it follows that
(F)"
=
T(A)"T- 1.
(c) (Very easy) Prove this for n = 3. (d) Show that~'·" ~m.,. ~' as n ~ oo if the stationarity condition about the AR(p) equation is satisfied and if (y, l is covariance-stationary. Hint: The stationarity condition requires that all the roots of (13) be less than 1 in absolute value. By (12) what needs to be shown is that (F)"~,_, ~ m.,. 0. Recall that z, ~ m.,. " if the mean square convergence occurs component-wise. (e) Show that y, has the MA(oo) representation
How is Vrj related to the characteristic roots? Hint: From (d) it immediately follows that
~' = v,
+ Fv,_ 1 + (F) 2v,_2 + · · ..
The top equation of this is the MA(oo) representation for y,. (If all the eigenvalues are not distinct, we need what is called the "Jordan
decomposition" ofF in place of (15).) 8. (AR(l) that started at I = 0) In the text, the MA(oo) representation (6.2.2) of an AR(I) process following y, = c +
defined for I = -I, -2, ... as well as for I = 0, I, 2, .... Instead, consider an AR(l) process that started at 1 = 0 with the initial value of y0 . Suppose that 1<1>1 < I and that y 0 is uncorrelated with the subsequent white noise process (EJ • <2,
· · · ).
(a) Write the variance of y, as a function of a 2 (= Var(r,)), Var(y0 ), and
y, =(1 + > + >2 + ... + <1>'-I)c 2 1 +<, + <,-2+··· +<1>'- <1
+>'yo.
Let J.L and (Yj} be the mean and autocovariances of the covariancestationary AR(l) process, so J.L = cj(l- >)and (yj} are as in (6.2.5).
437
Serial Correlation
Show that lim E(y,) =J1,
lim Var(y,) =Yo.
t-'>- 00
lim Cov(y,,y,_,) = Yi·
and
t-'>- 00
I-'>- 00
(b) Show that (y,} is covariance-stationary if E(yo) = 11 and Var(yo) = Yo· 9. (Proof of Proposition 6.8(a)) We wish to prove Proposition 6.8(a). By Chebychev's inequality, it suffices to show that limHoo Var(ji) = 0. From (6.5.2), n-1
.
2 Var(ji) =Yo+n n~
"(I- :!.)y n
1
j==!
(since I-:!.= 0 for j = n) n
Yo
+-n2 L~( I - -nj) IY·I1 j==!
n
Yo 2 <-+n n
L IYI· 1
j=!
(a) (Optional) Prove the following result: n
lim a1 = 0
=}
1 __.,_ oo
lim
n-"'- oo
~n L
a1 = 0.
j=!
(a1} converges to 0. So (i) la1 I < M for all j, and (ii) for any given 8 > 0 there exists a positive integer N such that la1I < 8 j2 for all j > N. So Hint:
]I I<;; L j
;; Lai n
j=!
n
Ia, I
j=!
IN =-
L
n.
J=!
NM
= -n
1
n
lail +n.
L
J=N+l
+
(n- N)
n
8
-
2
IN Ja1 J<"M nL j=!
NM < --
n
1
n
+-n L " 2:_ j=N+l
8
+ -. 2
By taking n large enough, N M j n can be made less than 8 j2. (b) Taking the result in (a) as given, prove that Var(ji) --> 0 as n -+
00 .
438
Chapter 6
10. (Proof of Proposition 6.8(b)) (a) (optional) Prove the following result: I
oo
n
'\' ai < oo => lim - '\' jai = 0. Ln-oon Lj=I
j=I
Hint: n
L jai =
a1
+ 2a2 + 3a3 + · ·· + na.
j=l
= (a 1 + a2
+ · ·· +a.) + (a2 + a3 + · ·· +a.) + · · ·
+(an-I +an)+ a. n
n
n
n
= LLak < L j=I k=j
N
= L j=I
n
Lak k=j
Lak k=j
j=l
n
+
L j=N+I
n
Lak. k=j
Since {ai] is summable, (i) I L~~i ak I < M lor all j, n, and (ii) there exists a positive integer N such that I L~~i akl < s/2 for any givens > 0 for all
j, n > N (this follows because II:}~ 1 ai} is Cauchy).
(b) Taking this result as given, prove Proposition 6.8(b). EMPIRICAL EXERCISES
1. Read at least pp. 115-119 (except the last two paragraphs of p. 119) of Bekaert and Hodrick (1993) before answering. Data files DM.ASC (for the Deutsche Mark), POUND.ASC (British Pound), and YEN.ASC (Japanese Yen) contain weekly data on the following items: Column 1: the date of the observation (e.g., "19850104" is January 4, 1985) Column 2: the ask price of the dollar in units of the foreign currency in the spot market on Friday of the current week (S,) Column 3: the ask price of the dollar in units of the foreign currency in the 30-day forward market on Friday of the current week ( F,) Column 4: the bid price of the dollar in units of the foreign currency in the spot market on the delivery date on a current forward contract (S30, ).
Serial Correlation
439
The sample period is the first week of 1975 through the last week of 1989. The sample size is 778. As in the text, define s, = log(S,), f, = log(F, ), s30, = log(S30,). If / 1 is the information available on the Friday of week 1, it includes (s,, s,_ 1 , ... , f,, f 1_ 1 , ••• , s30,_ 5 , s30,_6 , •.. }. Note that s30, is not observed until after the Friday of week 1 + 4. Defines, = s30, - f,. Pick your favorite currency to answer the following questions. (a) (Library/internet work) For the foreign currency of your choice, identify the week when the absolute value of the forward premium is largest. For that week, find some measure of the domestic one-month interest rate (e.g., the one-month CD rate) for the United States and the currency's home country, to verify that the interest rate differential is as large as is indicated in the forward premium. (b) (Correlogram of (e,)) Draw the sample correlogram of e, with 40 lags. Does the autocorrelation appear to vanish after 4 lags? (It is up to you to decide whether to subtract the sample mean in the calculation of sample correlations. Theory says the population mean is zero, which you might want to impose in the calculation. In the calculation of the correlogram for the yen/$ exchange rate shown in Figure 6.2, the sample mean was subtracted.) (c) (Is the log spot rate a random walk?) Draw the sample correlogram of s,+ 1 - s, with 40 lags. For those 40 autocorrelations, use the Box-Ljung statistic to test for serial correlation. Can you reject the hypothesis that (s,} is a random walk with drift? (d) (Unconditional test) Carry out the unconditional test. Can you replicate the results of Table 6.1 for the currency? {e) (Optional, regression test with truncated kernel) Carry out the regression test. Can you replicate the results of Table 6.2 for the currency?
RATS Tip: Use LINREG with the ROBUSTERRORS option. To include autocorrelations up to the fourth, set LAGS= 4 in LINREG. The DAMP= 0 option instructs L INREG to use the truncated kernel. TSP Tip: TSP's GMM procedure has the KERNEL option, but the choice is limited to Bartlett and the kernel called Parzen. You would have to use TSP's matrix commands to do the truncated kernel estimation. (f) (Bartlett kernel) Use the Bartlett kernel-based estimator of S to do the
regression test. Newey and West (1994) provide a data-dependent auto-
440
Chapter 6
matic bandwidth selection procedure. Take for granted that the autocovariance lag length detennined by this procedure is 12 for yen!$ (so autocovariances up to the twelfth are included in the calculation of S), 8 for OM/$, and 16 for Pound/$.
-
RATS Tip: The ROBUSTERRORS, DAMP~ 1 option instructs LINREG to use Bartlett. For yen!$, for example, set LAGS = 12. TSPTip: Set het, kernel =bartlett in the GMM procedure. Set NMA ~ 12 for yen!$. The standard error for the f - s coefficient for yen!$ should be 0.6815. (g) (Optional, VARHAC) Use the VARHAC estimator for Sin the regression test. Set Pmax to the integer part of n 113 In Step I of the VARHAC, you would estimate a bivariate VAR, and for each of the two VAR equations, you would pick the lag length p by B!C on the fixed sample oft = Pmax + I, ... , T. The information criterion could be (6.6.11) or, alternatively, Jog(SSR~k) /(n- Pmaxl)
+ pK ·log(n- Pmax)/(n- Pmaxl·
(16)
Just to be clear about how you picked p in your answer, use this latter objective function in selecting p by BIC. For the case of yen!$, the lag length selected should be 4 for the first VAR equation and 6 for the second equation. For yen!$, the standard error for the f - s coefficient should be 0.8027. 2. (Continuation of the Fama exercise of Chapter 2) Now we know how to use the three-month T-bill rate to test the efficiency of the U.S. Treasury bills market using monthly data. Let :n:1 = inflation rate (in percent, annual rate) over the three-month
period ending in month
t -
I,
R, = three-month T-bill rate (in percent, annual rate) at the beginning of month t.
Consider the Fama regression for testing market efficiency:
(a) (Trivial) Why is the dependent variable :n:,+ 3 rather than, say, :n:,?
Serial Correlation
441
(b) Show that, under the efficient market hypothesis, {! 1 I and the autocovariance of {s1 } vanishes after the second lag (i.e., Yj = 0 for j > 2). (c) Let r,+3
=
R, - 1ft+3 be the ex-post real rate on the three-month T-bill. Use PA/3 in MISHKIN.ASC as the measure of the three-month inflation rate to calculate the ex-post real rate. (The timing of the variables in MISHKIN.ASC is such that a January interest rate observation uses end-ofDecember bill rate data and a January observation for a three-month inflation rate is from the December to March CPI data. So PA/3 and TB3 (threemonth T-bill rate) for the same month can be paired in the regression.) Draw the sample correlogram for the sample period of 1/53-7171. What is the prediction of the efficient market hypothesis about the correlogram?
(d) Test market efficiency by estimating the Fama regression for the sample
period of 1/53-7171. (The sample size should be 223; do not throw away two-thirds of the observations, as Fama did in his Tables &-8.) Use (6.6.4) with q = 2 for S.
-
ANSWERS TO SELECTED Q U E S T I O N S - - - - - - - - - ANALYTICAL EXERCISES
2d. Since {•h} is absolutely summable, t/lj --> 0 as j --> 00. So for any j, there exists an A > 0 such that lt/lj+ki < A for all j, k. So lt/lj+k · ifJ1i < Ait/lkl· Since {t/Jkl (and hence {At/ld) is absolutely summable, so is {t/lj+k · hl (k = 0, I, 2, ... ) for any given j. Thus by (i), 00
IYjl
=
a
2
L t/lj+kifJk
k=O
00
00
k=O
k=O
L lt/lj+kt/lkl =a' L lt/lj+klit/Jkl < oo.
Now set ajk in (ii) to lt/lj+kl ·iifJki· Then 00
00
00
I: 1aj11 =I: lt/Jkllt/lj+kl < iifJkl I: lt/ljl < oo. j=O
j=O
Let 00
M=
L lt/ljl j=O
00
and
sk
= lifJkl L lt/lj+llj=O
442
Chapter 6
Then {sd is summable because Iski < l1h I· M and {1/rkl is absolutely summable. Therefore, by (ii), 00
00
j=O
k=O
L(L 11/rj+'l · 11/r•l) < oo. This times ?e.
1/rj
<1
2 is L.:J,.o IYjl· So {yj) is absolutely summable.
is the (1, I) element ofT(A)jT- 1
Sa. I->' E(y,) = I-> c+¢'E(y0 ) =(I-
<1>' ·
(since f.l = cf(l- >))
[E(yo) - J.L],
Var(y,) =(I + q,2 + q,• + ... + q,2'-2)"2 + ¢ 2' Var(y0 ) (since Cov(£,, Yo) = 0) I - ¢2' I - ¢2 "2 + q,2' Var(yo)
2
= (I - > ')yo +
= Yo+
<1>
2
<1> '
Var(yo)
(since Yo=
2
2
j(l - > )),
2 ' · [Var(yo) - YoL
Cov(y,, y,_j) = (>j + q,H2 + ... + >2,-j-2)"2 + >2,-j Var(yo) =
q,j. (I+ q,2 +
=
q,1. . I
_
... + q,2,-2j-2)"2 +
-.2U- j)
~
q,2,-j Var(yo)
" 2 + q, 2,_1. Var(y0 )
I - ¢2 = >j · (I - > 2u- j))Yo + ¢ 2H Var(yo) = Yo
.
= Yj + > - 1 • [Var(yo) -Yo]. EMPIRICAL EXERCISES
1c. The Ljung-Box statistic with 40 lags is 84.0 with a p-value of 0.0058%. So the hypothesis that the spot rate is a random walk with drift can be rejected. See Figure 6.6 for the correlograrn for yen/$.
1f. For the yen/$ exchange rate, the standard errors are 3. 72 for the constant and 0.68 for the f- s coefficient. The Wald statistic for the hypothesis that {30 = 0 and fJ1 = I is 21.85 (p = 0.002%).
443
Serial Correlation
-" c
u
1 0.6
=" 0.6 0
u
c 0
~ 0.4
" ~ ~
0
0.2
u
"c. E
0
"'"'
-0.2 10
5
0
15
25
20
35
30
40
lag two standard errors
Figure 6.6: Correlogram of sr+l - s,, Yen/Dollar
c
.~
u
1 0.6
=" 0.6 .Q -"' 0.4 0
u
c
-
" ~ ~
0
u
"Ec. "'"'
0.2
---------
-------------------------~----~----~--~~
...--
0 -0.2
..L-,---,--,---,--,--....,--~-.--~--,------,~--,-----,c:-'
0
1
2
3
4
5
6
7
6
9
10
11
12
lag ---- two standard errors
Figure 6. 7: Correlogram of Three-Month Ex-Post Real Rate 2c. The correlogram for the three-month ex-post real rate is shown above (Figure 6. 7). Theory predicts that serial correlation vanishes after the second lag. The correlogram is mostly consistent with this prediction.
444
Chapter 6
References Anderson, T. W., 1971, The Statistical Analysis of Time Series, New York: Wiley. Andrews, D., 1991, "Heteroskedasticity and Autocorrelation Consistent Covariance Matrix Estimation," Econometrica, 59,817-858. Bekaert, G., and R. Hodrick, 1993, "On Biases in the Measurement of Foreign Exchange Risk Premiums," Journn.l of International Money and Finance, 12, 115-138. Brockwell. P., and R. Davis, 1991, Time Series: Theory and Metlwds (2d ed.), Berlin: SpringerVerlag. Den Haan, W., and A. Levin, l996a, "Inferences from Parametric and Non-Parametric Covariance Matrix Estimation Procedures," Technical Working Paper Series 195, NBER. - - - · 1996b, "A Practitioner's Guide to Robust Covariance Matrix Estimation," Technical Working Paper Series 197, NBER. Fuller, W., 1996, Introduction to Statistical lime Series (2d ed.), New York: Wiley. Gordin, M. 1., 1969. "The Centra1 Limit Theorem for Stationary Processes," Soviet Math. Dokl., 10,
I 174-1176. Gourieroux, C., and A. Monfort, 1997, Time Series and Dynamic Models, Cambridge: Cambridge University Press. Hamilton, J., 1994, lime Series Analysis, Princeton: Princeton University Press. Hannan, E. J., 1970, Multiple Time Series, New York: Wiley. Hannan, E. J., and M. Deistler, 1988, The Statistical Theory of Linear Systems, New York: Wiley. Hansen, L., 1982, "Large Sample Properties of Genera1ized Method of Moments Estimators," Econo~ metrica, 50, 1029-1054. Li.itkepohl, H., 1993, lmroduction to Multiple lime Series Analysis (2d eci.), New York: SpringerVerlag. Newey, W., and K. West. 1987, "Hypothesis Testing with Efficient Method of Moments Estimation," International Economic Review, 28, 777-787. _ _ _ , 1994, "Automatic Lag Selection in Covariance Matrix Estimation," Review of Economic Studies, 61.631-653. Ng, S., and P. Perron, 2000, "A Note on Model Selection in Time Series Ana1ysis," mimeo., Boston: Boston College. White, H., 1984, Asymptotic Theory for Econometn.cians, New York: Academic.
CHAPTER 7
Extremum Estimators
ABSTRACT
In the previous chapters, the generalized method of moments (GMM) has been our choice for estimating the various models we have considered. The method of maximum likelihood (ML) has been discussed, but only in passing. In this chapter, we expand our repertoire of estimation techniques by examining a class of estimators called "extremum estimators," which includes (linear and nonlinear) least squares, (linear and nonlinear) GMM, and ML as special cases. As has been emphasized by Amemiya (1985) and others, this unified approach is useful for bringing out the common structure that underlies those apparently diverse estimation principles. Section 7.1 introduces extremum estimators. The two sections that follow develop the asymptotic properties of extremum estimators. Section 7.4 defines the trio of test statistics -the Wald, Langrange multipler, and likelihood-ratio statistics- for extremum estimators in general, and shows how they can be specialized to individual cases. Section 7.5 is a brief discussion of the computational aspects of extremum estimators. The discussion of this chapter may appear daunting, as it combines a heavy use of asymptotic theory with calculus. The essence, however, is really straightforward. If you can read graduate textbooks on micro and macroeconomics without too much difficulty. you should be able to understand most of the discussion on the second (if not the first) reading. If you are not interested in the asymptotics of extremum estimators per se but plan to study Chapter 8, there is no need to read this chapter in its entirety. Just read Section 7.1 and then try to understand the statements in Propositions 7.5, 7.6, 7.8, 7.9, and 7.11. A Note on Notation: Unlike in the previous chapters, we use the parameter vector with subscript 0 for its true value. So Do is the true parameter value, while (J represents a hypothetical parameter value. This notation is standard in the "highbrow" literature on nonlinear estimation. Also, we will use "t" rather than "i" for the observation index.
446
7.1
Chapter 7
Extremum Estimators
An estimator iJ is called an extremum estimator if there is a scalar objective function Q, (0) such that
iJ maximizes
Q,(O) subject to 0 E
e c ffi!P,
(7.1.1)
where e. called the parameter space, is the set of possible parameter values. In this book, we restrict our attention to the case where e is a subset of the finite dimensional Euclidean space JRP. The objective function Q,(O) depends not only on 0 but also on the sample or the data, (w 1, w2, ... , w,), where w, is the t-th observation and n is the sample size. In our notation the dependence of the objective function on the sample of size n is signalled by the subscript n. The maximum likelihood (ML) and the linear and nonlinear generalized method of moments (GMM) estimators are particular extremum estimators. This section presents these and other extremum estimators and indicates how they are related to each other. '
"Measurability" ot 0 To be sure, the maximization problem (7.1.1) may not necessarily have a solution. But recall from calculus the following fact: Let h : EtP ~ IR be continuous and A c JRP be a compact (closed and bounded) set. Then h has a maximum on the set A. That is, there exists an x• E A such that h(x•) > h(x) for all x in A. Therefore, if Q,(O) is continuous in 0 for any data (w,, W2, ... , w,) and e is compact, then there exists a 0 that solves the maximization problem in (7 .1.1) for any given data (w 1, ••. , w,). In the event of multiple solutions, we would choose one from them. So iJ, thus uniquely determined for any data (w 1, ••• , w,), is a function of the data. Strictly speaking, however, being a function of the vector of random variables (w 1, ... , w,) is not enough to make iJ a well-defined random variable; iJ needs to be a "measurable" function of (w 1 , ... , w,). (If a function is continuous, it is measurable.) The following lemma shows that iJ is measurable if Q,(O)is Lemma 7.1 (Existence of extremum estimators): Suppose that (i) the parameter Space e is a COmpact Subset of ITtP, (ii) Q, (0) is COntinUOUS in 0 for any data (w 1, ... , w,), and (iii) Q,(O) is a measurable function of the data for all 0 in e. Then there exists a measurable function iJ of the data that solves (7.1.1).
447
Extremum Estimators
See, e.g., Gourieroux and Monfort (1995, Property 24.1) for a proof. In all the examples of this chapter, Q.(ll) is measurable. In most applications, we do not know the upper or lower bound for the true parameter vector. Even if we do, those bounds are not included in the parameter space, so the parameter space is not closed (see the CES example below for an example). So the compactness assumption for e is something we wish to avoid. In some of the asymptotic results of this chapter, we will replace the compactness assumption by some other conditions that are satisfied in many applications. Two Classes ot Extremum Estimators In the next two sections, we wil1 develop asymptotic results for extremum estimators and then specialize those results to the following two classes of extremum estimators.
I. M-Estimators. An extremum estimator is an M-estimator if the objective function is a sample average: ]
Q.(ll) = -
n
n
L
(7.1.2)
m(w,; II).
t=I
where m is a real-valued function of (w,, II). Two examples of an M-estimator we study are the maximum likelihood (ML) and the nonlinear least squares (NLS). 2. GMM. An extremum estimator is a GMM estimator if the objective function can be written as Q.(ll) = -
I
g,(II)'Wg,(ll) with g,(ll)
2
(Kxl)
]
n
=- L n
g(w,;ll),
(7.1.3)
t=l
where Wis a K x K symmetric and positive definite matrix that defines the distance of g. (II) from zero. It can depend on the data. Maximizing this objective function is equivalent to minimi1ing the distance g,(II)'Wg,(ll), which is how we defined GMM in Chapter 3. As will become clear, the deflation of the distance by 2 simplifies the expression for the derivative of the objective function. We have introduced GMM estimation for linear models where g(w,; II) is linear in II. Below we will show that GMM estimation can be extended easily to nonlinear models. There are extremum estimators that do not fall into either class. A prominent example is classical minimum distance estimators. Their objective function can be
448
Chapter 7
written as -g, (6)'Wg, (6) but g, (6) is not necessarily a sample mean. The consistency theorem of the next section is general enough to cover this class as well. In the rest of this section, we present examples of M-estimators and GMM estimators. Each example is a combination of the model (a set of data generating processes [DGPs)) and the estimation method. Maximum Likelihood (ML)
A prime example of an M-estimator is the maximum likelihood for the case where {w,} is i.i.d. The model here is a set of i.i.d. sequences {w,} where the density of w, is a member of the family of densities indexed by a finite-dimensional vector 6: j(w,; 6), 6 E 8. (Since {w,] is identically distributed, the functional form of j(-; ·)does not depend on t.) The functional form off(·;·) is known. The model is parametric because the parameter vector 6 is finite dimensional. At the true parameter value 6 0 , the density of the true DGP (the DGP that actually generated the data) is j(w,; 6 0) 1 We say that the model is correctly specified if 6 0 E 8. Since {w,} is independently distributed, the joint density of the data (w 1, W2, .•. , Wn) is n
j(w 1 , w2, ... , w.; 6o)
=IT f(w,; 6o).
(7.1.4)
f=J
With the distribution of the data thus completely specified, the natural estimation method is maximum likelihood, which proceeds as follows. With the true parameter vector 6o replaced by its hypothetical value 6, this density, viewed as a function of 6, is called the likelihood function. The maximum likelihood (ML) estimator of 60 is the 6 that maximizes the likelihood function. Since the log transformation is a monotone transformation, maximizing the likelihood function is equivalent to maximizing the log likelihood function: 2
log
j(w~o w 2, ... , w.; 6) =log
[D
f(w,; 6)] =
t.
log j(w,; 6).
(7.1.5)
The ML estimator of 6 0 , therefore, is an M-estimator with 1Had we adhered to the notational co~vention of the previous sections, the lnle parameter value would be denoted by 9 and a hypothetical value by 9. In this chapter, we use 9o for the lnle value and 9 for a hypothetical value of the parameter vector. 2 For 6 such that j(wr; 9) = 0. the log cannot be taken. This does not present a problem because such 6 never maximi7es the (level) likelihood function f(WJ, ... , Wn; 6). For such 6 we can assign an arbitrarily small number to log f(w 1 ; 6) so that the maximizer of the \eve\lilc:ellhood is also the maximization of the log likelthood. See White (1994, p. 15) for a more formal treatment of thls issue.
449
Extremum Estimators
J
m(w,; 0) =Jog f(w,; 0), that is, Qn(O) = -
n
n
L
Jog J(w,; 0). (7. 1.6)
t=l
Before illustrating ML by an example, it is useful to make three remarks. • (Serially correlated observations) The average log likelihood would not take a form as simple as this if (w,} were serially correlated. Examples of ML for serially correlated observations will be studied in the next chapter. • (Alternative estimators) Maximum likelihood is not the only way to estimate the parameter vector. For example, let JL(O) be the expectation of w, implied by the density function f(w,; 0). It is a known function ofO. By construction, the following zero-mean condition holds: E[w, - JL(Oo)] = 0.
(7.1.7)
The parameter vector 00 could be estimated by GMM with g(w,; 0) = w,- JL(O) in the GMM objective function (7. I .3). • (Efficiency of ML) As will be shown in the next two sections, ML is consistent and asymptotically normal under suitable conditions. It was widely believed that ML is efficient (i.e., achieves minimum asymptotic variance) in the class of all consistent and asymptotically normal estimators. This belief was shown to be wrong by a counterexample (described, for example, in Amemiya, 1985, Example 4.2.4). Nevertheless, ML is efficient in quite general classes of asymptotically normal estimators. One such general class is GMM estimators. In Section 7.3, we will note that the asymptotic variance of any GMM estimator of 00 is no smaller than that of the ML estimator. Example 7.1 (Estimating the mean of a normal distribution): Let the data (w 1, ••• , wn) be a scalar i.i.d. sequence with the distribution of w, given by N(/L. a 2 ). So 0 = (/L. a 2 )' and 2 f(w 1 ;/L,a)=
I [ (w,-11-)2] expv'2rra 2 2a 2
•
The average Jog likelihood of the data (w,, ... , Wn) is
L
w,-:f] .
I " I I I " [( - Llogj(w 1;/L,a 2 )=--log(2rr)--Jog(a 2 ) - n 2 2 n 2a t=l
t=I
(7. 1.8)
450
Chapter 7
The ML estimator of (JLo, aJ) is an extremum estimator with Qn (II) taken to be this average log likelihood. Suppose that the variance is assumed to be positive but that there is no a priori restriction on f.Lo, so the parameter space E> is rn: x rn:++, where ~+ is the set of positive real numbers. It is easy to check that the ML estimator of JLo is the sample mean of w1 • The GMM estimator of JLo based on the zero-mean condition E(w,- JLo) = 0, too, is the sample mean of w 1 • Conditional Maximum Likelihood In most applications, the vector w, is partitioned into two groups, y, and x,, and the researcher's interest is to examine how x1 influences the conditional distribution of y, given x,. We have been calling the variable y, the dependent variable and x, regressors. None of the results of this chapter depends on y, being a scalar; y, can be a vector and still all the results will remain valid with the scalar "y,'' replaced by a vector "y,." We nevertheless use the scalar notation, simply because in all the examples we consider in this chapter, there is only one dependent variable. Let f(y, I x,; 11 0 ) be the conditional density of y, given x,, and let j(x,; tal be the marginal density of x,. Then f(y,, x,; llo. tal= f(y, I x,; llo)f(x,; tal
(7.1.9)
is the joint density of w, = (y,, x;)'. Suppose, for now, that 11 0 and to are not functionally related (see the next paragraph for an example of a functional relationship). The average log likelihood of the data (w 1 , •.. , wn) is I
n
-n L
l=l
]
log f(w,; II, tl =n
n
L l=l
)
log f(y, I x,; II)+n
n
L
log f(x,; tl.
t=l
(7.1.10) The first term on the right-hand side is the average log conditional likelihood. The conditional ML estimator of 11 0 maximizes this first term, thus ignoring the second term. It is an M-estimator with m(w,; II)= log f(y,
I x,; II), that is,
I Qn(ll) =-
n
L n
logf(y,
I x,;ll).
1=1
(7.1.11) The second term on the right-hand side of (7.1.10) is the average log marginal likelihood. It does not depend on II, so the conditional ML estimator of 11 0 is
451
Extremum Estimators
numerically the same as the (joint or full) ML estimator that maximizes the average joint likelihood (7.1.1 0). Now suppose that Oo and 1(r 0 are functionally related. For example, Oo and 1(r 0 might be partitioned as 00 = (a~. {3~)'. 1(r 0 = ({3~. y 0)'. Then the full (i.e., unconditional) ML and conditional ML estimators are no longer numerically equal. It is intuitively clear that the conditional ML misses information that could have been obtained from the marginal likelihood. Indeed it can be shown that the conditional ML estimator of 00 is less efficient than the full ML estimator obtained from maximizing the joint log likelihood (7.1.10) (see, e.g., Gourieroux and Monfort, 1995, Section 7.5.3). In most applications, this loss of efficiency is unavoidable because we do not know, or do not wish to specify, the parametric form, f(x,; 1{r), of the marginal distribution. Here are two examples of conditional ML (other examples will be in the next chapter). Example 7.2 (Linear regression model with normal errors): In Chapter 2, we considered the linear regression model with conditional homoskedasticity. Assume further that the error term is normally distributed, so [y,, x,} is i.i.d., y, = x:Po + s,, s, I x, ~ N(O, rrJ).
(7.1.12)
The conditional likelihood of y, given x,, f(y, I x,; 0), is the density function of N(x;p, rr 2). So, with 0 = (/3'. rr 2)' and w, = (y,. x;)', them function is m(w,; 0) =log f(y, I x,; {3, rr 2)
x;p)2]) = --log(2rr)- log(rr)-- (y,- x;f3) 2 2 I [ (y,= Iog ( exp ~2rrrr2 2rr 2 I
I
2
(7.1.13)
(I
The parameter space e is JFtK X 114+· where K is the dimension of {3 and 114+ is the set of positive real numbers reflecting the a priori restriction that rrJ > 0. As was verified in Chapter I, the ML estimator of {3 0 is the OLS estimator and the ML estimator of rrJ is the sum of squared residuals divided by the sample size n. Example 7.3 (Probit): In the probit model, a scalar dependent variable y, is a binary variable, y, E [0, 1}. For example, y, = I if the wife chooses to enter the labor force andy, = 0 otherwise, and x, might include the husband's
452
Chapter 7
income. The conditional probability of y, given a vector of regressors x, is given by
I x,; llo) = 4>(X,IIo), { f(y, = 0 I x,; llo) =I- 4>(X,IIo), f(y, =I
(7.1.14)
where 4>0 is the cumulative density function of the standard normal distribution. This can be written compactly as f(y,
I x,; llo)
= 4>(x;llol'' [I - 4>(X,IIo)] 1-y'.
If there is no a priori restriction on 11 0 , the parameter space e is JRP. The ML estimator of llo is an M-estimator with them function given by m(w,;ll) = logf(y1 I x,;ll) = y,log4>(x;ll)+(l- y,)log[l -4>(x;ll)]. (7.1.15) lnvariance of ML The joint ML and conditional ML estimators have a number of desirable properties. One of them, which holds true in finite samples, is invariance. To describe invariance for an extremum estimator in general, consider reparameterizing the model by a mapping or function l = r(ll) defined on e. Let A be the range of the mappmg: A
=r(e) ={l ll = r(ll) for some II e e].
(7.1.16)
By definition, for every l in A, there exists at least one II such that l = r(ll). The mapping 1: : e -+ A is called a reparameterization if it is one-to-one; namely, for every l in A, there is only one II in e such that l = 1: (II). This unique II is therefore a function of l. This mapping from A to e is called the inverse of 1: and denoted r- 1• We say that an extremum estimator 8 is invariant to reparameterization r A
if the extremum estimator for the reparameterized model is r(ll). Let Q. (l) be the objective function associated with the reparameterized model. An extremum estimator is invariant if and only if Q.(l) = Q.(r -1 (l))
for all
leA.
(7.1.17)
Toseethis,Ieti. = r(iJ). Foranyl e A, wehaveQ.(l) = Q.(r- 1 (l)) < Q.(O) because 0 maximizes Q.(ll) one and "f- 1 (l) is in e. But Q.(O) = Q.(c 1 (l)) = Q.(i). So Q.(l) < Q.(i) for alll eA.
453
Extremum Estimators
ML (be it conditional or joint) is invariant to reparameterization because the likelihood after reparameterization is j(.; A) = f(.; T- 1(A)), which produces the objective functions that satisfy (7.1.17). Can there be an estimator that is not invariant? The answer is yes; Review Question 5 will ask you to verify that GMM is not invariant. Nonlinear Least Squares (NLS)
In NLS, as in conditional ML, the observation vector w, is partitioned into two groups, y, and x,: w, = (y,, x;)'. The model in NLS is a set of stochastic processes {y1 , x,} such that the conditional mean E(y, I x,), which is a function of x,, is a member of the parametric family of functions
+ <,,
E(<,
I x,) = 0, llo
E
e.
(7 1.18)
The most widely used estimation method to estimate llo here is least squares, which is to minimize the sum of squared residuals. Therefore, an NLS estimator, which is least squares applied to the model just described, is an M-estimator with I m(w,; II)= -[y,-
n
L [y
1 -
t=!
(7.1.19) Here maximization of Qn(ll) is the same as minimization of the sum of squared residuals. Example 7.4 (CES production function with additive errors): illustration, consider the CES production function -PO ' Q, = Ao · [ ooK 1
+ (I -
') L,-P0]-1/PO oo
+ <1 ,
As an
(7.1.20)
where Q, is output in period t, K, is the capital stock, L, is labor input, and <1 is an additive shock to the production function with E(< 1 1 K,, L,) = 0. This is a conditional mean model (7.1.18) withy,= Q,, X1 = (K 1 , L,)', 11 0 = 11 (A 0 , 80, p0)',
454
Chapter 7
Linear and Nonlinear GMM In Chapter 3, we applied GMM to linear equations. The linear equation to be estimated was (7.1.21) With x, as the vector of instruments, the orthogonality (zero-mean) conditions were (7.1.22)
E[x, · (y,- z;llo)l = 0.
The correctly specified model here is a set of ergodic stationary processes w, = (y,, z;, x;)' such that these zero-mean conditions hold for 11 0 in e. The linear GMM estimator of llo is a GMM estimator with the g function in the GMM objective function (7.1.3) given by g(w,; II)
= x, · (y, -
z;ll) = x, · y, - x,z;ll.
(7.1.23)
GMM can be readily applied to nonlinear equations. Suppose that the estima-
tion equation is nonlinear and also implicit in y,: a(y1 , z,; llo) = £,.
(7.1.24)
Just set g(w,; II) = x,·a(y,, z,; II). This estimator is called a generalized nonlinear instrumental variables estimator.' It is still a special case of GMM because the g function here can be written as a product of the vector of instruments and the error term. The properties of GMM estimators to be developed in this chapter do not rely on this special structure. As an example of nonlinear GMM, consider the nonlinear Euler equation of Hansen and Singleton (1982). Example 7.5 (Nonlinear consumption Euler equation): The Euler equation for the household optimization problem, standard in macroeconomics, is
E[Rt+i fJou'(cr+ll j I]= 1 '( ) u c,
I
'
(7.1.25)
where Rr+i is the gross ex-post rate of return (I plus the rate of return) from the asset in question, c, is consumption, {3 0 is the discounting factor, u'(c) is the marginal utility, and I, is the information available at date t. Assume for 311 is more general than the usual nonlinear instrumental variables estimator because conditional homoskedasticity is not assumed.
455
Extremum Estimators
the utility function that u(c) = c 1-"0 /(l- a 0 ). Then u'(c) = c-•o and the Euler equation becomes Cr+l
E [ a ( - - , Rr+!; O'o, fJo
)
c,
(7.1.26) If x, is a vector of variables whose values are known at date t, then x, and it follows from (7.1.26) that
E /1
(7.1.27) Taking the unconditional expectation of both sides and using the Law of Total Expectations (that E[E(x II,)]= E(x)), we obtain the orthogonality (zeromean) conditions E[g(w,; tlo)] = 0 where
Ct+l
g(w,; tlo) = x, ·a ( --;;-· Rr+!; ao, fJo
)
.
w1th
w, =
[c,;::~l
tlo =
[~:l
(7.1.28)
QUESTIONS FOR REVIEW
1. (Estimating probit model by NLS) For the probit model, verify that E(y,
I
x,) =
456
Chapter 7
for any (measurable) function h.] Would the NLS estimator of 11 0 be consistent? [Answer: No.] 5. (Lack of invariance in GMM) Consider the linear model y, = e0 z, + E:, where e0 and z, are scalars. The orthogonality conditions are E[x, · (y,- e0 z,)] = 0. Write down the GMM objective function Qn(e). Assume that E> = JR.,_+ (i.e., e0 > 0) and consider the reparameterization A = I ;e. With this repararneterization, the linear equation can be rewritten as Zr = A0 y 1 - A0 e 1 and the orthogonality conditions can be rewritten as E[x, · (z,- J. 0 y,)] = 0. Let Qn(J.) be the GMM objective functions associated with this set of orthogonality conditions. Is it the case that Qn(e) = Qn(l(e) or Qn(A) = Qn(I(J.) when Wis the same in the two objective functions? [Answer: No.]
7.2
Consistency
ln this section, we first present a set of sufficient conditions under which an extre-
mum estimator is consistent. Those conditions will then be specialized to NLS, ML,andGMM. Two Consistency Theorems lor Extremum Estimators
The objective function QnO is a random function because for each (J its value Qn(ll), being dependent on the data (w 1, ... , wn), is a random variable. The basic idea for the consistency of an extremum estimator is that if Qn(ll) converges in probability to Qo(ll), and the true parameter llo solves the "lintit problem" of maximizing the limit function Q0 (11), then the limit of the maximum 0 should be 11 0 . Sufficient conditions for the maximum of the limit to be the limit of the maximum are that the convergence in probability be "uniform" (in the sense made precise in a moment) and that the parameter space E> be compact. As already mentioned, the parameter space is not compact in most applications. We therefore provide an alternative set of consistency conditions, which does not require E> to be compact. To state the first consistency theorem for extremum estimators, we need to make precise what we mean by convergence being "uniform." If you are already familiar from calculus with the notion of uniform convergence of a sequence of functions, you will recognize that the definition about to be given is a natural extension to a sequence of random functions, QnO (n = I, 2, ... ). Pointwise convergence in probability of QnO to some nonrandom function QoO simply means plimn~oo Qn(ll) = Qo(O) for all II, namely, that the sequence of random
457
Extremum Estimators
variables IQ" (IJ) - Qo(IJ) I (n = I, 2, ... ) converges in probability to 0 for each IJ. Uniform convergence in probability is stronger. The convergence has to occur "uniformly" over the parameter space e in the following sense: (7.2.1)
sup IQn (IJ) - Qo(IJ)I --+ 0 as n --+ oo. Oee
P
As in the case of convergence in probability, uniform convergence in probability can be extended readily to sequences of vector random functions or matrix random functions (by viewing a matrix as a vector whose elements have been rearranged), by requiring uniform convergence for each element: a sequence of vector random functions {hn (-)) converges uniformly in probability to a nonrandom function h 0 (-) if each element converges uniformly. This element-by-element convergence is equivalent to convergence in the norm:
(7.2.2)
sup llhn(IJ)- ho(IJ)II--+ 0 as n--+ oo, Oee
P
where 11.11 is the usual Euclidean norm 4 In the following statement of our first consistency theorem, the conditions ensuring that IJ is well defined are labeled (i)-(iii), while those ensuring consistency are (a) and (b). A
Proposition 7.1 (Consistency with compact parameter space): Suppose that (i) e is a compact subset ofllf.P, (ii) Qn (IJ) is continuous in IJ for any data (w 1 , ... , Wn). and (iii) Qn(IJ) is a measurable function of the data for aJIIJ in e (so by Lemma 7.1 the extremum estimator jj defined in (7.1.1) is a well-defined random variable). If there is a function Q0 (1J) such that
(a) (identification) Qo(IJ) is uniquely maximized on 8 at IJo
E
8,
(b) (uniform convergence) QnO converges uniformly in probability to Qo(-), A
then IJ
-+p
IJo.
We will not prove this result; see, e.g., Amemiya (1985, Theorem 4.1.1 )for a proof. For estimates such as ML, NLS, and GMM, we will state later the identification condition (a) in more primitive terms so that the condition can be interpreted more intuitively.
4The Euclidean norm of a p-dimensional vector x =
Jxf+x5+ ··+x~..._
(XJ,
x2 .... , Xp) 1, denoted IJx(l, is defined as
458
Chapter 7
The theorem that does away with the compactness of e is Proposition 7.2 (Consistency without compactness): Suppose that (i) the true parameter vector (1 0 is an element of the interior of a convex parameter space e (c JRP), (ii) Q.(O) is concave 5 over the parameter space for any data (w 1, ...• w.) • and (iii) Q.({l) is a measurable function of the data for all (I in e. Let (I be the extremum estimator defined by (7.1.1) (wait for a moment to Jearn about its existence). If there is a function Q0 ((1) such that
.
(a) (identification) Q 0 ({1) is uniquely maximized on
e at (1 0 E e.
(b) (pointwise convergence) Q.((l) converges in probability to Q 0 ({1) for all
o E e.
then, as n -> oo,
. exists with probability approaclting 1 and (I. ->p 0
(I
0.
See Newey and McFadden (1994, pp. 2133-2134) for a proof. In most applications, e is an open convex set (as in Examples 7.1-7.4), so condition (i) is satisfied. The convergence of Q.((l) to Q 0 ((1) is required to be only pointwise, which is easier to verify in applications. The price of these niceties is the requirement that the objective function Q.(O) be concave for any data (under which pointwise convergence implies uniform convergence), but in many applications this condition is satisfied. Below we will verify that concavity is satisfied for ML for the linear regression model (after a reparameterization), probit ML, and linear GMM. Consistency of M-Estimators For an M-estimator, the objective function is given by (7.1.2). If {w1 } is ergodic stationary, the Ergodic Theorem implies pointwise convergence for Q.(O): Q.(O) converges in probability pointwise for each (I E e to Q 0 (0) given by Qo(O) = E[m(w,: (I)]
(7.2.3)
(provided, of course, that E[m(w,: 0)] exists and is finite). To apply the first consistency result (Proposition 7.1) toM-estimators, we need to show that the convergence of~ I:;= I m(w,: .) to E[m(w,: .)] is uniform. A standard method to prove uniform convergence in probability, which exploits the fact that the expression is 5 A scalar function h(x) is said to be concave if ah(XJ) + ( I - a)h(xz) ::=: h(ax1 + ( I - a)x2) for all Xt, "2· 0 ::=: a ::=: I. So a linear function is concave. Some authors use the tenn "globally concave" to distinguish it from the concept of "local concavity" that lhe Hessian evaluated at the parameter vector in question is negative semidefinite. If a function is concave, then it is continuous. So condition (ii) in this proposition is stronger lhan condition (ii) in the previous proposition.
459
Extremum Estimators
a sample mean, is the uniform law of large numbers. Its version for ergodic stationary processes (stated as Lemma 2.4 in Newey and McFadden, 1994, and as Theorem A.2.2 in White, 1994) is Lemma 7.2 (Uniform law of large numbers): Let {w,} be an ergodic stationary process. Suppose that (i) the set 8 is compact, (ii) m(w,; {I) is continuous in (I for all w,, and (iii) m(w,; {I) is measurable in w, for all (I in e. Suppose, in addition, the dominance condition: there exists a function d(w,) (sometimes called the "dominating function") such that lm(w,; 0)1 < d(w,) for all (I E e and E[d(w,)] < oo. Then ~ L:7~ 1 m(w,; .) converges uniformly in probability to E[m(w,; .)] over e. Moreover, E[m(w,; {!)] is a continuous function of (1. Since by construction lm(w,; 0)1 < sup9E8 lm(w,; 0)1 for all (I E e. we can use sup9E8 lm(w,; 0)1 for d(w,), so the above dominance condition can be stated simply as E[ sup 9E8 lm(w,; 0)11 < oo. The Uniform Law of Large Numbers can be extended immediately to vector random functions as follows: Let {w,J be an ergodic stationary process. Suppose that (i) the parameter space e is compact, (ii) h(w,; (I) is continuous in (I for all w,, and (iii) h(w,; {I) is measurable in w, for all (I in e. Suppose, in addition, the dominance condition: E[sup llh(w,; 0)111 < oo. 9e0
Then E[h(w,; (I)] is a continuous function of (I and ~ L:7~ 1 h(w,: .) converges uniformly in probability to E[h(w,; .)] over e. Just by combining this lemma with the first consistency theorem for extremum estimators, and noting that Qn ((I) is continuous if m(w,; (I) is continuous in (I and that Qn((l) is a measurable function of the data if m(w,; 0) is measurable in w,, we obtain Proposition 7.3 (Consistency ofM-estimators with compact parameter space): Let {w,} be ergodic stationary. Suppose that (i) the parameter space e is a compact subset of JR!P, (ii) m(w,; {I) is continuous in (I for all w,, and (iii) m(w,; {I) is measurable in w, for all (I in e. Let (I be theM-estimator defined by (7.1.1) and (7.1.2). Suppose, further, that A
460
Chapter 7
(a) (identification) E[m(w,; II)] is uniquely maximized on
e atll 0 E e,
(b) (dominance) E[sup 8 E8 1m(w,; 11)1] < oo. Then
0 -+p llo.
It is even more straightforward to specialize the second consistency theorem for extremum estimators toM-estimators. The objective function Qn(ll) is concave if m(w,; II) is concave in 11. 6 As noted above, the pointwise convergence of Qn(ll) to Q 0 (11) (= E[m(w,; II)]) is assured by the Ergodic Theorem if E[m(w,; II)] exists and is finite for alii!. Thus Proposition 7-4 (Consistency of M-estimators without compactness): Let [w,} be ergodic stationary. Suppose that (i) the true parameter vector 11 0 is an element of the interior of a convex parameter space e (C IR.P), (ii) m(w,; II) is concave over the parameter space for all w,, and (iii) m(w,; II) is measurable in w, for all II in 8. Lett! be theM-estimator defined by (7.1. I) and (7. I .2). Suppose, further, that
-
(a) (identification) E[m(w,; II)] is uniquely maximized on 8 atllo
E
8,
(b) E[ lm(w,; II) I] < oo (i.e., E[m(w,; II)] exists and is finite) for all II
E
Then, as n
-+
-
-
e.
oo,ll exists with probability approaching I and II -+p 11 0 •
The following example shows that probit conditional ML satisfies the concavity
condition. Example 7.6 (Concavity of the probit likelihood function): The m function for probit is given in (7.1.15) where <1>0 is the cumulative density function of the standard normal distribution. Since the normal distribution is symmetric,( -v) = I -
+ (I
- y,) log (-X, II).
(7 .2.4)
Concavity of m is implied if both log
(7.2.5)
+ g(x).
461
Extremum Estimators
I
where ¢(v) (=
Concavity after Reparameterlzatlon
Proposition 7.4 can be extended easily to estimators for objective functions that are concave after reparameterization. Suppose that all the conditions of the proposition except concavity are satisfied and that there is a continuous one-to-one mapping -r(6) : 0-+ A = -r(E>) such that
is concave inland A = -r(E>) is a convex set. Let
be the objective function after this reparameterization. Clearly, all the assumptions of Proposition 7.4 are satisfied for l, A, and m(w,; l): lo = -r(6 0 ) is an interior point of A, E[m(w,; l)] is uniquely maximized at l 0 , and E[ lm(w,; l)ll < oo for alll in A. Thus for sufficiently large n, there exists an M-estimator i such that plimn_ooi = l 0 . Since Qn(l) satisfies (7.1.17) for the one-to-one mapping T, il = T- 1(l) maximizes the original objective function Qn(6). The estimator il is consistent because
n-+00
n-+00
= -r- 1(plim i)
(since -r- 1 is continuous)
n-oo (7.2.6)
Example 7.7 (Concavity of linear regression likelihood function): In the conditional ML estimation of the linear regression model, m(w,; 6) is given in (7.1.13) with 6 = ({3', a 2 )' and w, = (y,, X,)'. This function is not necessarily concave, 7 but consider a reparameterization: y = l(a and & = f3 1a. It is a continuous one-to-one mapping between e = JRK x JR,_+ and A= IRK x JR,_+· The new parameter space A is convex. With l = (&', y)',
7Forexample, the second partial derivative of m with respect to a 2 may or may not be non~negative depending on the value of y 1 -Lx;fJ. Recall that a necessary condition for a twice differentiable function h(xl, ... , Xp) to be concave is that 2hj((hj ) 2 be non~negative for all j = I, 2, ... , p.
a
462
Chapter7
the m function after reparameterization becomes
_ I m(w,; A)= - log(2rr)
2
+ log(y)-
I , z(yy,- x,~) 2 .
(7.2.7)
Since log(y) is concave in y, it is concave in A. Since v = yy, - x;~ is linear in l. and -~v 2 is concave, -!(YYr - x;0) 2 is concave in l.. So above, which is the sum of two concave functions (plus a constant), is concave in A for any w,. Therefore, provided that all the conditions of Proposition 7.4 besides concavity are satisfied for 9, 0, and m(w,; 9), the conditional ML estimator iJ is consistent.
m
Identification in NLS and ML The identification condition (a) (that E[m(w,; fl)] be uniquely maximized at flo) in these consistency theorems forM-estimators can be stated in more primitive terms for NLS and ML. NLS
Consider firstNLS, wherem(w,; {I)= -(y,-rp(x,; 9)) 2 and E(y, I x,) = rp(x,; {1 0 ). We have shown in Section 2.9 that the conditional expectation minimizes the mean squared error, that is, for any (measurable) function h(x, ), mean squared error
= E[{y,- h(x,)f] = E[{y,- E(y, I x,)1 2 ] ;::: E[ {y, - E(y, I x,) 12].
+ E[{E(y, I x,)- h(x, )1 2 ] (7.2.8)
This is just reproducing (2.9.4) in the current notation. Furthermore, the conditional expectation is essentially the unique minimizer in the following sense. Being functions of x,, h (x,) and E(y, I x,) are random variables. For two random variables X and Y, let X 'I Y mean Prob(X 'I Y) > 0 (i.e., Prob(X = Y) < I) and let X = Y mean Prob(X 'I Y) = 0. (If X and Y differ only for events whose probability is zero, we do not need to care about the difference.) Therefore, h(x,) 'I E(y1 1 x,) means Prob(h(x,) 'I E(y, I x,)) > 0. If h(x,) 'I E(y, I x,) in this sense, then {E(y, 1 x,)- h(x,)l 2 > 0 with positive probability. Consequently, E[{E(y, I x,)- h(x,)fj in (7.2.8) is positive and the mean squared error is strictly greater than E[{y,- E(y, I x, lf].
463
Extremum Estimators
Just setting h(x,) =
if
{ E[IYr-
I x, ),
we conclude
'I
(7.2.9)
if
Therefore, the identification condition- that E[ m(x,; II) J= - E[ {y,-
conditional mean identification:
for all II
'I llo.
(7.2.10)
ML
For ML, the role played by (7.2.8) for NLS is played by the Kullback-Leibler information inequality. Here we examine identification for conditional ML (doing the same for unconditional or joint ML is more straightforward). The hypothetical conditional density f(y, I x,; II), being a function of (y,, x,), is a random variable for any II. The expected value of the random variable log f(y, I x,; II) can be written as 8 E[log f(y,
I x,; II)]
=
J
log f(y,
I x,; ll)f(y,, Xr; llo, t
0 )dy,dx,.
(7.2.11)
Note well that the expected value is with respect to the true joint density f (y,, x,; 110 , t 0 ). The Kullback-Leibler information inequality about density functions (adapted to conditional densities) states that E[Iog f(y, { E[log f(y,
I x,; II)] I x,; II)]
< E[log f(y, = E[log f(y,
I x,; lloll I x,; llo)]
if f(y, if f(y,
I x,; II) 'I I x,; II) =
f(y, f(y,
I x,; llo). I x,; llo) (7.2.12)
(provided, of course, that E[log f(y, I x,; II)] exists and is finite). (See Analytical Exercise I for derivation.) It immediately follows that the identification condition (that E[log f(y, I x,; II)] be maximized uniquely at llo) is satisfied if
conditional density identification: f(y,
I x,; II) 'I
f(y,
for all II '111 0 .
I x,; llo) (7.2.13)
BJf f{y 1 I x1: 6) = 0, then its log cannot be defined. If this bothers you, assume f(Yt I Xt: 6) is positive for all Yt, x1, and 6. See White ( 1994, p. 9) for an argument that avoids such a simplifying assumption.
464
Chapter7
We can tum the two consistency theorems forM-estimators, Propositions 7.3 and 7.4, into ones for conditional ML. with w, = (y,. x;)' and m(w,; {/)=log f · (y, I x,; (!). Replacing the identification condition forM-estimators by conditional density identification, we obtain the following two self-contained statements about consistency of ML. Proposition 7.5 (Consistency of conditional ML with compact parameter space): Let fy,. x,} be ergodic stationary wich conditional density f(y, I x,; (1 0 ) and Jet (I be the conditional ML estimator, which maximizes che average Jog conditional likelihood (derived under che assumption chat fy,. x,} is i.i.d.):
-
-
(I=
]
argmax9E0 n
L log f(y, I x,; {/). n
t=l
Suppose the model is correctly specified so chat (1 0 is in e. Suppose chat (1} che parameter space e is a compact subset ofJRP. (ii) f(y, I x,; {I) is continuous in (I for all (y,, x,). and (iii) f(y, I x,; {/)is measurable in (y,. x,) for all (I E e (so by Lemma 7.1 (/is a well-defined random variable). Suppose, furcher, chat
-
(a) (identification) Prob[J(y, I x,; 0)
f
(b) (dominance) E[ sup8 E 8 1log f(y 1 over y, and x, ).
I x,; 0)11
Then
(I ~P
f(y,
I x,; 0 0 )]
> 0 fora!! (If 0 0 in
e.
< oo (note: che expectation is
Oo.
Proposition 7.6 (Consistency of conditional ML without compactness): Let fy,, x,} be ergodic stationary wich conditional density f (y, I x,; (/ 0 ) and let (I be che conditional ML estimator, which maximizes che average Jog conditional likelihood (derived under the assumption chat fy,, x,} is i.i.d.): ]
iJ = argmax8Ee
n
n
L log f(y, I x,; 0). t=l
Suppose che model is correctly specified so chat (1 0 is in e. Suppose that (i) che true parameter vector (1 0 is an element ofthe interior of a convex parameter space e (c JRP ). (ii) log f (y, I x,; (I) is concave in (I for all (y,, x, ). and (iii) log f (y, I x,; {I) is measurable in (y,, x,) for all (I in e. (For sufficiently large n, (I is well defined; see a few Jines below.) Suppose, furlher, that (a) (identification) Prob[J(y, I x,; 0)
f
f(y,
I x,; 0 0 )]
> 0 for all (If 0 0 in
e.
465
Extremum Estimators
(b) E[ I log f(y 1 I x,; 11)1] < oo (i.e., E[log f(y, I x,; II)] exists and is finite) for all II E 0 (note: the expectation is over Yr and x,). Then, as n ~ oo, iJ exists with probability approaching 1 and
iJ
~P llo.
There are two remarks worth making. • The Kullback-Leibler information inequality (7.2.12) holds for unconditional densities as well (the proof is easier for unconditional densities). Therefore, Propositions 7.5 and 7.6 hold for unconditional ML; just replace j(y, I x,; II) by j(w,; II). • The noteworthy feature of these two ML consistency theorems is that, despite the fact that the log likelihood is derived under the i.i.d. assumption for {y1 , x,}. consistency is assured even if {y1 , x,} is not i.i.d. but merely ergodic stationary. This is because the ML estimator is an M-estimator. An ML estimator that maxintizes a likelihood function different from the model's likelihood function is called a quasi-ML estimator or QML estimator. Put differently, a QML estimator maxintizes the likelihood function of a ntisspecified model. Therefore, the "maximum likelihood" estimators in Propositions 7.5 and 7.6 are QML estimators if {Yr. x,} is ergodic stationary but not i.i.d. The propositions say that those particular QML estimators are consistent. The identification condition in conditional ML can be stated in even more primitive terms for specific examples. Example 7.8 (Identification in linear regression model): For the linear regression model with normal errors, the log conditional likelihood for observation t is
logf(y,
I
I
I
2
2
2a 2
I x,; II)= --log(2rr)- -log(a 2 ) - -
, (y,- x,{J) 2 .
Since the log function is strictly monotonic, conditional density identification is equivalent to the condition that log j(y, I x,; II) of log j(y, I x,; 11 0 ) with positive probability if II of llo. This condition is satisfied if E(x,X,) is nonsingular (and hence positive definite). To see why, let II = (fJ', a 2 )' and llo = ({J~. aJY. Clearly, log f(y, I x,; II) of log f(y, I x,; llo) with positive probability if a 2 of aJ. So consider the case where a 2 = aJ but fJ of {J 0 • Then
466
Chapter 7
Hence, x;fJ ol x;fJ 0 with positive probability; if >
>
The last term is finite if E(x,x;) is. These results, together with the fact noted earlier that the log likelihood is concave after reparameterization, imply that the conditional ML estimator of the linear regression model with normal errors (derived under the i.i.d. assumption for {y,, x,}) is consistent if {y,, x,} is ergodic stationary and E(x,x;) is nonsingular. Example 7.9 (Identification in probit): The same conclusion just derived in the previous example for the linear model also holds for the probit model, whose conditional likelihood for observation t is
The same argument used in Example 7.8 implies that, under the nonsingularity of E(x,x;), x;o ol x;Oo with positive probability if 0 ol 00 . But since
(7.2.14)
Combining this bound for I log
467
Extremum Estimators
singularity of E(x,x;n• We therefore conclude that the probit ML estimator (whose log likelihood is derived under the i.i.d. assumption for {y,, x,}) is consistent if {y,, x,} is ergodic stationary and E(x,x;) is nonsingular. Consistency of GMM We now turn to GMM and specialize Proposition 7.1 to GMM by verifying the conditions of the proposition under a set of appropriate sufficient conditions. The GMM objective function is given by (7.1.3). Clearly, the continuity of Qn(ll) in II is satisfied if g(w,; II) is continuous in II for all w,. If {wr} is ergodic stationary, then g.(ll) (= ~ L:~~• g(w,; II)) -->p E[g(w,; II)]. So the limit function Q 0 (11) is
(7.2.15) This function is nonpositive if W (= plim W) is positive definite. It has a maximum of zero at 11 0 because E[g(w,; 11 0 )] = 0 by the orthogonality conditions. Thus, the identification condition (that Q 0 (11) be uniquely maximized at 11 0 ) in Proposition 7.1 is satisfied if E[g(w,; II)] i= 0 for II i= 11 0 . Regarding the uniform convergence of Qn(ll) to Q 0 (11), it is not hard to show that the condition is satisfied if g.O converges uniformly to E[g(w,; ·)](see Newey and McFadden, 1994, p. 2132). The multivariate version of the Uniform Law of Large Numbers stated right below Lemma 7.2, in turn, provides a sufficient condition for the uniform convergence. So we have proved Proposition 7.7 (Consistency of GMM with compact parameter space): Let {w,} be ergodic stationary and let II be the GMM estimator defined by
-
where the symmetric matrix
Wis assumed
to conve~ge in probability to some
90nly for interested readers, here is a proof: I log /(Yr I xr; 6)1 ~ IYt II log $(x;B)I ~ pog $(x;B)I
Yrlllog $( -x;B)I
+ I log$( -x;B)I
(since IYr I ~ I and II - Y1 I ~ 1)
+ 1x;o1 + 1x;01 2] (by (7.2.14)) 2 x [llog(O)I + llxrii·IIOII + llxrll 2 ·11611 2 ].
<0 2
<0
+ II -
x [I log (O)I
The last inequality is due to the Cauchy-Schwartz inequality that lx'yl :5 llxiiiiYII for any two conformable vectors x and y. Also, E(llx1 11 2 ), which is the sum of E(x~) over i, is finite because the nonsingularity of E(x1 x;) implies E(x~) < oo for all i. Jf E(llxJ 11 2 ) < oo, then ECIIx1 ID < oo.
468
Chapter 7
symmetric positive definite matrix W. Suppose that the mndel is correctly specified in that E[g(w 1; 0 0 )] = 0 (i.e., the orthogonality conditions hold) for 0 0 E EJ. Suppose that (i) the parameter space EJ is a compact subset of JRP, (ii) g(w,; 0) is continuous in 0 for all w,, and (iii) g(w,; 0) is measurable in w, for all I} in EJ (so I} is a well-defined random variable by Lemma 7. 1). Suppose, further, that
.
(a) (identification) E[g(w,; 0)]
#
0 for all 0
#
Oo in EJ,
(b) (dominance) E[sup 9 E 8 IIg(w,; 0)111 < oo. Then iJ
__.P
Oa.
If Qn(O) is concave, then the compactness of EJ, the continuity of g(w,; 0), and the dominance condition can be replaced by the condition that E[g(w,; 0)]
exist and be finite for all 0. We will not state this result as a proposition because there are very few applications in which Qn(O) is concave. For example, the GMM objective function for the Euler equation model in Example 7.5 is not concave. Just about the only well-known case is one in which g(w,; I}) is linear in 0, but the linear case has been treated rather thoroughly in Chapter 3. When g(w,; 0) is nonlinear in 0, specifying primitive conditions for (a) (identification) and (b) (dominance) in Proposition 7.7 is generally quite difficult. So in most applications those conditions are simply assumed. QUESTIONS FOR REVIEW
1.
(Necessity of density identification) We have shown for conditional ML that the conditional density identification (7.2.13) is sufficient for condition (a) (that E[log f(y, I x,; 0)] be maximized uniquely at 00 ). That it is also a necessary condition is clear from the Kullback-Leibler information inequality. Without using this inequality, show the necessity of conditional mean identification. Hint: Show that, if (7.2.13) were false, then (a) would not hold. Suppose that there exists a 0 1 # 0 0 such that 0 1 E EJ and f(y, I x,; 0 1 ) = j(y, I x,; 0 0 ). Then E[log f (y, I x,; 0 1)] = E[log f (y, I x,; Oo)].
2. (Necessity of conditional mean identification) Without using (7.2.8), show that conditional mean identification is necessary as well as sufficient for identification in NLS. 3. (Identification in NLS) Suppose that the rp function in NLS is linear: rp(x1 ; 0) = X,O. Verify that identification is satisfied ifE(x,X,) is nonsingular.
469
Extremum Estimators
4. (Identification in NLS estimation of pro bit) In the pro bit model, E(y, I x,) =
6. (Identification in linear GMM) Consider the linear GMM model of Chapter 3. Verify that the rank condition for identification derived in Chapter 3 is equivalent to the identification condition in Proposition 7. 7. 7. (Concavity of the objective function in linear GMM) Verify that the GMM objective function in linear GMM is concave in 9. Hint: The Hessian (the ~ L:~~~ x,z;. matrix of second derivatives) of Qn(9) is -S:U WS., where Sxz A twice continuously differentiable function is concave if and only if its Hessian is everywhere negative semidefinite.
=
8. (GMM identification with singular W) For W, if we require it to be merely
positive semidefinite, not necessarily positive definite, what would be the identification condition in Proposition 7.7? Hint: Suppose W is positive semidefinite. Show that if E[g(w,; 9 0 )] = 0 and WE[g(w,; 9)] oft 0 for 9 oft 90 , then Q0 (9) = E[g(w,; 9)]'W E[g(w,; 9)] has a unique maximum at 9 0 .
-4
7.3 Asymptotic Normality The basic tool in deriving the asymptotic distribution of an extremum estimator is the Mean Value Theorem from calculus. How it is used depends on whether the estimator is an M-estimator or a GMM estimator. Therefore, the proof of asymptotic normality will be done separately forM-estimators (including ML as a special case) and GMM estimators. This is followed by a brief remark on the relative efficiency of GMM and ML. By way of summing up, we will point out at the end of
470
Chapter 7
the section that the Taylor expansion for the sampling error has a structure shared by both M-estimators and GMM. Asymptotic Normality of M-Estimators Reproducing (7.1.2), the objective function forM-estimators is
I Q.(O) = n
n
L
(7.3.1)
m(w,; 0).
t=l
It will he convenient to give symbols to the gradient (vector of first derivatives) and the Hessian (matrix of second derivatives) of them function as . n)- am(w,; 0)
S (Wr, u (px I)
H(w,; OJ (pxp)
-
=
a0 as(w,; 0)
ao'
(7.3.2)
,
a2 m(w,; 0)
= aoao'
(7.3.3)
In analogy to ML, s(w,; 0) will be referred to as the score vector for observation t. This s(w,; 0) should not be confused with the score in the usual sense, which is the gradient of the objective function Q. (0). The score in the latter sense will be denoted s.(O) later in this chapter. The same applies to the Hessian: H(w,; 0) will be referred to as the Hessian for observation t, and the Hessian of Q. (0) will be denoted H.(O). The goal of this subsection is the asymptotic normality of 0 described in (7.3.10) below. In the process of deriving it, we will make a number of assumptions, which will he collected in Proposition 7.8 below. Assume that m(w,; 0) is differentiable in 0 and that 0 is in the interior of e. So 0, being the interior solution
-
-
-
to the problem of maximizing Q. (0), satisfies the first-order conditions
0
(pxl)
=
aQ.
n
'
= - L::s(w,; 0)
n
(hy (7.3.1) and (7.3.2)).
(7 .3.4)
t=l
We now use the following result from calculus: Mean Value Theorem: Let h : JRP -> ffi:" be continuously differentiable. Then h(x) admits the mean value expansion
471
Extremum Estimators
(qxl)
ah(x)
+
h(x) = h(Xo)
Ox'
(qxl)
(x- Xo), (pxl)
(qxp)
where xis a mean value lying between x and J
2
-
aQn(6) - aQn(6o) +a Qn(6) (0- 6o) a6 a6 a6a6' (px '> (pxl)
(pxl)
1
(pxp)
n
=- I:s(w,; 6 0 ) n t=!
+ [In
n
LH(w,;
0) ] (0- 6 0)
t=I
(px I)
(by (7 .3.1 )-(7 .3.3)),
(px I)
(pxp)
(7.3.5) -
'
where 6 is a mean value that lies between 6 and 60 • The continuous differentiability requirement of the Mean Value Theorem is satisfied if m (w,; 6) is twice continuously differentiable with respect to 6. Combining this equation with the first-order condition above, we obtain
I"
[I"
+ -n L
0 =- I:s(w,; 6 0 )
(pxl)
9-
n
t=l
-]·
H(w,; 6) (6- 6 0 ).
(7.3.6)
t=1
Assuming that ~ L:7~ 1 H(w,; 6 0 to yield
•
0) is nonsingular,
[I"L
this equation can be solved for
-J-l]Jn
./ii(6- 60 ) =- -;;
H(w,; 6)
t=l
n
I:s(w,; 60 ).
(7.3.7)
t=1
This expression for (.jii times) the sampling error will be referred to as the mean value expansion for the sampling error. Note that the score vector s(w,; 6) is evaluated at the true parameter value 60 . " " Now, since 6 lies between 60 and 6, 6 is consistent for 6 0 if 6 is. If {w,} is ergodic stationary, it is natural to conjecture that
-nI L H(w,; 6) ->P E[H(w,; 6 n
-
0 )].
(7.3.8)
t=l
10 The Mean Value Theorem only applies to individual elements of h, so that i actually differs from element to element of the vector equation. This complication does not affect the discussion in the text.
472
Chapter 7
The ergodic stationarity of w, and consistency of(} alone, however, are not enough to ensure this; some technical condition needs to be assumed. One such technical condition is the uniform convergence of~ L:7~ 1 H(w,; ·)to E[H(w,; -)]in a neighborhood of 9 0 . 11 But hy the Uniform Convergence Theorem, the uniform convergence of~ L:7~ 1 H(w,; ·) is satisfied if the following dominance condition is satisfied for the Hessian: for some neighhorhood Jl of 9o, E[ sup 9 E.N IIH(w,; 9) Ill < oo (this is condition (4) below). 12 This is a sufficient condition for (7 .3.8); if you can directly verify (7.3.8), then there is no need to verify the dominance condition for asymptotic normality. Finally, if (7.3.8) holds and if ]
r.:: -vn
n
L s(w,; 9 t=!
0 ) --> d
N(O, :E),
(7.3.9)
then by the Slutzky theorem (see Lemma 2.4(c)) we have
./ii(iJ- 9o)-:
N(o. (E[H(w,; 9ol]r':E(E[H(w,; 9ol]r').
(7.3.10)
Collecting the assumptions we have made so far, we have Proposition 7.8 (Asymptotic normality of M-estirnators): Suppose that the conditions of either Proposition 7.3 or Proposition 7.4 are satisfied, so that {w,} is ergodic stationary and theM-estimator(} defined by (7.1.1) and (7.1.2) is consistent. Suppose, further, that A
(1) 9o is in the interior of
e.
(2) m(w,; {})is twice continuously differentiable in(} for any w,, (3) }. L:7~, s(w,; 9 0 ) in (7.3.2),
->d
N(O, 1:), l: positive definite, where s(w,; {})is defined
11 This is a consequence of the following result (see, e.g., Theorem 4.1.5 of Amemiya, 1985): Suppose a sequence of random functions h 11 (6) converges to a nonstochastic function ho(9) uniformly in fJ over an open neighborhood of 80 . Then plim11 _... 00 hn(8) = ho(fJo) if plim 8 = 8o and ho(fJ) is continuous at 8 0 .
Set h11 (8) to~ L~=l H(w 1 ; ·)and ho(fJ) to E[H(wz; 8)]. Continuiry of E[H(w 1 ; 8)] will be assured by the Uniform Convergence Theorem. The uniform convergence needs to be only over a neighborhood of 8o, not over the entire parameter space, because 8 is close to 8o for n sufficiently large. 12 A matrix can be viewed as a vector whose elements have been rearranged. So the Euclidean norm of a matrix, such as IIH(w1; 8) 11. is the square root of the sum of squares of all its elements.
473
Extremum Estimators
(4) (local dominance condition on the Hessian) for some neighborhood oN of 0 0 , E[ sup JIH(w,;
IJ)IIl
< oo,
6eJ/
so that for any consistent estimator IJ, ~ 'L7~ 1 H(w,; 0) where H(w1 ; IJ) is defined in (7.3.3), (5) E[H(w1 ,
--+p
E[H(w,;
IJo)],
IJo)] is nonsingular.
Then iJ is asymptotically normal with Avar(il) = (E[H(w,;
1
IJol]r :E(E[H(w,; IJol]r
1
•
(This is Theorem 4.1.3 of Amemiya (1985) adapted toM-estimators.) Two remarks are in order. • Of the assumptions we have made in the derivation of asymptotic normality, the following are not listed in the proposition: (i) IJ is an interior point, and (ii) ~ 'L7~ 1 H(w,; 0) is nonsingular. It is intuitively clear that these conditions hold hecause iJ converges in prohability to an interior point 0 0 , and ~ 'L7~ 1 H(w,; 0) converges in probability to a nonsingular matrix E[H(w,; IJ 0 )]. See Newey and McFadden (1994, p. 2152) for a rigorous proof. This sort of technicality will be ignored in the rest of this chapter.
-
• If w, is ergodic stationary, then so is s(w,; IJ 0 ) and the matrix :E is the long run variance matrix of (s(w,; IJ 0)]u A sufficient condition for (3) is Gordin's condition introduced in Section 6.5. So condition (3) in the proposition can he replaced by Gordin's condition on {s(w,; IJ 0 )]. It is satisfied, for example, if w, is i.i.d. and E[s(w,; IJ 0 )] = 0. The assumption that :E is positive definite is not really needed for the conclusion of the proposition, but we might as well assume it here because in virtually all applications it is satisfied (or assumed) and also because it will be required in the discussion of hypothesis testing later in this chapter. Consistent Asymptotic Variance Estimation To use this asymptotic result for hypothesis testing, we need a consistent estimate of Avar(O) = (E[H(w,;
1
IJol]r :E(E[H(w,: IJol]r
13 The long-run variance was introduced in Section 6.5.
1 •
474
Chapter 7
Since 9' -->p 9 0 , condition (4) of Proposition 7.8 implies that ]
n
,
-n L
H(w,; 9) --> E[H(w,; 9 0 )]. P
t=l
Therefore, provided that there is available a consistent estimator f of :E, -
Avar(B) =
{]
n
L n
H(w,; B)
t=l
~-1
f { ]-;; Ln
H(w,; 0)
~-[
(7.3.11)
t=l
is a consistent estimator of the asymptotic variance matrix. To obtain f. the methods introduced in Section 6.6, such as the VARHAC, can be applied to the estimated series {s(w,; 0)} under some suitable technical conditions. Asymptotic Normality of Conditional ML We now specialize this asymptotic normality result to ML for the case of i.i.d. data. In ML, where the m function is a log (conditional) likelihood, the first-order conditions (7.3.4) are called the likelihood equations. The fact that them function is a log (conditional) likelihood leads to a simplification of the asymptotic variance ' matrix of 9. We show this for conditional ML; doing the same for unconditional or joint ML is easier. Consider the two equalities in condition (3) of Proposition 7.9 below. The second equality is called the information matrix equality (not to be confused with the Kullback-Leibler information inequality). It is so called because E[s(w,; 9 0 ) s(w,; 90 )'] is the information matrix for observation t. It is left as Analytical Exercise 2 to derive these two equalities under some technical conditions pennitting the interchange of differentiation and integration. Here, we just note the implication of these two equalities for Proposition 7.8. If w, is i.i.d., the Lindeberg-Levy CLT and the first equality (that E[s(w,; 9 0 )] = 0) imply ]
n
.Jri L
s(w,; 9 0 )
7
N(O, :E) where :E = E[s(w,; 9 0 ) s(w,; 9 0 )'].
t=l
(7.3.12) '
This and the information matrix equality imply that Avar(9) in Proposition 7.8 is simplified in two ways as Avar(O) = -jE[H(w,; 9o)
lr
I
= jE[s(w,; 9o) s(w,; 9ol']
r'.
(7.3.13)
475
Extremum Estimators
Thus, we have proved Proposition 7.9 (Asymptotic normality of conditional ML): Let w, (= (y,, x;)') be i.i.d. Suppose the conditions of either Proposition 7.5 or Proposition 7. 6 are satisfied, so that ii ~P 11 0 . Suppose, in addition, that (1) llo is in the interior of e, (2) f(y,
1
x,; II) is twice continuously differentiable in II for all
(y,,
x, ),
(3) E[s(w,; 11 0 )] = 0 and -E[H(w,; 11 0 )] = E[s(w,; 11 0 ) s(w,; 11 0 )'], where sand
H functions are defined in (7.3.2) and (7.3.3), (4) (local dominance condition on the Hessian)
for some neighborhood
J{
of 11 0 ,
E[sup IIH(w,; II) Ill< oo, 8e.N
so that for any consistent estimator ii, ~ L:;~, H(w,; ii) ~P E[H(w,; 11 0) ], (5) E[H(w,; 11 0 ) J is nonsingular.
Then
ii is asymptotically nonnal with Avar(ii) given by (7.3.13).
Two remarks about the proposition: • Condition (3) is stated as an assumption here because its derivation requires interchange of integration and differentiation (see Analytical Exercise 2). A technical condition under which the interchange is legal is readily available from calculus (see, e.g., Lemma 3.6 of Newey and McFadden, 1994). So condition (3) could be replaced by such a technical condition on f(y, I x,; II). We do not do it here because in most applications condition (3) can be verified directly. • It should be clear from the above discussion how this proposition can be adapted to unconditional ML: simply replace f(y, I x,; II) by f(w,; II). To use this asymptotic result for hypothesis testing, we need a consistent estia natural mate of Avar(O). Noting that it can be written as -{E(H(w,; 11 0 ) J
r',
estimator is
1 first estimator of Avar(O) = - { ~
L n
H(w,;
0) ~-1
(7.3.14)
t=l
Since
9 ~P
11 0 , this estimator is consistent hy condition (4) of Proposition 7.9.
476
Chapter 7
Another estimator, hased on the relation Avar(ii) = IE[s(w,; llo) s(w,; llol'] ]
second estimator of Avar(ii) =
{
n
;; ~ s(w,; ii) s(w,; ii)'
rl' is
} -I
(7.3.15)
To insure consistency for this estimator, we need to make an assumption in addition to the hypothesis of Proposition 7.9 (see, e.g., Theorem 4.4 of Newey and McFadden, 1994). We will not show it here hecause in many applications the consistency of this second estimator can be verified directly. There is no compelling reason for preferring one estimator of Avar(ll) over the other. The second estimator, which requires only the first derivative of the log likelihood, is easier to compute than the first. This can be an important consideration when it is impossible to calculate the derivatives of the density function analytically and some numerical method must he used. On the other hand, in at least some cases the first provides a better finite-sample approximation of the asymptotic distribution (see, e.g., Davidson and MacJUnnon, 1993).
-
Two Examples To get a better grasp of the asymptotic normality results about conditional ML, it is useful to see how the results can be applied to familiar examples. We consider two such examples. Example 7.10 (Asymptotic normality in linear regression model): To reproduce the log conditional density for observation t for the linear regression model with normal errors,
With e = JRK x Il4+ (where K is the dimension of {3), condition (I) of Proposition 7.9 is satisfied. Condition (2) is obviously satisfied. To verify (3), a routine calculation yields: 14
s(w,; II) =
I-] [ + ~~-~
" l
- 2a2
I '"'2
2a4
e,
, H(w,; II)=
[I' -~~~
"
-
1
f
"
;,4Xt . Cr
I - + I c,-3] '
-2a4Xt. Cr I 4<:r4 -
l '"'2 la6cr
2a6Xt.
+
I "4 4asC 1
(7.3.17) 14For lhe parameter a2 the differentiation is with respect to a 2 rather than a.
477
Extremum Estimators
where w, = (y 1, x;)', 9 = (fJ', a 2 )' and f:, = y,- x;p (which should be distinguished from£, = y,- x;fJ 0 ). So for 9 = 9o the f:, in these expressions can be replaced by £ 1 • In the linear regression model, E(£1 1 x,) = 0. Also, since £ 1 is N(O, a~). we have E(
-.!, E(x,x;)
- E[H(w,; 9o) J = E[s(w,; 9o) s(w,; 9ol'] =
"o
[
o'
: ] . (7.3.18) 2a04
If E(x,x;) is nonsingular, then E[H(w,; 9 0 ) J is nonsingular and (5) is satisfied.
-
Regarding condition (4 ), let E:, = y, - x; p for some consistent estimator and 2 . Condition (7.3.8) in this example is
p
a
(7.3.19)
It is straightforward to show this, given that E:, = £,-x;
Example 7.11 (Asymptotic nonnality of probit ML): To reproduce the log conditional likelihood of the pro bit model, log f(y,
1
x,; 9) = y,log (x;9) +(I- y,) log[! - (x;9)]. (7.3.20)
With 8 = JRP, condition (I) of Proposition 7.9 is satisfied. Condition (2) is obviously satisfied. To verify (3), a tedious calculation yields (y,- (x;9))(x;9))(x;9) ''
H
(7.3.21)
. - {- [ y, - (x;9l ]' ' ' (x;9)(1 - (x;9)) [
y,- (x;9l
+ [ (x;9)(1 (To derive (7.3.22), use the fact that
Y?
J '( , )} x,x,.,
- (x;9))
(7.3.22)
= y,.) Here, (·) is the cumulative
478
Chapter 7
density function of N(O, I) and¢(·)= <1>'0 is its density function. It is then easy to prove the conditional version of (3): E[s(w,; llo) I x,] = 0 and - E[H(w,; llo) I x,] = E[s(w,; llo) s(w,; llo)' I x,].
(7.3.23)
So (3) is satisfied by the Law of Total Expectations. In particular, for probit, it is easy to show that - E[H(w,; llo)] = E[s(w,; llo) s(w,; llol'] = E[ ).(x;llo)A( -x;llo)x,x;]. (7.3.24) where ).(v)
=
t/>(v) I -
(7.3.25)
is called the inverse Mill's ratio or the hazard for N(O, 1). It can be shown that the term in braces in (7.3.22) is between 0 and 2 (see Newey and McFadden, 1994, p. 2147). So (7.3.26)
IIH(w,; IIlii < 211x,x;11.
The Euclidean norm llx,x; II is the square root of the sum of squares of the elements ofx,x;. Therefore, the expectation of llx,x;\1 2 and hence that of llx,x;ll are finite if E(x,x;) (exists and) is finite. So the local dominance condition on the Hessian (condition (4)) is satisfied ifE(x,x;) is nonsingular. It can be shown that condition (5) is satisfied ifE(x,x;) is nonsingular (see footnote 29 on p. 2147 of Newey and McFadden, 1994). We conclude, again, that all the conditions of Proposition 7.9 are satisfied ifE(x,x;) is nonsingular. Asymptotic Normality of GMM Now tum to GMM. The GMM objective function is I
-
I
"
Qn(ll) = --g,(II)'Wg,(ll) with g,(ll) =- I)<w,; II). (Kxl) n t=l 2
(7.3.27)
As in the case ofM-estimators, we apply the Mean Value Theorem to the first-order condition for maximization, but the theorem will be applied to g,(ll), not its first derivative. Thus, unlike in the case of M-estimators, the objective function needs to be continuously differentiable only once, not twice. Titis reflects the fact that the sample average enters the objective function differently for the case of GMM.
479
Extremum Estimators
Assuming that 9' is an interior point, the first-order condition for maximization of the objective function is 0
aQ (OJ , , , " = -G,(9) W g,(9), 88 (pxK) (KxK) (Kxl)
=
(pxl)
(7.3.28)
(px I)
where G,(9) is the Jacobian ofg,(9): G,(9J
= ag,(~J
(7.3.29)
a9
(K xp)
Now apply the Mean Value Theorem to g,(9), notto aQ, (9)(a9 as in M-estimation, to ohtain the mean value expansion g,(O) = g,(9o) (Kxl)
+ G,(O) (0 -
(Kxl)
(Kxp)
9o).
(7.3.30)
(pxl)
Substituting this into the first-order condition above, we obtain
,.,,_ 0
= -G,(9)
(pxl)
Solving this for
(pxK)
,.,,
W g,(9 0 ) - G,(9) (KxK)
(Kxl)
(pxK)
(0- 9 0 ) and noting that g,(9) =
__
,.,
W G,(9) (9- 9 0 ). (7.3.31)
(KxK) (Kxp)
(pxl)
~ L:;~, g(w,; 9), we obtain
-1 "
"
./n(9- 9 0 ) = - [ G,(9) (pxl)
(pxK)
I
..-..
-
W G,(9) ]
(KxK) (Kxp)
" G,(9) (pxK)
W
"
Ir.: L...., " g(w,; 9 0 ).
(KxK) yn
t= I
(Kx I)
(7.3.32) In this expression, the Jacobian G, (9) of g,(9) is evaluated at two different points, 0 and 9. Since g,(9) = ~ L:;~, g(w,; 9), the Jacobian at any given estimator 0 can be written as
"
-
G (O) = ~" ag(w,; 9) n n L. ao' .
(7.3.33)
t=l
If w, is ergodic stationary and 9 is consistent, it is natural to conjecture that this expression converges to E[ag(w,; 90 )(a9']. As was true forM-estimators, this conjecture is true if ag(w,; 9)(a9' satisfies the suitable dominance condition specified in condition (4) below. It should now be clear that the GMM equivalent of Proposition 7.8 is
480
Chapter 7
Proposition 7.10 (Asymptotic normality of GMM): Suppose that the conditions of Proposition 7. 7 are satisfied, so that {w,} is ergodic stationary, W(K x K) converges in probability to a symmetric positive definite matrix W, and the p-dimensional GMM estimator 9 is consistent. Suppose, further, that
.
(I) 9o is in the interior of
e.
(2) g(w,; 9) is continuously differentiable in 9 for any w,, (Kx I)
(3)
!o L:7= 1 g(w,; 9o) -->d N(O, (KxK) S ), S positive definite,
vn
(4) (local dominance condition on •·~~',' 81 ) for some neighborhood N of 90 ,
, 9) II] < oo, E[ sup I og(w,; 9 a eo1
~ any consistent · · ag(w 9- • 111 "'" so that 10r estimator ~r=l ao ·B) --> p E[ag(w,;8o)J 88' ' 1 ,'
(Kxp)
(5) E["•<;~: 801 ] is of full column rank. (Kxp)
Then the following results hold:
.
(a) (asymptotic normality) 9 is asymptotically normal with Avar(O) = (G' WG)- 1G' WS WG(G' WG)- 1, (pxp)
where G
= E[ag(w,;Bol],
(Kxp)
CO
(b) (consistent asymptotic variance estimation) Suppose there is available a consistent estimateS of S. The asymptotic van·ance is consistently estimated by
-
Avar(O) =
(G' WG) -'G' WSWG(G' WG) -I,
(7.3.34)
The assumption that S is positive definite is not really needed for the conclusion of the proposition, hut we might as well assume it here because in virtually all applications it is satisfied (or assumed) and also because it will be required in the discussion of hypothesis testing later in this chapter. It is useful to note the analogy between linear and nonlinear GMM by comparing this proposition to Proposition 3.1. In linear GMM, g, (9) is given by
481
Extremum Estimators ]
g,(ll) = (;;-
11
1
11
I:X, · y,) - (;;- L x,z; )11. 1=1
(7.3.35)
t=.l
So G _ G.(il) in Proposition 7.10 reduces to -(~ L: x,z;), and G reduces to - E(x,z;). The thrust of Proposition 7.10 is that the asymptotic distribution of the nonlinear GMM estimator can be ohtained hy taking the linear approximation of g, (II) around the true parameter value. The analogy to linear GMM extends further: • (Efficient nonlinear GMM) Using the current notation, the matrix inequality (3.5.11) in Section 3.5 is (7.3.36) which says that the lower bound for Avar(iJ) is (G'S- 1G)-'- The lower bound is 1 achieved ifW satisfies the efficiency condition that plim.~oo = • namely, = for some consistent estimator of s. that
w s-
w s-l
• (J statistic for testing overidentifying restrictions) Suppose that the conditions
w s-l
w
of Proposition 7.10 are satisfied and that = so that satisfies the ef= s- 1 • Then the minimized distance J = ng,(O)' ficiency condition: plim 1g,(0) (= -2nQ.(O)) is asymptotically chi-squared with K- p degrees of freedom. The proof of this result is essentially the same as in the linear case; see Newey and McFadden (1994, Section 9.5) for a proof.
w
s-
• (Estimation of S) Sis the long-run variance of {g(w,; 11 0 )}. Under some suitable technical conditions, the methods introduced in Section 6.6- such as the VARHAC-can be applied to the estimated series {g(w,; 0)} to obtainS. In particular, ifw, is i.i.d., then S = E[g(w,; 11 0 )g(w,; llol'] and the sample second ~ moment of g(w,; II) can serve asS.
-
GMM versus Ml The question we have just answered is: taking the orthogonality conditions as given, what is the optimal choice of the weighting matrix W? A related, but distinct, question is: what is the optimal choice of orthogonality conditions?" This question, too, has a clear answer, provided that the parametric form of the density of the data (w 1 , ••• , w.) is known. Here, we focus on the i.i.d. case with the 15 Yet another efficiency question concerns the optimal choice of instruments in generalized nonlinear instrumental estimation (a special case ofGMM where the g function is a product of the vector of instruments and the error rerrn for an estimation equation). See Newey and McFadden (1994, Section 5.4) for a discussion of optimal instruments.
482
Chapter7
density of w, given by f(w,; IJ). Let ii be the GMM estimator associated with the orthogonality conditions E[g(w,; IJ 0)] = 0. Its asymptotic variance Avar(O) is given in Proposition 7.10. It is not hard to show that Avar(ii)::: E[s(w,; IJ 0 )s(w,; IJ 0
)'r'
. IJ ) where s (w, 0 = (pxl)
alog J(w,; IJo) . aiJ
(7.3.37) That is, the inverse of the information matrix E(ss') is the lower bound for the asymptotic variance of GMM estimators. This matrix inequality holds under the conditions of Proposition 7.10 plus some technical condition on f (w,; IJ) which allows the interchange of differentiation and integration. See Newey and McFadden (1994, Theorem 5.1) for a statement of those conditions and a proof. Asymptotic efficiency of ML over GMM estimators then follows from this result because the lower bound [E(ss')]- 1 is the asymptotic variance of the ML estimator. The superiority of ML, however, is not very surprising, given that ML exploits the knowledge of the parametric form f(w,; IJ) of the density function while GMM does not. As you will show in Review Question 4 below, GMM achieves this lower bound when the g function in the orthogonality conditions is the score for observation t: g(w 1 ; IJ) = s(w,; IJ) (Kxl)
=
alog f(w,; IJ) aiJ
(pxl)
.
(7.3.38)
Therefore, the GMM estimator with optimal orthogonality conditions is asymptotically equivalent to ML. Actually, they are numerically equivalent, which can be shown as follows. Since K (the number of orthogonality conditions) equals p (the number of parameters) when g is chosen optimally as in (7.3.38), the GMM estimator ii should satisfy g,(ll) = 0, which under (7.3.38) can be written as I n - Z:::s(w,;
n
0) =
0.
(7.3.39)
t=l
This is none other than the likelihood equations (i.e., the first-order condition for ML). They have at least one solution because the ML estimator satisfies them. If they have only one solution, then IJ is the ML estimator as well as the GMM estimator. If they have more than one solution, we need to choose one from them as the GMM estimator. Since one of the solutions is the ML estimator, we can A
simply choose it as the GMM estimator.
483
Extremum Estimators
Expressing the Sampling Error in a Common Format To prepare for the next section's discussion, it is useful to develop a slightly different proof of asymptotic normality. We have used the Mean Value Theorem to derive the mean value expansion for the sampling error, which is (7 .3. 7) for M-estimators and (7.3.32) for GMM. We show in this suhsection that they can be written in a common format, (7.3.43) below. Consider M-estimators first. Noting that aQa~901 = ~ L::7~ 1 s(w,; 60 ), the mean value expansion (7.3.7) can be written as
.Jri (660 ) =
-
L
[I " H(w,; 6) n t=l
]-I .Jri
oQ.(6o) . a6
(7.3.40)
By condition (4) of Proposition 7.8, ~ L::7~ 1 H(w,; i}) converges in probability to some p x p symmetric matrix >II given by
>II = E[H(w,; 6o)].
(7.3.41)
(pxp)
So rewrite (7.3.40) as
By construction, the term in braces converges in probability to zero. By condition (3) of Proposition 7.8, .Jri aQa~901 (= }. L::7~ 1 s(w,; 6 0 )) converges to a random
variable. So the last term, which is the product of the term in hraces and .Jri aQa~901 , converges to zero in probability by Lemma 2.4(b). This fact can be written compactly as a Taylor expansion: 16
.Jri
~-~
o6 (pxl)
a, ,
~xl)
(7.3.43)
where the term "o," means "some random variable that converges to zero in probability."17 The exact expression for the o, term will depend on the context. Here, it equals the last term in (7.3.42). As you will see, all that matters for our argument
1611 is apt to caJl this equation a Taylor expansion rather than a mean vaJue expansion because the matrix that multiplies Jn aQato) is evaJuated at 9o rather than at 6 as in the mean vaJue expansion. 17 The "op" notation was introduced in Section 2.1.
484
Chapter 7
is that the Op term vanishes (converges to zero in probability), not its expression. Since the difference between .jii(iJ -8 0 ) and -'i'- 1./ii aQ.~Do> vanishes, the asymptotic distribution of .jii(iJ -8 0 ) is the same as that of -'i'- 1./ii aQ.~OoJ by Lemma 2.4(a). It then follows that .jii(iJ- IJ 0 ) converges to a normal distrihution with mean zero and asymptotic variance given by Avar(IJ) = '1'- >1:'1'- , where
l: (pxp)
=Avar (aQ.(IJo)) ao .
(7.3.44)
ForM-estimators, .jii aQ.~Dol = }. L:7~ 1 s(w,; IJ 0 ) and l: is the long-run variance of s(w,; IJo). Setting 'I' = E[H(w,; IJo) J gives the expression for Avar(O) in Proposition 7.8. Next tum to GMM. The expression for .jii aQg~Do) is
r.: aQ.(IJo) ,- I ~ "" = -[G.(IJo)l W r.: L..,g(w,; IJo).
-vn
aiJ
(7.3.45)
t=l
L:
(Recall that G.(IJ) = 8 ~9<~> .) Since G.(IJ 0 ) = ~ '•';/o> ~P G (= E['•';~;Do>]), W~P W, and }. L:7~ 1 g(w,; IJ 0) ~d N(O, S) underthe conditions of Proposition 7.10, we have .jii aQ.~Oo) converging to a normal distribution with mean zero and 1:
= Avar(aQ.(IJo))
=
(J(J
(pxp)
G'
W
S
W
G .
(7.3.46)
(pxK)(KxK)(KxK)(KxK)(Kxp)
Now rewrite the mean value expansion for the sampling error for GMM (7 .3.32) as
n
c
- - .jii I "L..,g(w,; IJo). = -G.(IJ)'W
(7.3.47)
t=I
It is left as Review Question 5 below to show that this can be written as the Taylor expansion (7.3.43), with the symmetric matrix 'I' given by '1'=-G' (pxp)
W
G.
(7.3.48)
(pxK) (KxK) (Kxp)
Substituting (7.3.48) and (7.3.46) into (7.3.44), we obtain the GMM asymptotic variance indicated in Proposition 7.10. Table 7.1 summarizes the discussion of this subsection by indicating how the substitution should be done to make the common formulas applicable to M-estimators and GMM.
Table 7.1: Taylor Expansion for the Sampling Error
Terms for substitution
M-estimators
GMM
Q.(ll)
I n - Lm(w,; II) n
-2g.(ll) Wg.(ll)
I
,-
t=l
Jii aQ.(llo) all
"' l:
I
n
Jii L
s(w,; 11 0 )
-[G.(II 0 )]'W
t=l
Jn
tg(w,; llo) t=l
E[H(w,; 11 0 )]
-G'WG
long-run variance ofs(w,; 11 0 )
G'WSWG
NOTE: See (7.3.2) and (7.3.3) for definition ofs(w1 ; 9) and H(w 1 ; 9). Also.
486
Chapter 7
In passing, it is useful for the next section's discussion to note that :E = - ljl 1). for ML with i.i.d. observations and efficient GMM (the GMM with W =
s-
For ML, :E = E[s(w,; 11 0 ) s(w,; 11 0 )'], which equals the negative of the "' given in (7.3.41) thanks to the information matrix equality. For efficient GMM, setting w = s-l in (7.3.46) gives (7.3.49) which is the negative of the "' in (7.3.48). QUESTIONS FOR REVIEW
1. (Score and Hessian of objective function) ForM-estimators and ML in particular, we defined the score for ohservation 1, s(w,; II), to be the gradient (vector of first derivatives) of m(w,; II) and the Hessian for observation 1, H(w,; II), to be the Hessian of m (w,; II). Let the score (without the qualifier "for observation 1") be the gradient of the objective function Q,(ll) and denote it as s,(ll). Let the Hessian (without the qualifier "for observation t") be the Hessian of Q, (II) and denote it as H,(ll). Under the conditions of Proposition 7.9, show that E[s,(llo)] = 0,
-~ E[H,(IIo)] =
E[s,(llo)s,(ll 0 )'].
n
Hint: {s(w,; 11 0 )} is i.i.d. 2. (Conditional information matrix equality for the linear regression model) For the linear regression model of Example 7.10, verify that E[s(w,; llo)
I x,]
=0
and - E[H(w,; llo)
I x,]
= E[s(w,; llo) s(w,; llo)' I x,].
3. (Avarfor the linear regression model) For the linear regression model of Example 7.10, let (ji, &2 ) be the ML estimate of (fJ, 0" 2 ). Write down the first esti-
-
..----.,.
mator of Avar(ll) defined in (7.3.14). Does Avar(fJ) equal & 2 (~ [Answer: No.]
L x,x;)- 1?
4. (GMM with optimal orthogonality conditions) Assume that the relevant technical conditions are satisfied for the hypothetical density j(w,; II) so that the information matrix equality holds. Using Proposition 7.10, show that when the g function in the orthogonality conditions is given as in (7 .3.38), the asymptotic variance matrix of the GMM estimator equals the inverse of the information matrix for observation I (which is the lower bound in (7.3.37)). Hint: Since K equals p, G is a square matrix and the expression tor Avar(ll) in
-
Extremum Estimators
487
Proposition 7.10 reduces to Avar(O) = G-'SG'- 1 • The choice of g implies G = E[H(w,; 9 0 )] and S = E[s(w,; 9 0 )s(w1 ; 90 )']. 5. (Taylor expansion of the sampling error for GMM) Show that (7.3.47) can be written as the Taylor expansion (7.3.43) under the conditions of Proposition 7 .I 0, by taking the following steps. (a) Show that B- 1 = 'it- 1 + Yn where Yn ~P 0. (b) Show that c = Jn aQa~Oo) + Yn where Yn ~P 0. Hint: Gn(O)- Gn(9 0 ) = ' I (Gn(O) -G)- (Gn(Oo)- G) ~P 0 and W,;;; L:7~t g(w,; Oo) converges in distribution to a random variable.
!il 8 Qn(Ool+x wherexn=- -[Yn'\1" !il (c) Showthat-B-'c= -IJI- 1 V" 88 r~> >Jt-IYn + YnYn]. (d) Show that Xn
~P
8Qn(Oo)+ 88
0.
6. (Why the GMM distance is deflated by 2) Verify that, if the negative of the distance g.(O)'S- 1g.(9) were not divided by 2 in the definition of the efficient GMM objective function, then we would not have :E = -'it.
7.4 Hypothesis Testing As is well known for ML, there is a trio of statistics-the Wald, Lagrange multiplier (LM), and likelihood ratio (LR) statistics-that can be used for testing the null hypothesis. The three statistics are asymptotically equivalent in that they share the same asymptotic distribution (of x2 ). In this section we show that the asymptotic equivalence of the trinity can be extended to GMM by developing an argument applicable to both ML and GMM. The key to the argument is the common format for the sampling error derived at the end of the previous section. On the first reading, you may wish to read the next subsection and then jump to the last subsection. The Null Hypothesis We have already considered, in the context of the linear regression model, the problem of testing a set of r possibly nonlinear restrictions (see Section 2.4). To recapitulate, let 9 0 be the p-dimensional model parameter. The null hypothesis can
488
Chapter 7
be expressed as
Ho : a(Oo) = 0.
(7.4.1)
(rx I)
We assume that a(-) is continuously differentiable. Also, let A(O)
= oa(~)
(>xp)
ao
(7.4.2)
be the Jacobian of a(O). We assume that
= A(0 0 ) is offull row rank.
Ao
(7.4.3)
(rxp)
This ensures that the r restrictions are not redundant. This rank condition implies that r s p. Let 0 be the extremum estimator in question. It is either ML or GMM. It solves the unconstrained optimization problem (7 .1. I). The Wald statistic for testing the null hypothesis utilizes the unconstrained estimator. The LM statistic, in contrast, utilizes the constrained estimator, denoted 0, which solves max Q.(O) s.t. a(O) = 0.
(7.4.4)
8E8
As was shown in Section 7.2, the true parameter value 0 0 solves the "limit unconstrained problem" where Q. in the unconstrained optimization problem (7 .1.1) is replaced by the limit function Q0 . It also solves the "limit constrained problem" where Q. in the above constrained optimization problem is replaced by Q0 , because 0 0 satisfies the constraint. It turns out that the uniform convergence in probability of Q. 0 to Q 0 0 assures that the limit of the constrained estimator 0 is the solution to the limit constrained problem, which is Oo. For a proof, see Newey and McFadden (1994, Theorem 9.1) for GMM and Gallant and White (1988, Lemma 7.3(b)) for general extremum estimators. It is also possible to show that, if all the conditions ensuring the asymptotic normality of the unconstrained estimator are satisfied, then 0. too, is asymptotically normal under the null. For a proof for GMM, see Newey and McFadden (1994, Theorem 9.2); for ML, see, e.g., Amemiya (1985, Section 4.5). 18
18 If you are interested in the proof. it is just the mean value version of the discussion in the subsection on the LM statistic below, which is based on Taylor expansions.
489
Extremum Estimators
The Working Assumptions The argument for showing that the Wald, LM, and LR statistics are all asymptotically x2 (r) is based on the truth of the following conditions.
(A) .jfi(O -11 0 ) admits the Taylor expansion (7.3.43), which is reproduced here (7.4.5) (B) .fii aQa~•oJ __., N(O. I:), I: positive definite. (C) .jfi(O - llo) converges in distribution to a random variable. (D) I:= -'it.
We have shown in the previous section that the first two conditions are satisfied for conditional ML under the hypothesis of Proposition 7.9, for unconditional ML under the hypothesis suitably modified, and for GMM under the hypothesis of Proposition 7.10. As just noted, under the same hypothesis, the constrained estimator II is consistent and asymptotically normal, ensuring condition (C). So conditions (AHC) are implied by the hypothesis assuring the asymptotic normality of the extremum estimator. The Wald and LM statistics, if defined appropriately so as not to depend on condition (D), are asymptotically x2 (r). But their expressions can be simplified if condition (D) is invoked. Furthermore, without the condition, the LR statistic would not be asymptotically x2 As was noted at the end of the previous section, 1• condition (D) is satisfied for ML and also for efficient GMM with W = Therefore, the part of the discussion that relies on condition (D) is not applicable to nonefficient GMM.
s-
The Wald Statistic Throughout the book we have had several occasions to derive the Wald statistic using the "delta method" of Lemma 2.5. Here, we provide a slightly different derivation based on the Taylor expansion. Applying the Mean Value Theorem to a(/1) around llo and noting that a(llo) = 0 under the null, the mean value expansion of .fii a(ii) is
-
.fii a(il) = A(ii) .fii(il- llo). (rxl)
(rxp)
(7.4.6)
(pxl)
Derivation of the Taylor expansion from the mean value expansion is more transparent than in the previous section and proceeds as follows. Since II, lying between
490 (} 0
and
Chapter 7
iJ, is consistent and since A(.) is continuous, A(ii) converges in probability
to Ao (= A(0 0 )). Furthermore, the term multiplying A(ii), y'n(iJ- 0 0 ), converges to a random variable. So (A(ii) - A 0)y'n(iJ - 0 0 ) vanishes (converges to zero in probability) by Lemma 2.4(b). Denoting this term by oP, the mean value expansion can be rewritten as a Taylor expansion:
v'n a(iJ) = Ao (rxl)
y'n(iJ- Oo)
+
(pxi)
(rxp)
oP .
(7.4.7)
(rxl)
(Note that this argument would not be valid if y'n(iJ - 0 0 ) did not converge to a random variable.) Substituting the Taylor expansion (7.4.5) into this, we obtain an equation link1: ing VIi a(O) to VIi
aQab'' c
.
1
c 3Qn(0o)
vna(O)=-Ao "'- vn (rx 1)
(rxp) (pxp)
ao
+op.
(7.4.8)
(px I)
(The Op here is A 0 times the oP in (7.4.5) plus the oP in (7.4.7), but it still converges to zero in probability.) It immediately follows from this expression and the asymptotic normality of y'n
aQab'' 1 (condition (B)) that y'n a(O) converges to a normal
distribution with mean zero and Avar(a(O)) = Ao"'- 1 1:"'-'A~ (rxr)
= A 0 :E- 1 A~
(if condition (D) is invoked).
(7.4.9)
(rxp) (pxp) (pxr)
Since the r x p matrix Ao is of full row rank and :E is positive definite (by condition (B)), this r x r matrix is positive definite. Therefore, the associated quadratic form na(O)'[ Ao :E- 1A~r'a(O)
(7.4.10)
is asymptotically x 2 (r) under the null. As noted above, A(iJ) ~P A0 . So if there is available a consistent estimator f of :E, then A(O)f-' A(O)' is consistent for A 0 :E- 1A 0, which implies by Lemma 2.4(d) that the Wald statistic defined by (7.4.11) too is asymptotically x2 (r) under the null. How the consistent estimator f of :E (= - "'l is obtained depends on whether the extremum estimator is ML or efficient GMM. For ML, it is given by a consistent
491
Extremum Estimators
estimate of- lji = - E[H(w,; 9 0)]: a2 Qn(9) I n - a9a9' = -;; LH(w,; 9), A
(7.4.12)
t=l
or by a consistent estimate of l: = E[ s(w,; 9o) s(w,; 9o)']:
I"
- L, s(w,; 9) s(w,; 9)'. A
(7.4.13)
A
n
For efficient GMM, l: = G' s-'G (see (7.3.49)), which is consistently estimated by (7.4.14) ~
As already mentioned, S is an estimate of the long-run variance matrix constructed from the estimated series (g(w,; 9) }. A
The Lagrange Multiplier (LM) Statistic Let y" (r x I) be the Lagrange multiplier for the constrained problem (7.4.4). The constrained estimator 9 satisfies the first-order conditions
.J,i a Qn(9) a9
+ A(iJ)' .J,i y n =
0 ' (pxl)
.J,i a(ii) = 0 .
(7.4.15) (7.4.16)
(rx I)
We seek the Taylor expansion of these first-order conditions. By condition (C), .jii(i! - 9 0 ) converges to a random variable. So .J,i a(il) admits the Taylor expansiOn (7.4.17) It is left as Review Question 2 to show that the following Taylor expansion holds for both ML and GMM:
(7.4.18) An implication of this equation is that .J,i '~1 9 > converges to a random variable. So .J,i y "'satisfying (7.4.15), converges to a random variable. Consequently, since
492
Chapter 7
A(/1) ---> P A0 , we obtain
Substituting these three equations into the first-order conditions, we obtain "' (pxp) [ Ao
(rxp)
l
A~ [.,fo.(iJ
(pxr)
-/loll [-.Jii oQ;~ o>l +
(pxl)
6
_
-
(pxi)
0
VnYn
0
(rxr)
(rxl)
(rxl)
Op.
(7.4.20)
Using the formula for partitioned inverses, this can be solved to yield 1A .,,-I r.: aQ,(IIo) r.: --(A0.., ·•·- 1A')yny,0 0.., yn 3(1
+ Op,
(7 •• 4 21)
].Jii aQ,(IIo) + all
.Jii(iJ -llo) = -["'-I -"'-I A~(Ao"'-l A~)- 1 Ao"'- 1
op.
(7.4.22) The rest is the same as in the derivation of the Wald statistic. From (7 .4.21 ), .Jii y, converges to a normal distribution with mean zero and variance given by Avar(y ,) = (A 0 "'-I A~)- 1 Ao"'- 11:"'-l A~(A 0 "'-I A~)- 1 (rx r)
= (A 0 l:- 1 A~)- 1
(if condition (D) is invoked).
(7.4.23)
Since this r x r matrix is positive definite, the associated quadratic form (7.4.24) is asymptotically x2 (r) under the null. If there is available a consistent estima1 tor, denoted 1:, of 1:, then A(O)l:- A(O)' is consistent for Aol:- 1A~. So the LM statistic defined by the first line below is asymptotically x2 (r) under the null. This statistic can be beautified by the first-order conditions (7.4.15): LM = n y~ [A(O)i- A(O)'j Yn 1
(rxl)
(lxr) (r
x r)
-
= n(aQ,(/1))'
all
(lxp)
1; _,(aQ.(/1)) (since A(ii)'y. (pxp)
ao
=
-•~·JB>
by (7.4.15)).
(pxl)
(7.4.25)
493
Extremum Estimators
The last expression for the LM statistic is the reason why the statistic is also called the score test statistic. It is the distance from zero of the score evaluated at the constrained estimate 8. You should not be fooled by its beauty, however: despite being a quadratic form associated with a p-dimensional vector, the degrees of freedom of its x2 asymptotic distribution are r, not p, as is clear from the derivation. For ML, I: is given by (7.4.12) or (7.4.13), evaluated at 8. For GMM, I: is consistently estimated by
-
-
.
With
G (Kxp)
1 " " ag(w,; 8) = G, (8) = -n £.__, a8 ' . -
(7.4.26)
r=l
Here, S is an estimate of the long-run variance matrix constructed from the estimated series {g(w,; 8)}. The Likelihood Ratio (LR) Statistic
-
-
The LR statistic is defined as 2n times Q"(8)- Q,(8). (For efficient GMM, calling this the likelihood ratio statistic is a slight abuse of the language because Q. for GMM is not the likelihood function, but we will continue to use the term for both ML and GMM.) It is not hard to show for ML and GMM (see Review Questions 6 and 7) that LR
= 2n · [Q.(II)- Q.(iJ)] =
-n(O- ii)''i1(0- iJ)
+ oP.
(7.4.27)
The validity of this expression does not depend on condition (D). From this expression, it is easy to derive the asymptotic distribution of the LR statistic. Subtracting (7.4.22) from (7.4.5), we obtain
Substituting this into (7.4.27), the LR statistic can be written as LR- - ( r.: vn
aQ.a(8o) )' w-1 A'0 (A0 w-rA'0)-I A0 w-r ( v r.:n aQ.a(8o)) + Op. 8
8
(7.4.29) Now invoke assumption (D). The LR statistic becomes LR- ( r.: aQ.(8 o))' I:- 1A' [A I:-rA' - ...;n a8 0 0 0
8 J- 1A0 I:-r ( ...;nr.: aQ.( a8 o)) +
Op·
(7.4.30)
494
Chapter 7
This is asymptotically x2 (r) because the variance of the limiting distribution of A0 :E- 1 'Qa~Oo)) is the positive definite matrix in brackets.
(vn
This completes the proof of the asymptotic equivalence of the trio of statistics. But it should be clear from the proof that something stronger than asymptotic equivalence (that the three statistics share the same asymptotic distribution) holds. We have shown in the proof that the difference between the Wald statistic and (7.4.10) is op. But by (7.4.8) the difference between the latter and the quadratic form in the expression for the LR statistic (7.4.30) is op when :E = -lll. So the numerical difference between the Wald statistic and the LR statistic is oP when :E = -'it. The same applies to the LM statistic. The difference between it and (7.4.24) is op. But the difference between the latter and the quadratic form in (7 .4.30) is again oP when :E = - lll. Summary of the Trinity To summarize the rather lengthy discussion of this section,
-
Proposition 7.11 (The trinity): Let 6 the extremum estimator defined in (7.1.1). Consider the null hypothesis (7.4.1) where a(·) is continuously differentiable (so the J~cobian A(6) = ";~~) is continuous) and the rank condition (7.4.3) is satisfied. Let 6 be the constrained extremum estimator defined in (7.4.4). Assume:
• the hypothesis of Proposition 7.9, if the extremum estimator is conditional ML, • the hypothesis appropriately modified as indicated right below Proposition 7.9, in the case of unconditional ML, • the hypothesis of Proposition 7.10 with W = s- 1 and that there is available a consistent estimatorS of S constructed from 0 and S constructed from ii, in the case of efficient GMM. Define the Wald, LM, and LR statistics as in Table 7.2. Then (a) the constrained extremum estimator 6 is consistent and asymptotically normal, (b) the three statistics all converge in distribution to is the number of restrictions in the null,
x2 (r)
under the null where r
(c) moreover, the numerical difference between the three statistics converges in probability to zero under the null.
Table 7.2: Trinity 1
1
Wald: na(th'[ A(O) i:- A(O)'r a(O) (lxr)
(rxp)(pxp)(pxr)
(rxl)
-
-
LM: n(aQn(IJ))' 1;-1 (aQn(IJ))
atJ
atJ
(pxp)
(pxl)
(lxp)
LR: 2n · [Qn(t/)- Qn(ii)] Conditional ML
Terms for substitution
I
n
-n L
log f(y,
Efficient GMM
I x,; IJ)
t=l
2
•
a Qn(IJ) I " · · , G = Gn(B) - atJatJ' or n L..,s(w,; IJ)s(w,; IJ) G's- 1G, (Kxp) replace
8 by ii in above
G'S- 1G, G = Gn(ii)
NOTE: See (7.3.2) and (7.3.3)for definitions of s(w,; 0) and H(w,; 0). Also, l
gn(O)
n
=- Lg(w1 ; 0), n
J=l
G (O) n
= ag, (O)
ao' ·
The same middle matrix is used in Qn(O) and Qn (0) for GMM.
(Kxp)
496
Chapter 7
QUESTIONS FOR REVIEW
1. Verify that the LM statistic for ML can be written as
-)'II"'-~-'(1~ -) n;; L..s(w,;/1) n L..s(w,;ll)s(w,;/1)' ;; L..s(w,;/1). ( I~ 1=1
2.
1=1
Derive (7.4.18). Hint: For ML, replace -
.Jri
iJ
by ii in (7.3.5). For GMM, derive
n
- ,- 1 " ' - ,= -G.(/1) W y'il L..g(w,; llo)- G.(ll) WG.(II)./Ii(ll-11 0 ),
aQ.(IIJ
all
1::1
-
-
where lilies between 11 0 and II. The derivation of this should be as easy as the derivation of (7.3.31 ). 3. Explain why the LR statistic would not be asymptotically
x2 if l:
= -lfl did
not hold.
4. For ML, suppose the trio of statistics are calculated as indicated in Table 7 .2, but suppose w, is actually serially correlated. Which one is still asymptotically
x2 ? [Answer:
None.]
-
5. Suppose l: = -lfl does not hold but suppose that a consistent estimate lJI of lJI is available along with the consistent estimate I: of l:. Propose an LM statistic that is asymptotically x2 (r). Hint: Use the first equality in (7.4.23), which is valid without l: = -lfl. The answer is
-
n (a ~n/1(/1)
)'Iii_, A(iiJ'[A(ii)lii- I:lii-' A(O)' r' A(ii)lii 1
-I
(a Q;/1(/1)). (7.4.31)
6.
(Optional, proof of (7.4.27)) Prove (7.4.27) forM-estimators. Hint: Take for granted the "second-order'' mean value expansion around II,
-
- By the lirst-order condition, aQ.. c9, , where II- lies between II- and II. = 0 . Also, 39 2
-
' a9a9' Q,rel
--+p
E[H( w,,· II oJ] ·
497
Extremum Estimators
7. (Optional, proof of (7.4.27) for GMM) Prove (7.4.27) for GMM. Hint: From the mean value expansion of g, (II) around II, derive the Taylor expansion A
(7.4.33) Substitute this into Qn(O) = - !g, (0)' Wg, (ii). Then use the first-order condition that Gn(IJ)' Wg,(t}) = 0.
7.5
Numerical Optimization
So far, we have not been concerned about the computational aspect of extremum estimators. In the ML estimation of the linear regression model, for example, the objective function Qn(li) is quadratic in II, so there is a closed-fonn fonnula for the maximizer, which is the OLS estimator. The objective function is quadratic for linear GMM also, with the linear GMM estimator being the closed-fonn fonnula. In most other cases, however, no such closed-fonn fonnulas are available, and some numerical algorithm must be employed to locate the maximum. A number of algorithms are available, but this section covers only the two most important ones. More details can be found in, e.g .• Judge et al. (1985, Appendix B). The first is used for calculating M-estimators, while the second is used for GMM. Newton-Raphson
We first consider M-estimators, whose objective function Qn(li) is (7.1.2). Since Qn(ll) is twice continuously differentiable forM-estimators, there is a secondorder Taylor expansion
where ilj is the estimate in the j-th round of the iterative procedure to be described in a moment, and Sn and Hn are the gradient and the Hessian of the objective function: Sn(IJ)
=
aQn(li)
a/1' ,
(7 .5.2)
498
Chapter 7
.
The (j + 1)-th round estimator 6j+t is the maximizer of the quadratic function on the right-hand side of (7.5.1). It is given by (7.5.3) This iterative procedure is called the Newton-Raphson algorithm. If the objective function is concave, the algorithm often converges quickly to the global maximum . • For ML estimators, when the global maximum 6 is thus obtained, the estimate of Avar(i)) is obtained as -H.(Ii)- 1 (recall that H.(6) ==~I: H(w,; 6)). Gauss-Newton Now tum to GMM, whose objective function is Q.(6) = -~g.(6)'Wg.(6) (see (7.1.3)). As in the derivation of the asymptotic distribution, we work with a linearization of the g. (6) function. The first-order Taylor expansion of g. (6) around 9j is
g.(6)
~
. + G.(6j)(6. 6j) .
g.(6j)
(K xI)
(K xp)
=
(px I)
[g.(iJj)- G.(iJj)iJj]- [-G.(iJj)j6
= Vj- Gj6,
(7.5.4)
where (7.5.5) (Recall that G. (6) = "';;;,'.9>.) If the g. (6) function in the expression for the GMM objective function were this linear function Vj - Gj6, then the objective function would be quadratic in 6 and the maximizer (or the minimizer of the GMM distance) would be the linear GMM estimator: (7.5.6) This is the (j + 1)-th round estimate in the Gauss-Newton algorithm. 19 Unlike in Newton-Raphson, there is no need to calculate second-order derivatives. Writing Newton-Raphson and Gauss-Newton in a Common Format There are also similarities between Gauss-Newton and Newton-Raphson. Substituting (7.5.5) into (7.5.6) and rearranging, we obtain 19 The Gauss-Newton algorithm was originally developed for nonlinear least squares. See Review Question 2 for how Gauss-Newton is applied to NLS. The algorithm presented here is therefore its GMM adaptation.
499
Extremum Estimators
(7 .5.7) '
The second bracketed term is none other than the gradient evaluated at II j for the GMM objective function Qn(ll) = -!g,(II)'Wg,(ll). The role played by the Hessian of the objective function in (7.5.3) is played here by the first bracketed term, which is the estimate based on Oj of the -G'WG in Table 7.1. Therefore, the analogy between M-estimators and GMM noted in Table 7.1 also holds here. Furthermore, if g, is linear, then the Hessian equivalent (the first bracketed term in (7 .5. 7)) is the Hessian of the GMM objective function. So the analogy is exact: the Gauss-Newton algorithm coincides with the Newton-Raphson algorithm. Equations Nonlinear in Parameters Only In the Gauss-Newton algorithm (7.5.7), when g,(ll) (= ~ L: g(w,; II)) is not linear in II, it is generally necessary to take averages over observations to evaluate the gradient and the Hessian equivalent in each iteration. (This is also generally true in Newton-Raphson if the objective function is not quadratic in II.) For large data sets it is a very CPU time-intensive operation. There is, however, one case where the averaging over observations in each iteration is not needed. That occurs in a class of generalized nonlinear instrumental variable estimation (a special case of nonlinear GMM where g(y,, z,, II) can be written as a(y,, z,; ll)x,). Suppose that the equation a(y,, z,; II) takes the form
+ a 1(y,, z, )'«(II),
a (y,, z,; II) = ao(y,, z,)
(lxq)
(7.5.8)
(qxl)
where a 0 0 and a 1 0 are known functions of (y,, z,) (but not functions of II). A function of this form is said to be nonlinear in parameters only or linear in variables. It can be easily seen that I
g,(ll)
I
n
n
=- Lg(y,,z,;ll) =- La(y,,z,;ll) ·X, n
n
t==l ]
t=l
I
n
= (- L:ao(y,, z,) · x,) n r=l
n t=l (Kx I)
Gn(ll) =
n
+ (- I:x,at(y,, z,l')cx(ll),
(7.5.9)
(qxl) (Kxq)
ag, (II) (1~ ') acx (II) all' = n L.. x,at (y,, z,) all' . t=l
(qxp)
(K xq)
(7 .5.10)
500
Chapter 7
These equations make it clear that the sample averages need to be calculated just once, before the start of iterations. QUESTIONS FOR REVIEW
1. (Does Q. increase during iterations?) Set II = Oj+I in (7 .5.1) and rewrite it as
'
Consider the Newton-Raphson algorithm (7.5.3). Suppose that llj+I is sufficiently close to Oj so that the sign of Q.(Oi+Il- Q.(Oj) is the same as that of the right hand side of the equation. Show that Q.(Oj+Il > Q.(Oj) if Q.(ll) is concave. Hint: Using (7.5.3), the right-hand side can be written .as
2. (Gauss-Newton for NLS) In NLS (nonlinear least squares), the objective function is given by (7.1.19). Let Oi be the estimate in the j-th round and consider the linear approximation
(7.5.11) Replace I' (x,; II) in the m function by this linear function in II. Show that the maximizer of this linearized problem is
(7.5.12)
501
Extremum Estimators
PROBLEM SET FOR CHAPTER 7 - - - - - - - - - - - - ANALYTICAL EXERCISES
1. (Kullback-Leibler information inequality) Let f(y I x; IJ) be a parametric family of hypothetical conditional density functions. with the true density function given by f(y I x; 9 0 ). Suppose E[log f(y 1 x; IJ)] exists and is finite for aliiJ. The Kullback-Leibler information inequality states that
I x; IJ) f= f(y I x; IJo)] > 0 ==> E[logf(y I x;IJ)] < E[logf(y I x;IJo)].
Prob[f(y
Prove this by taking the following steps. In the proof, for simplicity only, assume that f (y I x; IJ) > 0 for all (y. x) and IJ, so that it is legitimate to take logs of the density f(y I x; IJ). (a) Suppose Prob[f(y I x; IJ) define a(w)
f= f(y I
x;
90 )]
>
0. Let w
= (y, x')' and
= f(y I x; IJ)jf(y I x; IJo).
Verify that a(w) f= I with positive probability, so that a(w) is a nonconstant random variable. (b) The strict version of Jensen's inequality states that if c(x) is a strictly concave function and xis a nonconstant random variable, then E[c(x)] < c(E(x)). Using this fact, show that E[log a(w)] < log{E[a(w)]}. Hint: log(x) is strictly concave. (c) Show that E[a(w)] = I. Hint: The conditional mean of a(w) equals I because
502
Chapter7
E[a(w) I x]
=I =
I
=I
a(w)f(y
I x; llo)dy
f(y I x; II) f( I . II )d f(y I x; llo) y x, 0 y f(y
I x; ll)dy
=1.
The last equality holds because f(y I x; II) is a density function far any x and 11. Note well that the conditional expectation is taken with respect to the true cand~ianal dens·lty f(y 1 x:llo). (d) Finally, show the desired result. 2. (Infannatian matrix equality) For ML, the score vector and the Hessian for observation (y, x) can be rewrinen (with w = (y, x')') as: s(w; II) =
alog f(y I x; II), all
(pxl)
. II _ as(w; II) _
"<~':~) ) -
all'
a2 1ag J(y I x; II)
-
a11a11'
·
As usual, far simplicity, assume f(y I x, II) > 0 for all (y, x) and II, so it is legitimate to take logs of the density function. (a) Assuming that integration (i.e., taking expectations) and differentiation can be interchanged, show that E[s(w; llo)l = 0. Hint: Since f(y I x; II) is a hypothetical density, its integral is unity:
I
f(y
I x; ll)dy =
I.
Differentiate bath sides of this equation with respect to II and then change the order of differentiation and integration to obtain the identity
I
s(w; ll)f(y I x; ll)dy =
0 .
(px I)
Set II = llo and obtain E[s(w; llo) I x] = 0. (b) Show the infonnation matrix equality: - E[H(w; llo) J = E[ s(w; llo) s(w; llo)'].
503
Extremum Estimators
Hint: Differentiating both sides of the above identity in the previous hint and assuming that the order of differentiation and integration can be interchanged, we obtain
f
"a ,[s(w;6)f(y lx;6)]dy=
o6
0.
(pxp)
(px I)
Show that the integrand can be written as
a
a6,[s(w; 6)J(y I x; 6)] I x; 6) + s(w; 6) s(w; 6)' f(y I x; 6).
= H(w; 6)f(y
3. (Trinity for linear regression model) Consider the linear regression model with normal errors, whose conditional density for observation t is logf(y,
I x,;{J,a 2 )
I I ( = - - Iog(2tr)- -log(a 2 ) - y,
2
2
x' {J) 2
2a
;
Let (fj, & 2 ) be the unrestricted ML estimate of 6 = (fJ' a 2 )' and let (jj, a2 ) be the restricted ML estimate subject to the constraint R{J = c where R is an r x K matrix of known constants. Assume that 8 = JRK x Ill.;.+ and that E(x, x;) is nonsingular. Also, let II""'
~ =
U2 ;; L......~;=t XI x, [
0I ]
'
1: =
2(&2)2
I I "" z;l;; L......~,= I XI XI' [
0I ]
.
2(i72)2
/i
(a) Verify that minimizes the sum of squared residuals. So it is the OLS estimator. Verify that 1J minimizes the sum of squared residuals subject to the constraint R{J = c. So it is the restricted least squares estimator. (b) Let Q.(6) = ~ I:7~ 1 1og f(y,
I x,; {J, a 2 ). Show that
I Iog (SSRu) n , 2 2 2 I log(2tr)- I - I 1og (SSRR) 2 2 2 -n- ,
- = - I Iog(2tr)- I Q.(6)
Q.(6) = -
where SSRu (= L;(y,- x;/i) 2 ) is the unrestricted sum of squared residuals and SSRR (= L;(y, - x;p) 2 ) is the restricted sum of squared residuals. Hint: Show that &2 = SSRufn and a2 = SSRRfn.
504
Chapter 7
(c) Verify that the 1; given here. although not the same as-~ I:7~t H(w,; iJ), is consistent for - E[H(w,; 11 0 ) ). Verify that the 1': given here, although not the same as-~ I:7~t H(w,; 0), is consistent for- E[H(w,; 9 0 )). Hint: From the discussion in Example 7.8, ii is consistent. As mentioned in Section 7.4, 9 too is consistent under the null. Given the consistency of ii and ii, it should be easy to show that 8 2 and ii 2 are consistent for cr 2 (d) Show that the Wald, LM, and LR statistics using 1; and 1; given here in the formulas in Table 7.2 can be written as
(RP- c)'[R(X'X)-'R'r'
where y (n x I) and X (n x K) are the data vector and matrix associated with y, and x,, and P = X(X'X)- 1X'. Hint: The A(IJ) (r x (K + I)) in Table 7.2 is
R : 0 ). ( (rxK) (rxl) (e) Show that the three statistics can also be written as W=n. SSRR -SSRu SSRu ' SSRRLM = n · .:___;_:_ _SSRu _.::.. SSRR ' LR = n . iog(-SS_R_R). SSRu
Hint: As we have shown in an analytical exercise of Chapter 1, SSRR - SSRv =
= (RP- c)'[R(X'X)- 1RT 1
505
Extremum Estimators
(f) Show that W > LR > LM. (These inequalities do not always hold in
nonlinear regression models.)
ANSWERS TO SELECTED Q U E S T I O N S - - - - - - - - - - -
2a. Since j(y I x; IJ) is a hypothetical density, its integral is unity:
I
j(y I x; IJ)dy = I.
This is an identity, valid for any IJ identity with respect to IJ, we obtain
aa(}
I
E
(I)
EJ. Differentiating both sides of this
j(y I x; IJ)dy =
0 .
(2)
(pxl)
If the order of differentiation and integration can be interchanged, then
:(} I
f(y I x; IJ)dy
=I :(}
j(y I x; IJ)dy.
(3)
But by the definition of the score, s(w; IJ) j(y I x; IJ) = :e j(y I x; IJ). Substituting this into (3), we obtain
I This holds for any(}
I
E
s(w; IJ)j(y I x; IJ)dy =
(4)
0 . (p X I)
EJ, in particular, for IJo. Setting(} = 8 0 , we obtain
s(w; IJo)f(y I x; IJo)dy = E[s(w; IJo) I x] =
0 .
(px I)
(5)
Then, by the Law of Total Expectations, we obtain the desired result.
References Amemiya, T., 1985, Advanced Econometrics, Cambridge: Harvard University Press. Davidson, R., and J. MacKinnon, 1993, Estimation and Inference in Econometrics, New York: Oxford University Press. Gallant, R., and H. White, 1988, A Unified Theory of Estimation and Inference for Nonlinear Dynamic Models. New York: Basil Blackwell.
506
Chapter 7
Gourieroux, C., and A. Monfort, 1995, Statistics and Econometric Models, New York; Cambridge University Press. Hansen, L. P., and K. Singleton, 1982, "Generalized Instrumental Variables Estimation of Nonlinear Rational Expectations Models," Econometrica, 50, 1269-1286. Judge, G., W. Griffiths, R. Hill, H. Ltitkepohl, and T, Lee, 1985, The Theory and Practice of Econometrics (2d ed.), New York: Wiley. Newey, W., and D. McFadden, 1994, "Large Sample Estimation and Hypothesis Testing," Chapter 36 in R. Engle and D. McFadden (eds.), Handbook of Econometrics, Volume IV, New York: North-Holland. White, H., 1994, Estimation, Inference, and Specification Analysis, New York: Cambridge University
Press.
CHAPTER 8
Examples of Maximum Likelihood
ABSTRACT The method of maximum likelihood (ML) was developed in the previous chapter as a special case of extremum estimators. In view of its importance in econometrics, this chapter considers applications of ML to some prominent models. We have already seen some of those models in Chapters 3 and 4 in the context of GMM estimation. In deriving the asymptotic properties ofML estimators for the models considered here, the general results for ML developed in Chapter 7 will be utilized. Before proceeding, you should review Section 7.1, Propositions 7.5, 7.6, 7.8, 7.9, and 7.11. A Note on Notation:
We will maintain the notation, introduced in the previous
chapter, of indicating the true parameter value by subscript 0. So 9 is a hypothetical parameter value, while 8o is the true parameter value.
8.1
Qualitative Response (QR) Models
In many applications in economics and other sciences, the dependent variable is a categorical variable. For example, a person may be in the labor force or not; a commuter chooses a particular mode of transportation; a patient may die or not; and so on. Regression models in which the dependent variable takes on discrete values are called qualitative response (QR) models. A QR model is described by a parameterized density function J(y, I x,; (I) representing a family of discrete probability distributions of y, given x,, indexed by 0. A QR model is called a binary response model if the dependent variable can take on only two values (which can be taken to be 0 and I without loss of generality) and is called a multinomial response model if the dependent variable can take on more than two values. Only binary models are treated in this subsection. For a thorough treatment of binary and multinomial models, see Amemiya (1985, Chapter 9).
508
Chapter 6
The most popular binary response model is the probit model, which has already been presented in Example 7.3. Another popular binary model is the logit model:
I x,; Oo) = A(x;Oo). = 0 I x,; 0 0 ) =I- A(x;Oo),
J(y, =I { j(y,
(8.1.1)
where A is the cumulative density function of the logistic distribution: A (v)
This density
exp(v)
= -:---'----':--:1 + exp(v)
(8.1.2)
f (y, I x,: 0 0 ) can be written as f(y,
I x,; Oo) = A(x;o 0 )Y' x [I- A(x;0 0 )] 1-Y'.
(8.1.3)
Taking logs of both sides and replacing the true parameter value 0 0 by its hypothetical value 0, we obtain the log likelihood for observation t: log f (y,
1
x,: 0) = y, log A (x;o) + (I - y,) log[! - A (x;o)].
(8.1.4)
The logit objective function Q,(O) is 1/n times the log likelihood of the sample (y,, x 1 , y2 , x2 .... , y,, x,). Under the assumption that (y,, x,} is i.i.d., the log likelihood of the sample is the sum of the log likelihood for observation t over t. Therefore, Qn(O) =
~ :t{y,logA(x;O)+(I- y,)log[l- A(x;OJJ).
n
(8.1.5)
t=l
For probit, we have proved that its log likelihood function is concave (Example 7.6), and that the ML estimator is consistent (Example 7.9) and asymptotically normal (Example 7.11) under the assumption that E(x,x;) is nonsingular. You are about to find out that doing the same for logit is easier. (On the first reading, however, you may wish to skip the discussion on consistency and asymptotic normality.) Score and Hessian for Observation t
The logistic cumulative density function has the convenient property that A'(v) = A(v)(l - A(v)), A"(v) =[I- 2A(v)]A(v)[l - A(v)]. (8.1.6)
509
Examples of Maximum Likelihood
Using this, it is easy to derive (see Review Question I (a)) the following expressions for the score and the Hessian for observation t: s(w,; II) ( = H(w,; II) (
, a log j(y, I x,; II)) all = [y,- A(x,ll)]x,,
= as(w,;ll)) all' =
' ' ' -A(x,ll)[l - A(x,ll)]x,x,.
(8.1.7) (8.1.8)
where w, = (y,, x;)'. Consistency Since x,x; is positive semidefinite, it immediately follows from its expression that H(w,: II) is negative semidefinite and hence Q"(ll) is concave. The relevant consistency theorem, therefore, is Proposition 7.6. We show here that the conditions of the proposition are satisfied under the nonsingularity of E(x,x;). Condition (a) of the proposition is the (conditional) density identification: j(y, I x,; II) oft f(y, I x,; 11 0 ) with positive probability for II oft 11 0 1 Since a logarithm is a strictly monotone transformation, this condition is equivalent to log f(y, I x,; II)
oft log f(y, I x,; 11 0 ) with positive probability for II oft 11 0 , (8.1.9)
where llo is the true parameter value and the density f is given in (8.1.3). Verification of this condition is the same as in the probit example in Example 7.9 and proceeds as follows. The discussion in Example 7.8 has established that E(x,x;) nonsingular
=> x;11 oft x;llo with positive probability for II oft 11 0 . (8.1.10)
Since the logit function A(v) is a strictly monotone function of v, we have A(x;ll) oft A(x;llo) when x;11 oft x;11 0 . Thus, condition (a) is implied by the nonsingularity of E(x,x;). Turning to condition (b) (that E[llog f(y,lx,; Ill I) < oo for all II), it is easy to show that I log A(v)l
(8.1.11)
Condition (b) can then be verified in a way analogous to the verification of the same condition for probit in Example 7.9. Thus, as in probit, the nonsingularity of E(x,x;) ensures thatlogit ML is consistent. 1In other words, let X = fC.vt I x ; 0) and Y =- j(y I x : 8o). X and Y are random variables because they 1 1 1 are functions of w 1 = (:y 1 , x:)'. The condition can be seated simply as Prob(X f" Y) > 0 for 0 f" Oo.
510
Chapter 8
Asymptotic Normality To prove asymptotic normality under the same condition, we verify the conditions of Proposition 7.9. Condition (I) of the proposition is satisfied if the parameter space e is taken to be JRP. Condition (2) is obviously satisfied. Condition (3) can be checked easily as follows. Since E(y, I x,) = A(x;8), we have E[s(w,; 8 0 ) I x,] = 0 from (8.1.7). Hence, E[s(w,; 8 0 )] = 0 by the Law of Total Expectations. It is left as Review Question l(b) to derive the conditional information matrix equality
E[s(w,; 8 0 ) s(w,; 8 0 )' I x,] = - E[H(w,; 8 0 ) I x,].
(8.1.12)
Hence, (3) is satisfied by the Law of Total Expectations. The local dominance condition (condition (4)) is easy to verify for logit because, since IA(x;8)[1 A(x;8)]1 < l,wehave: IIH,(w,;8)11 < llx,x;llforall8. Theexpectationofllx,x;ll' (the sum of squares of the elements of x,x;) is finite if E(x,x;) is nonsingular (and hence finite). So E(llx,x; II) is finite. Regarding condition (5), the argument in footnote 29 of Newey and McFadden (1994) used for verifying the same condition for probit can be used for logit as well 2 Thus, we conclude that if (y,, x,} is i.i.d. and E(x,x;) is nonsingular, then the logit ML estimator 8' of 8o is consistent and asymptotically normal, with the asymptotic variance consistently estimated by
_
Avar(ii) = -
[l L n
n
H(w,;
9)
]-1 ,
(8.1.13)
t=I
where w, = (y,, x;)' and H(w,; 8) is given in (8.1.8). QUESTIONS FOR REVIEW
1. (Score and Hessian for observation t for general densities) In (8.1.1), replace the logit cumulative density function (c.d.f.) A(x;8o) by a general c.d.f. F(x;8o). (a) Verify that the score and Hessian for observation
t
are given by
y,- F, s(w,; 8) = -=F,'"'·-(""l-'=F,.,.,) f,x,,
20nly for interested readers, here's lhe argument. For any ii > 0, lhere exists a C such lhat A(v)[J - A(v)] ~ C > 0 for Ivi .:::;: ii. So E( A(x19)[1- A(x'O)]xx'J ~ E(l (lx'lll .S ii)A(x'O)[I - A(x'O)]xx'J 2::. C E(1Cix'81 _:::: ii)xx'] in lhe matrix inequality sense, where HI vi_::::: ii) is lhe indicator function. The last term is positive definite (not just positive semi-definite) for large enough ii by non-singularity ofE(xx1).
511
Examples of Maximum Likelihood
Yt - Ft 2 , H(w,;IJ) =- [ F,. (1- F,) ] 2 j, x,x,
Yt - Fr
,
,
+ [ F,. (1- F,) ] j,x,x,.
where F, = F(x;IJ), j, = j(x;IJ), and JO = F'O is the density function. (b) Verify the conditional information matrix equality (8.1.12) for the general c.d.f. F. Hint: Show that the expected value of the second term in the expression for the Hessian in the previous question is zero. 2. Verify (8.1.7) and (8.1.8). 3. (Logit for serially correlated observations) Suppose {y,, x,} is ergodic stationary but not necessarily i.i.d. Is the logit ML estimator described in the text consistent? [Answer: Yes, because the relevant consistency theorem, Proposition ' 7.6, does not require a random sample. However, the expression for Avar(IJ) is not given by the negative of the inverse of the sample Hessian.]
8.2 Truncated Regression Models A limited dependent variable (LDV) model is a regression model in which either the dependent variable y, is constrained in some way or the observations for which y, does not meet some prespecified criterion are excluded from the sample. The LDV model of the former sort is called a censored regression model, while the latter is called a truncated regression model. This section presents truncated regression models; censored regression models are presented in the next section. The Model Suppose that {y,, x,} is i.i.d. satisfying y, =
x;Po + s,, s, I x,
~ N(O, cr~).
(8.2.1)
The model would be just the standard linear regression model with normal errors, were it not for the following feature: only those observations that meet some prespecified rule for y, are included in the sample. We will consider only the simplest truncation rule: y, > c where c is a known constant. This rule is sometimes called a "truncation from below." Once this case is worked out, it should not be hard to consider more general truncation rules.
512
Chapter 8
''
'
'
\~ '
''
''
''
''
'
X 0
density after truncation
densityofN(O,t)
''
''
'"'"
.... ---
c
Figure 8.1: Effect of Truncation
Truncated Distributions
To proceed, we need two results from probability theory.
Density of a Truncated Random Variable: If a continuous random variable y has a density function f(y) and cis a constant, then its distribution after the truncation y > cis defined over the interval (c, oo) and is given by f(y) f(y I y > c) = Prob(y > c)
(8.2.2)
Figure 8.1 illustrates how truncation alters the distribution for N(O, 1). The solid curve is the density of N(O, 1). Because it is a density, the area under the curve is unity. The dotted curve is the distribution after truncation, which is defined over the interval (c, oo). It lies above the solid curve over this interval so that the area under the dotted curve is unity. The second result follows.
Moments of the Truncated Normal Distribution: If y ~ N(f.lo, aJ) and c is a constant, then the mean and the variance of the truncated distribution are
I y > c) = f.lo + aoA(v), Var(y I y >c)= aJ(l - A(v)[A(v)-
(8.2.3)
E(y
where v = (c- f.lo)/ao and A(v)
= 1 °':;/,).
v]),
(8.2.4)
513
Examples of Maximum Likelihood
This formula for the mean clearly shows that the sample mean of draws from the truncated distribution is not consistent for J.Lo. This is an example of the sample selection bias. The bias equals ao'-(v). The function '-(v) is called the inverse Mill's ratio or the hazard function. The function A(v) is convex and asymptotes to vas v--+ oo and to zero as v--+ -oo. So its derivative '-'(v) is between 0 and I. It is easy to show that '-'(v) = A(V)(A(v)- v).
(8.2.5)
Using (8.2.3) and (8.2.4), we can show how truncation alters the form of the regression of y, on x,. By (8.2.1), the distribution of y, conditional on x, before truncation is N(x;/3 0 , aJ). Observation I is in the sample if and only if y, > c. So E(y, I x,, I in the sample)= x;/3 0 + ao'-(c-;;Po), Var(y, I x,, I in the sample) =
aJ
II - '-(c-;jPo) ['-(c-;;Po) - c-;;Po JJ.
(8.2.6) (8.2.7)
This formula shows that the OLS estimate of the x, coefficient from the linear regression of y1 on x1 is not consistent because the correction tenn a0 AC-;~Po), which is correlated with x,, would be included in the error term in the linear regression. Since the functional form of the hazard function A(v) is known (under the normality assumption for c, ), we could avoid the sample selection bias by applying nonlinear least squares to estimate ({3 0 , aJ), but maximum likelihood should be the preferred estimation method because it is asymptotically more efficient. In passing, we note that the sample selection bias does not arise if selection is based on the regressors, not on the dependent variable. Suppose that observation 1 is in the sample if cf>(x,) > 0. Then E(y, I x,, 1 in the sample)= E[x;/3 0 + £ 1 I x,, cf>(x,) > 0] = x;/3 0 + E[c, I x,, cf>(x,) > 0]
=
x;f3o·
(8.2.8)
The last equality holds since E [c, I x,, cf>(x,) > 0] = E(c, I x,) = 0 if cf>(x,) > 0. The Likelihood Function
Derivation of the likelihood function utilizes (8.2.2). As just noted, the pretruncation distribution of y, I x, is N(x;f3 0 , aJ), whose density function is
514
Chapter 8
exp[-~(y'- x;{J 0 ) 2 ]
I
2
J2rrag
ao
= _1 ao
(8.2.9)
ao
where
x;Po
< c-
ao
<~>C -"':;Po)
x;Po I x,)
ao (since Y•-;JPo
1 x,
~ N(O.
t)). (8.2.10)
Therefore, by (8.2.2), the post-truncation density, defined over (c, oo), is
(8.2.11)
This is the density (conditional on x,) of observation t in the sample. Taking logs and replacing ({J 0 , ag) by their hypothetical values ({J, a 2 ), we obtain the log conditional likelihood for observation t: logf(y,
I x,; {J, a 2 )
2 (Y'- x;fJ)2} -log [I- (c- x;p)]
I I tog(a ) - 2:I = { -2log(2rr)2
17
.
17
(8.2.12) Were it not for the last term, this would be the log likelihood (conditional on x,) for the usual linear regression model. The last term is needed because the value of the dependent variable for observation t, having passed the test y, > c and thus being in the sample, is never less than or equal to c. Reparameterizi11g the Likelihood Function This log likelihood function can be made easier to analyze if we introduce the
following reparameterization: ~ =
{Jfa, y =!fa.
(8.2.13)
(We have considered this reparameterization in Example 7. 7 for the linear regression model.) This is a one-to-one mapping between (fJ, a 2 ) and (~. y), with the inverse mapping given by fJ = ~fy, a 2 = ljy 2 The reparameterized log
515
Examples of Maximum Likelihood
conditional likelihood is log j(y,
I x,; l)
= [-
I
2
1og(2rr)
+ log(y)-
I
,
Z(yy,- x,8)
2] -log[l-
The objective function in ML estimation is I In times the log likelihood of the sample. For a random sample, the log likelihood of the sample is the sum over t of the log likelihood for observation t. Thus, the objective function in the ML estimation of the reparameterized truncated regression model, denoted Q.(8, y), is the average overt of (8.2.14). The ML estimator (J, y) of (8 0 , y0 ) is the (8, y) that maximizes this objective function. Verifying Consistency and Asymptotic Normality
The next task is to derive the score and the Hessian and check the conditions for consistency and asymptotic normality for the reparameterized log likelihood (8.2.14). (On the first reading, you may wish to skip the discussion on consistency and asymptotic normality.) Score and Hessian for Observation t A tedious calculation yields
s(w,; 8, y) = ((K+l)x1)
[ ;;
H(w,; 8, y) = ((K+l)x(K+l))
1
(yy, - x;8)x, , -
[
]
(yy, - x,8)y,
XtXt'
-~~
,
1
-ytxr
r'
2
+ Y,
+ l(v,) [-x,] ,
(8.2.15)
e
]
+ l(v,)[l(v,)- v,]
[-ex; '
~~
-ex,] e2
'
(8.2.16) where K is the number of regressors, w, = (y,, x;)', and v, = ye- x;8 (= '-;:~). The Hessian is not everywhere negative semidefinite even after the reparameterization. So the objective function Q.(8, y) is not concave 3 The relevant consistency theorem, therefore, is Proposition 7.5, not Proposition 7.6, while the relevant asymptotic normality theorem remains Proposition 7.9. 3However. it can be shown that the Hessian of the objective function Qn(9) is negative semidefinite at any solution to the likelihood equations (see Orme, 1989). Since this rules out any saddle points or local minima, having more than two local maxima is impossible. Therefore, provided that the maximum of Qn(d, y) occurs at a point interior to the parameter space, the first-order condition is necessary and sufficient for a global maximum. If an algorithm such as Newton-Raphson produces convergence, the point of convergence is the global maximum.
516
Chapter 8
Consistency Condition (a) of Proposition 7.5 is the (conditional) density identification
-
f(y,
I x,; 8,
y)
i=
-
f(y,
I x,; 8o, Yo)
with positive probability for (8, y)
i= (8o, Yo).
where (8o, yo) is the true parameter value and the parameterized log density f(y, I x,; 8, y) is given in (8.2.14). Clearly, given (y,, x,), the value of j(y, I x,; 8, y) is different for different values of y. So identification boils down to the condition that X,8 i= x;oo with positive probability for 8 i= 80 • But (8.1.1 0) indicates that this condition is satisfied ifE(x,x;) is nonsingular. Turning to condition (b) of the proposition (the dominance condition for j(y, I x,; 8, y)), it is easy to show that the condition is satisfied under the nonsingularity of E(x,x;l 4 Therefore, the ML estimator (J, j)) of (80 , y0 ) is consistent under the nonsingularity ofE(x,x;). Asymptotic Normality A rather tedious calculation using (8.2.6), (8.2.7), and (8.2.13) shows that E[s x,] = 0. An extremely tedious calculation yields the conditional information matrix equality: E[ss' I x,] = - E[H I x, ]. Therefore, condition (3) of Proposition 7.9 is satisfied. We will not discuss verification of conditions (4) and (5) of the proposition 5 Thus, we conclude: if {y,, x,) is i.i.d. and E(x,x;) is nonsingular, then the ML estimator (J, 9) of the truncated regression model is consistent and asymptotically normal with the asymptotic variance consistently estimated by -
Avar(8.
y) =
-
[]
n
L H(w,; J, y) ]-1 n
(8.2.17)
t=l
where w, = (y,, x;)' and H(w,; IJ) is given in (8.2.16).
4 0nly for interested readers, what needs to be shown is E[sup Ee pogj(y I x ;9)1]
517
Examples of Maximum Likelihood
Recovering Original Parameters By the invariance property of ML (see Section 7.1), the ML estimate of ({j 0 , aJ) is = ~jy and &2 = I jf/ 2 , the value of the inverse mapping. Clearly, if the ML estimate of the reparameterized model, (~. y), is consistent, then so is the ML estimate (ti, &2) thus recovered. The asymptotic variance of & 2) can be obtained
/i
(/i,
-
by the delta method (see Lemma 2.5) as JAvar(~. y)J', where J is the estimate of the Jacobian matrix for the inverse mapping ([j = ~/y, a 2 = ljy 2 ) evaluated at (~. y):
-- [ti -j,] J-
0'
- ..1.. yJ
.
(8.2.18)
QUESTIONS FOR REVIEW
1. (Second moment of y,) From (8.2.6) and (8.2.7), derive:
E(y; I x,, tin sample)=
where vt
= yc -
~[I+ !.(v,)yc + (x;~) 2 + A.(v,)x;~]. y
x;o.
2. (Estimation of truncated regression model by GMM) The truncated regression model can also be estimated by GMM. Consider the following K + I orthogonality conditions. The first K of them are g 1 (w,:~.y)= (Kxi)
where w, = (y,, x;)' and v,
!.(v,)) I , ( y,--x,~--- -x,,
Y
Y
(8.2.19)
= yc-X,~. The (K + 1)-th orthogonality condition
IS
g2(w,;
~. y) = l- ~[I+ !.(v,)yc + (x;~) 2 + A.(v,)x;~]. y
(8.2.20)
(a) Verify that E[g1 (w,: ~ 0 • y0 ) 1 = 0 and E[g 2(w,: ~ 0 , y 0 ) 1 = 0. Hint: Use the Law of Total Expectations. v, is a function of x,. (b) Show that(~', y') satisfies the likelihood equations if and only if it satisfies the K +I orthogonality conditions. Hint: Let s 2(w,; ~- y) be the (K + 1)-th
518
Chapter 8
element of s(w,;
~.
y) in (8.2.15). Then
,
(
I
s 2 (w,; ~. y) + yg2 (w1 ; ~. y) = (x,~) y,- y x',~-
8.3
y),(v . 1) )
Censored Regression (Tobit) Models
The censored regression model is also called the Tobit model to pay respect to Tobin ( 1958), who was the first to introduce censoring in economics. The simplest Tobit model can be written as follows: t=l,2, ... ,n, y, = {
Yt
ifyt* > c,
c
if Yr"'
(8.3.1) (8.3.2)
Here, e1 I x1 is N (0, a~) and {y,, x1 ) (I = I, 2, ... , n) is i.i.d. The threshold value c is known. Unlike in the truncated regression model, there is no truncation here: observations for which the value of the dependent variable y,• fails to pass the test y,' > c are in the sample. The feature that distinguishes the censored regression model from the usual regression model is that the dependent variable is censored, that is, constrained to lie in a certain range, which necessitates a distinction between the observable dependent variable y, and the unobservable or latent variable y,'. The former is the censored value of the latter. An equivalent way to write the model is y, =
max{x;Po + E:,, c).
(8.3.3)
Tobit Likelihood Function For those observations that remain intact after censoring (Tobin ( 1958) called those
observations non limit observations), the density of y, (conditional on x,) is _1 q,(Y'- x;Po).
o-o
ao
(8.3.4)
This differs from the density for the truncated model (8.2.11) simply because there is no truncation. For those observations, called limit observations, for which the value of the dependent variable is altered by censoring, all we know is that y,• < c,
519
Examples of Maximum Ukelihood
the probability of which is given by Prob(y,' < c I x,) = Prob(Y:- x',Po < c- x',Po / x,) o-o o-o =
<~>(c -x',Po) (since Y!-•!Po 1 x, ao
~ N(O, 1)).
(8.3.5)
o-o
Therefore, the density of y,, defined over the interval [c, oo), is (8.3.4) for y, > c and a probability mass of size at y, = c. This density can be written as
c-;;Po)
(8.3.6) where the dummy variable D, is a function of y, defined as 0
D,= { I
if y, > c (i.e., y,' > c),
(8.3.7)
if y, = c (i.e., y,' :5 c).
Taking logs and replacing (fJ 0 , o-J) by their hypothetical values (/J, o- 2 ), we obtain the log conditional likelihood for observation t: log f(y,
I x,; p, o- 2 )
I
=(I - D,) log [ o- >
(Y ' -x'P)] o- ' + D, log (c-x'p o-' ). (8.3.8)
Thus, the Tobit average log likelihood for a random sample can be written as Q.(e) =
=
~ ~{o- D,)Iog[>C' ~x;p) J+D,Iog<~>C-o-x;P)} ~ L {log[>C' ~ x;P)]} + ~ L {log <~>(c -o-x;P) ~-
l·
(8.3.9)
~-
Reparameterlzatlon
As in the truncated regression model, the discussion is simplified if we introduce the reparameterization (8.2.13). The reparameterized log conditional likelihood is log j(y, I x,; 8, y) =(I-
D,)~-~ log(2rr) + log(y)- ~(yy,- x;8) 2 ) + D,log
(8.3.10)
520
Chapter 8
We have seen that the term in braces is concave in (d, y) (see Example 7.7). We have also seen that log <J>(v) is concave (see Example 7.6), which implies that log <J>(yc- x;d) is concave in (d. y). It follows that the reparameterized Tobit average log likelihood for observation t, being a non-negative weighted average of two concave functions, is concave in the parameters (this finding is due to Olsen, 1978). Accordingly, as in probit and logit, the relevant consistency theorem for Tobit is Proposition 7.6. The remaining task is to write down the score and the Hessian and check the conditions for consistency and asymptotic normality for the reparameterized log likelihood (8.3.10). Score and Hessian for Observation t
A straightforward calculation utilizing the fact that!.( -v) = (8.2.5) produces
s(w,;d, y) = (1- D,) · ((K+l)xl)
(yy,- x;d)x, [
(
1
y- YYt-
H(w,; d, y) = -(I - D,) · ((K+l)x(K+l))
[
XrXt
") Xta Yt
',
-ytxt
]
+ D,
I
· 1.(-v,)
¢( -v) ¢'( v)
[-x,]
and
=
¢(v) ¢'(v)
,
(8.3.11)
C
]
- YrXt 1 2
y2
+ Yr
- D, · 1.(-v,)[!.(-v,) + v,]
x,x;,
[ -ext
-ex,] 2 c
•
(8.3.12)
where
Vr
= yc -
c-x;p) " (= xta a- .
Consistency and Asymptotic Normality
For Tobit, we will not verify the conditions for consistency and asymptotic normality. For the case where (x,} is a sequence of fixed constants, Amemiya (1973) has proved the consistency and asymptotic normality of Tobit ML under the assumption that x, is bounded and lim~ L~~~ x,x; is nonsingular. It seems a reasonable guess that for the case where x, is stochastic (as here), a sufficient condition for consistency and asymptotic normality is the nonsingularity ofE(x,x;). Recovering Original Parameters
As in the truncated regression model, the ML estimate of ({J 0 , O"C) can be recovered as {i = 8;9 and i7 2 = l/y 2 . The asymptotic variance of ({i, i7 2 ) can be obtained
521
Examples of Maximum likelihood
by the delta method as JAvar(~. ' yA)]-1 . - [ ~I"" L...•~I H(w,,. o,
ylJ', where J is as in (8.2.18) and Avar(~. y) is
QUESTIONS FOR REVIEW
1. (Censored vs. truncated regression) One way to see the difference between truncation and censoring is to calculate E(y, I x,) for the censored regression model. Show that
Hint: Conditional on y, > c, the distribution of y, (conditional on x,) is none
other than the truncated distribution derived in the previous section, whose expectation is given in (8.2.6). The probability that y, >cis given in (8.2.10).
2. Show that the Hessian given in (8.3.12) is negative semidefinite. Hint:
1y,x, 2] = [~'Yr ] + Yr
[ x;
y2
x,x; [ -ex'
'
-ex,] _ [ x, ] [ , _ cJ. c -c 2
Also, ).(v) > v for all v (i.e., A(-v)
-
+v >
~]·
y·
xt
0 for all -v).
3. (Tobit for serially correlated observations) Suppose {y,, x,) is ergodic stationary but not necessarily i.i.d. Is the Tobit ML estimator described in the text consistent? [Answer: Yes.]
8.4 Multivariate Regressions In Section 1.5 and also in Example 7.2 of Chapter 7, we pointed out that the OLS estimator of the linear regression model is numerically the same as the ML estimator of the same model with normal errors. We have derived the GMM estimators for the single-equation models with endogenous regressors in Chapter 3 and for their multiple-equation versions in Chapter 4, but we have not indicated what the ML estimators for those models are. In this section and the next, we derive those ML counterparts. This section examines the ML estimation of the multivariate regression model, which is the simplest multiple-equation model.
522
Chapter 8
The Multivariate Regression Model Restated
The multivariate regression model is described by Assumptions 4.1-4.5 and 4.7, with z,m = x,m = x, for all m = I, 2, ... , M. To change the notation of Chapter 4, the system of M equations can be written as
Y<m =
x;
:n:om
+ v,m
(m =I, 2, ... , M; t =I, 2, ... , n).
(8.4.1)
(I xK) (Kx I)
Here, we maintain the convention of designating the true parameter vector by subscript 0. If we define Y
y,
Y<2
v,
,
no = [:n:ol
7ro2
· · · 7roM], (8.4.2)
(KxM)
(Mxl)
(Mxl)
Y<M then the M -equation system can be written compactly as
y,
= n~x, + v,
(t
= I, 2, ... , n).
(8.4.3)
The model has the following features. • {y,, x,} is ergodic stationary (Assumption 4.2).
• The common regressors x, are predetermined in that E(x, · v,m) = 0 for all m (Assumption 4.3). • The rank condition for identification (Assumption 4.4) reduces to the familiar condition that E(x,X,) be nonsingular. • The error vector is conditionally homoskedastic in that E(v, v; Sl 0 is positive definite (Assumption 4.7).
I x,)
= Sl0 , and
The GMM estimator of :n: om is the OLS estimator: (8.4.4) So the GMM (OLS) estimate of no is given by
fi
(KxM)
1
n
= ("x,x;) n L..t t=l
(KxK)
-I
1 n (- "x,y;). n
L..t
t=l (KxM)
(8.4.5)
523
Examples of Maximum Likelihood
The Likelihood Function
The model is not yet specific enough to lend itself to ML estimation. We make the supplementary assumptions that (I) v, I x, ~ N(O. 2 0 ), and (2) (y,, x,} is i.i.d. (This assumption strengthens Assumptions 4.2 and 4.5.) The multivariate regression model with the normality assumption (I) implies that y, 1 x, ~ N(D~x,, Q 0 ). The density of this multivariate normal distribution is (8.4.6) Replacing the true parameter values (D 0 , Ro) by their hypothetical values (D, Q) and taking logs, we obtain the log conditional likelihood for observation t: logf(y, I x,; D, Q) = -
M
2
I
I
I
log(2rr) + zlog(IR- D- z(y,- D x,)
sz-
ff)
f
(y,- D x,), (8.4.7)
where use has been made of the fact that -log(IQI) = log(IQ- 1 1). For a random sample, I In times the log conditional likelihood of the sample is the average over t of the log conditional likelihood for observation t. So the objective function Q.(ll, Q) is Q.(ll, Q) =- M log(2rr) 2
+~
2
log(IR- 1 1)-
-
1
2n
n
l)y,- D'x,)'Q- 1 (y,- D'x,). t=l
(8.4.8) It is left as Review Question 2 to show that the last term can be written as n
'"' - I L...-(y,D f x,) f 2n t=l
sz- I (y,- D x,) =-2I trace[Q- 1-Q(D)], f
(8.4.9)
-
where Q(D) is defined as n
Q(ll) (MxM)
'"' =-I L...-(y,D ,x,)(y,- D ,x,)., n
(8.4.10)
t=l
So the objective function can be rewritten as Q.(ll, Q) = -
M
2
log(2rr)
+I
2
log(IQ- 1 I)-
lI
-
trace[Q- 1 Q(ll)]. (8.4.11)
The ML estimate of (D 0 , Q0 ) is the (D, Q) that maximizes this objective function.
524
Chapter 8
The parameter space for (ll 0 , S! 0) is {(ll, S!) IS! is symmetric and positive definite}.
Maximizing the Likelihood Function We now prove that the solution to the maximization problem is numerically the same as the GMM estimator. That is, the ML estimator of llo is given by the GMM/OLS estimator fi in (8.4.5) and the ML estimate of S! 0 is given by n
-S! = -I~-., L- v, v, = S!(ll), n
vt
(8.4.12)
t=I
_,
n
where = YtXr. It is left as Analytical Exercise 2 to show that, under the assumptions of the multivariate regression model stated above, Si(ll) is positive definite with probability one for any given n for sufficiently large n. So we can assume that Si(ll) is positive definite, not just positive semidefinite. The proof that the GMM estimator is numerically the same as the ML estimator is based on the following two-step maximization of the objective function.
Step 1: The first step is to maximize Qn(ll, S!) with respect to S!, taking
n
as given. For this purpose, the following fact from matrix algebra is useful.
An Inequality Involving Trace and Determinan/: 6 Let A and B be two symmetric and positive definite matrices of the same size. Then the function f(A) =log (lA I)- trace (AB)
is maximized uniquely by A = B- 1 . This result, with A = S!- 1 and B = Si(ll), immediately implies that the objective function (8.4.11) is maximized uniquely by S! = §(ll), given n. Substituting this S! into (8.4.11) gives (1/n times) the concentrated log likelihood function (concentrated with respect to S!):
6This result is implied by Lemma A.6 of Johansen (1995). The uniqueness of the minimizer is due to the linearity of the trace operator and the strict concavity oflog(IAD noted in Magnus and Neudecker ( 1988, p. 222).
525
Examples of Maximum Likelihood
Q:(ll)
=
Qn(ll, Q(ll))
M I I = -21og(2rr) + 21og(IQ(ll)- 11)- trace[Q(ll)- 1Q(ll)] 2
= -
M
2
tog(2rr)-
I
2
-
tog(IQ(ll)l)-
M 2'
(8.4.13)
Step 2: Looking at the concentrated log likelihood function the ML estimator of ll 0 should minimize
I'~
-
IQ(ll)l = -;; L..(y,t=l
·
Q~ (ll),
· ·1
n x,)(y,- n x,)
we see that
(8.4.14)
.
It is left as Analytical Exercise I to show that this is minimized by the
OLS estimator gives (8.4.12).
fi
given in (8.4.5). Finally, setting
n
=
fi
in (8.4.10)
Consistency and Asymptotic Normality We have shown in Chapter 4, without the normality assumption (that e, x, 1s normal) or the i.i.d. assumption for {y1 , x, }, that the GMM/OLS estimator fi in (8.4.5) is consistent and asymptotically normal and that Q in (8.4.12) is consistent. It then follows, trivially, that the ML estimator of ll 0 is consistent and asymptotically normal and the ML estimator of Q0 is consistent. It is worth emphasizing that the ML estimator of llo based on the likelihood function that assumes normality is consistent and asymptotically normal even if the error term is not normally distributed. We will not be concerned with the asymptotic normality of the ML estimator of Q0 , Q(ll). One way to demonstrate the asymptotic normality would be to verify the conditions of the relevant asymptotic normality theorem of the previous chapter.
-
QUESTION FOR REVIEW
1. Prove (8.4.9). Hint: Use the following properties of the trace operator. (1) trace(x) = x if xis a scalar, (2) trace(A +B) = trace(A) + trace(B), and (3) trace(AB) = trace(BA) provided AB and BA are well defined. The answer is n
.!.n L(y,- ll'x,)'Q- 1(y,- ll'x,) t=l
=trace[~ t(y,- ll'x,)'Q- 1(y,- n'x,)J t=l
526
Chapter 8
= .!_ ttrace[(y,n t= L
ll'x,n~-'(y,- ll'x,)]
n
= .!_ n
2:: trace[ ~r' (y, -
n'x, )(y, - n'x,)'J
t=l
=trace[~ tQ- 1(y,- ll'x,)(y,- ll'x,)'] t=L
~
=trace[ Q- 1 t(y,- ll'x,)(y,- ll'x,)l
8.5 FIML As seen in Chapter 4, the multivariate regression model is a specialization of the seemingly unrelated regression (SUR) model, which in tum is a specialization of the multiple-equation system to which three-stage least squares (3SLS) is appli· cable. The two-stage least squares (2SLS) estimator is the GMM estimator when each equation of the system is estimated separately, while 3SLS is the GMM estimator when the system as a whole is estimated. This section presents the ML counterparts of 2SLS and 3SLS. The Multiple-Equation Model with Common Instruments Restated To refresh your memory, the model is an M -equation system described by Assump-
tions 4.1-4.5 and 4.7, with Xrrn = x, for all m =I, 2, ... , M (so the set of instruments is the same across equations). Still maintaining the convention of denoting the true parameter value by subscript 0, we can write the M equations of the system as Yrrn
=
z;rn
~om +errn
(m= 1,2, ... ,M;t
= 1,2, ... ,n).
(8.5.1)
(lxL,.) (L,xl)
The model has the following features. • Unlike in the multivariate regression model, for each equation m, the regressors Zrrn may not be orthogonal to the error term, but there are available K predetermined variables x, that are known to satisfy the orthogonality conditions E(x, · Brrnl = 0 for all m (this is Assumption 4.3). Defining the M x I vector Etas
527
Examples of Maximum Likelihood
Et (Mxl)
(8.5.2)
=
the orthogonality conditions can be expressed as E(x,<;) =
0
(KxM)
(8.5.3)
.
These K predetermined variables serve as the common set of instruments in the GMM estimation. • The endogenous variables of the system are those variables (apart from the errors) in the system that are not included in x,. • The rank condition for identification (Assumption 4.4) is that the K x Lm matrix E(x,z;ml be of full column rank for all m. Equation m is said to be overidentified if the rank condition is satisfied and Lm < K. • The error vector is conditionally homoskedastic in that E(s,s; I: 0 is positive definite (Assumption 4.7).
I x,)
= I: 0 , and
• E(x,x;) is nonsingular (a consequence of conditional homoskedasticity and the nonsingularity requirement in Assumption 4.5). The GMM estimator for this system as a whole is the 3SLS estimator given in (4.5.12), while the single-equation GMM estimator of each equation in isolation is 2SLS given in (3.8.3). The following examples will be used to illustrate the concepts to be introduced shortly.
Example 8.1 (A model of market equilibrium): An expanded version of Working's example considered in Section 3.1 is a two-equation system:
q, = YliP< + fJll + fJ12a, +
(demand),
(8.5.4)
q, = Y21P< + fJ21 + fJnw, +
(supply).
(8.5.5)
<<2
In this system, q, is the amount of coffee transacted and p, is the coffee price. The variable a, appearing in the demand equation is a taste shifter, while the variable w, is the amount of rainfall that affects the supply of coffee. The
528
Chapter 8
system can be cast in the general format by setting
Yn =q,, Y<> =q,, Zn
=
[~:], z,, = [ ~],
~01 = [;::]' ~02 = [;~:]. Jl,, /lJ2
(Note that Yn andy,, are the same variable.) IfE(&,j) = 0, E(a,&,j) = 0, and E(w,&,j) = Ofor j = 1,2, then (l,a,,w,) can be included in the common set of instruments. So
The other nonerror variables in the system, p, and q,, are endogenous variables. Provided that the rank condition for identification is satisfied, each equation is just identified because the number of right-hand side variables equals that of instruments. Example 8.2 (Wage equation): In the empirical exercise of Chapter 3, we estimated a standard wage equation. Consider supplementing the wage equation by an equation explaining KWW (the score on the "Knowledge of the World of Work" test):
+ il11 + Jl12IQ, + ""' Y21S, + il21 + ea,
LW, = Y11S, KWW, =
(8.5.6) (8.5.7)
where S,, for example, is years in schooling for individual I. Assume that E(&,j) = 0 and E(IQ,&,j) = 0 for j = I, 2. Assume also that there is a variable MED (mother's education), which does not appear in either equation but which is predetermined in that E(MED,eu) = 0 for j = I, 2. So the common set of instruments is x, = (I, IQ,, MED,)'. The rest of the nonerror variables in the system are endogenous variables. This two-equation system has three endogenous variables, LW, S, and KWW. Provided that the rank condition is satisfied, the first equation is just identified because the number of right-hand side variables equals the number of instruments, while the second equation is overidentified.
529
Examples of Maximum Likelihood
The Complete System of Simultaneous Equations The ML counterpart of 3SLS is called the fun-information maximum likelihood (FIML) estimator. If it is to be estimated by FIML, the multiple-equation model needs to be further specialized before introducing the normality and i.i.d. assumptions. The required specialization is that those M equations form a "complete" system of simultaneous equations. Completeness requires two things. I. There are as many endogenous variables as there are equations. This condition implies that, ify, is an M-dimensional vector collecting those M endogenous variables, then (Y<~, ... , y, M, z, 1 , .•• , z, M) are all elements of (y,, x, ), which allows one to write the M -equation system (8.5.1) as
ro
y,
(MxM)(Mxl)
+
Bo
x,
e,
=
(MxK)(Kxl)
(1=1,2, ... ,n),
(8.5.8)
(Mxl)
with the m-th equation being y,m = z;moom + e,m· (This is illustrated for Example 8.1 below.) The equation system written in this way is called the structural form, and (r 0 , B 0) are called the structural form parameters. As will be illustrated below, each of the structural form parameters is a function of (,lot, ... , OoM). The system in Example 8.1 satisfies this first requirement. The system in Example 8.2 is not a complete system because the number of endogenous variables is greater than the number of equations. It is not possible to estimate incomplete systems by FIML (unless we complete the system by adding appropriate equations, see the discussion about LIML below). This is in contrast to GMM; incomplete systems can be estimated by 3SLS (or more generally by multiple-equation GMM if conditional homoskedasticity is not assumed), as long as they satisfy the rank condition for identification. 2. The square matrix r 0 is nonsingular. This implies that the structural form can be solved for the endogenous variables y, as (8.5.9) where n~ (MxK)
vt (Mxl)
=-
ro -• Bo , (MxM)
= ro (MxM)
-1
(8.5.10)
(MxK)
et .
(8.5.11)
(Mxl)
Expression (8.5.9) is known as the reduced-form representation of the structural form, and the elements of 0 0 are called the reduced-form coefficients. In
530
Chapter 8
each equation of the reduced form, the regressors are the same set of variables, x,. From (8.5.3) and (8.5.11), it follows that E(x,v;) = 0, namely, that all the regressors are predetermined. The reduced form, therefore, is a multivariate regression model. Relationship between (r0 , B,) and
&o
Let So be the stacked vector collecting all the coefficients in the M -equation system (8.5.1):
(8.5.12)
For understanding the mechanics of FIML, it is important to see how the coefficient matrices (f 0 , B0 ) depend on S0 . To illustrate, consider Example 8.1. The S0 vector is YII fJII
So = (6x I)
fJ12
(8.5.13)
Y21
fJ21 fl22
The two endogenous variables are (p,, q,). It does not matter how the endogenous variahles are ordered, so arhitrarily put p, first in y,: y, = (p,, q, )'. The structural form parameters can be written as
ro
(2x2)
=
-YII [
-y21
Bo = (2x3)
-fJII [ -fJ21
-fJI2
0
(8.5.14)
This example serves to illustrate three points. First, each row of r 0 has an element of unity, reflecting the fact that the coefficient of the dependent variable in each of theM equations (8.5.1) is unity. In this sense r 0 is already normalized. Second, some elements of r 0 and B0 are zeros, reflecting the fact that some endogenous or predetermined variables do not appear in certain equations of the M-equation system. This feature of the structural form coefficient matrices is called the exclusion restrictions. (In Example 8.1, no elements of r 0 happen to be zero because the two endogenous variables appear in both equations.) Third,
531
Examples of Maximum Likelihood
the system involves no cross-equation restrictions, so each element of ~ 0 appears in (f 0 , B0 ) just once. The FIML Likelihood Function
To make the complete system of simultaneous equations estimable by FIML, we assume (I) that the structural error vector e, is jointly normal conditional on x,, i.e., e, I x, ~ N(O, 1: 0 ), and (2) that {y,,x,} is i.i.d., not just ergodic stationary martingale differences (this assumption strengthens Assumptions 4. 2 and 4.5). By the normality assumption (I) about the structural errors e, and (8.5.11), we have v, I x, ~ N(O, f 011: 0 (f 01)'). Combining this and the reduced form y, = II~x, + v, with (8.5.1 0) produces (8.5.15) So the log conditional likelihood for observation 1 is log f(y, I x,; ~. 1:) = -
M
2
~[y, +
log(2rr) -
I
2
log (If - 11: (r- 1l'l) 1
r- 1Bx,l'[r- 11:(r- 1)']- [y, + r- 1Bx,].
(8.5.16)
The likelihood is a function of(~, 1:) because, as illustrated in (8.5.14), the structural form coefficients (f, B) are functions of~. Now
)T 1 Iy, + r- 1Bx,]
[y, + r- 1Bx,l'[r- 11:(f- 1
= [y, + r- 1Bx,l'[r'1:- 1 f]ly, + r- 1Bx,J = [ry, +Bx,]'1:- 1 [ry, +Bx,]
(8.5.17)
and (8.5.18) Substituting these into (8.5 .16), logj(y, I x,; ~. 1:) = -
M
2
I 1 1og(2rr) + 21og(lfl 2) - 2log(l1:1)
-
~[ry, +
Bx,]'1:- 1[ry, + Bx,]. (8.5.19)
Taking the average over 1, we obtain the FIML objective function for a random sample:
532
Chapter 8
I
Af
Qn(~, l:) = -2Jog(2rr)+ 21og(lfl
2
)-
I 21og(ll:l)
n
-_I "'[ry, +Bx,]'l:- 1 [ry, +Bx,]. 2n L...,
(8.5.20)
t=l
The FIML estimate of (~ 0 , 1: 0 ) is the(~, l:) that maximizes this objective function. The FIML Concentrated Likelihood Function As in the maximization of the objective function of the multivariate regression model, we can proceed in two steps. The first step is to maximize Qn(~, l:) with respect to l: given ~. This yields -l:(~) =-I~ L...,(fy, + Bx,)(fy, + Bx,).' (MxM) n r::::l
(8.5.21)
Its (m, h) element can be written as n
&mh =
~ L(Yrm- z;m~m)(y,h- z;h~h).
n
(8.5.22)
t=l
By substituting (8.5.21) back into the objective function (8.5.20), (1/n times) the FIML concentrated likelihood function can be derived as Q;(~);;: Qn(~, 1:(~)) =-
~ log(2rr)- ~
+
= - Af log(2rr)- Af-
2
2
~ log(lfl 2 ) - ~ Iog(li:(~)l)
~log~~ ~(y, + 2 nL...,
r- 1Bx,)(y, + r- 1Bx,)'l·
t=l
(8.5.23) The second step is to maximize this FIML concentrated likelihood with respect to ~. The FIML estimator &of ~ 0 is the ~ that maximizes this function 7 The FIML estimator of l:o is 1: (&l,
7So the FIM:L estimator
&is an extremum estimator that maximizes
" + r- 1BxrHYt + r- 1Bxd]. -1~ L
Some aspects of this extremum estimator are explored in an optional analytical exercise (see Analytical Exercise 3).
533
Examples of Maximum Likelihood
Testing Overidentifying Restrictions Now compare this FlML concentrated likelihood Q~(8) given in (8.5.23) with that for the multivariate regression model, Q~ (ll) given in (8.4.13). The FIML concentrated likelihood Q~(8) is what obtains when we impose on Q~(ll) the restrictions implied hy the null hypothesis that Ho: n~ = -f 0 B0 or foll~ + B0 = 0. 1
(8.5.24)
Put differently, FIML estimation of 8o is a constrained multivariate regression. Therefore, the truth of the null hypothesis can be tested by the likelihood ratio principle. Since the OLS estimator fi given in (8.4.5) maximizes the multivariate regression concentrated likelihood function Q~ (ll) given in (8.4.13) and the FIML estimator ii maximizes the FIML concentrated likelihood function Q~(ii) given in (8.5.23), the likelihood ratio test statistic is LR = 2n · (Q~(fi) - Q~(/i)) 1
= n x (logiS!(-(f- Bl')l-log
IQ(fi)l).
(8.5.25)
where Q(ll) is defined in (8.4.1 0) and (f, B) is the value of (f, B) implied hy ii. Since the dimension of 8 equals I::::'= I Lm, the number of restrictions implied by the null hypothesis H0 above is M
M
KM - L Lm = L(K - Lm), m=l
(8.5.26)
m:=:l
which is the total number of overidentifying restrictions of the system. Therefore, the likelihood ratio test based on the LR statistic (8.5.25) is a test of overidentifying restrictions. It is a specification test because the restriction being tested is a condition assumed in the model. Properties of the FIML Estimator There is no closed-form solution to the FIML estimator. So, in order to establish the consistency and asymptotic normality, we would have to verify the conditions of relevant theorems of the previous chapter. We summarize here, without proof, the main properties of the FIML estimator. Identification Stating the identification condition in various forms for complete simultaneous equations systems is a major topic in most textbooks. Here it is left as an optional
534
Chapter 8
analytical exercise (Analytical Exercise 3) to show that the identification condition for FIML as an extremum estimator is equivalent to the rank condition for identification (that E(x,z;ml be of full column rank for all m = I, 2, ... , M). Asymptotic Properties Although the FIML ohjective function (8.5.20) is derived under the normality of e, I x,, the FIML estimator of 80 is consistent and asymptotically normal without the normality assumption. For a proof of consistency without normality, see, e.g., Amemiya (1985, pp. 232-233). There is a nice proof of asymptotic normality of 8 due to Hausman (1975), which shows that an iterative instrumental variable estimation can be viewed as an algorithm for solving the first-order conditions for the maximization of Qn(o, :E). 8 This argument shows that the FIML estimator of 80 is asymptotically equivalent to the 3SLS estimator9 Therefore, its asymptotic variance is given hy (4.5.15), and a consistent estimator of the asymptotic variance is (4.5.17). A
lnvariance Since the FIML estimator is an ML estimator, it has the invariance property discussed in Section 7.1. To illustrate, consider Example 8.1 and suppose you wish to reparameterize the supply equation as p, = ji,,q,
+ !h1 + fJ,w, + 1!, 2
(supply).
(8.5.27)
The old supply parameters (y21 , fJ21, fJ22l and the new supply parameters (ji21 , fJ21 , fJ22 ) are connected by the one-to-one mapping (8.5.28) If (y 21 , {i21 , {i22 ) is the FIML estimator of (y21 , {J21 , fJ22) for the system consisting of the demand equation (8.5.4) and the old supply equation (8.5.5), then the FIML estimator based on the system consisting of (8.5.4) and the new supply equation is numerically the same as the value of the above mapping at (y21 , {i21 , {i22 ). The 3SLS estimator (and 2SLS for that matter) does not have this invariance property.
8we will not display the first-order condition.~ (the likelihood equations) for FIML because it requires some cumbersome matrix notation. See Amemiya (1985, pp. 233-234) or Davidson and MacKinnon ( 1993. pp. 658660) for more details. For the SUR model, the likelihood equations and the iterative procedure are much easier
to describe: see the next subsection. 9This asymptotic equivalence does not carry over to nonlinear equation systems; nonlinear FIML is more efficient than nonlinear 3SLS. See Amemiya (1985, Section 8.2).
535
Examples of Maximum Likelihood
The foregoing discussion about FIML can be summarized as
Proposition 8.1 (Asymptotic properties ofFIML): Consider theM-equation system (8.5. I) and Jet ~ 0 be the stacked vector collecting all the coefficients in the M-equation system. Make the following assumptions. • the rank condition for identification (that E(x,z;m) be of full column rank for all m = 1 , 2, ... , M) is satisfied,
• E(x,X,) is nonsingular, • the M-equation system can be written as a complete system of simultaneous equations (8.5.8) with r 0 nonsingular,
• e, I x,
~
N(O, 1: 0 ), 1: 0 positive definite,
• (y,, x,} is i.i.d.,
• the parameter space for (~ 0 • 1: 0 ) is compact with the true parameter vector (~ 0 • 1: 0 ) included as an interior point. Then (a) the FIML estimator(~. 1:), which maximizes (8.5.20), is consi.
(b) the asymptotic variance of~ is given by (4.5.15), and a consistent estimator of 1 the asymptotic variance is (4.5.17) where &-mh is the (m, h) element of l:- , (c) the likelihood ratio statistic (8.5.25) for testing overidentifying restrictions is asymptotically x2 with K M - L~~ 1 Lm degrees of freedom. A
Furthermore, these asymptotic results about
~hold
even if e, 1 x, is not normal.
ML Estimation of the SUR Model
The SUR model is a specialization of the FlML model with z,m being a subvector of x,. It is therefore an M-equation system (8.5.1) where the regressors z,m are predetermined. Unlike in the multivariate regression model, the set of regressors could differ across equations. It is easy to show (see Review Question 3) that no two dependent variables can be the same variable, so y, (M x 1) in the structural form (8.5.8) simply collects the M dependent variables. Furthermore, since z,m includes no endogenous variables, r 0 in the structural form is the identity matrix. Thus, the FIML objective function (8.5.20) becomes
536
Chapter 8
Qn(
M
2
1og(2n)-
I
I " 1og(ll:l)- n .L;[ry, 2 2 t=I
+ Bx,]'l:-'[ry, + Bx,], (8.5.29)
where f is understood to be the identity matrix. (We continue to carry f around in order to facilitate the comparison to the objective function for LIML to be introduced in the next subsection.) As in FIML, the l: that maximizes the objective function given
L;;;=,
n
.L;[ry,
+ Bx,]'l:- 1 [ry, + Bx,]
= (y- Z
t=I
Therefore, (8.5.31) (Doing the same for the more general case ofFIML is not as easy because, although (8.5.30) holds for FIML as well, the score is more complicated thanks to the 2 term log(lfl ) in the FIML likelihood function. Put differently, if If I is a constant (i.e., independent of <1), then for FIML is the same as (8.5.31).) Setting this equal to zero and solving for <1, we obtain
'fl'
'S'
(8.5.32) Given some consistent estimator f of 1: 0 , 3(f) is the SUR estimator of <1 0 derived in Chapter 4. These two functions (3(1:), l:(
537
Examples of Maximum Likelihood
part updates the estimate of 1: 0 using the current estimate of .1 0 . If this process converges, then the limit is a solution to the first-order conditions. QUESTIONS FOR REVIEW
1. (Deriving f(y, I x,) from f(e, I x,)) In this exercise, we wish to derive the likelihood for observation t for the FIML model by the following fact from probability theory:
Change of Varklble Formula: Suppose e is a continuous M -dimensional random vector with density J,(e) and g : m:.M --> m:.M is a oneto-one and differentiable mapping on an open set S, with the inverse mapping denoted as g- 1( •). Let the M x M matrix J(e) be the J acohian matrix af~~l (the M x M matrix of partial derivatives, whose (i, j) element is a~;•l). Assume that IJ(e)l is not zero at any point in S. ' of y = g(e) is given by Then the density J,(g-l(y)) f(y) = abs(IJ(g 1(y))l) · (Note that IJ(g- 1(y))l is a determinant, which may or may not be positive.) Using this fact, derive the density of y, I x, from the density of e, I x, and verify that the associated log likelihood is given by (8.5.16). Hint: The inverse mapping g-l(y,) is the left-hand side of (8.5.8). SoJ(e,) = f 01 Also, Iog(abs(lfl)) = ~ Iog(lfl 2 ). 2. (Instruments outside the system) Consider the complete system of simultaneous equations (8.5.8) with If ol ,P 0. Suppose that one of the K predetermined variables, say x, K, does not appear in any of the M-equations. Let E' (y I x) he the least squares projection of yon x. Show that E'(y,m I x,) = E'(y,m I x,), where x, = (xu, ... , x,_K _J)' (so x, = (x;, x,K )'). (As you will show in Analytical Exercise 4, including x, K in the list of instruments in addition to will not change the asymptotic variance of the FIML estimator of .1 0 ; it probably makes the small sample properties only worse.)
x,
3. Recall that in Example 8.1 y, 1 and y, 2 are the same variahle. In the SUR model, this cannot happen; that is, no two elements of (y, 1, y, 2 , .•• , y,M) are the same variahle if the assumptions descrihing the model are satisfied. Prove this. Hint: Suppose y, 1 and y, 2 are the same variable. Then
538
Chapter 8
Since z" and z, 2 are subvectors of x,, the right-hand side can be written as for some K-dimensional vector"'· Use the orthogonality conditions that E(x,e;) = 0 and the nonsingularity of E(x,x;) to show that"' = 0.
x;"'
8.6
LIML
LIML Defined
The advantage of the FIML estimator is that it allows you to exploit all the information afforded by the complete system of simultaneous equations. This, however, is also a weakness because, as is true with any other system or joint estimation procedure, the estimator is not consistent if any part of the system is misspecilied. If you are confident that the equation in question is correctly specified hut not so sure about the rest of the system, you may well prefer to employ single-equation methods such as 2SLS. The rest of this section derives the ML estimator called the limited-infonnation maximum likelihood (LIML) estimator, which is the ML counterpart of 2SLS. Let the m-th equation of theM-equation system he the equation of interest. The
Lm regressors, z,m, are either endogenous or predetermined. Let y, be the vector of Mm endogenous variables and be the vector of Km predetermined variables, with Mm + Km = Lm. Partition ~Om conformably as ~Om = (Y~m• {3~)'. Then the m-th equation can he written as
x,
Yrm =
z;m
+ Erm
8om
(lxLm)(Lmxl)
y;
=
+ x;
Yom
(IXMmHMmxl)
/3om
+ Erm·
(8.6.1)
(lxKm)(Kmxl)
Obviously, this single equation in isolation is an incomplete system because of the existence of the Mm included endogenous variables y,. However, if this equation is supplemented by the Mm reduced-form equations for y, taken from (8.5.9), then the system of I + Mm equations thus formed is a complete system of simultaneous equations. To describe this idea more fully, let llo be the associated reducedform coefficient matrix and collect appropriate Mm columns from ll 0 and write the reduced-form equations for the Mm included endogenous variables as
y, (Mmxl)
_, flo
X1
(M111 xK)(Kx!)
+
V1 (Mmxl)
(8.6.2)
539
Examples of Maximum Ukelihood
Combining (8.6.1) and (8.6.2), we obtain the system of I + Mm equations:
+
y,
fo
((l+Mm)x(l+Mm)) ((1+Mm)x1)
Bo
x,
e,
=
(8.6.3)
((l+Mm)xl)
((l+Mm)XK) (Kxi)
where y,
=
[
Ytm ]
(M:~ I)
Er '
=
[
Erm ] (M:rxl)
IM.' l
'
- Yom
fo
=
[M;xl)
(I xMm)
,
Bo
=
[- (lx~m) p'
l
(lx(K-Km)) 0'
if0
-
.
(8.6.4)
(Mm xK)
Here, we have assumed, without loss of generality, that the included predetermined variables x, are the first Km elements ofx,. The equation system (8.6.3) is a complete system of I + Mm simultaneous equations hecause (8.6.5)
lfol =I ;iO.
For example, consider Example 8.2. The included endogenous variable in the wage equation is S,. Supplement the wage equation by a reduced-form equation for S,, a regression of S, on a constant, !Q, and MED,. The r 0 matrix for the two-equation system that results is
- [I fo =
0
The hypothetical value r of r 0 has the same structure shown in (8.6.4) for f 0 . So lfl = I. Replacing in the FIML objective function (8.5.20) (~,B) by (ym, Pm, ll), r hy r, and 1: by 1:, and noting that If I= I, we obtain the LIML objective function:
-Mm I Q.(y m• Pm' ll, 1:) =- 21og(2rr) - :zlog(I:EI) n
-
2~ L;[ ry, + Bx,]' 1: t=1
-I [
ry, + Bx,],
(8.6.6)
--
where 1: is the (I + M.,) x (I + Mm) variance matrix of e,. Let (y .,, Pm' ll, 1:) be the FIML estimator of this system. The LIML estimator of ~Om = (Y~m, P~m)' is (y~, P~)'. This derivation of the LIML estimator, which is different from the
540
Chapter 8
original derivation hy Anderson and Ruhin (1949), is due to Pagan (1979). Since LIML is a FIML estimator, Proposition 8.1 assures us that the LIML estimator is consistent and asymptotically normal even if the error is not normally distributed. Compulalion ol LIML
For the SUR model, we have shown that the ML estimator can be obtained from the iterative SUR procedure. If you reexamine the argument, you should see that it does not depend on r in the SUR objective function (8.5.29) being the identity matrix; all that is needed is that the determinant of r is a constant. Therefore, the same argument also applies to the LIML objective function (8.6.6). That is, we can obtain the LIML estimator of 80m by applying the iterative SUR procedure to the (I +Mm )-equation system (8.6.3). In particular, if l:o is known, the LIML estimator is the SUR estimator. On the face of it, this conclusion -that SUR can be used to estimate the coefficients of included endogenous variables- is surprising, but it is true. The essential reason is this: the sampling error Sm - Som depends not only on the correlation between the included endogenous variables y, and the error term e,m but also on the correlation between y, and v,. Those two correlations cancel each other out. Review Question I verifies this for a simple example. The iterative SUR procedure, however, is not how the LIML estimator is usually calculated, because there is a closed-form expression for the estimator. The simultaneous equation system (8.6.3) has two special features. One is the special structure of r, which we have noted, and the other is the fact that there are no exclusion restrictions in the rows of B corresponding to the included endogenous variables y,. Thanks to these features, the LIML estimator Sm = (y~, {i~)' and also the likelihood ratio statistic (8.5.25) can be calculated explicitly. To write down the closed-form expressions for the LIML estimator and the likelihood ratio statistic, we need to introduce some more notation. Let Zm, Y, X, X, and Ym be the data matrices and the data vector associated with z,m, y,, x,, x,, and Y<m• respectively:
(8.6.7)
541
Examples of Maximum Likelihood
and let M and M be the annihilators hased on X and X, respectively:
M =In- X(X'X)- 1X', M =I.- X(X'X)- 1X'. (n xn)
(8.6.8)
(n xn)
Then the LIML estimator ~m can be written as (8.6.9) where k is the smallest characteristic root of (8.6.10) with (I
+ Mm)
x (I
+ Mm)
matrices Wand W defined hy (8.6.11)
(See, e.g., Davidson and MacKinnon, 1993, pp. 645--649, for a derivation.) The likelihood ratio statistic (8.5.25) for testing overidentifying restrictions reduces to LR = n logk.
(8.6.12)
Since there are no overidentifying restrictions in the supplementary Mm equations (8.6.2), the equation of interest is the only source of overidentifying restrictions, which are K- Lm in numher. Therefore, by Proposition 8.1, this statistic for testing overidentifying restrictions is asymptotically x2 (K - Lm). When the equation is just identified so that K = Lm, then the LR statistic should be zero. Indeed, it can be shown that k = I in this case (see, e.g., p. 647 of Davidson and MacKinnon, 1993). If we do not necessarily require k to be as just defined, the estimator (8.6.9) is called a k-class estimator. The LIML estimator is thus a k-class estimator with k defined above. Inspection of the 2SLS formula in terms of data matrices (see (3.8.3') on page 230) immediately shows that the 2SLS estimator is a k-class estimator with k = I, and the OLS estimator is a k-class estimator with k = 0. It follows that LIML and 2SLS are numerically the same when the equation is just identified (so that k = I). We have already shown that LIML is consistent and asymptotically normal. Moreover, using the explicit formula (8.6.9), it can be shown that the LIML estimator of llom has the same asymptotic distribution as 2SLS (see, e.g., Amemiya, 1985, Section 7.3.4). So the formula for the asymptotic variance ofLIML is given by (3.8.4) and its consistent estimate by (3.8.5).
542
Chapter 8
LIML versus 2SLS
Since LIML and 2SLS share the same asymptotic distrihution, you cannot prefer one over the other on asymptotic grounds. For any finite sample, LIML has the invariance property, while 2SLS is not invariant. Furthennore, the conclusion one could draw from the large literature on finite-sample properties (see, e.g., Judge et al., 1985, Section 15.4) is that LIML should he preferred to 2SLS. Major conclusions from the literature are listed below. They assume that the predeterrnined variables x, are fixed constants, however. • Suppose errors are nonnally distributed. The p-th moment of the 2SLS estimator exists if and only if p < K- Lm+ I, where K is the number ofpredetennined variables and Lm is the numher of regressors in the m-th equation. Thus, 2SLS does not even have a mean if the equation is just identified (i.e., if K = Lm), and this is true even if the distrihution of errors is a nonnal distribution, whose moments are all finite. • The LIML estimator has no finite moments even if errors are nonnally distrihuted. • However, in most Monte Carlo simulations (see, e.g., Anderson, Kunitomo, and Sawa, 1982), LIML approaches its asymptotic nonnal distribution much more rapidly than does 2SLS.
QUESTIONS FOR REVIEW
1. (Why SUR can be used for a structural equation) To see why SUR can be used to estimate the coefficients of included endogenous variables, consider the following simple example: Yil = YoYr2 { Yr2 = fJoX 1
+ 6'rJ, + v,,
where x, is predetennined. Suppose for simplicity that
is known. Let (y,
{J) be the SUR estimator of (yo, flo).
543
Examples of Maximum Ukelihood
(a) Show that
[~]- [;:] a 121>' -;;- L... Yr2Xr ] - ! [ai l l-;;-> L...' Yr2Br! +a 121>' -;;- L... Yr2Vr ] a22lnL...t "'x2
a2I.!nL...tti "'x e
+ a22lnL...tt "'x v
'
where amh is the (m. h) element of I: 01 • Hint: The formula for the SUR estimator is given by (4.5.12) with (4.5.13') and (4.5.14') on page 280. (b) Show that
a
" I l- lL..., "" ' Yt2Br1 +a 121"' - L..., Yr2Vr n
n
t=I
t=I
converges to zero in probability. Hint:
8.7
Serially Correlated Observations
So far, we have assumed i.i.d. observations in deriving the asymptotic variance of ML estimators. This section is concerned with ML estimation when observations are serially correlated. Two Questions
When observations are serially correlated, there arise two distinct questions. The first is whether the ML estimator that maximizes a likelihood function derived from the assumption of i.i.d. observations is consistent and asymptotically normal when indeed the observations are serially correlated. More precisely, suppose that the observation vector w, is ergodic stationary hut not necessarily i.i.d., and consider the ML estimator that maximizes
L
I " Q,(O) = log f(w,; 9), n t=I
(8.7.1)
544
Chapter B
where log f(w,; 6) is the log likelihood for ohservation t. This objective function would he (I/n times) the log likelihood of the sample (w 1, ... , Wn) if the observations were i.i.d. Even though the log likelihood for observation t, logf(w,; 6), is correctly specified, this objective function is misspecified because it assumes, incorrectly, that ( w,} is i .i .d. The ML estimator that maximizes this objective function is a quasi-ML estimator because the likelihood function is misspecified. Is a quasi-ML estimator that incorrectly assumes no serial correlation consistent and asymptotically normal? We have actually answered this first question in the previous chapter. By Propositions 7.5 and 7.6, the quasi-ML estimator is consistent because it is an M-estimator, with m(w,; 6) given hy the log likelihood for ohservation t. For asymptotic normality, the relevant theorem is Proposition 7.8, which spells out conditions under which an M-estimator is asymptotically normal. Just reproducing the conclusion of Proposition 7.8, the expression for the asymptotic variance is given by
(8. 7.2) where 6 0 is the true value of 6, H(w,; IJ) is the Hessian for observation t, and l: is the long-run variance of the score for ohservation t evaluated at IJ 0 , s(w,; 6 0 ). This asymptotic variance can he consistently estimated by
~
Avar(O) =
{ 1 -;;
L n
t=l
H(w,; 0)
}-1 f {-;;1 L n
H(w,; 0)
}-1 ,
(8.7.3)
t=I
where f is a consistent estimate of :E. Any of the "nonparametric" methods discussed in Chapter 6 (such as the VARHAC) can be used to calculate l: from the estimated series (s(w,; 0) }; there is no need to parameterize the serial correlation in (s(w,; 6o)}. The second question is what is the likelihood function that properly accounts for serial correlation in observations and what are the properties of the "genuine" ML estimator that maximizes it. The question can be posed for conditional ML as well as unconditional ML. In conditional ML, we divide the observation vector w, into two groups, the dependent variable y, and the regressors x,, and examine the conditional distribution of (y 1, ••. , y,) conditional on (x 1, ••• , x.). Parameterizing this conditional distribution is relatively straightforward when (w,} is i.i.d., hecause it can he written as a product over t of the conditional distrihution of y, given x,. In fact, this i.i.d. case is what we have examined in this and previous chapters. If (w,} is serially correlated, however, the researcher does not know the
-
545
Examples of Maximum Likelihood
dynamic interaction between y, and x, well enough to be able to write down the conditional distribution. 10 In the rest of this section, we deal only with examples of unconditional ML estimation with serially correlated ohservations. Unconditional ML for Dependent Observations
We start out with the simplest case of a univariate series. Let (yo, Y1, ... , Yn) be the sample. (For notational convenience, we assume that the sample includes observation fort = 0.) When the observations are serially correlated, the density function of the sample can no longer be expressed as a product of individual densities. However, it can be written down in terms of a series of conditional probability density functions. The joint density function of (yo, y 1) can he written as (8.7.4) Similarly, for (yo. Y1, Y2), (8.7.5) Substituting (8.7.4) into this, we obtain (8.7.6) A repeated use of this sort of sequential substitution produces n
f(yo, Y~> · · ·, Yn) =
[l f(y,
I Yr-1. · · ·, Yl· Yo)f(yo).
(8.7.7)
t=l
If the conditional density f(y,
I y,_ 1 , ... , y 1 , y 0 ) and the unconditional density
f(y0 ) are parameterized by a finite-dimensional parameter vector 6, then ( 1/ n times) the log likelihood of the sample (y0 , y 1 , ... , Yn) can be written as
I " - l:)og f(y, n r=l
I n
I Y•-1· ... , Y~> Yo; 6) +-log f(yo; 6).
(8.7.8)
The ML estimator 6 of the true parameter value 6 0 is the 6 that maximizes this log likelihood function. We will call this ML estimator the exact ML estimator.
10 For a discussion of conditional ML when the complete specification of the dynamic interaction is not possible or undesirable, see Wooldridge (1994. Section 5). For a discussion of possible restrictions on the dynamic interaction (such as "Granger causality" and "weak exogeneity"), see, e.g .• Davidson and MacKinnon (1993, Section 18.2).
546
Chapter B
ML Estimation of AR(l) Processes
In Section 6.2, we considered in some detail first-order autoregressive (AR(l)) processes. If we assume for an AR(l) process that the error term is normally distributed, we obtain a Gaussian AR(l) process: y,
=co+ ¢oy,_, + £,,
(8.7.9)
with <, ~ i.i.d. N(O. aJ). We assume the stationarity condition that I
" {J 1 f(yo,YJ, ... ,y.;fi)=fl 2rra
t=I
x
2
y.) for a
2 exp[-2a1 2 (y,-c-¢y,_ 1) ] }
I
j2rr
... ,
,~:2
exp
-(yo[
'~•)']
2 1 ~¢2 2
•
(8.7.12)
Taking logs and dividing both sides by n, we obtain the objective function for exact ML:
""I n
I L- --log(2rr)I I 2 ) - - I (y,- c- ¢YH) ') Q.(fi) =-log(a n t=l 2 2 2a 2
+-I
2
) - [(Yo- _ { --log(2rr)I I ( a c) -log I-¢ 2 2 n 2 2 I- ¢ 2 1-¢2 "
2 ] } .
(8.7.13)
The exact ML estimator fJ of the AR( I) process is the fJ that maximizes this objective function. Assuming an interior solution, the estimato~ solves the first-order conditions (the likelihood equations) obtained by setting 'a~' to zero. They are a
547
Examples of Maximum Likelihood
system of equations that are nonlinear in IJ, so finding IJ requires iterative algorithms such as those described in Section 7.5.
Conditional ML Estimation of AR(l) Processes The source of the nonlinearity in the likelihood equations is the log likelihood of y 0 . As an alternative, consider dropping this term and maximizing
~I --log(2rr)I I 2 ) - -I , (y,- c- ¢y,_ ) Qn(IJ) = -I L...., -log(CT 1 n 2 2 2CT·
'l
.
(8.7.14)
t=I
Using the relation f(yo, y,, ... , Ynl = f(v,, ... , Yn I .vo)f(yo) and (8.7.12), we see that this objective function is (1/ n times) the log of the likelihood conditional on the first observation yo,
n
1 1 (y,- c- ¢y,_,)'] } . f(y,, y,, ... , Yn I .vo; IJ) = " { exp[- 2 2 2 r~I .J2rrCT C1 (8.7.15) Let 0 be the IJ that maximizes the log conditional likelihood (8.7.14) (conditional on y 0 ). This ML estimator will henceforth he called the y 0 -conditional ML estimator. Here, conditioning is on y0 , which makes the estimator here conceptually different from the conditional ML estimators of the previous sections where conditioning is on x,. Before examining the relationship between the two ML estimators, the exact ML estimator IJ and the y 0 -conditional ML estimator IJ, we give two derivations of the asymptotic properties of the latter. The first derivation is based on the explicit "
A
solution for the estimator. Recall from Section 1.5 and Example 7.2 that the objective function in the ML estimation of the linear regression model y, = x;t3 0 + e, for i.i.d. observations is
I2
I I 2 ) - -I( y - x , Pl -] Ln --log(2rr)-log(CT
n
t=l
2
2"2
'
'
'l .
(8.7.16)
It was shown in Section 1.5, and is very easy to show, that the ML estimator of Po is the OLS coefficient estimator and the ML estimator of equals the sum of squared residuals divided by the sample size. Now, with x, = (I, y,_ 1)' and Po = (c0 , ¢ 0 )', the objective function for the linear regression model reduces to Qn (IJ) in (8.7.14), the conditional log likelihood for the AR(l) process. Therefore, algebraically, the conditional ML estimator (c, ¢l of (c0 , ¢ 0 ) is numerically the
"J
548
Chapter 8
same as the OLS coefficient estimator in the regression of y, on a constant and Y•-1> and the conditional ML estimator 8- 2 of aJ is the sum of squared residuals divided by n. Regarding the asymptotic properties of the estimator, we have shown in Section 6.4 that (c, if,) is consistent and asymptotically normal with the asymptotic variance indicated in Proposition 6.7 for p = I and that 8-2 is consistent for aJ. Therefore, for the asymptotic normality of the conditional ML estimator (c, if,), which maximizes the y0 -conditional likelihood for normal errors, the normality assumption is not needed. The second derivation is more indirect but nevertheless more instructive. As just noted, the objective function Q"(ll) coincides with (8.7.16) with x, = (I, y,_ 1)' and {J 0 = (c0 , ¢ 0 )'. Since (y,, x,) is merely ergodic stationary and not necessarily i.i.d., the conditional ML estimator (c, if,, 8- 2 ) can be viewed as the quasi-ML estimator considered in Proposition 7 .6. It has been veri tied in Example 7.8 that the objective function satisfies all the conditions of Proposition 7.6 if E(x,x;) where x, =(I, y,_ 1)' is nonsingular. We have shown in Section 6.4 that this nonsingularity condition is satisfied for AR(l) processes if aJ > 0. Thus, the conditional ML is consistent. To prove asymptotic normality, view the conditional ML estimator as an M-estimator with w, = (y,, y,_ 1 )' and m(w,; II)= -
I
2
Iog(2Jr)-
I
2
1og(a 2 ) -
I
2
2
a 2 (y,- c-
The relevant theorem, therefore, is Proposition 7.8. Let s(w,; II) be the score for observation t and let H(w,; II) be the Hessian for observation t associated with this m function, and consider the conditions (I )-(5) of Proposition 7.8. The discussion in Example 7.I 0 implies that all the conditions except for (3) hold if E(x1 x;) where x, = (1. _y1_ 1)'is nonsingular, which as just noted is satisfied if aJ > 0. This leaves condition (3) that )n L7~ 1 s(w,; llo) N(O, l:). In the AR(l) model, f(y, I Yr-I .... , YJ, Yo) = f(y, I y,_,). As you will show in Review Question I, this special feature of the joint density function implies that the sequence (s(w,; 11 0 )) is a martingale difference sequence. Since s(w 1 ; llo) is a function of w1 = (_y,, y,_J)' and (y,) is ergodic stationary, the sequence (s(w,; 11 0 )) also is ergodic stationary. So by the ergodic stationary martingale difference CLT of Chapter 2, we can claim that condition (3) is satisfied with
--+,
l: = E[s(w 1; 11 0 ) s(w,; 11 0 )'].
(8.7.18)
Finally, it is easy to show the information matrix equality for observation t that - E[H(w1; 11 0 ) J = E[s(w 1; 11 0 ) s(w,; 11 0 )'] (a proof is in Example 7.1 0). Substituting this and (8.7.18) into (8.7.2), we conclude that the asymptotic variance of the
549
Examples of Maximum Likelihood
'
y 0 -conditional ML estimator 8 of Oo is given by
1 1 Avar(i}) = - (E[H(w,; Oo) Jr = (E[s(w,; Oo) s(w,; Oo)'Jr . (8.7 .19) That is, the usual conclusion for a correctly specified ML estimator for i.i.d. observations is applicable to the y 0 -conditional ML estimator that maximizes the correctly specified conditional likelihood (8.7.14) for dependent observations. 11 The difference between (8.7.13) and (8.7.14) is the log likelihood of y 0 divided by n. If the sample size n is sufficiently large, then the first observation Yo makes a negligible contribution to the likelihood of the sample. The exact ML estimator and the Yo-conditional ML estimator tum out to have the same asymptotic distribution, provided that 1<1>1 < I (see, e.g., Fuller, 1996, Sections 8.1 and 8.4). For this reason, in most applications the parameters of an autoregression are estimated by OLS (Yo-conditional ML) rather than exact ML. Conditional ML Estimation of AR(p) and VAR(p) Processes A Gaussian AR(p) process can be written as
Yr =co+
(8.7.20)
uJ). The conditional ML estimation of the AR(p) process is
completely analogous to the AR(l) case. If the sample is
x, = (1, y,_ 1, ... , Yr-p)'
and
{3 0 =(co, >or.
Therefore, the conditional ML (conditional on (Y-p+l• Y-p+'· ... , Yo)) estimator of (c0 , ¢>01 , ..• , t/>op) is numerically the same as the OLS coefficient estimator in the regression of y, on a constant and p lagged values of y,. The conditional ML estimate of is the sum of squared residuals divided by the sample size. By Proposition 6.7, the conditional ML estimator of the coefficients is consistent and
uJ
asymptotically normal if the stationarity condition (that the roots of I -¢>o 1z- · · ·-
t/>opzP = 0 be greater than I in absolute value) is satisfied. Again as in the AR(l) case, the error term does not need to be normal for consistency and asymptotic normality. The exact ML estimates and the conditional ML estimates have the same asymptotic distribution if the stationarity condition is satisfied. 11 This argument can be applied to models more genera] than autoregressive processes. See Lemma 5.2 of Wooldridge (1994).
550
Chapter 8
These results generalize readily to vector processes. A Gaussian VAR(p) (p-th order vector autoregression) can be written as y,
=Co+ '~>oiYt-1 + · · · + '~>opYt-p + e,,
(8.7.21)
(Mxl)
withe, ~ i.i.d. N(O, 2 0 ). The objective function in the conditional ML estimation is (1/ n times) the log conditional likelihood: M 1 1 ~~ Qn(ll) = --log(2rr) +-log( IS!- l l - - L.)(y,-
2
2
2n
n x,) Q-
11]
(y,-
n x,)}, I
l=l
(8.7.22) where
TI'
=[c
11> 1
11> 2
'l>p],
·· ·
x,
=
Yt-t
(8.7.23)
Yt-2
Yt-p
This coincides with the objective function in the ML estimation of the multivariate regression model considered in Section 8.4. Therefore, the conditional ML estimate of (c0 , TI 0 ) is numerically the same as the equation-by-equation OLS estimator. It is consistent and asymptotically normal if the coefficient matrices satisfy the stationarity condition (that all the roots of liM- '1> 01 z- · · · - '~>opzPI = 0 be greater than I in absolute value). This result does not require the error vector e, to be normally distributed. QUESTIONS FOR REVIEW 1.
(Score is m.d.s.) Consider the Gaussian AR(l) process (8.7.9) and let s(w,; (}) be the score for observation t (the gradient of (8.7.17)) with w, = (y,, y,_ 1)' and(} = (c, >. a 2 )'. (a) Show that
s(w,; 9) =
where
P=
"\x, · (y,- x;{l) [ - ~1 2
+ 2 ~4 (yt
(c, >)'and x, = (1, Yt-IY·
-
x;f3)
] 2
,
551
Examples of Maximum likelihood
(b) Show that E[s(w,; IJo) I Y<-1,
Y<-2· ... l =
0.
Hint: At IJ = 90 , y, - x;fJ = s,. (c) Show that {s(w,; 1} 0 )) is a martingale difference sequence. Hint: s(w,; 1} 0 ) is a (measurable) function of (y, y,_ 1). So s(w,_i; IJ 0 ) (j > 1) can be calculated from (y,_J. Y<-2• ... ). (d) Does this result in (c) carry over to Gaussian AR(p)? [Answer: Yes.]
PROBLEM SET FOR CHAPTER 8 - - - - - - - - - - - - ANALYTICAL EXERCISES
1. Show that
L...(y,- n ' x,)(y,- n ''I x,) > 1'~'''1 -;; L... v,v, , I -;;I~ t=l
(I)
t=l
~,
~
where v, = y, - TI x, and TI is given in (8.4.5). (Since v, does not depend on n by construction, this inequality shows that the left-hand side is minimized at TI = TI.) Hint: Use the "add-and-subtracf' strategy to derive ~
y,- TI'x,
=
~
v, + (TI- TI)'x,.
So n
L(y,- TI'x,)(y,- TI'x,)' t=l
n
=
n
n
I:v,v; + (fi- TI)'I:x,v; + I:v,x;(fi- TI) t=1
t=l
n
+ (fi- TI)'(I:x,x;)
n
n
= I:v,v; + (fi- TI)'(I:x,x;)
1=1
n
(since
I:x,v; = 0). 1=1
552
Chapter 8
Then use the following result from matrix algebra: Let A and B be two positive semidefinite matrices of the same size. Then
lA + Bl
>
IAI.
~
2. (Q(II) is positive definite.) Under the conditions of the multivariate regression model, show that Q(II) defined in (8.4.10) is positive definite for any given II. Hint: Derive y,- II'x, = v, + (II 0 - II)'x,. Since {y1, x,} is i.i.d., Q(II) converges almost surely toE[ (y, - II'x, )(y, - II'x, )']. Show that
Then use the matrix inequaltiy mentioned in the previous exercise. Q 0 is positive definite by assumption. 3.
(Optional, identification of~ in FlML) As noted in footnote 7 of Section 8.5, the FIML estimator of ~ 0 is an extremum estimator whose objective function is ~
Q"(~) =-In(~) I,
(2)
where
Q(~)
1
n
=- L(y1 + r-'Bx,)(y, + r-'Bx,)'. n
(3)
t=l
Let Qo(~)
= plim Qn(~).
(4)
n~oo
To show the consistency of this extremum estimator, we would verify the two conditions of Proposition 7.1 of the previous chapter. Reproducing these two conditions in the current notation: identification: at ~ 0 ,
Q 0 (~)
is uniquely maximized on the compact parameter space
uniform convergence: QnO converges uniformly in probability to Q 0 0. ln this exercise we prove that the rank condition for identification (that E(x,z;ml be of full column rank for all m) is equivalent to the above extremum estimator identification condition. ln the proof, take for granted all the assumptions we made about FIML in the text.
Examples of Maximum Likelihood
553
To set up the discussion, we need to introduce some new notation. Reproducing the m-th equation of the structural form from the text: Ytm
=
z;m
l)Om
+ Ctm.
(5)
(lxLm) (Lmxl)
The regressors Z1m consist of Mm endogenous variables and Km predetermined variables with Mm + Km = Lm. Define
Sm
: a matrix that selects those Mm endogenous regressors from y,,
(MmxM)
Cm : a matrix that selects those Km predetermined regressors from x,. (Km
x K)
To illustrate, S 1 and C 1 for Example 8.1 withy, = (p 1 • q,)' and x, = (I, a,, w, )' IS
So the regressors in the m-th equation can be written as
'('s'·'c') = y t m : XI m '
2 tm
(a) Show that E(x,z;ml = E(x,x;) ( 11o (KxLm)
(KxK)
S~
: C~ ].
(KxM)(MxMmJ
(6)
(KxKm)
(KxLm)
Hint: Use the reduced form (8.5.9) and the fact that E(x,V,) = 0. Since multiplication by a nonsingular matrix (E(x,x;) here) does not alter rank, equation (6) implies that the rank condition for identification is equivalent to the condition that the rank of (11 0 S~ : C~] be Lm. (b) Show that plim Q(d) = f n~oo
0 I:o(f 0 )' + (11~ + r-'B] E(x,x;l[11~ + r-'BJ', 1
1
(7)
where 11~ = -f 01B0 . Hint: Since {y,, x,} is i.i.d., the probability limit is given by E((y, + r-'Bx,)(y, + r-'Bx,)']. Also, y,
+ r-'Bx,
= v,
+ (11~ + r-'B)x,.
554
Chapter a
(c) Show that IR(~)I is minimized only if rrr~ + B = 0. Hint: Since E(x,x;) is positive definite, the second term on the right-hand side of (7) is zero only if II 0 + B'(f- 1 )' = 0. Use the fact from matrix algebra noted in Exercise 1 above. Therefore, the extremum estimator identification condition (that Q 0 (~) be maximized uniquely at ~ 0 ) holds if and only if (f, B) = (f 0 , B0 ) is the only solution to r II~+ B = 0 viewed as a system of simultaneous equations for (f, B). (d) Let a~ (I x (M + K)) be them-throw of [r : BJ, and let e~ be a (I x (M + K)) vector whose element corresponding to Ytm is one and whose other elements are zero. To illustrate, e'1 for Example 8.1 withy, = (p,, q, )' and x, = (I, a,, w, )' is (0, I, 0, 0, 0). Verify for Example 8.1 the general result that Sm I
am (lx(M+K))
(e) Rewrite
fiT~+
B
0'
= e' m
(MmxM)
m
(!x(Mm+Km))
[
0
(8)
(KmxM)
= 0 as (9)
Show that them-throw of rrr~
+B=
0 can be written as (10)
where (11) (f) Verify that ~m = ~om is a solution to (10). Show that a necessary and sufficient condition that ~m = ~Om is the only solution to (10) is the rank condition for identification for equation m. Hint: Recall the following fact
from linear algebra: Suppose Ax = y has a solution XQ. A necessary and sufficient condition that it is the only solution is that A is of full column rank. (g) Explain why we can conclude that the only solution to r II 0 + B = 0 is (f, B) = (f 0 , B 0 ). Hint: Each element of~ appears in either r orB only once.
555
Examples of Maximum Likelihood
4. (Optional, instruments that do not show up in the system) Consider the complete system of simultaneous equations discussed in Section 8.5. Let
x, = (Kxl)
[I(K~{)x))]. x,K
Suppose that x, K does not appear in any of the M structural equations. Show that dropping x,K from the Jist of instruments changes neither the rank condition for identification nor the asymptotic variance of the FIML estimator of 80 . Hint: Define the two matrices, Sm and Cm as in Exercise 3, so that = (y;s~ : x;c;.,) and (6) holds. FIML is asymptotically equivalent to 3SLS, so its asymptotic variance is given in (4.5.15). The last row of is a vector of zeros. Let 0 0 ((K- I) x M) be the matrix of reduced-form coefficients when x, K is dropped from the system. Then
z;m
c;,
0
= 0
[IlK ~?xM)] 0' (l xM)
References Amemiya, T., 1973, ··Regression Analysis When the Dependent Variable is Truncated Normal," Econometrica, 41,997-1016. - - - · 1985, Advanced Ecorwmetrics, Cambridge: Harvard University Press. Anderson, T. W., and H. Rubin. 1949, "Estimator of the Parameters of a Single Equation in a Complete System of Stochastic Equations." Annals of Mathematical Statistics, 20,46----63. Anderson, T. W.. N. Kunitomo. and T. Sawa, 1982, "Evaluation of the Distribution Function of the Limited Information Maximum Likelihood Estimator," Econometrica, 50. 1009-1027. Davidson, R., and J. MacKinnon. 1993, Estimation and Inference in Et·onometrics, New York: Oxford University Press. Fuller, W.. 19%, Imroduction to Statistical Time Series (2d ed.), New York: Wiley. Hausman, J., 1975, "An Instrumental Variable Approach to Full Information Estimators for Linear and Certain Nonlinear Econometric Models." Econometrica, 43, 727-738. Johansen, S .. 1995, Likelihood-Based I11ference ill Cointegrated Vector Auto-RegreJsive Modeh, New York: Oxford University Press. Judge, G., W. Griffiths. R. Hill, H. Liitkepohl. and T. Lee, 1985, Tlte Theory m1d Practice af Econometrics (2d ed.). New York: Wiley. Magnus, J .. and H. Neudecker, 1988. Matrix Differential Calculm· with Applications in Statistics and Econometrics. New York: Wiley. Newey, W., and D. McFadden. 1994. "Large Sample Estimation and Hypothesis Testing," Chapter 36 in R. Engle and D. McFadden (eds.), Handbook of Econometrics, Volume IV, New York: North-Holland.
556
Chapter 8
Olsen, R. J., 1978, "Note on the Uniqueness of the Maximum Likelihood Estimator for the Tobit Model," Econometrica, 46, 1211-1215. Orme, C., 1989, "On the Uniqueness of the Maximum Likelihood Estimator in Truncated Regression Models." Econometric Reviews, 8, 217-222. Pagan, A., 1979, "Some Consequences of Viewing LIML As an Iterated Aitken Estimator," Economics Leuers. 3, 369-372. Sapra, S. K., 1992, "Asymptotic Properties of a Quasi-Maximum Likelihood Estimator in Truncated Regression Model with Serial Correlation," Econometric Reviews, II, 253-260. Tobin, J., 1958, "Estimation of Relationships for Limited Dependent Variables,'' Econometrica, 26, 24-36. Wooldridge, J., 1994, "Estimation and Inference for Dependent Processes," Chapter 45 in R. Engle and D. McFadden (eds.), Handbook of Econometrics, Volume IV, New York: North-Holland.
CHAPTER 9
Unit-Root Econometrics
ABSTRACT Up to this point our analysis has been confined to stationary processes. Section 9.1 introduces two classes of processes with trends: trend-stationary processes and unitroot processes. A unit-root process is also called difference-stationary or integrated
of order I (I (I)) because its first difference is a stationary, or I(0), process. The technical tools for deriving the limiting distributions of statistics involving 1(0) and I (I) processes are collected in Section 9.2. Using these tools, we will derive some unitroot tests of the null hypothesis that the process in question is I (I). Those tests not only are popular hut also have better finite-sample properties than most other existing tests. In Section 9.3, we cover the Dickey-Fuller tests, which were developed to test
the null that the process is a random walk, a prototypical I (I) process whose first difference is serially uncorrelated. Section 9.4 shows that these tests can be generalized to cover 1(1) processes with serially correlated first differences. These and other unit-root tests are compared briefly in Section 9.5. Section 9.6, the application of this chapter, utilizes the unit-root tests to test purchasing power parity (PPP), the fundamental proposition in international economics that exchange rates adjust to national price levels.
9.1
Modeling Trends
There is no shortage of examples of time series with what could be reasonably described as trends in economics. Figure 9.1 displays the log of U.S. real GDP. The log GDP clearly has an upward trend in that its mean, instead of being constant as with stationary processes, increases steadily over time. The trend in the mean is called a detenninistic trend or time trend. For the case of log U.S. GDP, the deterministic trend appears to be linear. Another kind of trend can be seen from Figure 6.3 on page 424, which graphs the log yen/dollar exchange rate. The log exchange rate does not have a trend in the mean, but its every change seems to have a permanent effect on its future values so that the best predictor of future values is
558
Chapter 9
5.0 4.5-
4,0 3.5 3.0
"
2.0 1.5 1.0
0.5 0.0 1869
1889
1909
1929
1949
1969
1989
Figure 9.1: Log U.S. Real GOP, Value for 1869 Set to 0 its current value. A process with this property, which is not shared by stationary processes, is called a stochastic trend. If you recall the definition of a martingale, it fits this description of a stochastic trend. Stochastic trends and martingales are synonymous. The basic premise of this chapter and the next is that economic time series can be represented as the sum of a linear time trend, a stochastic trend, and a stationary process.' Integrated Processes
A random walk is an example of a class of trending processes known as integrated processes. To give a precise definition of integrated processes, we first define 1(0) processes. An 1(0) process is a (strictly) stationary process whose long-run variance (defined in the discussion following Proposition 6.8 of Section 6.5) is finite and positive. (We will explain why the long-run variance is required to be positive in a moment.) Following, e.g., Hamilton (1994, p. 435), we allow 1(0) processes to have possibly nonzero means (some authors, e.g., Stock, 1994, require 1(0) processes to have zero mean). Therefore, an 1(0) process can be written as (9.1.1) where {u1 } is zero-mean stationary with positive long-run variance. 1 A nme on semantics. Pmcesses
with trends are often called "nonstationary processes.'' We will not use this term in the rest of this book because a process can be not stationary without containing a trend. For example, let E1 be an i.i.d. process with unit variance and let d 1 take a value of 1 fort odd and 2 for 1 even. Then a process {u 1 } defined by u1 = d1 - e1 is not stationary because it~ vari:tnce depends on 1. Yet the process cannot be reasonably described as a process with a trend.
559
Unit-Root Econometrics
The definition of integrated processes follows from the definition of !(0) processes. Let ~ be the difference operator, so, for a sequence {~~}, ~~~=(I- L)~l
= ~~- ~1-1,
~ 2 ~ 1 =(I- L) 2 ~ 1 = (~1 - ~1 -!l- (~ 1 -I- ~1-2l. etc.
(9. !.2)
Definition 9.1 (l(d) processes): A process is said to be integrated of order d (l(d)) (d = 1, 2, ... ) if its d-th difference ~d~1 is 1(0). In particular, a process {~rl is integrated of order 1 (1(1)) if the first difference, ~~1 , is !(0). The reason the long-run variance of an !(0) process is required to be positive is to rule out the following definitional anomaly. Consider the process {v1 } defined by Vr
= Br- Br-I,
(9.1.3)
where {er} is independent white noise. As we verified in Review Question 5 of Section 6.5, the long-run variance of {v1 } is zero. lf it were not for the requirement for the long-run variance, {v1 } would be !(0). But then, since ~e 1 = v1 , the independent white noise process {er} would have to be I (I)! lf a process {vr} is written as the first difference of an !(0) process, it is called an I( -1) process. The long-run variance of an 1( -1) process is zero (proving this is Review Question 1). For the rest of this chapter, the integrated processes we deal with are of order 1. A few comments about !(I) processes follow. • (When did it start?) As is clear with random walks, the variance of an !(1) process increases linearly with time. Thus if the process had started in the infinite past, the variance would be infinite. To focus on 1(1) processes with finite variance, we assume that the process began in the finite past, and without loss of generality we can assume that the starting date is 1 = 0. Since an !(0) process can be written as (9.1.1) and since by definition an 1(1) process {~ 1 } satisfies the relation~~~ = 8 + u 1 , we can write~~ in levels as ~~ = ~0
+ 8' I+ (U! + U2 + · · · + U1 ),
(9.!.4)
where {ur} is zero-mean 1(0). So the specification of the levels process Rrl must include an assumption about the initial condition. Unless otherwise stated, we assume throughout that E(~J) < oo. So ~ 0 can be random. • (The mean in 1(0) is the trend in l(l)) As is clear from (9.1.4), an 1(!) process can have a linear trend, 8 · 1. This is a consequence of having allowed 1(0) processes to have a nonzero mean 8. lf 8 = 0, then the !(I) process has no trend
560
Chapter 9
and is called a driftless 1(1) process, while if 8 i= 0, the process is called an 1(1) process with drift. Evidently, an I (I) process with drift can be written as the sum of a linear trend and a driftless I (I) process. • (Two other names of! (I) processes) An I (I) process has two other names. It is called a difference-stationary process because the first difference is stationary. It is also called a unit-root process. To see why it is so called, consider a model
(1-pL)y,=o+u,,
(9.1.5)
where u, is zero-mean 1(0). It is an autoregressive model with possibly serially correlated errors represented by u 1 • If the autoregressive root pis unity, then the first difference of y, is 1(0), so [y,} is I (I). Why Is It Important to Know if the Process Is 1(1)? For the rest of this chapter, we will be concerned with distinguishing between trend-stationary processes, which can be written as the sum of a linear time trend (if there is one) and a stationary process, on one hand, and difference-stationary processes (i.e., I (I) processes with or without drift) on the other. As will be shown in Proposition 9.1, an I (I) process can be written as the sum of a linear trend (if any), a stationary process, and a stochastic trend. Therefore, the difference between the two classes of processes is the existence of a stochastic trend. There are at least two reasons why the distinction is important. I. First, it matters a great deal in forecasting. To make the point, consider the following simple AR(l) process with trend: Yr = Zr
Ct
+ 0 · t + Zr,
= PZr-1 +ct,
(9.1.6)
where (erl is independent white noise. The s-period-ahead forecast of Yr+, conditional on (y 1, Yr- 1,
••• )
I y,, Yr-t.
... )
E(y,+,
can be written as
+ o · (t + s) I y,, Yr-t. .. . ] + E(z,+, I y,, Yr-t. ... ) (9.1.7) =a+ 8 · (t + s) + E(z,+, I y,, Yr-to. ·. ).
= E[a
Both (y,, y,_ 1 , ••• ) and (z,, z1_ 1 , ••• ) contain the same information because there is a one-to-one mapping between the two. So the last term can be written as
561
Unit·Root Econometrics
E(z,+,
I y,, .Yr-1 •... ) = E(z,+, I z,, Zr-1 •... ) = p'zr
(9.1.8)
since Zr+' = er+.• + pe,+,-1 + · · · + p'- 1e,+ 1 + p'z,. Substituting (9.1.8) into (9.1.7), we obtain the following expression for the s-period-abead forecast: E(y,+, I y,, Yr-1, ... )
=a + 8 · (t + s) + p' Zr =a+ 8 · (t + s) + p' · (y,- a- 8 · t). (9.1.9)
There are two cases to consider.
Case IPI < 1. Now if IPI < I, then (z, I is a stationary AR(I) process and thus is zero-mean 1(0). So (y, I is trend stationary. In this case, since E(z:) < oo, we have
Therefore, the s-period-abead forecast E(y1+, I y 1 , y,_ 1, ... ) converges m mean square to the linear time trend a + 8 · (t + s) as s --> oo. More precisely, E([E(Yr+•
I y,, Yr-1• ... )- a- 8 · (t + s)f}--> 0 ass--> oo.
(9.1.10)
That is, the current and past values of y do not affect the forecast if the forecasting horizon s is sufficiently long. In particular, if 8 = 0, then E(y,+,
I y,, Yr-1· ... )--> E(y,) as s--> m.s.
00.
(9.1.11)
That is, a long-run forecast is the unconditional mean. This property, called mean reversion, holds for any linear stationary processes, not just for stationary AR( 1) processes. 2 For this reason, a linear stationary process is sometimes called a transitory component. Case p = 1. Suppose, instead, that p = I. Then (z, I is a driftless random walk (a particular stochastic trend), and y, can be written as
Yr = (a
+ zo) + 8 · t + e1 + 82 + · · · + e,,
(9.1.12)
which shows that (.Y1I is a random walk with drift with the initial value of z0 +a (a particular 1(1) or difference-stationary process). Setting p = 1 in (9.1. 9), we obtain E(y,+,
I _Y1 , Yr-1•
2See, e.g., Hamilton (1994, p. 439) for a proof.
... ) = 8 · S + y1 •
(9.1.13)
562
Chapter 9
That is, a random walk with drift 8 is expected to grow at a constant rate of 8 per period from whatever its current value y, happens to be. Unlike in the trend-stationary case, the current value has a permanent effect on the forecast for all forecast horizons. This is because of the presence of a stochastic trend z,. A stochastic trend is also called a permanent component. 2. The second reason for distinguishing between trend-stationary and differencestationary processes arises in the context of inference. For example, suppose you are interested in testing a hypothesis about fJ in a regression equation Yr = x,{J
+ u,,
(9.1.14)
where u 1 is a stationary error term. If the regressor x 1 is trend stationary, then, as was shown in Section 2.12, the t-value on the OLS estimate of {J is asymptotically standard normal. On the other hand, if x 1 is difference stationary, that is, if it has a stochastic trend, then the t-value has a nonstandard limiting distribution, so inference about fJ cannot be made in the usual way. The regression in this case is called a "cointegrating regression." Inference about a cointegrating regression will be studied in the next chapter. Which Should Be Taken as the Null, 1(0) or 1(1)?
To distinguish between trend-stationary processes and difference-stationary, or I (I), processes, the usual practice in the literature is to test the null hypothesis that the process is I (I), rather than taking the hypothesis of trend stationarity as the null. Presumably there are two reasons for this. One is that the researcher believes the variables of the equation of interest to be I (I) and wishes to control type I error with respect to the I (I) null. This, however, is not ultimately convincing because usually the researcher does not have a strong prior belief as to whether the variables are I (0) or I (I). For a concise statement of the problem this poses for inference, see Watson (1994, pp. 2867-2870). The other reason is simply that the literature on tests of the trend-stationary null against the I (I) alternative is still relatively underdeveloped. There are available several tests of the null hypothesis that the process is I (0) or more generally trend stationary; see contributions by Tanaka (1990), Kwiatkowski et a!. (1992), and Leybourne and McCabe (1994), and surveys by Stock (1994, Section 4) and Maddala and Kim (1998, Section 4.5). As of this writing, there is no single 1(0) test commonly used by the majority of researchers that has good finite-sample properties. In this book, therefore, we will concentrate on tests of the 1(1) null.
563
Unit-Root Econometrics
Other Approaches to Modeling Trends In addition to stochastic trends and time trends, there are two other popular approaches to modeling trends: fractionally integration and broken trends with an unknown break date. They will not be covered in this book. For a survey of the former approach, see Baillie (1996). The literature on trend breaks is growing rapidly. Surveys can be found in Stock (1994, Section 5) and Maddala and Kim (1998, Chapter 13). QUESTION FOR REVIEW
1. (Long-run variance of Var(u,) < oo. Verify run variance of (t.u,) v, = t.u,) is defined
(vi+ v,
g,2
differences of 1(0) processes) Let (u,) be 1(0) with that (t.u,] is covariance stationary and that the longis zero. Hint: The long-run variance of (v,) (where to be the limit as T --+ oo of Var(fiii). fiv =
+ · · · + vr)/fi =
(ur- uo)/.JT.
Tools for Unit-Root Econometrics
The basic tool in unit-root econometrics is what is called the "functional central limit theorem" (FCLT). For the FCLT to be applicable, we need to specialize 1(1) processes by placing restrictions on the associated 1(0) processes. Which restrictions to place, that is, which class of 1(0) processes to focus on, differs from author to author, depending on the version of the FCLT in use. In this book, following the practice in the literature, we focus on linear 1(0) processes. After defining linear 1(0) processes, this section introduces some concepts and results that will be used repeatedly in the rest of this chapter. Linear 1(0) Processes Our definition of linear 1(0) processes is as follows. Definition 9.2 (Linear 1(0) processes): A linear 1(0) process can be written as a constant plus a zero-mean linear process (u,] such that u, = 1/r(L)s,, ljr(L)
=
1/ro
+ 1jr1L + 1jr2 L 2 + · · · fort
=
0, ±1, ±2, .... (9.2. I)
(s,) is independent white noise (i.i.d. with mean 0 and E(s?l
= u2
> 0), (9.2.2)
564
Chapter 9 DO
L j}lfrJI < oo.
(9.2.3a)
}=0
',if (I)
i 0.
(9.2.3b)
The innovation process {etl is assumed to be white noise i.i.d. This requirement is for simplifying the exposition and can be relaxed substantially.' In the rest of this chapter, we will refer to linear 1(0) processes as simply 1(0) processes, without the qualifier "linear." Also, unless otherwise stated, {e,} will be an independent white noise process and {url a zero-mean 1(0) process. Condition (9.2.3a) (sometimes called the one-summability condition) is stronger than the more familiar condition of absolute summability. This stronger assumption will make it easier to prove large sample results of the next section, with the help of the Beveridge-Nelson decomposition to be introduced shortly. Since {',if1 } is absolutely summable, the linear process {url is strictly stationary and ergodic by Proposition 6.l(d). The j-th order autocovariance of {u 1 ) will be denoted y1. Since E(u 1 ) = 0, YJ = E(u,u,_ 1). By Proposition 6.J(c) the autocovariances are absolutely summable. To understand condition (9.2.3b), recall that the long-run variance (which we denote !-. 2 in this chapter) equals the value of the autocovariance-generating function at z = I by Proposition 6.8(b). By Proposition 6.6 it is given by DO
2
i-. =long-run variance of {u,) =Yo+ 2 LYJ = rr 2 · [1/f(l)f
(9.2.4)
j=l
The condition (9.2.3b) thus ensures the long-run variance to be positive. Approximating 1(1) by a Random Walk
Let {~tl be I (I) so that !J.~, = 8 + u, where u, = 1/f(L)e, is a zero-mean 1(0) process satisfying (9.2.1)-(9.2.3) with E(~J) < oo. Using the following identity (to be verified in Review Question 1): 1/f(L) =',if (I)+ !J.a(L),
!J. = I - L,
DO
(9.2.5)
3For example, the innovation process can be stationary martingale differences. See, e.g .. Theorem Tanaka (1996, p. 80).
~.8
of
565
Unit-Root Econometrics
we can write u r as U1
= 1/f(L)Ei
1
= 1/f(l) · Ei 1
+ ~1
- ~~-I
with ~ 1
= a(L)Ei
1.
(9.2.6)
It can be shown (see Analytical Exercise 7) that a(L) is absolutely summable. So, by Proposition 6.1 (a), {~~} is a well-defined zero-mean covariance-stationary process (it is actually ergodic stationary by Proposition 6.l(d)). Substituting (9.2.6) into (9.1.4), we obtain (what is known in econometrics as) the Beveridge-Nelson decomposition: I
~I= 8 · l
+ 2)1/f(J) · Ei, + ~'- ~,-tl + ~0 s=I
(9.2.7) Thus, any linear I( I) process can be written as the sum of a linear time trend (8 · t), a drift less random walk or a stochastic trend (1/f (I )(Ei 1 + Ei 2 + · · ·+ Ei1)), a stationary process (~ 1 ), and an initial condition (~0 - ~ 0 ). Stated somewhat differently, we have Proposition 9.1 (Beveridge-Nelson decomposition): Let {u 1 } be a zero-mean 1(0) process satisfying (9.2.1)-(9.2.3). Then
u 1 + u 2 + · · · + u1 = 1/f(l)(EiJ where~~
a(L)Ei 1 , aj
=
-(1/IHI
+ '2 + · · · + Ei1) + ~~- ~o.
+ l/IH 2 + · · ·).
{~ 1 }
is zero-mean ergodic
4
stationary
For a moment set 8 = 0 in (9.2.7) so that {~ 1 } is driftless I (I). An important implication of the decomposition is that any driftless I (I) process is dominated by the stochastic trend 1/f(l) I:;~ I Ei, in the following sense. Divide both sides of the decomposition (9.2.7) (with 8 = 0) by -fi to obtain (9.2.8)
4 For a more general condition, which does not require the 1(0) process lur) to be linear, under which u + 1 · · · -1- lit is decomposed as shown here, see Theorem 5,4 of Hall and Heyde (1980), Jn that theorem, let) is a stationary martingale difference sequence, so the stochastic trend is a martingale.
566
Chapter 9
Since E(~J) < oo by assumption, E[(~o/Ji) 2 ]--> 0 as 1 --> oo. So ~o!.fi converges in mean square (and hence in probability) to zero. The same applies to ry, 1.fi and ry 0 / .fi_ So the terms in the brackets on the right-hand side of (9.2.8) can be ignored asymptotically. In contrast, the first term has a limiting distribution of N(O, o- 2 • [1/.r(I)JZ) by the Lindeberg-Levy CLT of Section 2.1. In this sense, the stochastic trend grows at rate .,fi. Using (9.2.4), the stochastic trend can be written as (9.2.9) which shows that changes in the stochastic trend in ~~ have a variance of ). 2 , the long-run variance of {t.~, }. Now suppose 8 oft 0 so that {~,} is 1(1) with drift. As is clear from dividing both sides of (9.2.7) by 1 rather than .fi and applying the same argument just given for the 8 = 0 case, the stochastic trend as well as the stationary component can be ignored asymptotically, with ~,fl converging in probability to 8. In this sense the time trend dominates the I( I) process in large samples. Relation to ARMA Models
In Section 6.2, we defined a zero-mean stationary ARMA process {u,} as the unique covariance-stationary process satisfying the stationary ARMA(p, q) equation (9.2.1 0) where >(L) satisfies the stationarity condition. If {e,} is independent white noise and if 11(1) oft 0, then the ARMA(p, q) process is zero-mean 1(0). To see this, note from Proposition 6.5(a) that u, can be written as (9.2.1) with 1/.r(L) = >(L)- 111(L).
(9.2.11)
Since >(I) oft 0 by the stationarity condition and 11 (I) oft 0 by assumption, condition (9.2.3b) (that 1}.r(l) oft 0) is satisfied. By Proposition 6.4(a), the coefficient sequence {1/.rj} is bounded in absolute value by a geometrically declining sequence, which ensures that (9.2.3a) is satisfied. Given the zero-mean stationary ARMA(p, q) process {u,} just described, the associated I (I) process R,} defined by the relation t.~, = u, is called an autoregressive integrated moving average (ARIMA(p, 1, q)) process. Substituting the relation u, =(I- L)~, into (9.2.10), we obtain
567
Unit-Root Econometrics >'(L)~,
= 8(L)e,,
(9.2.12)
with >'(L) = >(L)(l - L), >(L) stationary, and 8(1) oft 0. One of the roots of the associated autoregressive polynomial equation >' (z) = 0 is unity and all other roots lie outside the unit circle. More generally, a class of l(d) processes can be represented as satisfying (9.2.12) where >'(L) is now defined as >'(L) = >(L)(l- L)d
(9.2.13)
So >'(z) = 0 now has a root of unity with a multiplicity of d (i.e., d unit roots) and all other roots lie outside the unit circle. This class of l(d) processes is called autoregressive integrated moving average (ARIMA(p, d, q)) processes. The first parameter (p) refers to the order of autoregressive lags (not counting the unit roots), the second parameter (d) refers to the order of integration, and the third parameter (q) is the number of moving average lags. Since >(L) satisfies the stationarity condition, taking d-th differences of an ARIMA(p, d, q) produces a zero-mean stationary ARMA(p, q) process. So an ARIMA(p, d, q) process is I(d). The Wiener Process
The next two sections will present a variety of unit-root tests. The limiting distributions of their test statistics will be written in terms of Wiener processes (also called Brownian motion processes). Some of you may already be familiar with this from continuous-time finance, but to refresh your memory, Definition 9.3 (Standard Wiener processes): A standard Wiener (Brownian motion) process W(·) is a continuous-time stochastic process, associating each date t E [0, I] with the scalar random variable W (t), such that (I) W(O) = 0;
(2) for any dates 0 < t 1 < t2 < · · · < t, < I, the changes
are independent multivariate normal with W(s)- W(t) particular W(l) ~ N(O, 1));
~
N(O, (s- t)) (so in
(3) for any realization, W(t) is continuous in t with probability I.
568
Chapter 9
Roughly speaking, the functional central limit theorem (FCLT), also called the invariance principle, states that a Wiener process is a continuous-time analogue
of a driftless random walk. Imagine that you generate a realization of a standard random walk (whose changes have a unit variance) of length T, scale the graph vertically by deflating all its values by ../f, and then horizontally compress this normalized graph so that it fits the unit interval [0, I]. In Panel (a) of Figure 9.2, that is done for a sample size of T = I 0. The sample size is increased to I00 in Panel (b), and then to 1,000 in Panel (c). As the figure shows, the graph becomes increasingly dense over the unit interval. The FCLT assures us that there exists a well-defined limit as T --> oo and that the limiting process is the Wiener process. Property (2) of Definition 9.3 is a mathematical formulation of the property that the sequence of instantaneous changes of a Wiener process is i.i.d. The continuoustime analogue of a driftless random walk whose changes have a variance of rr 2 , rather than unity, can be written as rr W (r ). To pursue the analogy between a driftless random walk and a Wiener process a bit further, consider a "demeaned standard random walk" which is constructed from a standard driftless random walk by subtracting the sample mean. That is, let {~,} be a driftless random walk with Var(Ll~,) = I and define (9.2.14) The continuous-time analogue of this demeaned random walk is a demeaned standard Wiener process defined as W''(r)
= W(r)
-1
1
W(s)ds.
(9.2.15)
(We defined the demeaned series for 1 = 0, I, ... , T - I, to match the dating convention of Proposition 9.2 below. If the demeaned series is defined for 1 I, 2, ... , T, then its continuous-time analogue, too, is (9.2.15).) We can also create from the standard random walk {~~} a detrended series: ~,' - ~~ - &. -
8· 1
(1 = 0. I, ... , T - I),
(9.2.16)
where &. and 8 are the OLS estimates of the intercept and the time coefficient in the regression of~~ on (I, 1) (I = 0, I, ... , T - I). The continuous-time analogue of
0.6 0.4 0.2
t
0 -0.2 -0.4 -0.6 -0.8 -1
(a) T=10
-1.2 2
1.5
1
0.5
0 1
(b) T=100 -0.5 0.8
0.7 0.6
0.5 0.4
0.3 0.2 0.1 0
1
-0.1
(c) T=1000 -0.2
Figure 9.2: Illustration of the Functional Central Limit Theorem
570
Chapter 9
the detrended random walk is a detrended standard Wiener process defined as W'(r)
= W(r)- a-d· r,
1 =1 1
a=
(4- 6s)W(s) ds,
1
d
(-6+ l2s)W(s)ds.
(9.2.17)
The coefficients a and dare the limiting random variables of & and 8, respectively, as T -+ oo (see, e.g., Phillips and Durlauf, 1986, for a derivation). Therefore, if a linear time trend is fitted to a driftless random walk by OLS, the estimated linear time trend 8 converges in distribution to a random variable d; even in large samples 8 will never be zero unless by accident. This phenomenon is sometimes called spurious detrending. A
A Useful Lemma Unit-root tests utilize statistics involving !(0) and !(I) processes. The key results
collected in the following proposition will be used to derive the limiting distributions of the test statistics of the unit-root tests of Sections 9.3-9.4. Proposition 9.2 (Limiting distributions of statistics involving 1(0) and 1(1) variables): Let {$1) be driftless !(I) so that L'l$1 is zero-mean !(0) satisfying (9.2.1)-(9.2.3) and E($J) < 00. Let ).
2
=long-run variance of {L'I$1 ) and y0
= Var(L'I$,).
Then
1'
; 2 -+). 2 · (a) - I 2 . L..)$r-1) W(r) 2 dr, T d o t=I
1 (b)T
L L'l$ $r-1 -+ -W(l) ). 2
T
1
ct
t=I
2
2
Yo. - -
2
Let{$,"} be the demeaned series created from{$,) by the formula (9.2.14). Then I (c) 2
T 2 "'<$,"_ 1) -+
T L. t=I
I (d) T
T
2
). ·
d
). 2
11 o
[WM(r)fdr,
L 1'1$, $,"_1--: z-{[WM(I)]2- [WM(O)J2]- ~. t=I
571
Unit-Root Econometrics
Let
{~,'}
be the detrended series created from {~,} by the formula (9.2. I 6). Then
I T (e) T'l)~,'_,)'
7
{'
).2.
O
t=I
I (f) T
T
Jo [W'(r)J' dr,
A2
I>'>~.~.'-1--: 2{[W'(l)J'- [W'(O)J'} t=l
-7·
The convergence is joint. That is, a vector consisting of the statistics indicated in (a)-(!) converges to a random vector whose elements are the corresponding random variables also indicated in (a)-(!). Part (a), for example, reads as follows: the sequence of random variables {(1/ T) 2
I:;~, (~t-1) 2 } indexed by T converges in distribution to a random variable A2 f W 2 dr. The limiting random variables in the proposition are all written in terms of standard Wiener processes. Note that the same Wiener process W (-) appears in (a) and (b) and that the demeaned and detrended Wiener processes in (c)-(f) are derived from the Wiener process appearing in (a) and (b). So the limiting random variables in (a)-(!) can be correlated. To gain some understanding of this result, suppose temporarily that {~,} is a driftless random walk with Var(t:.~,) = a 2 Given that a Wiener process is the continuous-time analogue, it is not surprising that "I:;~, (~1 _J) 2 " suitably normalized by a power of T becomes "a 2 f W2 ." Perhaps what you might not have guessed is the normalization by T 2 . One way to see why the square of T provides the right normalization, is to recall that E[(~ 1 ) 2 ] = Var(~,) = a 2 · t. Thus, (9.2.18) So the mean of L;~,(~,_J) 2 grows at rate T 2 In order to construct a random variable that could have a limiting distribution, I:;~, (~ 1 _ 1 ) 2 will have to be divided by T 2 • Now suppose that {~,} is a general driftless I (I) process whose first difference is a zero-mean 1(0) process satisfying (9.2.1)-(9.2.3). Then serial correlation in t:.~, can be taken into account by replacing a 2 by A2 (the long-run variance of {L':.~1 }).
This is due to implication (9.2.9) of the Beveridge-Nelson decomposition that a driftless I(I) process is dominated in large samples by a random walk whose changes have a variance of A2 . Put differently, the limiting distribution of L:;~,(~1 - 1 ) 2 /T 2 would be the same if{~,}, instead of being a general driftless
572
Chapter 9
I( I) process with serially correlated first differences, were a driftless random walk whose changes have a variance of A2 For a proof of Proposition 9.2, see, e.g., Stock (1994, pp. 2751-2753). It is a rather mechanical application of the FCLT and a theorem called the "continuous mapping theorem." Since W(l) ~ N(O, 1), the limiting random variable in (b) is (9.2.19) where X ~ x2 ( I) (chi-squared with I degree of freedom). Proof of (b) can be done without the fancy apparatus of the FCLT and the continuous mapping theorem and is left as Analytical Exercise I . QUESTIONS FOR REVIEW
1. (Verifying Beveridge-Nelson for simple cases) Verify (9.2.5) for ljr(L) = I+ ljr 1 L. Do the same for ljr(L) =(I - t/JL)- 1 where 1¢1 < I. 2. (A CLT for 1(0)) Proposition 6.9 is about the asymptotics of the sample mean of a linear process. Verify that the paragraph right below Proposition 9.1 is a proof of Proposition 6.9 with the added assumption of one-summability. 3. (Is the stationary component in Beveridge-Nelson 1(0)?) Consider a zeromean 1(0) process
Verify that the long-run variance of {ry,j defined in (9.2.6) is zero. 4. (Initial values and demeaned values) Let {~,} be driftless 1(1) so that~~ can be written as ~~ = ~o + u 1 + u2 + · ·· + u1 • Does the value of ~o affect the value of ~i where W'} is the demeaned series created from {~~} by the formula (9.2.14)? [Answer: No.] 5. (Linear trends and detrended values) Let ~~ be I (I) with drift, so it can be written as the sum of a linear time trend a+ 8 ·I and a driftless I (I) process. Do the values of a and 8 affect the value of ~ 1' where {~,'} is the detrended series created from {~1 } by the formula (9.2.16)? [Answer: No.]
573
Unit-Root Econometrics
9.3
Dickey-Fuller Tests
In this and the following section, we will present several tests of the null hypothesis that the data generating process (DGP) is !(I). In this section, the I( I) process is a random walk with or without drift, and the tests are based on the OLS estimation of appropriate AR(l) equations. Those tests were developed by Dickey (!976) and Fuller (1976) and are referred to as Dickey-Fuller tests or DF tests. We take the convention that the sample has T + I observations (yo, y 1, ••• , y 7 ), so that the AR(l) equation can be estimated for 1 =I, 2, ... , T. The AR(l) Model
The Dickey-Fuller test statistics are derived from the estimation of the first-order autoregressive model:
v, =
PYt-1
+ £,.
{£,}independent white noise.
(9.3.1)
(The case with intercept will be studied in a moment.) This model includes both I (I) and I(O) processes: if p = I, that is, if the associated polynomial equation I - pz = 0 has a unit root, then !'!.y, = £ 1 and {y,} is a driftless random walk, whereas if IPI < I, the process is zero-mean stationary AR(l). Thus, the hypothesis that the data are generated by a driftless random walk can be formulated as the null hypothesis that p = I. If we take the position that the DGP is either 1(0) or !(I), then -I < p < I. Given this a priori restriction on p, the !(0) alternative can be expressed as p < I. So one-tailed tests will be used to test the null of p = I against the 1(0) alternative. The tests will be based on fj, the OLS estimate of p in the AR(l) equation (9.3.1). Under the I(I) null of p = I, the sampling error is fj- I, whose expression is given by (9.3.2) As will be verified in a moment, the numerator has a limiting distribution if divided by T and the denominator has one if divided by T 2 So consider the sampling error magnified by T:
"T
T·(p-l)=T·Lt~l
!';
YtYt-1
I:,~~
TI
"T Lt=l ~Yt Yt-l I
"T (Yt-l )2 .
T2 Lt=l
(9.3.3)
574
Chapter 9
Deriving the Limiting Distribution under the 1(1) Null
Deriving the limiting distribution ofT· (p- I) under the I (I) null (a driftless random walk) is very straightforward. Since {y,] is driftless I (I), parts (a) and (b) of Proposition 9.2 are applicable with S< = y,. Furthennore, since l'!.y, is independent white noise, A2 (the long-run variance of {l'!.y,j) equals y0 (the variance of l'!.y,). Thus,
IL l'!.y, T
WIT=-
T
V<-1-+
-
d
Yo 2 Yo -W(l) --
2
2
=
WJ,
(9.3.4)
f=l
(9.3.5) As noted in the Proposition, the convergence is joint, so wT = (w 1T. w 2T)' as a vector converges to a 2 x I random vector, w = (w~o w2 )'. Since (9.3.3) is a continuous function of wT (it equals w 1T fw 2T ). it follows from Lemma 2.3(b) that
WI
-+ d
-=
w,
lfW0) 2 -lf Yo.
J0I W(r)' dr (9.3.6)
The test statistic T · (p- I) is called the Dickey-Fuller, or DF, p statistic. Several points are worth noting: • T · (p - 1), rather than the usual ./T · (p - I). has a nondegenerate limiting distribution. The estimator p is said to be superconsistent because it converges to the true value of unity at a faster rate (T).
• The null hypothesis does not specify the values of Var(e,) and y0 (the initial value) of the DGP, yet they do not affect the limiting distribution, the distribution of the random variable D FP. That is, the limiting distribution does not involve those nuisance parameters. So we can use T · (p- I) as the test statistic. This test of a driftless random walk is called the Dickey-Fuller p test. • Note that both the numerator and the denominator involve the same Wiener process, so they are correlated. Since W(l) is distributed as standard nonnal, the numerator of the expression for DFP could be written as (x 2 (1) - I )/2. But this would obscure the point being made here.
575
Unit-Root Econometrics
The usual !-test for the null of p = I, too, has a limiting distribution. It is just a matter of simple algebra to show that the !-value for the null of p = I can be written as (9.3.7)
where s is the standard error of regression: T
s=
"'<
I I L. Yr - py,_J) 2 . TA
(9.3.8)
t=l
It is easy to prove (see Review Question 3) that s 2 is consistent for y0 (= Var(LI.y1)). It is then immediate from this and (9.3.4) and (9.3.5) that l(W(l) 2 - I)
t-;t
JJoiW(r)2dr =DF,.
(9.3.9)
This limiting distribution, too, is free from nuisance parameters. So the t-value can be used for testing, but the critical values for the standard normal distribution cannot be used; they have to come from a tabulation (provided below) of D F,. This test is called the DF t test. We summarize the discussion so far as Proposition 9.3 (Dickey-Fuller tests of a driftless random walk, the case without intercept): Suppose that {y1 } is a driftless random walk (so {LI.yr} is independent white noise) with E(yJ) < oo. Consider the regression of y, on Yr-I (without intercept) fort = I, 2..... T. Then
DF p statistic: T · (p- I) --+ DFp. d
DFt statistic: t --+ DF,, d
where p is the OLS estimate of the y,_ 1 coefficient, t is the t-value for the hypothesis that the y,_ 1 coefficient is 1, and DFp and DF, are the random variables defined in (9.3.6) and (9.3.9). Thus, we have two test statistics, T · (p- I) and the t-value, for the same I(I) null. Comparison of the (generalizations of) these OF tests will be discussed in the next section.
576
Chapter 9
Table 9.1: Critical Values for the Dickey-Fuller p Test Sample size (T)
25 50 100 250 500 00
25 50 100 250 500 00
25 50 100 250 500 00 SOURCE:
0.01 -11.8 -12.8 -13.3 -13.6 -13.7 -13.8 -17.2 -18.9 -19.8 -20.3 -20.5 -20.7 -22.5 -25.8 -27.4 -28.5 -28.9 -29.4
Probability that the statistic is less than entry 0.025 0.05 0.10 0.90 0.95 0.975
-9.3 -9.9 -10.2 -10.4 -10.4 -10.5
Panel (a): T · (p- 1) -7.3 -5.3 1.01 -7.7 -5.5 0.97 -7.9 -5.6 0.95 -8.0 -5.7 0.94 -8.0 -5.7 0.93 -8.1 -5.7 0.93
-14.6 -15.7 -16.3 -16.7 -16.8 -16.9
Panel (b): T · (jY' - I) -12.5 -10.2 -0.76 -13.3 -10.7 -0.81 -13.7 -11.0 -0.83 -13.9 -II. I -0.84 -14.0 -11.2 -0.85 -14.1 -11.3 -0.85
-20.0 -22.4 -23.7 -24.4 -24.7 -25.0
Panel (c): T. (p'- I) -17.9 -15.6 -3.65 -19.7 -16.8 -3.71 -20.6 -17.5 -3.74 -21.3 -17.9 -3.76 -21.5 -18.1 -3.76 -21.7 -18.3 -3.77
0.99
1.41 1.34 1.31 1.29 1.28 1.28
1.78 1.69 1.65 1.62 1.61 1.60
2.28 2.16 2.09 2.05 2.04 2.03
0.00 -0.11 -0.13 -0.14 -0.14
0.65 0.53 0.47 0.44 0.42 0.41
1.39 1.22 1.14 1.08 1.07 1.05
-2.51 -2.60 -2.63 -2.65 -2.66 -2.67
-1.53 -1.67 -1.74 -1.79 -1.80 -1.81
-0.46 -0.67 -0.76 -0.83 -0.86 -0.88
-om
Fuller (1976, Table 8.5.1 ), corrected in Fuller (1996, Table 10.A.1 ).
Panel (a) of Table 9. I has critical values for the finite-sample distribution of the random variable T · (p- I) for various sample sizes. The solid curve in Figure 9.3 graphs the finite-sample distribution of p for T = 100, which shows that the OLS estimate of p is biased downward (i.e., the mean of p is less than 1). These are obtained by Monte Carlo assuming that £, is Gaussian and y 0 = 0. By Proposition 9.3, as the sample size T increases, those critical values converge to the critical values for the distribution of DFP, reported in the row for T = 00. (Since Yo does not have to be zero and £, does not have to be Gaussian for the asymptotic results in Proposition 9.3 to hold, we should obtain the same limiting critical values if the finite-sample critical values were calculated with a different assumption about the initial value y 0 and the distribution of£,.) From the critical values for T = oo,
577
Unit-Root Econometrics
7.0,-----------------------------------------------, 6.0
5.0
cm
4.0
~
m
···.
a. 3.0
2.0
.·· 1.0
0.0 -t-----1""·---=-~0.6 0.7
=..:.:.:..r-·=··""""";.;-~··=::::;:::: ;· ___···.:;o·-~-.·::;;_\,__j 0.8
- - - without intercept or time
1.0
0.9
········· with intercept
·····- with time
Figure 9.3: Finite-Sample Distribution of OLS Estimate of AR( l) Coefficient, T = I 00
we can see that the DFp distribution, too, is skewed to the left, with the 5 percent lower tail critical value of -8.1 (so Prob(DFp < -8.1) = 5%) and the 5 percent upper tail critical value of 1.28. The small-sample bias in p is reflected in the skewness in the limiting distribution ofT· (p- 1). Critical values for DF, (the limiting distribution of the 1-value) are in the row forT = oo of Table 9.2, Panel (a). Like DFP, it is skewed to the left. The table also has critical values for finite-sample sizes. Again, they assume that y 0 = 0 and e, is normally distributed.
Incorporating the Intercept
A shortcoming of the tests based on the AR(l) equation without intercept is its lack of invariance to an addition of a constant to the series. If the test is performed on a series of logarithms (as in Example 9.1 below), then a change in units of measurement of the series results in an addition of a constant to the series, which affects the value of the test statistic. To make the test statistic invariant to an addition of a constant, we generalize the model from which the statistic is derived. Consider the model
y,
=a+ z,, z,
=
pz,_,
+ e,,
{e,} independent white noise.
(9.3.10)
578
Chapter 9
Table 9.2: Critical Values for the Dickey-FuUer t-Test Sample size (T)
25 50 100 250 500 00
25 50 100 250 500 00
25 50 100 250 500 00
SOURCE:
0.01 -2.65 -2.62 -2.60 -2.58 -2.58 -2.58 -3.75 -3.59 -3.50 -3.45 -3.44 -3.42 -4.38 -4.16 -4.05 -3.98 -3.97 -3.96
Probability that the statistic is less than entry 0.025 0.05 0.90 0.95 0.975 0.10
0.99
-2.26 -2.25 -2.24 -2.24 -2.23 -2.23
Panel (a): t -1.95 -1.60 -1.95 -1.61 -1.95 -1.61 -1.95 -1.62 -1.95 -1.62 -1.95 -1.62
0.92 0.91 0.90 0.89 0.89 0.89
1.33 1.31 1.29 1.28 1.28 1.28
1.70 1.66 1.64 1.63 1.62 1.62
2.15 2.08 2.04 2.02 2.01 2.01
-3.33 -3.23 -3.17 -3.14 -3.13 -3.12
Panel (b): t~ -2.99 -2.64 -0.37 -2.93 -2.60 -0.41 -2.90 -2.59 -0.42 -2.88 -2.58 -0.42 -2.87 -2.57 -0.44 -2.86 -2.57 -0.44
0.00 -0.04 -0.06 -0.07 -0.07 -0.08
0.34 0.28 0.26 0.24 0.24 0.23
0.71 0.66 0.63 0.62 0.61 0.60
-3.95 -3.80 -3.73 -3.69 -3.67 -3.67
Panel (c): t' -3.60 -3.24 -1.14 -3.50 -3.18 -1.19 -3.45 -3.15 -1.22 -3.42 -3.13 -1.23 -3.42 -3.13 -1.24 -3.41 -3.13 -1.25
-0.81 -0.87 -0.90 -0.92 -0.93 -0.94
-0.50 -0.58 -0.62 -0.64 -0.65 -0.66
-0.15 -0.24 -0.28 -0.31 -0.32 -0.32
Fuller (1976. Table 8.5.2). corrected in Fuller (1996, Table IO.A.2).
The process {y,} obtains when a constant a is added to a process satisfying (9.3.1). Underthe null of p = I, {z,J is a driftless random walk. Since y, can be written as
y, =a+ zo + eJ
+ e2 + · · · + &,,
{y,} too is a driftless random walk with the initial value of vo = a + z0 • Under the alternative of p < I, {y,] is stationary AR ( 1) with mean a. Thus, the class of I (0) processes included in (9.3.10) is larger than that in (9.3.1 ).
By eliminating {z,] from (9.3.10), we obtain an AR(l) equation with intercept
Yr =a'+ PYr-I
+ e,,
(9.3.11)
579
Unit-Root Econometrics
where
a'= (I- p)a.
(9.3.12)
Since a' = 0 when p = I. the I (I) null (that the DGP is a driftless random walk) is the joint hypothesis that p = I and a' = 0 in terms of the coefficients of regression (9.3.11 ). Without the restriction a' = 0, {y,} could be a random walk with drift. We will develop tests of a random walk with drift in a moment. Until then, we continue to take the null to be a random walk without drift. Let fj" here be the OLS estimate of pin (9.3.11) and t" be the !-value for the null of p = I. It should be clear that a would not affect the value of p" or its OLS standard error; adding the same constant toy, for all t merely changes the estimated intercept. Therefore, the finite-sample as well as large-sample distributions of the DF p" statistic, T · (p" - I), and the DF t" statistic, t", will not depend on a regardless of the value of p. As you will be asked to prove in Analytical Exercise 2, T · (p" - I) converges in distribution to a random variable represented as DF~=
H[W"(l)f- [W"(0)] 2
-
1)
(9.3.13)
Jd [W~'(r)]2 dr
and t" converges in distribution to
DF'(
_ !(\W"(l)] 2 =
-
[W~'(0)] 2 -
JJ [W~'(r)]2 1 0
I)
.
(9.3.14)
dr
Here, W"(-) is a demeaned standard Wiener process introduced in Section 9.2. These limiting distributions are free from nuisance parameters such as a. The basic idea of the proof is the following. First, obtain a demeaned series by subtracting the sample mean from {y, l (or equivalently, as the OLS residuals from the regression of {y,} on a constant). Second, use the Frisch-Waugh Theorem (introduced in Analytical Exercise 4 of Chapter I) to write T · (p"- I) and the !-value in formulas involving the demeaned series. Finally, use Proposition 9.2(c) and (d). Summarizing the discussion, we have Proposition 9.4 (Dickey-Fuller tests of a driftless random walk, the case with intercept): Suppose that {y, l is a driftless random walk (so {t.y, l is independent white noise) with E(yij) < oo. Consider the regression of y, on (I, Yr-l) for t=i,2, ... ,T. Then
DF p~' statistic: T · (p"
-
I) --> DF%, d
580
Chapter 9
DF tM statistic: tM --> DF{', d
where pM is the OLS estimate of the y,_ 1 coefficient, tM is the t-value for the hypothesis that the y,_ 1 coefficient is 1, and DF};' and DF{' are the two random variables defined in (9.3.13) and (9.3.14).
Critical values for DF};' are in Table 9.1, Panel (b), and those for DF{' are in Table 9.2, Panel (b), in rows forT= oo. The distribution is more strongly skewed to the left than in the case without intercept. The table also has critical values for finite-sample sizes based on the assumption that £ 1 is normal. The finite-sample distribution of pM for T = 100 is graphed in Figure 9.3. It shows a greater bias for pM than for p (the OLS estimate without intercept). Unlike in the case without intercept, the finite-sample critical values do not depend on YO· This is because, when p = I, a change in the initial value Yo only adds the same constant toy, for alit, so the values of the DF pM and tM statistics are invariant to y 0 • We started out this subsection by saying that the I( I) null is the joint hypothesis that p = I and a' = 0. Yet the test statistics are for the hypothesis that p = I. For the unit-root tests, this is the practice in the profession; test statistics for the joint hypothesis are rarely used.
Example 9.1 (Are the exchange rates a random walk?): Let y, be the log of the spot yen/$ exchange rate. To make the tests invariant to the choice of units (e.g., yen/$ vs. yen/¢), we fit the AR(I) model with intercept, so Proposition 9.4 is applicable. For the weekly exchange rate data used in the empirical application of Chapter 6 and graphed in Figure 6.3 on page 424, the AR( I) equation estimated by OLS is y,
=
0.162 (0.435)
+
0.9983376 Yr-1, R 2 (0.0019285)
= 0.997,
SER
= 2.824.
Standard errors are in parentheses. Because of the lagged variable y,_ 1, the actual sample period is from the second week rather than the first week of 1975. So T = 777 and
T. (pM- I)= 777 · (0.9983376- I)= -1.29, 0.9983376 - I tM = 0.0019285 = -O. 86 ' Since the alternative hypothesis is that p < I, a one-tailed test is called for. (If you think that the process might be an explosive AR(l) with p > I, you
581
Unit-Root Econometrics
should perform a two-tailed test.) From Table 9.1, Panel (b), for DFff, the (asymptotic) critical value that gives 5 percent to the lower tail is -14.1. That is, Prob(DF~ < -14. I) = 0. 05. The test statistic of -1.29 is well above this critical value, so the null is easily accepted at the 5 percent significance level. Turning to the OF 1 test, from Table 9.2, Panel (b), the 5 percent critical value is -2.86. The 1-value of -0.86 is greater, so the null hypothesis is easily accepted. Incorporating Time Trend
As was mentioned in Section 9.1, most economic time series appear to have linear time trends. To make the OF tests applicable to time series with trend, we generalize the model once again, this time by adding a linear time trend:
y, =a + 8 · I + z,, z, = pz,_ 1 + e,, [e,] independent white noise. (9.3.15) Here, 8 is the slope of the linear trend which may or may not be zero. Under the null of p = I, [z,J is a driftless random walk, andy, can be written as y,
=a + 8 · 1 + zo +
e1 + e2 + · · · +
e,
=Yo+ 8 ·I+ (e1 + · · · + e,)
(9.3.16)
with Yo = a + zo. So [y,} is a random walk with drift if 8 # 0 and without drift if 8 = 0. Under the alternative that p < I, the process, being the sum of a linear trend and a zero-mean stationary AR( I) process, is trend stationary. By eliminating [z,} from (9.3.15) we obtain y,
=a'+ 8' ·I+ PYt-1 + e,,
(9.3.17)
where a'=(l-p)a+p8, 8'=(1-p)8.
(9.3.18)
Since 8' = 0 when p = I, the 1(1) null (that the OGP is a random walk with or without drift) is the joint hypothesis that p = I and 8' = 0 in terms of the regression coefficients. In practice, however, unit-root tests usually focus on the single restriction p = I. As in the AR(l) model with intercept, the statement of Proposition 9.3 can be readily modified to the AR(l) model with trend. Let p' be the OLS estimate of the pin (9.3.17) and 1' be the 1-value for the null of p = I. As you will be asked to
582
Chapter 9
prove in Analytical Exercise 3, T · (p'
DF'
- I) converges in distribution to
1)
2 l([W'(l)] 2
- [W'(0)] 2 ~ "'--'-'----:';:..___..:---,___:_:__-'-
(9.3.19)
Jd[W'(r)]2dr
P-
and t' converges in distribmion to 1{[W'(l)] 2
-
[W'(0)] 2
-
I)
1 )J0 [W'(r)]2dr
(9.3.20)
Here, W' (-) is a detrended standard Wiener process introduced in Section 9.2. The basic idea of the proof is to use the Frisch-Waugh Theorem to write T · (p'- I) and t' in formulas involving detrended series and then apply Proposition 9.2(e) and (f). Summarizing the discussion, we have Proposition 9.5 (Dickey-Fuller tests of a random walk with or without drift): Suppose that [y,} is a random walk with or without drift with E(yJ) < oo. Considerthe regression of y, on (I, t, YH) fort = I, 2, ... , T. Then
DF p' statistic: T · (/5' - I)
--+
DF;,
(9.3.21)
d
DFt' statistic: t'--+ DF;,
(9.3.22)
d
where p' is the OLS estimate of the y,_ 1 coefficient, t' is the t -value for the hypothesis that the y,_ 1 coefficient is 1, and DF; and DF; are the two random variables defined in (9.3.19) and (9.3.20). Critical values for DF; are in Table 9.1, Panel (c), and those for DF; are in Table 9.2, Panel (c). The table also has critical values for finite sample sizes based on the assumption that e, is normal (the initial value Yo as well as the trend parameters a and 8 need not be specified to calculate the finite-sample distributions because they do not affect the numerical values of the test statistics under the null of p = I). For both the DF p' and t' statistics, the distribution is even more strongly skewed to the left than that for the case with intercept. Now for all sample sizes displayed in the tables, the 99 percent critical value is negative, meaning that more than 99 percent of the time /5' is less than unity when in fact the true value of p is unity! This is illustrated by the chained curve in Figure 9.3, which graphs the finite-sample distribution of /5' for T = I00.
583
Unit-Root Econometrics
Two comments are in order about Proposition 9.5. • (Numerical invariance) The test statistics fj' and I' are invariant to (a, 8) regardless of the value of p; adding a+ 8 ·I to the series {y,} merely changes the OLS estimates of a' and 8' but not fj' or its 1-value. Therefore, the finite-sample as well as large-sample distributions ofT· (fj' - I) and 1' will not depend on (a, 8), regardless of the value of p. • (Should the regression include time?) Since the test statistics are invariant to the value 8 in the DGP, Proposition 9.5 is applicable even when 8 = 0. That is, the p' and I' statistics can be used to test the null of a driftless random walk. However, if you are confident that the null process has no trend, then the OF p" and I" tests of Proposition 9.4 should be used, with the regression not including time, because the finite-sample power against stationary alternatives is generally greater with fj" and 1" than with fj' and 1' (you will be asked to verify this in Monte Carlo Exercise I). On the other hand, if you think that the process might have a trend, then you should use fj' and 1', with time in the regression. If you fail to include time in the regression and use critical values from Table 9.1, Panel (b), and Table 9.2, Panel (b), when indeed the DGP has a trend, then the test will not be correctly sized in large samples (i.e., the probability of rejecting the null when the null is true will not approach the prespecified significance level (the nominal size) as the sample size goes to infinity). This is simply because the limiting distributions of fj" or I" when the DGP has a trend are not DF% or DF/'".
QUESTIONS FOR REVIEW
1. (Superconsistency) Let p be the OLS coefficient estimate as in (9.3.2). Verify that T 1- ' • (fj- I) -->pOforO < ry
Jd
W(r) 2 dr
where'-'= long-run variance of {LI.yr} and y0 = Var(LI.y,).
584
Chapter 9
3. (Consistency of s 2 ) The usual OLS standard error of regression for the AR( l) equation. s. is defined in (9.3.8). If IPI < l. we know from Section 6.4 that s 2 is consistent for Var(e,). Show that. under the null of p = l. s 2 remains consistent for Var(e, ). Since t:.y, = e, when p = l. s 2 is consistent for y0 (= Var(t:.y,)) as well. Hint: Rewrite s 2 as T
s2
=
l I " ' TL_.(t:.y,(p- l)y,_,) 2 A
1=1
2 :::-T---:-1 L(t:.y,)'- T- I . [T. T
(p-
J l)]· T
1=1
+
I T- I
T
L t:.y, y,_, 1=1
. [T. (p- I)],II: .(y,_,) . T
2
A
T'
l=l
Also, E[(t:.y,) 2 ] = y0 . Use the result we have used repeatedly: if xr ~, 0 and YT ~d y, then xr · YT ~, 0. 4. (Two tests for a random walk) Consider two regressions. One is to regress t:.y, on t:. y,_,, and the other is to regress t:. y, on y,_,. To test the null that {y,} is a driftless random walk, which distribution should be used, the 1-value from
the first regression or the 1-value from the second regression? 5.
(Consistency of OF tests) Recall that a test is said to be consistent against a set of alternatives if the probability of rejecting a false null hypothesis when the true OGP is any of the alternatives approaches unity as the sample size increases. LetT· (fj- I) be as defined in (9.3.3). (a) Suppose that {y,} is a zero-mean !(0) process and that its first-order autocorrelation coefficient is Jess than l (so E(y;) > E(y, y,_ 1)). Show that plim T · (p r-oo
-
l) = -oo.
Thus, the OF p test is consistent against general 1(0) alternatives. Hint: Show that, if {y,} is 1(0), •
A
pImp= )
r-oo
E(y,y,_,) E(y,_,) 2
.
(b) Show that the OF I test is consistent against general zero-mean 1(0) alternatives. Hint: s converges in probability to some positive number when {y,} is zero·mean 1(0).
585
Unit-Root Econometrics
6. (Effect of Yo on test statistics) For p~ and r". verify that the initial value y 0 does not affect their values in finite samples when p = I. Does this hold true when p < I? [Answer: No, because the effect of a change in y0 on y, depends on r .] Does the finite-sample power against the alternative of p < I depend on Yo? [Answer: Yes.] 7. (Sargan-Bhargava, 1983, statistic) Basing the test statistic on the OLS coefficient estimator is not the only way to derive a unit-root test. For a sample of (y0 , .VI, ... , YT ), the Sargan-Bhargava statistic is defined as SB=
(a) Verify that it is the reciprocal of the Durbin-Watson statistic. (b) Show that, if {y,} is a driftless random walk, then SB-> d
"T
"T '
. H In!: L,~o }',2 = L,~I Yi-I
t [W(r)]' dr.
Jo
+ YT·2
AI SO, h 'fT' -> p 0 ·
(c) Verify that, under an 1(0) alternative with E[(l'ly,) 2 ] fo 0, SB the test rejects the I(I) null for small values of SB.)
9.4
->p
0. (So
Augmented Dickey-Fuller Tests
The I (I) null process in the previous section was a random walk, which does not allow for serial correlation in first differences. The tests of this section control for serial correlation by adding lagged first differences to the autoregressive equation. They are due to Dickey and Fuller ( 1979) and came to be known as the Augmented Dickey-Fuller (ADF) tests. 'rhe Augmented Autoregression The I (I) null in the simplest ADF tests is that {y,] is ARIMA(p, I, 0), that is, {l'ly,] is a zero-mean stationary AR(p) process: L'ly, ={I l'ly,_I + {2 L'ly,_, + · · · + {p L'ly,_, +£,, E(£?) := a 2 , (9.4.1)
586
Chapter 9
where {£ 1 I is independent white noise and where the associated polynomial equation, I - \IZ- \2z 2- · · ·- \pzP = 0, has all the roots outside the unit circle. (Later on we will allow {L'.y,} to be stationary ARMA(p, q).) The first difference process (L'>y, I is thus an 1(0) process because it can be written as L'>y, = l{r(L)£1 where (9.4.2) The model used to derive the test statistics includes this process as a special case: (9.4.3) Indeed, (9.4.3) reduces to (9.4.1) when p =I. Equation (9.4.3) can also be written as an AR(p Yt = !IYt-I
+ 1):
+ ¢2Yt-2 +,,, + !p+IYt-p-I + £,,
(9.4.4)
where the ¢'s are related to (p, II- ... , \p) by
=
(j = I, 2, ... , p).
(9.4.5)
Conversely, any (p + 1)-th order autoregression can be written in the form (9.4.3). This form (9.4.3) was originally proposed by Fuller (1976, p. 374). It is called the augmented autoregression because it can be obtained by adding lagged changes to the level AR(I) equation (9.3.1 ). It is easy to show (Review Question 2) that the polynomial equation associated with (9.4.4) has a root inside the unit circle if p > I. Thus, if we take the position that the DGP is either I(I) or 1(0), then p cannot be greater than I and the 1(0) alternative is that p < I. Limiting Distribution of the OLS Estimator
Having set up the model, we now tum to the task of deriving the limiting distribution of the OLS estimate of p in the augmented autoregression (9.4.3) under the I (I) null. Numerically the same estimate of p can be obtained from the AR(p + I) equation (9.4.4) by calculating the sum of the OLS estimates of(¢~> ¢2, ... ,
587
Unit-Root Econometrics
To simplify the calculation, consider the special case of p change in y, or AR(2)), so that the augmented autoregression is
I (one lagged
(9.4.6)
Yz = PYz-I +II L'>Yz-1 + e,.
We assume that the sample is (y_ 1, y 0 , ..• , y 7 ) so that the augmented autoregression can be estimated for t = I, 2, ... , T (or alternatively, the sample is (y 1, Y2· ... , YT) but the summations in what follows are from t = 3 to t = T rather than from t = I tot = T). Write this as y, =
x;p + e,
with Xr (2xl)
If b
-=
=
p -
Yz-1 ] [ .6.Yr-1 '
(2xl)-
[p]
II .
(9.4.7)
(p, ~ 1 )' is the OLS estimator of the augmented autoregression coefficients
p <= (p, lil'), the sampling error is given by T
_ 1 T
b-P=(I;x,x;) L;x,-e,. 1=1
(9.4.8)
1=1
The individual terms in (9.4.8) are given by
'"" T
~Xr. E:r
=
[ '\'T L.,.....r=I Yt-1 E:r ]
I= I
T
(9.4.9)
•
Lr=l .6.Yr-I E:r
(2 X I)
We can now apply the same trick we used for the time regression of Section 2.12. For some non singular matrix 'Y" 7 to be specified below, rewrite (9.4.8) as ')" T ·
(b- {J) = ')" T
[!- p] II- II
= Ay 1 Cr,
(9.4.10)
where T
Cr
= Y'T
1
I;xr · E:r. t=I
(9.4.11)
588
Chapter 9
We seek a matrix Y" 7 such that Y" 7 · (b - {J) has a nondegenerate limiting distribution under the I( I) null. This is accomplished by setting (9.4.12) With this choice of the scaling matrix Y" 7 , it follows from (9.4.9) and (9.4.11) that
CT =
[
TI '\'T L,...t=I T ft I:,~~ I
Yr-1
et
]
(9.4.13)
.
Ll.y,_l ,,
Let us examine the elements of A 7 and c 7 and derive their limiting distributions under the null of p = I. The (I, I) element of A 7 • Since [y,} is I( I) without drift, it converges in distribution to A2 • W 2 dr by Proposition 9.2(a) where A2 = long-run variance of {LI.y,}.
J
The (2, 2) element of A 7 • It converges in probability (and hence in distribution) to y0 (= Var(LI.y, )) because {LI.y,}, being stationary AR(l ). is ergodic stationary with mean zero. The off-diagonal elements of Ar. They are I/ ../T times I T
T
L Ll.y,_l Yt-J,
(9.4.14)
1=1
which is the average of the product of zero-mean !(0) and driftless I( I) variables. By Proposition 9.2(b), the similar product (1/T) I:;~I Ll.y, Yt-1 converges in distribution to a random variable. It is easy to show (Review Question 3) that (9.4.14), too, converges to a random variable. So lf../T times (9.4.14), which is the off-diagonal elements of A 7 , converges in probability (hence in distribution) to 0. Therefore, A 7 , the properly rescaled X'X matrix, is asymptotically diagonal: 1
Ar--> d
[!.. 2 -J0 W(r) 2 dr OJ 0
Yo
'
589
Unit-Root Econometrics
so (Ar)-I --->
2 [(A
ri
2
· Jo W(r) dr
)-I
0
d
~~J.
(9.4.15)
Yo
We now turn to the elements of cr. The first element of cr. Using the Beveridge-Nelson decomposition. it can be shown (Analytical Exercise 4) that T
-
I
T
L
Yr-I
e,--->
C1
ct
t=I
I 2 = >"" · J/r(l) · [W(l) 2 -
(9.4.16)
1],
for a general zero-mean 1(0) process D.y, = ljr(L)e,.
(9.4.17)
Jnthepresentcase. D.y, = \ID.Yr-I+e,underthenull. soljr(L) = (l-\ 1L)- 1 and 1/r(l)=
I I - II
(9.4.18)
.
Second element of cr. It is easy to show (Review Question 4) that {D.Yr-I e,} is a stationary and ergodic martingale difference sequence. so I
second element of cr =
T
M'
yT
where y0
=Var(D.y
1)
and
0"
2
L D.y,_I e,---> t=I
=Var(e
c2
~ N(O, Yo·
2
0" ),
(9.4.19)
d
1 ).
Substituting (9.4.15)-(9.4.19) into (9.4.1 0) and noting that the true value of p is unity under the I (I) null, we obtain, for the OLS estimate of the augmented autoregression, 5
5Here, we are using Lemma 2.3(b) to claim that AT 1 CT converges in distribution to A- 1c where A and c are the limiting random variables of AT and c-r. respectively. Note that. unlike in situations encountered for the stationary case, the convergence of AT to A is in distribution. not in prohahility. We have proved in the text that AT converges in distribution to A and cT to c, but we have not shown that (AT. cT) converges in distribution jointly to (A. c). which is needed when using Lemma 2.3(b). However, by. e.g., Theorem 2.2 of Chan and Wei (1988). the convergence is joint.
590
Chapter 9
Thus,
T,
(p-
I) ---+ u21jr(l) d
A2
HW(l)2- I]
A2
or ----,---- · T · (p
Jd W(r) 2 dr
u21fr(l)
-
I) ---+ D F d
P•
(9.4.20) (9.4.21) where DFp is none other than the DF p distribution defined in (9.3.6). Note that the estimated coefficient of the zero-mean 1(0) variable (L'>y,_ 1) has the usual .Jf asymptotics, while that of the driftless I (I) variable (Yr-~) has a nonstandard limiting distribution. Deriving Test Statistics
There are no nuisance parameters entering the limiting distribution of the statistic "'~(IJ · T · (p - I) in (9.4.20), but the statistic itself involves nuisance parameters through "'~(IJ' By the formula (9.2.4) for the long-run variance A2 of {L'>y,), this ratio equals ljr(l), which in tum equals 1 ~ , by (9.4.18) under the null. Now go
1
back to the augmented autoregression (9.4.6). Since f 1 , the OLS estimate of ( 1, is consistent by (9.4.21), a consistent estimator of 1 ~ , is ( I - f 1)- 1 • It then follows 1 from (9.4.20) that (9.4.22) That is, the required correction on T · (p- I) is done through the estimated coefficient of lagged L'>y in the augmented autoregression. After the correction is made, the limiting distribution of the statistic does not depend on nuisance parameters controlling serial correlation of {L'>y,) such as A2 and y0 . The OLS t-value for the hypothesis that p = I, too, has a limiting distribution. It can be written as
591
Unit-Root Econometrics
p-]
t=--c=====~~======= T
s·
(1, I) element of
c~=x,X, l=l
r
I
T. (p-I)
s·
Jo. I) element of (A 7 )
1
(9.4.23) '
where s is the usual OLS standard error of regression. It follows easily from (9.4.15), (9.4.20), the consistency of s 2 for a 2 , and Proposition 9.2(a) and (b) for ~~
= y, that
(9.4.24)
t-+ DF,, d
where DF, is the DF
t
distribution defined in (9.3.9). Therefore, for the r-value,
there is no need to correct for the fact that lagged !:. y is included in the augmented
autoregression. All these results for p = I can be generalized easily:
Proposition 9.6 (Augmented Dickey-Fuller test of a unit root without intercept): Suppose that (y,) is ARIMA(p, I, 0) so that (t:.y,) is a zero-mean stationary AR(p) process following (9.4.1). Consider estimating the augmented autoregression, (9.4.3), and Jet (p, f" f,, ... , fp) be the OLS coefficient estimates. Also lett be the t-value for the hypothesis that p = I. Then . . ADF p statJStlc:
•
T. (p-I) •
•
I -II - 12- ... - \p
-+
(9.4.25)
DFP,
d
(9.4.26)
ADFt statistic: t-+ DF,, d
where DFp and DF, are the two random variables defined in (9.3.6) and (9.3.9). Testing Hypotheses about
1
The argument leading to this surprisingly simple result is lengthy, but it has a by-product. The obvious generalization of (9.4.21) to the p-th order autoregres-
sion is
fT.
- II II12-12 -
\p- \p
-I
0 -+N
0
, a'.
Yo
Y1
Yl
Yo
Yp-1 Yp-2
Yp-1
Yp-2
Yo
d
0
(9 .4 .27)
592
Chapter 9
where y1 is the j-th order autocovariance of {!>y, }. By Proposition 6.7, this is the same asymptotic distribution you would get if the augmented autoregression (9.4.3) were estimated by OLS with the restriction p = I. that is, if (9.4.1) is estimated by OLS. It is easy to show (Review Question 6) that hypotheses involving only the coefficients of the zero-mean I(O) regressors ({ 1 , { 2 , ... , {p) can be tested in the usual way, with the usual I and F statistics asymptotically valid. In deriving the limiting distributions of the regression coefficients, we have exploited the fact that one of the regressors is zero-mean I(O) while the other is driftless !(I). A systematic treatment of more general cases can be found in Sims, Stock, and Watson (1990) and Watson (1994, Section 2). What to Do When p Is Unknown?
Proposition 9.6 assumes that the order of autoregression, p, for t>y, is known. What can we do when the order p is unknown? To distinguish between the true order p and the number of lagged changes included in the augmented autoregression, we denote the latter by p in this section. We have considered a similar problem in Section 6.4. The difference is that the DGP here is a stationary process in first differences, not in levels as in Section 6.4, and that the estimation equation is an augmented autoregression with p freely estimated. We display below, without proof, three large-sample results about the choice of p under which the conclusion of Proposition 9.6 remains valid. These results are applicable to a class of processes more general than is assumed in Proposition 9.6: {!>y,} is zero-mean stationary and invertible ARMA(p, q) of unknown order (albeit with an additional assumption on £ 1 that its fourth moment is finite) 6 So, written as an autoregression, the order of autoregression for !>y1 can be infinite, as when q > 0, or finite, as when q = 0. The first result is that the conclusion of Proposition 9.6 continues to hold when p, the number of lagged first differences in the augmented autoregression, is increased with the sample size at an appropriate rate. (I) (Said and Dickey, 1984, generalization of Proposition 9.6) Suppose that
p
satisfies
p --+
oo but
as T
----*
oo.
(9.4.28)
(That is, p goes to infinity but at a rate slower than T '1 3 .) Then the two statistics, the ADF p and ADF 1, based on the augmented autoregression with 6See Section 6.2 for the definition of invertible ARMA(p, q) processes. If an ARMA(p. q) process is invertible, then it can be wriuen as an infinite-order autoregression.
593
Unit· Root Econometrics
p lagged changes,
have the same limiting distributions as those indicated in
Proposition 9.6. This result, however, does not give us a practical guide for the lag length selection because there are infinitely many rules for p satisfying (9.4.28). A natural question is whether any of the rules for selecting the lag length considered in Section 6.4- the general-to-specific sequential I rule or the information-criterionbased rules-can be used in the present context. To refresh your memory, the information-criterion-based rule is to set p to the j that minimizes log (
SSR·) C(T) T + (j +I) x ----;;:--,
(9.4.29)
where SSRi is the sum of squared residuals from the autoregression with j lags: t.y, = P Yr-1
+ ~1 t.y,_, + {2 t..vr-2 + ·· · + ~j t.y,_j + E:,.
(9.4.30)
The term C(T)/T is multiplied by j +I, because the equation has j +I coefficients including the lagged y, coefficient. For the Akaike information criterion (AIC), C(T) = 2, while for the Bayesian information criterion (BlC), also called Schwartz information criterion (SIC), C(T) = log(T). In either rule, pis selected from the set j = 0, I, 2, . , . , Pmax· In Section 6.4, this upper bound Pm= was set to some known integer greater than or equal to the true order of the finite-order autoregressive process. The p selected by either of these rules is a function of data (not just a function of T), and hence is a random variable. To be sure, when {t.y,} is stationary and invertible ARMA(p, q), the autoregressive order is infinite when q > 0 and the upper bound Pm= obviously cannot be set to the true order, But it can be made to increase with the sample size T. To emphasize the dependence of Pmax on T, write it as Pmax
7This is Theorem 5.3 of Ng and Perron (1995). Their conditionAl is (9.4.28) here. Their condition A2 11 is implied by the second condition indicated here.
594
Chapter 9
(3) Suppose that pis selected by AIC or by BIC with p,..x(T) satisfying condition (9.4.28). Then the same conclusion holds 8 Thus, no matter which rule-the sequential t, AIC, or BIC-for selecting p you employ, the large-sample distributions of the ADF statistics are the same as m Proposition 9.6. A Suggestion for the Choice of Pmax(T)
The finite-sample distribution, however, depends on which rule you use and also on the choice of the upper bound p"""'(T), and there are infinitely many valid choices for the function p"""'(T). For example, p"""'(T) = [T 114 ] (the integer part ofT 114 ) satisfies the conditions in result (2) for the sequential t-test. So does Pm=(T) = [IOOT 3i 10 ]. It is important, therefore, that researchers use the same Pm=(T) when deciding the order of the augmented autoregression. In this context, the Monte Carlo literature examining the small-sample properties of various unitroot tests is relevant. Simulations run by Schwer! (1989) show that in small and moderately large samples (from T = 25 to I ,000), including enough lags in the autoregression is important to minimize size distortions? The choice of p (the number of lags included in the autoregression) that was more or less successful in 114 controlling the actual size in Schwert's study is [12 · C~) J. The upper bound Pm=(T) should therefore give lag selection rules a chance of selecting a p as large as this. Therefore, in the examples and application of this chapter, we use the function
1 ~f ]
4
4
Pm=(T) = [12 · (
(integer part of 12 · (
1 ~f )
(9.4.31)
in any of the rules for selecting the lag length. This function satisfies the conditions of results (2) and (3) above. The sample period when selecting pis t = p"""'(T) + 2, p"""'(T) + 3, ... , T. The first t is Pm=(T) + 2 because Pm=(T) + I observations are needed to calculate Pm=(T) lagged changes in the augmented autoregression. Since only T Pm=(T) - 1 observations are used to estimate the autoregression (9.4.30) for j = I, 2, ... , p"""'(T), the sample size in the objective function in (9.4.29) should be
8This is Theorem 4.3 of Ng and Perron (1995). The theorem does not indicate the upper bound PmiU(T), but (as pointed out by Pierre Perron in private communications with the author) Hannan and Deistler (1988, Section 7.4 (ii)) implies that (9.4.28) can be an upper bound. 9 Recall that a size distortion refers to the difference between the actual or elCact size (the finite-sample probability of rejecting a true null) and the nominal size (the probability of rejecting a true null when the sample size
is infinite),
595
Unit-Root Econometrics
T- Pmax(T)- I rather than T: SSRj ) ( log T - PmaoJT) - I
.
+ (] +
C(T - Pmax(T) - I) I) x T - Pmax(T) - I .
(9.4.32)
Including the Intercept in the Regression As in the DF tests, we can modify the ADF tests so that they are invariant to an addition of a constant to the series. We allow [y,] to differ by an unspecified constant from the process that follows augmented autoregression without intercept (9.4.3). This amounts to replacing they, in (9.4.3) by y, -a, which yields y,
=a'+ PY<-1 + ~~
l>y,_l
+ ~2 t.y,_2 + · · · + ~P t.y,_P + &,
(9.4.33)
where a' = (I - p)a. As in the AR(l) model with intercept, the 1(1) null is the joint hypothesis that p = I and a' = 0. Nevertheless, unit-root tests usually focus on the single restriction p = I. It is left as an analytical exercise to prove Proposition 9.7 (Augmented Dickey-Fuller test of a unit root with intercept): Suppose that [y,) is an ARIMA(p, I, 0) process so that [t.y,) is a zero-mean stationary AR(p) process following (9.4.1). Consider estimating the augmented autoregression with intercept, (9.4.33), and Jet(&,/)". f~o be the OLS coefficient estimates. Also Jet t" be the 1-value for the hypothesis that p = I. Then
f2 .... , frl
,, .. ADF p statJStJc:
T · (p'" - I) • •
•
(9.4.34)
ADFt 1' statistic: t 1' --> DF/',
(9.4.35)
I -
~I -
~2 -
... -
~p
d
where the random variables DF% and DF/' are defined in (9.3.13) and (9.3.14). The basic idea of the proof is to use the Frisch-Waugh Theorem to write /)" and the t-value in formulas involving the demeaned series created from [y, ]. Before turning to an example, two comments about Proposition 9.7 are in order. • (Numerical invariance) As was true in Proposition 9.4, since the regression includes a constant, the test statistics are invariant to an addition of a constant to the series. • (Said-Dickey extension) The Said-Dickey-Ng-Perron extension is also applicable here: if (t.y,] is stationary and invertible ARMA(p, q) (so t.y, can be
596
Chapter 9
written as a possibly infinite-order autoregression), then the ADF p 1' and t 1' statistics have the limiting distributions indicated in Proposition 9.7, provided that p (the number of lagged changes to be included in the autoregression) is set by the sequential t rule or the AIC or BIC rules. 10 Example 9.2 (ADF on T-bill rate): In the empirical exercise of Chapter 2, we remarked rather casually that the nominal interest rate R, might have a trend for the sample period studied by Fama (1975). Here we test whether it has a stochastic trend (i.e., whether it is driftless 1(1)). The data set we used in the empirical application of Chapter 2 was monthly observations on the U.S. one-month Treasury bill rate. We take the sample period to be from January 1953 to July 1971 (T = 223 observations), the sample period of Fama's (1975) study. The interest rate series does not exhibit time trend. So we include a constant but not time in the autoregression. The maximum lag length Pm=(T) is set equal to 14 according to the function (9.4.31). In the process of choosing the actual lag length p, we fix the sample period from t = Pm=(T) + 2 = 16 (April 1954) to t = 223 (July 1971), with a sample size of208 (= T- Pm=(T)- 1). The sequential t rule picks a p of 14. The objective function in the information-criterion-based rule when the sample size is fixed is (9.4.32). The AlC picks p = 14, while the BlC picks p = I. Given p = I, we estimate the augmented autoregression on the maximum sample period, which is t = p +2, ... , T (from March 1953 to July 1971) with 221 observations. The estimated regression is R, = 0.088 + 0.97705 (0.057) (0.0162)
R,_, -
0.20 L'.R,_" (0.067)
R2 = 0. 944, sample size = 221. From this estimated regression, we can calculate _ ADF p~ = 221
0.97705- I X
I+ 0.20
= -4.22, t 1'
0.97705- I ---,---,-,--,----- = - 1.4 2. 0.0162
From Table 9.1, Panel (b), the 5 percent critical value for the ADF p 1' statistic is -14.1. The 5 percent critical value for the ADF t~ statistic is -2.86 from Table 9.2, Panel (b). For either statistic, the 1(1) null is easily accepted.
10 Ng and Perron (1995) proved this result about the selection of lag length only for the ca~e without intercept. That it can be extended to the case with intercept and also to the case with trend was confinned in a private communication with one of the authors of the paper.
597
Unit-Root Econometrics
Incorporating Time Trend Allowing for a linear time trend, too, can be done as in the DF tests. Replacing the y, in augmented autoregression (9.4.3) by y, - a - 8 · 1, we obtain the following augmented autoregression with time trend: y, = a• + 8• ·I+ PY<-I +{I t'.y,_I + {2 t'.y,_2 + · · · + {p t'.y,_P + &,,
(9.4.36) where a• =(I- p)a
+ (p- {I- {2- · · · - {p)8,
8• = ( I - p)8.
(9.4.37)
Since 8• = 0 when p = I, the 1(1) null (that the process is 1(1) with or without drift) implies that p = I and 8• = 0 in (9.4.36), but, as usual, we focus on the single restriction p = I. Combining the techniques used for proving Propositions 9.5 and 9.7, it is easy to prove
Proposition 9.8 (Augmented Dickey-Fuller Test of a unit root with linear time trend): Suppose that (y,] is the sum of a linear time trend and an ARIMA(p, I, 0) process so that (t'.y,) is a stationary AR(p) process whose mean may or may not be zero. Consider estimating the augmented autoregression with trend, (9.4.36), and Jet (a, 8, p', {I> {2 , ... , {p) be the OLS coefficient estimates. Also Jet r' be the 1-va/ue for the hypothesis that p = I. Then ADF p' statistic:
I
T · (jj' - I) " _ -{I -
(2 -
... -
" ~ DF;, (p
(9.4.38)
d
ADF 1' statistic: 1' -> DF,',
(9.4.39)
d
where the random variables DF; and DF,' are defined in (9.3.19) and (9.3.20). Before turning to an example, three comments are in order. • (Numerical invariance to trend parameters) As in Proposition 9.5, since the regression includes a constant and time, neither a nor 8 affects the OLS estimates of (p, {I, ... , {p) and their standard errors, so the finite-sample as well as large-sample distribution of the ADF statistics will not be affected by (a, 8). • (Said-Dickey extension) Again, the Said-Dickey-Ng-Perron extension IS applicable here: if (t'.y,] is stationary and invertible ARMA(p, q) with possibly a nonzero mean, then the ADF p' and t' statistics have the limiting distributions
59B
Chapter 9
indicated in Proposition 9.8, provided that jj is set by the sequential1-rule or the AIC or BIC rules. • (Should the regression include time?) The same comment we made about the choice between the OF tests with or without trend is applicable to the ADF tests. If you are confident that the null process has no trend, then the ADF p" and 11' tests of Proposition 9.7 should be used, with the augmented autoregression not including time, because the finite-sample power of the ADF tests is generally greater without time in the augmented autoregression. On the other hand, if you think that the process might have a trend, then you should use the ADF p' and I' tests of Proposition 9.8 with time in the augmented autoregression. Example 9.3 (Does GDP have a unit root?): As was shown in Figure 9.1, the log of U.S. GOP clearly has a linear time trend, so we include time in the augmented autoregression. Using the logarithm of U.S. real GOP quarterly data from 1947:QI to 1998:Ql (205 observations), we estimate the augmented autoregression (9.4.36). As in Example 9.2, we set Pm=(T) = [12. (205/100) 114 ] (which is 14). Also as in Example 9.2, the sample period is fixed (1 = Pm=(T) + 2, ... , T) in the process of selecting the number of lagged changes. The sequential I picks jj = 12, while both A1C and B1C pick jj = I . Given jj = I, the augmented autoregression estimated on the maximal sample of 1 = jj + 2, ... , T (1947:Q3 to 1998:Ql) is y, =
0.22 (0.10)
+
0.00022 · I (0.00011) R2
= 0.999,
+
0.9707 y,_, (0.0137)
sample size
+
0.348 L'.yH, (0.066)
= 203.
From this estimated regression, we can calculate , Q 0.9707388 - I I_ _ = -9.11, AD F p = 2 3 x 0 348 I'=
0.9707388 - I = -2.13. 0.0137104
From Table 9.1, Panel (c), the 5 percent critical value for the ADF p' statistic is -21.7. The 5 percent critical value for the ADF 1' statistic is -3.41 from Table 9.2, Panel (c). For either statistic, the 1(1) null is easily accepted. A little over 50 years of data may not be enough to discriminate between the 1(1) and 1(0) alternatives. (As noted by Perron (1991 ), among others, the span of the data rather than the sampling frequency, e.g., quarterly vs.
599
Unit-Root Econometrics
annual, is more important for the power of unit-root tests, because I - p is much smaller on quarterly data than on annual data.) We now use annual GDP data from 1869 to 1997 for the ADF tests." SoT = 129. As before, we set Pmox(T) = [12 · (129/100) 114 ] (which is 12) and fix the sample period in the process of choosing the lag length. The sequential t picks 9 for p, while the value chosen by AIC and BIC is I. When p = I, the regression estimated on the maximum sample oft= fj + 2, ... , T (from 1871 to 1997, 127 observations) is
y, =
0.67 (0.18)
+
0.0048 · t (0.0014)
R 2 = 0.999,
+
0.85886 Yt-1 (0.03989)
+
0.32 t. Vt-1, (0.086)
sample size = 127.
The ADF p' and t' statistics are -26.2 and -3.53, respectively, which are less (greater in absolute value) than their respective 5 percent critical values of -21.7 and -3.41. With the annual data covering a much longer span of time, we can reject the 1(1) null that GDP has a stochastic trend. The impression one gets from Examples 9.1-9.3 is the ease with which the 1(1) null is accepted, which leads to the suspicion that the DF and ADF tests may have low power in finite samples. The power issue will be examined in the next section. Summary of the OF and ADF Tests and Other Unit-Root Tests The various 1(1) processes we have considered in this and the previous section satisfy the model (a set of DGPs)
y, =
d,
+ z,, z,
=
pz,_ 1 + u,,
{u,)
~zero-mean
1(0).
(9.4.40)
The restrictions on (9.4.40), besides the restriction that p = I, that characterize the I( I) null for each of the cases considered are the following. (I) Proposition 9.3: d, = 0, {u,] independent white noise,
(2) Proposition 9.4: d, =a, {u,} independent white noise, (3) Proposition 9.5: d, =a
+ 8 · t, {u,} independent white noise,
(4) Said-Dickey extension of Proposition 9.6: d, = 0, {u,} is zero-mean ARMA(p, q), II Annual GDP data from 1929 are available from the Bureau of Economic Ana1ysis web site. The BalkeGordan GNP data (Balke and Gordon, 1989) were used to ell:lrapolate GDP back to 1869.
600
Chapter 9
(5) Said-Dickey extension of Proposition 9.7: d, =a, {u,) is zero-mean ARMA(p,q), (6) Said-Dickey extension of Proposition 9.8: d, = ARMA(p,q).
a+ 8 · t,
{u 1 } is zero-mean
QUESTIONS FOR REVIEW
1. (Non-uniqueness of the decomposition between zero-mean !(0) and driftless !(I)) Assume p = 2 in (9.4.4). Show that the third-order autoregression can be written equivalently as y, =a, · y,_,
+a, · (y,_,
- y,_,)
+a,· (y,_,
- y,_,)
+ '•·
How are (a 1 , a 2 , a 3 ) related to (¢ 1 , ¢>2, ¢3)? Which regressor is zero-mean 1(0) and which one is driftless !(!)?Verify that the OLS estimate of they,_, coefficient from this equation is numerically the same as the OLS estimate of p from (9.4.3) with p = 2. 2. Let >(z) = I - ¢ 1z- · · ·-
'i'·
3. Prove that (9.4.14) -+d A2 W(l) 2 + Hint: The asymptotic distribution of } L;~, t>y,_, y,_, is the same as that of } L;~, t>y, y,. Also, t>y, y, = t>y, y,_, + (t>y, ) 2 • Use Proposition 9.2(b). 4. Let {l>y,] be a zero-mean 1(0) process satisfying (9.2.1)-(9.2.3). (The zeromean stationary AR(p) process for {t>y,] considered in Proposition 9.6 is a special case of this.) (a) Show that {l>yt-Jotl is a stationary and ergodic martingale difference sequence. Hint: Since {8 1 ] and {t>y,} are jointly ergodic stationary, {t>y,_ 1· o,} is ergodic stationary. To show that {t>y,_,,,] is a martingale difference sequence, use the Law of Iterated Expectations. {8 1 _ 1, 8 1 _ 2 , •.. ] has more information than {t>y,_ 2 ,,_ 1, t>y,_3 ,,_,, ... }. (b) Prove (9.4.19). Hint: You need to show that E[(t>y,_ 1 o1) 2 ] = Yo.
<J 2
601
Unit-Root Econometrics
5. Verify (9.4.24). Hint: From (9.4.20) and (9.2.4), ,
!{W(l) 2
I
- I) T · (p - I) -> - - .o..'--:7'---'d 1/r( I) W'dr .
Jd
Also, by (9.4.15) and (9.2.4), s-J(I, l)elementof(Ar)- 1
--;;
I/)t(l) 2 Jd W 2 dr.
(9.4.41)
Since ljr(l) > 0 by the stationarity condition, ./1/r(l) 2 = ljr(l). 6. (Hest on 0 It is claimed in the text (right below (9.4.27)) that the usual 1value for each I; is asymptotically standard normal. Verify this for the case of p = I. Hint: (9.4.27) for p = I is ' - 1; )-> N(O. a 2 /Yo). v"" 1 · (1; 1 1 d
You need to show that v'f times the standard error of { 1 converges in probability to the square root of a 2 jy0 . v'f times the standard error of { 1 is s · (2. 2) element of (Ar) 1•
J
9.5 Which Unit-Root Test to Use? Besides the OF and ADF tests, a number of other unit-root tests are available (see Maddala and Kim, 1998, Chapter 3, for a catalogue). The most prominent among them are the Phillips (1987) test for case (4) (mentioned at the end of the previous section) and the Phillips-Perron (1988) test, a generalization of the Phillips test to cover cases (5) and (6). (The Phillips tests are derived in Analytical Exercise 6.) Their tests are based on the OLS estimate of the y,_ 1 coefficient in an AR(l) equation, rather than an augmented autoregression, with the long-run variance of {t.y,) estimated from the residuals of the AR(l) equation. We did not present them in the text because the finite-sample properties of the tests are rather poor (see below). There is a new generation of unit-root tests with reasonably low size distortions and good power. They include the ADF-GLS test of Elliott, Rothenberg, and Stock (I 996) and the M -tests of Perron and Ng (1996).
602
Chapter 9
Local-to-Unity Asymptotics
These and most other unit-root tests are consistent against all fixed 1(0) alternatives.12 Which consistent test should be chosen over the other? Recall that in Section 2.4 we defined the asymptotic power against a sequence of local alternatives that drift toward the null at rate ./'f. lt turns out that in the unit-root context the appropriate rate is T rather than .ff (see, e.g., Stock, 1994, pp. 2772-2773). That is, the probability of rejecting the null of p = l when the true DGP is
c
P =I+T
(9.5.1)
has a limit between 0 and l as T goes to infinity. This limit is the asymptotic power. The sequence of alternatives described by (9.5.1) is called local-to-unity alternatives. By definition, the asymptotic power equals the nominal size when c = 0. It can be shown that the DF or ADF p tests are more powerful than the DF or ADF t tests against local-to-unity alternatives (see Stock, 1994, pp. 27742776). That is, for any nominal size and c, the asymptotic power of the DF or ADF p statistic is greater than that of the DF or ADF t-statistic. On this score, we should prefer the p-based test to the t-based test, but the verdict gets overturned when we examine their finite-sample properties. Small-Sample Properties
Turning to the issue of testing an I( l) null against a fixed, rather than drifting, 1(0) alternative, the two desirable finite-sample properties of a test are a reasonably low size distortion and high power against 1(0) alternatives. There is a large body of Monte Carlo evidenoe on the finite-sample properties of the DF, ADF, and other unit-root tests (for a reference, see the papers cited on p. 2777 of Stock, 1994, and see Ng and Perron, 1998, for a comparison of the ADF-GLS and the M -tests). Major findings are the following. • Virtually all tests suffer from size distortions, particularly when the null is in a sense close to being 1(0). TheM-test generally has the smallest size distortion, followed by the ADF t. The size distortion of the ADF p is much larger than that of the ADF t. For example, the I( I) null considered by Schwer! (1989) is
y,- Yr-1 = cr When
+ e,,_,, (c,} Gaussian i.i.d.
(9.5.2)
e is close to -1, this l(l) process is very close to a white noise process,
12 That the DF tests are consistent against general 1(0) alternatives was verified in Review Question 5 to Section 9.3. For the consistency of the ADF tests, see, e.g., Stock (1994, p. 2770).
603
Unit-Root Econometrics
because the AR root of unity nearly equals the MA root of -e. Simulations by Schwert (1989) show that the ADF p and Phillips-Perron Zp and Z, tests (see Analytical Exercise 6) reject the I (I) null far too often in finite samples when e is close to -I. In particular, for the Phillips-Perron tests, the actual size when the nominal size is 5 percent is more than 90 percent even for sample sizes as large as I ,000 when e = -0.8. • In the case of ADF tests, how the lag length p is selected affects the finitesample size and power. For a given rule for choosing fj, the power is generally higher for the ADF p-test than for the ADF !-test. • The power is generally much higher for the ADF-GLS and theM-tests than for the ADF t- and p-tests. Unlike the M -statistic, the ADF-GLS statistic can be calculated easily by standard regression packages. In the empirical exercise of this chapter, you will be asked to use the ADF-GLS to test an I (I) null.
9.6
Application: Purchasing Power Parity
The theory of purchasing power parity (PPP) states that, for any two countries, the currency exchange rate equals the ratio of the price levels of the two countries. Put differently, once converted to a common currency, national price levels should be equal. Let P, be the price index for the United States, P." be the price index for the foreign country in question (say, the United Kingdom), and S, be the exchange rate in dollars per unit of the foreign currency. PPP states that P, = S, · P,*,
(9.6.1)
or taking logs and using lower-case letters for them, 13
Pr = s,
+ p;.
(9.6.2)
PPP is related to but different from the law of one price, which states that (9.6.1) holds for any good, with P, representing the dollar price of the good in question 13 Equation (9.6.1) or (9.6.2) is a statement of "absolute PPP.'' What is called relative PPP requires only that the first difference of (9.6.2) hold, that is, the rate of growth of the exchange rate should offset the differential between the rate of growth in home and foreign price indexes. Relative PPP should not be confused with the weaker version of PPP to be introduced below.
604
Chapter 9
and P,' the foreign currency price of the good. If international price arbitrage works smoothly, the law of one price should hold at every date r. If the law of one price holds for all goods and if the basket of goods for the price index is the same between two countries, then PPP should hold. Due to a variety of factors limiting international price arbitrage, PPP does not hold exactly at every date r. 14 A weaker version of PPP is that the deviation from PPP Zr
=
Sr - Pr
+ P1• •
(9.6.3)
which is the real exchange rate, is stationary. Even if it does not enforce the law of one price in the short run, international arbitrage should have an effect in the long run. The Embarrassing Resiliency of the Random Walk Model?
For many years researchers found it difficult to reject the hypothesis that real exchange rates follow a random walk under floating exchange regimes. As Rogoff (1996) notes, this was somewhat of an embarrassment because every reasonable theoretical model suggests the weaker version of PPP. However, studies since the mid 1980s looking at longer spans of data have been able to reject the random walk null. Recent work by Lothian and Taylor (1996) is a good example. It uses annual data spanning two centuries from 1791 to 1990 for dollar/sterling and franc/sterling real exchange rates to find that the random walk null can be rejected. In what follows, we apply the ADF r-test to their data on the dollar/sterling (dollars per pound) exchange rate. The dollar/sterling real exchange rate, calculated as (9.6.3) and plotted in Figure 9.4, exhibits no time trend. We therefore do not include time in the augmented autoregression. If we were dealing with the weak version of the law of one price, with P, and P,' denoting the home and foreign prices of a particular good, then we could exclude the intercept from the augmented autoregression because a change in units of measuring the good does not affect z,. But P, and P,' here are price indexes, and a change in the base year leads to adding a constant to z,. To make the test invariant to such changes, a constant is included in the augmented autoregression. By applying the ADF test, rather than the DF test, we can allow for serial correlation in the first differences of z, under the null, but the lag length chosen by BIC (with Pmm(T) = [12 · (200/I
605
Unit-Root Econometrics
0.5 0.4 0.3 0.2 0.1 0 ·0.1 ·0.2 -0.3 -0.4 -0.5 1791
1811
1831
1851
1871
1891
1911
1931
1951
1971
Figure 9.4: Dollar/Sterling Real Exchange Rate, 1791-1990, 1914 Value Set to 0
z,= 0.179 (0.052)
+
0.8869z,_ 1, SER=0.07i, R2 =0.790, t= 1792, ... ,1990. (0.0326)
The OFt-statistic (t") is -3.5 (= (0.8869- 1)/0.0326), which is significant at I percent. Thus, we can indeed reject the hypothesis that the sterling/dollar real exchange rate is a random walk. An obvious caveat to this result is that the sample period of 1791-1990 includes fixed and floating rate periods. It might be that international price arbitrage works more effectively under fixed exchange rate regimes. If so, the z,_ 1 coefficient should be closer to 0 during fixed rate periods. Lothian and Taylor ( 1996) report that if one uses a simple Chow test, one cannot reject the hypothesis that the z,_ 1 coefficient is the same before and after floating began in 1973. In the empirical exercise of this chapter, you will be asked to verify their finding and also use the ADF-GLS to test the same hypothesis. PROBLEM SET FOR CHAPTER 9 _ _ _ _ _ _ _ _ _ _ _ __ ANALYTICAL EXERCISES
1. Prove Proposition 9.2(b). Hint: By the definition of t.~,.
So
t.~,,
we
have~~
=
~,_ 1
+
606
Chapter 9
From this, derive
Sum this overt = I, 2, ... , T to obtain T
T
:L t>~~ · ~~-' = GJ[(~d- (~ol l- m:L(t>~~) . 2
2
t=l
Divide both sides by T to obtain
L 6.~1 . ~I-I =
I T T
I(
2
~T
2
.jT
)
-
I(
2
~o )
2
I
T
- 2T L(f>~ 1 )
.jT
1=1
2 •
t=l
Apply Proposition 6.9 and the Ergodic Theorem to the terms on the right-hand side of(*).
2. (Proof of Proposition 9.4) Let p~ be the OLS estimate of p in the AR(l) equation with intercept, (9.3.11), and define the demeaned series {y;_,) (I = I, 2, ... , T) by ~
Y1_,
This
=
Y1-1-
- y, Y
=
Yo+Yl+ .. ·+YT-1 T (I= I, 2, ... , T).
(I)
y;_, equals the OLS residual from the regression of y1_ 1 on a constant for
t=l,2, ... ,T. (a) Derive "JJ- _
p
l _ -
"T ~ Ltt=l f::l.yt Yr-1 "T ~ L-l=i(YI-1)
(2)
2
Hint: By the Frisch-Waugh Theorem, p~ is numerically equal to the OLS
coefficient estimate in the regression of y 1 on
y;_, (without intercept). Thus, (3)
Use the fact that
L;=l (yl
I;;=, y;_,
- Y1-1l · Y:'-1·
= 0 to show that
I;;=, (y y;_,) · y;_, 1 -
607
Unit-Root Econometrics
(b) (Easy) Show that, under the null hypothesis that [y,] is driftless 1(1) (not necessarily driftless random walk),
A'
-([W"(l)] 2
T · (p"
-
I)--+ 2
A2 •
d
Jd
y
-
[W"(O)J 2) - ~
2
[W"(r)) 2 dr
(4)
where Yo is the variance of D.y,, A2 is the long-run variance of D.y,, and W"(-) is a demeaned standard Brownian motion introduced in Section 9.2. Hint: Use Proposition 9.2(c) and (d) with~' = y,. (c) (Trivial) Verify that T · (p"- I) converges to DF~ (defined in (9.3.13)) if [y,} is a driftless random walk. Hint: If y, is a random walk, then A2 = y0 . (d) (Optional) Lets be the standard error of regression for the AR(I) equation with intercept, (9.3.11). Show that s 2 is consistent for y0 (= Var(D.y,)) under the null that [y,} is driftless I (I) (so D.y, is zero-mean 1(0) satisfying (9.2.1)--{9.2.3)). Hint: Let(&*, p") be the OLS estimates of (a*, p). Show that &* converges to zero in probability. Write s 2 as T
s' =
I
T-1
L[(D.y,- &*) -
(p" -
l)y,_JJ'
t=l
T
=
I T- I
L( D.y, -a'*)' t=l
2 . [T. T-1
T
(p"- I)]·.!._ L(D.y,- &*). Yt-) T
t=l
T
+
I , 1"" 2 2 T _I · [T · (p"- I)] · T' L)y,_,) . (5) t=l
Since plim &* = 0, it should be easy to show that the first term on the righthand side of (5) converges in probability to y0 = E[ (D.y,) 2]. To prove that the second term converges to zero in probability, take for granted that I I LY<-i T --+ A ..;T T ct f'i' -
t=l
1'
W(r) dr.
(6)
0
(e) Show that t" --+ct DF'/ if [y,] is a driftless random walk. Hint: Since [y,] is a driftless random walk, A2 = y0 = a 2 (=variance oft:,). So sis
608
Chapter 9
consistent for
rJ.
.0"- 1
(7)
t" = s-x-,;--;=(2=.c::2C")'=el=e=m=e=nt=o=f=o(x=·x"")=c1 Show that the square root of the (2, 2) element of (I:;~,
x,x;) _,
is
1
x, = [1, y,_Jl'.
where 3.
(Proof of Proposition 9.5) Let
p'
be the OLS estimate of p in the AR(l)
equation with time trend, (9.3.17), and define the detrended series [y,'_ 1) (t =
1, 2, ... , T) by
' y,_ 1
8)
where (a,
=y,_
1
. .
-a-8·t,
(8)
are the OLS coefficient estimates for the regression of y,_ 1 on
(1, t) for t = 1, 2, ... , T. (a) Derive
., p - 1-
'i;"'T
L.....t=l
b.
'i;"'T
(
' Yr Yr-1 '
L....r=l Yr-l
Hint: By the Frisch-Waugh Theorem,
,0'
(9)
)2
is numerically equal to the OLS
estimate of the y J_ 1 coefficient in the regression of y, on y 1'_ 1 (without intercept or time). Thus, "'r
p =
'i;"'T
L....t=l
Yr
'i;"'T
(
'
Yr-1 '
Lt=l Yr-1
(10)
)2
Show that I:;~, (y,- y;_,) · y1'_ 1 = I:;~, (y,- y,_,) · y,'_ 1 • (By construction,
I:;~,
y,'_ 1 =
0 and I:;~, t ·
y;_, =
0 (these are the normal equations).)
(b) (Easy) Show that, under the assumption that y,
=a+ 8. t +~~where[~,)
is driftless !(1),
A' ~ -([W'(1)] 2 - [W'(0)] 2 ) - ~ T·<.D'-1)-+ 2 d
A' j~[W'(r)]2dr
2
(11)
609
Unit-Root Econometrics
where
Yo is the variance of
t>y,, A2 is the long-run variance of t>y,, and
W' O is a detrended standard Brownian motion introduced in Section 9.2. Hint: Let ~,'_ 1 be the residual from the hypothetical regression of ~r-l on (I, t) lor t = I, 2, ... , T where 1;, = y, - a - 8 · t (this regression is hypothetical because we do not observe{!;,)). Then 1;,'_ 1 = fort= I, 2, .... T, because 1;, differs from y, only by a linear trend a + 8 · t. Use Proposition 9.2(e) and (I). Since t>y, = 8 + <'>~, under the null, the variance of !>~, equals that of !> y, and the long-run variance of <'>~, equals that of
_v;_,
t>v,. (c) (Trivial) Verify that T · (jj'- I) converges to DF; (defined in (9.3.19)) if {y,} is a random walk with or without drift.
4. (Proof of (9.4.16)) In the text,
L .Yr-1 e, T
-I T
--+ c, d
t=l
=
I 2 · Y,(l) · [W(l)'- I] -u 2 0
(9.4.16)
was left unproved. Prove this under the condition that {y,} is driftless I (I) so that {t>y,) is zero-mean 1(0) satisfying (9.2.1)-(9.2.3).
Hint: Thee, in
(9.4.16) is the innovation process in the MA representation of {t>yr). Using the Beveridge-Neisen decomposition, we have
Yr =
l/J(l)w, +~~+(Yo- ~o), w,
= e, + e, + · · · + e,.
So
IT
IT
T LYr-J e, = 1/1(1) T L t=I
IT
w,_, e, + T
t=I
IT
L ~r-J e, +(Yo- ~o) T L e,. t=I
t=I
(*) Show that the second and the third term on the right-hand side of this equation converge in probability to zero. Once you have done Analytical Exercise 1 above, it should be easy to show that I
T
T
2
L w,_, e,-; (~ )lW(l)'- 1]. t=I
610
Chapter 9
5. (Optional, proof of Proposition 9.7) To simplify the calculation, set p = I, so the augmented autoregression is y,
Let (a', p~,
=a'+ PY<-1 + ~~
t>y,_l
+ <,.
fil be the OLS coefficient estimates. Show that T__. .:::
Hint: By the Frisch-Waugh Theorem, (p~, { 1) are numerically the same as the
OLS coefficient estimates from the regression of y, on (y;_,, (l>y,_,)l~l), where is the residual from the regression of on a constant and (t>y,_,)l~l is the residual from the regression of t>y,_ 1 on a constant lor 1 = I, 2 ... , T. Therefore, if b = (p~. f,)', fJ = (p, ~J)'. and x, = (y;_,, (t>y,_,)l~l)', then they satisfy (9.4.8). So
y;_,
y,_,
T
1"'~2 (I, I) element of A 7 = T2 L)Y,_,) , t=l
I 1st element of c7 = T
T
LY:'- 1 • s,. t= 1
Apply Proposition 9.2(c) and (d) with 1;, = y, to these expressions. 6. (Phillips tests) Suppose that (y,} is driftless I (I), so t>y, can be serially correlated. Nevertheless, estimate the AR(l) equation without intercept, rather than the augmented autoregression without intercept, on (y,}. Let p be the OLS estimate of they,_, coefficient. Define Zp
= Z,
,
T · (p- I) -
= ;cs · I A
-
I
- ·
2
I
2·
(T
2 ·
s2
uj) , ·
2
,
(A -Yo),
(T·u~>) '2 , , · (A - Yo), s ·A
where &P is the standard error of p, s is the OLS standard error of regression, X2 is a consistent estimate of A2 , y0 is a consistent estimate of y0 , and 1 is the 1-value for the hypothesis that p = I.
611
Unit-Root Econometrics
(a) Show that Zp and Z, can be written as
Hint: By the algebra of least squares, I
(b) Show that
Zp ---> DFp and Z, ---> DF,. d
d
Hint: Use Proposition 9.2(a) and (b)
with~~
= y,.
7. (Optional) Show that a(L) in (9.2.5) is absolutely summable. Hint: A one-line
proof is 00
00
00
00
L:: Vt; < L:: L:: It;)= "L:ilt;) < i=j+l
00.
j=Oi=j+l
Justify each of the equalities and inequalities. MONTE CARLO EXERCISES
1. (You have less power with time) In this simulation, we wish to verify that the finite-sample power against stationary alternatives is greater with the OF t~-test than with the OFt' -test, that is, inclusion of time in the AR( I) equation reduces power. The OGP we examine as the stationary alternative is a stationary AR(l) process: y,
= py,_, + e,,
p
= 0.95,
e,
~
i.i.d. N(O. !), Yo= 0.
Choose the sample size T to be I 00 and the nominal size to be 5 percent. Your computer program for calculating the finite-sample power should have the following steps.
612
Chapter 9
(1) Set counters to zero. There should be two counters, one for the DF t~ and the other fort'. They will record the number of times the (false) null gets rejected. (2) Start a do loop of a large number of replications (I million, say). In each replication, do the following. (i) Generate the data vector (yo, Y1, ... , YT ).
Gauss Tip: As was mentioned in the Monte Carlo exercise for Chapter I, to generate (y 1, y,, ... , YT), avoid creating a do loop within the current do loop; use the matrix operators. The matrix formula is r
=
y (Txl)
·yo+
A
e
(TxT}(Txi)
(Txl)
where p
YI Y2
y=
p' r= ' PT
YT
0
1
0
p
A=
.......
p'
p
PT-1
PT-2
0 0 0 ' f=
fi
s, fT
p
To define r, use the Gauss command seqm. To define A, use toepli tz and lowma t. rand A should be defined in step (1), outside the current do loop. (ii) Calculate I~ and t' from (yo. Y1o ..• , YT). (So the sample size is actually T +I.)
by I if I~ < -2.86 (the 5 percent critical value for DF'/). Increase the counter fort' by I if t' < -3.41 (the 5 percent critical value for D F,').
(iii) Increase the counter for
I~
(3) After the do loop, for each statistic, divide the counter by the number of replications to calculate the frequency of rejecting the null. This frequency converges to the finite-sample power as the number of Monte Carlo replications goes to infinity.
613
Unit·Aoot Econometrics
Power should be about 0.124 for the OF t"-test (so only about 12 percent of the time can we reject the false null!) and 0.092 for the OFt' -test. So indeed, power is less when time is included in the regression, at least against this particular OGP. 2. (Size distortions) In this Monte Carlo simulation, we calculate the size distortions for the AOF p"- and !~'-tests. Our null OGP is the one examined by Schwer! (I 989): y, = py,_,
e,
~
+e, +lie,_,,
p =I, II= -0.8,
i.i.d. N(O, 1), Yo= 0, eo= 0.
For T = I 00 and the nominal size of 5 percent, calculate the finite-sample or exact size for /3" and t". It would be useful to select the number of lagged changes in the augmented autoregression by one of the data-dependent methods (the sequential t, the AIC, or the BIC), but to keep the computer program simple and also to save CPU time, set it equal to 4. Thus, the regression equation, as opposed to the OGP, is
\1 · (y,_, - Yt-2) + \2 · (Yt-2- y,_,) + \1 · (y,_,- y,_.) + I• · (y,_•- Yt-s) +error.
y, =canst.+ PYt-1 +
Since the sample includes y0 , the sample period that can accommodate four lagged changes is from t = 5 (not p + 2 = 6) to t = T and the actual sample size is T- 4. Verify that the size distortion is less for the AOF t"- than for the AOF p"-test. (The size should be about 0.496 for the AOF p" and 0.290 for the AOF t~'.) EMPIRICAL EXERCISES
Read Lothian and Taylor (I 996, pp. 488-509) (but skip over their discussion of the Phillips-Perron tests) before answering. The file LT.ASC has annual data on the following: Column I: Column 2: Column 3: Column 4:
year dollar/sterling exchange rate (call itS,) U.S. WPI (wholesale price index), normalized to 100 for 1914 (P,) U.K. WPI, normalized to 100 for 1914 (P/).
The sample period is 1791 to 1990 (200 observations). These are the same data used by Lothian and Taylor ( 1996) and were made available to us. For data sources
614
Chapter 9
and how the authors combined them to construct consistent time series, see the Appendix of Lothian and Taylor ( 1996). (According to the Appendix, the exchange rate (and probably WPis) are annual averages. This is rather unfortunate and pointin-time data should have been used, because of the time aggregation problem: if a variable follows a random walk on a daily basis, the annual averages of the variable do not follow a random walk. See part (f) below for more on this.) Calculate the dollar/sterling real exchange rate as
z, = ln(S,) ~ ln(P,) + ln(P,').
(I)
Since S, is in dollars per pound, an increase means sterling appreciation. The plot of the real exchange rate in Figure 9.4 shows no clear time trend, so we apply the AOF t"-test. Therefore, the augmented autoregression to be estimated is
z,
= const. + Plr-1 + \1 · (Z:-1 ~ Zr-2) + · · · + \p · (Zr-p ~ Zr-p-1) +error,
(2) where p is the number of lagged changes included in the augmented autoregression (which was denoted p in the text). (a) (AOF with automatic lag selection)
For the entire sample period of 17911990, apply the sequential t-rule, the AIC, and the BIC to select p (the number of lagged changes to be included). Set Pmax (the maximum number of lagged changes in the augmented autoregression, denoted Pmax(T) in the text) by (9.4.31) (so Pm"x = 14). For the sequential t test, use the significance level of I 0 percent. (You should find p = 14 by the sequential t, 14 by AIC, and 0 by BIC.) Verify that the value of the t"-statistic for p = 0 (if estimated on the maximum sample) matches the value reported in Table I of Lothian and Taylor (1996) as r,.
RATS Tip: RATS allows you to change the number of regressors in a do loop. Since the RATS OLS procedure LINREG does not allow the standard errors to be retrieved, the OFt" cannot be calculated in the program. So calculate the OF t"-statistic as the t-value on the z,_ 1 coefficient in the regression of z, ~ Zr-l· Also, neither the AIC nor the BIC is calculated by LINREG. So use COMPUTE. In the RATS codes below, z is the log real exchange rate and dz is its first difference. In applying the sequential t, AIC, and BIC rules, fix the sample period to be 1806 to 1990 (which corresponds to Pmax + 2, ... , T). The RATS codes for AIC and BIC are linreg dz 1806:1 1990:1;# constant z{1)
615
Unit·Aoot Econometrics
compute ssrr=%rss compute akaike = log(%rss/%nobs)+%nreg*2.0/%nobs compute schwarz = log(%rss/%nobs)+%nreg*log(%nobs) /%nobs display akaike schwarz * run ADF with 1 through 14 lagged changes do P=L 14 linreg dz 1806:1 1990:1;# constant z{1) dz{1 top) compute akaike = log(%rss/%nobs)+%nreg*2.0/%nobs compute schwarz = log(%rss/%nobs)+%nreg *log(%nobs)/%nobs display akaike schwarz end do TSP Tip: Unlike RATS, it appears that TSP does not allow you to change the sample period and the number of regressors in a do loop. The OLSQ pro-
cedure prints out the BIC, bm not the AIC, so the AIC must be calculated using the SET command. (b) (OF test on the floating period) Apply the OF 1g-test to the floating exchange rate period of 1974-1990. Because of the lagged dependent variable Z<-J, the first equation is fort = 1975: Zt975
=canst.+
PZt974
+error.
So the sample size is 16. Can you reject the random walk null at 5 percent? (c) (OF test on the gold standard period) Apply the OF tM-test to the gold standard period of 1870--1913. Take t =I for 1871, so the sample size is 43. Can you reject the random walk null at 5 percent? Is the estimate of p (the z,_ 1 coefficient in the regression of z, on a constant and z,_ 1) closer to zero than in (b)? (d) (Chow test) Having rejected the I(l) null, we proceed under the maintained assumption that the log real exchange rate z, follows a stationary AR(l) process. To test the null hypothesis that there is no structural change in the AR(l) equation, split the sample into two subsamples, 1791-1973 and 1974-1990, and apply the Chow test. You may recall that the Chow test was originally developed for models with strictly exogenous regressors. If K is the number of regressors including a constant (2 in the present case), it is an F-test with (K, T - 2K) degrees of freedom. This result is not directly applicable here, because in the AR(l) equation the regressor is the lagged dependent variable
616
Chapter 9
which is not strictly exogenous. However, as you verified in Analytical Exercise 12 of Chapter 2, if the variables (the regressors and the dependent variable) are ergodic stationary (which is the case here because (z,} is ergodic stationary when IPI < I), then in large samples, with the sizes of both subsamples increasing, K · F is asymptotically x2 (K). This result does not require the regressors to be strictly exogenous. Calculate K · F and verify the claim (see pp. 498-502 of Lothian and Taylor. 1996) that "there is no sign of a structural shift in the parameters during the floating period." In the present case of a lagged dependent variable, there is an issue of whether the equation for 1974, ZJ974
= const. +
PZI973
+error,
should be included in the first subsample or in the second. It does not matter in large samples; include the equation in the first subsample (so the sample size of the first subsample is 183). (Note: K · F should be about 1.26.) (e) (The ADF-GLS) As was mentioned in Section 9.5, the ADF-GLS test achieves
substantially higher power at a cost of only slightly higher size distortions. The ADF-GLS t 1'- and t' -statistics are calculated as follows. As in the ADF t"- and t' -tests, the null process is
y,
= d, +
z,. z,
=
PZt-l
+
u,. p =I, {u,} zero-mean 1(0) process.
(3)
Write d, = x;p. For the case where d, is a constant. x, = I and p =a, while for the case where d, is a linear time trend, x, = (I. t )' and p = (a. 8)'. We assume the sample is of size T with (y~o y2 , ••. , YT ). (i) (GLS estimation of fJ) In estimating the regression equation
y,
= x;p +error,
(t
=
I, 2, .... T),
(4)
do GLS assuming that the error process is AR(l) with the autoregressive coefficient of jj = I+ cIT. (The value of c will be specified in a moment.) That is, estimate p by OLS by regressing -
-
-
-
-
-
YI· Y2- PYI· Y3- PY2· · · ·, YT- PYT-l on x1.x2 -px,.x3 -px2 .... ,xr- PXr-1·
(Unlike in the genuine GLS, the first observations, y 1 and x 1, are not
617
Unit-Root Econometrics
weighted by /I ~ p 2 , but this should not matter in large samples.) Set c according to
in the demeaned case (with x, = 1), c=~I3.5
in the detrended case. (Roughly speaking, this choice of censures that the asymptotic power at c = c where p =I+ efT is nearly 50 percent which is the asymptotic power achieved by an ideal test specifically tailored to this alternative of c = c.) Call the resulting estimator PoLs and define y, = y, ~ x;PoLs· (This should not be confused with the GLS residual, which is (y, ~ PYr-Il ~ (x, ~ PXr-Il'PoLs·l (ii) (ADF without intercept on the residuals) Estimate the augmented autore-
gression without intercept using
y, instead of y,:
.Vt = PYr-1 + \1 · (Yr-1 ~ Yr-2) + · · · + \r · (Yr-p ~ .Vr-p-I) +error. The !-value for p = I is the ADF-GLS !-statistic; call it the ADF-GLS t~ if x, = I and the ADF-GLS t' if x, = (I, t)'. The limiting distribution of the ADF-GLS t~ is that of DF, (which is tabulated in Table 9.2, Panel (a)), while that of the ADF-GLS t' is tabulated in Table I. C. of Elliott eta!. (1996). Reproducing their lower tail critical values ~3.48
for I%,
~3.15
for 2.5%.
~2.89
for 5%, and
~2.57
for 10%.
Elliott et a!. ( 1996) show for the ADF-GLS test that the asymptotic power is higher than those of the ADF p- and t -tests, and that compared with the ADF tests, the finite-sample power is substantially higher with only a minor deterioration in size distortions. Apply the ADF-GLS 1~-test to the gold standard subsample 1870-1913. Choose the lag length by BIC with Pmax (the maximum number of lagged changes in the augmented autoregression, denoted PmaxCTl in the text) by (9.4.31) (so Pmax = 9). Can you reject the null that the real exchange rate is driftless 1(1)? (Note: The BIC will pick 0 for p. The statistic should be ~2.24. You should be able to reject the 1(1) null at 5 percent.)
TSP Tip: An example of TSP codes assuming p = 0 for this part is the following.
618
Chapter 9 ?
7 (e) ADF-GLS ?
7 Step 1: GLS demeaning ?
set rhobar=l-7/(1913-1870+1); smpl 1870 1913; const=l; Ct=l; zt=z; ? z lS the log real exchange rate smpl 1871 1913; zt=z-rhobar*z(-1); ct=const-rhobar*const(-1); smpl 1870 1913; olsq zt ct; z=z-@coef(l); ? @coef(l) is the GLS estimate of the mean ? ? Step 2:
DF with demeaned variable
?
smpl 1871 1913; olsq z z (-1); set t=(@coef(l)-1)/@ses(l) ;print t; (f) (Optional analytical exercise on time aggregation)
Suppose y, is a random walk without drift. Create from this time-averaged series by the formula Y1 Yl =
+ Y2 + · ·· + Yn , n
Yn+l Y2 =
+ Yn+2 + · · · + Y2n , · · · · n
(Think of y, as the exchange rate on day t and Yj as an annual average of daily exchange rates for year j.) Show that {yj} is not a random walk. More specifically, show that
619
Unit-Root Econometrics
ANSWERS TO SELECTED Q U E S T I O N S - - - - - - - - - ANALYTICAL EXERCISES
1. Since E(;Jl < oo, ; 0 ; ft converges in mean square (and hence in probability) to 0 as T ---> oo. Thus, the second term on the right-hand side of(*) in the hint can be ignored. Regarding the first term, we have
The first term on the right-hand side of this equation can be ignored. By Proposition 6.9, the second term converges to N(O, A2 ) because !l.;, is a linear process with absolutely summable MA coefficients. So the first term on the right-hand side of(*) converges in distribution to (A 2 j2)X where X ~ x 2 (1). Since {!l.;,} is ergodic stationary, the last term on the right-hand side of(*) converges in probability to ~ E[(!l.;,) 2 ] by the Ergodic Theorem. 5. The (1, I) element of Ar converges to the random variable indicated in Proposition 9.2(c). Regarding the first element of cr, by Beveridge-Nelson, y, = >f!(l)w,
+ ry, +(yo~
ryo), w,
= '' + '' + · · · + t:,.
(I)
Therefore, (2)
where M
-
X 1_ 1 = Xr-I- x,
_
X=
Xo+···+XT-I T
for x
= w, TJ,
and
The second term on the right-hand side of this equation can be ignored asymptotically. By Proposition 9.2(d) for;, = w, (a driftless random walk), the first term on the right-hand side converges to
620
Chapter 9
From this, we have I T
I
T
LY;'_ 1 • s, --;j 2 · a 2 · >fr(l) · ([W"(l)f- [W"(O)f- I).
(4)
/o=l
The A 7 matrix is asymptotically diagonal for the demeaned case too. The offdiagonal elements of A 7 are equal to 1/ft times (5)
Since
L;=, y;'_
1
= 0, this equals
(6) It is easy to show, by a little bit of algebra analogous to that in Review Question
3 of Section 9.4 and Proposition 9.2(d), that (6) has a limiting distribution. So 1/ ft times (6), which is the off-diagonal elements of A 7 , converges to zero in probability. Then, using the same argument we used in the text for the case without intercept, it is easy to show that ~ 1 is consistent for \ 1 . So >fr(l) is consistently estimated by 1/(1- ~I). Taken together, T_·_.:::(p_'"-,--:.:.1) _::_ , --> DF:. I - \I d EMPIRICAL EXERCISES
(a) With p = 0, the estimated AR(l) equation is Zt
=canst.+ 0.8869 Zt-1, SSR = 0.995135, R 2 = 0.790, T = 199. (0.0326)
The value oft" is -3.47, which matches the value in Table I of Lothian and Taylor (1996). The 5 percent critical value is -2.86 from Table 9.2, Panel (b). (b) The estimated AR(l) equation for 1974-1990 is
z, =canst.+ 0.7704z,_ 1 ,
SSR = 0.150964, R 2 = 0.539, T = 16.
(0.1904) t" = -1.21 > -2.86. We fail to reject the null at 5 percent.
621
Unit-Root Econometrics
(c) The estimated AR(l) equation for 1870-1913 is
z,
= const. + 0.7222 z,_ 1 , SSR (0.1 045)
= 0.069429,
R2
= 0.538,
T
= 43.
t" = -2.66 > -2.86. We fail to reject the null at 5 percent.
(d) The estimated AR(l) equation for 1791-1974 is
z,
=cons!.+ 0.8993z,_ 1 , SSR (0.0329)
= 0.837772,
R2
= 0.805,
T = 183.
From this and the results in (a) and (b), 0.995135- (0.837772 + 0.150964) K .F =
o 837772+0.t5o964
1·26 ·
(199 4)
The 5 percent critical value for x 2 (2) is 5.99. So we accept the null hypothesis of parameter stability. (e) The ADF-GLS t" = -2.24. The 5 percent critical value is -1.95 from Table 9.2, Panel (a). So we can reject the 1(1) null.
References Baillie, R., 1996, "Long Memory Processes and Fractional Integration in Econometrics." Journal of Econometrics, 73, 5-59. Balke, N., and R. Gordon, 1989, "The Estimation of Prewar Gross National Product: Methodology and New Evidence," Journal of Political Economy, 97. 39-92. Chan, N., and C. Wei, 1988, "Limiting Distributions of Least Squares Estimates of Unstable Autore· gressivc Processes," Annals of Statistics, 16, 367-401. Dickey, D., 1976, "Estimation and Hypothesis Testing in Nonstationary Time Series," Ph.D. disser· tation, Iowa State University. Dickey, D., and W. Fuller, 1979, "Distribtion of the Estimators for Autoregressive Time Series with a Unit Root," Journal of the American Statistical Association, 74, 427-431. Elliott, G., T. Rothenberg, and J. Stock, 1996, "Efficient Tests for an Autoregressive Unit Root," Econometrica, 64, 813-836. Fama, E., 1975, "Short-Tenn Interest Rates as Predictors of Inflation," American Economic Review, 65, 269--282. Fuller, W., 1976, Introduction to Statistical Time Series, New York: Wiley. _ _ _ , 1996, Introduction to Statistical Time Series (2d ed.), New York: Wiley. Hall, P., and C. Heyde, 1980, Martingale Limit Theory and Its Application, New York; Academic. Hamilton, 1., 1994, Time Series Analysis, Princeton: Princeton University Press. Hannan, E. 1., and M. Deistler, 1988, The Statistical Theory of Linear Systems, New York: Wiley.
622
Chapter 9
Kwiatkowski, D., P. Phillips, P. Schmidt, and Y. Shin, 1992, ''Testing the Null Hypothesis of Stationarity against the Alternative of a Unit Root," Journal of Econometrics, 54, I 59-178. Leyboume, S., and B. McCabe, 1994, "A Consistent Test for a Unit Root," Journal of Business and
Economic Statistics, 12, 157-166. Lothian, J., and M. Taylor, 1996, "Real Exchange Rate Behavior: The Recent Float from the Perspective of the Past Two Centuries," Journal of Political Economy, 104-,488-509. Maddala, G. S., and In-Moo Kim, 1998, Unit Root.~. Cointegration, and Structural Change, New York: Cambridge University Press. Ng, S .. and P. Perron, 1995, "Unit Root Tests in ARMA Models with Data-Dependent Methods for the Selection of the Truncation Lag," Journal of the American Statistical Association, 90, 268-
281. - - - · 1998, "Lag Length Selection and the Construction of Unit Root Tests with Good Size and Power," working paper, mimeo. Perron, P., 1991, "Test Consistency with Varying Sampling Frequency,'' Econometric Theory, 7, 341-368. Perron, P., and S. Ng, 1996, "Useful Modifications to Unit Root Tests with Dependent Errors and Their Local Asymptotic Properties," Review of Economic Studies, 63, 435-463. Phillips, P., 1987, "Time Series Regression with a Unit Root," Econometn'ca, 55, 277-301. Phillips, P., and S. Durlauf, 1986, "Multiple Time Series Regression with Integrated Processes," Review of Economic Studies, 53,473-496. Phillips, P., and P. Perron, 1988, "Testing for a Unit Root in Time Series Regression," Biometrika, 75, 335-346. Rogoff, K., 1996, "The Purchasing Power Parity Puzzle," Journal of Economic Literature, 34,647668. Said, S., and D. Dickey, 1984, "Testing for Unit Roots in Autoregressive~ Moving Average Models of Unknown Order," Biometrika, 71,599-607. Sargan, D., and A. Bhargava, 1983, "Testing Residuals from Least Squares Regression for Being Generated by the Gaussian Random Walk," Econometrica, 51, 153-174. Schwert, G., 1989, "Tests for Unit Roots: A Monte Carlo Investigation," Journal of Business and Economic Statistics, 7, 147-159. Sims, C., J. Stock, and M. Watson, 1990, "Inference in Linear Time Series Models with Some Unit Roots," Econometrica, 58, 113-144-. Stock, J., 1994, "Unit Roots, Structural Breaks and Trends," Chapter 46 in R. Engle and D. McFadden (eds.), Handbook of Econometrics, Volume IV, New York: North Holland. Tanaka, K., 1990, ''Testing for a Moving Average Unit Root," Econometric Theory, 6, 433-444-. _ _ _ , 1996, Time Series Analysis, New York: Wiley. Watson, M., 1994, "Vector Autoregressions and Cointegration," Chapter 47 in R. Engle and D. McFadden (eds.), Handbook of Econometrics, Volume IV, New York: North Holland.
CHAPTER
10
Cointegration
ABSTRACT As the examples of the previous section amply illustrate, many economic time series can be characterized as being 1(1 ). But very often their linear combinations appear to be stationary. Those variables are said to be co integrated and the weights in the linear combination are called a co integrating vector. This chapter studies cointegrated systems, namely, vector 1(1) processes whose elements are cointegrated.
In the previous chapter, we have defined univariate 1(0) and 1(1) processes.
The first section of this chapter presents their generalization to vector processes and defines co integration for vector I ( 1) processes. Section I 0.2 presents alternative representations of a cointegrated system. Section I 0.3 examines in some detail a test of whether a vector 1(1) process is cointegrated. Techniques for estimating cointegrating vectors and making inferences about them are developed in Section I 0.4.
As an application, Section 10.5 takes up the money demand function. The log real money supply, the nominal interest rate, and the log real income appear to be 1(1), but the existence of a stable money demand function implies that those variables are cointegrated, with the coefficients in the money demand function forming a cointegrating vector. The techniques of Section I0.4 are utilized to estimate the cointegrating vector. Unlike in other chapters, we will not present results in a series of propositions, because the level of the subject matter is such that it is difficult to state assumptions without introducing further technical concepts. However, for technically oriented readers, we indicate in the text references where the assumptions are formally stated. A Note on Notation: Unlike in other chapters, the symbol "n" is used for the dimension of the vector processes, and as in the previous chapter, the symbol "T" is used for the sample size and "t" is the observation index. Also, if Yr is the t-th observation of ann-dimensional vector process, its j-th element will be denoted Yjr instead of Yrj. We make these changes to be consistent with the notation used in the majority of original papers on cointegration.
624
Chapter tO
10.1
Co Integrated Systems
Figure IO.l(a) plots the logs of quarterly real per capita aggregate personal disposable income and personal consumption expenditure for the United States from 1947:Ql to 1998:Ql. Both series have linear trends and-as will be verified in an example below- stochastic trends. However, these two 1(1) series tend to move together, suggesting that the difference is stationary. This is an example of cointegration. Cointegration relations abound in economics. In fact, many of the variables we have examined in the book are cointegrated: prices of the same commodity in two different countries (the difference should be stationary under the weaker version of Purchasing Power Parity), long- and short-term interest rates (even though they have trends, yield differential may be stationary), forward and spot exchange rates, and so forth. Cointegration may characterize more than two variables. For example, the existence of a stable money demand function implies that a linear combination of the log of real money stock. the nominal interest rate, and log aggregate income may be stationary even though each of the three variables is I (I). We start out this section by extending the definitions of univariate 1(0) and 1(1) processes of the previous chapter to vector processes. The concept of cointegration for multivariate I (I) processes will then be introduced formally. 1.4,----------------------------, 1.3
~
1 .1
.... ... --·"" ...
1.0 0.9
... -·--- .-
/ ~.,.-...--::=::-::::/--~ _..........
1.2
-
,·-----
.r;;.,...:.:.....................----.. ·····"
0.8 ~~::.···
0.7
06~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~-M~YWDM~Mron~~~~~M~MOO~M~~
- - log per capita income
year · · · · • log per capita consumption
Figure 10.1 (a): Log Income and Consumption
Linear Vector 1(0) and 1(1) Processes
Our definition of vector 1(0) and 1(1) processes follows Hamilton (1994) and Johansen (1995). To introduce the multivariate extension of a zero-mean linear
625
Cointegration
univariate !(0) process, consider a zero-mean n-dimensional VMA (vector moving average) process:
u, = lii(L) e, , W(L) (11xl)
== 1, + lii,L + lii 2 L 2 + · · ·,
(10,1,])
S2 , S2 positive definite.
(10.1.2)
(11xn) (11xl)
where e, is i.i.d, with E(e 1) = 0, E(e,e;) =
(11 XII)
(The error process e, could be a vector Martingale Difference Sequence satisfying certain properties, but we will not allow for this degree of generality,) As in the univariate case, we impose two restrictions. The first is one-summability: 1 {lilj) is one-summable,
(10,1.3)
Since {lilj) is absolutely summable when one-summability holds, the multivariate version of Proposition 6. I (d) means that u, is (strictly) stationary and ergodic. The second restriction on the zero-mean VMA process is Iii(!)
f=
(n x n matrix ofzeros).
0
(10.1.4)
(n XII)
That is, at least one element of then x n matrix Iii (I) is not zero. This is a natural multivariate extension of condition (9.2.3b) in the definition of univariate 1(0) processes on page 564. A (linear) zero-mean n-dimensional vector 1(0) process or a (linear) zeromean 1(0) system is a VMA process satisfying (IO.l.l)-(10.1.4). Given a zeromean 1(0) system u,, a (linear) vector 1(0) process or a (linear) 1(0) system can be written as ~
+u,,
where~
is ann-dimensional vector representing the mean of the stationary process. Using the formula (6.5.6) with (6.3.15), the long-run covariance matrix of the 1(0) system is lil(l)S21i1(1)'.
(10.1.5)
Since S2 is positive definite and Iii (I) satisfies (10.1.4 ), at least one of the diagonal 1A sequence of matrices {lit j J is said to be one-summable if Nj,kel is one-summable (i.e., L:~o jl\l!j,ke I < oo) for all k, f = I, 2, ... , n, where \l.lj,ke is the (k, f) element of then x n matrix ~ j.
626
Chapter 10
elements of this long-run variance matrix is positive, implying that at least one of the elements of an 1(0) system is individually 1(0) (i.e., 1(0) as a univariate process). We do not necessarily require \11(1) to be nonsingular. In fact, accommodating its singularity is the whole point of the theory of cointegration. Consequently, the long-run variance matrix, too, can be singular. Example 10.1 (A bivariate 1(0) system): Consider the following bivariate first-order VMA process:
{
u,, =Bit -
U21
=
fJ,/-1
+ Yf2,t-I'
or u, = e,
+ 'IIIer-h
(10.1.6)
B2r,
where
e,
=
[elt] E:2r
[-I
r]
0 0 .
'If, =
'
(10.1.7)
Since this is a finite-order VMA, the one-summability condition is trivially satisfied. Requirement (10.1.4) is satisfied because
\11(1)=12+"',=[
0 0
y]-~ 0. I (2x2)
(10.1.8)
So this example fits our definition of 1(0) systems. If y = 0, then the first element, u It• is actually I( -I) as a univariate process. The n-dimensionall(1) system y, associated with a zero-mean 1(0) system u, can be defined by the relation f),y, =
~
+ u,
0
(10.1.9)
So the mean of f),y, is ~- Since not every element of the associated zero-mean 1(0) process u, is required to be individually 1(0), some of the elements of the I (I) system y, may not be individually 1(1). 2 Substituting (10.1.1) into this, we obtain f),y, = li + \II(L)e,.
(10.1.10)
This is called the vector moving average (VMA) representation of an 1(1) 2Jn many textbooks and also in Engle and Granger (1987), all elements of an I (I) system are assumed to be individually 1(1), but that assumption is not needed for the development of cointegration theory. This point is emphasized in, e.g., Lt.itkepohl ( !993, pp. 352-353) and Johansen (1995, Chapter 3).
627
Cointegration
system. In levels, y, can be written as
y, =Yo+ 8 ·I+ U1 + U2 + · · · + u,,
(10.1.11)
or, written out in full, Yit
YI,O
Y2t
Y2.0
Ynt
Yn,O
OJ ·I
+
82 ·I
u1,2
+
+ Un,I
u2,2
+ .. ·+
Un,2
Regarding the initial value y0 , we assume either that it is a vector of fixed constants or that it is random but independent of e, for alit.
Example 10.2 (A bivariate 1(1) system): An I (I) system whose associated zero-mean 1(0) process is the bivariate process in Example 10.1 is
(10.1.12) where \11 1 is given in (10.1.7). In levels, Yit = YI,o +OJ ·t +(£It - fJ,o) + Y · (s2,o + £2,1 + · · · + f2,t-J), { Y2t = Y2.o + 02 ·I+ (£2,1 + £2,2 + · · · + f2t ).
(10.1.13) The second element of y, is I(I) as a univariate process. If y = 0, then the first element y 1, is trend stationary because the deviation from the trend function (y 1,o-fJ.o)+o 1·I is stationary (actually i.i.d.). Otherwise y 1, is 1(1). The Beverldge-Nelson Decomposition
For an I (I) system whose associated zero-mean I (0) process is u, satisfying (I 0.1.1 )-(I 0.1.4), it is easy to obtain the Beveridge-Nelson (BN) decomposition. Recall from Section 9.2 that the univariate BN decomposition is based on the rewriting of the MA filter as (9.2.5) on page 564. The obvious multivariate version
628
Chapter 10
of it is 00
\II(L) = \11(1) +(I - L)a(L), a(Ll = (nxn)
(n xn)
(n xn)
Clj
= -('i'j+l
I>ju. j=O
+ "'j+2 + ... ).
(10.1.14)
(n xn)
So the multivariate analogue of (9.2.6) on page 565 is
u, = \ll(l)e,
+ ~~- ~,_ 1 ,
~~ (n xI)
= a(L)e,.
(10.1.15)
Since \II(L) is one-summable, a(L) is absolutely summable and ~~ is a welldefined covariance-stationary process. Substituting this into (10.1.11), we obtain the multivariate version of the BN decompnsition:
y, =
~ ·t
+ \ll(l)(e, + e, + · · · + e,)
+~~+(yo- ~ 0 ).
(10.1.16)
As in the univariate case, the I (I) system y, is decomposed into a linear trend~· t, a stochastic trend \11(1)(£ 1 + e 2 + · · · + e,), a stationary component~,. and the initial condition Yo - ~ 0 • By construction, ~ 0 is a random variable. So unless Yo is perfectly correlated with ~ 0 • the initial condition is random. Example 10.2 (continued): For the bivariate I (I) system in Example 10.2, the matrix version of a(L) equals -\11 1 so that the stationary component~~ in the BN decomposition is
- [£" -0 ye,,] .
~~-
(10.1.17)
(It should be easy for you to verify this from the matrix version of (9.2.5).)
The two-dimensional stochastic trend is written as
\ll(l)(e 1 + e2 + · · · + e,) =
[~
Thus, the two-dimensional stochastic trend is driven by one common stochastic trend, L;~, e,_,. This is an example of the "common trend" representation of Stock and Watson (1993) (see Review Question 2 below).
629
Cointegration
Cointegration Defined
The 1(1) system y, is not stationary because it contains a stochastic trend W(l) · (e 1 + · · · + e,), and, if~ i 0, a deterministic trend~· 1. The deterministic and stochastic trends, however, may disappear if we take a suitable linear combination of the elements of y,. To pursue this idea, 3 premultiply both sides of the BN decomposition (I 0.1. I 6) by some conformable vector a to obtain a'y, =a'~· 1 + a'W(I)(e 1 + e 2 + · · · + e,) + a'q, + a'(y0
-
q0 ). (10.1.19)
If a satisfies a'W( I)= 0' ,
(10.1.20)
(1 xn)
then the stochastic trend is eliminated and a'y, becomes
"
a ' y, = a o · 1 +a 'q, +a '(Yo - q 0 ) .
(10.1.21)
Strictly speaking, this process is not trend stationary because the initial condition a' (Yo - q0 ) can be correlated with the subsequent values of a' q1 (see Review Question 3 below for an illustration). The process would be trend stationary if, for example, the initial value y0 were such that a' (Yo - q0 ) = 0. We are now ready to define cointegration. 4 Definition lO.l (Cointegration): Let y, be ann-dimensional I (I) system whose associated zero-mean 1(0) system u, satisfies (10.1.1)-(10.1.4). We say that y, is cointegrated with ann-dimensional cointegrating vector a if a i 0 and a'y, can be made trend stationary by a suitable choice of its initial value y0 . Defining cointegration in this way, although dictated by logical consistency, does not necessarily mean that the theory of cointegration requires that the initial condition Yo be chosen as indicated in the definition. The process in (10.1.21) is not stationary because q, and q0 are correlated. However, since q, = a.(L)e, and a.(L) is absolutely summable, q, and q 0 will become asymptotically independent as 1 increases. In this sense, the process in (10.1.21) is "asymptotically stationary," and asymptotic stationarity is all that is needed for estimation and inference for cointegrated I(I) systems. 5
3The idea was originally suggested by Granger (1981). See also Aoki (1968) and Box and Tiao (1977). 40ur definition i.~ the same as Definition 3.4 in Johansen ( 1995). 5The distinction between stationarity and asymptotic stationarity is discussed in Li.itkepohl (1993, pp. 348349)
630
Chapter 10
With cointegration thus defined, we can define the following related concepts. • (Cointegration rank) The cointegrating rank is the number of linearly independent cointegrating vectors, and the cointegrating space is the space spanned by the cointegrating vectors. As the preceding discussion shows, an n-dimensional vector a is a cointegrating vector if and only if (10.1.20) holds. Since there are h linearly independent such vectors if the cointegration rank is h, it follows that rank [ljl(l)] = n- h.
(10. 1.22)
(nxn)
Put differently, the cointegration rank h equals n - rank[ljl (I)]. • (Cointegration among subsets of y1 ) At least one of the elements of a cointegrating vector is not zero. Suppose, without loss of generality, that the first element of the cointegrating vector is not zero. Then we say that y 11 (the first element of y,) is cointegrated with Y21 (the remaining n - I elements of y,) or that y 11 is part of a cointegrating relationship. We can also define cointegration for a subset ofy 1 . For example, the n-1 variables in Y2 1 are not cointegrated if there does not exist an (n - I )-dimensional vector b oft 0 such that (I 0. 1.20) holds with a' = (0, b'). Therefore, y 21 is not cointegrated if and only if the last n- I rows of ljl(l) are linearly independent. • (Stochastic cointegration) Note that the deterministic trend a'8 ·tis not eliminated from a'y, unless the cointegrating vector also satisfies a'8 = 0.
(10.1.23)
In most applications, a cointegrating vector eliminates both the stochastic and deterministic trends in (10.1.19). That is, a vector a that satisfies (10.1.20) necessarily satisfies (I 0. 1.23), so that a'y, which can now be written as a'y, = a'q,
+ a'(yo- qo),
(I 0.1.24)
is stationary (not just trend stationary) for a suitable choice of y0 . This implies that 8 is a linear combination of the columns of ljl(l), so rank [8
: ljl(l)] = n- h.
(10.1.25)
(nx(n+l))
Unless otherwise noted, we will assume that (10.1.25) as well as (10.1.22) are
Cointegration
631
satisfied when the cointegration rank is h. If we wish to describe the case where a cointegrating vector eliminates the stochastic trend but not necessarily the detenninistic trend, we will use the term stochastic cointegration. Of course, ify, does not contain a deterministic trend (i.e., if 8 = 0), then there is no difference between cointegration and stochastic cointegration. Example 10.2 (continued): In the bivariate 1(1) system (10.1.12) in Example 10.2, the matrix >It (I) is given in (10.1.8), so
The rank of >It (I) is I, so the cointegration rank is I (= 2- 1). All cointegrating vectors can be written as (c, -cy)', c # 0. The assumption that the cointegration vector also eliminates the detenninistic trend can be written as c8, - cy8, = 0 or 81 = y8,, which implies that the rank of [8 : >It (I)] shown above is one. Quite a few implications for 1(1) systems follow from the definition of cointegration. For example, For an n -dimensional I(I) system, the cointegration rank cannot be as large as n. If it were n, then rank[W(l)] = 0 and >It (I) would have to be a matrix of zeros, which is ruled out by requirement (I 0.1.4).
• (h < n)
• (Implications of the positive definiteness of the long-run variance of L'>y1) The long-run variance matrix of L'>y, is given by >lt(l)Q>It(l)' (see (10.1.5)), which is positive definite if and only if W(l) is nonsingular. Therefore, y, is not cointegrated if and only if the long-run variance matrix of L'>y, is positive definite 6 Since the long-run variance of each element of L'>y, is positive if the long-run variance matrix of L'>y1 is positive definite, it follows that each element of y, is individually I (I) (i.e., I (I) as a univariate process) if y, is not cointegrated. The same is true for a subset of y,. For example, consider y2,, the last n - I elements of y,. The long-run variance matrix of L'>y 2, is given by >it 2 (l)Q>it 2 (1)' where >it 2 (1) is the last n - I rows of >it (1). It is positive definite if and only if the rows of W2 (1) are linearly independent or y 2, is not cointegrated. In particular, each element of y2 , is individually I (I) if y 2, is not cointegrated.
6This equivalence is specific to linear processes. In general, the positive definiteness of the long-run variance matrix of !:::.yz is sufficient, but not always necessary, for y1 to be not cointegrated. See Phillips (1986, p. 321).
632
Chapter 10
o
Suppose that y 11 is cointegrated with y,,. Then y 2, is not cointegrated if h = l and cointegrated if h > l. The reason is as follows. Cointegration of y 1, with y 2 , implies that there exists a cointegrating vector, call it a 1, whose first element is not zero. lf h = 1, then there should not exist an (n - ])-dimensional vector b of 0 such that a~ = (0, b') is a cointegrating vector, because a 1 and a, are linearly independent. A vector such as a2 can be found if h > 1.
o
(VAR in first differences?) If y, is difference stationary (without drift), it is
tempting to model it as a stationary finite-order VAR (L)!!.y, = e, where (L) is a matrix polynomial of degree p such that all the roots of l
= 1<1>(1)- 1 1= 1<1>(1)1-l
of 0.
So 'II ( 1) is nonsingular and the long-run variance matrix of t.y, is positive definite.
QUESTIONS FOR REVIEW
1. Let (yJ 1 , y,,) be a' in Example 10.2 withy = 0. Show that {Ytr + Y21 } is 1(1). Hint: You need to show that the long-run variance of { !!.y 11 + !!.y 21 ) is positive. 2. (Stock and Watson (1988) common-trend representation) e2 + · · · + e 1 so that the BN decomposition is
Let w,
= e1 +
A result from linear algebra states that if C is ann x n matrix of rank n - h, then there exists ann x n nonsingular matrix G and an n x (n - h) matrix F of full column rank such that
CG=[ F :
(nxn)(nxn)
(nx(n-h))
o].
(nxh)
Show that the permanent component '11(l)w1 can be written as Fr,, where F is ann x (n- h) matrix of full column rank and r, is an (n- h)-dimensional
633
Cointegration
random walk with Var(I'>T 1 ) positive definite. Hint: lJt(l)w1 = lJt(l)GG- 1 w1 . Let T1 be the first n - h elements of G - I w 1 . This representation makes clear that an I (I) system with a cointegrating rank of h has n- h common stochastic trends. 3. (a'y 1 is not quite stationary) To show that the process in (1 0.1.21) is not quite stationary, consider the simple case where q1 = e1 - e,_,, 8 = 0, and Yo = 0. Verify that
Cov(q" q0 ) =
252
fort= 0,
0
fort= 0,
-Q
fort = 1, Var(a'y,) =
6a'Qa
fort = 1,
0
fort>l,
4a'Qa
fort>l,
where Q = Var(e 1 ). Verify that (a'y1 } is stationary fort = 2. 3, .... 4. (What if some elements are stationary?) Let y, be an I( 1) system and suppose that the first element of y, is stationary. Show that the cointegrating rank is at least I. 5. (Matrix of cointegrating vectors) Suppose that the cointegration rank of an n dimensional I (I) system y1 is h and let A be an n x h matrix collecting h linearly independent cointegrating vectors. Let F be any h x h nonsingular matrix. Show that columns of AF are h linearly independent cointegrating vectors. Hint: Multiplication by a nonsingular matrix does not alter rank.
10.2
Alternalive Representations of Cointegraled Systems
Jn addition to the common trend representation (see Review Question 2 of the previous section), there are three other useful representations of cointegrated vector 1(1) processes: the triangular representation of Phillips (1991), the VAR representation, and the VECM (vector error-correction model) of Davidson et al. (1978). This section introduces these representations. Phillips's Triangular Representation This representation is convenient for the purpose of estimating cointegrating vectors. The representation is valid for any cointegration rank, but we initially assume that the cointegrating rank h is I. Let a be a cointegrating vector, and suppose, without loss of generality, that the first element of a is not zero (if it is zero, change
634
Chapter 10
the ordering of the variables of the system so that the first element after reordering is not zero). Partition y, accordingly:
y,
=
(nxl)
Y11
Y21
]
[ ((n-l)xl)
(10.2.1)
.
Thus, y 1, is cointegrated with Y2 1. Since a scalar multiple of a cointegrating vector, too, is a cointegrating vector, we can normalize the cointegrating vector a so that its first element is unity:
(I 0.2.2)
a=
We have seen in the previous section that if a cointegrating vector a eliminates
not only the stochastic trend (i.e., a'ljl(l) = 0') but also the deterministic trend (i.e., a'8 = 0), then a'y, can be written as (10.1.24). Setting the a in (10.1.24) to the cointegrating vector in (I 0.2.2), we obtain
Yll = Y 'Y21
+ z,• + M.
(10.2.3)
where (10.2.4) Since q1 is jointly stationary, z~ is stationary. This equation, with z~ viewed as an error term and M as an intercept, is called a cointegrating regression. Its regression coefficients y, relating the permanent component in y 11 to those in y2, can be interpreted as describing the long-run relationship between y 11 and y 2,. The cointegrating regression will have the trend term a' 8 · t as an additional regressor if the cointegrating vector does not eliminate the deterministic trend in a'y,. The triangular representation is an n -equation system consisting of this cointegrating regression and the la't n - I rows of (10.1.9): D.y2, = 82
+ n2,
=
82 ((n-l)xl)
+
lji2(L)
e, ,
(I 0.2.5)
((n-l)xn) (nxl)
where 82 and n 2, are the last n - I elements of the n-dimensional vectors 8 and n,, respectively, and lji 2(L) is the last n - I rows of the lji(L) in the VMA representation (10.1.10). The implication of the h = I assumption is that, as noted in
Cointegration
635
the previous section, y 2, is not cointegrated (it would be cointegrated if h > I). In particular, each element of y2, is individually I (I). More generally, consider the case where the cointegration rank h is not necessarily I. By changing the order of the elements ofy, if necessary, it is possible (see, e.g., Hamilton, 1994, pp. 576-577) to select h linearly independent cointegrating vectors, a 1, a 2 , ... , ah, such that
A
(n xh)
(I 0.2.6)
=[a,
Partition y, conformably as
=
y, (n xI)
y,,
(hx 1)
l
(10.2.7)
Y21 [ ((n-h)xl)
Multiplying both sides of the BN decomposition (I 0.1.16) by this A' and noting that A''i'(l) = 0 (since the columns of A are cointegrating vectors), A'~ = 0 (if those cointegrating vectors also eliminate the deterministic trend), and A'y 1 = y 1, - f'y 2,, we obtain the following h cointegrating regressions:
y,,
r'
=
(hxl)
Y21
+
(hx(n-h))((n-h)xl)
/L
+ z,*
(hxl)
(I 0.2.8)
,
(hxl)
where /L = A'(y0 - ~ 0 ) and z; =A'~,. Since~~ is jointly stationary, so is z;. The triangular representation is these h cointegrating regressions, supplemented by the rest of the VMA representation: ~'>Y2t ((n-h)xl)
=
~2 ((n-h)xl)
+
U21
~2
((n-h)xl)
((n-h)xl)
+
W2(L)
e, .
(I 0.2.9)
((n-h)xn) (nxl)
Here, 'it 2 (L) is the last n-h rows of the 'it (L) in the VMA representation (I 0.1.1 0). It is easy to show (see Review Question I) that y 2, is not cointegrated. To illustrate the triangular representation and how it can be derived from the VMA representation, consider Example 10.3 (From VMA to triangular): In the bivariate system (I 0. 1.12) of Example I 0.2, the cointegration rank is 1. The cointegrating vector whose first element is unity is (1, -y)'. So the and f.l in (10.2.3) are
z;
636
Chapter 10
z; = (1, -y)~, =
e,,- ye2,.
11- = (1, -y)(yo- ~o) = (Yt.o- YY2.ol- (eJ,O- Yt:2,o), (10.2.10)
and the triangular representation is
Ytt = 11- + YY2t
+ (t:It -
yt:2,),
( 10.2.11)
{ L'>y2, =82+t:21· VAR and Cointegration
For the stationary case, we found it useful and convenient to model a vector process as a finite-order VAR. Although, as seen above, no cointegrated I (I) system can be represented as a finite-order VARin first differences, some cointegrated systems may admit a finite-order VAR representation in levels. So suppose a cointegrated 1(1) system y, can be written as y, (n x 1)
= ct
+ 8 · t + ~1 •
-
= e,,
· · ·-
(10.2.12)
(nxn)
For later reference, we eliminate
~
from this to obtain
y, = ct' + 8' · t +
(10.2.13)
where ct' (nxl)
= (
+24>2 + ... + p
8' (nxl)
= 4>(1)8.
(10.2.14)
How do we know that this finite-order VARin levels is a cointegrated I (I) system? An obvious way to find out is to derive the VMA representation, L'>~, = lji(L)e,, from the VAR representation,
= e, and noting that L'> = 1- L
L)e,.
(10.2.15)
637
Cointegration
Substitute
1'>~ 1
= lf!(L)e, into this to obtain
(10.2.16)
This equation has to hold for any realization of e 1 , so
(10.2.17)
This can be solved for >1! (L) by multiplying both sides from the left by
(nxn)
(nxn)
Since the rank of lf!(l) equals n -h when the cointegration rank of~~ ish, the rauk of <1>(1) is at most h. As shown below, the rank is actually h. To state a necessary and sufficient condition, let U(L) and V(L) ben x n matrix lag polynomials with all their roots outside the unit circle and let M(L) be a matrix polynomial satisfying (1 - L)In-h M(L) =
[
(hx(~--h))
(1 0.2.19)
That is, the first n -h diagonal elements of the diagonal matrix M(L) are 1-L and the remaining h diagonal elements are unity. A necessary and sufficient condition for a finite-order VAR process {~ 1 } following
638
Chapter 10
and the rank of
(10.2.20)
Then it follows from basic linear algebra that there exist two n x h matrices of full column rank, B and A, such that
B
A' .
(10.2.21)
(nxh) (hxn)
This is sometimes called the reduced rank condition. The choice of B and A is not unique; if F is an h x h nonsingular matrix, then B(F') -I and AF in place of B and A also satisfy (l 0.2.21 ). Substituting (l 0.2.21) into ( 10.2.18), we obtain BA'ljl(l) = 0. Since B is offull column rank, this equation implies A'ljl(l) = 0. So the columns of the n x h matrix A are cointegrating vectors. The Vector Error-Correction Model (VECM)
For the univariate case, we derived the augmented autoregression (9.4.3) on page 586 from an AR equation. The same idea can be applied to the VAR here. It is a matter of simple algebra to show that the VAR representation (L)~, = e, m (10.2.12) can be written equivalently as
where
= ... , +
p
... , + ... + p.
(10.2.23)
(nxn)
= -(Hl + H2 + · · · + p)
~'
for s
= I, 2, ... , p-I.
(10.2.24)
(nxn)
~r-l
from both sides of (10.2.22) and noting that p - In = -(In ,- ,-···- p) = -<1>(1), we can rewrite (10.2.22) as
Subtracting
L'>~, = -
= -BA'~r-1 + ~~L'>~r-1 + ~,L'>~r-2 + · .. + ~p-lL'>~r-p+l + e,. (10.2.25)
Using the relation y, = L'>y, =a'+~·
· t-
=a'+~·· t -
a+~
·t
+~"this
can be translated into an equation in y,:
B z,_, + (nxh)(hxl)
~1
L'>y,_, + · · · + ~p-JL'>Yr-p+l + e,, (10.2.27)
639
Cointegratton
where a* and 8' are as in ( !0.2.14) and z, is given by Zr (hxl)
=
A' Yr ·
(10.2.28)
(hxn)(nxl)
Since, as noted above, the n x h matrix A collects h cointegrating vectors, z, is trend stationary (with a suitable choice of the initial value y0 ) 9 This representation is called the vector error-correction model (VECM) or the error-correction representation. If it were not for the term Bz,. 1 in the VECM, the process, expressed as a VAR in first differences, could not be cointegrated. The VECM accommodates h cointegrating relationships by including h linear combinations of levels of the variables. If there are no time trends in the cointegrating relations, that is, if A'8 = 0, then 8* = cfl(l)8 = BA'8 = 0. So the VAR and the VECM representations do not involve time trends despite the possible existence of time trends in the elements ofy,. That the same I(O) process has the VMA, VAR, and VECM representations is known as the Granger Representation Theorem. Example 10.4 (From VMA to VARIVECM): In the previous example, we derived the triangular representation from the VMA representation (I 0. I .12). In this example, we derive the VAR and VECM representations from the same VMA representation. For the '11 (L) in ( !0.1.12), it is easy to verify that ( !0.2.17) is satisfied with (10.2.29) So the VMA can be represented by a finite-order VAR. For this VAR, we have (I 0.2.30)
which can be written as (!0.2.21) with
9 Just in case you are wondering why Yo is relevant. It is true that you can solve (10.2.27) for z as 1 Zt-1
= (B'B)- 1B'[a:* +a*. t - !:::.yt
+ s1!:::.Yr-1 + ... + Sp-t!:::.Yr-p+l + etl.
So you might think that z1 is trend stationary regardless of the choice of y0 . This is not true, strictly speaking, because, unlike in the VMA representation, !:::.y1 in the VARNECM representation is not defined for t -1, -2, .... Only with a judicious choice of Yo can one make !:::.y1 (t I, 2.... ) stationary. This point is mentioned in Johansen (1995, Theorem 4.2).
=
=
640
Chapter 10
B= [
~]
(10.2.31)
and A= [ _;] (for example).
By (10.2.27) and (10.2.28), the VECM for this choice ofB and A is [
~~::] = [y82+~:- ya2] + [8' -t2] .1_ [~] 1+e,, 21 _
z, =
A'y, = Ylt- YY2t·
(10.2.32)
y82,that is, if the cointegrating
The deterministic trend disappears if 8 1 = vector also eliminates the deterministic trend.
Johansen's ML Procedure
In closing this section, having introduced the VECM representation, we here provide an executive summary of the maximum likelihood (ML) estimation of cointegrated systems proposed by Johansen (1988). Go back to the VAR representation (10.2.13). For simplicity, we assume that there is no trend in the cointegrating relations, so that 8* = 0. 10 We have considered the conditional ML estimation (conditional on the initial values of y) of a VARin Section 8.7. If the error vector e1 is jointly normal N(O, Q) and if we have a sample ofT+ p observations (Y-r+I, y -r+2· ... , YT ), then the (average) log likelihood of (y 1 , ... , YT) conditional on the initial values (Y-r+l, Y-r+2• ... , Yo) is given by QT(II) =-
~ log(2rr) + ~ log(IQ-
T
1
1)-
~ L{(y,- II'x,)'Q- 1(y,- II'x,)l. 2 t=l
(10.2.33) where n is the dimension of the system (not the sample size), II= (II, Q), and I
n
'- [a* =
(nxl)
y,_, ,
<1>2
(nxn)
(n xn)
...
p ]
(n xn)
'
x,
Yt-2
. (10.2.34)
((l+np)xl)
Yr-p
1°For a thorough treatment of time trends in the VECM, see Johansen (1995, Sections 5.7 and 6,2).
641
Cointegration
Let
So = -4>(1) =-[In- (4>, + · · · + 4>p)].
(I 0.2.35)
(nxn)
Then the reduced rank condition is that 1; 0 = -BA' and the VECM (10.2.26) can be written as
t;.y, =
a'
+ I; oYt-1 + I; 1t;.y,_, + · · · +I; p-J t;.yt-p+l +
e,.
(10.2.36)
Equations (10.2.35) and (I 0.2.24) provide a one-to-one mapping between (4> 1 , ••• , 4>p) and (1; 0 , •.. , Sp- 1). Thus the same average log likelihood Qr(IJ) can be rewritten in terms of the VECM parameters as
n -:zlog(2rr)
l
T
+
I _, I '""'{ _,_ ' _, _,_ :zlog(IQ I)- T L.... (t;.y,- II x,) Q (t;.y,- II x,) , 2
t;:;;:: I
(10.2.37) where I
-' = [ (nxl) a' n
So
(nxn)
1;, (nxn)
.. · sr-'] (nxn)
Yt-1
x,
t;.y,_,
. (10.2.38)
• ((l+np)xl)
The ML estimate of the VECM parameters is the (a', I; 0 , ... , I; p-J, Q) that maximizes this objective function, subject to the reduced rank constraint that 1; 0 = -BA' for some n x h full rank matrices A and B. This constraint accommodates h cointegrating relationships on the 1(1) system. Obviously, the maximized log likelihood is higher the higher the assumed cointegration rank h. Thus, we can use the likelihood ratio statistic to test the null hypothesis that h = ho against the alternative hypothesis of more cointegration. Unlike in the stationary VAR case, the limiting distribution of the likelihood ratio statistic is nonstandard. Given the cointegration rank h thus determined, the ML estimate of h cointcgrating vectors can be obtained as the estimate of A. For a more detailed exposition of this procedure, see Hamilton (1994, Chapter 20) and also Johansen (1995, Chapter 6).
642
Chapter 10
QUESTIONS FOR REVIEW
1. (Error vector in the triangular representation) Consider the error vector (z~, u2 1) in the triangular representation (10.2.8) and (10.2.9). It can be written as
z~
(hxl)
[
U2r
l
_
-
''''(L) _ 'l" Er =
[ l w;(L) (hxn)
Er.
(10.2.39)
W2(L)
{(n-h)xl)
((n-h)xn)
Here,
w;(L) = (1,
-r') a(L),
(hxn)
(I 0.2.40)
(nxn)
where a(L) is from the BN decomposition and 'i' 2 (L) is the last n- h rows of 'i'(L) in the VMA representation (a) Show that 'i' 2 ( I) is of full row rank (i.e., the n-h rows of 'i' 2 (1) are linearly independent). Hint: Suppose, contrary to the claim, that there exists an (n-h)-dimensionalvectorb f' Osuchthatb''i' 2 (1) = 0'. Showthat(O', b') would be an (n-dimensional) cointegrating vector and the cointegration rank would be at least h + I. (b) Verify that 'it'(L) is absolutely summable. (c) Write 'i''(L) ='it~+ w;L + w;L 2 + · · ·. For the bivariate process of Example I 0.3, verify that 'it~ is not diagonal. (Therefore, even if the elements of e, are uncorrelated, z~ and u 2, can be correlated.) 2. (From triangular to VMA representations) For Example 10.3, start from the triangular representation (10.2.1 I) and recover the VMA representation. Hint: Take the first difference of the cointegrating regression. 3. (An alternative decomposition of 4>(1)) For the bivariate system of Example I 0.4, verify that B = (y,O)', A= (ljy, -I)'
is an alternative decomposition of 4>(1). Write down the VECM corresponding to this choice of B and A.
643
Cointegration
4. (A trivariate VAR that is 1(2)) Consider a trivariate VAR given by
Ylr = 2y,,,_,- YJ.r-2 + '" (i.e., 1'. 2 y,, Y2r = PY2,r-J + e2r with IPI < I, and
= e 1,),
!
YJt = E:Jr ·
Verify that this is an 1(2) system by writing 1'. 2 y, as a vector moving average process. Write down (L) for this system and show that
10.3 Testing the Null of No Cointegration Having introduced the notion of cointegration, we need to deal with two issues. The first is how to determine the cointegration rank, and the second is how to estimate and do inference on cointegrating vectors. The first will be discussed in this section, and the second will be the topic of the next section. There are several procedures for determining the cointegration rank. Among them are Johansen's (1988) likelihood ratio test derived from the maximum likelihood estimation of the VECM (briefly covered at the end of previous section) and the common trend procedure of Stock and Watson (1988). These procedures allow us to test the null of h = h 0 where h 0 is some arbitrary integer between 0 and n - I. We will not cover these procedures.'' Here, we cover only the simple test suggested by Engle and Granger (1987) and extended by Phillips and Ouliaris (1990). In that test, the null hypothesis is that h = 0 (no cointegration) and the alternative is that h > I. Spurious Regressions
The test of Engle and Granger ( 1987) is based on OLS estimation of the regression
y,, =
f.I
+ y'Y2r + z;.
(10.3.1)
where Yit is the first element of y,, Y2r is the vector of the remaining n - I elements, and
z; is an error term. This regression would be a cointegrating regression
if h = I and Y1t were part of the cointegrating relationship. Under the null of h = 0 (no cointegration), however, this regression does not represent a cointegrating relationship. Let ({j, y) be the OLS coefficient estimates of (JI, y ). It turns out that y 11 See Maddala and Kim (1998, Section 7) for a catalogue of available procedures.
Chapter 10
644
does not provide consistent estimates of any population parameters of the system' For example, even if Yir is unrelated to Y2r (in that Ll.y 11 and Ll.y2, are independent for all s, 1), the 1- and F -statistics associated with the OLS estimates become arbitrarily large as the sample size increases, giving a false impression that there is a close connection between y 11 and y2,. This phenomenon, called the spurious regression, was first discovered in Monte Carlo experiments by Granger and Newbold (1974). Phillips (1986) theoretically derived the large-sample distributions of the statistics for spurious regressions. For example, the 1-value, if divided by ft. converges to a nondegenerate distribution. The Residual-Based Test tor Cointegration
The regression (10.3.1) nevertheless provides a useful device for testing the null of no cointegration, because the OLS residuals, y 11 - fi- f/y 2,, should appear to have a stochastic trend if y, is not cointegrated and be stationary otherwise. Engle and Granger (1987) suggested applying the ADF 1-test to the residuals in order to test the null of no cointegration. Because of the use of the residuals, the test is called the residual-based test for cointegration. The asymptotic distributions of the test statistic for some leading unit-root tests were derived theoretically and the (asymptotic) critical values tabulated by Phillips and Ouliaris (1990) and Hansen (1992a). In contrast to the univariate unit-root tests, the asymptotic distributions (and hence the asymptotic critical values) depend on the dimension n of the system. This is because the residuals depend on ({i, y), which, being estimates based on data, are random variables. Here we indicate the appropriate asymptotic critical values when the unit-root test applied to the residuals is the ADF t-test of Proposition 9.6 for autoregressions without a constant or time. That is, the ADF 1-statistic is the 1value on the x,_ 1 coefficient in the following augmented autoregression estimated on residuals: Ll.x, = (p- l).t,_ 1 +
~1
Ll.x,_ 1 + · · · +
~PLI.x,_P
+error,
(10.3.2)
where x, here is the residual from regression (10.3.1). There is no need to include a constant in this augmented autoregression because, with the regression already including a constant, the sample mean of the residuals is guaranteed to be zero. There is no need to include time either, because the variables of the regression (10.3.1) implicitly or explicitly include time trends (see below for more on this). The number of lagged changes, p, needs to be increased to infinity with the sample
645
Cointegration
size T, but at a rate slower than T 113 More precisely, 12 p ---> oo but
~13
T
---> 0 as T ---> oo. P
(10.3.3)
The critical values are the same if Phillips' Z1-test (see Analytical Exercise 6 of the previous chapter) is used instead of the ADF t. There are three cases to consider. I. E(L'>y21) = 0 and E(L'>y1 1) = 0, so no elements of the 1(1) system have drift. The appropriate critical values are in Table IO.I(a), which reproduces Table lib of Phillips and Ouliaris (1990). For a statement of the conditions on the VMA representation under which the asymptotic distribution is derived, see Hamilton (1994, Proposition 19.4).
2. E(L'>y21) i" 0 but E(~'>Yir) may or may not be zero. This case was discussed by Hansen (1992a). Let g (= n - I) be the number of regressors besides a constant in the regression (I 0.3.1 ). Some of the g regressors have drift. Since linear trends from different regressors can be combined into one, 13 the regression (10.3.1) can be rewritten a' a regression of y 11 on a constant, g- I I( I) regressors without drift, and one I (I) regressor with drift. Now, since linear trends dominate stochastic trends (in the sense made precise in the discussion of the BN decomposition in the previous chapter), the 1(1) regressor with a trend behaves very much like time in large samples. So the residuals are "asymptotically the same" as the residuals from a regression with a constant, g - I driftless 1(1) variables, and time as regressors, in the sense that the limiting distribution of a statistic based on the residuals from the former regression is the same as that from the latter regression. The critical values for the ADF t test based on the latter regression are tabulated in Table lie of Phillips and Ouliaris (1990). Therefore, to find the appropriate critical value when the regression (I 0.3.1) has a constant and g regressors but not time, turn to this table for g - I regressors. If the regression has only one regressor besides the constant (i.e., if g = 1), then the regression is asymptotically equivalent to a regression of y 11 on a constant and time. For this case, the limiting distribution of the ADF t-statistic calculated from the residuals turns out to be the Dickey-Fuller distribution (D Fr') of Proposition 9.8. Table IO.l(b) combines these distributions: for the case of one 1(1) 12see Phillips and Ouliaris (1990, Theorem 4.2). The lag length p can be a random variable because it can be data-dependent. This condition is the same as in the Said-Dickey extension of the ADF tests in the previous chapter (see Section 9.4), except that p here can be a random variable. 13 For example, let n == 3 so that there are two regressors, Y2l and Y3t with coefficients Yl and Y2· respectively. If 82 and 83 are the drifts in Y2t and Y3r• respectively, then the linear trends in regression (10.3.1) are Y18zt and J128Jt, which can he combined into a single time trend (y182 + n83)t.
Chapter to
646 Table 10.1: Critical Values for the ADF t-Statistic Applied to Residuals Estimated Regression: Ytr = 1-' + y'Y2r Number of regressors, excluding constant (a) Regressors have no drift I
2 3 4 5
1%
2.5%
5%
10%
-3.96 -4.31 -4.73 -5.07 -5.28
-3.64 -4.02 -4.37 -4.71 -4.98
-3.37 -3.77 -4.11 -4.45 -4.71
-3.07 -3.45 -3.83 -4.16 -4.43
-3.67 -4.07 -4.39 -4.77 -5.02
-3.41 -3.80 -4.16 -4.49 -4.74
-3.13 -3.52 -3.84 -4.20 -4.46
(b) Some regressors have drift
2 3 4 5
-3.96 -4.36 -4.65 -5.04 -5.36
For panel (a), Phillips and Ouliaris (1990, Table lib). For panel (b), the first row is from Fuller ( 1996, Table IO.A.2), and the other rows are from Phillips and Ouliaris (1990, Table lie). SOURCE:
regressor (g = I), it shows the ADF t' -distribution, and for the case of g (> I) regressors, it shows the critical values from Table lie of Phillips and Ouliaris (1990) for g - I regressors. For example, if g = 2, the 5 percent critical value is -3.80, which is the 5 percent critical value in Table lie in Phillips and Ouliaris (1990) for one regressor. 3. This leaves the case where E(C>y2,) = 0 and E(C>Ytrl 'I 0. Since Ytr has drift and Y2r does not, we need to include time in the regression (I 0.3.1) in order to remove a linear trend from the residuals. The discussion for the previous case makes it clear that the ADF 1-statistic calculated using the residuals from a regression of y 1, on a constant, g driftless I (I) regressors y 2,, and time has the asymptotic distribution tabulated in Table I 0.1 (b) for g + I regressors (or Table lie of Phillips and Ouliaris, 1990, for g regressors). For example, if g = 2, the regressor (I 0.3.1) has a constant, two I (I) regressors, and time; the critical values can be found from Table IO.l(b) for three regressors.
Cointegration
647
Several comments are in order about the residual-based test for cointegration. • (Test consistency) The alternative hypothesis is that y, is cointegrated (i.e., h ::0: l ). The test is consistent against the alternative as long as Y1t is cointegrated with y 2,. 14 The reason (we will discuss it in more detail below for the case with h = l) is that the OLS residuals from the regression (10.3.1) will converge to a stationary process. However, if y 11 is I(l) and not part of the cointegration relationship, then the test may have no power against the alternative of cointegration because the OLS residuals, y 11 - y'y 21 , with a nonzero coefficient (of unity) on the 1(1) variable y 11 , will not converge to a stationary process. Thus, the choice of normalization (of which variable should be used as the dependent variable) matters for the consistency of the test. • (Should time be included in the regression?) If time is included in the regression (10.3.1), then the drift in y 11 , E(t:.y 11 ), affects only the time coefficient, making the numerical value of the residuals (and hence the ADF t-value) invariant to E(t:.y 11 ). This means that the case 3 procedure, which adds time to the regressors, can be used for case l, where E(t:.y1 1 ) happens to be zero. That is, if you include time in the regression (l 0.3. 1), then the appropriate critical value for case I is given from Table IO.I(b) for g +I regressors. The procedure is also valid for case 2, because if time is included in addition to a constant and g I( l) regressors with drift, then the regression can be rewritten as a regression with a constant, g driftless l(l) regressors, and time that combines the drifts in the g 1(1) regressors. This regression falls under case 3 and the critical values provided by Table 10.1 (b) for g + I regressors apply. Therefore, when time is included in the regression (10.3.1), the same critical values can be used, regardless of the location of drifts. A possible disadvantage is reduced power in finite samples. The finite-sample power is indeed lower at least for the DGPs examined by Hansen (1992a) in his simulations. • (Choice of lag length) The requirement (10.3.3) does not provide a practical rule in finite samples for selecting the lag length pin the augmented autoregression to be estimated on the residuals. There seems to be no work in the context of the residual-based test comparable to that of Ng and Perron ( 1995) for univariate I( l) processes. The usual practice is to proceed as in the univariate context, which is to use the Akaike information criterion (AIC) or the Bayesian information criterion (BIC), also called the Schwartz information criterion (SIC), to determine the number of lagged changes. t4For the case of h = I, this is an implication of Theorem 5.1 of Phillips and Ouliaris (1990). Their remark (d) to this theorem shows that the test is consistent when h > I and hence Y2r is cointegrated.
648
Chapter 10
• (Finite-sample considerations) Besides the residual-based ADF 1-test, a number of tests are available for testing the null of no cointegration. They include tests proposed by Phillips and Ouliaris (1990), Stock and Watson (1988), and Johansen ( 1988). Haug ( 1996) reports a recent Monte Carlo study examining the finite-sample penormance of these and other tests. It reveals a tradeoff between power and size distortions (that is, tests with the least size distortion tend to have low power). The residual-based ADF t-test with the lag length chosen by AIC, although less powenul than some other tests, has the least size distortion for the DGPs examined. The following example applies the residual-based test with the ADF !-statistic to the consumption-income relationship.
Example 10.5 (Are consumption and income cointegrated?): As was already mentioned in connection with Figure I 0.1 (a), log income (y1 ) and log consumption (c1 ) appear to be cointegrated with a cointegrating vector of (I, -I). This figure, however, is rather deceiving. The plot of y, - c1 (which is the log of the saving rate) in Figure I0.1 (b) shows an upward drift right after the war and a downward drift since the mid 1980s. (The latter is the well-publicized fact that the U.S. personal saving rate has been declining.) As seen below, the test results depend on whether to include these periods or not We initially focus on the sample period of 1950:Ql to 1986:Q4 (148 obervations ). We first test whether the two series are individually 1(1), by conducting the ADF t-test with a constant and time trend in the augmented autoregression. In applying the BIC to select the number of lagged changes in the augmented augoregression, we follow the same practice of the previous chapter: the maximum length (Pmaxl is [ 12 · (T I I 00) 114] (the integer part of [ 12· (T I I00) 114 ]), the sample period is fixed at t = Pmax + 2, Pmax +3, ... , T in the process of choosing the lag length p, and, given p, the maximum sample oft = p + 2, p + 3, ... , T is used to estimate the augmented autoregression with p lagged changes. For the present sample size, Pmax is 13. For disposable income, the BIC selects the lag length of 0 and the ADF !-statistic (t') is -1.80. The 5 percent critical value from Table 9.2 (c) is -3.41, so we accept the hypothesis that y, is I(!). For consumption, the lag length by the BIC is I and the ADF t statistic is -2.07. So we accept the I (I) null at 5 percent. Thus, both series might well be described as I( I) with drift Turning to the residual-based test for cointegration, the OLS estimate of the static regression is
649
Cointegration
0.07 0.06 0.05 0.04 0.03 0.02 0.01 0.00
47
52
57
62
67
72
77
82
87
92
97
Figure 10.1 (b): Log Income Minus Log Consumption
c, = -0.046 (0.009)
+
0.975 y,, R 2 = 0.998, (0.0037)
t =
1950:Ql to 1986:Q4. (10.3.4)
To conduct the residual-based ADF t test, an augmented autoregression without constant or time is estimated on the residuals with the lag length of 0 selected by the BIC. The ADF t statistic is -5.49. Since the series have time trends, we turn to Table I 0.1 (b), rather than Table I 0.1 (a), to find critical values. For g = I, the 5 percent critical value is -3.41, so we can reject the null of no cointegration.
If the residual-based test is conducted on the entire sample of 47:Q I to 98:QI, the ADF t statistic is -2.94 with the lag length of I determined by BIC, and thus, we cannot reject the hypothesis that the consumption-onincome regression is spurious! Testing the Null of Cointegration
In the above test, cointegration is taken as the alternative hypothesis rather than the null. But very often in economics the hypothesis of economic interest is whether the variables in question are cointegrated, so it would be desirable to develop tests where the null hypothesis is that h = I rather than that h = 0. Very recently, several tests of the null of cointegration have been proposed. For a catalogue of such tests, see Maddala and Kim ( 1998, Section 4.5). As was true in the testing of
650
Chapter 10
the null that a univariate process is 1(0), there is no single test of cointegration used by the majority of researchers yet. 15 QUESTION FOR REVIEW
1. For the residual-based test for cointegration with the regression (10.3.1) not including time as a regressor, verify that there is very little difference in the critical value between case I and case 2.
10.4
Inference on Cointegrating Vectors
In the previous section, we examined whether an 1(1) system is cointegrated. In this section, we assume that the system is known to be cointegrated and that the cointegration rank and the associated triangular representation are known. 16 Our interest is to estimate the cointegrating vectors in the triangular representation and to make inference about them. For the most part, we focus on the special case where the cointegration rank is I; the general case is briefly discussed at the end of the section. The SOLS Estimator
For the h = I case, the triangular representation is Yit {
+ y'y2, + z; = ~2 + u2,
= J.L
D.y2, ((n-1) xI)
z;
(1
[
X
1)
u2,
l
= IJI'(L)
e, ,
(10.4.1)
(nxn) (nxl)
((n-l)xl)
where y 2, is not cointegrated. (This is just reproducing (10.2.3) and (10.2.5) with (10.2.39) for h = 1.) So the regression (10.3.1) is now the cointegrating regression. Therefore, there exists a unique (n- I)-dimensional vector y such that y 1, - ji'y2 , is a stationary process (z;l plus some time-invariant random variable (J.L) when y = y and has a stochastic trend when y ~ y. This suggests that the OLS estimate 15 The procedures of Johansen (1988) and Stock and Watson (1988) cannot be used for testing the null of cointegration against the alternative of no cointegration, because in their tests the alternative hypothesis specifies more cointegration than the null.
16 In practice, we rarely have such knowledge, and the decision to entertain a system with h cointegrating relationships is usually based on the outcome of prior tests (such as those mentioned in the previous section) designed for determining the cointegration rank. This creates a pretest problem, because the distribution of the estimated cointegrating vectors does not take into account the uncertainty about the cointegration rank. This issue has not been studied extensively.
651
Cointegration
(/l, ji) of the coefficient vector, which minimizes the sum of squared residuals,
is consistent. In fact, as was shown by Phillips and Durlauf (1986) and Stock (1987) for the case where ~ 2 in (10.4.1) (= E(Ll.y2,)) is zero and by Hansen (1992a, Theorem l(b)) for the ~ 2 i' 0 case, the OLS estimate ji is superconsistent for the cointegrating vector y, with the sampling error converging to 0 at a rate faster than the usual rate of ft. 17 Also, the R2 converges to unity. 18 The OLS estimator of y from the cointegrating regression will be referred to as the "static" OLS (SOLS) estimator of the cointegrating vector. Since the SOLS estimator is consistent, the residuals converge to a zero-mean stationary process. Thus, if a univariate unitroot test such as the ADF test is applied to the residuals, the test will reject the 1(1) null in large samples. This is why the residual-based test of the previous section is consistent against cointegration. This fact- that the OLS coefficient estimates are consistent when the regressors are I (I) and not co integrated- is a remarkable result, in sharp contrast to the case of stationary regressors. To appreciate the contrast, remember from Chapter 3 what it takes to obtain a consistent estimator of y when the regressors y2, are stationary: for the OLS estimator to be consistent, the error term has to be uncorrelated with y2,; otherwise we need instrumental variables for y2,. In contrast, if y 1, is cointegrated with y2, and if the I (I) regressors Y2 1 are not cointegrated, as here, then we do not have to worry about the simultaneity bias, at least in large samples, even though the error term and the 1(1) regressors are correlated. 19 In finite samples, however, the bias of the SOLS estimator (the difference between the expected value of the estimator and the true value) can be large, as noted by Banerjee et al. (1986) and Stock (1987). Another shortcoming of SOLS is that the asymptotic distribution of the associated t value depends on nuisance parameters (which are the coefficients in IJI' ( L) in (I 0.4.1 )), so it is difficult to do inference. Later in this section we will introduce another estimator (to be referred to as the "DOLS" estimator) which is efficient and whose associated test statistics (such as the t- and Wald statistics) have conventional asymptotic distributions.
z;
z;
17 The speed of convergence is T if liz = 0, and T 312 if liz ;;f. 0. All the studies cited here assume that 11 is a fixed constant. The same conclusion should hold even if 11 is treated as random. As mentioned in Section 10.1 (see the paragraph right below (10.1.21)), strictly speaking, II+ zj is not stationary even if zj is stationary. However, since 11 and zi are asymptotically independent as t --+ oo, 11 + z7 is asymptotically stationary, which is all that is needed for the asymptotic results here. 18 For a statement of the conditions under which these results hold. see Hamilton (1994, Proposition 19.2). Those conditions are restrictions on "IJI"*(L)er in (10.4.1). 19 tf Y2t is cointegrated, then h ~ 2 and the regression (10.3.1)-with n - I regressors-is no longer a cointegrating regression (a cointegrating regression should haven- h regressors, see the triangular representation (10.2.B)). This case can be handled by the general methodology presented by Sims, Stock, and Watson (1990). See Hamilton (1994, Chapter 18) or Watson (1994, Section 2) for an accessible exposition of the methodology.
652
Chapter 10
The Bivariate Example
To be clear about the source and nature of the correlation between the error term and the I (I) regressors and also to pave the way for the introduction of the DOLS estimator, consider the bivariate version of (I 0.4.1 ). For the time being, assume 'il'(L) = 'il~ (so there is no serial correlation in (z;. u 2,)), 82 = 0, and /L = 0. The triangular representation can be written as Ytt = Y Y2t {
~Y2t
+ z;
(10.4.2)
= Uz,.
The error term z; and the 1(1) regressor yz 1 in the cointegrating regression in (10.4.2) can be correlated because Cov(y,,, z;l = Cov(_v,,o + L'>yz,I = =
+ L'>yz,2 + ·· · + t>y,,, z;l Cov(t>y,,I + t>y,,, + · · · + t>y,,, z;) (since Cov(yz,o. z;) = Cov(u 2 , 1 + u 2 , 2 + · · · + u 2,, z;) (since £>y2, = u 2,)
= Cov(u 2,, z;l
(since (u 2,, z;l is i.i.d.).
0)
(I 0.4.3)
z;
Since 'il~ and Q are not restricted to be diagonal, and u 2, can be contemporaneously correlated. To isolate this possible correlation, consider the least squares projection of -, on a constant and u 2, (= t>y2,). Recall from Section 2.9 that E (y I I, x) = [E(y)- f3o E(x)] + f3ox where f3o = Cov(x, y)/ Var(x) and that the least squares projection error, y- E'(y I I, x), has mean zero and is uncorrelated with x. Here, the mean is zero for both z; and u,,. So E'(z; I I, uz 1 ) = f3 0u,, and, if we denote the least squares projection error by = E'(z; I I, u 2, ), we have
z;
v, z; -
z; = f3ou,,
+ v,
= f3ol>y,,
+ v,,
E(v,) = 0, Cov(u 2,, v,) = Cov(t>y,,, v,) = 0.
(10.4.4)
Substituting this into the cointegrating regression, we obtain Ytt = yy,,
+ f3ot>y,, + v,.
(10.4.5)
This regression will be referred to as the augmented cointegrating regression. Now we show that the I (I) regressor y2, is strictly exogenous in that Cov(y2,, v,) = 0 for all t, s. By construction, Cov(t>y2 ,, v,) = 0. For Cov(t>y2, v,) fort oft s, Cov(t>y 2,, v,) = Cov(u 2, v,) = Cov(u 2 , z;- f3ou 2,)
= Cov(u 2,, z;)- f3oCov(u 2,, u2,) = 0,
(I 0.4.6)
Cointegration
653
because (z;, u2,) is i.i.d. So L'.yz, is strictly exogenous. The strict exogeneity of L'.y 2, implies the same for y2, because Cov(yz,, v,) = Cov(Y2.o + L'.y2.1 = Cov(L'.y2,1 = 0
+ L'.yz,z + · · · + L'.y2, v,) + L'.y2,2 + ·· · + L'.yz,, v,) (since Cov(yz,o, v,) =
for all t. s by (10.4.6).
0)
(10.4.7)
Continuing with the Bivariate Example
To recapitulate, we have shown that, in the augmented cointegrating regression (10.4.5), the I (I) regressor is strictly exogenous if 'i''(L) = '11 0. Let (9, fj;,) be the OLS coefficient estimate of (y, fJo) from the augmented cointegrating regression. (This 9 should be distinguished from the SOLS estimator of y in the cointegrating regression in (10.4.2).) This regression, (10.4.5), is very similar to the augmented autoregression (9.4.6) in that one of the two regressors is zero-mean 1(0) and the other is driftless I (I). We have shown for the augmented autoregression that the "X'X" matrix, if properly scaled by T and ft. is asymptotically diagonal, so the existence of 1(0) regressors can be ignored for the purpose of deriving the limiting distribution of 9, the OLS estimator of the coefficient of the I(I) regressor. The same is true here, and the same argument exploiting the asymptotic diagonality of the suitably scaled "X'X" matrix shows that the usual 1-value for the hypothesis that y = y0 is asymptotically equivalent to I
I=
T
T I~ Y21VI a2
r
'
(10.4.8)
/z,E(yz,)2
aJ
v
where is the variance of 1 • That is, the difference between the usual t-value and i in (10.4.8) converges to zero in probability, so the limiting distribution oft and that of i are the same. In the augmented autoregression used for the ADF test, the limiting distribution of the ADF !-statistic was nonstandard (it is the Dickey-Fuller /-distribution). In contrast, the asymptotic distribution of i (and hence that of the usual t-va!ue) is standard nonnal. To see why, first recall that the 1(1) regressor y2, is strictly exogenous in that Cov(y2, v,) = 0 for all I, s. To develop intuition, temporarily assume that (z;, u 2,) is jointly normal. Then y2, and v, are independent for all (s, 1), not just uncorrelated. Consequently, the distribution of v, conditional on
(Y2.I, Y2.2 •... , Yzr) is the same as its unconditional distribution, which is N(O, a;).
654
Chapter 10
So the distribution of the numerator of (10.4.8) conditional on (y2 • 1 , Yz,z, ... , Y2T) is
But the standard deviation of this normal distribution equals the denominator of -t, so
t I (Yz.I, Yz.z, ... , Yzr)
~ N(O, l).
Since this conditional distribution oft does not depend on (y2•1 , y2 ,2 , •.. , y 2 r ), the unconditional distribution of i', too, is N(O, 1). (This type of argument is not new to you; see the paragraph containing ( 1.4.3) of Chapter l.) Recall from Chapter 2 that, even if the normality assumption is dropped, the distribution of the usual t-value is standard normal, albeit asymptotically, in large samples. The same conclusion is true here: the limiting distribution oft is N(O, l) even if (z;, u 2,) is not jointly normal. Proving this requires an argument different from the type used in Chapter 2, because y 21 here has a stochastic trend. We will not give a formal proof here; just an executive summary. It consists of two parts. The first is to derive the limiting distribution of the numerator of (10.4.8) (which can be done using a result stated in, e.g., Proposition 18.1 (e) of Hamilton, 1994, or Lemma 2.3(c) of Watson, 1994). This distribution is nonstandard. The second is to show that normalization (10.4.8) converts the nonstandard distribution into a normal distribution (Lemma 5.1 of Park and Phillips, 1988). This example of a bivariate l(l) system is special in several respects: (a) there is no serial correlation in the error process (z;, u 21 ), (b) the l(l) regressor y2, is a scalar, (c) y 2, has no drift, and (d) p. = 0. Of these, relaxing (a) requires some thoughts. Allowing for Serial Correlation
as in (10.4.2), then (z;. u 21 ) is serially correlated. Consequently, the l(l) regressor y 21 in the augmented cointegrating regression (10.4.5) is no longer strictly exogenous because Cov(L'.y 2,, v1), while still zero fort = s, is no longer guaranteed to be zero for r i' s. To remove this correlation, consider the least squares projection of on the current, past, and future values of u 2 , not just on the current value of uz. Noting that L'.y21 = u21, the projection can If it is not the case that l[t'(L) =
l[t~
z;
655
Cointegration
be written as 00
z; = {J(L)"-Yzr + v,,
fJ(L)
=
L
fJjLi
(10.4.9)
j=-oo
(This v, is different from the v, in (10.4.5).) By construction, E(v,) = 0 and Cov("-y2 ., v,) = 0 for all 1, s. The two-sided filter {J(L) can be of infinite order, but suppose- for now- that it is not and {Jj = 0 for lj I > p. Thus,
+ fJ-1 "-Yz.r+l + · ·· + fJ-p"-Y2,t+p + fJ1"-Yz.r-1 + · · · + fJp"-Yz.r-p + v,.
z: = fJo"-Yzr
(10.4.10)
Substituting this into the cointegrating regression in (10.4.2), we obtain the (vastly) augmented cointegrating regression
+ fJo"-Yzr + fJ-1 "-Y2.r+l + ·· · + fJ-p"-Yz.r+p + fJ1"-Yz.r-l + ·· · + fJP"-Yz.r-p + v,.
Ylr = YYzr
(10.4.11)
Since Cov(;,.y2, v,) = 0 for all 1 and s, ;,.y 2, is strictly exogenous, which means (as seen above) that the level regressor Y2r too is strictly exogenous. Thus, by including not just the current change but also past and future changes of the I (I) regressor in the augmented cointegrating regression, we are able to maintain the strict exogeneity of Yzr· The OLS estimator of the cointegrating vector y based on this augmented cointegrating regression is referred to as the "dynamic'' OLS (DOLS), to distinguish it from the SOLS estimator based on the cointegrating regression without changes in y 2 .
There are 2 + 2p regressors, the first of which is driftless 1(1) and the rest zero-mean 1(0). As before, with suitable scaling by T and ~. the I (I) regressor is asymptotically uncorrelated with the 2p + I zero-mean 1(0) regressors, so that the suitably scaled X'X matrix is again block diagonal and the zero-mean 1(0) regressors can be ignored for the purpose of deriving the limiting distribution of the DOLS estimate of y. Therefore, the expression for the random variable that is asymptotically equivalent to the 1-value for y = y0 is again given by (10.4.8), the expression derived for the case where (z;, Uzr) is serially uncorrelated. However, when (z;, u 2,) is serially correlated, the derivation of the asymptotic distribution of (10.4.8) is not the same as when (z;, uz,) is serially uncorrelated, because the error term v, can now be serially correlated; the projection (10.4.9), while eliminating the correlation between "-yz, and v, for all s, I, does not remove its own serial correlation in v,. To examine the asymptotic distribution,
Chapter 10
656
A;
let V (T x T) be the autocovariance matrix ofT successive values of v, and be the long-run variance of v,. Also, let X in the rest of this subsection be the T x I matrix (y2 , 1, y 2 , 2 , .•. , y 2r )'. Again, to develop intuition, temporarily assume that (z;, u 2 ,) are jointly normal. Then, since y 2, is strictly exogenous, the distribution of the numerator of (10.4.8) conditional on X is normal with mean zero and the conditional variance I ' T 2 XVX.
(10.4.12)
The square root of this would have to replace the denominator of (10.4.8) to make the ratio standard normal in finite samples. Fortunately, it turns out (see Corollary 2. 7 of Phillips, 1988) that, in large samples, the distribution of the numerator is the same as the distribution you would get if v, were serially uncorrelated but with the rather than a variance of Therefore, all that is required to modify the ratio (10.4.8) to accommodate Var(v,)) by (the long-run variance of serial correlation in is to replace namely, if tis given by (I 0.4.8) with still in the denominator, its rescaled value
A;
J.
v,
v, ),
a;(=
A;
a;
(10.4.13) is asymptotically N (0, I)! Since the OLS t -value for y is asymptotically equivalent to t, it follows that
·t-: N(O, (;._) A,
1),
(10.4.14)
-
where Av is some consistent estimator of Av and s is the usual OLS standard error
of regression (it is easy to show that s is consistent for a,). Put differently, if we rescale the usual standard error of y (the OLS estimate of the y 2, coefficient in (10.4.11)) as rescaled standard error= (
~s")
x usual standard error,
(10.4.15)
then the r-value based on this rescaled standard error is asymptotically N(O, 1). The foregoing argument is applicable only to the coefficient of the I(I) regressor, so the t -values for fJ's in (I 0.4.11 ), even when their standard errors are rescaled as just described, are not necessarily N(O, I) in large samples.
657
Cointegration
Since the regressors are strictly exogenous, the discussion in Sections 1.6 and 6.7 suggests that the GLS might be applicable. However, the fact noted above that X'VX behaves asymptotically like A,X'X implies that the GLS estimator of y (to be referred to as the DGLS estimator) is asymptotically equivalent to the DOLS estimator (or more precisely, T times the difference converges to 0 in probability) (see Phillips, 1988, and Phillips and Park, 1988). Therefore, there is no efficiency gain from correcting for the serial correlation in the error term v1 by GLS. A consistent estimate of A, is easy to obtain. Recall that in Section 6.6 we used the VARHAC procedure to estimate the long-run variance matrix of a vector process. The same procedure can be applied to the present case of a scalar error from the augmented process. Consider fitting an AR(p) process to the residuals, cointegrating regression
v,,
v,
= 11 vt-1
+ t/>zv1_2 + ·· · +
(t = p
+ l, ... , T).
(10.4.16)
(This p should not be confused with the p in the augmented cointegrating regression.) Use the BlC to pick the lag length by searching over the possible lag lengths ofO, l, ... , T 113 • Using the relation (6.2.21) on page 385 and noting that the longrun variance is the value at z = l of the autocovariance-generating function, the long-run variance A~ of v1 can be estimated by (10.4.17)
where J,j (j = l, 2, ... , p) are the estimated AR(p) coefficients and residual from the AR(p) equation (10.4.16).
e,
is the
General Case
We now tum to the general ca•e of an n-dimensional cointegrated system with possibly nonzero drift. We focus on the h = l case because extending it to the case where h > l is straightforward. So the triangular representation is (10.4.1) and the augmented cointegrating regression is
+ Y Y2t + fJ~llY2t + fJ~1 llY2.t+l + · · · + fJ~pllY2.t+p + p;llY2.t-! + · ·· + {J~llY2.1-p + v,. (10.4.18)
Ylt = JL
1
Here, {Jj (j = 0, l, 2, ... , p, -1, -2, ... , - p) are the least squares projection coefficients in the projection of (the error in the cointegrating regression) on the current, past, and future values of !ly2. The DOLS estimator of the cointegrating
z;
658
Chapter 10
vector y is simply the OLS estimator of y from this augmented cointegrating regression.
The following results are proved in Saikkonen (1991) and Stock and Watson (1993). 20 1. The bivariate results derived above carry over. That is, the DOLS estimator of y is superconsistent and the properly rescaled r- and Wald statistics for hypotheses about y have the conventional asymptotic distributions (standard normal and chi squared). The proper rescaling is to multiply the usual r-value by (s/A.,) and the Wald statistics by the square of (s/X,). This simple rescaling, however, does not work for the r-value and the Wald statistic for hypotheses involving J.L or Pj.
z;
2. If the two-sided filter fJ(L) in the projection of on the leads and lags of Ll.y2, is infinite order, then the error term v, includes the truncation remainder, 00
00
j=p+I
j=p+I
All the results just mentioned above carry over, provided that p in (10.4.18) is made to increase with the sample si7£ T at a rate slower than T 113 See Saikkonen (1991, Section 4) for a statement of required regularity conditions. 3. The estimator is efficient in some precise sense 21 The estimator is asymptotically equivalent to other efficient estimators such as Johansen's maximum likelihood procedure based on the VECM, the Fully Modified estimator of Phillips and Hansen (1990), and a nonlinear least squares estimator of Phillips and Loretan (1991) (and also to the DGLS, as mentioned above). For all these efficient estimators, the t- and Wald statistics for hypotheses about y, correctly rescaled if necessary, have standard asymptotic distributions (see Watson, 1994, Section 3.4.3, for more details). Other Estimators and Finite-Sample Properties
Stock and Watson (1993) examined the finite-sample performance of these estimators just mentioned (excluding the Phillips-Loretan estimator). For the DGPs and the sample sizes (T = 100 and 300) they examined, the following results emerged. 20 These authors assume that J.L is a fixed constant. However, the same conclusions would hold even if J.L were a time-invariant random variable. 21 The r-value is asymptotically N(O, I), but the DOLS estimator itself is not asymptotically normal. So the usual criterion of comparing the variances of the asymptotic distributions among asymptotically normal estimators is not applicable here. See Saikkonen ( 1991, Section 3) for an appropriate definition of efficiency.
659
Cointegration
(l) The bias is smallest for the Johansen estimator and largest for SOLS. (2) However, the variance of the Johansen estimator is much larger than those for the other efficient estimators, and DOLS has the smallest root mean square error (RMSE). (3) For all estimators examined, the distribution of the 1-values (correctly rescaled if necessary) tends to be spread out relative to N(O, 1), suggesting that the null will be rejected too often. These conclusions are broadly consistent with the assessment of Monte Carlo studies found in Maddala and Kim (1998, Section 5.7). QUESTION FOR REVIEW
1. As we have emphasized, there is a similarity between augmented autoregression (9.4.6) on page 587 and augmented cointegrating regression (10.4.5). Then why is the 1-value on the l(l) regressor in the augmented cointegrating regression (10.4.5) asymptotically N(O, I) while that in (9.4.6) is not?
10.5
Application: The Demand tor Money in the United Stales
The literature on the estimation of the money demand function is very large (see Goldfeld and Sichel, 1990, for a review). The money demand equation typically estimated in the literature is m, - p, = 1L + YyYt
+ YRR, + z;,
(10.5.1)
where m, is the log of money stock in period t, p 1 is the log of price level, y, is log income, R1 is the nominal interest rate, and z; is some error term. Yy is the income elasticity and YR is the interest semielasticity of money demand (it is a semielasticity because the interest rate enters the money demand equation in levels, not in logs). Most empirical analyses that predate the literature on unit roots and cointegration suffer from two drawbacks. First, as will be verified below, all the variables involved, m, - p,, y,, and R,, appear to contain stochastic trends. If the variables have trends, conventional inference under the stationarity assumption is not valid. Second, the regressors may be endogenous. This is likely to be a serious problem in the case of money demand, because in virtually all macro models a shift in money demand represented by the error term affects the nominal interest rate and perhaps income. A resolution of both problems is provided by the econo-
z;
metric technique for estimating cointegrating regressions presented in the previous section.
660
Chapter 10
Another issue addressed in this section is the stability of the money demand function. The consensus in the literature is that money demand is unstable. This is challenged by Lucas (1988) who, using annual data since 1900, argues that the interest semielasticity is stable if the income elasticity is constrained to be unity. We will examine Lucas' claim by applying the state-of-the-art econometric techniques developed in this chapter. The Data
We base our analysis on the annual data studied by Lucas (1988), extended by Stock and Watson ( 1993) to cover 1900-1989. Their measures of the money stock, income, and the nominal interest rate are Ml, net national product (NNP), and the six-month commercial paper rate (in percent at an annual rate), respectively. The use of net national product-rather than gross national product-follows the tradition since Friedman (1959) that the scale variable (y,) in the money demand equation should be a measure of wealth rather than a measure of transaction volume. There are no official statistics on Ml that date back to as early as 1900, so one needs to do some splicing on series from multiple sources. For the money stock and the interest rate, monthly data were averaged to obtain annual observations. See Appendix B of Stock and Watson (1993) for more details on data construction. (m- p, y, R) as a Cointegrated System
Figures 10.2 and 10.3 plot m - p, y, and R. The three series have clear trends. Log real money stock (m - p) grew rapidly over the first half of the century, but experienced almost no growth thereafter until 1981. Log income (y), on the other hand, grew steadily over the entire sample period with a major interruption from 1930 to the early 1940s, so Ml velocity (y- (m- p)), also plotted in Figure 10.3, dropped during that period and then grew steadily until 1981. This movement in Ml velocity is fairly closely followed by the nominal interest rate, which suggests that m - p, y, and R are cointegrated, with a cointegrating vector whose element corresponding to y is about I. Inspection of these figures suggests that the three variables, m - p, y, and R, might be well described as being individually 1(1). In fact, Stock and Watson (1993), based on the ADF tests, report that m- p andy are individually I (I) with drift, while R is I (I) with no drift. They also report, based on the Stock-Watson (1988) common trend test, that the three-variable system is cointegrated with a cointegrating rank of I. In what follows, we proceed under the assumption that m - p is cointegrated with y and R. (In the empirical exercise, you will be asked to check the validity of this premise.)
. -. -·
-·-.
--. -...
... .. -- ...
.·--
..
-·-· - . . . -- ..
.. .
-A,./'_ _j
1920
1910
1940
1930
- ... -
Real M1
1960
1950
1970
1980
Net National Product
Figure 10.2: U.S. Real NNP and Real M1, in Logs
2.0,-----------------------------------------------~--· 15
,.
1.8 1.6
•
1.2 1.0 0.8
.. ... . '•. .... . ' ·. .-..· . -. \: ·,' . ~.
0.6
••
~
.
0.4
I'
-- ......... -
0.2 o.ok---~----~----_J
1900
.... '• "
1910
1920
_____J____
1930
1940
M1 Velocity
11
.....
...
1.4
13
..
.. .... .... ·.
·'
9 7 5
.. . •
3
·:
1 ~
_____ L_ _ _ __ L_ _ _ __ L_ _ _ _
1960 1970 -- - · - Com. Paper Rate 1950
~
-1
1980
Figure 10.3: Log M1 Velocity (left scale) and Short-Term Commercial Paper Rate (right scale)
662
Chapter 10
DOLS
The first line of Table 10.2 reports the SOLS estimate of the cointegrating regression (10.5.1). (The sample period is 1903~1987 because in the DOLS estimation below two lead and lagged changes will be included in the augmented cointegrating regression.) The standard errors are not reported because the asymptotic distribution of the associated t-ratio, being dependent on nuisance parameters, is unknown. The SOLS estimate of the income elasticity of 0.943 is close to unity, but we cannot tell whether it is insignificantly different from unity. To be able to do inference, we tum to the DOLS estimation of the cointegrating vector, which is to estimate the parameters (JL, Yy· YR) in the cointegrating regression (I 0.5.1) by adding !>y, l>R, and their leads and lags. Following Stock and Watson (1993), the number of leads and lags is (arbitrarily) chosen to be 2, so the augmented cointegrating regression associated with (I 0.5. I) is mt- Pt
+ YyYr + YRRr + {Jyol>Yr + fJy.-1 L'>Yr+l + {Jy,-2L'>Yr+2 + fJv1 L'>Yr-1 + {Jyzl>Yr-2 + fJRol>R, + fJR.~l!>Rr+l + fJR.~zl>Rr+2 + fJRIL'>Rr~l + fJRzl>Rr-2 + v,.
= JL
(10.5.2) This dictates the sample period to bet = 1903, ... , 1987. The DOLS point estimate of (yy. YR). reported in the second line of Table 10.2, is obtained from estimating this equation by OLS. To calculate appropriate standard errors, the long-run variance of the error term v, needs to be estimated. For this purpose we fit an autoregressive process to the DOLS residuals. The order of the autoregression is (again arbitrarily) set to 2. With the DOLS residuals calculated fort = 1903, ... , 1987, the sample period for the AR(2) estimation is fort = 1905, ... , 1987 (sample size= 83). The estimated autoregression is
ii,
= 0.93806 iir-1- 0.13341
ii,~2.
SSR
= 0.19843.
(10.5.3)
We then use the formula (I 0.4.17), but to be able to replicate the DOLS estimates by Stock and Watson (1993, Table Ill), we divide the SSR by T - p - K rather than by T - p to calculate a"/:, where K here is the number of regressors in the augmented cointegrating regression (which is 13 here). Thus a"/:= 0.19843/(832 -13) = 0.0029181 and
).2 = 0.0029181 = 0.07647. 2 ' (1- 0.93806 + 0.13341)
(10.5.4)
663
Colntegration
Table 10.2: SOLS and DOLS Estimates, 1903-1987
Yy SOLS estimates DOLS estimates (Rescaled standard errors)
YR
R2
Std. error of regression
0.943
-0.082
0.959
0.135
0.970 (0.046)
-0.101 (0.013)
0.982
0.096
SOURCE: The point estimates and standard errors are from Table III of Stock and Watson ( 1993). The R2 and standard error of regression are by author's calculation.
Table 10.3: Money Demand in Two Subsamples 1903-1945
y, SOLS estimates DOLS estimates (Rescaled standard errors)
YR
1946-1987
Yy
YR
0.919
-0.085
0.192
-0.016
0.887 (0. I 97)
-0.104 (0.038)
0.269 (0.213)
-0.027 (0.025)
SouRCE: Table III of Stock and Watson (1993).
The estimate 5., is the square root of this, which is 0.277. Since the standard error of the DOLS regression is 0.096, the factor in the formula (10.4.15) for rescaling the usual OLS standard errors from the augmented cointegrating regression is 0.277/0.096 = 2.89. The numbers in parentheses in Table 10.2 are the adjusted standard errors thus calculated. Now we can see that the estimated income elasticity is not significantly different from unity. Unstable Money Demand?
Table 10.3 reports the parameter estimates by SOLS and DOLS for two subsamples, 1903-1945 and 1946-1987. (Following Stock and Watson (1993), the break date of 1946 was chosen both because of the natural break at the end of World War 11 and because it divides the full sample nearly in two.) In sharp contrast to the estimates based on the first-half of the sample, the postwar estimates are very different from those based on the full sample. Lucas (1988) argues that estimates like these are consistent with a stable demand for money with a unitary income elasticity. To support this view, he points
664
Chapter 10
O.Or------------------------, -0.2 -0.4 >- -0.6
~ 0
-0.8
'!/
-1.0
Q)
!"
'!/
£
-1.2
•. . . .
-1.4
-1.6 -1.8 -2,0
~· .....
L__
0
__jL__
2
• until1945
. .... .
.
__[_ __ l_ __ L_ __ L_ ____L__ ___L__
4
6 8 10 commercial paper rate .& since
12
14
__j
16
1946
Figure 10A: Plot of m - p - y against R
to the evidence depicted in Figure lOA. It plots m,- p, -y, (the log inverse velocity, which would be the dependent variable in the money demand equation (10.5.1) if the income elasticity Yy were set to unity) against the interest rate, with the postwar observations indicated by a different set of symbols from the prewar observations. The striking feature is that, although postwar interest rates are substantially higher, postwar observations lie on the line defined by the prewar observations 22 This finding suggests that the postwar estimates suffer from the near multicollinearity problem: because of the similar trends in y and R in the postwar period (shown in Figures 102 and 103), Yy and YR estimated on postwar data are strongly correlated. Even though for each element the postwar estimate of (yy. YR) is different from the prewar estimate, they may not be jointly significantly different This possibility could be easily checked by the Chow test of structural change, at least if the regressors were stationary. In the Chow test, you would estimate
where D, is a dummy variable whose value is 1 if t > 1946 and 0 otherwise. Thus, the coefficients of the constant, y,, and R, are (/1-, Yy· YR) until 1945 and (/1- + o0, Yy + oy. YR +oR) thereafter. If the regressors are stationary, the Wald statistic for the null hypothesis that o0 = 0, oy = 0, oR = 0 is asymptotically x 2 (3) as both subsamples grow larger with the relative size held constant (see 22Jn Lucas' ( 1988) original plot, the sample period is 1900--1985 and the break date is 1957, but the message of the figure is essentially the same.
665
Cointegration
Analytical Exercise 12 in Chapter 2). Now in the present case where regressors are I(I) and not cointegrated, it has been shown by Hansen (1992b, Theorem 3(a)) that the Wald statistic (properly rescaled if necessary) has the same asymptotic distribution (of x2 ) if the parameters of the cointegration regression are estimated by an efficient method such as DOLS (which adds !;.y,, !;.R,, and their leads and Jags to the cointegrating regression (I 0.5.5)), the Fully Modified estimation, or the Johansen ML. 23 The DOLS estimates of (10.5.5) augmented with two leads and two Jags of !;,.y and /;,.R are shown in Table 10.4. The insignificant Wald statistic supports Lucas' view that the money demand in the United States has been stable in the twentieth century. PROBLEM SET FOR CHAPTER 1 0 - - - - - - - - - - - EMPIRICAL EXERCISES
(Skim Stock and Watson, 1993, Section 7 and Appendix B, before answering.) The file MPYR.ASC has annual data on the following: Column Column Column Column
1: 2: 3: 4:
natural log ofMI (to be referred to as m) natural log of the NNP price deflator (p) natural Jog of NNP (y) the commercial paper rate in percent at an annual rate (R).
The sample period is 1900 to 1989. This is the same data used by Stock and Watson (1993) in their study of the U.S. money demand. See their Appendix B for data sources.
Questions (a) and (b) are for verifying the presumption that m - pis cointegrated with (y, R). The rest of the questions are about estimating the cointegrating vector. (a) (Univariate unit-root tests) Stock and Watson (1993, Appendix B) report that the ADF t~' and t' statistics, computed with 2 and 4 lagged first differences, fail to reject a unit root in each of m- p and Rat the 10 percent level and that the unit root hypothesis is not rejected for y with 4 lags, but is rejected at the 10 percent (but not 5 percent) level with 2 lags. Verify these findings. In your tests, use the t~' test for R, and t' form - p andy, and use asymptotic critical values (those critical values for T = oo ). 23 We noted in the previous section that the Wald statistic is not asymptotically chi squared even after rescaling if the hypothesis involves 11. Hansen's (1992b) result, however, shows that the Wald statistic for the hypothesis about the difference in J.L is asymptoticaJly chi squared. Hansen (1992b) also develop.~ tests for structural change at unknown break dates.
Table 10.4: Test for Structural Change Wald statistic DOLS estimates (Rescaled standard errors)
0.925 (0.142)
-0.090 (0.026)
1.36 (0.72)
-0.52 (0.31)
SoURCE: Author's calculation, to be verified in the empirical exercise.
0.048
1.85
(0.034)
(p-value = 0.60)
Cointegration
667
(b) (Residual-based tests) Conduct the Engle-Granger test of the null that m - p is not cointegrated withy and R. Set p = I (which is what is selected by BIC). (If you do not include time in the regression, the ADF t -statistic derived from the residuals should be about -4.7. Case 2 discussed in Section 10.3 should apply. So the 5 percent critical value should be - 3.80.) (c) (DOLS) Reproduce the SOLS and DOLS estimates of the cointegrating vector reported in Table 10.2. (d) (Chow test) Reproduce the DOLS estimates reported in Table 10.4. They are based on the following augmented cointegrating regression: m,- Pr =M+~~+~~+~~+~~~+8R~~
+ fJyo!>,y, + fJy.-1/>,Yt+I + f3y,-2;.,Yt+2 + fJyi;.,Yt-1 + fJy,;.,Yt-2 + fJRot>,R, + fJR,-1/>,R,+I + fJR,-2/>,Rt+2 + fJRit>,R,_I + fJR2t>,R,_, + v,. The null hypothesis is that 8o = 8y = 8R = 0. Set the pin (10.4.16) to 2. Calculate the Wald statistic for the null hypothesis of no structural change.
References Aoki, Masanao, 1968, "Control of Large-Scale Dynamic Systems by Aggregation," IEEE Transactions on Automatic Control, AC-13, 246-253. Banerjee, A., J. Dolado, D. Hendry, and G. Smith, 1986, "Exploring Equilibrium Relationships in Econometrics through Static Models: Some Monte Carlo Evidence," Oxford Bulletin of Economics and Statistics, 48, 253-277. Box, G., and G. Tiao, 1977, "A Canonical Analysis of Multiple Time Series," Biometrika, 64, 355365. Davidson, J., D. Hendry, F. Srba, and S. Yeo, 1978, "Econometric Modelling of the Aggregate TimeSeries Relationships between Consumers' Expenditure and Income in the United Kingdom," Economic Journal, 88,661-692. Engle, R., and C. Granger, 1987, "Co-Integration and Error Correction: Representation, Estimation, and Testing," Econometrica, 55, 251-276. Engle, R., and S. Yoo, 1991, "Cointcgrated Economic Time Series: An Overview with New Results," in R. Engle and C. Granger (eds.), Long-Run Economic Relations: Readings in Cointegration, New York: Oxford University Press. Friedman, M., 1959, "The Demand for Money: Some Theoretical and Empirical Results," Journal of Political Economy, 67, 327-351. Fuller, W., 1996, Introduction to Statistical Time Series (2d ed.), New York: Wiley. Goldfcld, S., and D. Sichel, 1990, "The Demand for Money," Chapter 8 in B. Friedman and F. Hahn (eds.), Handbook of Monetary Economics, Volume I, New York: North-Holland.
668
Chapter 10
Granger, C., 1981, "Some Properties of Time Series Data and Their Use in Econometric Model Specification,'' Journal of Econometrics, 16, 121-130. Granger, C., and P. Newbold, 1974, "Spurious Regressions in Econometrics," Journal of Econometrics, 2, 111-120. Hamilton, J., 1994, Time Series Analysis, Princeton: Princeton University Press. Hansen, B., 1992a, "Efficient Estimation and Testing of Cointegrating Vectors in the Presence of Deterministic Trends," Journal of Econometrics, 53, 87-121. _ _ _ , 1992b, ''Tests for Parameter Instability in Regressions with I(I) Processes," Journal of Business and Economic Statistics, I0, 321-335. Haug, A .. 1996, "Tests for Cointegration: A Monte Carlo Comparison," Journal of Econometrics, 71,89-115. Johansen, S., 1988, "Statistical Analysis of Cointegrated Vectors," Journal of Economic Dyamics and Control, 12, 231-254. - - - · 1995, Likelihood-Based Inference in Cointegrated Vector Auto-Regressive Models, New York: Oxford University Press. Lucas, R., 1988, Money Demand in the United States: A Quantitative Review, Carnegie-Rochester Conference Series on Public Policy, Volume 29, 137-168. Ltitkepohl, H., 1993, Introduction to Multiple Time Series Analysis (2d ed.), New York: SpringerVerlag. Maddala, G. S., and In-Moo Kim, 1998, Unit Roots, Cointegration, and Structural Change, New York Cambridge University Press. Ng, S., and P. Perron, 1995, ''Unit Root Tests in ARMA Models with Data-Dependent Methods for the Selection of the Truncated Lag," Journal of the American Statistical Association, 90, 268-281. Park, J., and P. Phillips, 1988, ''Statistica11nference in Regressions with Integrated Processes: Part 1," Econometric Theory, 4,468-497. Phillips. P., 1986, "Understanding Spurious Regressions in Econometrics,'' Journal of Econometrics, 33, 311-340. _ _ _ , 1988, "Weak Convergence to the Matrix Stochastic Integral Bd 8 1," Journal of Multivariate Analysis, 24, 252-264. - - - · 1991, "Optimal Inference in Co integrated Systems," Econometrica, 59, 283-306. Phillips, P., and S. Dnrlauf, 1986, "Multiple Time Series Regression with Integrated Processes," Review of Economic Studies, 53, 473-495. Phillips, P.• and B. Hansen, 1990, "Statistical Inference in Instrumental Variables Regression with 1(1) Processes," Review of Economic Studies, 57, 99-I25. Phillips, P., and M. Loretan, 199I, "Estimating Long-Run Economic Equilibria," Review of Economic Studies, 58,407-436. Phillips, P.• and S. Ouliaris, 1990, "Asymptotic Properties of Residual Based Tests for Cointegration," Econometrica, 58, 165-193. Phillips, P., and J. Park, 1988. "Asymptotic Equivalence of OLS and GLS in Regressions with Integrated Regressors," Journal of the American Statistical Association, 83, 111-115. Saikkonen, P., 1991, "Asymptotically Efficient Estimation ofCointegration Regressions," Eco!UJmetric Theory, 7, 1-21. Sims, C., J. Stock, and M. Watson, 1990, "Inference in Linear Time Series Models with Some Unit Roots," Econometrica, 58, I 13-I44. Stock, J., 1987, "Asymptotic Properties of Least Squares Estimators of Cointegrating Vectors,'' Econometrica. 55. 1035-l 056.
JJ
Colntegration
669
Stock, J., and M. Watson, 1988, "Testing for Common Trends," Journal of the American Statistical Association, 83, 1097-1107. ___ , 1993, "A Simple Estimator of Cointegrating Vectors in Higher Order Integrated Systems,'' Econometrica, 61, 783-820. Watson, M., 1994, "Vector Antoregressions and Cointegration," Chapter 47 in R. Engle, and D. McFadden (eds.), Handbook of Econometrics, Volume IV, New York: North Holland.