This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
0 and λ is bounded and bounded away from 0. Typically λ is primarily a function of the direction of x rather than its magnitude. Optimally choosing λ is equivalent to optimally choosing Tj0stheim's m (see below). Furthermore, the geometric drift condition can be expressed in terms of the drift of the logarithm of V(Xt) (the log-drift condition). For the following theorems let Λ be the class of all nonnegative measurable functions on W that are bounded and bounded away from 0. Theorem 1 Assume {Xt} is an aperiodic, φ-irreducible T-chain in W such that E (||ΛΊ||7(1 + ||z|| r ) I XQ = x) is bounded for some r > 0. The following are equivalent conditions, each sufficient for {Xt} to be geometrically ergodic. y be a 1-1 differentiate mapping. We consider on y the induced relation m « m <=> h~ι{ηι) ~ h-ι{η2),
(i) ί m s u p | M | _ , o o g ( Λ ( ^ ) ) | | ^ J | r I Xo = x) < 1 for some λ e A, r > 0.
(it) l i m s u p H s i i ^ E ( i j p ί | X°
= x
) < l for
some
r>0,n>m>0.
(in) l i m s u P | N K o o E (log (δ + ^ j j ^ l " ) | *o = x) < 0 for some δ > 0, λeΛ. (iv) limsupiiaju^oo E (log {δ + l^\\χ^) m > 0.
Xo = x) < 0 for some
δ>0,n>
NONLINEAR TS STABILITY
159
Proof The equivalences follow from Cline and Pu (1999a, Lena. 4.1, Lem. 4.2) and the sufficiency for geometric ergodicity from condition (10) with test function V(x) = 1 + λ(z)||z|| r . D If the noise terms in a FCAR(p) model do not dominate when the time series is very large then they may be at least partly ignored, as the following theorem suggests. Theorem 2 Assume {Xt} is an aperiodic, φ-irreducible T-chain defined by r (1) and (6) such that αo,...,α p are bounded, sup|j a .|| < M 2?(|c(ei;£)| ) < oo r r for some r > 0 and all M < oo, and lim||x||_>ooi£ (|c(ei;x)| /||α:|| ) = 0. Let a be given by (8) and θ{x) — (a(x),xι,... ,xp-ι)/(l + ||a?||) for x = (xι,..., xp). The following are equivalent conditions, each sufficient for {Xt} to be geometrically ergodic. (i) U m s u P | W K o o £ ( ^
I Xo = x) \\θ(x)\\r < 1 for some r > 0, λ G Λ.
(it) l i m s u p | N K o o E (j[nj=m ||0(-Xj )IΓ \XQ = X)
D
Similar theorems express equivalent conditions for transience of a Markov chain (e.g., Cline and Pu (2001, Thm. 2.2)). Example 6. We refer to conditions (iii,iv) in Theorem 1 and (iii,iv) in Theorem 2 as "log-drift" conditions. A model which illustrates the benefits of using a log-drift condition is the Periodic FCAR(l) model,
where the coefficient function a\(x) is periodic with period r. This is an unusual model but there is a suggestion of periodicity in the coefficient functions fitted by Chen and Tsay (1993a) for the sunspot number data, and it was at the urging of Prof. Chen that we studied the stability of this particular model. At any rate, it is a useful example. With n = m = 1, the condition in Theorem 2(iv) can be reexpressed as limsupE (log(ί + K(ξi)|) ξo = x) < 0 for some δ > 0.
(13)
160
CLINE AND PU
(This can also be determined from the 2-step condition with V(x) — 1 + |x|.) The function E (log(£ + |αχ(£i)|) | ξo = %) is close to being periodic in |αi(x)x|. Thus the left-hand side of (13) is only a limsup, not a limit, which has the unfortunate consequence that condition (13) is not sharp. The solution is to choose n = m — 2 instead. Assume a\(x) is continuously differentiate with a derivative that is 0 on a set of measure 0. Then the function E (log(ί + |αi(ξ2)|) I ξo = x) does in fact have a limit and therefore does lead to a sharp condition, namely if
Γ log(|αi(u)|)*ι<0
(14)
JO
then {ξt} is geometrically ergodic and if f£ log(\aι(u)\)du > 0 then {ξt} is transient (Cline and Pu (1999a, Thm. 3.4; 2001, Thm. 3.2)). Note that the skeleton process, xt = ao(xt-ι) + aι{xt-i)zt-i is geometrically stable if and only if sup|αi(α:)| < 1. Therefore, this is an example where stability of the time series does not coincide with stability of its skeleton. For a specific example, suppose a\{x) = c + dcos(x) with \c\ + \d\ > 1, M < |d| < 2. Then the time series is geometrically ergodic but the skeleton is not stable. Although condition (14) does not explicitly refer to the noise distribution, the noise does play a major role in determining the condition by causing the drift to be averaged. Example 7. The directional method, on the other hand, seems to work well with many threshold models of order 1, including ones that employ a delay. Consider, for example, the TAR(l) time series with delay d given by ξt = ao(Xt-ι) + aι(Xt-i)ξt-i + eu
Xt-ι = (ξt-u
>6-d
ao(x) bounded and a\{x) depends only on (sgn(xχ),... ,s x = (xi,..., Xd). There are thus 2d regions i?χ,..., R2d each corresponding to a coefficient: a\j = 01(1), x G Rj. Notice that the thresholds — boundaries of the regions — are the axial hyperplanes. As long as the time series remains large (which is all that is of concern for stability), ξt avoids the thresholds and hence Xt cycles among some subset of the regions. There may be several such subsets possible, depending on the signs of the αij's, and it is the drift of the "worst case" cycle that is critical. Thus the geometric drift condition for stability is exactly (Cline and Pu (1999b, Cor. 2.4))
max
161
NONLINEAR TS STABILITY
where the maximum is taken over the possible cycles described above. This corresponds precisely to the geometric stability of the skeleton. The test function for establishing this condition takes the directional form, V(x) = 1 4- A(rr)||a;||r, and the optimal choice for X(x) is constant on each of the 2d regions. 6. The Piggyback Method. To this point we have presented examples studied with what one might call the traditional methods of drift analysis. Not all models yield to this analysis, however, including some surprisingly simple models. In this section we present a new approach, as yet somewhat informal, that employs a much more sophisticated ra-step test function. We call it the piggyback method because it relies on finding a stable Markov chain similar to a process embedded in the one of interest and building a new test function on top of one that works for the known stable chain. We will first present a sketch of the piggyback method and then provide three examples. The sketch, however, is quite rough because in fact the method is applied somewhat differently for each of the examples. (See the papers referenced below.) Indeed, the concept is elegant but its application is messy and as yet we do not know how generally useful it may prove to be. The time series {ξt} is embedded in a Markov chain {Xt} At the same time we consider another Markov chain {Yt} similar to a simpler process embedded in {Xt}> The chain {Yt} is assumed to be geometrically ergodic and, in particular, to satisfy the geometric drift condition with test function V\{y). Its stationary distribution we denote G. If there is a function H(y) which somehow exemplifies (or bounds) the relative change in magnitude of X\ when XQ is large and YQ — y then, intuitively, the stationary value of H(Yt) will measure the geometric drift of {Xt}- Thus, a log-drift condition for geometric stability would be exp ( ίlog(ί + H(y))G(dy)\ < 1 for some δ > 0.
(15)
To obtain a test function that will yield such a condition, we first define
and let y(x) identify the "embedding" of Yt into Xt An integer m is chosen suitably large, a "correction" function c(x) constructed and the ultimate test function is (something like) l/m
V(x)=c(x)[l[E(h(Yj)\Y0 =
162
CLINE AND PU
The key point for this paper is that the piggyback method and the resulting condition for ergodicity capture the implicit stochastic behavior of {F t }, not the behavior of a deterministic skeleton of {Xt} or {£*}. Even if such a skeleton can be identified, its stability properties will not coincide with those of {&}. Alternatively, one may think of {Xt} as having a sort of stochastic skeleton which must be analyzed for stability. Example 8. Our first example is a bivariate threshold model (Cline and Pu (1999a, Ex. 3.2; 2001, Ex. 3.2)). Indeed it is the simplest such model that is not just two independent univariate models joined together. Suppose
where ai(xχ) = α ϋ l X l < o + α^lxjXb i = 1?2. Note that the nonlinearity of the second component Xt^ is driven by the univariate TAR(l) process {Xt,ι} The latter is our "embedded" process and is stable when max(αii,αi2,αiiαi2) < 1.
(16)
Let G be its stationary distribution. The function |α2(#i)| plays the role of H(y) so that the resulting (sharp) stability condition is, in addition to (16),
which represents the stationary value of the relative change in magnitude of Xti2 when it is very large. This condition neither implies nor is implied by the stability condition for the corresponding skeleton process: (16) plus
Example 9. The second example (cf. Cline and Pu (1999c)) is the threshold ARMA(l,ςr) model (TARMA) with a delay d:
ξt = oo(Xt-i) + aι(Xt-i)ξt-i
+et + &i(-ϊt-i)et-i + • +
bq(Xt-ι)et-q,
where Xt-ι = {ξt-i,..., ξt-d), «o3 &i, , bq are bounded and aχ{x) is (asymptotically) piecewise constant. We further assume the thresholds are affine which implies the regions on which a\(x) is constant are cones in Md. This is the simplest interesting example of a TARMA process. See also Brockwell, Liu and Tweedie (1992) and Liu and Susko (1992). In the case q > 0, the time series must be embedded in the Markov chain {(Xt,Ut)} where Xt is as above and Ut = ( e t , . . . , e t _ g +i). Threshold ARMA models have not seen a lot of study, perhaps in part because the moving average terms can affect the irreducibility and periodicity
NONLINEAR TS STABILITY
163
properties of the chain in complicated ways as yet not well understood. (See, for example, Cline and Pu (1999c)). This by itself is a major role played by the noise but we will pass by it here. Let the regions i?χ,..., Rm be the partition of W such that a\(x) is constant on each region, with a n , . . . , a i m being the corresponding constants. There are basically two types of situations that arise in these models when ξt is very large: cyclical and noncyclical. For the cyclical situation, {Xt} essentially cycles close to certain rays having the form
(Uϊ:lalji,nPiZΪalji,...,l)xu
(17)
if all are in the interior of the conical regions. Noise plays no role in determining the stability in this situation since Xt avoids the thresholds; all that matters is the product of coefficients realized by moving through the cycle. For a model which is purely cyclical the stability condition is based on the "worst case" cycle, it is deterministic and corresponds to that of the skeleton process, and it is very much like that of the TAR(l) process with delay discussed in section 5. The model may also have, however, situations where one or more of the rays of type (17) actually lie on a threshold. In such a case, Xt can fall on either side of the threshold, and thus into one of two possible regions, at random but depending on both the present error et and the past errors e t _i,..., et-q. If Jt denotes the region that Xt is in then {(Jt, Ut)} behaves something like a Markov chain where the first component is one of a finite number of states and the second component is stationary. (If q = 0 then {Jt} itself is like a finite state Markov chain.) We relate {(Jt, Ut)} to such a Markov chain denoted, say, {(JtiUt)}. This chain is not necessarily irreducible or aperiodic but clearly every invariant measure is finite. Indeed it may be decomposed into a finite number of uniformly ergodic subprocesses. The coefficients |αχj| play the role of H(y) in this model. Now let G be any stationary distribution for {(Jt, Ut)} and define ΈJ = IRQ G(j,du). If (condition (15))
regardless of the choice of G then {(XuUt)} is geometrically ergodic aiίid, again, the condition is sharp. Because at least one ray lies on a threshold, the noncyclical models are special cases, but the stability condition for a noncyclical process can be quite different from that of nearby purely cyclical processes. See the parameter spaces for the TARMA(1,1) with delay 2 in Cline and Pu (1999c).
164
CLINE AND PU
Example 10. The third example combines the nonlinearity of piecewise continuous coefficient functions with a piecewise conditional heteroscedascity, a model called the threshold AR-ARCH time series: ξt = a(Xt-i) + b(Xt-i)et + c(et; X t -i), where a{x) and b(x) are piecewise linear, {c(ei a )} is uniformly integrable and Xt-i = (ξt-i, ,ζt-p) We further suppose a(x) and b(x) are homogeneous, b{x) is locally bounded away from zero except at x = 0, the thresholds are subspaces containing the origin and the regions of constant behavior are cones. Note that these assumptions need only hold asymptotically (in an appropiate sense) as x gets large. Once again, the Markov chain under study is {Xt} The basic idea on which we piggyback is that the process {Xt} collapsed to the unit sphere behaves very much like a Markov chain. The compactness of the unit sphere serves to make this chain stable and then the stability condition for the original chain can be computed. More specifically, define
S = α ( X t _ 0 + &(*_!)*,
X; = (ξl...,ξ*t-p+ι)
and θ*t = X*/\\X*t\\.
Then, due to the homogeneity of a(x) and 6(x), {0£ } is a Markov chain on the unit sphere and is uniformly ergodic with stationary distribution G, say. By the piggyback method, therefore, Xt has geometrically stable drift if
E (flog(\a(θ) + b(θ)e1\/\θι\)G(dθ)yj < 0. For a simple demonstration, suppose p = 1, a(x) = (αil x
+
P
p^y +P
log
(|α2
)
then {ξt} is geometrically ergodic. This example and generalizations of it will be considered fully in a forthcoming paper (Pu and Cline (2001)). Example 11. A very simple model which has not been analyzed fully is the ordinary TAR(2) model with additive noise, ξt = αi(X t -i)ξt-i + a2(Xt-ι)ξt-2
+ eu
where Xt-ι = (ξt-i^t-2) &nd αi(#) and a2(x) are piecewise constant. The precise stability condition is not known even when there is but one affine threshold. The results of this section, however, suggest that the key will be to identify an appropriate stochastic skeleton process to study.
165
NONLINEAR TS STABILITY
7. The Role of the Noise Distribution Tails. Spieksma and Tweedie (1994) pointed out how, with appropriate assumptions on the error distribution tails, an ordinary drift condition (such as (9) with V(x) = 1 + \x\) can be boosted to ensure geometric ergodicity of the process. We generalize the result as follows. Theorem 3 Assume {Xt} is an aperiodic, φ-irreducible T-chain in W and V : W —> [l,oo) is locally bounded. Suppose there exists a random variable W(x) for each x such that V{X\) < W(x) whenever XQ = x, {\W(x) — V(x)\ + er(w(x^~v(x^} is uniformly integrable for some r > 0 and limsupE (W(x) - V(x)) < 0. Then there exist s > 0 and V\(x) = esV^ ergodic (and hence geometrically ergodic).
(18)
such that {Xt} is V\-uniformly
Proof This follows directly from the drift condition for ^-uniform ergodicity (cf. Meyn and Tweedie (1993, Thm. 16.0.1)) and uniform convergence (cf. Cline and Pu (1999a, Lem. 4.2)). (See also the proof to Theorem 4.) D Essentially, this is the log-drift condition in another guise: if the test function in (10), for example, is replaced with V\{x) = esV^ with some sufficiently small s > 0 then (18) is a log-drift version of the condition. As a bonus, if V(x) is norm-like, satisfying ||x|| < V(x) < M +jfif||a;||, one gets strong laws and central limit theorems for all the sample moments (Meyn and Tweedie (1992), Chan (1993a,b)) and exponentially damping tails in the stationary distribution (Tweedie (1983a,b)). Example 12. For example, consider the FCAR(p) model discussed in section 3. If the noise term c(et;Xt-ι) is such that sup x E(er^eux^) < oo for some r > 0 then it frequently is possible to satisfy the requirements of Theorem 3 with a norm-like V(x). To illustrate how this can work, consider the FCAR(l) process, Xt = ξt = αi(6-i) + Φ t ; 6 - i ) with
-L < a\{x) < a\\x + αoi Ίix < -L,
, *
L > a\(x) > a\2X + αo2 if x > L, where a\\a\i — 1, a\\ < 0 and L < oo. We assume here that E(c(e\\ x)) = 0 for all x G R and sup x E(er^eux^) < oo for some r > 0. For the special case of equality on the right in (19) (the SETAR(l) model of Example 4), Chan et al (1985) showed {ξt} is ergodic if and only if 7 = 011^02 + «oi < 0. We thus assume 7 < 0. Let λi = λ^"1 = \J—a\\ and choose δ{ > 1, i = 1,2 so that -λiα O 2 + δχ- δ2 = X2OΌI + h - δ\ = λ 2 7/2. Define
V{x) = (\ι\x\ + δλ)lx<0
+ (X2\x\
166
CLINE AND PU
Then it is a simple computation to show that for some e > 0 and K < oo, V(Xι) - V(x) < (λ 2 lχ< 0 - λil x >o)c(ei; x) + X2j/2 + K\c(eγ,x)\l\c{ei]x)\>€\x\ χ ιs when |Xo| — \ \ sufficiently large, which satisfies the conditions of Theorem 3 with the limit in (18) being λ27/2. The time series is thus geometrically ergodic. On the other hand its skeleton, while stable, is not geometrically stable since a\\aγι = 1. In fact we would say both have only a linear drift. Tanikawa (1999) studied this example and Cline and Pu (1999b) looked at similar first order threshold-like models, but with a possible delay. Using a similar approach but with stronger stability conditions, Diebolt and Guegan (1993) studied multivariate examples and An and Chen (1997) investigated FCAR(p) models withp > 1. One of the drawbacks to a log-drift condition such as the one in Theorem l(iii) is that it guarantees geometric ergodicity only with test functions of the form V(x) = l-hλ(α;)||x|| r where r may be arbitrarily small and therefore it fails to imply needed limit theorems for sample moments. To be able to conclude Vί-uniform ergodicity with an exponential-like Vί, the condition must again be boosted and then the desired limit theorems will hold. Theorem 4 Assume {Xt} is an aperiodic, φ-irreducible T-chain in W and V : W —> [l,oo) is locally bounded and V(x) -> oo as \\x\\ -> oo. Suppose there exists a random variable W(x) for each x such that V(X\) < W(x) whenever Xo = x, {\ log{W{x)/V{x))\ + e ( ^ W ) r - ( ^ ) ) Γ } is uniformly integrable for some r > 0 and limsup£ (log{W(x)/V(x))) < 0.
(20)
IMI->oo
Then there exist s > 0 and V\{x) = e^v^s such that {Xt} is V\-uniformly ergodic (and hence geometrically ergodic). Proof For v > w > 1 and 0 < s < r, we have i {ew3~vS - 1) < ^ ( ^ - l) and log(w/v) < j fe - l j < 0. By the uniform integrability of {log(W(x)/V(x))}, truncation and uniform convergence (as s I 0),
Λi ((ii ((e("W-(v(*))« ("W(v(*)) __ Λ
< limsupE (\og(W(x)/V(x))lw{x)
Λ
+ e < -e,
(21)
167
NONLINEAR TS STABILITY For w > v > 1 and 0 < s < r/2, we have 0 < \og(w/υ) < - (ewS-υ° - l) < - (ewT~vT
-
and if wΓ-υr < K, v > M > 1 then ± (β 10 '-"* - 1) < g £ £ . By the uniform integrability of {e^ί*))'"^**))'}, truncation and V{x) -> oo as ||z|| ->• oo,
0 <
\imsupE(log(W(x)/V(x))lw{x)>v(x})
\\x\\-κx>
)M^)) _ Λ i
) <6,
(22)
for e > 0 and s > 0 small enough. From (20)-(22), therefore, we conclude there exists s > 0 small enough that limsupE (- (eWi»MV(*)) _ Λ IX o = \ Also, sup|| x ||< M J5 M v ( X l ) ) s
χ0 —xλ < oo for all M < oo, and hence geo-
metric ergodicity is assured with test function V\.
D
Example 13. We again consider an FCAR(l) model, ξt = «i(6-i) + c{et\it-ι), satisfying (19) but now we assume a n < 0 < anau < 1 and |c(ei;z)| < ci|a;|^|ei| where cx > 0, 0 < β < 1 and E(eη^) < oo for some η > 0. Let λi = y/-an, λ 2 = V~ α i2 and F(x) = 1 + (λil x
168
CLINE AND PU
Brockwell, P.J., Liu, J. and Tweedie, R.L. (1992). On the existence of stationary threshold autoregressive moving-average processes. J. Time Series Anal 13, 95-107. Chan, K.-S. (1989). A note on the geometric ergodicity of a Markov chain, Adv. Appl. Probab. 21, 702-704. Chan, K.-S. (1990). Deterministic stability, stochastic stability, and ergodicity, Appendix 1 in Non-linear Time Series Analysis: A Dynamical System Approach, by H. Tong, Oxford University Press (London). Chan, K.-S. (1993a). A review of some limit theorems of Markov chains and their applications, Dimensions, Estimation and Models, ed. by H. Tong, World Scientific (Singapore), 108-135. Chan, K.-S. (1993b). On the central limit theorem for an ergodic Markov chain, Stock. Proc. Appl 47, 113-117. Chan, K.-S., Petruccelli, J.D., Tong, H. and Woolford, S.W. (1985). A multiple threshold AR(1) model, J. Appl Probab. 22, 267-279. Chan, K.-S. and Tong, H. (1985). On the use of the deterministic Lyapunov function for the ergodicity of stochastic difference equations, Adv. Appl Probab. 17, 666-678. Chan, K.-S. and Tong, H. (1986). On estimating thresholds in autoregressive models, J. Time Series Anal 7, 179-190. Chan, K.-S. and Tong, H. (1994). A note on noisy chaos, J. Royal Stat. Soc. 56, 301-311. Chen, R. and Hardle, W. (1995). Nonparametric time series analysis, a selective review with examples, Bulletin of the International Statistical Institute, 50th session of ISI, August, 1995, Beijing, China. Chen, R. and Tsay, R.S. (1991). On the ergodicity of TAR(l) process, Ann. Appl Probab. 1, 613-634. Chen, R. and Tsay, R.S. (1993a). Functional-coefficient autogressive models, J. Amer. Stat Assoc. 88, 298-308. Chen, R. and Tsay, R.S. (1993b). Nonlinear additive ARX models, J. Amer. Stat Assoc. 88, 955-967. Cline, D.B.H. and Pu, H.H. (1998). Verifying irreducibility and continuity of a nonlinear time series, Stat & Prob. Letters 40, 139-148. Cline, D.B.H. and Pu, H.H. (1999a). Geometric ergodicity of nonlinear time series, Stat Sinica 9, 1103-1118. Cline, D.B.H. and Pu, H.H. (1999b). Stability of nonlinear AR(1) time series with delay, Stock. Proc. Appl 82, 307-333. Cline, D.B.H. and Pu, H.H. (1999c). Stability of threshold-like ARMA time series, Statistics Dept., Texas A&M Univ. Cline, D.B.H. and Pu, H.H. (2001). Geometric transience of nonlinear time series, Stat Sinica 11, 273-287. Collomb, G. and Hardle, W. (1986). Strong uniform convergence rates in robust nonparametric time series analysis and prediction: kernel regression estimation from dependent observations, Stock. Proc. Appl 23, 77-89. Diebolt, J. and Guegan, D. (1993). Tail behaviour of the stationary density of general non-linear autoregressive processes of order 1, J. Appl. Probab. 30, 315-329.
NONLINEAR TS STABILITY
169
Foster, F.G. (1953). On the stochastic matrices associated with certain queueing processes. Ann. Math. Stat. 24, 355-360. Guegan, D. (1987). Different representations for bilinear models, J. Time Series Anal. 8, 389-408. Guegan, D. and Diebolt, J. (1994). Probabilistic properties of the β-ARCH model, Stat. Sinica 4, 71-87. Guo, M. and Petruccelli, J.D. (1991). On the null-recurrence and transience of a first order SETAR model, J. Appl. Probab. 28, 584-592. Hardle, W., Lϋtkepohl, H. and Chen, R. (1997). A review of nonparametric time series analysis, Internal. Stat Review 65, 49-72. Hardle, W. and Vieu, P. (1992). Kernel regression smoothing of time series, J. Time Series Anal 13, 209-232. La Salle, J.P. (1976). The Stability of Dynamical Systems, CMBS 25, Society for Industrial and Applied Mathematics (Philadelphia). Lim, K.S. (1992). On the stability of a threshold AR(1) without intercepts, J. Time Series Anal. 13, 119-132. Liu, J. (1992). On stationarity and asymptotic inference of bilinear time series models, Stat. Sinica 2, 479-494. Liu, J. and Brockwell, P.J. (1988). On the general bilinear time series models, J. Appl. Probab. 25, 553-564. Liu, J., Li, W.K. and Li, C.W. (1997). On a threshold autoregression with conditional heteroscedastic variances, J. Stat. Plan. Inference 62, 279300. Liu, J. and Susko, E. (1992). On strict stationarity and ergodicity of a nonlinear ARMA model, J. Appl. Probab. 29, 363-373. Lu, Z. (1996). A note on the geometric ergodicity of autoregressive conditional heteroscedasticity (ARCH) model, Stat. Prob. Letters 30, 305-311. Lu, Z. (1998a). Geometric ergodicity of a general ARCH type model with applications to some typical models, in Advances in Operations Research and Systems Engineering, J. Gu, G. Fan and S. Wang, eds., Global-Link Informatics Ltd., 76-86. Lu, Z. (1998b). On the geometric ergodicity of a non-linear autoregressive model with an autoregressive conditional heteroscedastic term, Stat. Sinica 8, 1205-1217. Masry, E. and Tj0stheim, D. (1995). Nonparametric estimation and identification of nonlinear ARCH time series: strong convergence and asymptotic normality, Econometric Theory 11, 258-289. Meyn, S.P. and Tweedie, R.L. (1992). Stability of Markovian processes I: Criteria for discrete-time chains, Adv. Appl. Probab. 24, 542-574. Meyn, S.P. and Tweedie, R.L. (1993). Markov Chains and Stochastic Stability, Springer-Verlag (London). Nummelin, E. (1984). General Irreducible Markov Chains and Non-negative Operators. Cambridge University Press, Cambridge. Nummelin, E. and Tuominen, P. (1982). Geometric ergodicity of Harris recurrent Markov chains with applications to renewal theory, Stoch. Proc. Appl. 12, 187-202. Petruccelli, J.D. and Woolford, S.W. (1984). A threshold AR(1) model, J. Appl. Probab. 21, 270-286.
170
CLINE AND PU
Pham D.T. (1985). Bilinear Markovian representation and bilinear models, Stock. Proc. Appl. 20, 295-306. Pham, D.T. (1986). The mixing property of bilinear and generalised random coefficient autoregressive models, Stock. Proc. Appl. 23, 291-300. Pham, D.T. (1993). Bilinear times series models, Dimensions, Estimation and Models, ed. by H. Tong, World Scientific Publishing (Singapore), 191-223. Popov, N. (1977). Conditions for geometric ergodicity of countable Markov chains, Soviet Math. Dokl. 18, 676-679. Priestley, M.B. (1980). State-dependent models: A general approach to non-linear time series analysis, J. Time Series Anal. 1, 47-71. Pu, H.H. and Cline, D.B.H. (2001). Stability of threshold AR-ARCH models, tech. rpt., Dept. Stat., Texas A&M University (forthcoming). Quinn, B.G. (1982). A note on the existence of strictly stationary solutions to bilinear equations, J. Time Series Anal. 3, 249-252. Spieksma, F.M. and Tweedie, R.L. (1994). Strengthening ergodicity to geometric ergodicity for Markov chains, Stock. Models 10, 45-74. Tanikawa, A. (1999). Geometric ergodicity of nonlinear first order autoregressive models, Stock. Models 15, 227-245. Tj0stheim, D. (1990). Non-linear time series and Markov chains, Adv. Appl. Probab. 22, 587-611. Tj0stheim, D. (1994). Non-linear time series: a selective review, Scand. J. Stat. 21, 97-130. Tj0stheim, D. and Auestad, B.H. (1994a). Nonparametric identification of nonlinear time series: projections, J. Amer. Stat. Assoc. 89, 1398-1409. Tj0stheim, D. and Auestad, B.H. (1994b). Nonparametric identification of nonlinear time series: selecting significant lags, J. Amer. Stat. Assoc. 89, 1410-1419. Tong, H. (1981). A note on a Markov bilinear stochastic process in discrete time, J. Time Series Anal. 2, 279-284. Tong, H. (1990). Non-linear Time Series Analysis: A Dynamical System Approach, Oxford University Press (London). Tuominen, P. and Tweedie, R.L., (1979). Markov chains with continuous components, Proc. London Math. Soc. (3) 38, 89-114. Tweedie, R.L. (1975). Sufficient conditions for ergodicity and recurrence of Markov chains on a general state space. Stock. Proc. Appl. 3, 385-403. Tweedie, R.L. (1976). Criteria for classifying general Markov chains. Adv. Appl. Probab. 24, 542-574. Tweedie, R.L. (1983a). Criteria for rates of convergence of Markov chains with application to queueing and storage theory, in Probability, Statistics and Analysis, London Math. Society Lecture Note Series, ed. by J.F.C. Kingman and G.E.H. Reuter, Cambridge Univ. Press (Cambridge). Tweedie, R.L. (1983b). The existence of moments for stationary Markov chains, Adv. Appl. Probab. 20, 191-196.
173
Institute of Mathematical Statistics LECTURE NOTES — MONOGRAPH SERIES
TESTING NEUTRALITY OF mtDNA USING MULTIGENERATION CYTONUCLEAR DATA Susmita Datta Department of Mathematics and Statistics Georgia State University Abstract The neutrality theory of evolutionary genetics assumes that DNA markers distinguishing individuals and species are neutral and have little effect on individual fitness (Kimura, 1983). Under this hypothesis, the action of genetic drift or genetic drift in combination with mutation or migration can be used to describe the evolution of most DNA markers. In recent years, scientists have set up experiments to collect cytonuclear data over several generations to test whether the empirical evidence is consistent with this theory. In this paper, we review the existing statistical tests for neutrality based on such data and propose a new test that we believe is vastly superior. The new test arises from the likelihood theory after embedding the neutral model in a larger class of selection models, where the selection effect takes place due to a difference in fertility of various gametes. A power study based on Monte Carlo simulation is presented to demonstrate the superior performance of the new test.
1
Introduction
A major debate amongst evolutionary geneticists in recent years is whether most DNA markers distinguishing individuals and species are neutral and have little effect on individual fitness (Kimura, 1983). As a profound application of this theory, DNA sequence differences between extant species have been used to reconstruct the history of life. The classical theoretical developments of random genetic drift is built around this assumption. Under this hypothesis, the action of genetic drift or genetic drift in combination with mutation or migration can be used to describe the evolution of most DNA markers. The recent attacks on the neutrality theory are twofold. Firstly, it has been pointed out that in some cases non-neutral models can also explain behavior consistent with empirical evidence. For example, Gillespie (1979) showed that his model of selection in a random environment has the same
174
DATTA
stationary distribution as the infinite allele neutral model. Therefore, the agreement between observations and that predicted by the infinite allelic model noted by Fuerst et al. (1977) can be used with equal strength to support Gillespie's model of natural selection. In another context, Rothman and Templeton (1980) showed that, under some departure from the model assumptions, a neutral model (Watterson, 1977; Ewens, 1972) can yield frequency spectra and homozygosity similar to those expected from heterosis. In addition to the above results, a number of recent experiments suggest apparent non-neutral behaviors of mtDNA markers (Clark and Lyckegaard, 1988; MacRae and Anderson 1988; Fos et al., 1990; Nigro and Prout, 1990; Pollak, 1991; Arnason, 1991; Kambhampati et al., 1992; Scribner and Avise, 1994a, b; Hutter and Rand, 1995; etc.). Singh and Hale (1990) suggested that the apparent "non-neutral" behavior may also be caused by mating preference and that any attempt to understand the role of selection on mtDNA variants should first begin with simpler conspecific variants rather than with interspecific variants; however see MacRae and Anderson (1990), Jenkins et al. (1996). Multi-locus empirical comparisons have been undertaken by Karl and Avise (1992; also see McDonald, 1996), Berry and Kreitman (1993), McDonald (1994). In view of these recent experimental developments it is important to test whether the apparent non-neutral behavior of the markers are indeed statistically significant. Consequently it is more important than ever to devise appropriate statistical tests for testing the neutrality of a mtDNA marker. As we will see in Section 3, the existing statistical tests are often too limited to take full advantage of the multi-generation cytonuclear data that are now available. As a result, a new test based on the recent works by Datta (1999, 2001) is proposed. This test is based on an approximate likelihood for the full available data constructed from a broad parametric selection model and is therefore expected to perform well in practice. The data collection scheme and the underlying model of random drift for genetic evolution is introduced in the next section. This neutral model serves as the null model for the statistical tests which are introduced in Sections 3 and 4. A numerical power study based on Monte Carlo simulation is reported in Section 5. The paper ends with some concluding remarks in Section 6.
2
Data Collection Scheme and the Random Drift Model
In recent kitty-pool experiments, there are two potential sources of variation in cytonuclear frequencies, namely, genetic sampling variation and statistical sampling variation (Weir, 1990). Genetic sampling variation arises
NEUTRALITY OF mtDNA
175
from genetic drift, the sampling of gametes from a finite breeding pool of individuals in nature to constitute the next generation. Statistical sampling variation arises from sampling individuals from a population and using the genotypic frequencies from the sample in subsequent calculation. In Datta et al. (1996), test statistics based on cytonuclear disequilibria were constructed which can account for both sources of variation. The sampling scheme is described below. Such sampling schemes were introduced by Fisher and Ford (1947) and subsequently considered by Schaffer et al. (1977). Kiperskey (1995) also collected data on the fruit fly Drosophila melanogaster following such a scheme. We feel that these types of sampling schemes will become increasingly important in prospective tests for selection (White et al., 1998) using molecular markers in which a cytoplasmic marker is included as a control. Consider a population propagating through discrete non-overlapping generations. Although this is a simplifying and restrictive assumption, it can be achieved for an experimental population with specially selected species, such as Gambusia and fruit flies. At each generation, a portion of the adult population is collected by simple random sampling and sent for genotyping after they form the next generation; eggs by random mating. The eggs are then collected and placed in a cage to form the next generation. Thus, in this case, only the sample genotypic relative frequencies are available and are therefore subject to the additional source of sampling variation. We let g denote the number of consecutive generations from which samples were drawn. Throughout the rest of the paper, we will simultaneously concentrate on a nuclear site with possible alleles A and a and a cytoplasmic site with possible alleles Cand c. The various relative frequencies at the genotypic and the gametic levels are indicated in Tables 1 and 2, respectively. Note that since the cytoplasmic marker is only maternally inherited, its representation remains the same at both levels. Also, if needed, we will denote the generation number (i.e., time) in parenthesis and the corresponding quantities at the sample level will be indicated by the hat notation. Table 1 Genotypic frequencies Nuclear Allele Cytoplasm AA Aa aa Total C V2 P3 q Pi c P4 P5 Pe 1-q V u w 1 Total
DATTA
176
Table 2 Gametic frequencies
Cytoplasm C
Nuclear Allele A a Total eι
c
Total
q
e4 P
l-q 1
Under the action of genetic drift alone, the evolution of the population through generations can be modeled by the following Markov chain. Under the RUZ (random union of zygotes) model (Watterson, 1970), the probability of observing an offspring which received gametic types / and m, respectively, from the two parents is e/e m . Thus, the probability distribution of the counts X(t + 1) = (Xn(t +1), , Xu(t + 1)) in generation t + 1,given Ww> the gametic combination counts up to time ί, is multinomial and is given by (1)
(2) where JVi+i = Σ/,m x /m(^+l) is the size of the ί+1 generation. Finally note that this in turn determines the distribution of the genotypic and the gametic proportions p(t + 1) and e(t + 1), since they are just linear combinations of the x(t + l);viz, pk(t) = {Σf,m<*fmkXfm{t)}/Nt and φ) = ΣkβikPk(t) The coefficients a and β are given in Tables 3 and 4, respectively. For example, since the mtDNA is only maternally transmitted, the genotype AA/C can be formed by either A/C from the father and A/G from the mother, or by A/c from the father and A/C from the mother leading to = {xn(t)+X2i{t)}/Nt.
NEUTRALITY OF mtDNA
177
Table 3 The coefficients aifmk
f
m
1
2
k 3
4
5
6
1 2 3 4
1 0 0 0
0 0 1 0
0 0 0 0
0 1 0 0
0 0 0 1
0 0 0 0
1 2 3 4
0 0 0 0
1 0 0 0
0 0 1 0
0 0 0 0
0 1 0 0
0 0 0 1
lor 2
3or4
Table 4 The coefficients i k 1 2 3 4
5 6
3
1 1 1/2 0 0 0 0
2 0 0 0 1 1/2 0
3 0 1/2 1 0 0 0
4 0 0 0 0 1/2 1
Existing Neutrality Tests Based on Multigeneration Data
Here we review two existing tests of neutrality based on multigeneration data. The first one, due to Schaffer et al. (1977), compares the relative frequencies at a single locus with that expected under random drift over the generations. The test due to Datta et al. (1996) takes advantage of simultaneous data collected at a nuclear site and a cytoplasmic site over generations and compares the pathways of association measures, called the cytonuclear disequilibria, with their expected values over time under random drift. Other existing tests for neutrality generally compare various empirical characteristics based on the very last generation data with the corresponding
178
DATTA
asymptotic value reached at the equilibrium distribution under random drift. The obvious criticisms of such methods are (i) they don't take full advantage of the multigeneration data and (ii) the theoretical basis is questionable unless the population has been in existence for a long time.
3.1 The Schaffer-Yardley-Anderson
tests: This test is a modifica-
tion of a classical test due to Fisher and Ford (1947). They considered the variance stabilizing angular transformation ( 2 sin~ ^relative frequency ) of the proportions at a single locus and compared them with their constant expected value under the action of genetic drift leading to the asymptotically chi-squared distributed test statistic:
where
^ _ l'W~ιY ~ 1W-1Γ Y is the vector of transformed relative gene frequencies, 1 is the vector of ones and W is the matrix whose elements are given by μ
z-1
,
k-l/N
k=l
i y k
j=ϊ \
,*
i <j . ±y
3 J
In an effort to improve the power properties of their test, Schaffer et al. (1977) also proposed an alternative test which effectively tests for linear trend in the transformed frequencies. Although such a selection model may be hard to justify biologically, this approach does lead to a usable test statistic.
3.2 The disequilibria test due to Datta et al.: The Schaffer et al. tests do not make full use of cytonuclear data because they were constructed for tracking the information at a single locus. Datta et al. (1996) proposed testing the dynamics of the sample cytonuclear disequilibria coefficients (see Arnold 1993) with those expected under a drift model. This resulted in a test statistic of the form
T2 = (D-μYΣ^φ-μ) which has an approximate chi-squared distribution under the null hypothesis of random drift. Here D is the vector of sample cytonuclear disequilibria, μ is its expectation under the neutral model of random drift and Σ is its estimated variance-covariance matrix. The above test is somewhat difficult to implement in the sense that the formulas for Σ are complicated. Moreover its power properties may be poor due to its omnibus nature.
NEUTRALITY
4
179
OF mtDNA
A New Test Based on a Selection Model
In order to take full advantage of the multigeneration cytonuclear data, very recently Datta (2000) considered a fairly broad selection model that includes the neutrality model of random drift as a special case. The selection effect takes place because of a difference in the fertility of various gametes. We propose the resulting likelihood based score test for testing the neutrality hypothesis. Not only does it arise very naturally, it incorporates the entire cytonuclear information present in the data; furthermore since it is derived from embedding the null hypothesis of random drift into a fairly rich parametric alternative selection model, it should enjoy reasonable power properties at least when the selection model holds. A simulation based power comparison study reported in the next section shows that this is indeed the case. Consider the following selection model as an alternative to the random drift model describing the evolution of the population. The probability of observing an offspring which received garnetic types / andra, respectively, w e from the two parents is ef(w)em(w)where eι(w,t) = Wieι(t)/ yΣ j j{t))\ here W{ denote the relative fertility of gametic type i, 1 < i < 4. Therefore, the distribution of the counts x(t + 1) given the t-th generation is given by (1) with the product ef(t)em(t) replaced by ej{w, t)em(w, t). Note that when W{ = 1/4, one has the random drift model. Therefore, a test of neutrality can be based on the score statistic s(wo), with wo = (1/4, , 1/4). Since s, the derivative of the log-likelihood based on the population gametic relative frequencies, is not computable from the observed data, Datta (2001) suggested using an approximate version of it obtained by replacing them with the corresponding sample versions. This results in additional terms for the variance-covariance matrix, all of which can be consistently estimated from the observed data. One can show that the approximate log likelihood has a simple closed form expression given by g-i
6
Άw) = Σ Σ Nt+ιPk(t + 1) log(ifc(ti;,i)) t=l
k=l
Lι(w,t) = e\(w,t) +
eι(w,t)e2{w,t),
L2{w,t) = 2e1(ϊi;,ί)e3(iί;,t) + e2(w,t)es{w,t) + eι(w,t)e4(w,t) Ls(w, t) = e\(w, t) + e3(w, t)e4(tu, t),
180
DATTA
t) = ej(w,t) +es(w,t)eA(w,t)
1
and βi{w,t) = ei{t)wi/ ί Σej{t)wj J , 1 < i < 4. See Datta (2001) for the details of the algebraic calculations. Of course, the approximate score is defined as s(w) ^ (d/dw) I (w) where we interpret it as a function of w = (wι,W2, ws) ( i.e., replace w^ by I — wι — w2 — w^). Datta (2001) was able to calculate the estimated asymptotic variancecovariance matrix of s given by Σ y = (d/dw)(s(w))\w=g
+ C f d + CξC2 + ••• +
CjCg,
where
ds{w;p{l), -,p(g)) dP(t)
-1I/2
\{diag{p(t))-p(t)p(t)T)/nt]
1 < < ^ 9 Therefore, a test statistic for the neutrality hypothesis is given by T = S {WQ)Σ^S{W ) with w0 = (1/4,1/4,1/4). The neutrality hypothesis would be rejected if T exceeds Xχ_Q(3). T
0
5
Power Studies
We now report the results of a simulation study where we compare the power of Datta's (2001) test with those of earlier tests by Schaffer et al. (1977). The experimental setup is as follows. The w are parametrized by a single parameter μ so that w\ = w2 = 1/2 - μ; ^3 = W4 = μ. We simulated 2000 multigeneration samples each of g = 5 generations. The (constant) population size Nt equaled 1000 and (constant) sample size nt equaled 100. The initial population frequencies were given by pi — Pz — PA = P6 = 0; P2 = 0.5, p$ = 0.5. The counts x at the population level of successive generations are generated recursively using the multinomial model described in the second paragraph of Section 4. Next the genotypic and the gametic proportions are obtained by the formulas pk = Σ/,m afmkXfmlN and βi = Σk βikPk Finally, at every generation given the population p^ the sample pk are generated by multinomial (n; pi, • ,pβ) sampling. We simulated the powers of three different tests; (i) the omnibus test by Schaffer et al. for the mitochondrial locus, (ii) the linear trend test by Schaffer et al. and (iii) the new approximate score test by Datta described in the previous section. A nominal level of α = 0.05 was used for each case.
NEUTRALITY OF mtDNA
P _
181
*
0.8
1
•
Scor
s
SYA
1-
Unw
0.4
5-
0.2
s
ie
e
.
»
0.0
Λ
0220
0225
0230
0235
0.240
0.245
0.250
mu
Figure 5.1: Power of 5% tests at selected values of μ Figure 5.1 describes the findings. To reduce computational time involved we computed the power at only four values at and near the null hypothesis value of μ = 0.25. The figure clearly illustrates the superior performance of Datta's approximate score test over the older tests. Whereas the power of this test reaches nearly one for μ = 0.22, the power for the other tests remain flat in the entire range of μ values under consideration.
6
Concluding Remarks
Testing the neutrality of DNA markers is an important problem in evolutionary biology. In recent years, experiments have been designed in a controlled setting where a population with discrete generation can be allowed to propagate at the same time random samples from each generation are collected. Often genotyping is done at two loci, say one nuclear and one cytoplasmic, simultaneously. Utilization of such multigeneration cytonuclear data in a set up where both genetic and statistical variabilities are present is a challenging problem. We present some recent work by Datta (2001) in this direction, where a neutrality test is constructed by correctly identifying an approximate likelihood for such a setup. A numerical comparison of power with earlier tests show great promise. It will be interesting to investigate the
182
DATTA
power properties more extensively covering a broad range of non-neutrality models, possibly going beyond the selection model considered here. Such a study is underway and will be reported elsewhere.
References
Arnold, J., 1993: Cytonuclear disequilibria in hybrid zones. Annu. Rev. Ecol. Syst. 24, 521-554. Arnason, E., 1991: Perturbation-reperturbation test of selection vs. hitchhiking of the two major alleles of Esterase-5 in Drosophila pseudoobscura. Genetics 129, 145-168. Berry, A. J. and Kreitman, M., 1993: Molecular analysis of an allozyme cline: alcohol dehydrogenase in Drosophila melanogaster on the east cost of North America. Genetics 134, 869-893. Clark, A. G. and Lyckegaard, E. M. S., 1988: Natural selection with nuclear and cytoplasmic transmission. III. Joint analysis of segregation and mtDNA in Drosophila melanogaster. Genetics 118, 471-481. Datta, S., 2001. Estimation of Selection Parameters using Multi-generation Cytonuclear Data, Biometrical Journal 43, 219-233. Datta, S., 1999: Hypotheses testing for different selection models using multigeneration cytonuclear data. In Proceedings of American Statistical Association, Biometrics Section, 157-161, Alexandria, USA. Datta, S., Kiparsky, M., Rand, D. M. and Arnold, J., 1996: A statistical test of a neutral model using the dynamics of cytonuclear disequilibria. Genetics 144, 1985-1992. Ewens, W. J., 1972: The sampling theory of selectively neutral alleles. Theor. Popul. Biol. 3, 87-112. Fisher, R. A. and Ford, E. B., 1947: The spread of a gene in natural conditions in a colony of the moth Panaxia dominula L. Heredity 1, 143-174. Fos, M., Dominguez, M. A., LaTorre, A. and Moya, A., 1990: Mitochondrial DNA evolution in experimental populations of Drosophila pseudoobscura. Proc. Natl. Acad. Sci. USA 87, 4198-4201. Fuerst, P. A., Chakraborty, R. and Nei, M., 1977: Statistical studies on protein polymorphism in natural populations. I. Distribution of single locus heterozygosity. Genetics 86, 455-483.
NEUTRALITY OF mtDNA
183
Gillespie, J. H., 1979: Molecular evolution and polymorphism in a random environment. Genetics 74, 175-195. Hutter, C. M. and Rand, D. M., 1995: Competition between mitochondrial haplotypes in distinct nuclear genetic environments : Drosophila pseudoobscura vs Drosophila persimilis. Genetics 140, 537-548. Jenkins, T. M., Babcook, C, Geiser, D. M. and Anderson, W. W., 1996: Cytoplasmic incompatibility and mating preference in Colombian Drosophila pseudoobscura. Genetics 142, 189-194. Karl, S. A. and Avise, J. C, 1992: Balancing selection at allozyme loci in oysters: implications from nuclear RFLPs. Science 256, 100-102. Kambhampati, S., Rai, K. S. and Verleye, D. M., 1992: Frequencies of mitochondrial DNA haplotypes in laboratory cage populations of the mosquito Aedes albopictus. Genetics 132, 205-209. Kimura, M., 1983: The Neutral Theory of Molecular Evolution. Cambridge University Press, New York. Kiparsky, M., 1995: Cytonuclear genetics of experimental Drosophila melanogaster population. Unpublished Honor's Thesis, Department of Ecology and Evolutionary Biology, Brown University. MaCrae, A. and Anderson, W. W., 1988: Evidence of non-neutrality of mitochondrial DNA haplotypes in Drosophila pseudoobscura. Genetics 120, 485-494. MaCrae, A. and Anderson, W. W., 1990: Can mating preference explain changes in mtDNA haplotype frequency? Genetics 124, 999-1001. McDonald, J. H., 1994: Detecting natural selection by comparing geographic variation in protein and DNA polymorphisms. In Non-Neutral Evolution, B. Golding, Ed., 88-100, Chapman and Hall, New York. Mcdonald, J. H., 1996: Lack of geographic variation in anonymous polymorphisms in the American oyster Crassostrea virginica. Mol. Biol. Evol. 13, 1114-1118. Nigro, L. and Prout, T., 1990: Is there selection on RFLP differences in mitochondrial DNA? Genetics 125, 551-555. Pollak, P. E., 1991: Cytoplasmic effects on components of fitness in tobacco hybrids. Evolution 45, 785-790. Rothman, E. D. and Templeton, A. R., 1980: A class of models of selectively neutral alleles. Theor. Popul. Biol. 18, 135-150. Schaffer, H. E., Yardley, D. and Anderson, W. W., 1977: Drift or selection: test of gene frequency variation over generations. Genetics 87, 371-379.
184
DATTA
Scribner, K. T. and Avise, J. C , 1994: Population cage experiments with a vertebrate: genetics of hybridization in Gambusia fishes. Evolution 48, 155-171. Singh, R. S. and Hale, L. R., 1990: Are mitochondrial DNA variants selectively non-neutral? Genetics 124, 995-997. Watterson, G. A., 1970: The effect of linkage in finite random-mating population. Theor. Popul. Biol. 1, 72-87. Watterson, G. A., 1977: Heterosis or neutrality? Genetics 85, 789-814. Weir, B. S., 1990: Genetic Data Analysis. Sinauer Assoc, Sunderland. White, T., Marr, K. A. and Bowden, R. A., 1998: Clinical, cellular, and molecular factors that contribute to antifung drug resistance. Clinical Microbiology Reviews 11, 382-402.
185
Institute of Mathematical Statistics LECTURE NOTES — MONOGRAPH SERIES
INFERENCE ON RANDOM COEFFICIENT MODELS FOR HAPLOTYPE EFFECTS IN DYNAMIC MUTATIONS USING MCMC. Richard M. Huggins1, Guoqi Qian1 Sz Danuta Z. Loesch2 department of Statistical Sciences, La Trobe University, Bundoora, 3083, Australia. 2 School of Psychological Science, La Trobe University, Bundoora, 3083, Australia. Abstract Modern genetic research involves complex stochastic processes and difficult inference problems and research in this area is necessarily a collaboration with geneticists and biologists. The major difficulty is in defining stochastic processes which are biologically meaningful yet amenable to analysis. To illustrate this we examine a class of random effects models for dynamic mutations. Dynamic mutations characterize several inherited disorders in humans. In these disorders a mutated segment of the gene typically increases in size as it is transmitted from generation to generation until the gene fails. Biological interest is in the effect of various genetic markers on the rate of expansion of the mutated segment and to what extent these markers describe alternate pathways. We concentrate on the widely studied fragile X disorder as there are data sets available for statistical analysis. We use heirarchical Bayes models fitted via MCMC methods to examine some data and hence determine a class of random coefficients, branching, time series models which have applications in genetical research.
1
Introduction
The realistic modelling of the stochastic processes occurring in biology quickly leads to analytically intractable models. However, these models may be of enormous practical importance. Here we examine a model for the recently discovered phenomenon of dynamic mutations, that lead to a number of genetic disorders. In these disorders rather than an on-off allele being transmitted from a parent to their offspring the mutation itself changes upon transmission. The mutation arises through expansion at a region of the gene, i.e. the insertion of extra genetic material in the form of trinucleotide
186
HUGGINS, QIAN AND LOESCH
repeats at that site. This induces instability at that site with the size of the expansion usually increasing from generation to generation. Ultimately the expansion becomes large enough to affect the functioning of the gene resulting in phenotypic expression of the disorder. A feature of these dynamic mutations is that they are clinically undetectable until the expansion crosses a threshold and phenotypically abnormal individuals arise. The fragile X Syndrome is the most widely studied dynamic mutation. It is a common X-linked genetic disorder associated with intellectual disability. The fragile X syndrome arises as the result of progressive CGG trinucleotide repeat expansion in the FMR1 gene, which culminates in the failure of gene expression. The mechanism for the expansion of the repeat size has not been determined. The fragile X syndrome is classified according to the number of CGG repeats. The premutation (55-200 repeats) and full mutation (200+ repeats) categories are unstable and some instability has also been reported within grey zone (35-55 repeats) alleles. Previously the rate of expansion of CGG repeats on transmission has usually been modelled as a function of the size of CGG repeat sequence on the maternal X chromosome. However, some molecular characteristics of the gene other than the size of the CGG repeat in this chromosome have recently been shown to affect the rate of expansion. These genetic covariates are certain combinations of microsatellite markers (haplotypes) flanking the repeat sequence, and the number and position of the AGG triplets which normally interrupt CGG repeats at intervals of 9-10 units. Eichler et al. (1996) have recently hypothesised that at least two different mutational pathways leading to the fragile X syndrome related to those genetic covariates are in action in the population. However, the data obtained by Dana et al (2000) from an Afro-American population showed that this hypothesis should be viewed from a broader evolutionary perspective. Several mathematical models for the transmission of fragile X have been proposed. These include comprehensive transmission models (Ashley and Sherman 1995, Morton & Macpherson 1992, Morris et al 1995), iterated branching models (Gawel and Kimmel 1996) and Bernoulli counting process models (Bat et al 1997). The model of Ashley and Sherman (1995) accounts for the dynamics of fragile X mutation by assuming the multi-step expansion of CGG repeats (meiotic expansion in either sex, and mitotic expansion restricted to somatic alleles of maternal origin). They also included population assumptions, such as selection against full mutation, as well as molecular mechanisms, such as the loss of AGG interspersions, which in their model constitutes the initial mutation in the FMR1 gene. These previous models were biologically meaningful but were difficult to analyse statistically and tended to become outdated as more data on molecular characteristics of the gene came to hand.
RANDOM COEFFICIENT MODELS
187
Huggins, Loesch & Sherman (1998) introduced a multi-step non-linear time series models for the transmission of dynamic mutations which could be analysed using standard statistical methods. They used a random intercepts model to examine the effects of haplotypes or genetic markers on the transmission of the mutation. Here we extend that model in several directions, with an emphasis on the effect of the genetic markers on the rate of transmission. Firstly, we use cubic splines to model the relationship between the length of the repeat sequence in offspring X chromosomes to that in the parent X chromosome. Secondly we allow the rate of transmission to be random, rather than just a random intercept. Available data consists of observations on several generations of affected families as, due to the lengths of human generations, it is not possible to collect DNA to construct longer chains. However, one purpose of our modelling is to be able to model distantly related affected individuals who are known to have a common ancestor. We illustrate the inference method using published data of Murray et al (1997) who report a variety of haplotypes which may be related to the rate of expansion in fragile X. The biological meaning of our results shall be discussed elsewhere, as will formal tests of hypotheses.
2
Notation
We label the observed haplotypes by h = 1,2,..., H. We suppose there are Nh families with haplotype /ι, labelled /ι/, / = 1,..., iV^. Let n^f denote the number of parents that produce observed offspring in family /&/, labelled /ι/fc, k = 1,..., rihf. Also, let Whfk be the number of offspring of individual hfk. The offspring of individual hfk are denoted by hfkl, I = 1,..., Whfk Note that individuals may be labeled both as parents and as offspring of their parents. This notation aids in constructing the conditional densities required below. In this paper, inference is conducted conditional on the initiating individual in each family, the observed family structure and the observed haplotypes. Further, the probands, or the individuals initially detected with the disorder in each family, have been omitted. This was done on order to reduce the ascertainment bias as the probands were usually detected by their having an extreme phenotype and hence genotype. Let Y be the vector of observations of the CGG triplet repeat lengths on all offspring. Namely, Y = {Yhfki : (Λ,/,M) e S], where S is defined by S = {(Λ,/,M): h = 1,2,-- ,£Γ; /|Λ = 1,2,--.,ΛΓΛ; k\hf = 1,2, -- , n Λ / ; l\hfk = 1,2, ,whfk.}. Here f\h is understood as the value of / given h.
188
3
HUGGINS, QIAN AND LOESCH
The Hierarchical Model
Our full model for the length of the CGG triplet repeat sequence in offspring I of individual k is of the general form { h)
(βh + β f )IF(hfk)}
+ δFIFζhfkl
+ σεh}ku
(3.1)
h = 1,... ,iϊ, f\h = 1,... ,ΛΓΛ, k\hf = 1,..., n Λ /, Z|/ι/fc = 1,..., wh}k, where sa n the C/ι/Jfc/' d tfifkis are independent normal random variables with 0 means and variance 1, Xhfk is a design matrix and IF(hfk) takes the value 1 if individual hfk is female and zero otherwise. The response g(Yhfkι) is taken m to be log log Yhfki this paper as this gave errors that appeared normally distributed, although other choices are possible. We suppose βh follows a common probability distribution for all individuals with haplotype h and βi ' follows a common distribution for all individuals in family / within haplotype h. The introduction of the sex effect allows us to mimic the Ashley-Sherman (1995) model and the subsequent model of Huggins et al (1998) which were two step models with an initial step common to transmissions from males and females, and a second step that only occurs in transmissions from females. This is difficult to directly model in a linear framework so instead we allow different rates in transmissions from males and females. Available data, see the plots of Huggins & Loesch (1998), suggest there is little difference between the haplotypes or families in transmissions from males and a single vector β is used to model this transition. The conditional mean Xhfk{β + (βh + βf^I^hfk)} of g(Yhfkι) as a function of the parameters and log Yhfk was modelled using cubic splines (Dierckx 1993). The corresponding B-splines are contained in Xhfk- As the present work is exploratory ten degrees of freedom were used in the cubic splines to allow flexible modelling. This gave similar results to models with different degrees of freedom (6, 8 and 14). Let m denote the number of columns of Xhfk- The main interest is in the posterior distributions of βh
and βf\
We will use the following prior distributions for the parameters /?, βh, βf \σ2 and tip:
2. βh - b2(βh\Σho) = DMVN(0, ΣΛO), 3. iβJ Λ) ~6i(i9j Λ) |Σ / o) = MyΛΓ(O1Σ/o), 4. σ 2 ~ 6 4 ( σ > , λ ) = = IGa(i//2,i/λ/2), 5. 4 ~ 6 5 ( 4 1 ^ , λ F ) = IGa(ι/F/2, I/
h = 1,..., £Γ, f\h = 1,... ,JVΛ, Λ = l , . . . ,
189
RANDOM COEFFICIENT MODELS
which are all assumed to be independent. Here MVN denotes the multivariate normal distribution, IGa the inverse Gamma distribution (e.g. Robert 1994, p. 153). The values of v, λ, up and Xp do not have any significant effect on the analysis so will be specified in the study (see Appendix A). The hyperprior distributions for the hyper-parameters in the above prior distributions are taken to be: 1. βo ~ Wo|&,Σ) = DMVN{b,Σ), 2. Σ o - W - H ξ o + m + l^oiϊo),
ξo>m
ι
3. Σfco ~ W- (ξh0 + m + l,ξhoRhθ), ι
4. Σ/o - W- (ξf0
+ m + l,ξ/OΛ/o),
ξhx> > m, ξ/o > m,
which are also assumed to be independent. Here W7^1(ξ. + m + l.ξ.R.) denotes an inverted Whishart distribution for an m x m random matrix Σ. with degrees of freedom £ + m + 1 and positive definite matrix R. It can be shown that ^ ( Σ " 1 ) = R~ι and E(Σm) = {(£ - m - ljξ.β.}" 1 if these exist (Muirhead 1981, p.113). The values of &,Σ,fθj-Ro?fΛθ?Λ/ιθjf/o a n d Rfo will be specified in Appendix A. We denote the last three densities by wo(Σo|ξo,i?o), u>hθ(Zho\ξhθ,Rhθ) and^ /o (Σ/o|ξ/o,Λ/o) respectively. In order to simplify the presentation, we use Θ and 7 respectively to denote the parameters and hyper-parameters in the hierarchical model defined here. Namely, Θ = ( / ^ / ^ / ^ σ 2 , ^ ) * and 7 = (/3b,Σo,ΣΛO,Σ/o), where Ph — vPi?
7 P/f J 5 P/ — vP/
5
? P/
a n α
P/
— vPi
/ι = 1, , H. Note that Θ is a vector of dimension (1 + H + Σ Λ ^ I Nh)m + 2 and 7 is an m x (1 + 3m) matrix. The equation (3.1) together with the prior distribution of Θ and hyperprior distribution of 7 comprise a hierarchical random effects model for the CGG triplet repeat sequence. In order to make inference about Θ and 7, and in particular βh and βj ' from the data, we need the posterior distributions of βh and βί '. Because of their mathematical complexity we will not be able to make inferences directly from these posterior distributions. Rather we will first generate a sample from these distributions and then make inferences from the sample. The Markov chain Monte Carlo (MCMC) technique has shown to be powerful in simulating a sample from those probability distributions which are analytically intractable. Recall that the basic idea of MCMC is to use a relatively simple transition probability kernel to generate a Markov chain in such a way that its invariant distribution is that from which we want to generate a sample. After a sufficient number of generations of the chain have been simulated, subsequent generations can be regarded as a sample from the target distribution. Two most basic algorithms used in MCMC methods are Gibbs sampling and Metropolis-Hastings
190
HUGGINS, QIAN AND LOESCH
algorithm. We use a mixture of these two algorithms which is called MCMC block-at-a-time algorithm (Chib and Greenberg, 1995). Several methods are available (Robert 1998) for monitoring the convergence to stationarity of the resulting Markov chain. We employ the GelmanRubin statistic (Gelman and Rubin 1992) which measures the correlation between the within- and between-chain variations of the simulated multiple Markov chains. When the Gelman-Rubin statistic becomes very close to 1 (usually less than 1.2 or 1.1 is enough in practice), the chains can be regarded as having achieved convergence. Once a sample has been generated from the joint posterior distribution, it can be examined to detect the random effects of haplotypes and families. The sample can also be used to simulate the posterior predictive distribution for the response variable CGG repeats which, after comparison with the true CGG observations, allows determination of the goodness of fit of the model. The various joint and conditional densities and the MCMC procedure used to fit the model are described in the appendices.
4
Results
The data of Murray et al (1997) contain observations on CGG repeat sequences for 124 individuals and their parents after the probands were omitted. The remaining 124 individuals are the offspring of 86 observed parents who come from 57 different families, nested in 18 different haplotypes. In addition, 10 of the 86 observed parents are fathers who between them have 18 offspring. When applying the hierarchical model of section 3, the parameter vector Θ contains 762 components and the hyper-parameter 7 is a 10 x 31 matrix. For each of these parameters we generated 3 Markov chains of length 10,000 from which the posterior samples were formed. We first check the convergence of the simulated Markov chains. It was found that for the current model at most 2400 transitions would be sufficient for the simulated Markov chain to be stationary. In the simulations the slowest convergence occurred at the Markov chain of Θ(88) = /?s (8) (component 8 of haplotype 724) among all parameters and at 7(6,17) = E^o^β] (the 6th diagonal element of Σ^o) among all hyper-parameters. Figure 1 displays the sub-chain of a simulated Markov chain (by filter [fc/50], i.e., taking every 50th value) for each of /?s(8) and Σ^oίθjθ) and their Gelman-Rubin statistic value sequences based on the 3 parallel chains. To form a posterior sample for each parameter or hyper-parameter, we discard the first half of each of the 3 generated chains and combined the rest into one sequence; then we filter this sequence by the operator [Λ /50] and take the resulting sequence of length 300. Shown in Figure 2 are histogram densities of the posterior samples for the 10 components of the haplotype
RANDOM COEFFICIENT MODELS
191
tt> 6
I G-RvaluβsforSigr ω
^
y-1.2 " V-1-1 40
100
160
Figure 1: Row 1 gives Markov chains generated for β$(8) and Έho(6,6). Row 2 gives the corresponding Gelman-Rubin statistic values based on the three parallel chains.
8 effect 0 8 (the 81st-90th elements of Θ, i.e. Θ[81 : 90] and the 10 components of the family effects β[2) = Θ[261 : 270] (effects of the first family in the second haplotype which are the 261-270th components of Θ). These histograms are also typical of other haplotype and family effects. The haplotype effects can be analysed by examining whether or not there is any heterogeneity among the marginal posterior distributions of {βh(ϊ), h = 1, , 18} (i = 1, , 10). Shown in the first column of Figure 3 are the Xbar-chart and S-chart with 95% confidence level for posterior samples of {0i(8), ,0is(8)}, and the histogram density and the QQ-plot for the corresponding posterior means. The other two columns of Figure 3 are for {0i(6), ,/?i8 (6)} and {0i (3), ,0is(3)}. Figure 3 gives evidence of haplotype effects in the expansion process. The family effects given the haplotype effects can be similarly analysed based on the marginal posterior distributions of βf(i) (i — 1, , 10). The three columns in Figure 4 give the results of these effects from the posterior samples of 0/(8) 's, 0/(4)'s and 0/(1) 's respectively. Again family effects are evident in Figure 4. The goodness of fit of the model can be assessed by comparing the response observations g(Y) to their posterior predictive distributions (Gelman et al 1995, sec. 6.3). For each response observation g(Yhfki) a sample of size 300, regarded as the replicates of g(Yhfki) which could have been observed,
HUGGINS, QIAN AND LOESCH
20
20 40 60
0
0
40 60
192
1 A3 A bθta_8(3)
s
Jh J L
°
•10 bita.Vf
s
1
°
"
s
5 -10 -5 0 5 10 bβta_8(7)
•10
0
1
jjj^H
-5bβ°te_8δ)10
1 S
s
L
3 d
H
^
—
-
0-5 0 5 10 bβta_8(10)
S
J
o
5 bθta_1Λ{(2)}(5)
bθta_1Λ{(2)}(3)
bβta_1A{(2)}(2)
i
b 5 βtL8?8) 1 °
s
w -5
bθta_8(5)
S
l riL -10
bθta_8(4)
s
A!
jL
A (2)
Figure 2: Histograms for all the 20 components of βs and β\
II
I„
»^umber beyond
V\/ \/\
- Λ
ii
-o Λ Λ beyond ϊmits=0
Λ. ^
Number beyond id bmi(s=0 bmi(s=0
s Chart for beta.h.8
\ /°
A
V-
π
7\
' \i i&>
_
/
\ / X<
Number beyond lirn"ite=2 °
s Chart for beta.h.3
s Chart rorbeta.h.6 "*
effects.
/\
.•"'„••
\/
Number beyond limils=2
"
I:
*
o
V
'
/
"
•
%
-
.
Nurnber bβvond lirr |s=2
11
Figure 3: Effects of haplotype factors evaluated at components 8, 6 and= 3.
RANDOM COEFFICIENT MODELS
1101
^.. ^..^.. . X Al
Number beyond Km ls=2
Number beyond limits=4
s Chart lor bθta.f.8
I: " ^ I3
193
1 4 7 10 13 I t in
s Chart for bela.f.4
ΐ
s Chart (orbβta.1.1
g oil "juj'jj ib^'jis,"
111
πv*ϊ
κ
N U j mber beyond Bmits=0
π Figure 4: Effects of family factors evaluated at components 8, 4 and 1.
was generated from its posterior predictive distribution. Then the Bayes pvalue was calculated for each sample, which is the proportion of the replicates that are greater than g{Yhfki) The histogram density for the replicates of log log Yg, which is typical for all 124 response observations, is given in Figure 5. A histogram of the 124 Bayes p-values is also given in Figure 5. Since most of the Bayes p-values are very close to 0.5 (all but one are between 0.4 and 0.563), it implies that the response observations are typical under the posterior predictive distributions. In addition, we provide in Figure 5 the plots of log log Ynew IY — the sample means of the posterior predictive distributions against= log log Y, and £?(loglog Y|Θ p o 5 ) — the fitted response values using the posterior means of the distributions of the /?'s against log log Y. These plots show the reasonable fit of the proposed model to the data which is also supported by the values of the goodness of fit statistic computed from (2) below. Actually it was found that || loglog Y-loglog Yne™|Y||2 = 14.10 and 11 log log Y - E(loglog Y | Θ p o 5 ) | | 2 = 7.26. However, the histogram density in Figure 5 shows that the posterior predictive distribution of Y has a quite wide range and a large variance, particularly considering the nature of a loglog( ) transformation for the response Y. This implies that the proposed model has a large variation in its predictability although it is not the main interest of this paper. One could possibly obtain better predictions if another transformation g(Y) or a discrete model such as a Poisson distribution for Y were used. Finally we propose to evaluate the goodness of fit of the model by the
HUGGINS, QIAN AND LOESCH
194
summary(bayesPIO) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.317 0.477 0.497 0.497 0.517 0.563
-15-11-7-3 1 5 9 13 17 Histogram density of log(log(Y_9)) from a
0.25 0.31 0.37 0.43 0.49 0.55 0.61 Histogram for θayes p-vakies with individual
v obs. parent is father t-obs. parent is mother l
••
+ obs. parent is father +obs. parent is mother
k>g(log(Y))
Figure 5: Model assessment based on posterior predictive distributions.
following statistic N
N
i=l hfkl
Var(Yhfkl\θi)
(4.1)
where θ i , , θ # is a posterior sample of θ . Although no rigorous justification is available, it sounds sensible to approximate T by a χf 24 distribution without being too erroneous. Based on the posterior θ sample generated in our study we found T = 87.26, which corresponds to 0.5-percentile of a χ{24 and thus indicates that the proposed model gives an adequate fit.
5
Discussion
Many of the advances in inference for stochastic processes over the past few decades have arisen from branching processes and time series models. A population model for the transmission of dynamic mutations combines time series models with branching processes, for, as well as (3.1) holding the number of offspring Whfk of individual hfk has a distribution that depends on both the sex of individual hfk and the value of Yhfk- The outstanding problem is to find conditions under which the population model has a stationary distribution. We have concentrated on developing and fitting time series type models for the transmission of a dynamic mutation given the observed haplotypes
RANDOM COEFFICIENT MODELS
195
and family structures. The model can be viewed as an extension of the simpler bifurcating autoregressive models for cell lineages. The work on cell lineage studies starting with Cowan (1984), Cowan &: Staudte (1986) and others considers the bifurcating autoregressive process for cell lineage studies. Cell lineage trees arise from the binary splitting of cells and the processes of interest are time series, typically low order autoregresive models, down lines of descent with correlated errors for sister cells and in later models (Huggins & Basawa 1999) correlations between individuals in the same generation. Bui & Huggins (1999) considered random effects models for the bifurcating autoregressive(l) model. The processes of interest here have a random number of offspring rather than just the two of the bifurcating autoregressive process and the models down each line of descent are more sophisticated than the ARMA type processes. Moreover, individuals with a given identified haplotype may differ at other gene sites, which naturally gives rise to the random effects models. Hence we take the transition rates to be random, which are common for individuals in the same families and there is communality between unrelated individuals with the same haplotype. Markov models are typically used to model the transmission of a characteristic from parent to offspring. That is, the genetic make up of the offspring only depends on that of the parents. The indiscriminant use of such Markov models may be unrealistic in the branching situations that occur in family studies. For example, in related work on cell lineage studies, Staudte et al (1996) determined that the correlation between cousins in the data examined by them was larger than what was predicted by AR(1) type Markov models. In Markovian AR(1) models for cell lineages, correlations drop off as powers of the mother-daughter correlation θ. That is, mother-daughter correlation is 0, the= sister-sister correlation is θ2 (plus environment effects) the cousin-cousin correlation is 04 etc. Huggins & Basawa (1999) proposed several models which allow higher correlations between cousins and other more distant relatives including an AR(2) type model as well as random effects models of generation effects. This situation may also occur in dynamic mutations unless genetic covariates that affect the expansion rates are taken into account. This motivated our hierarchical random effects model. The models examined here are based on transmissions over one or two generations and may not reflect the full complexity of the dynamic mutations. Moreover, the processes are sparsely observed and as the process is unobserved in its initial stages, there is little information concerning the mutation rate currently available. The nested random effects, which depend on haplotypes that are themselves evolving according to some stochastic process, add another level of complexity. However, the haplotypes occur on a portion of the gene which does not appear to code for a protein hence they should not be exposed to selection so that their evolution should be depen-
196
HUGGINS, QIAN AND LOESCH
dent only on random mutations and rare recombinations and be simpler than that of the the FMR1 gene itself. Future work will involve the collection and modelling of data on distantly related affected individuals, where the evolution of the haplotypes will need to be considered. The strength of the relationship between individuals will be determined either through family trees or through genetic markers with known mutation rates. It is not necessary to use MCMC to estimate random effects although it was convenient in the present example. For example, Park & Basawa (2000) have developed an optimal estimating function approach. However, they have not yet fully developed the inferential procedures. The use of cubic splines has advantages in that the models are linear and may to some extent be regarded as non-parametric. Moreover, it was not necessary to introduce a threshold as in Huggins et al (1998). However, the non-linear models of Huggins et al (1998) did have some biological advantages in that they more naturally modelled the expansions in the meiotic and mitotic phases. In the non-linear case it was possible to let the parameters in the model depend on the size of the parents repeat sequence which is more difficult in linear models. The increasing interest in molecular genetics will result in new and complex stochastic processes. However, as in cell lineage studies and the example considered here, these new processes are combinations of familiar simpler stochastic processes with perhaps extended error structures. Nevertheless, it appears certain that many new problems in statistical inference and applied probability will arise in this area. Acknowledgements This research was supported by NICHID grant # HD36071 and an Australian Research Council small grant.
References [1] Ashley, E.A. & Sherman, S.L (1995) Population dynamics of a meiotic/mitotic expansion model for the fragile X syndrome. Am. J. Hum. Genet 57: 1414-1425. [2] Bui, Q.M. & Huggins, R.M. (1999) The random coefficient bifurcating autoregression model in cell lineage studies. J. Stat. Plan. & Inf. 81; 253-262. [3] Bat O, Kimmel M, & Axelrod DE (1997) Computer simulation of expansions of DNA triplet repeats in the fragile X syndrome and Huntington 92s disease. J.Theor.Biol. 188: 53-67
RANDOM COEFFICIENT MODELS
197
[4] Chib, Siddhartha and Greenberg, Edward (1995) Understanding the Metropolis-Hastings Algorithm American Statistician V. 49, No. 4 327335. [5] Cowan, R. (1984). Statistical concepts in the analysis of cell lineage data. Proceedings of the 1983 Workshop on Cell Growth and Division., 18-22, LaTrobe University. [6] Cowan, R. and Staudte, R.G. (1986). The bifurcating autoregression model in cell lineage studies. Biometrics. 42, 769-783. [7] Dana C. Crawford, Charles E. Schwartz, Kellen L. Meadows, James L. Newman, Lisa F. Taft, Chris Gunter, W. Ted Brown, Nancy J. Carpenter, Patricia N. Howard-Peebles, Kristin G. Monaghan, Sarah L. Nolin, Allan L. Reiss, Gerald L. Feldman, Elizabeth M. Rohlfs, Stephen T. Warren, and Stephanie L. Sherman (2000) Survey of the Fragile X Syndrome CGG Repeat and the Short-Tandem-Repeat and SingleNucleotide-Polymorphism Haplotypes in an African American Population. Am. J. Hum. Genet. 66 480-493. [8] Dierckx, P. (1993). Curve and surface fitting with splines, Clarendon: New York. [9] Eichler, E.E., Macpherson, J.N., Murray, A. Jacobs, P.A., Chakravarti, A. and Nelson, D.L. (1996) Haplotype and interspersion analysis of the FMR1 CGG repeat identifies two different mutational pathways for the origin of the fragile X syndrome. Human Molecular Genetics. 5 319-330. [10] Gawel B & Kimmel M(1996) The iterated Galton-Watson process. J.Appl.Prob. 33: 949-959. [11] Gelman, A. and Rubin, D.B. (1992). Inference from iterative simulation using multiple sequences (with discussion). Statistical Science 7,457511. [12] Gelman, A., Carlin, J.B., Stern, H.S. and Rubin, D.B. (1995). Bayesian Data Analysis, Chapman & Hall: London. [13] Huggins, R.M. & Loesch, D.Z. (1998) On the analysis of mixed longitudinal growth data. Biometrics [14] Huggins, R.M., Loesch, D.Z., & Sherman, S.L. (1998) A branching nonlinear autoregressive model for the transmission of the fragile X dynamic repeat mutation. Ann. Hum. Genet. 62 337-347. [15] Huggins, R.M. & Basawa, I.V. (1999) Extensions of the bifurcating autoregressive model for cell lineage studies. J.Appl.Prob. 36 1225-1233.
198
HUGGINS, QIAN AND LOESCH
[16] Morris, A., Morton, N.E., Collins, A., Macpherson, J., Nelson, D. and Sherman, S. (1995) An n-allele model for progressive amplification in the FMR1 locus. Proc. Natl. Acad. Sci. USA 92 4833-4837. [17] Morton, N.E. and Macpherson, J.N. (1992) Population genetics of the fragile X syndrome: multiallelic model for the FMR1 locus. Proc. Natl. Acad. Sci. USA 89 4215-4217. [18] Muirhead (1981).Aspects of multivariate statistical theory, John Wiley: New York. [19] Murray, M.A., Macpherson, J.N., Pound, M.C., Sharrock, A., Youings, S.A., Dennis, N.R., McKechnie, N., Lineham, P., Morton, N.E., Jacobs, P.A.(1997). The role of size, sequence and haplotype in the stability of FRAXA and FRAXE alleles during transmission. Hum. Mol. Genet. 6:173-84. [20] Park, Jeong-gun &; Basawa, I.V. (2000) Optimal estimating equations for mixed effects models with dependent observations. Technical report 2000-16, Department of Statistics, University of Georgia. [21] Robert, C.P. (1994) The Bayesian Choice. New York: Springer-Verlag [22] Robert, C.P. (1998). Discretization and MCMC Convergence Assessment, Springer-Verlag: New York. [23] Staudte, R.G., Zhang, J., Huggins, R.M. and Cowan, R. (1996). A reexamination of the cell lineage data of E. O. Powell. Biometrics., 52, 1214-1222.
199
RANDOM COEFFICIENT MODELS
Appendices In the appendices we derive the joint and conditional densities required to estimate the model parameters using MCMC, and then give the MCMC algorithm.
A
Joint Densities
Using the notations introduced in the paper, it follows that the joint conditional density of g(Y) given Θ, 7 and the observed ancestor of each family is H
{g(Y)\θ,Ί}
P
Nh
nh=fhfk
ί>{5(Y)|ΘJQΠ Π h=lf=l nhf
H Nh
ΐ[p{9(Yhfki)\β,βh,βf\σ2,δ2F}
k=l 1=1
h=l /=1 k=l
{g(Yhfki)-Xhfk[β 2(PpIF(hfk)
+ σ*) (A.1)
The joint prior density of the parameter Θ given 7 is H Nh
π(θ|7)
H
Σ
Π Π Wrl /o) Π W Λ o ) h=lf=l
h=l
H Nh /ι=l/=l H X h=l
{
1
ff
Λ=l/=1
200
HUGGINS, QIAN AND LOESCH
x|Σ Λ 0 Γ* / 2 exp {-^ x|ΣoΓ 1/2 exp {—tr&^iβ
- βo)(β - βoΫ}]
(A.2) The joint prior density of the hyper-parameter 7 is τr(7)
= ω/0(Σ;o|ξ/θιΛ/o) ωω(Σ!fto|ξfco,ΛAθ) ^o(Σo|ξo,i?o) • bo{βo\b,Σ)
x(2π)-"/ 2 |ΣΓ 1 / 2 exp{-i(ft,-6)'Σ- 1 (A-6)}
(A.3)
where Γ m ( ) is the multivariate gamma function (see Muirhead 1981, pp. 61-63). In the simulations, the value of b was taken to be the estimate of β in the spline regression model g(Yhfki) = Xhfkβ We also used the following values in the simulation in such a way that they should have very weak effects on the simulation results: v = λ = up = \ F = 10,= Σ = RQ = Rho — Rfo — 10^10 with Jio being the 10 x 10 identity matrix, and £o = 6ιθ = ζfo = 50.
B
Posterior Conditional Densities
In order to simulate the posterior distributions using MCMC methods we need a number of posterior conditional densities. The joint posterior density of Θ, 7 given the observed ancestor of each family is
The conditional posterior density of θ given 7 and the observed ancestor of each family is p{g(Y)|Θ,γ}7r(Θ|Ύ)
fR
.
201
RANDOM COEFFICIENT MODELS
The conditional posterior density of 7 given Θ and the observed ancestor of each family is
C
Simulating the Posterior Densities by MCMC
To simulate the posterior distributions of the β^s and βί s, we first generate a random sample for (©,7) from the posterior density π(θ,7|Y). Then we extract those values of βh and βf in the sample, which clearly comprises a 1 random sample from the posterior distributions of (/?£, βj) . So the question is how to generate a random sample from the posterior density π(θ,7|Y). To answer this question we apply an Markov chain Monte Carlo method (MCMC) which uses a Metropolis-Hastings acceptance-rejection algorithm alternately to the two blocks of the random quantity (©,7). In other word, this is the so-called MCMC block-at-a-time algorithm (refer to Chib and Greenberg, 1995). We list the algorithm in the following: Obtain the initial values (7(0), 1. Generate a matrix 7^°) from the prior π(7). 2. Generate a vector θ<°) from the prior 7r(e Update ( 7 ϋ), θ ^ ) to {^j+ι\ θ ^ + 1 ) ) . Repeat for j = 0,1,
, N - 1.
1. Use an M-H acceptance-rejection algorithm to generate a matrix 7^ + 1 ) from the conditional posterior density π ( 7 | Y , θ ^ ) : - Generate an initial matrix 70 from τr(7). - Repeat for i = 0,1, , / — 1. - Generate a matrix 7' from π(7) and u from the uniform distribution ZY(/, 00). - If u < Oij(ji ,V), set 7^\ = 7 ; . The acceptance probability is defined by
a 7(7ί ( τ ω τ7 V mini )
' ~
π ( θ | 7 )
U(Θ(%?Y
l
- Finally set 2. U s e a n M - H a c c e p t a n c e - r e j e c t i o n a l g o r i t h m t o g e n e r a t e a v e c t o r QU+ι) f r o m t h e c o n d i t i o n a l p o s t e r i o r d e n s i t y π ( ^ ^
202
HUGGINS, QIAN AND LOESCH
- Generate an initial vector ΘQ fr°m 7r(θ|7^ + 1 )). - Repeat for q = 0,1, , Q — 1. - Generate a matrix θ ' from π(θ|7^ + 1 ^) and u from the uniform distribution Z//(/, oo). - If u < OLQ(QΨ,&), bility is defined by
set θ ^ i = ©'. The acceptance proba-
- If u > aθ(θφ,&), set Θ% = θ[j). - Finally set θ^ + 1 ) = θ § } . Return the sequence {(7<°\ β ^ ) , ( 7 ( 1 ) , θ W ) , • • •, (7 ( Λ Γ ) , Θ(JV))}. As in any MCMC method, the sequence generated by the above algorithm is actually a Markov chain with τr(θ,7|Y) a s i*s invariance density. So by ignoring the first iVo values the sequence can be approximately regarded as a random sample from the posterior density τr(θ, 7|Y), provided that both NQ and N are sufficiently large. In our simulations we found that when taking / = Q = 50 the simulated Markov chain became convergent after No > 48 runs.
205
Institute of Mathematical Statistics LECTURE NOTES — MONOGRAPH SERIES
SEMIPARAMETRIC INFERENCE FOR SYNCHRONIZATION OF POPULATION CYCLES RE. Greenwood Dept. of Mathematics, Arizona State University, Tempe, Arizona and Dept. of Mathematics, University of British Columbia, Canada D.T. Haydon Centre for Tropical Veterinary Medicine Easter Bush, Roslin, Midlothian, U.K.
1
Introduction
We consider a dynamic random field. On each of a discrete array of sites is located a hyperbolic dynamical system perturbed by noise. We assume that the dynamics are identical, with a unimodal limit cycle and that the perturbations are independent, centered, and with the same distribution. In addition the individual processes are coupled with one another in a homogeneous pattern. The coupling may be global, in which case we are thinking of a mean-field type of system. Or the coupling may be local, the coupling strength between each site and its neighbors attenuating with distance. The application we have in mind is to cycling populations of animals, where the log of each local population increases roughly linearly to an apparent critical point from which it falls precipitously to a minimum. The data is discrete in space and time, being based on periodic reports from catchment or reporting regions. In a previous study [1] we focussed on data from Canadian lynx populations. Lynx population cycles are known to follow those of snowshoe hare. Previous analyses of this data have been concerned with inferring the length and regularity of the evident population cycles. We consider in [1] the very different challenge of estimating a parameter identified as strength of coupling among populations. That paper is primarily addressed to data analysis and interpretation. The emphasis in this report is on the steps involved in arriving at a suitable model and estimator. We explore some of the difficulties posed by this rather unusual problem. In forthcoming studies we, with colleagues, will apply the method described here to cycling population data from Canadian muskrat and mink, and from the greysided vole of Hokkaido.
206
2
GREENWOOD AND HAYDON
Development of a Coupled Evolving Phase Field Model
To begin, let us introduce a general definition of synchronization applicable to random fields. Let X = {Xa,i € I,t = 0,1,2,...} denote the values at site i and time t of a dynamic random field. Let S = {£(•)*, t > 0} denote a real-valued functional or collection of functionals of {Xis, i € / , s < *}. Note that we allow S evaluated at time t to depend on data up to and including time t. Define S'-synchronization to mean {SX)t = 0, almost surely, for t in some time-set T. This gives us a flexible definition of exact synchronization. But a random field will not be exactly synchronized. In order to formulate a statistically useful definition we need to allow for and to measure departures from synchronization. For example, we probably want the mean and variance of S applied to random data to be small. We are led to the following: Definition. A dynamic random field X departs from 5-synchronization by no more than e over the time interval T if E(SX)Ί < e for all t eT. One might use other norms, e.g. Lι or Kullbach-Leibler instead of L 2 . However this definition has a familiar form and is computationally convenient. At each site i in an array of sites / is located a "skeleton" process which, in the absence of any noise or coupling can be written -ΪM+i = HXith
i e /, t = 0,1,2,....
(2.1)
In this treatment we write the dynamics as of first order in discrete time. We assume there is a unimodal limit cycle t(t),t = 0, . . . , p , ^(0) = t{p), where, since / is the same for each i, the form of the cycle, ί^ and its period, p, are the same for each site i. For definiteness let ^(0) be the minimum point of the unimodal cycle. If we picture the deterministic field (1) running in equilibrium, we see the periodic cycle ί{ ) being executed at each site, the only difference between sites being the phase, which will differ if the initial points X^o were not identical. Now suppose we watch, instead, the stochastic field, X^t+ι=f(Xit)
+ eiu
t 6 / , t = 0,l,2,...
(2.2)
where en represents a small, identically distributed, centered noise. A theorem of deterministic dynamics called the "shadowing lemma" implies that the stochastic paths of (2) shadow the paths of (1), where "shadow" means that they remain in a distributional neighborhood of the deterministic path up to a time shift. The differences among the paths at the various sites, then, in addition to the small width of this neighborhood, are the phases, which change in a random way. The neighborhood width depends on the
207
INFERENCE FOR SYNCHRONIZATION
noise variance. This picture motivates a model for coupling based on the randomly evolving phases of the components. We now move our attention from the evolving random field of population levels, (2), to a corresponding evolving random field of phases. Going back to the deterministic model (1) we can unambiguously define the phase φa at site 2, time £, to be the time fraction of the current cycle, ^(0),... ,^(p), which has been accomplished at time t. Then each phase φu is in the interval [0,1). The arithmetic for phases is modi. We think of the set of points {Φit,i G /} as a set on the circle of circumference 1. As time advances the points progress around the circle. Since in fact we wish to consider the phase field associated with the stochastic field (2), where the "limit cycle" is a stochastic perturbation of the deterministic one, we may not see a unique minimum for each cycle, and the definition of φu may be ambiguous. We will assume that the noise en is small enough so that the ambiguity can be resolved by a device described in the next section. For the moment let us ignore this problem and assume that in the stochastic model the phase φn is defined as the fraction of the current orbit which has been traversed at time t by each path X{ of the stochastic field (2). We describe the structure of the phase field φa, i G /, t = 0,1,2,..., by writing Φiί+\ = Φit + 9it + tit
mod 1,
(2.3)
where gu is the fraction of the current orbit traversed at site i at time t. Now we introduce a hypothetical coupling force into the phase field which will shift the phase at each site i and time t in the direction of the "mean phase". For this purpose we need to devise an appropriate definition of mean for a set of random points on a circle. As mentioned above we identify the phase values with points on a circle of circumference 1. Between points x,y on this circle, let Δ(x,y) denote the signed smallest arc measured counter-clockwise between them. Then Δ(x,y) is positive or negative according to whether x leads or lags y, and (x,y)|<0.5. For each set of phases φu with t fixed, let φt be a solution of ψt)=0.
(2.4)
Then φt is almost surely uniquely defined and can be interpreted as the mean phase at time t on the circle, or equivalently on [0,1) mod 1. To model phase coupling among sites i in the array /, we insert a coupling term in expression (3) which moves the phases at time t + 1 in the direction of the mean phase at time t, and write Φi,t+i = Φit + 9it ~ cA{φiu φt) + eit mod 1,
i e /, ί = 0,1,....
(2.5)
208
GREENWOOD AND HAYDON
The coupling strength c can take values in [0,1). Expression (5) says that in each time step, from t to t + 1, three changes occur in the phase at each site: (a) it advances according to phase dynamic ga inherited from the limit cycle of the "skeleton" process (1), (b) it is shifted in the direction of the mean of the phases as defined by (4) by a proportion c of its distance from this mean, and (c) it is perturbed by a centered random effect. About the en we assume only that for each t they are conditionally independent given the past of the process {φu}, with the same conditional distribution for each i and t. In order to simplify our computations we would like to omit the modi in relation (5). This is possible if it does not happen, or happens only rarely, that the combined effect of (a), (b), and (c) at a particular timestep t is large enough that the phase increment at a single step exceeds 1. In the model we make this assumption. If it is occasionally violated in the data we make an appropriate adjustment. Assuming, then, that the noise is sufficiently small relative to the coupling, (5) can be written as +i,^+1)
= (1 - c)Δ(φiuφt)
+ eit + {git - φt+i + Φt),
<2 ^
The last grouped term in (6) has mean 0 as can be seen from considerations of symmetry. In addition, for each t the various values in i are conditionally independent given the past of the phase field. We have assumed that the noise increments en are small and centered and conditionally independent given the past of the phase field. Without altering this assumption we can regard the last grouped term in (6) as part of the noise and write (6) as Δ(&, t + i, φt+ι) = (1 - c)A{φiu φt) + ηiu
i E /, t = 0,1,....
(2.7)
Relation (7) describes a stochastic process of ARl-type. It differs from the usual AR1 model in that the process is a vector-valued function of a distinct vector process. The underlying process has vectors of phases as its values, whereas the process defined by (7) has values which are vectors of differences of these phases from the evolving central phase. Nevertheless the least squares type estimator is consistent for (1 — c) and that is what we used in [1]. One needs to know that there is a stationary law for the process defined by (7). This can be shown by an argument along the lines of arguments in Meyn and Tweedie's book [2], The fact that the state space is bounded simplifies the situation.
209
INFERENCE FOR SYNCHRONIZATION
If we wish to model local rather than global coupling, then the central phase as seen from site i should be computed from the phases at sites near i. We then define the local centre of phases at i, to be the almost surely unique phase φit which is the solution of
i e J, t = 0 , 1 , . . . ,
(2.8)
3
where the numbers α^ are weights reflecting the fraction of the total coupling at site i to be attributed to interaction between sites i and j , Σ aij — 1. We _
_
3
arrive at a version of (7) with φit replacing φt in each equation. If edge effects are disregarded, and if α^ depends only on i — j , the system remains spatially homogeneous in law. In any case, a stationary distribution exists for the process and the least-squares-type estimator is consistent for 1 — c. Let us return to the notion of departure from 5-synchronization by no more that e, as defined early in this section. Perfect synchronization corresponds to departure zero, and higher levels of departure correspond to less synchronization. The synchronization measure we choose here is the meansquare deviation of the phase field from its mean φt, or φit, as defined by (4) or (8),
The departure from synchronization, then, is defined as
E{SX)2t = ±-ΣEA(φiuφt)2.
3
(2.10)
Estimation
In an initial application of the model presented here [1], we study synchronization in the well-known Canadian lynx data sets. Here we summarize a few of the results and refer to [1] for the full treatment and references. The first data set was compiled by Elton and Nicholson, in 1942, [4] spans the years 1821 to 1891 and is organized over six trading regions of the Hudson's Bay Company. The second data set, compiled by Statistics Canada, [5] is organized over eight Canadian provinces and territories and 2 spans the years 1919-1990. We used estimators of (1 - c) and σ where η is the noise increment in (7), which are analogous to well-known linear estimators for parameters of AR1 time series. The data analysis involves two difficulties peculiar to the phase-coupled model (7). As mentioned earlier it was not possible to identify uniquely
210
GREENWOOD AND HAYDON
the time points where minima of cycles occur in the data. Our treatment in [1] involves identifying the sets of possible minima and repeating the estimation procedure many times using random choices from these sets. We are exploring other possible ways to handle this problem in other data sets. A second difficulty involves discontinuities in the function A(φu,φt) at ±0.5, and a very few instances where A(φt+1^φt) < 0. Our handling of these is detailed in [1]. We mention these problems here to indicate the novel difficulties posed by data analysis in this context. The estimators we use for θ — 1 — c and for σ% are
θ = ϊ=i-ί=i n
i=\
i\ — l
_
2^ i\{φit,Φty t=l
(3.1)
/n(iV - 1) - 2. We can distinguish by looking at the residuals from these estimates the times at which synchrony is maintained by intermittent synchronizing events from the time periods of constant phase-coupling dynamics as expressed in expression (7). At the times of synchronizing events we see outliers in the residuals, whereas corresponding to periods of consistent coupling, data is clustered around the regression line. Times, thus detected, of synchronizing events correspond to times of decrease of the estimator (9) of asynchrony. The coupling estimates we obtained were c ~ 0.054 for the earlier data set and c ~ 0.011 for the later one. With local weighting the estimate doubled to c ~ 0.096 for the earlier data set and remained about the same at c ~ 0.005 for the later data set. The numbers for the earlier data set are significant according to tests using the estimated variances, and are significantly greater than estimates produced from simulated cycles with randomized phases from a model fitted to this data. Thus, although the estimates c from the lynx data seem small, we believe that they measure the phenomenon of coupling as intended. An alternative statistical approach to the study of synchronization of population cycles uses the correlation structure of the logged abundance data as in e.g. Jolliffe [3]. With this method, amplitude variability may confound the detection of synchronization. Further, since phase coupling is not modelled, correlational methods measure phase coupling only very indirectly.
INFERENCE FOR SYNCHRONIZATION
211
Acknowledgements This work was supported by NSERC Canada and the Crisis Points Group of the Peter Wall Institute of Advanced Studies at U.B.C. The second author acknowledges support from the Wellcome Trust.
References [1] Haydon, D.T. and Greenwood, P.E. (2000). Spatial coupling in cyclic population dynamics: models and data. Theoretical Population Biology, 58, 239-254. [2] Meyn, S.P. and Tweedie, R.L. (1993). Markov Chains and Stochastic Stability. Springer, Heidelberg. [3] Jolliffe, I.T. (1986). Principal Components Analysis. Springer, New York. [4] Elton, C. and Nicholson, M. (1942). The ten-year cycle in numbers of lynx in Canada. J. Anim. EcoL, 11, 215-244. [5] Statistics Canada. 1921-1990. Catalogue 23-207, Fur Production, Ottawa, Canada.
213
Institute of Mathematical Statistics LECTURE NOTES — MONOGRAPH SERIES
PLUG-IN ESTIMATORS IN SEMIPARAMETRIC STOCHASTIC PROCESS MODELS Ursula U. Muller Universitat Bremen Anton Schick * Binghamton University Wolfgang Wefelmeyer Universitat Siegen Abstract Consider a locally asymptotically normal semiparametric model with a real parameter ϋ and a possibly infinite-dimensional parameter F. We are interested in estimating a real-valued functional a(F). If ά# estimates a(F) for known $, and ϋ estimates #, then the plug-in estimator ά^ estimates a(F) if ϋ is unknown. We show that ά^ is asymptotically linear and regular if ά# and $ are, and calculate the influence function and the asymptotic variance of ά^. If a(F) can be estimated adaptively with respect to #, then ά^ is efficient if ά# is efficient. If a(F) cannot be estimated adaptively, then for ά^ to be efficient, ΰ must also be efficient. We illustrate the results with stochastic process models, in particular with time series models, and discuss extensions of the results. Key Words: Empirical estimator, asymptotically linear estimator, influence function, regular estimator, Markov chain model, nonlinear regression, residual distribution, nonlinear autoregression, innovation distribution, stochastic equicontinuity, stochastic differentiability.
1
Introduction
Let Vn = {PnϋF : ^ G Θ,F G 7 } denote a sequence of semiparametric models, with θ one-dimensional and T a possibly infinite-dimensional set. We are interested in estimating a real-valued functional a(F). For each ϋ Research partially supported by NSF Grant DMS 0072174
214
MULLER, SCHICK AND WEFELMEYER
let ά# be an estimator for a(F) when ΰ is known, and let ΰ be an estimator for #. The plug-in estimator for a(F) is ά^, obtained by substituting ΰ for ΰ in ά#. In Section 2 we introduce a semiparametric version of local asymptotic normality, and a concept of asymptotically linear estimators for functionals on such models. This concept requires embedding the semiparametric model into a "nonparametric" supermodel which is to some extent arbitrary, although for specific types of stochastic models there usually is a natural choice. We recall two results which are essentially known, at least for i.i.d. models and Markov chain models. The first characterizes regular estimators among asymptotically linear ones; the second characterizes efficient estimators. In Section 3 we show that ά^ is asymptotically linear if ά# and ϋ are, and calculate the influence function of ά^. In Section 4 we assume that both ά# and ϋ are asymptotically linear and regular, and show that then ά^ is also regular. The characterization of regular estimators then allows comparison of the asymptotic variance of the plug-in estimator with the minimal asymptotic variance. In particular, if a(F) can in principle be estimated adaptively with respect to #, i.e., if knowledge of ΰ does not contain information about a(F)j then ά^ is efficient whenever ά# is efficient for known ϋ. lΐa(F) cannot be estimated adaptively with respect to #, then both ά# and ΰ must be efficient for ά^ to be efficient. For i.i.d. observations and estimators of the form ά# = a(F$), these results are due to Klaassen and Putter (1999). Sections 5 and 6 contain applications to models with i.i.d. observations and to Markov chain models, respectively. The discussion is heuristic in these two sections. Some extensions of the results are outlined in Section 7.
2
Characterizing Regular and Efficient Estimators
In this section we recall briefly characterizations of regular and of efficient estimators of real-valued functional in semiparametric models. For the i.i.d. case we refer to Bickel, Klaassen, Ritov and Wellner (1998), Chapter 3 and in particular Section 3.4. For Markov chains see Wefelmeyer (1999); for general models and counting process models see Andersen, Borgan, Gill and Keiding (1993, Chapter VIII); for general parametric models see Le Cam and Yang (1990). We will embed the semiparametric model in a "nonparametric" supermodel. The reason is that we want a sufficiently rich class of "asymptotically linear" estimators, and asymptotic linearity will be defined in terms of statistics approximating local likelihood ratios; see below. For n G N let Vn denote a collection of probability measures on some
215
PLUG-IN ESTIMATORS
measurable space (Ωn, Tn). Fix a sequence Pn in Vn and assume that the model is locally asymptotically normal at Pn in the following sense. There f is a linear space H with inner product (h,h ) and corresponding norm ||/ι||, there is a sequence of random linear functionals Sn on H, and for each h E H there are perturbations Pnh of Pn within Vn such that (2.1)
\\h\\ + oPn{l), under P n ,
(2.2)
with N denoting a standard normal random variable. The linear space H may be interpreted as (approximate) tangent space oίVn at Pn. Now consider a sequence of semiparametric submodels Vn — {PnϋF '• ^ G Θ J F G ^ } o f ? n , with θ one-dimensional and T an arbitrary, possibly infinite-dimensional set. We fix ΰ and F and consider perturbations ϋnu = ϋ + n~1/2u of ϋ and Fnυ of F, with v in some linear space V. In general, the appropriate rate may be different from n" 1 / 2 , but it will be n" 1 / 2 in our applications, which is why we have taken it to be n~ιl2 here. We assume that (2.1) and (2.2) hold at Pn = P n #F 5 and that the semiparametric model is locally asymptotically normal in the following sense. There are an element m in H and a linear operator D : V ->• H such that for Pnuυ = PnϋnuFnv * 7
log ^
= Sn(um + Dv) - I||um + Dt;|| 2 +
oPnϋF{l).
Prom (2.2), Sn(um + -Dv) => ||i4m + 2?v||JV
under Pn#F
The tangent space of the semiparametric model is H=[m] + DV, with [m] denoting the linear span of m. The one-dimensional space [m] is the tangent space for known F, obtained by perturbing ΰ. The space DV is the tangent space for known #, obtained by perturbing F. We assume that DV is closed, and that m does not belong to DV. We may think of m as the score function for ΰ. Let α(#, F) be a real-valued functional. It is called differentiable at (#, F) with gradient g Ίΐ g £ H and n 1 / 2 (α(τ? n ί z ,F n v ) - α(tf,F)) -> {g,um + Dv) for u G Λ, v € V.
(2.3)
The canonical gradient g is the projection of g onto if; it is therefore of the form ΰm + Dv.
216
MULLER, SCHICK AND WEFELMEYER
An estimator ά is called asymptotically linear for α(#, F) with influence function bifbeH and 1 2
n / (ά-a(ϋ,F))
= Sn(b) + oPnΰF(l).
(2.4)
An estimator ά is called regular for α(#, F) with ZzYm£ L if L is a random variable such that ι/2
n (ά
- a(βnu, Fnv)) => L under P n m ; for all u G # , Ϊ; G V.
(2.5)
The convolution theorem of Hajek (1970) says that L = \\g\\N + M
in distribution,
with M independent of N. This justifies calling a regular estimator a efficient for a(ΰ,F) if it is asymptotically normal with variance ||ff||2, nιl2{a - α(0, F)) =• \\g\\N under P n i 9 F . We have the following two characterizations: (1) An asymptotically linear estimator is regular for α(i?,F) if and only if its influence function is a gradient of α(??,F). (2) A regular estimator is efficient for α(#, F) if and only if it is asymptotically linear with influence function equal to the canonical gradient of a(ΰ,F). Now it becomes clear why we have introduced the "nonparametric" model. If we restrict attention to the semiparametric model, then there is only one gradient, the canonical one, and all regular and asymptotically linear estimators are asymptotically equivalent. In the examples of Sections 5 and 6 we will need to consider estimators whose influence functions are non-canonical gradients. The concept of asymptotically linear estimators is arbitrary in that it depends on the choice of "nonparametric" model; see Wefelmeyer (1991).
3
Asymptotic Linearity of Plug-in Estimators
Consider the problem of estimating a real-valued functional a(F) in the semiparametric model Vn = {PnΰF '• ΰ Eθ,F G T}. Fix ϋ and F. Suppose that for each τ near ΰ we have an asymptotically linear estimator aT of α(F), with influence function 6T. We assume that asymptotic linearity holds locally uniformly in shrinking neighborhoods of 7?, sup |n1/2(ai?n]ί _ a(F)) - Sn(bκJ\
= o P n ,,(l),
(3.1)
217
PLUG-IN ESTIMATORS and that Sn(bΰnu)
is stochastically differentiable at u = 0,
sup \Sn(bΰnJ
- Sn(b#) + (m,bϋ)u\ = oPnΰF(l).
(3.2)
Now let ϋ be asymptotically linear for ΰ, with influence function c,
By conditions (3.1) and (3.2), the plug-in estimator ά^ fulfills =
Sn(b^) +
=
Sn(bΰ -
oPnΰF(l) (m,b#)c)+oPnϋF(l).
This means that ά^ is asymptotically linear for a(F) with influence function (3.3)
b = b#-{m,bΰ)c
If (m, ) = 0, then we can relax the assumption that ϋ is asymptotically linear to n1/2-consistency and still get that the plug-in estimator is asymptotically linear, now with influence function 6 = . Asymptotic linearity (3.3) of the plug-in estimator also follows if we replace conditions (3.1) and (3.2) by a non-uniform version of (3.1), - a(F)) = Sn(b#) + o P n , F ( l ) ,
(3.4)
and an expansion of ά^,
n1/2(^ ~ &*) = - K bϋ)nι'2φ
-ΰ) + oPnΰF(1).
(3.5)
An application is Example 3 in Section 6. R e m a r k 1. (Plug-in and sample splitting.) Our requirements (3.1) and (3.2) are stronger than the following conditions. For every bounded sequence
Sn(b*nun) - Sn(b»)
=
Sn(bΰnun)
+ o P τ ι , F (l),
(3.6)
=
-un(mM
+ oPnΰF(l).
(3.7)
Property (3.7) appears quite frequently in the literature. It has been been verified by Drost, Klaassen and Werker (1997), Jeganathan (1995), Koul and Schick (1997) and Kreiss (1987) for some time series models. A simple sufficient condition for (3.7) is given in Schick (2000) in the context of i.i.d. observations.
218
MULLER, SCHICK AND WEFELMEYER
It was shown by Klaassen and Putter (1999) in the context of i.i.d. observations that under these weaker conditions one can use sample splitting techniques to construct a modified version of the plug-in estimator with influence function b as in (3.3). Their construction can be generalized to stationary and ergodic Markov chains using the sample splitting techniques developed in Schick (2001). But this will not be pursued here. Remark 2. (Stochastic differentiability.) We will check stochastic differentiability (3.2) for specific types of processes in Sections 5 and 6, using stochastic equicontinuity of empirical processes. Here we indicate that (3.2) also follows from a locally uniform version of local asymptotic normality. Compare the proof of Theorem 6.2 in Bickel (1982). Since F is fixed in (3.2), we will omit it from the notation. Fix ϋ and set τ = ϋnu = ldΛ-n~ι'2u. Assume that the parametric family {Pn<β : ΰ G Θ} is locally asymptotically normal at #, log - ^
= uSnΰ(m)
- -u2\\m\\l +
oPnΰ(l).
Assume also that for τ near ϋ there is a tangent space Hτ such that for each bτ £ Hτ we have local asymptotic normality at r, l o g % ^ = Snτ(bτ) - hbT\\2T +
oPnr(l).
Then 1 2
2
1
2
2
2
If bτ is continuous at r = ΰ in an appropriate sense, PUτbτ will be approximately equal to Pnϋ,um+bϋ (For a more explicit argument it would be convenient if the sequence of "nonparametric" supermodels Vn were indexed by an infinite-dimensional parameter; see LeCam (1986, Chapter 11) and Greenwood and Wefelmeyer (1991, Section 4).) Hence log dPnτbg
dPnΰ
=
log- dpn*,™+*>*
=
Sn4um + bΰ)--\\um + bΰ\\l + oPnϋ(l).
°g
dPnϋ
(3.9)
If both Snτ and || | | τ are continuous at r = ϋ in an appropriate sense, we obtain from (3.8) and (3.9): = o P n i 9 (l). Stochastic differentiability (3.2) requires that the supremum over |ιt| < Δ is stochastically negligible. For this, we need a corresponding strong version of local asymptotic normality, as introduced in Fabian and Hannan (1985, Section 9.1).
219
PLUG-IN ESTIMATORS
4
Efficient and Adaptive Plug-in Estimators
We continue the discussion of plug-in estimators under additional assumptions. As in Section 3, let a(F) be a real-valued functional of F. Fix # and F. For r near ϋ let άτ be a locally uniformly asymptotically linear estimator of a(F) in the sense of (3.1), and let ΰ be an asymptotically linear estimator of ΰ. Now assume, in addition, that a(F) is differentiate (2.3) at (ΰ,F), that ά# is regular at (#,F) for known ΰ, and that ϋ is regular at (ΰ,F). Then we can decompose the canonical gradient of a(F) and the influence function of the plug-in estimator ά^ as follows. Let cp denote the canonical gradient of ϋ when F is known. It is of the form CF = tm with t determined by -ΰ) = u = {tm, um), i.e., t = ||m||~ 2 and cF = \\m\\~2m. The squared length ||m|| 2 of m may be called the Fisher information for ϋ when F is known. Let Dvm be the projection of m onto DV. When ϋ is unknown, the canonical gradient of ϋ is characterized by three properties. It is in H = [m] + DV, orthogonal to DV, and its projection onto [m] is cp Hence it is of the form c = t(m — Dvm) , with t determined by (c — ϋp,m) = 0, i.e., t = (m — Dvm, m)~ι = ||m — Dvm\\~2. Hence the canonical gradient of ΰ is c=\\m-Dvm\\-2(m-Dvm). The squared length \\m — Dvm\\2 of m — Dvm is the Fisher information for ΰ. Let g# denote the canonical gradient of a(F) when ϋ is known. The canonical gradient ~g of a(F) for unknown ΰ is characterized by three properties. It is in H — [m] + DV, orthogonal to m, and its projection onto DV is gϋ. Hence g — g# is of the form tc, with t determined by (m,p) = 0, i.e., t = -(m,g#), i.e., g = gΰ-{m,g#)c.
(4.1)
Since ϋ is regular for ϋ, we obtain from characterization (1) of Section 2 that the influence function of ϋ is a gradient of ΰ, say c. Hence (4.2) Since ά# is regular for a(F) when ϋ is known, we obtain from characterization (1) of Section 2 that the influence function of ά# is a gradient of a(F) for known ΰ, say g#. Hence (4.3)
220
MULLER, SCHICK AND WEFELMEYER
From (3.3) we obtain that the plug-in estimator ά^ is asymptotically linear for a(F), with influence function (4.4)
9 = 9ΰ -(m,g#)c. Prom (4.1) and (4.4) we obtain 9 -9" = 9ΰ -9# - (m,g# - g#)c - {m,gϋ)(c - c).
(4.5)
It follows from (ra, c) = 1 and relations (4.2) and (4.3) that g-g is orthogonal to Ή. Hence g is a gradient. Characterization (1) of Section 2 now implies that the plug-in estimator ά^ is regular. Since g — g and c — c are orthogonal to if, the asymptotic variance of the plug-in estimator ά^ is -\\m - Dυm\\-2{m,gΰ
- gΰ)2 - 2(c - c,gϋ - S#)(m,3#
If ά# is efficient for a(F) when i? is known, and ΰ is efficient for i?, then 9ΰ — 3i9 a n d c = c. Hence g — ~g by (4.5), and the plug-in estimator ά^ is efficient for a(F). By (4.1), its asymptotic variance is
llsll2 = ll^ll2 + \\m - Dvm\\-2(m,gϋ)2.
(4.7)
If ά^ is efficient for a(F) when ΰ is known, then by (4.5) the influence function of the plug-in estimator ά^ is g = g-(m,gΰ){c-c),
(4.8)
and by (4.6) the asymptotic variance of α^ is 2
2
2
(4.9)
If ϋ is efficient for #, then by (4.5) the influence function of the plug-in estimator ά^ is 9 = 9 + 9ϋ -9ϋ ~ {™>,9ϋ ~ g#)c,
(4.10)
and by (4.6) the asymptotic variance of ά^ is
llίll2 + \\g* - VoW2 ~ \\m - Dvm\\-2(m,gϋ - gϋf.
(4.11)
We say that a(F) can be estimated adaptively with respect to ΰ if the asymptotic variance bound for a(F) is not decreased by knowing ΰ. This is the case if and only if Ίj = ~gϋ. By (4.1), this is equivalent to ( m , ^ ) = 0. Then ~g does not depend on c. Hence the plug-in estimator ά$ is efficient whenever ά# is efficient for known #, and the asymptotic variance is ||g#|| 2 . We say that F can be estimated adaptively with respect to ϋ if for every differentiable functional a(F) the asymptotic variance bound for a(F) is not decreased by knowing ΰ. This is equivalent to orthogonality of m and DV. Then ϋ can also be estimated adaptively with respect to F , and c = cp = 2
221
PLUG-IN ESTIMATORS
5
The i.i.d. Case
If we have i.i.d. observations ΛΊ,... ,X n , then a natural candidate for the "nonparametric" model of Section 2 is the usual nonparametric model, with completely unknown distribution Q(dx) of X{. (Larger nonparametric models are obtained by allowing the observations to be dependent.) Fix Q. Set Qh = J h(x)Q{dx) and H = L2fl(Q) = {he L2(Q) : Qh = 0}. For h E H set The approximation is meant in the sense of Hellinger differentiability. The joint law of X L , . . . , Xn is Pn = Qn. We have local asymptotic normality:
(Qhψ2N, with N standard normal. This is (2.1) and (2.2) with
V
,
(h,ti) = Q(hti).
Remark 3. (Stochastic differentiability.) In the stochastic differentiability condition (3.2), the parameter F is fixed, and we may omit it. Let {Q$ : ϋ G Θ} be a parametric family of distributions of X{. For r near ϋ let bτ G ^2,o(Qr) Stochastic differentiability (3.2) follows if bτ is differentiate at r = ϋ in an appropriate sense. By Taylor expansion, sup
H<Δ'
=*P»«U)
From QT6T = 0 we obtain by taking the derivative under the integral, (5.2) with nriΰ = dτ=#dQτ/dQΰ the score function for ϋ. Relations (5.1) and (5.2) imply stochastic differentiability (3.2). A proof for fixed bounded sequences u = un is in Schick (2000). For (5.2) it is essential that Qτbτ = 0, in other words, that bτ is in the nonparametric tangent space, i.e., that bτ(Xi) is a statistic which approximates a local likelihood. n -i/2 γj}=ι
222
MULLER, SCHICK AND
WEFELMEYER
For stochastic differentiability (3.2) to hold, the function b# need not be differentiable. For r near ϋ consider the empirical process
If the collection of functions 6 r , r near #, fulfills an appropriate bracketing condition, then vnτ is stochastically equicontinuous at r = ϋ\ For each ε, η > 0 there is δ > 0 such that limsupP n #( sup
< ε.
\vnτ -vnϋ\>η)
Such a result was first proved by Daniels (1961) and Huber (1967) to obtain asymptotic normality of the maximum likelihood estimator under weak conditions on the score function; for a general version see Pollard (1985). For T = ΰnu we obtain sup =
OPn
,(l).
(5.3)
From QTbT = 0 we obtain
= Qΰbτ-Qτbτ
= ~(τ (5.4)
Relations (5.3) and (5.4) imply stochastic differentiability (3.2). Example 1. (Nonlinear regression.) Let (Xi, YΊ),..., (X n , Yn) be pairs of real observations with The εi are i.i.d. with density f(y) which has mean zero and finite variance but is unknown otherwise. The X{ are independent and independent of the Si and have unknown distribution function G(y). The pair (Xi,Yi) plays the role of Xι in the general setting. The model is semiparametric. The distribution of (Xi, Yι) is
Q»fG(dx,dy)
= dG{x)f(y - r(β,x))dy.
Fix 7?, / and G. We introduce perturbations
fnz(y) dGnw{x)
= =
ι 2
dG{x){\+n- l w(x)).
223
PLUG-IN ESTIMATORS
The density fnz must integrate to one and have mean zero, so z runs through
Z = {zβ L2(f) : Jz(y)f(y)dy = 0, J yz(y)f(y)dy = 0}. The function Gnw must be a distribution function, so w runs through L2fl{G) = {we L2(G) : j w{x)dG(x) = 0}. The perturbed distribution of (X^Yi) is dx d
Qnuzw(dx,dy) = Q#nufnzGnw( i y) 12
= QϋfG{dx, dy) (l + n- '
= dGnw{x)fnz{y
-
r(ϋnu,x))dy
(ur(0, x)i{y - r(0, *))
where r(ϋ,x) is the derivative of r(ΰ,x) with respect to #, and ^(y) = —f'{y)lf{y) is the score function for location of the error distribution. Hence the tangent space H of the nonlinear regression model consists of functions h{x, y) = ur(ϋ, x)i(y - r(tf, x)) + z(y - r(tf, x)) + tι (a ). It is therefore of the form ~H = [m] + DV of Section 2, with V = Zx L2,o(G), m(z,y) = r(i9, x)^(y - r(tf, a:)),
Dv(a:, y) = z(y - r(ΰ, x)) + w(x).
Note that by taking the derivative under the integral,
E(ε£(ε)) = Ix£(x)f{x)dx = - ί xf'(x)dx = 1.
(5.5)
Note also that w(x) is orthogonal to both m(x, y) and z(y — r(ΰ, #)), so that both ϋ and / are adaptive with respect to G. We want to estimate the expectation
= Jk(y)f(y)dy of an /-square-integrable function k under the error distribution. The usual estimator is the empirical estimator based on the estimated errors,
n
with ii = Yi—r(ΰ, Xi). A natural estimator of?? is the least squares estimator i?, which solves
224
MULLER, SCHICK AND WEFELMEYER
The least squares estimator is asymptotically linear with influence function (Er(ϋ,X)2)-ιr(ΰ,x)(y-r(ϋ,x)).
c(x,y) =
We have (m,c) = Er{ϋ,X)2 by (5.5), and c J_ DV since Jyz(y)f{y)dy = 0 for z G Z. Hence c is a gradient of ΰ. The empirical estimator of a(f) = Ek(ε) based on true innovations is
Its influence function is
For v = (z,w) G Z x I/2,o(G) we have
n^WA.) - α(/)) -> £(*(φ(e)) = (», Dυ). Hence g$ is a gradient of α(/) when Ί9 is known. It fulfills
Hence by Remark 3, an appropriate bracketing condition on the collection of functions bτ(x,y) = k(y — TX) — Ek(ε) implies stochastic differentiability (3.2) of the form
sup M<Δ
Σ ι uEr(4,X)E(l(ε)k(ε))\=oPa,/G(l). i=
It follows from (3.3) that the plug-in estimator ά^ is asymptotically linear for a(f) with influence function 9
=
9ϋ-{m,gϋ)c
-
k(ε) -Ek(ε)
-Er(ΰ,X)E(ί(ε)k(ε))(Er(ΰ,X)2)-1r(ΰ,x)ε.
Efficient estimators for ΰ are constructed in Schick (1993). The canonical gradient ~g and an efficient estimator for Ek{ε) are in Mύller and Wefelmeyer (2000a).
225
PLUG-IN ESTIMATORS
6
Markov Chain Models
Let XQ, ..., Xn be observations from a homogeneous and uniformly ergodic Markov chain with transition distribution Q(x)dy) and invariant law π(dx). Assume for simplicity that the chain is stationary. The natural "nonparametric" model of Section 2 is described by the collection of all such transition distributions. Fix Q. Set Qxh = f Q(x,dy)h(x,y) and H = {he L2(π ® Q) : Qxh = 0}. For h e H set Qnh(x,dy) = Q(x,dy)(l + rΓ The approximation is meant in the sense of Hellinger differentiability for Markov chains. The joint law oϊ X$,... ,Xn is Pn{dx0, . ,dx n ) = π(dxo)Q{xo,dx\)
Q(xn-Udxn).
We have local asymptotic normality:
i = ι
with N standard normal. This is (2.1) and (2.2) with 1
2
Sn(h) = n" / Σ h{Xi-UXi),
(Λ, ti) = π ® Q(hti).
2=1
Remark 4. (Stochastic differentiability.) The arguments of Remark 3 translate to stochastic process models. Stochastic equicontinuity for Markov chains was obtained by Ogata (1980) in connection with asymptotic normality of maximum likelihood estimators. Results for general discrete-time stochastic processes are in Andrews (1994) and Andrews and Pollard (1994). Let {Q$ : ϋ G Θ} be a parametric family of transition distributions of X{. For r near ϋ let bτ be π r ® Qr-square-integrable with f Qr(x,dy)bT(y) = 0 for all x. The score function for ϋ is
226
MΪJLLER, SCHICK AND WEFELMEYER
If the functions bτ fulfill an appropriate bracketing condition for r near #, we have stochastic differentiability (3.2) of the form sup H<Δ
Example 2. (Nonlinear autoregression.) The observations XQ, ... ,Xn are real with The ε; are i.i.d. with density f(x) which has mean zero and finite variance but is unknown otherwise. Conditions for uniform ergodicity are in Bhattacharya and Lee (1995) and An and Huang (1996). The model is semiparametric, with transition distribution (x, dy) = f(y - r(0, x))dy. Fix ϋ and /. We introduce perturbations ϋnu fnv(x)
= ϋ + n-ι'2u, = f(x){l +
n-ι'2v{x)).
As in the regression example, Example 1, the function v runs through V = {υ e L2(f) : ίv{x)f(x)dx = θjxv(x)f(x)dx
= 0}.
The perturbed transition distribution is = fnυ(y x, dy) (l + n-1'2(ur{ϋ,χ)l(y
-
r{ϋnu,x))dy
- r(0, x)) + υ(y - r(ΰ,
where r(ϋ,x) is the derivative of r(ϋ,x) with respect to ϋ, and i(x) = —f'(x)/f(x) is the score function for location of the innovation distribution. Hence the tangent space H of the nonlinear autoregressive model consists of functions h{x,y) = ur(ϋ, x)i(y - r(0, x)) + v{y - r(0, x)). It is therefore of the form H = [m] + DV of Section 2, with m(z, y) = r(ΰ, x)i(y - r(ΰ, x)),
Dυ(x, y) = v(y - r(#, x)).
227
PLUG-IN ESTIMATORS We want to estimate the expectation
= Jk(x)f(x)dx of an /-square-integrable function k under the innovation distribution. The usual estimator is the empirical estimator based on the estimated innovations,
ά= -n ] < with έi = Xi — r(ϋ,Xi-\). estimator #, which solves
A natural estimator of ΰ is the least squares
The least squares estimator is asymptotically linear with influence function
We have (ra,c) = Er(ΰ,X)2 by (5.5), and c J_ DV since Jxv(x)f(x)dx = 0 for v € V. Hence c is a gradient of ΰ. The empirical estimator of α(/) = Ek(ε) based on true innovations is 1^ Its influence function is
9ϋfa y) = k(y - ^ ) - Ek(ε). We have n 1/2 (α(/™) - α(/)) -> B(fc(e)t;(ε)) = (g*,Dv)
for t; G V.
Hence g$ is a gradient of a(f) when i9 is known. It fulfills " (m,gϋ) =
Er(ϋ,X)E(ί(e)k(ε)).
By Remark 4, an appropriate bracketing condition on the functions bτ(x, y) = k(y - rx) - Ek(ε) implies stochastic differentiability (6.1). It follows from (3.3) that the plug-in estimator ά^ is asymptotically linear for a(f) with influence function 9
= gϋ-{rn,g$)c = jfc(ε) - Ek{ε) -
Er{ϋ,X)E{i{ε)k{ε))(Er{ΰ,X)2)-ιr{ϋ,x)ε.
228
MULLER, SCHICK AND WEFELMEYER
Efficient estimators for ϋ are constructed in Drost, Klaassen and Werker (1997) and Koul and Schick (1997). The canonical gradient g and an efficient estimator for Ek(ε) is in Schick and Wefelmeyer (2000). Example 3. (Heteroscedastic linear autoregression.) The observations Xo5 , X n are real with
The Si are independent and, for simplicity, standard normal. Conditions for uniform ergodicity and efficient estimators for ϋ are in Maercker (1997) and Schick (1999). The model is semiparametric, with transition distribution 1
/υ — ϋx
where φ is the standard normal density. Fix ϋ and 5. Introduce perturbations tfmi = 0 + n~ 1 / 2 u, snυ{x) = s(x)(l + n-l'2v{x)). The function υ runs through V = ^ ( Z ) , where / is the stationary density. The perturbed transition distribution is Qnuv(x,dy) = QΰnuSnv{χ,dy) = Qϋs{x,dy)(l
= —-7-τy( y G
+ n-1/2{um{x,y)
T\
(
+ Dv(x,y))),
with
Since the normal distribution is symmetric, m and DV are orthogonal, and s can be estimated adaptively with respect to ΰ. Suppose we want to estimate the functional 2
a(s) = / s(x) dx. Jo For all u E R and v EV we have n1/2(a(snυ)
- a(s)) -> 2 / s(x)2v(x)dx = (Dva,Dυ + um) Jo 2 with va = l[ 0) i]S //. Hence a(s) is differentiate at (i?,s), with canonical gradient — ϋx\^
\
229
PLUG-IN ESTIMATORS Assume first that ΰ is known. Then we can estimate a(s) by l
/ h(x) Jo where
Here wn(x) = c~ιw(c~ιx), where w is a continuously differentiable symmetric density with compact support [—1,1], and cn is a bandwidth of order n " 1 / 3 . We show that a$ is asymptotically linear with influence function Dva. We do so under the assumption that 5 is twice continuously differentiable. Write (Xi - ΰXi^)2 = β(Xi-i) 2 (ε? - 1) + s(X - i ) 2 . Expand s(Xi_i) around s(x) to obtain , x Γ1A(x) aϋ - a(s) = / J o
+ 2s(x)8'{x)f1(x) j — fo{x)
2
dx
where A(X)
=
-
To . To
Λ
1=1
i=\
The assumptions imply that / is twice continuously differentiable. Hence we obtain uniformly for x G [0,1],
EA{x)2 = O(n-ιc-1) = O(n E(fo(x) - /(z)) 2 = 0{n-χc-1 + c4n) = O(n" 2 / 3 ), E(fι(x) - cnf'(x))2 = Oin-'c-1 + ci) = O(n- 2 / 3 ). We can also show that s u p o < x < i |/o(#) — / ( # ) | converges to zero in probability. From this and the facfthat / is bounded away from zero on [0,1], we can conclude that άΰ - a{s) = I -jγ^dx Now write
+ 0Pnΰs
(cn).
230
MULLER, SCHICK AND WEFELMEYER
with I
n(y) = Jo
1
j^wn(y-x)dx.
It is easy to check that In converges in L,2(f) to the indicator of [0,1]. Combining the above lets us conclude that α# has influence function Dυa. 1 2 Suppose now that ϋ is unknown. Let ϋ be a n / -consistent estimator of ΰ. We prove that the plug-in estimator ά^ is efficient. We have already shown above that ά$ fulfills (3.4) with b$ = Dva. By the argument of Section 3, it remains to show (3.5). Since (m,Dva) = 0 by adaptivity, (3.5) reduces l 2 to asymptotic equivalence of ά^ and ά#, i.e., n / (a$ — ά#) = opnϋs(l). To prove this, we note first that
~ ^ / o fo(x) where B{x) = -Ys{Xi-ι): n ϊΞ{ Since / 0 B(x)/fo(x)dx converges to zero in probability, we obtain the desired result.
7
Extensions
1. We have assumed ϋ and a(F) to be one-dimensional. Extension to finitedimensional a(F) is straightforward; infinite-dimensional a(F) require additional technicalities. In nonlinear regression, Example 1, we may, e.g., be interested in estimating the error distribution function F, defined by F(t) — P(ε < t). For linear regression we refer to Klaassen and Putter (1999). Extension to finite-dimensional ϋ is also straightforward. We note that it may happen that a(F) is adaptive with respect to certain components of ί? only. For efficiency of ά^, efficient estimators are required only for the non-adaptive components of ΰ. Extensions of nonlinear regression, Example 1, are treated in Mύller and Wefelmeyer (2000a). Extensions of nonlinear autoregression, Example 2, are treated in Schick and Wefelmeyer (2000). 2. We have restricted attention to functionals a(F) of F only. The results may be extended to functionals α(i9, F) which depend also on ΰ. An interesting application is estimation of invariant distributions of time series, for example in linear autoregression Xι = ΰXi-ι + E{. Since Y^LxWej is distributed as the invariant law, we can write the expectation of a function k under the invariant law as j
Ek{X) = Ek(Σΰ εj)
= a(ΰ,F),
231
PLUG-IN ESTIMATORS
where F is the invariant distribution function. Hence Ek(X) can be estimated by a von Mises statistic or a [/-statistic based on estimated innovations; see Schick and Wefelmeyer (2001). 3. The results extend from semiparametric models {PnϋF : t f 6 θ , F E 7 J } to parametric families {Vn>β : ϋ G θ } of nonparametric models. This is of interest when we start from a nonparametric model Vn and impose a restriction which depends on an unknown parameter, say r#(Pn) = 0, leading to Vn* = {Pn r#(Pn) = 0}. For example, let XQ, . . . , Xn be observations from a Markov chain with transition distribution fulfilling /Q(x,dy)y = r{ϋ,x) for some ϋ. This is thenonlinear autoregressive model X{ = r($,Xi_i) + ε^, where the ε% are martingale increments, not i.i.d. as in Example 2. For estimators of ϋ see Wefelmeyer (1994), (1996), (1997a), (1997b); for estimators of the stationary law see Schick and Wefelmeyer (1999). The model may be written as a semiparametric model by introducing transition distributions F(x,dy) with / F(x, dy)y = 0 and writing Q{x,dy)=F{x,dy-r(ΰJx)). This is, however, technically inconvenient because we perturb ΰ and would need differentiability of F. Another example are i.i.d. observations (-XΊ, YΊ),..., ( I n , Yn) with joint law fulfilling the constraint E(a(X, Y, ΰ)\X) = 0, where α(X, Y, ϋ) is a given function. For plug-in estimators in such models see Mύller and Wefelmeyer (2000b). A special case is a{X,Y,ϋ) = Y - r(tf,X), i.e., Y{ = r(ϋ,Xi) + εu which differs from Example 1 in that we do not assume Z{ and X{ to be independent.
References Andersen, P. K., Borgan, 0., Gill, R. D. and Keiding, N. (1993). Statistical Models Based on Counting Processes. Springer Series in Statistics, Springer, Berlin. Andrews, D. W. K. (1994). Asymptotics for semiparametric econometric models via stochastic equicontinuity. Econometrica 62, 295-314. Andrews, D. W. K. and Pollard, D. (1994). An introduction to functional central limit theorems for dependent stochastic processes. Internat. Statist Rev. 62, 119-132. An, H. Z. and Huang, F. C. (1996). The geometrical ergodicity of nonlinear autoregressive models. Statist. Sinica 6, 943-956.
232
MULLER, SCHICK AND WEFELMEYER
Bhattacharya, R. and Lee, C. (1995). On geometric ergodicity of nonlinear autoregressive models. Statist. Probab. Lett. 22, 311-315. Bickel, P. J. (1982). On adaptive estimation. Ann. Statist. 10, 647-671. Bickel, P. J., Klaassen, C. A. J., Ritov, Y. and Wellner, J. A. (1998). Efficient and Adaptive Estimation for Semiparametric Models. Springer, New York. Daniels, H. E. (1961). The asymptotic efficiency of a maximum likelihood estimator. Proc. Fourth Berkeley Symp. Math. Statist. Probab. 1, 151163. Drost, F. C, Klaassen, C. A. J. and Werker, B. J. M. (1997). Adaptive estimation in time-series models. Ann: Statist. 25, 786-817. Greenwood, P. E. and Wefelmeyer, W. (1991). Efficient estimating equations for nonparametric filtered models. In: Statistical Inference in Stochastic Processes (N. U. Prabhu, I. V. Basawa, eds.), 107-141, Marcel Dekker, New York. Fabian, V. and Hannan, J. (1985). Introduction to Probability and Mathematical Statistics. Wiley, New York. Hajek, J. (1970). A characterization of limiting distributions of regular estimates. Z. Wahrsch. Verw. Gebiete 14, 323-330. Huber, P. J. (1967). The behavior of maximum likelihood estimates under nonstandard conditions. Proc. Fifth Berkeley Symp. Math. Statist. Probab. 1, 221-233. Jeganathan, P. (1995). Some aspects of asymptotic theory with applications to time series models. Econometric Theory 11, 818-887. Klaassen, C. A. J. and Putter, H. (1999). Efficient estimation of Banach parameters in semiparametric models. Technical Report, Department of Mathematics, University of Amsterdam. Koul, H. L. and Schick, A. (1997). Efficient estimation in nonlinear autoregressive time series models. Bernoulli 3, 247-277. Kreiss, J.-P. (1987). On adaptive estimation in stationary ARMA processes. Ann. Statist. 15, 112-133. Le Cam, L. (1986). Asymptotic Methods in Statistical Decision Theory. Springer-Verlag, New York. Le Cam, L. and Yang, G. L. (1990). Asymptotics in Statistics. Springer Series in Statistics, Springer, Berlin.
233
PLUG-IN ESTIMATORS
Maercker, G. (1997). Statistical Inference in Conditional Heteroskedastic Autoregressive Models. Shaker, Aachen. Mύller, U. U. and Wefelmeyer, W. (2000a). Estimating parameters of the residual distribution in nonlinear regression. In preparation. Mύller, U. U. and Wefelmeyer, W. (2000b). Regression type models and optimal estimators. In preparation. Ogata, Y. (1980). Maximum likelihood estimates of incorrect Markov models for time series and the derivation of AIC. J. Appl. Probab. 17, 59-72. Pollard, D. (1985). New ways to prove central limit theorems. Econometric Theory 1, 295-314. Schick, A. (1993). On efficient estimation in regression models. Ann. Statist. 21, 1486-1521. Correction: 23 (1995), 1862-1863. Schick, A. (1999). Efficient estimation in a semiparametric heteroscedastic autoregressive model. Technical Report, Department of Mathematical Sciences, Binghamton University, h t t p : //math. binghamton. edu/ant on/preprint. html Schick, A. (2000). On asymptotic differentiability of averages. Probab. Lett. 51, 15-23.
Statist.
Schick, A. (2001). Sample splitting with Markov chains. Bernoulli, 7, 33-61. Schick, A. and Wefelmeyer, W. (1999). Efficient estimation of invariant distributions of some semiparametric Markov chain models. Math. Meth. Statist. 8, 426-440. Schick, A. and Wefelmeyer, W. (2000). Estimating the innovation distribution in nonlinear autoregressive models. To appear in: Ann. Inst. Statist. Math. Schick, A. and Wefelmeyer, W. (2001). Estimating invariant laws of linear processes by [/-statistics. Technical Report, Department of Mathematics, University of Siegen. http://www.math.uni-siegen.de/statistik/wefelmeyer.html Wefelmeyer, W. (1991). A generalization of asymptotically linear estimators. Statist. Probab. Lett. 11, 195-199. Wefelmeyer, W. (1994). Improving maximum quasi-likelihood estimators. In: Asymptotic Statistics (P. Mandl, M. Huskova, eds.), 467-474, Physika-Verlag, Heidelberg. Wefelmeyer, W. (1996).
Quasi-likelihood models and optimal inference.
234
MULLER, SCHICK AND WEFELMEYER Ann. Statist
24, 405-422.
Wefelmeyer, W. (1997a). Adaptive estimators for parameters of the autoregression function of a Markov chain. J. Statist. Plann. Inference 58, 389-398. Wefelmeyer, W. (1997b). Quasi-likelihood regression models for Markov chains. In: Selected Proceedings of the Symposium on Estimating Functions (I. V. Basawa, V. P. Godambe and R. L. Taylor, eds.), 149-173, IMS Lecture Notes-Monograph Series, Institute of Mathematical Statistics, Hayward, California. Wefelmeyer, W. (1999). Efficient estimation in Markov chain models: an introduction. In: Asymptotics, Nonparametrics, and Time Series (S. Ghosh, ed.), 427-459, Statistics: Textbooks and Monographs 158, Dekker, New York.
237
Institute of Mathematical Statistics LECTURE NOTES — MONOGRAPH SERIES
NUISANCE PARAMETER ELIMINATION AND OPTIMAL ESTIMATING FUNCTIONS T. M. Durairajan and Martin L. William Loyola College, Madras, India
Abstract In the context of obtaining optimal estimating functions for interesting parameters in the presence of nuisance parameters in parametric models, a method of elimination of nuisance parameters is proposed in this paper. The proposed method is direct and does not impose any 'factorization' conditions on the likelihood. In this direction, a sequence of lower bounds for the variance-covariance matrix of estimating functions is derived. A recipe which gives a transparent approach for obtaining optimal estimating functions is suggested. It is shown that minimum variance unbiased estimators could be obtained using the recipe. Keywords and Phrases: Lower bounds, nuisance parameter elimination, optimal estimating function.
1
Introduction
In the theory of estimating functions applied to parametric models involving nuisance parameters, the 'elimination' of nuisance parameters to obtain optimal estimating functions (EF) for interesting parameters is a very important task. In a pioneering work, Godambe (1976) suggested a method of eliminating nuisance parameters by multiplying and adding suitable functions to the score function and formally established the optimality of conditional score function. Lloyd (1987) and Bhapkar and Srinivasan (1993) claimed the optimality of marginal score function. However, that there are errors in the results of Lloyd (1987) and Bhapkar and Srinivasan (1993) has been pointed out by Bhapkar (1995, 1997) who imposed some more conditions and established the optimality of marginal score function. The conditional and marginal factorization properties were used by the above authors in the elimination of nuisance parameters. Heyde (1997) proposed a method of
238
DURAIRAJAN
AND WILLIAM
obtaining optimal EF by eliminating nuisance parameters from a suitably chosen function that possesses the 'likelihood score property'. Heyde gives the optimal EF of 'first order theory' but not of the higher orders. The present work is an attempt in this direction. In this paper, a straight forward recursive method of elimination of nuisance parameters without going into the factorization aspects of the likelihood is proposed. In this direction, a theorem which gives a sequence of lower bounds for the variance-covariance matrix of the EFs is established in Section 2. This is achieved by considering higher order derivatives with respect to the nuisance parameters drawing inspiration from Godambe (1984). Consequently, a recipe which gives a systematic approach for possible elimination of nuisance parameters leading to optimal EF is suggested. Section 3 presents several examples to illustrate the recipe. In Section 4, as another outcome of the main result of Section 2, a sequence of lower bounds for the variance - covariance matrix of unbiased estimators of the interesting parameters is given. This sequence is different from the sequence of Bhattacharya bounds both in context and in content. Further, it is shown that minimum variance bound unbiased estimators of the interesting parameters could be obtained by the suggested recipe.
2
The Main Result and the Recipe
Let X be a random vector with sample space X and probability density function p(x; ω) with respect to some σ-finite measure μ on (X, B(X)). The family of densities is indexed by ω = {θ,φ) E Ω with θ G Ωi C SRr, φ E Ω2 C 3?m, Ω = Ωi x Ω2. The interesting parameter is θ and the nuisance parameter is φ and estimation of θ in the presence of φ is considered. We assume the usual regularity conditions on the density function 'p' and the EFs g = (gu .. .,gr)': X x Ωx -» W (refer Godambe (1976, 1984), Bhapkar (1995, 1997)). Let Dg = ((E(dgi/dθj))). Let the class of EFs satisfying the regularity conditions be denoted as Go let Mg(ω) = D~ιE{g.g')(D'g)~ι, the variance-covariance matrix of standardized EFs. In the sequel, the following notations are used: le = (dlog p/dθu...,dlogp/dθry In = E(lθ l'θ)
(2.1) (2.2)
NUISANCE PARAMETER ELIMINATION
239
with l
φ \ — h\ )
&
l
•'22 — -^ [*0 " φ J
^ J
/ i i a n d / ^ a r e a s s u m e d non-singular a n d for simplicity, we write I,1' = lφ,
/&> = /i2, I$? = hi, J& = hi- Let (2.6)
, fc = 1 , 2 , . . .
Since / ^ are positive definite we have Bk+i > Bk. Also, E[L •0(*)
'j = BΓ1
V k.
(2.7)
Theorem 2.1: For every 3 € C/o, M5 > 5 fc , k = 1,2,..., with equality if and only if g = *)* = >1(0,0) L ^ where A(θ, φ) is a non-singluar matrix and the functions Lg ' are defined recursively in (2.4) and (2.5). Proof: For g e Go, we observe that E[glφk)>) = 0, E[gLf]'} Now, considering the n.n.d. matrix E
r(*)
9,' — =
-Dg
= -Dg\/k.
(2.8)
E(g.g')
and applying matrix theory arguments, we get Rank
- Rank
0 E{g.g')
Also, Bkι — M~ι > 0 by the non-negative definiteness of the full matrix in (2.8). This gives Mg > Bk. Further, B~ι - M~ι = 0 if and only if the rank of the matrix in (2.8) is r. ' for some non-singular matrix A(θ,φ), then Now, if g = A(θ,φ) clearly the matrix in (2.8) has rank r so that Mg = B^. Conversely, if Mg = Bk i.e. the rank of the matrix in (2.8) is r, then as B^1, Dg and E(g.g') are all non-singular, it is necessary that for some non-singular matrix A(θ, φ), g = A(θ, φ) - Lf\ Hence the theorem. Remark 1: If for some k > 1, Bk is attained by the EF A(θ,φ)Lf) for a suitable choice of A(θ,φ) (i.e. A(θ,φ)Ly is free of φ), then we have /Jί;+1) = 0 which gives L^+l) = L{θk) and Bk+λ = Bk. Similarly, V s > k + 1, /W = 0, L{θs) = Lf] and Bs = Bk.
240
DURAIRAJAN AND WILLIAM In view of this, we now propose the following.
Definition 2.1: If there exists a g* G Go such that Mg* attains a lower bound Bk for some A;, then g* is said to be minimum variance bound estimating function (MVBEF). Based on the theorem established and the above remark, we suggest the following recipe for eliminating nuisance parameters and obtaining MVBEF: "Starting with the score function /#, consider recursively the functions Lθ
= IQ — I ^ I ^ l φ ' i LQ = L Q — l[2 I22
lφ
a n
ds 0 forth. If t h e nuisance
parameters are essentially eliminated or appear as a multiplicative factor of an EF in some recursion, stop the process as the EF thus obtained is optimal and further recursions would result in the same EF". Remark 2: The first order bound in the sequence of bounds derived above namely Bι = (In — Iul^hi)"1 was derived as the lower bound by Chandrasekar and Kale (1984) who gave a different line of proof. In a latest book, Heyde (1997) has proposed a 'first order' theory for obtaining optimal EF. He suggests the possibility for higher order theory but does not give any explicit method for the same. The recipe suggested above gives a formal and transparent method to proceed to second and higher order theories and carries out Heyde's suggestion. However, the forms of the optimal EFs obtained in Theorem 2.1 do not follow as a consequence of the technique of Heyde (1997).
3
Applications
In this section, a number of examples are discussed to illustrate the recipe suggested in the previous section. Throughout this section, we reserve the symbol θ for the interesting (real or vector) parameter. Example 3.1: Let x = (xi,...,x n ) and y = (j/i,. ,J/n) be independent where X{ are i.i.d. with density φexp(-φx), x > 0 and yι are i.i.d. with density φθ~ιexp(-φθ~λy), y > 0, 0, > > 0 . Here,
is the MVBEF attaining the bound B\. Example 3.2: Let z\,..., zn be i.i.d. with z% = (x^, yu > ? VH) where X{ are i.i.d. exponential with mean φ and for each fixed j — 1,..., r, y^ are i.i.d. n
exponential with mean φθj. Denote x = Σ x^ yj = Y%=i yji, j = 1,..., r.
241
NUISANCE PARAMETER ELIMINATION
Here, φL™ = (rf,... ,g*r) is MVBEF, with g* = | - ^
^
Σ |
Remark 3: Examples 3.1 and 3.2 have been discussed respectively by Lloyd (1987) and Bhapkar and Srinivasan (1993) in the context of marginal factorization of the likelihood. These authors claimed that a marginal score function is the optimal EF. However, from the above discussion, we find that the optimal EFs obtained above do not coincide with the EFs claimed by these authors as optimal. In Example 3.1, we have Mg{\)+ — 2θ2/n whereas for the EF of Lloyd (1987) namely g0 = n/θ - 2n/(θ + we have Mgo = 2(n + I)θ2/n2 > M (i)*. This shows that Lloyd's claim that go is optimal EF is incorrect. The errors in Lloyd (1987) and Bhapkar and Srinivasan (1993) has been pointed out also by Bhapkar (1995, 1997) who has found the correct optimal EF for the model in Example 3.1 but not for Example 3.2. In contrast, the recipe of Section 2 and the explicit form L,θ ' for the optimal EF have enabled us to achieve this for Example 3.2 as well in an elegant manner. Example 3.3: Let x\, yi,..., x n , yn be independent normal with E(xi) = 0, E(yi)
=θ + φi, V{xi) = V(yi) = 1. Here, L^ = Σ {x{ - θ) is the MVBEF. i l
Example 3.4: Let a i, j/i,... .Xn^Vn be independent normal with E(x{) = θ + φu E{yi) = φi, V{xi) = V{yi) = 1. Here, the MVBEF is L^ = n(x - y - θ)/2. Remark 4: Examples 3.3 and 3.4 were discussed by Godambe (1976) in the context of nuisance parameter elimination when conditional factorization property for the likelihood holds good. He has shown that the same EFs are optimal. In the above discussion, we have demonstrated the straight forward applicability of our recipe without investigating the factorization aspects which is required in Godambe's approach. Example 3.5: Let x = (xi,... ,xn) and y = (yi,... ,y m ) be independent where xι are i.i.d. Poisson (0>), yj are i.i.d. Poisson (>), θ,φ > 0. Here,
4
υ
= mn(x - θy)/(θ(nθ + m)) is the MVBEF.
This example with m — n — \ has been discussed by Reid (1995) in illustrating the roles of conditioning in inference in the presence of nuisance parameters, wherein the estimation is based on conditioning upon a statistic called a 'cut' (Barndorίf-Nielsen 1978) and involves a suitable reparametrization. In contrast, our approach is straight forward and does not require reparametrization. Example 3.6: Consider the linear model of a randomized block design Vij = μ + U + bj + eij, i — 1,..., fc, j = 1,..., r where e^ are i.i.d. iV(0, σ 2 ),
242
DURAIRAJAN AND WILLIAM
Σt{ = Σbj
= 0. Suppose estimation of t h e effect of the first treatment 't'λ
alone is of interest. 2
T h a t is θ = t\, φ = ( μ , ΐ i , . . . ,ίjfe_i,&i, {
Here, (σ (k
2
5
&r-i )
= ylm - y.. - h is MVBEF, where ylm =
- l)/(kr))L ^
Example 3.7: Consider a 2 x 2 contingency table ((n^)), i, j = 1,2, following a multinomial distribution with fixed sample size n = ΣΣn^ and with probabilities ((TΓJJ)), i,j — 1,2, ΣΣπ^ = 1. Let θ = πn be the parameter of interest with φ = (πi2,π2i). Here Lβ ' = {n\\ — nθ)/(θ(l — θ)) is the MVBEF. This example was discussed by Bhapkar (1989) in investigating conditioning and loss of information in the presence of nuisance parameters. Example 3.8: Let {X(t),t > 0} and {Y(t),t > 0} be two independent Poisson processes with parameters (θ + φ) and φ respectively, 0, φ > 0. Suppose data on the states of {X(t)} at times
( Let
1
lt is f o u n d
t h a t
-1
1
2^
2
get 5 ( )* = A(θ, φ) 4
^2
n-lV
2_
^ = ώ . Σ K - %) 2 Choosing Z— 1
1 ,n
2 )
= (Sf - σf,...
Φ) =
—
) '
j
V(: as the
|4'
Ί4) We
MVBEF attaining
the bound B2. This MVBEF was shown as optimal EF not attaining the bound j?i by Chandrasekar and Kale (1984) and in their framework served as an example of an optimal EF which is not MVBEF. From our discussion above, it is found that g^* is MVBEF attaining, however, the higher bound B2 Example 3.10: Let xi, yi,..., x n , yn, be independent normal with E(xι) = E(yi) = φi, i = 1,..., n and Vfa) = V(yi) =θ Vi = 1,..., n. Here, L^ is
NUISANCE
PARAMETER
ELIMINATION
243
not an E F for 0 and so we proceed to the second recursion to get (2) θ
n 1 A Q 9Λ ~*~ AΛ2 ^ Λ X * ~" y*) 2=1
a s
MVBEF attaining the bound B2.
The above example has been discussed by Godambe (1976) in the context of conditional factorization wherein he contrasts the consistency of the solution of the above optimum estimating equation with the inconsistency of the usual maximum likelihood estimate discussed by Neyman and Scott (1948). Again, investigation into factorization aspects is not required in our approach. Example 3.11: Let #1, yi,..., xni yn be independent normal with E(xi) = 0, E{yi) = φ, V(xi) = V{yι) = 0, φ e 5R, 0 > 0. Here, L J 1 } is not an EF. So, we proceed to the second recursion to get Lg ' = —^[nθ2 4- (2n — 1)0 — n s ( l + sl + x2)] a s the MVBEF attaining bound B2. Here, si = ^(xi - x)2 and 5^ = ^Σ(yj — y) 2 . The solution of the equation LQ = 0 is a consistent estimate namely Λ
^ 4
[(2n - I) 2 + 4n 2 [s2x + ^ + x 2 ) ] V 2 - (2n ^
Minimum Variance Unbiased Estimators
The main result of Section 2 leads us to minimum variance bounds for unbiased estimators of the interesting parameters. Prom Theorem 2.1, it is immediately evident that, for unbiased estimators T = (Ti,...,T r ) ; of Var - Coυ{T) > Bk, fc = 1,2,...
where Bk's are given in (2.6). This is verified by considering EFs of the form T-0. If any of the L ^ ' s defined recursively in (2.4) and (2.5) is such that A{θ,φ)L{θk) is of the form T* - θ for a suitable choice of A(θ,φ), then T* is minimum variance unbiased estimator (MVUE) of θ. Thus, the recipe of Section 2 could possibly be of help in finding T*. The following examples illuminate this point. Example 4.1: Consider the model in Example 3.3. Here LQ ' /n — x — θ so that x is MVUE of θ attaining bound B\. Example 4.2: Consider the model in Example 3.4. Here 2L,Θ '/n = x — y — θ so that x — y is MVUE of θ attaining bound B\.
244 Example 4.3: 2
(σ (fc - l)/kr)L^
DURAIRAJAN AND WILLIAM Consider the linear model in Example 3.6.
Here,
= ylm - y.. - tλ so t h a t ylm - ymm is M V U E of tλ.
Example 4.4: Consider Example 3.7. Hear, 0(1 - θ)Ly /n = n\\jn - θ so that nn/n is MVUE of θ. Example 4.5: Consider Example 3.8. Here, ((0 + φ)sntm)~1Lg [SnX(tm) - tmy(Sn)]/(Sntm) as MVUE of θ. Example 4.6: Consider Example 3.9. (σ\,..., σ2) attaining bound i?2
= 0 gives
Here, (S?,... ,Sj?) is MVUE of
Example 4.7: Consider the Neyman-Scott model in Example 3.10. Here 2θ2L%)/n = Σ{xi - yi)2/(2n) - θ so that Σ{Xi - y;)2/(2n) is MVUE of θ attaining bound B<ι. It is noted that the MVBEF need not produce MVUE when it cannot be reduced to the form T* — θ. This is evident in Examples 3.1, 3.2, 3.5 and 3.11. This fact is also reported by Thavaneswaran and Abraham (1988) who considered EFs for non-linear time-series models. Acknowledgement We thank the referee for bringing to our notice some of the important references in this area and for the valuable comments which improved the quality and content of the paper substantially.
References Barndorίf-Nielsen, O.E. (1978) Information and Exponential Families in Statistical Theory. Wiley, New York. Bhapkar, V.P. (1989). Conditioning on ancillary statistics and loss of information in the presence of nuisance parameters. J. Statist. Plann. Inference, 21, 139-160. Bhapkar, V.P. (1990). Conditioning, marginalization and Fisher information functions. Proc. R.C. Bose Symposium, Delhi. (Ed. R.R. Bahadur). Wiley Eastern Limited, New Delhi, 123-136. Bhapkar, V.P. (1991). Loss of information in the presence of nuisance parameters and partial sufficiency. J. Statist. Plann. Inference, 28, 185-203. Bhapkar, V.P. (1995). Completeness and optimality of marginal likelihood estimating equations. Comm. Statist. A - Theory Methods, 24, 945-952.
NUISANCE PARAMETER ELIMINATION
245
Bhapkar, V.P. (1997). Estimating functions, partial sufficiency and insufficiency in the presence of nuisance parameters. Proc. Symposium on Estimating Functions, Georgia. (Ed. I.V. Basawa, V.P. Godambe and R.L. Taylor). Institute of Mathematical Statistics: Lecture Notes Monograph Series. 32, 83-104. Bhapkar, V.P. and Srinivasan, C. (1993). Estimating functions : Fisher Information and optimality. Probability and Statistics. Proc. First International Triennial Calcutta Symposium on Probability and Statistics. (Ed. S.K. Basu and B.K. Sinha). Narosa Publishing House, New Delhi, 165-172. Bhapkar, V.P. and Srinivasan, C. (1994). On Fisher information inequalities in the presence of nuisance parameters. Ann. Inst. Statist. Math. 46, 593-604. Chandrasekar, B. and Kale, B.K. (1984). Unbiased statistical estimation functions for parameters in presence of nuisance parameters. J. Statist. Plann. Inference, 9, 45-54. Godambe, V.P. (1976). Conditional likelihood and unconditional optimum estimating equations. Biometrika, 63, 277-284. Godambe, V.P. (1984). On ancillarity and Fisher information in the presence of a nuisance parameter. Biometrika, 71, 626-629. Heyde, C.C. (1997). Quasi-Likelihood and its application - A general approach to optimal parameter estimation. Springer, New York. Lloyd, C.J. (1987). Optimality of marginal likelihood estimating equations. Comm. Statist. A - Theory Methods, 16, 1733-1741. Neyman, J. and Scott, E.L. (1948). Consistent estimates based on partially consistent observations. Econometrica, 16, 1-32. Reid, N. (1995). The roles of conditioning in inference. Statistical Science, 10, 138 - 157. Thavaneswaran, A. and Abraham, B. (1988). Estimation for nonlinear time - series using estimating equations. J. Time Ser. Anal. 9, 99
247 Institute of Mathematical Statistics
LECTURE NOTES — MONOGRAPH SERIES
OPTIMAL ESTIMATING EQUATIONS FOR MIXED EFFECTS MODELS WITH DEPENDENT OBSERVATIONS Jeong-gun Park and I.V. Basawa University of Georgia Abstract Optimal joint estimating equations for fixed and random parameters are derived via an extension of the Godambe criterion. Applications to autoregressive processes and generalized linear mixed models for Markov processes are discussed. Marginal optimal estimating functions for fixed parameters are also discussed. Key Words: Optimal Estimating Functions; Generalized Linear Mixed Models; Autoregressive Processes; Markov Processes.
1
Introduction
Mixed effects models containing both fixed and random parameters are used extensively in both applied and methodological literature. A review of linear mixed models and their applications is given by Robinson(1991). Generalized linear mixed models are discussed by Breslow and Clayton(1993) among others. Work on the extension of mixed effects models to dependent observations appears relatively scarce. Our main goal in this paper is to develop optimal estimating equations for mixed effects models with dependent observations. See Basawa et al.(1997), Heyde(1997), and Godambe(1991) for recent literature on optimal estimating functions. Desmond(1997) gives an overview of estimating functions. Suppose Yt is a vector of observations on n individuals at time £, and Y(t — 1) = (YΊ,..., Yt-ι). Conditional on Y(t — 1) and a random parameter 7, the density of Yt is denoted by p(yt\y{t — l),/3,7) where β is a fixed parameter. Let π(7|α) denote the (prior) density of 7 which may depend on a parameter a. Suppose, for simplicity, a is known, and we wish to estimate β and 7 from a sample Y{T) = (Yi,..., Yτ). The likelihood function, conditional on 7, is given by ) =P(yo)TiJ=lP(yt\y(t-
l),β,j).
(1.1)
248
PARK AND BASAWA
The joint density of 7 and Y(T) is p(7,v(T)|α,/?)=LGS,7)π(7|α).
(1-2)
An intuitive way, often used in practice, to estimate the "mixed effects" β and 7, is to maximize p(7,y(T)|α,/3) with respect to β and 7 and obtain formally the estimating equations : of
dlogL(β,j) 97
°y=0, +
(i.3)
d/ogπ(7|α) ^7
= Q
Example 1. Linear mixed models Take T = 1, and YQ = yo fixed(given). Suppose, conditional on 7, Y\ is normal with mean vector X β + Z 7 and covariance matrix Σ, where X and Z are known covariate matrices. Further, assume that 7 is a normal vector with mean zero and covariance matrix Γ. It is assumed, for simplicity, that Σ and Γ are known. Equations (1.3) and (1.4) then lead to the well known mixed linear model equations, see, for instance, Robinson(1991) for a review. Example 2. Generalized Linear Mixed Models Again, set T = 1, and suppose (1-5)
), and Var\Yλ\β,η] = V(β,Ί) = U(μ(β,Ί)).
(1.6)
Without further assumptions regarding the conditional density of Y\ given 7, one may be interested in estimating β and 7. If we choose (1.7)
h(μ(β,Ί))=X'β + Z'Ί,
for an appropriate link function /ι( ), and retain the normality assumption regarding 7 (i.e. 7 ~ iVίo, Γj), a penalized quasi-likelihood approach (see Breslow and Clayton( 1993)) yields the estimating equations : (1.8)
)-1(Y1-μ)-Γ-1Ί
where μ = μ(β,j)
= 0,
(1.9)
= hΓι{X'β + Z'j) = g(X'β + Z'j), say. Equations
(1.8) and (1.9) correspond to (1.3) and (1.4) if the conditional density of
249
OPTIMAL ESTIMATING EQUATIONS
Y\ given 7 is a member of an exponential family. See also Sutradhar and Godambe(1997), and Waclawiw and Liang(1993). Analogous to (1.8) and (1.9), we now propose a general method based on the penalized quasi-likelihood equations for dependent data. Let (1.10)
E[Yt\Y(t-l),βΠ]=μt(β,Ί): and Var[Yt\Y(t-l),β,Ί] = Vt(β,Ί).
(1.11)
Suppose further that the prior density of 7 is known to be π(7|α). Consider the estimating equations: (1.12)
( L 1 3 )
If the conditional density of Yt given Y (t — 1) and 7 belongs to an exponential family, the equations in (1.12) and (1.13) correpond to (1.3) and (1.4). Suppose that 7 has a prior density τr(y\a) with mean 0 and variance Γ(α). Consider a modified equation for 7 : E
[iaJ^^\
Γ(α)"'7 = 0. (1.14)
If 7 - N(θ, Γ(α)), we have
_
dlog-κ(η\ά) dη
Hence, in this special case, (1.14) reduces to (1.13). We show in this paper that the equations in (1.12) and (1.14) are optimal estimating equations in the sense of a generalized Godambe criterion. The question of the performance (sampling properties) of the estimates obtained from (1.12) and (1.14) needs to be addressed via asymptotics. In the special case of linear mixed models (example 1), the estimates are known to be best linear unbiased predictors(BLUP), see Robinson(1991). In the general case, we may use an extension of the asymptotic optimality criterion discussed by Wefelmeyer(1996), at least for the estimation of the fixed effects parameter β. See also Heyde(1997). This topic will not be considered in this paper.
250
PARK AND BASAWA
In general, the information matrix(or its estimate) corresponding to the optimal estimating functions can be used to compute the standard errors of the estimates. The paper is organized as follows. The general optimality criterion for the estimation of mixed effects is introduced in Section 2. The optimal estimating equations are derived in Section 3. Section 4 discusses applications to autoregressive processes and generalized linear mixed models for Markov processes. Section 5 considers an alternative method based on a marginal model specification. Finally, Section 6 contains some concluding remarks on work in progress.
2
Optimality Criterion
Let y be the sample space, (Θi x Θ2) C Rk the parameter space, where Θi C Rm,m < k. Assume that 71^(7) is a prior density of 7 for a fixed /?, where 7 € Θi and β e Θ2. Let C be the set of all functions g : y x (Θi x Θ 2 ) -> Rk such that E\g(Y,j,β)\β] = 0, V/3 E Θ 2 , where the expectation is with respect to the joint distribution of Y and 7, for a fixed β. Assume that E[g{Y,η,β)g(Y,η,β)'] is finite for any g £ £. We shall use the notation < gug2
>β= E[gι{Y,Ί,β)g2{Y,Ί,β)'\β], V J l j f t 6 A
(2.1)
where the expectation is with respect to the joint distribution of Y and 7, for a fixed β. Let p(Y, 7, β) denote the joint density of Y and 7. We assume throughout that the following conditions are satisfied : C.I. For any g € £, g is differentiate with respect to both 7 and β\ Both E[%] and E[%) exist. C.2. The joint density p(y, 7, β) is differentiate with respect to both 7 and β. The support of p(y,7,/3) does not depend on the parameters. C.3. For any g G £, E[g] is differentiate with respect to β under the integral sign. CΛ. The support of conditional density Pβ(y\j) does not depend on the parameters 7 and β. For any g G £, i?[#|7] is differentiate with respect to 7 under the integral sign; E[^E{g\Ί)] and E[E(g\j)dl°9^h)] exist. The above conditions ensure the existence of various quantities in (2.4) and the validity of the derivation of the information function defined in (2.6) below.
OPTIMAL ESTIMATING EQUATIONS
251
Let s = ( s i , 5 2 , . . •, Sfc), where
,
dlogp(Y,y,β)
s
,
dlogp{Y,η,
sm+ι=
j =
dβi
For any g € £ 0 , A) C £,
^M = E
dlogp(Y,Ί,β)' 1
, j = l,...,m.
Note that p(Y,j, β) = Pβ{Y\Ί)κβ{Ί)i where pβ{Y\^) is the conditional density of Y given 7 with respect to an appropriate measure μ(y) and πβ(j) is the density of 7 with respect to a measure 1^(7). We then have E[gSj]
= E
dlogpβ(Y\-y)
dlogπβ(j)
dΊj
dΊj
f
= E = E
dE\g\Ί]\
dg
\dlogπβ(j)
E[g\l] • (2.2)
Now,
E[gsm+ι]
= E^g
dlogp(Y,Ί,β)
f dp{y, Ί, -dμ(y)dυ(j), I = 1,..., k - m.
J
Oui
(2 3
>
Combining (2.2) and (2.3), we obtain dg
/ J
^ Γ ^ 1Λ (2.4)
Note that if E\g\η] = 0, (2.4) reduces to
(2.5) LC/p J /
We define the information function by Ig(β) = E[gs]Έ[gg]~ιE[gs],
(2.6)
for any g E £o5 where E[gs] is given by (2.4). The criterion of optimality is to maximize the information function Ig{β) over J G £ Q
252
PARK AND BASAWA
Definition 2.1 A function g* is an optimal estimating function in Co if I9.(β)-Ig(β) is nonnegative definite, for all g E Co and all β. Case I (Fixed Effects) : m = 0, i.e. the model involves only the parameters of fixed effects. When m = 0, E[gs'} — -£?[||]. So the information function is given by
Let g* be the optimal estimating function for β in £o Then
It is easily verified that the nonnegative definiteness of Ig*{β) — Ig{β) is equivalent to that of E(gg) -
H{H*)-'E[g*g*']{H*')-lH'.
The latter turns out to be the optimality criterion given by Godambe(1985) when only fixed parameters are present with H = E \-ga\ and H* = E -A- . See also Godambe(1960) and Durbin(1960). Case II (Random Effects) : ra = fc, i.e. the model involves only the parameters of random effects.
When m = k, E\gs) = -E [f}] + E [ ^ 1 + ^ | 7 ] « 2 ^ 2 i ] . Thus the information function is given by Ig(β) = j'E[gg'}J,
where J = -E [^] + E [ ^ M
+
E[g\Ί]dl°9^{Ί)] . Let g* be the optimal
estimating function for 7 in Co- Then we require Ig*{β) - I9(β) = J*'E[g*g*'}J* - JE[gg]J to be nonnegative definite. This turns out to be Chan and Ghosh(1998)'s optimality criterion when only the parameters of random effects are present and J* is J with g replaced by g*. See also Ferreira(1981, 1982) and Godambe(1998).
253
OPTIMAL ESTIMATING EQUATIONS
3
Optimal Estimating Equations
Here we derive optimal estimating functions for parameters of mixed effects. Let y be the sample space and F? the σ-field generated by a specified partition of y. Let hi be a real-valued function of Y, 7, and β such that = 0, < = l , 2 , . . . , n .
Let u( ,β) :Θι^Rm
with E[u( ,β)] = 0, forfixedβ.
We consider the linear estimating space given by
B(β)u(Ί,β)\ , where 0^(7,/3) is a k x 1 vector which is measurable with respect to the σ-field Flf and B(β) is a, k x m non-random matrix. Let
α*(7,/?) = -E[(^ί, ?jjf) l*f ,7,/{ ^αr(/^ > 7 ,/3)]^
(3.1)
and
(3.2)
Theorem 3.1 Let 9* = Suppose that hi's are mutually orthogonal in the sense that E[h*h*j \F^, 7, /3] = 0 for alii φ j , ij = 1,2, . . . , n ; wΛere ΛJ =a*(η,β)hi. Then g* is an optimal estimating function in £Q Proof: From C h a n a n d Ghosh(1998), it is enough t o show t h a t g* is a n orthogonal projection of s into £o> i-e., < g,s — g* >= 0 for any # G CQ. We have
=
}
E[ajhjS ] + BE[us }
\Fί,Ί,β]\ +BE[us'].
254
PARK AND BASAWA
From (2.4), since E[hi\FY^,β]
= 0 for all i and any β,
Also, from (2.4), we have
-L - \u Since the function u is only a function of 7 and 5, the conditional expectation of ti, conditional on 7 and 5, is u itself. Thus we have
Next the second term on the righthand side of (3.3)can be found as
2=1.7=1
[[aM^\F^Ί,β)] +BE[uu]B*'. Since the functions aihi and a^hj are orthogonal for all i
*=1 j=l
Substituting α^ and J5*, we get
(3 5 )
Hence, from 3.4 and 3.5, we have < g,s > — < g,g* >= 0. This complete the proof.
Independent Observations If the elementary estimating functions h{, i = 1,..., n are independent with £7[/ii(y, 7, /?)|7, /?] = 0, Vi = 1,2, . . . , n , it's not necessary to partition y i.e. Fj is y itself. For an example, if Y = (y\,y2, 52/n) and y^'s are
255
OPTIMAL ESTIMATING EQUATIONS
independent, we take hi to be a function of yι, 7 and β such that E[hi\^, β] = 0, Vi. Here h^s are independent. Hence, the optimal estimating function in the space Co is given by g* in Theorem 3.1 with α*(7,/3) replaced by
a?(Ί,β) = -E [ ( | ^ , 0 ) |7,/3J [Fαr^lT,/?)]-1.
(3.6)
Application to discrete time stochastic processes As an example of dependent data we develop an optimal estimating function for a discrete time stochastic process. Let {Yi, Y2,..., YT} be a discrete time stochastic process. Let ht be a real-valued function of YΊ,..., Yj, 7, and /? such that
where i* 1 ^ is the σ-field generated by the past observations Yi, Y2,..., Yt-ι Let u(Ίβ) : Θi -> i ϊ m with E[tz( ,/3)] = 0, for fixed β. We consider the estimating space
where αί_ 1(7, β) is a A; x 1 vector measurable with respect to F£_ι and is a k x m non-random matrix. Let
and B*(β)=E\[u(Ί,l (3.8)
Theorem 3.2 Let T α
* = Σ t - i ( 7 , / ^ t + B*{β)u{Ί,β). t=\
Then g* is an optimal estimating function in CoThe proof follows from Theorem 3.1 with Fj = F^_x.
PARK AND BASAWA
256
4
Applications
In this section, we will discuss two applications and demonstrate the derivation of optimal estimating functions.
4.1
Autoregressive Processes
Suppose that yt(j), t = 1,2,..., T, j — 1,2,..., n are observed at time t on the j t h subject from the first order autoregressive process(AR(l) process), (4.1) and (4.2)
φ(j) = x'jβ + zjΊ,
where 7 ~ JVu),Γj and εt = (εί(l), .. ,εt(n))',ί = 1,2,... ,T, are iid random vectors with E[εt] = 0 and Vαr[εί] = Vn(a). Assume that 7 and εt are independent for any t. Here β is a (ra x 1) vector of fixed parameters and 7 is a (g x 1) vector of random parameters, and Xj and Zj are vectors of known covariates. Note that we are not making distributional assumptions regarding εt other than the mean and variance assumptions. Let yt = (y t (l),...,y t (n))\
φ=
and
0 0 .
0
0
yt-i(n)
Then (4.1) can be written in a vector form
We consider optimal estimation for (7, β) assuming a known. We choose elementary estimating functions ht and u such that £J(/ιί|F^_1,7,/3) = 0 and
E(u\β) = 0. Let ht = yt-Yt-ι{X'β
+ ZΊ)
and u = 7,
where X = ( x i , . . . , x n ) and Z = (z\,... , z n ) . Consider the estimating space £0 = {ff : ff = Σ)^=i At-ιht we let
+ i?u}. Now
257
OPTIMAL ESTIMATING EQUATIONS and
where π(η) is the prior distribution of 7. Then A^_x and B* are computed as A*_i =
( χ)
t-Ma)-' YM)-'
and d B* B*=
11
ί "£*« \ Γ" Γ".
Thus, the optimal estimating function for (7, β) in the space Co is given by g* = T
-ιVn{a)-1 (yt - Yt-ι(X'β + Z'Ί)) - T~ιΊ,
g\ =
(4.3)
and
= ΣXYt-iVn{aΓι (yt - Yt-ι(X'β + Z'Ί)).
(4.4)
ti t=i
Prom (4.3) and (4.4), the optimal estimates {y*,β*) of (η,β) are given by
ί-itί.ία)-1^* _ γt_lX'β*) I
7* = and
-1
β* = t=l
t-1
ί-l
t-l
where Δ = Z (ΣΪ=I Yt-iVnicή^Yt-i) Z' + Γ" 1 . Prom (2.4), the information function corresponding to the optimal estimating function g* is obtained as ί ZHnZ' + Γ" 1
\
XHnZ'
ZHnX' \
XHnX' J '
where Hn = ΣΪ=i ^ - i F ^ α ) - 1 ^ - ! ] . As a special case, if the subjects are uncorrelated, i.e. Vn(a) = diag(aj,j = 1,..., n), then the optimal estimating equations from (4.3) and (4.4) turn out to be n T j=l t = l
PARK AND BASAWA
258 and
n T V^V^
-1
/ ' Δ^ί 3=1 t=l
3
=°
/ \
y ι
L VJ
'
Thus the optimal estimates n
J
(TJ*,/3J
) of (7,/?) are obtained as
T
Ίi* = Δ j=l
t=l
and n
T
n
T
3=1
t=l
3=1
t=l
' n
T
1
n
Γ
(Σ Σ j=i t=i
LJ=I t=i
where Δ = Σ?=i Σ£=i ^fvliU)^
+ Γ"1.
We extend the method of obtaining optimal estimating functions to the p order autoregressive processes. Suppose that yt{j),t — 1,2, . . . , Γ , j = 1,2, . . . , n , are observed at time t on the j < / ι subject from the pth order autoregressive process(AR(p) process), th
(4.5)
Vtϋ) = ^2Φk(J)yt-k(J)+εt(j), k=l
and ΦU) = foiίi), • 1 ^ 0 ' ) ) = -Xj/Ϊ + ZjΊi
(4-6)
where 7 ~ Nq ίo, Γj and εt = (εt(l),..., ε*(n))', t = 1,..., T, are ΐid random vectors with E[εt] = 0 and Vαr[εt] = Vn(a). Assume that 7 and £* are independent for any t. Here β is a (m x 1) vector of fixed parameters, 7 is a (q x 1) vector of random parameters, and Xj and Zj are matrices of known covariates. Let
άι)
°
0
ypt(2)
0
0
••• ° ^
...
0
γ
tp=
V
••• ypt(n),
)
and ^ ( j ) = {yt-ι{j),yt-2U)i - ">yt-p{j))'- Elementary estimating functions ht and u for 7 and /? are chosen to be
259
OPTIMAL ESTIMATING EQUATIONS and
where X = (X(l),... ,X(n)) and Z = (Z(l),... ,Z(n)). These elementary estimating functions satisfy the required condition of unbiasedness, i.e, ElhtlF^^β] = 0 and #M/?] = 0. In the estimating space Co — {g : g = Σ*Li M-\ht + Bu}, the optimal estimating function for (7, β) is given by g** = (ffί*,^*) where T
Σ
gl* = £ ZY;tVn(a)-1 (yt - Ypt(X'β + Z'Ί)) - Γ ' S , t=ι
and gl* = ΣXY'vtVn{ά)-1 fa - Ypt(X'β + Z;7)) From gl* and 32*5 the optimal estimates of 7 and /? are found as Y* = A~ιZ {γ;tVn(a)-ι(yt
- YptX'β**)}
and Γ
+Vn(θl)'~lYnt \ X
n
t=l T
(
t=i
{
where Δ = Z(ΣLI 4.2
—ZA~~l. t=l T
K
t=i
J
YptV'1 (a)Ypt)Z' + Γ " 1 .
Generalized Linear Mixed Models for Markov Processes
Suppose that {yt(j),t = 0,1,..., Γ, j = 1,2,..., n} is a markov process with a transition density f(Vt(j)\Vt-ιU)>Φ(J)) = c exp{φ{j)mt{yt{j),yt-i{j))
~ qt(yt-i{J)>ΦU))} (4.7)
and xjβ + z'jΎ,
(4.8)
where 7 ~ JV(O,Γ) and c is a function of yt{j) and yt-ι(j). Here /3 is a (mxl) vector of fixed parameters, 7 is a (r x 1) vector of random parameters, and Xj and Zj are known vectors of covariates. It is assumed for simplicity that conditional on 7, the processes {yt{j)} are independent for different j = 1,2,...,n.
PARK AND BASAWA
260 Let X = (xι,x2,...,xn)
and Z = (zχ,Z2,. • • ,zn). Then (4.8) can be
written as
/ φ{\) \ φ(2)
Φ=
= X β + Zj.
We consider joint optimal estimation for (7, β), assuming that Γ is known. We choose elementary estimating functions ht and u for 7 and β such that Let
= rπt — qt and u = 7, where \
m t (y t (2),y ί _ 1 (2))
\
and qt =
™>t(yt(n),yt-i(n))
/ Consider the estimating space £o = {9 '• 9 = Σt=i -A-tht + Bu}. Let
A; = ~E[φ,§f Then we have
and
r-1, where ςft = diag {-Q^^qt{yt-ι(j),Φ{j))J = 1,... , n j . Thus, in the space £o the joint optimal estimating function for (7,/?) is given by g* = (^,^2)5 where n T
= ΣΣzj{ j z
and
m
t(ytU),yt-i(j))-qt(j))-r-1Ί,
(4.9)
n T
(4.10) The optimal estimates (7*,/?*) for (7,/3) can be obtained by solving the equation g* = 0 for 7 and /? simultaneously.
261
OPTIMAL ESTIMATING EQUATIONS
Prom (2.4), the information function corresponding to the optimal estimating function 3* is obtained as
/ ZHnZ' + Γ- 1 ZHnX' \ XHnZ XHnX
\ I
Where Hn = ΣΪ=ι E[qt] We now illustrate the model by two examples. Example 1. Consider an AR(1) process with normal errors, i.e. yt(j) = Φ(j)yt-Λj) + εt(j)
(4.11)
φ[j) = χ.β + z'jΊ,
(4.12)
and
where εt(j) ~ indep iVίO,σ|J and 7 ~ iVίO,Γj. The conditional density is given by -ιϋ)iΦ(j)) =ceχp\-
^2(yt{j)-Φ{j)yt-Aj))
>,
(4.13)
where c = -Ί=^=—. Since (4.13) can be written as f(ytϋ)\yt-i(j),Φϋ))=c* we have mt{ytϋ),yt-ι{j)) = and
This gives μ t (j) = σ^y^ifiφij) and Viϋ) = σ^y^ij). Hence the optimal estimating function for (7,/?) is given by g* = (31,32)? where n T
and n 9*2 =
T
262
PARK AND BASAWA
The optimal estimates of 7 and β are obtained by solving the equations gt = 0 and g\ = 0 which yields the same estimates as ήfj and β\ in section 4.1. Example 2. Let {yt{j)},t = 0,1,2,..., be a Markov chain, for each j = 1,2,..., n, defined on the binary state space {0,1}. Denote πtj = P(yt(j) = l|yt-i(j)), θtj = logit{πtj) = log ί γ ^ j — ) • We then have expjθtj} exp{θtj}' Consider the model θtj=βo + Φjyt-ι(j),
(4.14)
where φj = x'jβ + zfl- Conditionally on 7, the Markov chains {yt{j)}>J = 1,..., n, are assumed to be independent. Suppose 7 ~ Λf if), ΓJ. Conditional on 7, the transition densities are given by p(vtU)\Vt-i(j),Ί)
=
7rjJu)(l-^)(1-ytu))
=
exp{θtjyt(j)
=
exp{βoyt(j) + Φjyt-i(j)yt{j)
- logfl + ea?p{β
- log(l + exp{β0 + Φjyt The optimal estimating functions for (7,^,/?o) then reduce to n
Γ
n
Γ
and
The third estimating function g% here corresponds to the common intercept βo in the model for θtj
263
OPTIMAL ESTIMATING EQUATIONS
5
Marginal Quasi-likelihood Estimation
Let fit(y{t - l),β) = E\yt\yt-ι, • • ,yi,β], and Vt{y(t - l),β) = Var[yt\yt-ι,...,yi,β]. The optimal estimating equation for β is τ
Σ
/
0jH(v(t-l),β)\ Vt-\y(t-l),β)(yt--μt(y(t-l),β))=O.
(5.1)
Let β denote a solution of (5.1). Define μt{y(t - l),j,β) = E[yt\yt-u • ,2/i,7,/?L and t _i,...,yi 1 7,i9]. For fixed /3, let
denote an elementary estimating function for 7. Suppose 7 has a prior density π(7|α) with mean 0 and variance Γ. Consider the estimating equation T
Σ t=i
r - X 7 = 0.
(5.2)
For fixed j9, (5.2) is an optimal estimating equation when only the random parameter 7 is present in the model. We now substitute the marginal quasilikelihood estimate β in (5.2), when β is unknown. Denote the resulting estimate of 7(obtained from (5.2) after replacing β by β) by 7. The estimates β and 7 will be referred to as the marginal quasi-likelihood estimates(MQL). Note, however, that for fixed /3, (5.2) is the same as the estimating equation (1.14) for 7 corresponding to the joint optimal estimation of β and 7. It must be noted that (5.1) and (5.2) are not jointly optimal; on the other hand, (5.1) is optimal for β with respect to the marginal density of y = (y*,..., y\) and (5.2) is optimal for 7(when β is known) with respect to the joint density of y and 7. Application to Autoregressive Processes : Suppose that yt{j), t = 1,2,..., T, j = 1,2,..., n are observed at time t on the jth subject from the first order autoregressive process(AR(l) process): yt(j) = Φ(J)yt-i(j)+εt(j),
(5.3)
and = Xjβ + Zjj, where 7 ~ 7VρΓθ,r) and ε t = ( ε t ( l ) , . . . ,εt{n))\t
(5.4) = 1,2,...,T, are iϊd
random vectors with E[εΐ\ = 0 and Vαr[εt] = Vn(a). Assume that 7 and εt
264
PARK AND BASAWA
are independent for any t. Here β is a (m x 1) vector of fixed parameters, 7 is a (q x 1) vector of random parameters, and Xj and Zj are vectors of known covariates. Let (yt(l),---,»*(")) 1
Vt=
and ϊ t _ i = diag(yt-ι(j),j = 1,... ,n), i.e., lt_i is a diagonal matrix with tΛ O'? j ) diagonal element yt-i(j), j = 1,. . ,n. We consider optimal estimation for ^9, treating 7 as a nuisance parameter when a is known. We assume that ε* has a normal distribution for our illustration. Now
and Vt(y(t - 1),/?) = Ή.xZVαφlift-i,..
.,yi]ZYt-i
where X = (zi,..., x n ) and Z = (zi,..., z n ). To find the conditional expectation and variance, we derive the posterior density π(7|j/ί_i,... ,yi) of φ, conditional on the past observations
oc
p(yti,
r—\
(yr - Yr-^X'β + Z'Ί)) I exp j-Ifr'r-S - i (7 - ΔΓΛ^-ij'Δt-iίT - Δt-_\A-i) t-1 r=l XY
ΣU r-lVn{<x)-Xyr
r=l
- Σy'rVn{<*)-lYr-lX'β r=l
where ί-1
i K ί a ) - 1 ^ ! ^ + Γ"1 r=l
OPTIMAL ESTIMATING EQUATIONS and
265
ί-1 r=l
We then have βt(y(t — ί),β) and Vt(y(t — l),β) given by μt(y(t - 1),/3) = Yt-i{X'β + Z ' A ^ - i } , and - 1),/?) = Yt^
Vn(a).
Thus the optimal estimating equation for β is given as (5.1) and the optimal estimating function is computed as <Γ = t-\
where ί-1
Σ
—
A* Λ
t-ι —
r=l
and
BUi = Yt-iZ'A^ZYt-! + Vn(a). Hence, from the above equation, the optimal estimate of β is found as -.-1
τ ΓT
β =
Σ ,t=ι
Λ*
Ft*-1 A*
J
r=l
U=i
From (2.4), the information function corresponding to the optimal estimating function g* is obtained as
t=l
Suppose that, for fixed /?, we wish to estimate 7. We choose an elementary estimating function ht for 7 as
266
PARK AND BASAWA
Then the optimal estimating equation for 7 is given as (5.2) : 1
ι
Σ(r t _iZ')V n (α)- (yt - Yt-i(X'β + Z'Ί)) - T~ Ί = 0. t=l
Thus the above equation yields an estimate 7* of 7 when β is known. -1
T
= .t=i
•Σ The estimate 7* is optimal for 7 when β is known. When β is unknown, we replace it by β in 7* to obtain 7.
6
Concluding Remarks
In this paper, we have derived optimal estimating equations for estimating fixed and random parameters in mixed effects models with dependent observations. Among the issues that are not addressed in this paper are : (ϊ) asymptotic distribution theory, (ii) estimation of variance components, and (in) computational aspects. We hope to return to these topics in future work. Here, we content ourselves by offering the following remarks on these important issues. (i) Asymptotic Distribution Theory The consistency and asymptotic normality of the marginal quasi-likelihood estimate β in Section 5 can be established, under regularity conditions, from the general theory discussed by Heyde(1997). Appropriate extensions to develop asymptotic distribution theory for the joint estimation of β and 7 are needed. Moreover, work on asymptotic efficiency of the estimates will be useful. (ii) Estimation of Variance Components Assuming 7 ~ ΛΓ(0, Γ), one can estimate Γ from the marginal quasilikelihood in Section 5. An extension of the REML approach discussed by Breslow and Clayton(1993) can be used in practice. (in) Computational Aspects Extensions of Fisher scoring method discussed by Breslow and Clayton(1993) need to be developed. For most of the examples in our paper, however, we have obtained explicit solutions of the estimating equations. Consistent parameter estimates of the information matrices can easily be obtained and hence the estimates of the standard errors of the estimates can be computed, in principle.
OPTIMAL ESTIMATING EQUATIONS
267
References Basawa, I.V., Godambe, V.P., and Taylor, R.L., (Eds.) (1997). Selected Proceedings of the Symposium on Estimating Functions. IMS Lecture Notes-Monograph Series, Vol. 32. Breslow, N.E. and Clayton, D.G. (1993). Approximate Inference in Generalized Linear Mixed Models. Journal of the American Statistical Association 88, 421, 9-25. Chan, S. and Ghosh, M. (1998). Orthogonal Projections and the Geometry of Estimating functions. Journal of Statistical Planning and Inference 67, 227-245. Desmond, A.F. (1997). Optimal Estimating Functions, Quasi-likelihood and Statistical Modelling. Journal of Statistical Planning and Inference 60, 77-121. Durbin, J. (1960). Estimation of Parameters in Time-series Regression Models. Journal of Royal Statistical Soc. B22, 139-153. Ferreira, P.E. (1981). Extending Fisher's Measure of Information. Biometrika 68, 3, 695-698. Ferreira, P.E. (1982). Estimating Equations in the Presence of Prior Knowledge. Biometrika 69, 3, 667-669. Godambe, V.P. (1960). An Optimum Property of Regular Maximum Likelihood Estimation. Annals of Mathematical Statistics 31, 1208-1211. Godambe, V.P. (1985). The Foundations of Finite Sample Estimation in Stochastic Processes. Biometrika 72, 2, 419-428. Godambe, V.P. (Ed.) (1991). Estimating Functions. Oxford University Press, Oxford. Godambe, V.P. (1998). Linear Bayes and Optimal Estimation. To appear in Annals of the Institute of Statistical Mathematics. Heyde, C.C (1997). Quasi-Likelihood and Its Application, A General Approach to Optimal Parameter Estimation. Springer-Verlag, New York. Robinson, G.K. (1991). That BLUP is a Good Thing: The Estimation of Random Effects. Statistical Science 6, 1, 15-51.
268
PARK AND
BASAWA
Sutradhar, B.C. and Godambe, V.P. (1997). On Estimating Function Approach in the Generalized Linear Mixed Model. IMS Selected Proceedings of the Symposium on Estimating Functions(Basawa I. V, Godame, V.P., and Taylor, R.L. Eds.) 32, 193-213. Waclawiw, M.A. and Liang, K.Y. (1993) Prediction of Random Effects in the Generalized Linear Model. Journal of the American Statistical Association 88, 421, 171-178. Wefelmeyer, W; (1996). Quasi-likelihood Models and Optimal Inference. Annals of Statistics 24, 1, 405-422.
271 Institute of Mathematical Statistics LECTURE NOTES — MONOGRAPH SERIES
RECONSTRUCTION OF A STATIONARY SPATIAL PROCESS FROM A SYSTEMATIC SAMPLING Karim Benhenni LABSAD, Batiment des Sciences Humaines et Mathematiques Universite Pierre Mendes Prance e-mail: [email protected]
Abstract We consider the problem of predicting a spatial stationary process over a fixed unit region [0, l] d , d > 1. We derive a linear nonparametric predictor using an extended linear interpolation formula based on a regular sampling design of size md. Under some appropriate assumption on the spectral density, we give the rate of convergence of the corresponding integrated mean squared error when the observations get dense in the whole region. Key words: spatial process, linear interpolation, spectral density, rate of convergence.
1
Introduction and Results
The prediction of a spatial process from its observations at chosen sites is relevant to problems related to geology and environment, known as kriging. Parametric methods have been used to predict a process by means of a linear model. The best linear unbiased estimator of the underlying parameter was studied by many authors such as Cressie (1993), Matern (1986), Sacks, Welch and Mitchell (1989).We wish to predict the process X(t),t € [0, l)d from observations based on a systematic (regular) sampling design in the unit region [0, ΐ\d which is divided into md grids each of equal size 1/m. The best linear predictor depends on the inverse of a covariance matrix generated by the md observations and thus this maybe subject to serious numerical unstabilities. We consider in this paper a nonparametric approach to predict
272
BENHENNI
the process X(t),t E [0, l]d. We consider a weakly stationary process with a spectral density φx that satisfies for q > 1 /
\ωi\^\ωj\^φx(ω)dω
For ςr = 1, the predictor X(t) when t belongs to a given grid is derived by applying an extended Lagrange interpolation formula in each direction. When t = (t\...,td)e[±ϊ±±£[ for some k = (A*,... ,fcd) e { 0 , . . . , m l} d then the predictor is obtained by applying the Lagrange interpolation formula for every tι e (t^jt^+i) = (ki/m,(ki + ΐ)/m),ki = 0, ...,ra,2 =
where C d (t,k,j) = m d Π f = 1 ( - l ) * = (f - tki+ji) and tk = (tkl,.. .,**,). The whole process can then be reconstructed for every t G [0, l]d and the error of prediction is measured through an integrated mean squared error: IMSE =[...[ {E(X(t) J J [o,i]d
X(t))2dt.
In the one-dimensional case d = 1 we obtain the classical interpolation formula which was used by Su and Cambanis (1993), Muller and Ritter (1997) for predicting a second order stochastic process. For higher dimension d, Stein (1993,1995) considered the prediction of integrals of spatial processes and studied the asymptotic properties of the mean squared error of approximation. In his book, Stein (1999) gives a summary of his results and some discussions on spatial interpolation. When the observations get dense in the whole region (infill asymptotics), the following theorem gives the rate of convergence of the integrated mean squared error and the corresponding asymptotic constant in terms of the spectral density. Theorem 1. If a regular sampling design of size n = md is used, and if the spectral density φx of the process X satisfies \ωi\2\ωj\2φχ(ω)dω < oo,ϊ,j = 1,... ,d, then as m —> oo
mA(IMSE) where hd(ω) = (Σf=ιω2)2
—>^z[
— ^]
hd{ω)φx{ω)dω
273
RECONSTRUCTION
Example. In the practical case d = 2, the spectral density φχ(ω) = 2 4 (l+M ) satisfies the condition of the theorem for q — 1. In this case the 2 rate is of order n~ and the asymptotic constant will have ^2(^ ω\
The rate of convergence for the predictor X(t) cannot be faster than 4 d n~~ / for smoother process q > 1. However the rate is improved by using a more sophisticated linear predictor which is constructed by applying the extension of the Newton's rules formulae to a spatial case up to some appropriate order, see Benhenni and Cambanis (1992), Benhenni (1998) for d=l. For any positive integer i, let Δ z be the increment of the ith order operator = Σ (r) (-iΓ 7 * 9 (*Jb+r)., 0 < k + i < m. r=00
We define recursively the d-dimensional finite difference operator: ^S
^^-'^g(tkl,...
When t e [ ^ , &±ϊί[ for some k = (ku..., predictor is:
, kd) € {0,..., m - l}d then the
where
and W{(u) = u(u — 1)... (u — i + 1), i > 1, wo(u) = 1 Theorem 2. If a systematic sampling design of size n = md is used, and if the spectral density φx of the process X satisfies for q > 1 /
\ωi\2^\ωj\^φx(ω)dω < oo,i, j = 1,..., d,
then asm->oo
?
where ΛdΛ(α;) = Σ?=i ^
/ί < ( « ) d « +
For high dimension d and low regularity q the rate of convergence n~iqld becomes slower. Therefore in this case the above predictors may not be very
274
BENHENNI
efficient. It would then be interesting to study the interplay between the dimensionality and the regularity of some specific processes. It is a harder problem to study whether these predictors are asymptotically optimal within the class of systematic samplings. That is whether the above predictors have the same asymptotic performance as the optimal linear predictors for any fixed sample size n. This is true for d = 1 and 9 = 1 , see Su and Cambanis (1993). The same rules can be applied to predict the partial quadratic mean derivatives X^1'•'•?<*) (t) of the process under stronger condition on the existing derivatives. If
* e [£>ίί^[ predictor is:
for
s o m e
k
i*d) Ξ { 0 , . . . , m - l}d then the
= (*i>
Corollary. If a systematic sampling design of size n = md is used, and if the spectral density φx of the process X satisfies for q > 1 I
9d{ω)\ωi\2^\ωj\^φχ(ω)dω
< oo,
with i,j = 1,... ,d where <7d(ω) = Πf = 1 ωf, then a s m - ^ o o m^(IMSE) where hdΛ(ω)
2
—> /
gd(ω)hdΛ{ω)φχ(ω)dω
= Σι=i <*? /o ^ ( 1 1 ) ^ + 2 Σ Σ,i<ι
2
w2q(u)du) .
Proofs
Proof of theorem 1 The stationary spatial process X(t), Cramer's representation:
X(t) = [
t E [0, \}d can be expressed by
eiωttdW{ω)
where W is a process with orthogonal increments associated to the spectral measure with Radon-Nikodym derivative φx with respect to the Lebesgue
275
RECONSTRUCTION measure. Then the prediction error can be written as: 2
2
E(X(t) - X(t)) = 1^ \Rd{t,\i)\ φx{ω)dω where jRd(t, k) = exp {-iω't) - C d (t, k, j) exp (-iu/ί k + i-j) The Taylor expansion up to order two for every tι G (£fcπ£fcz+i) in the neighborhood of t^ gives: _^L+o((tz-tfc/)2)}.
exp (-turf) = expi-iωit^il-iiωi^-t^+iiωt)2^ Then exp (—iω't) =
Likewise for t = ίk+i-j? w
exp ( - ώ V i - j )
e
have:
= exp (-z ι=i
The remainder can then be expressed as:
where
Λι(t,k)
= l1=1
- ΣΣ
J=(ji,. J«ι)€{O,l}''
m
1=1
276
BENHENNI
and d
d
jBrf(t,k) = — 2_^ω\(t — tkt) +m
Lemma. for
V>
(Λ
Λ
\
Cd(t 5 k,j)
For every positive integer d, and every t = (i 1 ,...,* 4 *) G s o m e k = (fci,...,fed)€ { 0 , . . . , m - l } d we have:
Σ
(0
j=ϋi. j
cd(t,k,j) = i.
Proof, (i) and (ii) are proved by using the definition of Cd(t,k,j): (0
Σ
Q(t,k,j) =
m^i
}=(h,-Jd)e{o,i}d ίίί* - tki) - (f - tki+ι)
Σ
(a)
=0i> jd)e{o,i}
»=i
Σ
ι
n?=ifi
1=1
from (i). Concentrating on Ad(t,k), using the Lemma, we have: Λi(t,k)
=l - m
1=1
d
L
- YΣ^v^vtf-t^W' -tkv) x
{l-md"2
= 1.
277
RECONSTRUCTION Then again using the Lemma with the appropriate orders, we have Ad(t, k) = - £ ί i ( ί ' - tkι){(tι - tkι) - 1 + o((tι - tkι)
-1)}.
Now we write: /"•/
i t +11=1
Λ d
2(t,k)(t)dί
Using that / t ^ ί + 1 (ίz - ί ^ ) 1 ^ = ( i + 1 ) 1 m ,+i, we have
{[tkrtkι+1],ι=ι,...d}
The final result follows from the assumption of the theorem.
Proof of theorem 2. Let /(t) = exp (—iω't) and /(t) =
C r d (t,k ϊ ii ϊ ... ϊ i d )Δ < 1 " ^ / ( ί k )
Σ
Then the prediction error can be written as:
E(X(t) - X(t))2 = I
|/(t) - f{t)\2φx{ω)dω.
We apply the Newton's formulae for / with respect to each argument up to order (2q — 1) and we obtain for any d > 1:
where putting Co(t,k) = 1 d
2
A*
1
iι
-^°
°
Λ 9J f
iti.
fu
f, /
ί+1
d
t)
278
BENHENNI
where ξι € (tkι,tι),l
ι=ι
= 1 , . . . ,d. This can also be written as:
.,tkι_l,ξι,1>+\
2 9 ( m ( ί ' - tkι))^f(tkl,.. στ
...,
d
Cί-i(t,k,t1,...,»<_i)t«2ϊ(m(t/-ί*J))
Σ
1=1
i
-°
It is clear that for some ij ^ 0, C/_i(t,k, 2χ,... ,ii-i)w2q{m(tι—t^)) This implies that:
Rd,q(t,k) = J2w2q(m(tι - tkι))ζf(tkl,...
= O(^).
,ifc|_1,6,ίi+\ ... ,td) + O(-)
Now we have that / ( t ) = Π j ^ e x p ί i α ; ^ ) so that ^ / ( t ) = (iωι)2(*f (t). Using t h e Taylor expansion we have: ^ / - ( t ) = {i^ι)2q exp(i Σ f = i ω /^/){l — Y^f=ι(iωι)(tι
—tfc z )(l-ho(l))}. Moreover since the intermediate points satisfy
ζl = ίjbj + °(1) 5 it follows that: d
d
^w2q(m(tι-tkι)){iωι)2qexp{i^2 1=1
1=1
Therefore
1=1
Now we write: [o,i]d /
^ ί+1 Since / tfc** ί+1 w2q(rn(tι
- tkι))dtι
/
= i $ w2q{u)du,
l/(t)-/(t)|2Λ we obtain
The final result follows from the assumption of the theorem.
RECONSTRUCTION
279
References [1] Benhenni, K. and Cambanis, S.(1992). Sampling designs for estimating integrals of stochastic processes. Ann. Statist 20, 161-194. [2] Benhenni, K. (1998). Predicting integrals of stochastic processes: Extensions. J. AppL Prob. 35, 843-855. [3] Cressie, N.A.C. (1993). Statistics for spatial data. Wiley, New York. [4] Matern, B. (1986). Spatial variation (2nd edition), Lecture notes in Statistics, vol. 36, Springer, New York. [5] Muller-Gronbach, T. and Ritter, K. (1997). Uniform reconstruction of Gaussian processes. St. Proc. and their AppL 69, 55-70. [6] Sacks, J., Welch, W.J., Mitchell, T.J. and Wynn, H.P. (1989). Design and analysis of computer experiments. Statist. Sci. 4, 409-435. [7] Su, Y. and Cambanis, S. (1993). Sampling designs for the estimation of a random process. St. Proc. and their Appl. 46, 47-89. [8] Stein, M.L. (1993). Asymptotic properties of centered systematic sampling for predicting integrals of spatial processes. Ann. Appl. Prob. 3, 874-880. [9] Stein, M.L. (1995). Predicting integrals of stochastic processes. Ann. Appl Prob. 5, 158-170. [10] Stein, M.L. (1999). Interpolation of Spatial Data: Some Theory for Kriging. Springer-Verlag, New York.
281
Institute of Mathematical Statistics LECTURE NOTES — MONOGRAPH SERIES
ESTIMATING THE VARIANCE OF THE MAXIMUM PSEUDO-LIKELIHOOD ESTIMATOR
Lynne Seymour Department of Statistics University of Georgia Athens, GA 30602-1952 E-mail: [email protected] Abstract The use of the pseudo-likelihood estimator for Gibbs-Markov random field models has a distinct advantage over more conventional approaches mainly due to its computational efficiency. Indeed, the maximum pseudo-likelihood estimator (MPLE) is often used as the Monte Carlo parameter in Markov chain Monte Carlo (MCMC) simulations. The MPLE itself has some very nice estimation properties, though its variance is still undiscovered. In this paper, the moving-block bootstrap is employed to estimate the variance of the MPLE in the Ising model. KEY WORDS: parameter estimation, Gibbs random fields, Markov random fields, parametric bootstrap, πtoving-block bootstrap, subsampling
1
Introduction
Gibbs-Markov random fields (GMRFs) are statistical models used to study the spatial relationships among data taken on a grid. Although these models were developed around the turn of the century in physics as models for particle interactions (Gibbs, 1902), it has only been within the past fifteen years or so that they have been seriously considered by statisticians and other scientists as models for spatially related data. The form of the distribution used in practice is fairly simple:
282
SEYMOUR
where x\ represents observations taken on a grid Λ, θ represents the parameters of the distribution, V(•) is a function strictly of the data, and Z is a normalizing constant. This model has many applications throughout the scientific community. The pioneering paper of Besag (1974) used simple GMRFs to study the occurrence of new plant growth in a defunct mine, and to study grain and straw yield of wheat plots. In each of these models, the area of interest was partitioned into a grid, and the presence of new growth and grain/straw yields, respectively, were the data taken at each grid site. Classical development of this model has been in image analysis, in which the grid consists of the pixels on the computer screen, and the observed data are the colors at each pixel. The model then aids in, for example, image restoration (Geman and Geman, 1984; Tjelmeland and Besag, 1998), detecting boundaries (Geman, Geman, Graffigne, and Dong, 1990), or recognizing and simulating textures (Geman and Graffigne, 1986; Seymour, 1993). More recently, Smith et. al (2000) and Seymour (2000) use such a model to study social networks; more specifically, the clients of a social service agency are considered the "grid", and the data indicate whether the clients changed case managers. A first step in making inferences with any statistical model is parameter estimation, which can be especially challenging for GMRFs due to the size of the grid and/or the dependence of the data. Maximum likelihood estimation (MLE) is a standard technique for most models, but the nature of the GMRF model makes the MLE computationally intractable for even reasonable grid sizes. A commonly-used way of circumventing this problem is to approximate the MLE using Markov chain Monte Carlo (MCMC) methods (Geyer and Thompson, 1992), but this can become very cumbersome in some applications and can be unstable for cases in which the dependence in the data is strong (Seymour and Ji, 1996). Another standard technique is method-of-moments estimation, which requires that the dependence between data at neighboring grid points to be weak (Sherman and Carlstein, 1994). A non-standard parameter estimation technique that is very easy to implement is the maximum pseudo-likelihood estimate (MPLE). Instead of maximizing the likelihood based on the model in (1), a pseudo-likelihood is maximized. The pseudo-likelihood which Besag (1975) first proposed simply multiplies the conditional distributions at grid sites given the values at neighboring sites:
VC(Θ]XA) = H P ( X W = z w | X M = xMi),
(2)
where Λ* C Λ is the set of all sites in Λ with a complete set of neighbors in Λ, and M% C Λ is the set of points which are neighbors of the site i G Λ*.
283
PSEUDO MLE
These single-site conditional distributions are easily calculated, making the MPLE extremely appealing as an estimation technique. The MPLE is often used to initialize the MCMC method of parameter estimation, which is preferred over the MPLE because it gives an approximation of the likelihood function. The MPLE is used in this way because it is quickly computed, and because it is close to the parameter the MCMC method is trying to estimate (which cuts down on MCMC iterations).
2
Modelling Background 2
Let Λn be an n x n lattice in Z , and let Xi be a random variable associated 2 with the site i G Z . Then XAn = {Xi,i G Λn} is called a random field on Λn. The state space S is the collection of all possible values of X^ i G Z 2 , and Ωn = SAn is the collection of all possible realizations of the random field XAn. For a site i G Z 2 , a collection Mi of sites having the properties i φ. Mi and i G Mj «Φ j G Mi is called the neighborhood of the site i. The collection of all neighborhoods 9ΐ = {Mi,i G Z 2 } is called the neighborhood system. A random field is called a Markov random field (MRF) with respect to the neighborhood system 9t if its probability distribution P on Ω^2 satisfies P (Xi = Xi\Xz2\{i} = xχ2\{i}) =P(Xi 2
= XilXjsfi = xtfi)
(3)
for each i G Z and x G ΩZ2. where Z \{i} is the set of all sites in Z 2 except site i. These single-site conditional probabilities are called the local characteristics of the MRF. The pair potential il = {u(xi,Xj) : i,j G Z 2 , i φ j} is a collection of deterministic functions which quantify how values at pairs of sites interact. If each of the functions in il is zero for every pair of sites farther than a fixed finite range R from each other, then the collection is called a pair-potential of range R. The energy associated with x G ΩZ2 on An, denoted H\n(x), is a functional of u( , •) which summarizes all of the pair interactions of the random field on Λn. For example, an energy function may take the form H
An(X)
2
=
Σ
\ \jeNijeAn
which gives an additive summary of the pair interactions on Λn. A random field is called a Gibbs random field (GRF) induced by the pair-potential il if its probability distribution P on Ω^2 satisfies
= XAc) =
exp\-HAn(xAn;xAc)\ 1 ~ γ ^
(4)
284
SEYMOUR
where Z\n is a normalizing factor which is a sum over all possible x E ΩΛn. One of the more difficult problems with using GRFs is that, in some cases, these conditional distributions do not uniquely determine the distribution of the field. Such a condition is called phase transition, and implies spatial long-range dependence in the field. A random field is a MRF if and only if it is a GRF induced by finiterange potentials (Besag, 1974; Geman, 1991). The model given in (1) is an exponential family random field which satisfies both (3) and (4); hence the name Gibbs-Markov random field.
3
The MPLE and its Properties
The likelihood function is computationally intractable for GMRFs due to the normalizing factor in (4): For a 100 x 100 binary grid, which is trivial in most applications, this factor is a sum of 2 1 0 0 terms. Hence alternatives to the likelihood are very desirable for these models. Computationally intensive alternatives include using a Markov chain Monte Carlo approximation to the likelihood function (Geyer and Thompson, 1992), and estimating the normalizing factor via simulation. The alternative which was first proposed by Besag (1975) is to use the pseudo-likelihood in (2), rather than the likelihood. Note that the pseudolikelihood is simply the product of the local characteristics (using the Markov property), and that the conditional distributions at a single site make the normalizing factor in (4) much easier to compute. Parameter estimates derived by maximizing the pseudo-likelihood are called maximum pseudolikelihood estimates (MPLEs). Asymptotics in a random field setting are traditionally taken on a single realization of an n x n region as n goes to infinity. The existence, uniqueness, and strong consistency of the MPLE were established by Geman and Graffigne (1986), while Comets (1992) established that the convergence rate of the MPLE (and, incidentally, of the MLE) is on the order of e~n as n —> oo. Consistency of the MPLE is not restricted to a grid: Jensen and M0ller (1991) show the MPLE to be consistent for point processes, and Mase (1995) shows it to be consistent for continuous-state-space Gibbs processes. The moderate deviation probabilities for the MPLE have been shown to decay as n~ α as n -> oo, where a is not necessarily less than one (Ji and Seymour, 1996). Each of these convergence results have no requirements on the strength of the dependence in the random field. In addition, a model selection criterion based on penalized pseudo-likelihood, similar to the Bayesian information criterion (BIC) of time series, was shown to be weakly consistent without assuming any conditions on the strength of dependence. In contrast, the traditional BIC for GMRFs using
PSEUDO MLE
285
penalized likelihood (and using the MCMC approximation to the likelihood) was shown to approximate the Bayesian solution (i.e., the solution which minimizes the risk under a 0-1 loss function) to the model selection problem only under a weak-dependence condition (Seymour and Ji, 1996). This would seem to imply that the MPLE in some way makes more optimal use of the dependence structure in the observed GMRF than the MLE or its MCMC approximation. Further study is needed into the sufficiency of these estimates. Guyon (1987) has shown the asymptotic normality of the MPLE under weak dependence conditions (specifically, under spatial mixing with an exponentially decaying mixing coefficient). Jensen and Kunsch (1994) established the asymptotic normality of a stochastically normed MPLE for certain point processes, regardless of the strength of dependence. Janzura and Lachout (1995) showed that the sampling distribution of the MPLE is a normal mixture when the strength of spatial dependence is large, and Comets and Janzura (1998) more recently establish the asymptotic normality of the stochastically normed MPLE in complete generality for conditionally centered random fields. Grasping the variance of the MPLE is a difficult proposition - even for the simplest GMRF models (Cressie, 1993) - though some results are known. Guyon (1987) derived the asymptotic variance under weak spatial dependence, but it depends upon the intractible joint distribution of the field. The (restricted) mean square error has been shown to decrease asymptotically like n~2 (Ji and Seymour, 1996) regardless of the strength of dependence. Sherman (1996) developed a moving-block bootstrap, or sub-sampling, method which can be used for estimating the variance, which is /^2-consistent as long as the MPLE itself is asymptotically normal (i.e., under weak mixing conditions). The variance is an important component for MPLE-based inference, but much remains to be done in both deriving and estimating it. Once the variance of the MPLE is better-understood, efficiency of the MPLE relative to the MLE may be investigated. The MLE has been shown to be efficient for GMRFs (Mase, 1984). However, the efficiency of the MPLE relative to the MLE has been tabulated (Besag, 1977; Kashyap and Chellappa, 1983), and it appears that the MPLE is not as efficient as the MLE.
4
The Variance of the MPLE via Simulation
Before one can evaluate methods for estimating the variance of the MPLE, one must have an idea of what the target value is. Thus a simulation study was conducted in an effort to understand what the variance of the MPLE
SEYMOUR
286
looks like. This small simulation study employed the well-known Ising model, in which the neighborhood around each site consists of its four nearest neighbors, and the interactions between sites i and j behave according to θxiXj. The parameter θ governs how strongly the data are dependent, and in fact the Ising model is the only Gibbs distribution for which there are necessary and sufficient conditions for no phase transition in terms of an explicit critical value: |0| < \ In (l + \/2) «0.44 (Ellis, 1985). Thus, in the case that θ — 0.1, there is no phase transition and the data are weakly dependent, as seen in Figure 1; if θ = 1 then there is phase transition and they are very strongly dependent, as seen in Figure 2. The Gibbs sampler (Geman and Geman, 1984) was used to generate 100 random fields of several sizes from these Ising models. The MPLE was then calculated on each of these fields to simulate its sampling distribution, and the sample variance of this sampling distribution was computed. Figure 1: A Realization for θ = 0.1
PSEUDO MLE
287 Figure 2: A Realization for θ = 1
Table 1 and Table 2 give the average of 10 replications of the sample variances for the simulated sampling distribution of the MPLE for several nxn random fields. The third column of each table shows (variance) n 2 in an effort to understand whether the variance is proportional to n~ 2 . Figure 3 and Figure 4 plot the 10 sample variances, as well as these values multiplied by n 2 , against n. In the weak-dependence case (Table 1; Figure 4), it is clear that the variance behaves as O(n2). As is to be expected, this same rate is not as apparent in the strong-dependence case (Table 2; Figure 4); however, if one discounts the case n = 100 as possibly being too small for the asymptotics, then the relationship is more clear. Table 1: Average Simulated Variances, θ = 0.1 n average variance 50 0.00021125 0.00004664 100 300 0.00000570 500 0.00000200 1000 0.00000040
(average variance) * n 2 0.5281 0.4664 0.5120 0.4889 0.4437
SEYMOUR
288 Figure 3: Simulated Variances θ = 0.1 X10
8
O C D OC
ro in
e Variance
2.5 •
-
E 1
§
0.5
p
0 100
200
300
0 400
.
500 6 Field Size
. 700
.
i
800
900
0 1000
Table 2: Average Simulated Variances, θ = 1
n
average variance 0.00542753 100 0.00040862 300 0.00014896 500 1000 0.00003478
(average variance) *n2 54.26 36.78 37.24 34.78
Figure 4: Simulated Variances, θ = 1
0
100
200
300
400
500
600
700
800
900
1000
0
100
200
300
400
500
600
700
800
900
1000
PSEUDO MLE
5
289
Estimating the Variance via Sub-sampling
In Section 4, the variance of the MPLE was estimated by sampling many random fields. Rather than sampling repeatedly, one may simply subsample the random field in hand, potentially saving much computing time (though not necessarily). This will yield accurate results in many cases, most notably when dependence is weak (Sherman, 1996). Sherman and Carlstein (1994) introduced a sub-sampling procedure to estimate a general statistic (cf. also Politis and Romano, 1993) which is useful in estimating the variance of the MPLE. However, the MPLE must be asymptotically normal for their estimate to be consistent, which is the case only under weak dependence. The conjecture explored herein is that a sub-sampling scheme may be used to estimate the variance of the MPLE when there is a limiting distribution for the MPLE, as is the case whether dependence is weak or strong. In the following examples, we again look at the two cases of the Ising model described in Section 4. In each case, we sample one n x n GMRF, n = 100, 300, 500, 1000. We then partition that GMRF into sub-blocks that are 10%, 15%, 20%, and 25% of the size of the sampled GMRF. In addition, we allow three different degrees of overlap: None (shifting a whole subblock width), half (shifting a half sub-block width), and one pixel (shifting one pixel, the maximum overlap possible while still shifting). Table 3 and Table 4 summarize the results of the sub-blocking schemes given above and compare the results to the simulated variances found in Section 4. Several interesting phenomena may be observed. Table 3 contains the results for the weak dependence case - the case for which much is known. As is easily seen, the size of the sub-blocks and the amount of overlap matter very little, and all sub-sampled estimates are close to the simulated variance. Note that for the larger cases of n = 500 and n = 1000, the one-pixel shift is omitted - this is due to the fact that the number of sub-blocks for the one-pixel shift is prohibitively large in these cases.
SEYMOUR
290 Table 3: Variance Estimates when θ = 0.1 N
Size of Sub-block
Shift
Number of Sub-blocks
100
10
Whole Half
100 361
One
8281
15
Whole Half
36 121
One
7396
20
Whole Half
25 81
One
6561
25
Whole Half
16 36
One
5776
30
Whole Half
100 361
One
73441
45
Whole Half
36 144
One
65536
60
Whole Half
25 81
300
500
One
58081
75
Whole Half
16 36
One
51076
50
Whole Half Whole Half Whole Half Whole Half Whole Half Whole Half Whole Half Whole Half
100 361 36 144 25 81 16 36 100 361 36 144 25 81 16 49
75 100 125
1000
100 150 200 250
Average MPLE 0.1152 0.1128 0.1157 0.1182 0.1107 0.1113 0.1097 0.1081 0.1100 0.1110 0.1102 0.1091 0.0993 0.0988 0.0991 0.1006 0.0988 0.0989 0.0994 0.987 0.0991 0.0991 0.0997 0.0994 0.1003 0.1013 0.1012 0.1016 0.1002 0.1010 0.1003 0.1033 0.0998 0.0997 0.1003 0.1000 0.0997 0.1000 0.0997 0.1000
Estimated Var(MPLE) 80x10"° 89x10-° 83x10"° 69x10"° 54xlO~b 59x10"° 63x10"° 62x10"° 49x10"° 74x10"° 50x10"° 42x10"° 6x10-° 6x10"° 6xlO~° 6xlO~° 6x10-° 5xlO~° 4xlO~° 5x10"° 5xlO~° 6x10"° 4x10"° 5x10"° 2x10"° 2x10-° 2x10-° 2x10"° 2x10"° 2x10"° 2x10"° 2x10-° 0.5x10-° 0.5x10"° 0.6x10"° 0.4x10"° 0.3x10-° 0.5x10"° 0.6x10-° 0.5x10"°
Simulated Var(MPLE) 46.6x10"°
5.7x10-°
2.0x10"°
0.4x10"°
PSEUDO MLE
291
Table 4, which presents the results in the strong dependence case, contains much more interesting phenomena. Notice how the schemes with the smaller sub-blocks give bad estimates of the variance, and indeed of the MPLE itself. This may be understood intuitively by realizing that if the sub-blocks don't contain an accurate representation of the dependence structure in the entire field, the sub-sampling scheme will fail to provide an accurate estimate. For θ = 1, it is clear that good estimates of the MPLE can be obtained from fields of size n = 75 or larger. Once the sub-block size becomes large enough to capture the dependence structure in the entire field, then the degree of overlap begins to play a role in the accuracy of the estimate. Even that, however, seems to fade as is seen in the n = 500 and n = 1000 cases. Again, the cases employing maximum overlap were too computationally intense for inclusion here.
SEYMOUR
292
Table 4: Variance Estimates when θ = 1 N 100
Size of Sub-block 10
15
20
25
300
30
45
60
75
500
50 75 100 125
1000
100 150 200 250
Number of Sub-blocks 100 Whole Half 361 One 8281 36 Whole 121 Half One 7396 Whole 25 Half 81 One 6561 Whole 16 Half 36 One 5776 Whole 100 Half 361 One 73441 Whole 36 Half 144 One 65536 Whole 25 Half 81 One 58081 Whole 16 Half 36 One 51076 Whole 100 Half 361 Whole 36 Half 144 Whole 25 Half 81 Whole 16 Half 36 Whole 100 Half 361 Whole 36 Half 144 Whole 25 Half 81 Whole 16 Half 49 Shift
Average MPLE 4.5685 4.6020 5.2807 4.2588 4.6205 4.6828 3.7488 3.5458 4.2046 3.0882 2.6370 3.5785 2.8113 2.9351 3.1094 2.1644 2.0011 1.9984 1.3051 1.2719 1.3693 1.0137 0.9917 1.0665 1.4124 1.4113 1.0015 1.0388 1.0032 1.0008 0.9984 0.9955 1.0077 1.0084 1.0014 1.0000 0.9990 0.9988 0.9972 0.9954
Estimated Var(MPLE) 0.021924 0.019078 0.013587 0.067236 0.066193 0.064116 0.160362 0.161188 0.170413 0.313485 0.191927 0.324719 0.045737 0.048044 0.051915 0.087682 0.078927 0.079410 0.043379 0.042598 0.060681 0.000794 0.000387 0.015967 0.015656 0.016536 0.000129 0.002883 0.000159 0.000139 0.000118 0.000130 0.000042 0.000042 0.000044 0.000033 0.000034 0.000041 0.000040 0.000039
Simulated Var(MPLE) 0.00542753
0.00040862
0.00014896
0.00003478
PSEUDO MLE
6
293
Conclusion
Several fundamental questions about the properties of the MPLE remain unresolved - including the form of its sampling variance. However, because of its simplicity and ease of use, it is an appealing estimator. The minimal accomplishment of discovering the variance of the distribution of the MPLE will aid scientists in drawing inferences from their data, and in some cases render MCMC estimation redundant. This simulation study demonstrates that the variance of the MPLE can be estimated in spite of the fact that it is unknown. Weaknesses in the subsampling strategy which result from overlapping sub-blocks are currently being addressed by recasting the sub-sampled values as a stochastic process in which correlation (induced by overlap) is routine. Acknowledgement The author would like to thank Michael Sherman of the Department of Statistics at Texas A&M University for helpful comments and suggestions. The author would also like to thank Ingram Olkin and the Department of Statistics at Stanford University for sponsorship under which this work was completed, as well as to acknowledge partial support of this work by the National Science Foundation Grant DMS-9631278 and SES-0074456. References Besag, J. (1974), "Spatial interaction and the statistical analysis of lattice systems," Journal of the Royal Statistical Society, Series B 6, 192-236. Besag, J. (1975), "Statistical analysis of non-lattice data," The Statistician 24, 179-195. Besag, J. (1977), "Efficiency of pseudolikelihood estimators for simple Gaussian fields," Biometrika 64, 616-618. Comets, F. (1992) "On consistency of a class of estimators for exponential families of Markov random fields on the lattice," Annals of Statistics 20, 455-468. Comets, F. and Janzura, M. (1998), "A central limit theorem for conditionally centered random fields with an application to Markov fields," Journal of Applied Probability 35, 608-621. Cressie, N. (1993), Statistics for Spatial Data, Wiley, New York. Ellis, R. S. (1985), Entropy, Large Deviations, and Statistical Mechanics, Springer-Verlag, New York. Geman, D. (1991), "Random fields and inverse problems in imaging," Lecture Notes in Mathematics 1427, 113-193, Springer-Verlag, New York.
294
SEYMOUR
Geman, D., Geman, S., Graffigne, C., and Dong, P. (1990) "Boundary detection by constrained optimization," IEEE Transactions on Pattern Analysis and Machine Intelligence 12, 609-628. Geman, S., and Geman, D., (1984), "Stochastic relaxation, Gibbs distributions, and Bayesian restoration of images," IEEE Transactions on Pattern Analysis and Machine Intelligence 6, 721-741. Geman, S., and Graffigne, C., (1986) "Markov random field image models and their applications to computer vision," Proceedings of the International Congress of Mathematicians, 1496-1517, Berkeley, CA. Geyer, C. J., and Thompson, E. A., (1992) "Constrained Monte Carlo maximum likelihood for dependent data," Journal of the Royal Statistical Society, Series B 54, 657-699. Gibbs, J. W. (1902) Elementary University Press.
Principles of Statistical Mechanics, Yale
Guyon, X., (1987), "Estimation d'un champ par pseudo-vraisemblance conditionelle: Etude asymptotique et application au cas de Markovien," Spatial Processes and Spatial Time Series Analysis (Proceedings of the Sixth Franco-Belgian Meeting of Statisticians, 1985), F. Droesbeke, ed., Publications des Facultes Universitaires Saint-Louis, Brussels, Belgium, 16-62. Guyon, X. and H. R. Kύnsch (1992), "Asymptotic comparison of estimators in the Ising model," in: Stochastic Models, Statistical Methods and Algorithms in Image Analysis (P. Barone, A. Frigessi, and M. Piccioni eds.), Lecture Notes in Statist. 74 177-198. Springer, New York. Ising, E. (1925) "Beitrag zur Theorie des Ferromagnetismus," Zeitschrifl Physik 31, 253-258. Janzura, M. (1994), "Asymptotic normality of the maximum pseudo-likelihood estimate of parameters for Gibbs random fields," Transactions of the 12th Prague Conference on Information Theory, Statistical Decision Functions, and Random Processes, 119-121. Janzura, M., and P. Lachout (1995) "A central limit theorem for stationary random fields," Mathematical Methods of Statistics 4 463-471. Jensen, J. L., and Kϋnsch, (1994), "On asymptotic normality of pseudo likelihood estimates for pairwise interaction processes," Annals of the Institute of Statistical Mathematics 46, 475-486. Jensen, J. L., andM0ller, J. (1991), "Pseudolikelihood for exponential family models of spatial processes," Annals of Applied Probability 1, 445-461. Ji, C , and Seymour, L. (1996), "A consistent model selection procedure for Markov random fields based on penalized pseudo-likelihood," Annals of Applied Probability 6, 423-443.
PSEUDO MLE
295
Kashyap, R., and Chellappa, R. (1983), "Estimation and choice of neighbors in spatial-interaction models of images," IEEE Transactions on Information Theory 29, 60-72. Mase, S., (1984), "Locally asymptotic normality of Gibbs models on a lattice," Advances in Applied Probability 16, 585-602. Mase, S. (1995), "Consistency of the maximum pseudo-likelihood estimator for continuous state space Gibbsian processes," Annals of Applied Probability 5, 603-612. Politis, D. N., and Romano, J. P. (1993). "On the sample variance of linear statistics derived from mixing sequences," Stochastic Processes and Their Applications 45, 155-167. Propp, J. G., and Wilson, D. G. (1995), "Exact sampling with coupled Markov chains and applications to statistical mechanics," Random Structures and Algorithms 9, 223-252. Seymour, L. (1993), "Parameter estimation and model selection in image analysis using Gibbs-Markov random fields, Dissertation, The University of North Carolina. Seymour, L., and Ji, C. (1996), "Approximate Bayes model selection procedures for Gibbs-Markov random fields," Journal of Statistical Planning and Inference Special Issue on Spatial Statistics 51, 75-97. Seymour, L. (2000), "Gibbs Regression and a Test for Goodness of Fit," in Goodness-of-Fit Tests and Model Validity (C. Huber-Carol, N. Balakrishnan, M. S. Nikulin, and M. Mesbah, eds.) Birkhauser, Boston. To appear. Sherman, M. (1996), "Variance Estimation for Statistics Computed from Spatial Lattice Data," Journal of the Royal Statistical Society, Series B 58 509-523. Sherman, M., and Carlstein, E. (1994), "Nonparametric estimation of the moments of a general statistic computed from spatial data," Journal of the American Statistical Association 89, 496-500. Seymour, L., Smith, R. L., Calloway, M. O., and Morrissey, J. (2000), "Lattice Models for Social Networks With Binary Data," Technical Report #2000-24, Department of Statistics, University of Georgia. Tjelmeland and Besag, J. (1998), "Markov random fields with higher-order interactions," Scandinavian Journal of Statistics 25, 415-433.
297 Institute of Mathematical Statistics LECTURE NOTES — MONOGRAPH SERIES
A REVIEW ON INHOMOGENEOUS MARKOV POINT PROCESSES Eva B. Vedel Jensen and Linda Stougaard Nielsen Laboratory for Computational Stochastics University of Aarhus Abstract Recent models for inhomogeneous spatial point processes with interaction are reviewed. The focus is on models derived from homogeneous Markov point processes. For some of the models, the interaction is location dependent. A new type of transformation related model with this property is also suggested. The statistical inference based on likelihood and pseudolikelihood is discussed for the different models. In particular, it is shown that for transformation models, the pseudolikelihood function can be decomposed in a similar fashion as the likelihood function. Keywords: Cox point processes, Gibbs point processes, inhomogeneity, interaction, likelihood, Markov point processes, Papangelou conditional intensity, Poisson point processes, pseudolikelihood, thinning, transformation
1
Introduction
In recent years, models for inhomogeneous point processes with interaction have been suggested by several authors. We will in the present paper concentrate on three ways of introducing inhomogeneity into a Markov model, i.e. inhomogeneity induced by a non-constant first-order interaction (Stoyan and Stoyan (1998); see also Ogata and Tanemura (1986)), by thinning of a homogeneous Markov point process (Baddeley et al. (2000)) and by transformation of a homogeneous Markov point process (Jensen and Nielsen (2000)). The aim is to give a unified exposition of these models in order to be able to assess their relative merits and point to research problems that remain to be solved in this area. We restrict attention to finite point processes. For any of the three point process models to be considered, the inhomogeneity may be described by a function λ defined on the same set as the points. In the case where the
298
JENSEN AND NIELSEN
inhomogeneous point process is Poisson, λ is the ordinary intensity function. In addition to the point pattern, explanatory variables may be observed at each point, for the purpose of explaining the inhomogeneity. Such information may be included in any of the models. The interaction specified in the models may or may not be location dependent. In Section 2, inhomogeneous Poisson and Cox point processes are considered. In Section 3.1, a short summary of homogeneous Markov point processes is given, followed in Section 3.2 by a formal description of the three types of inhomogeneous point processes derived from homogeneous Markov point processes. In Section 3.3, parametric specification of the inhomogeneity is discussed, while in Section 3.4 parametric statistical inference is outlined for each of the three model types. Section 4 contains a discussion and some considerations concerning future research.
2
Poisson and Related Point Processes
We will throughout the paper consider point processes on a k—dimensional manifold X in Rm. Often, X will be full-dimensional such that k = m. (Formally, we will call X full-dimensional if X is a regular compact, that is a non-empty compact subset of Rm which is the closure of its interior.) But X may for instance also be a planar curve or a spatial surface. We will assume that 0 < \m(X) < oo where λ^ is the k—dimensional volume measure in Rm (Hausdorff measure). We let Bo be the bounded Borel subsets of X. It is easy to introduce inhomogeneity within the class of Poisson point processes. Let μ be a locally finite, non-atomic measure on X with density λ with respect to λ^ and let n( ) be the number of elements in . A point process X on X (a random finite subset of X) is then said to be a Poisson point process with intensity function λ if •
MA e Bo : n{X Π A) ~ Po(JA λ{u)duk) , . . . , As £ Bo disjoint: n(XΠAi),...,n(XΓ\A s ) independent
We have here used the short notation duk for λ^(cίw). It can be shown that the first requirement is enough, cf. e.g. Stoyan et al. (1995). If λ is constant the Poisson point process is said to be homogeneous, otherwise the process is inhomogeneous. A homogeneous Poisson point process is often used as a reference (null) model. The reason is the following result •
Let X be a homogeneous Poisson point process on X and let A G Bo- Then, given n(X Π A) = n, X Π A is a binomial process with n points, i.e. X Π A is distributed as {Xi,...,X n } where X\,..., Xn are independent and identically uniformly distributed on A.
299
INHOMOGENEOUS MARKOV PROCESSES
(a) Poisson
(b) Strauss
Figure 1: Realizations of inhomogeneous point processes in the unit circular disc with inhomogeneity function X(η) oc eθdc^η\ where dc(η) is the distance from 77 to C, the centre of the disc. The point pattern to the left is inhomogeneous Poisson, i.e. no interaction between points, whereas the point pattern to the right is inhomogeneous Strauss with 7 = 0.01 and therefore shows inhibition between points. For details, see Section 3.3 and Appendix III. The number of points in the Poisson process have been chosen to equal the number of points in the Strauss process. For both processes, the distribution of the points in the shaded sampling window T remains the same if T is rotated around C. The independence property of the homogeneous Poisson point process ensures that there is no interaction in the process, the uniformity means that the process is homogeneous. The inhomogeneity of a point process may depend on explanatory variables. One simple geometric example is an intensity function of the form
where dc(η) is the distance from η to a reference structure C. For m — k = 2, the reference structure may be a point or a planar curve. In Figure 1, C is a point (centre of a circle) while, in Figure 2, C is a straight line (centre of a linear band). For m = k = 3, the reference structure may be a point, a spatial curve or a spatial surface. See also Berman (1986) and references therein. Points lying on curves in two or three dimensions or points lying on spatial surfaces may also show inhomogeneity. In Figure 3 and 4, point processes on the unit circle Sι and unit sphere S2 are shown. (Points are represented as directions in Figure 3.) In any of the Figures 1 to 4, Poisson point processes are shown to the left while corresponding processes with
JENSEN AND NIELSEN
300
•
(a) Poisson
Ί
(b) Strauss
Figure 2: Realizations of inhomogeneous point processes in the unit band {η 6 R2 : dc(η) < 1}, where dc{η) is the distance from η to C, the full-drawn horizontal line. The inhomogeneity function is as in Figure 1 and likewise the right hand-side point pattern is inhomogeneous Strauss and the left is inhomogeneous Poisson with the same number of points. The distribution of points in the shaded sampling window T remains the same under horizontal translations of T. inhibition between points are shown to the right (for details, see Section 3.3). Statistical inference for inhomogeneous Poisson processes with a parametrically specified intensity function can be performed as follows. Let X be an inhomogeneous Poisson point process with intensity function \Q,Θ G ΘCJ? 1 . Then, the likelihood function of θ with respect to the homogeneous Poisson point process with intensity 1 takes the form
Lo(0; x) = exp(- f [λΘ(u) - l)duk) J ] \Θ{η).
(1)
We use index 0 in this likelihood because later it enters into more complicated likelihoods. In Berman and Turner (1992), log-linear models for λ# are discussed, where τ(η) — (τι(η),... ,77(77)) is a list of explanatory variables evaluated at η and indicates Euclidean inner product. After approximation of the integral by a finite sum, the likelihood takes the same analytical form as the likelihood of a generalized linear model with Poisson responses and standard software can be used to analyze the model. See also Rathbun (1996). Alternatively, the intensity function can be estimated non-parametrically, using kernel estimation (Silverman (1986)) or a Bayesian method (Heikkinen and Arjas (1998)).
INHOMOGENEOUS MARKOV PROCESSES
(a) Poisson
301
(b) Strauss
Figure 3: Realizations of inhomogeneous point processes on the unit circle S1. The situation is the same as in Figure 1 except that dc(η) is the distance along the circle from η to the point C = (cos(2π/3),sin(2π/3)) marked with an arrow. The points are shown as directions.
(a) Poisson
(b) Strauss
Figure 4: Realizations of inhomogeneous point processes on the unit sphere S2. The situation is the same as in Figure 1 except that dc(η) is the geodesic distance from η to C which is the north pole.
302
JENSEN AND NIELSEN
As a generalization one may consider inhomogeneous Cox processes, i.e. inhomogeneous doubly stochastic Poisson point processes. The definition of a Cox process is as follows. Let Λ be a random intensity function on X. Then, X is a Cox process if, given Λ = λ, X is a Poisson point process with intensity function λ, cf. Stoyan et al. (1995, p. 154). In M0ller et al. (1998) and Brix and M0ller (1998), log Gaussian Cox processes are discussed, i.e. Cox processes for which λ = eγ and Y = {Ys}seχ is a Gaussian field. Inhomogeneity is introduced by letting the mean-value of Ys depend on 5, see also M0ller (1999a). Clustered inhomogeneous point patterns may be modelled by this process and this appears to be a natural model if the aggregation is due to stochastic environmental heterogeneity. This type of model has in Brix and M0ller (1998) been used to describe the spatio-temporal development of two types of weeds in an organic barley field.
3
Markov Point Processes
If one wants to describe inhibition in addition to clustering then the class of Markov point processes is useful, cf. Ripley and Kelly (1977), Baddeley and M0ller (1989), the recent monograph van Lieshout (2000) and references therein. Let us start by recalling a few preliminaries for Markov point processes. 3.1
Homogeneous Markov Point Processes
Let ~ b e a reflexive and symmetric relation on X. Two points £,77 G X are called neighbours if ξ ~ η. A finite subset x of X is called a clique if all points of x are neighbours. By convention, sets of 0 and 1 points are cliques. The set of cliques is denoted C. If X C Rm is full-dimensional then the relation induced by Euclidean distance is often used. If X is a planar curve, distances along the curve may be more natural. If X is the unit sphere S"71"1, such that the observed points in fact are directions, then geodesic distance is natural. Markov point processes are characterized by the Hammersley-Clifford theorem, cf. Ripley and Kelly (1977). This theorem states that a point process X on Af, with density / with respect to the homogeneous Poisson point process with intensity 1, is a Markov point process iff (2) yCx
where φ > 0 is an interaction function with respect to ~, i.e. φ{x) = 1 unless
INHOMOGENEOUS MARKOV PROCESSES
303
x € C. Note that then the Papangelou conditional intensity
depends only on those points in x which are neighbours of η. Here, x U η is short for x U {77}. Let ψk be the restriction of φ to subsets consisting of k points. A pairwise interaction process is then a process for which ψk = 1 for k > 2. The famous Strauss process (Strauss (1975) and Kelly and Ripley (1976)) is the pairwise interaction process with
{
α k =0 β k=l 7 k = 2, x G C.
The density of the Strauss process becomes, cf. (2),
where s(x) is the number of neighbour pairs in x. If || || denotes the usual Euclidean distance and the neighbourhood relation is given by
then the process is called a Strauss process with interaction radius R. If X C Rm is full-dimensional, a Markov point process X on X is said to be homogeneous if ψ is translation invariant, cf. e.g. Stoyan and Stoyan (1998) and Baddeley et al. (2000). (We assume that φ is defined on all finite m subsets of R .) Other definitions of homogeneity are of course possible, cf. Jensen and Nielsen (2000). Note that translation invariance implies that ψ\ is constant and for k > 1, ψk{y) only depends on the relative positions of the k points in y. A homogeneous pairwise interaction process has a density of the form Φ
where φ indicates that η and ξ are different. For lower dimensional manifolds X, homogeneity may be defined in terms of invariance under other types of transformations. For instance, for X = Sm~ι a natural set of transformations are the rotations. Recall that the group O(m) of rotations consists of m x m real matrices O{m) = {A\AAT = ATA = J m }.
304
JENSEN AND NIELSEN
A homogeneous, with respect to this choice, pairwise interaction process on Sm~ι has a density of the form
{η,ξ}Cx
Recall that η ξ = cos 0, where θ is the angle between η and ξ. 3.2
Introducing Inhomogeneity
Throughout this section, X is a homogeneous Markov point process with respect to ~ on X and density
(4) yCx
Note that ψ\ is then constant. Below, we describe three ways of introducing inhomogeneity into the model. The resulting inhomogeneous point process is denoted by Y and is a point process on a k—dimensional manifold y in i? d , say. For the first two ways of constructing inhomogeneity, y = A*, i.e. the homogeneous process and the associated inhomogeneous process are defined on the same space. The inhomogeneity is described by a function λ(7/), η E y, which we will call the inhomogeneity function. Common to each of the three constructions is the feature that if X is a homogeneous Poisson point process, then the associated inhomogeneous point process Y is Poisson with intensity function proportional to λ. The three inhomogeneous models are therefore extensions of the inhomogeneous Poisson model. An obvious way of introducing inhomogeneity is by making the firstorder interaction non-constant. We will call this type I inhomogeneity. The associated inhomogeneous Markov point process has then a density of the form
fγ{y)ocl[λ(η)]lφ{z). ηGy
(5)
zCy
This type of model is natural if the interaction does not depend on the local intensity of points. This set-up has been studied in Ogata and Tanemura (1986), Stoyan and Stoyan (1998) and Baddeley and Turner (2000), among others. In Ogata and Tanemura (1986), logλ(r/) is a polynomium in Cartesian coordinates, while in Stoyan and Stoyan (1998), a piecewise (region-wise) constant function is studied. It is also interesting to note that in the hierarchical point process models described in Hόgmander and Sarkka (1999), densities of the form (5) appear.
305
INHOMOGENEOUS MARKOV PROCESSES
Type II inhomogeneity is obtained by using an independent inhomogeneous thinning of the homogeneous Markov point process. Let us suppose that the inhomogeneity function λ(ry), η € y, is bounded by λ m a x , and let p(η) = λ(?7)/λmax, 77 G y. The inhomogeneous process is then obtained by thinning with p, Y = {xieX:Ui
771,772 € 3^
Using the induced relation, interactions become location dependent. Note that we in this case have inhomogeneity both in the intensity and the strength of the interaction. The transformation approach may be extended by using a series of transformations. It can be shown, cf. Jensen and Nielsen (2000, Corollary 3.3), that h(X) = {h(ξ) : ξe X} is Markov with respect to « on y and has the density fγ(y) = exp(- ί [λfa) - l}dηk) J J X(η) f j ψ(h~l (*)), J
y
ηey
(6)
zCy zCy
where λ(77) = JhΓι{η), the Jacobian of the inverse transformation h~ι. This transformation result can be proved by the coarea formula in geometric measure theory, cf. Jensen (1998). Note that if X is Poisson, then the last product in (6) is of the form
306
JENSEN AND NIELSEN
and therefore Y is an inhomogeneous Poisson point process with intensity function /?λ( ). It is not always easy to find an appropriate transformation which introduces an inhomogeneity of a given form. (The problem to be solved is to find h such that Jh~ι = λ where λ is a given inhomogeneity function.) It is therefore useful to construct approximate transformation models with the same qualitative properties as the original transformation models. Let us suppose that d = m =fc,i.e. the manifolds X and 3^ are full-dimensional. Furthermore, suppose that the original process is a homogeneous pairwise interaction process, cf. (3). Then, the density of the transformed point process is, cf. (6),
fγ(y)cxβ^1[[λ(η)
Φ
Recalling that for a transformation model Jh~ι = λ, an obvious way of avoiding to construct the transformation is to replace
h-Hri-h-Hξ)
(7)
by an expression of the form
fo-fl,
(8)
where v > 0 is some suitably chosen power. The density of the transformation related model becomes Φ
MV) oc/^W H λfa) Π ^(λ(r/Γλ(eΓ(r/-O)
(9)
This type of model has also been considered in Baddeley and Turner (2000). Note that for point processes on the real line (Λ; = 1), (8) can for η and ξ close be regarded as an approximation to (7), if v — 1/2. This will generally not be the case for k > 1. 3.3
Exponential Inhomogeneity
The inhomogeneity function may be modelled parametrically or non-parametrically or both. If no prior knowledge is available about the inhomogeneity, non-parametric modelling may be useful, at least initially. With knowledge of the inhomogeneity (e.g. monotone decreasing in a known direction) then it can be worthwhile to consider parametrically modelled inhomogeneity such as that of exponential form
307
INHOMOGENEOUS MARKOV PROCESSES
where θ G Θ C Rι and τ(τj) G i?'. Let us concentrate on a comparison between type I and type III exponential inhomogeneity. If the inhomogeneity is of type I, then the density of the inhomogeneous point process takes the form, cf. (5), n
fγ(y; θ) oc a ( 9 ) « e ^ Π φ{z),
(10)
zCy
where t(y) = Σηev τ{rj). Note that if the homogeneous Markov point process is an exponential family model then the associated inhomogeneous model is too. In particular, if the homogeneous model is a Strauss model then
fγ(y,θ,β,Ί)
oc eθ«yHa(θ)β)nWΊs{y).
(11)
This is a nice three parameter exponential family model. If instead the transformation approach is used, we need to find a parametrized class of transformations h$,θ G Θ such that T
Jh~θ\η) = a(0)e* <"),
(12)
ηey.
Let us give a fairly general geometric example where this problem has a simple solution. Example 3.1 The example concerns inhomogeneity for point patterns in Rd which depends on the distance dc to a p—dimensional linear subspace C in i? d , p = 0, 1,..., cf — 1. We will define the transformation hg on the whole set {η G Rd : dc{η) < 1} and let τ(η) only depend on the distance of η to C, i.e. τ(η) = τ(dc(η)), say. The cases (d,p) — (2,0) and (2,1) with f the identity are illustrated in Figure 1 and 2, respectively. Then, (12) has a unique solution among transformations of the form η
hθ(η) = Pc(η) + 9θ(dc(η))
^
]
,
(13)
where pc{η) is the orthogonal projection onto C and ge is an increasing function of [0,1] onto itself. The solution is given by, cf. Appendix I,
It follows also from Appendix I that for he defined by (13) and (14), jnθ
yη) = a(V)e
,
where
For p > 0, this model may be used locally also in the case where C is curved. •
308
JENSEN AND NIELSEN
Likewise, it is possible to construct transformations on Sι or 5 2 for the case of exponential inhomogeneity with τ(η) = dc(η), where C is a point on S1 or S2 and dc is the geodesic distance to C, cf. Jensen and Nielsen (2000). Illustrations are given in Figure 3 and 4. For general functions r, the construction of an appropriate set of transformations may be difficult. An example is the inhomogeneity of the hickory tree data from Stoyan and Stoyan (1998, Figure 1). The density of a type III exponential inhomogeneous point process becomes
fγ(y θ) oc α(0)»ωeMω Π v ί ^ W ) , zCy
compare with (10). In particular, if the homogeneous model is a Strauss model then fγ(yΆβ,Ί) oc zθ
y
n
fγ(y, θ) ex e < \a(θ)β) M
Φ
J J u(a(θ)2l/ev^η^T^'θ[η
- ξ]).
In particular, in the Strauss case we get M
{
\
(16)
where s$(y) is the number of ~$ —neighbours of y. If two points 77, ξ are related in the homogeneous Strauss process when \\η — ξ\\ < i?, then the relation ~o is defined by
η ~θξ & α(0) 2 *e" (τ M +τ < f)) ''|to - ξ\\ < R. The three models (11), (15) and (16) are compared by simulation in Figure 5. Note that the type I process appears somewhat more homogeneous than the other processes, because the relation is for this process not location dependent. This feature becomes more pronounced if more points are forced into the point patterns by increasing β. Furthermore, the intensity in the type I point process appears to be lower than the intensity in the other two point processes. These two are, however, similar both regarding the relation and the point intensity. Thus, the inhomogeneity function and the parameters from the associated homogeneous
INHOMOGENEOUS MARKOV PROCESSES
(a) Type I
(b) Type III
309
(c) Type III related
Figure 5: Realizations of inhomogeneous point processes on the unit square. The densities used are, from left to right, (11), (15) and (16), respectively, all with τ(77i,772) = 771. The parameter values used are β = 200, 7 = 0.01, i?(interaction radius)= 0.05 and θ = —3. In (16), the exponent is v = 1/4. The number of observed points are, from left to right, n(y) — 87,95 and 93, respectively. process play a quite different role in the actual point intensity and point interaction in the different types of inhomogeneous models. These issues have of course to be examined in more detail.
3.4 3.4.1
Parametric Statistical Inference Likelihood Inference
If a Markov point process is observed in a sampling window T C A', cf. Figure 1 and 2, the conditional density of the point pattern observed in T, given the remaining points, may be used for inference. As in Baddeley et al. (2000), let for disjoint point patterns y and x
x(y
(17)
\χ)=
For a homogeneous Markov point process with density (4), the conditional density, with respect to the homogeneous Poisson point process on T with intensity 1, is then of the form, cf. e.g. M0ller (1999b, formula (14)),
f(xτ
oc χ{xτ
where xτ = xΠT,Tc is the complement of T and dT = {ξ E Tc\3η G T : η ~ £}. Note that this density depends on x^ only via
310
JENSEN AND NIELSEN
Let us suppose that the interaction function φ can be parametrized by φ G Φ. Then, the likelihood function based on observation of the homogeneous process in T becomes Lτ(φ] x) = cτ(ψ) Xdτ)λ(xτ] Φ I x&r),
(18)
where cτ(φ\ x&r) is a normalizing constant and χ( ; φ\ ) is defined as in (17) with φ parametrized by φ. In the Strauss case, we get n{xτ
s{xτ)+s(xτ xaτ
Lτ<β,r,x) = cτ(β,r,x&r)β h
(19)
' \
where for disjoint point patterns x and y
Φ y) = Likelihood inference based on these likelihood functions requires Markov chain Monte Carlo (MCMC) approximations, since the normalizing constant is not known explicitly, cf. Geyer (1999) and M0ller (1999b). Let us now compare likelihood inference for the two Markovian inhomogeneous processes, viz. type I and III. Likelihood inference for the type II case, based on MCMC techniques for missing data problems, has been discussed in detail in Baddeley et al. (2000). For type I inhomogeneity, the conditional density is given by
fγ(yτ I yrc) oc Y[ \{η) χ{yτ \ yar), where χ refers to the homogeneous process. If Xeiv) = &(θ)eθ"Γ(η\ and ψ is parametrized by φ, we get Lτ(θ,φ;y) = cτ(θ,φ;ydτ)eθ
\ yar)-
In particular, if the homogeneous process is a Strauss process we have Lτ(θ, β, i; y) = cτ(θ, β, r,
)eθ
y&Γ
Again, MCMC is required for the analysis. Statistical inference in the case of type III inhomogeneity is based on the following conditional density, derived from (6),
MVT I yτo) α Π m-xih-Hyr) I h-ι(ydτ)).
311
INHOMOGENEOUS MARKOV PROCESSES
If ho is chosen such that JhJ1 = \g and φ is parametrized by ψ, we get, cf. Appendix II,
where Lo( yτ) is the likelihood (1) of an inhomogeneous Poisson point process with intensity function Xg and Lh-ι,τ\(-\h^ι(y)) is the likelihood function (18) for the homogeneous Markov point process with observation Λ^1(y), observed in h^ι(T). Recall that (18) reduces to (19) in the Strauss case. Note that for the transformations derived in Example 3.1 and windows T as shown in Figure 1 and 2, h^ι{T) does not depend on θ. In fact, hθ(T)=T. Likelihood inference is simpler for type I than for type III models since the inhomogeneity parameter is a nuisance parameter in an exponential family model in the latter case. However, simulation studies indicate that the estimate ΘQ of θ based on Lo( ]yτ), cf. (20), is close to θ. In that case, the interaction parameter φ can be estimated on the basis of
and the analysis will be no more complicated than the analysis of a homogeneous Markov point process. 3.4.2
Pseudolikelihood Inference
A less demanding inference procedure is based on the pseudolikelihood function which is the likelihood function for a Poisson point process with intensity function equal to the Papangelou conditional intensity of the process, cf. e.g. Besag (1975) and Jensen and M0ller (1991). Recently, pseudolikelihood inference has been discussed by Baddeley and Turner (2000). If the homogeneous process is parametrized by φ the pseudolikelihood function based on observation in T becomes
PLτ(φ;x) = exp(- I\\φ{u χ) - l)duk)[ J ] \ψ(η;x\η)} J T
ηexT
where x\η means x \ {η} and
If the homogeneous process is a Strauss process, then
312
JENSEN AND NIELSEN
and PLT(β,T,x)
=exp(-
Compared to likelihood inference the 'normalizing' constant is much simpler. Let us now look at pseudolikelihood inference for the two Markovian inhomogeneous processes. For type I inhomogeneity of exponential form, the pseudolikelihood function takes the form
PLτ(θ,ψ;y)=exp(-
f [eθ
- l]duk)eθ
Jτ
In the Strauss case, we get PLτ(θ,β,T,y)
=
exp(β
xe
f
T
t{yτ) βπ(yτ) J*s(yτ)+s(y
Note that for fixed θ and 7 the maximum pseudolikehood estimate of β is known explicitly. This is an example where the Papangelou conditional intensity is of log-linear form and the analysis suggested by Baddeley and Turner (2000) can therefore be used. Using the approximations described in Berman and Turner (1992) and Baddeley and Turner (2000) one should, however, be careful when choosing the dummy points involved in the approximation. In the case of type III inhomogeneity, it can be shown, cf. the Appendix Π, PLτ{θ,<ψ',y) = Lo(θ ,yτ)PLh-i(τ)(φ;h^(y)).
(21)
The pseudolikelihood function thus decomposes as the likelihood function, cf. (20). Pseudolikelihood inference for type III processes appears to be more complicated but again it is expected that the inference can be split into two parts.
4
Discussion
In the present paper, we have discussed three types of inhomogeneous point processes, derived from homogeneous Markov point processes. It is of course also of interest to study how inhomogeneity can be introduced into other of the classical classes of point processes. For instance, one may consider inhomogeneous Neyman-Scott point processes (the Poisson point process of the mothers is inhomogeneous), inhomogeneous Matern hard-core processes (the unthinned Poisson point process is inhomogeneous), inhomogeneous
INHOMOGENEOUS MARKOV PROCESSES
313
simple sequential inhibition point processes (the size of the circular region around each point depends on the position of the point) and inhomogeneous Gibbs processes (e.g. transformations of homogeneous Gibbs processes). See Clausen et al. (2000). The emphasis has in the present paper been on parametrically modelled inhomogeneity. This is a new approach for type II processes. Dually, it will also be of interest to study non-parametric estimation of the transformation involved in type III models. Summary statistics like the K—, F— and G—functions have been developed for the initial study of the interaction in homogeneous point processes and for checking of models for homogeneous point processes, cf. Stoyan et al. (1995). In Baddeley et al. (2000), an analogue of the K—function is suggested for the inhomogeneous case. For a type II process, this analogue has the nice property of being identical to the K—function of the unthinned process. It still remains, however, to find versions of the F— and G—functions that can be used in the inhomogeneous case. For type III processes, an alternative is to estimate the transformation, either parametrically or nonparametrically, and then use the traditional summary statistics for the homogeneous case on the inversely transformed point pattern. Type III processes have the special feature that the neighbourhood relation induced by the transformation is location dependent. The relation is generally not isotropic in the sense that relationship only depends on the distance between the points. Another quite promising idea is to introduce inhomogeneity in Markov point processes by location dependent scaling. This is a topic for future research.
5
Acknowledgements
The authors want to thank Ute Hahn, Jesper M0ller and Aila Sarkka for fruitful discussions while preparing this paper. This research has been supported by the Centre for Mathematical Physics and Stochastics (MaPhySto), funded by a grant from the Danish National Research Foundation.
References Baddeley, A. J. and M0ller, J. (1989). Nearest-neighbour Markov point processes and random sets. Int. Statist. Rev., 57:89-121. Baddeley, A. J., M0ller, J., and Waagepetersen, R. (2000). Non- and semiparametric estimation of interaction in inhomogeneous point patterns. Statistica Neerlandica, 54:329-350.
314
JENSEN AND NIELSEN
Baddeley, A. J. and Turner, R. (2000). Practical maximum pseudolikelihood for spatial point patterns. Australian and New Zealand Journal of Statistics, 42:283-322. Berman, M. (1986). Testing for spatial association between a point process and another stochastic process. Appl. Statist, 35:54-62. Berman, M. and Turner, T. R. (1992). Approximating point process likelihoods with GLIM. Appl. Statist, 41:31-38. Besag, J. E. (1975). Statistical analysis of non-lattice data. The Statistician, 24:179-195. Brix, A. and M0ller, J. (1998). Space-time multitype log Gaussian Cox processes with a view to modelling weed data. Research Report R-98-2012, Department of Mathematical Sciences, Aalborg University. To appear in Scand. J. Statist Clausen, W. H. O., Rasmussen, H. H., and Rasmussen, M. H. (2000). Inhomogeneous Point Processes. Master Thesis, Department of Mathematical Sciences, Aalborg University. Geyer, C. J. (1999). Likelihood inference for spatial point processes. In Barndorff-Nielsen, O. E., Kendall, W. S., and van Lieshout, M. N. M., editors, Stochastic Geometry: Likelihood and Computation, pages 79-140, London. Chapman and Hall/CRC. Heikkinen, J. and Arjas, E. (1998). Non-parametric Bayesian estimation of a spatial Poisson intensity. Scand. J. Statist., 25:435-450. Hόgmander, H. and Sarkka, A. (1999). Multitype spatial point patterns with hierarchical interactions. Biometrics, 55:1051-1058. Jensen, E. B. V. (1998). Local Stereology. World Scientific, Singapore. Jensen, E. B. V. and Nielsen, L. S. (2000). Inhomogeneous Markov point processes by transformation. Bernoulli, 6:761-782. Jensen, J. L. and M0ller, J. (1991). Pseudolikelihood for exponential family models of spatial point processes. The Annals of Applied Probability, 1:445-461. Kelly, F. P. and Ripley, B. D. (1976). A note on Strauss' model for clustering. Biometrika, 63:357-360. M0ller, J. (1999a). Aspects of Spatial Statistics, Stochastic Geometry and Markov Chain Monte Carlo. Doctoral Thesis, Aalborg University. M0ller, J. (1999b). Markov chain Monte Carlo and spatial point processes. In Barndorff-Nielsen, O. E., Kendall, W. S., and van Lieshout, M. N. M., editors, Stochastic Geometry: Likelihood and Computation, pages 141— 172, London. Chapman and Hall/CRC. M0ller, J., Syversveen, A. R., and Waagepetersen, R. P. (1998). Log Gaussian Cox processes. Scand. J. Statist, 25:451-482.
INHOMOGENEOUS MARKOV PROCESSES
315
Ogata, Y. and Tanemura, M. (1986). Likelihood estimation of interaction potentials and external fields of inhomogeneous spatial point patterns. In Francis, I. S., Manly, B. F. J., and Lam, F. C , editors, Proc. Pacific Statistical Congress - 1985, pages 150-154. Amsterdam, Elsevier. Perrin, O. (1997). Modele de Covariance d'un Processus Non-Stationaire par Deformation de VEspace et Statistique. Ph.D Thesis, Universite de Paris I Pantheon-Sorbonne. Rathbun, S. L. (1996). Estimation of Poisson intensity using partially observed concomitant variables. Biometrics, 52:226-242. Ripley, B. D. and Kelly, F. P. (1977). Markov point processes. J. London Math. Soc, 15:188-192. Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. Chapman and Hall, London. Stoyan, D., Kendall, W. S., and Mecke, J. (1995). Stochastic Geometry and its Statistical Applications. Wiley, Chichester, second edition. Stoyan, D. and Stoyan, H. (1998). Non-homogeneous Gibbs process models for forestry - a case study. Biometrical Journal, 40:521-531. Strauss, D. J. (1975). A model for clustering. Biometrika, 63:467-475. van Lieshout, M. N. M. (2000). Markov point processes and their applications. World Scientific, London.
Appendix I Let us start by finding Jh^1 in the case p = 0 where (13) reduces to
Let Bd(0,1) be the unit ball in Rd. Using polar decomposition twice we get for an arbitrary function / on Bd{0,1),
Bd(O,l)
f(hθ(η))dηd
= ίd ι Γ Js ~ Jo = ίd χ f Js - Jo
= ί JBAO.VI
f{9θ{t)ω)td-1dtdωd-1 f{uω){9ϊ\u))d-\9βλ)'{u)dudωd-1
f(v)(9Ϊ1(h\\))d-1(9Ϊ1)'(h\\)\\η\r{d-1)dηd.
316
JENSEN AND NIELSEN
Therefore, d l
l
{d l)
JhjHv) = (9θH\\η\\)) - (9θ )'(M\)\\v\\- - for p = 0. 1
This result can now be used to find Jh^ Td(O,l) =
(22)
for general p. Let
d
{ηeR :dc(η)
For an arbitrary function / on 7^(0,1) we then get
f Jτd(o,i)
f(hθ(η))dηd
f
c1- Jc Jc = ί
f(η)(9θ1(dc(η)))d-p-l(9θ1y(dc(η))dc(η)-id'p-1)dηd,
JTd(O,l)
where we at the second equality sign have used (22). It follows that
Since we also have gβ must satisfy ("))*-*]' = " ( β l ^ e ^ U G [0,1].
(23)
Since go is increasing 1-1 of [0,1] onto itself, the unique solution of (23) is (14).
Appendix II In order to derive (20), we need to find the constant cτ(θ,ψ;ydτ) in the expression for the conditional density fγ(yτ;θ,ψ\yTc)
=
Since this is a density with respect to the Poisson point process on T with intensity measure λ^ we get, using that λθ(η) = Jh^ι{η) and the well-known
317
INHOMOGENEOUS MARKOV PROCESSES
expansion of the distribution of the Poisson point process, cf. M0ller (1999b, Section 2),
UJT l ί
,
••!
χ({χi,...,χn};Φ\hgl(y&r))dxk1.-.dxkn
λ
The result now follows by noting that
\km{h-Θ\T)) = [
χ
Jh- {T)
dξk= [ Jh-Θ\η)dηk = [ \Θ(η)dηk. JT
JT
The proof of (21) is obtained as follows. The Papangelou conditional intensity of the transformed process becomes, cf. (6),
where λ^ is the Papangelou conditional intensity of the untransformed process. The pseudolikelihood function of the transformed process therefore becomes PLτ(θ,ψ;y)
= [γ[XΘ(η)}exp(-Jτ[λΘ(u)-l]duk) ί -l k x exp(— / λ^(w)[λψ(/i^ {u)\hβ (y)) — l]du ) JT η£yτ
The result is now obtained by noting that
f λθMMh^M
JT
k
hfiy)) - \\du = f
[Xφiv hJ^y)) - l]dv
Jh7ι(τ)
JENSEN AND NIELSEN
318
Appendix III Simulations from the inhomogeneous Strauss point process (15) are shown in the right hand-sides of Figure 1 to 4. In Table 1 the model parameters and the resulting number of points in the simulated point patterns are given. Note, however, that the number of points in Figure 2, n(x) = 355, is for a 33% wider rectangle. A larger area was used to reduce edge problems. The point patterns shown in Figure 1 to 4, right, and the three point patterns in Figure 5 have been simulated using Metropolis-Hastings birthdeath algorithm with 500000 iterations, cf. e.g. M0ller (1999b). Figure 1 2 3 4
β 1000 400 70 100
7 0.01 0.01 0.01 0.01
R
n(y) θ 0.1 163 -3 0.1 355 -3 0.1 40 -1 0.2 143 -2
Table 1: Parameters used for simulation and the resulting number of points for the point patterns in the right-hand sides of Figure 1 to 4.
321
Institute of Mathematical Statistics LECTURE NOTES — MONOGRAPH SERIES
Perfect sampling for posterior landmark distributions with an application to the detection of disease clusters Marc A. Loizeaux and Ian W. McKeague Department of Statistics Florida State University Tallahassee, FL 32306-4330 Abstract We study perfect sampling for the posterior distribution in a class of spatial point process models introduced by Baddeley and van Lieshout (1993). For Neyman-Scott cluster models, perfect sampling from the posterior is shown to be computationally feasible via a coupling-fromthe-past type algorithm of Kendall and M0ller. An application to data on leukemia incidence in upstate New York is presented.
1
Introduction
Bayesian cluster models based on spatial point processes were originally introduced by Baddeley and van Lieshout (1993), primarily for applications in computer vision. Disease clustering applications have also played a prominent role in the development of these models, as surveyed by Lawson and Clarke (1999). An important special case is the Neyman-Scott process in which the observations arise from a superposition of inhomogeneous Poisson processes associated with underlying landmarks (Neyman and Scott, 1972); van Lieshout (1995) focused on this case. Markov chain Monte Carlo (MCMC) techniques are indispensable for the application of point process models in statistics, see, e.g., the survey of M0ller (1999). Following the seminal work of Propp and Wilson (1996), Kendall and M0ller (2000) developed a version of perfect simulation for locally stable point processes; see the article of M0ller (2001) in this volume. This raises the possibility of constructing perfect samplers for the posterior distribution in Bayesian cluster models. Perfect samplers deliver an exact draw from the target distribution; this is a distinct advantage over traditional
322
LOIZEAUX AND MCKEAGUE
MCMC schemes which are often plagued by convergence problems. For some recent applications of perfect simulation in statistics, see Green and Murdoch (1999), M0ller and Nicholls (1999) and Casella et al. (1999). In this article we show that the posterior in the Baddeley-van Lieshout class of Bayesian cluster models is locally stable, provided the prior is locally stable and the likelihood satisfies some mild conditions. This has two important consequences: the posterior density is proper (has unit total mass), and the Kendall-M0ller algorithm is potentially applicable. However, the Kendall-M0ller algorithm is known to be computationally feasible only under a monotonicity condition: the Papangelou conditional intensity needs to be attractive (favoring clustered patterns), repulsive (discouraging clustered patterns), or a product of such terms. We show that perfect sampling is feasible for the Neyman-Scott process when the prior satisfies this monotonicity condition. We present an application to data on leukemia incidence in an eight county area of uptstate New York during the years 1978-82. The study area includes 11 inactive hazardous waste sites. We assess the possibility of an increased leukemia incidence rate in the proximity of these sites. There is an extensive literature on the analysis of these data, recent contributions being Ghosh et al. (1999), who applied a hierarchical Bayes generalized linear model approach, and Ahrens et al. (1999), who adjusted for covariate effects using a log-linear model. Our results suggest that there is an elevated leukemia incidence rate in the neighborhood of one of the sites. The paper is organized as follows. In Section 2 we develop the main result of the paper showing that the posterior is locally stable, and examine the Neyman-Scott model in detail. Section 3 contains the application to disease clustering. Some concluding remarks are given in Section 4.
2 2.1
Bayesian cluster models Preliminaries
The basic framework comes from Carter and Prenter (1972), see also M0ller (1999). Let W be a compact subset of the plane representing the study region. A realization of a point process in W is a finite set of points x = {#i, #2,..., xn(x)} C W, where n(x) is the number of points in x. If n(x) = 0, write x = 0 for the empty configuration. Let Ω denote the exponential space of all such finite point configurations in W, and furnish it with the σ-field T generated by sets of the form {x : n(x ΠB) = fc}, where B E B, the Borel σ-field on W, and k = 0,1,2, A standard way of constructing an Ω-valued point process X is by specifying an unnormalized density / with respect to the distribution π of the unit
323
PERFECT SAMPLING
rate Poisson process on W. The unnormalized density / (or corresponding process X) is said to be locally stable if there is a constant K > 0 such that /(xU {£}) < Kf(x) for all x G Ω, ξ e W\x. Local stability implies that the Papangelou conditional intensity
(with 0/0 = 0) is bounded. Most point processes that have been suggested for modeling spatial point patterns are locally stable, including the Strauss (1975) process and the areainteraction process of Baddeley and van Lieshout (1995). The Strauss process, used later in this article, has unnormalized density /(x) = / ^ M y M , where /? > 0, 0 < 7 < 1 and ί(x) is the number of unordered pairs of points in x which are within a specified distance r of each other. The Strauss process only models repulsive pairwise interaction. 2.2
Posterior distribution
The observed point configuration which arises from the landmarks x will be denoted y = {yi,y25 ?yn(y)} C W, and assumed to be non-empty. The prior and observation models are specified by point processes on W. The prior distribution of landmarks corresponds to a point process X having density pχ(x) with respect to π. The likelihood is defined in terms of an unnormalized density /( |x). Thus, for a given set of landmarks x, the density of the observed point process Y with respect to π is
where
is the normalizing constant. We assume that /(y|x) is jointly measurable in x and y. Prom Bayes formula, the posterior density of X with respect to π is Px\γ=y{*) oc α y (x)/(y|x)pχ(x).
(2.1)
The following theorem provides sufficient conditions for the posterior to be locally stable. We assume local stability of the prior pχ(-) and of the likelihood /(y| ) (for each fixed y). In addition, /(y| ) is assumed to satisfy the following local growth condition: there exists a constant L > 0 such that /(y|χu{ξ})>L/(y|χ)
(2.2)
324
LOIZEAUX AND MCKEAGUE
for all x, y E Ω, ξ G W\x. The term 'local growth condition' is used here because we view it as being dual to local stability. Theorem 2.1. Suppose pχ( ) and /(y| ) (for each y) are locally stable, and /(y| ) satisfies the local growth condition (2.2). Then the posterior (2.1) is locally stable. Proof. It suffices to show that aγ(-) is locally stable, because pχ{ ) and /(y| ) are assumed to be locally stable, and local stability is preserved under products. Given ξ G W\x,
-i
ί
Jn > L //(v|x)π(dv) Jn =
Lαy(x)" 1 ,
completing the proof. The Kendall-M0ller algorithm uses the method of dominated couplingfrom-the-past to obtain perfect samples from a locally stable point process as the equilibrium distribution of a spatial birth-and-death process. The algorithm is computationally feasible if the Papangelou conditional intensity g(x, ξ) is either attractive or repulsive, or a product of such terms. In the attractive case, (x, ξ) < g(x', ξ) whenever ξ £ x' and x C x'; in the repulsive case g(x,£) > ^(x7,^) whenever ζ £ xr and x C x ; .
2.3
Neyman-Scott model
In this section we focus on the Neyman-Scott model in which the observation process Y is the superposition of n(x) independent inhomogeneous Poisson processes ZXi and a background Poisson noise process of intensity e > 0. The intensity h(-\xi) of ZXi is specified parametrically, and the prior pχ{x) is assumed to be locally stable. Here and in the application in the next section we assume the Thomas intensity model h(t\x) = ^ e - l l * - l l W ,
(2.3)
where κ,σ > 0. For convenience, denote K* = κ/(2πσ 2 ). In this case,
ί ^ x ) , where n(x)
325
PERFECT SAMPLING
is the conditional intensity at t of Y given x. We now check the relevant conditions of Theorem 2.1. To show that /(y| ) is locally stable, note that for ξ G W\x n(y)
/
n(x
/(y|χu{£}) = Π 3=1
where
To check the local growth condition, note that for ξ G W\x n(y)
/(y|xu uniformly in y, and we can use L = 1. Thus the conditions of Theorem 2.1 are satisfied, so the posterior is locally stable. The Papangelou conditional intensity corresponding to /(y| ) is
/(y|xU {£}) _ ΠA} Λ ^
HVj\ξ)
for ξ G W\x, which is clearly decreasing in x, thus repulsive. Noting that
we find that the Papangelou conditional intensity corresponding to aγ( ) is
for ξ G W\x, which does not depend on x. We conclude that the posterior is of a feasible form for implementing the Kendall-M0ller algorithm if the Papangelou conditional intensity corresponding to the prior is a product of repulsive or attractive components.
LOIZEAUX AND MCKEAGUE
326
3
Application
In this section we present an application to data on leukemia incidence in an eight county area of uptstate New York. There is an extensive literature on the analysis of these data, see, e.g., Waller et al. (1992, 1994), Ghosh et al. (1999) and Ahrens et al. (1999). The study area is comprised of 790 census tracts and leukemia incidence was recorded by the New York Department of Health for each census tract during the years 1978-82, see Waller et al. (1992). The study area includes 11 inactive hazardous waste sites. The goal is to assess the possibility of an increased leukemia incidence rate in the proximity of these sites.
Figure 1: Left: locations of 552 leukemia cases in upstate New York, along with an approximate outline of the eight county study region. The rectangular region is 1 x 1.2 square units. Right: contour plot of the population density λ(t), and the locations of the 11 hazardous waste sites. The locations of the centroids of the census tracts are available, but precise locations of the leukemia cases are not. Our methods require the precise locations, so we randomly dispersed the cases throughout their corresponding census tracts; if there was exactly one case in a tract, we placed it at the centroid, see the left panel of Figure 1. (Sensitivity analysis indicates that this approximation makes no difference to our conclusions.) In some instances a case could not be associated with a unique census tract, resulting in fractional counts. Our approach does not accomodate this type of data,
327
PERFECT SAMPLING
so we follow Ghosh et al. (1999) and group the 790 census tracts into 281 blocks in order to identify most of the cases with a specific block. Less than 10% of all cases could not be identified with a specific block, and such cases are excluded from our analysis.
Figure 2: Posterior intensity map for the leukemia data based on the Neyman-Scott model with e = 5.2 x 10~4, σ = 0.01, K* = 0.23e, and a Strauss prior with interaction radius r = 0.1, βx = 0.5 and = 0.1. Locations of the 11 hazardous waste sites are included. Our analysis is based on the Neyman-Scott model with the leukemia intensity rate specified by n(x) 2=1
where X(t) adjusts for population density, e > 0 and h(t\x) is the Thomas intensity (2.3). Our earlier treatment of the Neyman-Scott model extends without change to this form of the model because λ(ί) does not depend on x. We use a Strauss prior for the landmarks x. For X(t) we used a smoothed version of the population density based on the 1980 U.S. census, see the right panel of Figure 1; this plot also gives the locations of the 11 inactive hazardous waste sites suspected of causing elevated leukemia incidence rates. Figure 2 gives the posterior intensity for the landmarks based on the data shown in the left panel of Figure 1; we used 1000 samples drawn using
LOIZEAUX AND MCKEAGUE
328
Site 6
Site 9
ill
Site 10
5
.
Site 11
I
I
Ί .
/
. 1
|
• hi r . \H\. I . . . J
Figure 3: Posterior observed (solid lines) and expected (dotted lines) probabilities of at least one landmark within a given distance (in kms) of each waste site.
the Kendall-M0ller algorithm. Note that one of the waste sites (site 1) is located close to an area of high posterior intensity. To assess the significance of an elevated leukemia rate in the neighborhood of a given site, we compare the 'observed' with the 'expected' posterior landmark distribution. The relevant null hypothesis here is that the leukemia cases form an inhomogeneous Poisson process with intensity ρ\(t), where p is the average leukemia rate throughout the study region. To sample from the null distribution, we generated an artificial data set using independent
PERFECT SAMPLING
329
Poisson counts for each census tract, then analyzed the artificial data the same way as the original data. In Figure 3 we compare the observed and expected posterior probabilities of at least one landmark within a given distance (0-7.2 kms) of each waste site. The 20 dotted lines correspond to samples from the null distribution, and the 5 solid lines correspond to the data (with the leukemia cases randomly dispersed throughout their corresponding census tracts). The plots provide evidence of elevated leukemia rates in the neighborhood of site 1.
4
Conclusion
In this article we have developed perfect sampling for the posterior distribution in Bayesian cluster models for spatial point processes. We have isolated conditions under which perfect sampling using the Kendall-M0ller algorithm is applicable. The algorithm is shown to be feasible under mild conditions on the prior and the likelihood, and, in particular, for the useful special case of the Neyman-Scott model when the prior is repulsive. We are currently working on a more detailed study of this topic in which we examine an extended formulation of the Baddeley-van Lieshout cluster model and provide other examples in which perfect sampling from the posterior is feasible. Acknowledgements. We thank George Casella, John Staudenmayer and Lance Waller for help with the leukemia incidence data. We also thank Jesper M0ller for several useful comments. The project was partially supported by NSA Grant MDA904-99-1-0070 and NSF Grant 9971784. Equipment support was provided under ARO Grant DAAG55-98-1-0102 and NSF Grant 9871196.
References Ahrens, C, Altman, N., Casella, G., Eaton, M., Hwang, J. T. G., Staudenmayer, J. and Stefanscu, C. (1999). Leukemia clusters and TCE waste sites in upstate New York: How adding covariates changes the story. Preprint. Baddeley, A. J. and van Lieshout, M. N. M. (1993). Stochastic geometry models in high-level vision. In K. V. Mardia and G. K. Kanji, editors, Advances in Applied Statistics, Statistics and Images: 1, 231-256, Carfax Publishing. Baddeley, A. J. and van Lieshout, M. N. M. (1995). Area-interaction point processes. Ann. Inst. Statist. Math. 47, 601-619. Casella, G., Mengersen, K. L., Robert, C. P. and Titterington, D. M. (1999). Perfect slice samplers for mixtures of distributions.
330
LOIZEAUX AND MCKEAGUE f t p : //ftp. ensae. f r/pub/labo_stat/CPRobert /perfect. ps. gz.
Geyer, C. J. (1999). Likelihood inference for spatial point processes. In 0. E. Barndorff-Nielsen, W. S. Kendall, and M. N. M. van Lieshout, editors, Stochastic Geometry: Likelihood and Computation, pp. 79-140. Chapman and Hall. Geyer, C. J. and M0ller, J. (1994). Simulation and likelihood inference for spatial point processes. Scand. J. Statist. 21, 359-373. Ghosh, M., Natarajan, K., Waller, L. and Kim, D. (1999). Hierarchical Bayes GLMs for the analysis of spatial data: An application to disease mapping. J. Statist. Planning Inference 75, 305-318. Green, P. J. and Murdoch, D. J. (1999). Exact sampling for Bayesian inference: towards general purpose algorithms. In J. M. Bernardo et al., editors, Bayesian Statistics 6, Oxford University Press. Presented as an invited paper at the 6th Valencia International Meeting on Bayesian Statistics, Alcossebre, Spain, June 1998. Haggstrόm, 0., van Lieshout, M. N. M. and M0ller, J. (1999). Characterization results and Markov chain Monte Carlo algorithms including exact simulation for some spatial point processes. Bernoulli 5, 641-659. Kendall, W. S. and M0ller, J. (2000). Perfect simulation using dominating processes on ordered spaces, with application to locally stable point processes. Adv. Appl. Probab. 32, 844-865. Lawson, A. B. and Clark, A. (1999). Markov chain Monte Carlo for putative sources of hazard and general clustering. In A. B. Lawson, D. Bδhning, A. Biggeri, E. Lesaffre, J.-F. Viel, editors, Disease Mapping and Risk Assessment for Public Health. Wiley. M0ller, J. (1999). Markov chain Monte Carlo and spatial point processes. In 0. E. Barndorff-Nielsen, W. S. Kendall, and M. N. M. van Lieshout, editors, Stochastic Geometry: Likelihood and Computation, pp. 141-172. Chapman and Hall. M0ller, J. (2001). A review on perfect simulation in stochastic geometry. In this volume. M0ller, J. and Nicholls, G. K. (1999). Perfect simulation for sample-based inference, h t t p : //www.math. auckland. ac.nz/~nicholls /linkf iles/papers/ perfect.sim.templ.ps.gz. Neyman, J. and Scott, E. L. (1972). Processes of clustering and applications. In P. A. W. Lewis, editor, Stochastic Point Processes, Wiley, New York, pp. 641-681. Propp, J. G. and Wilson, D. B. (1996). Exact sampling with coupled Markov chains and applications to statistical mechanics. Random Structures Algorithms 9, 223-252. Strauss, D.J. (1975). A model for clustering. Biometrika 62, 467-475. Van Lieshout, M. N. M. (1995). Stochastic Geometry Models in Image Analysis and
PERFECT SAMPLING
331
Spatial Statistics. CWI Tract 108. Stichting Mathematisch Centrum, Amsterdam. Waller, L., Turnbull, B., Clark, L. and Nasca, P. (1992). Chronic disease surveillance and testing of clustering of disease and exposure: application to leukemia incidence and TCE-contaminated dump sites in upstate New York. Enυironmetrics 3, 281-300. Waller, L., Turnbull, B., Clark, L. and Nasca, P. (1994). Spatial pattern analysis to detect rare disease clusters. In N. Lange, L. Ryan, L. Billard, D. Brillinger, L. Conquest, J. Greenhouse, editors, pp. 3-23, Case Studies in Biometry. Wiley.
333
Institute of Mathematical Statistics LECTURE NOTES — MONOGRAPH SERIES
A REVIEW OF PERFECT SIMULATION IN STOCHASTIC GEOMETRY Jesper M0ller Department of Mathematical Sciences University of Aalborg Abstract We provide a review and a unified exposition of the Propp-Wilson algorithm and various other algorithms for making perfect simulations with a view to applications in stochastic geometry. Most examples of applications are for spatial point processes. Key Words: coupling from the past (CFTP), ergodicity, falling leaves model, horizontal CFTP, locally stable point processes, noisy point processes, read-once algorithm, spatial birth and death processes, vertical CFTP, Widom-Rowlinson model.
1
Introduction
One of the most important and exciting recent developments in stochastic simulation is perfect simulation. Following the seminal work by Propp and Wilson (1996), many papers have proven that perfect simulation algorithms are particular useful in stochastic geometry, spatial statistics and statistical physics. It seems timely to review this development with a view to applications in stochastic geometry. The aims of this paper are to provide such a review for readers with limited knowledge on perfect simulation, showing the mathematical details, and also to put things into a unified framework. From a mathematical view, the paper is self-contained, but in order to keep the paper within the limit of about 20 pages, no illustrative figures and empirical results are included (but the relevant references are provided). For the same reason I have chosen to focus on the Propp-Wilson algorithm, also called vertical coupling from
334
M0LLER
the past (CFTP) in Section 3, and so-called horizontal CFTP in Section 4, also called dominated CFTP (Kendall and M0ller, 2000) and coupling into and from the past (Wilson, 2000a). Section 2 provides some background material related to CFTP. Most examples of applications in Sections 2-4 are for finite spatial point processes. Other topics such as Fill's algorithm and extensions to infinite point processes are briefly discussed in Section 5.
2
CFTP and Two Examples
Proposition 1 below has some similarities with Theorem 3 in Propp and Wilson (1996) and Theorem 2.1 in Kendall and M0ller (2000), but it is stated so that it applies for both the vertical and horizontal CFTP algorithms presented in Sections 3 and 4. We consider a general setting with a given target distribution π defined on a state space £J, and a discrete time process {Xt}^2.0 on £7, which is defined by a so-called stochastic recursive sequence (SRS), Xt = φ(Xt-URi)i
ί = l,2
(2.1)
Here the Rt are random variables and φ is a deterministic function, called the updating function. Under mild conditions, any discrete time homogeneous Markov chain can be represented as a SRS, where the Rt are ΠD (independent and identically distributed random variables); see Foss and Tweedie (1998) and the references therein. When making simulations, Rt is generated by a vector of pseudo-random numbers Vt = (Vti,..., VtNt) > where Nt G N is either a constant or yet another pseudo-random number; see e.g. Example 2 below. Moreover, we include negative times and let for any state x € E and times 5 , t E Z with s < £, fa) =φ('~
φ(φ{x,Rs+i),Rs+2)
denote the state of a process at time t when it is started in x at time s. In Section 4 we consider continuous-time jump processes Xs{x) = {Xt(x) : t > s} with X*{x) = x, and set Xl(x) = Xs for integers s < t, where s < Jι(s, x) < J2(s, x) < ... is a random sequence containing the jump times of Xs{x) (this sequence will only be needed for theoretical considerations — it is not used in the implementations of our perfect simulation algorithms). In order to unify the notation, for the discrete time setting used in Examples 1 and 2 and Section 3, we simply set Xl(x) = Xt(x) for integers s < t. We require that π is the limiting distribution of XQ(X), where in Section 4 we select a particular state f, while in Examples 1 and 2 and in Section 3 an
335
PERFECT SIMULATION
arbitrary state x can be chosen. Finally, we say that a random variable {00} is a stopping time with respect to 7£_ = {i?_t}^0? ^ f° r a n y t G No, the event {T < t} is determined by i?o, Λ-i,. ., R-t
TGNQU
Proposition 1 Assume that (i) the distribution of {Rt}tez is stationary in time, (ii) there exists a state x G E so that for any event F C E, P (XQ(X) G FJ -* TΓ(JF) as t -» 00, and fmj T > 0 is an almost surely finite stopping time with respect to TZ- such that X_t(x) = X?_τ{x) whenever t > T. Then X ° Γ ( ί ) ~ π. Proof. P (X°_τ(x) G F) = P (lim X°t(z) E F ) = lim P (x° t (z) G F) \
/
\ί—>oo
y
= lim P (Xtix) €F)= t—»oo
\
ί—>oo
v
/
π{F)
/
where the monotone convergence theorem is used for obtaining the second equality and (i) is used for obtaining the third equality. Generally speaking, by a CFTP algorithm we understand a way of determining a time — t < —T and returning X^t(x) ~ π. Occasionally, T is referred to as the running time of the algorithm, though we should keep in mind that a more precise definition of "running time" may be appropriate in each specific algorithm. A CFTP algorithm is only perfect/exact in principle, as pseudo-random numbers are used in practice, and X®_τ and Γ may be dependent, so long running times may cause a bias in the simulations, which are actually used. Still, for short, we call this perfect simulation. d
Example 1: Falling leaves model Consider a closed set S C R and d ΠD random closed sets Rt C R , t G Z so that with probability one, S will be covered by a finite number of the i?t's: p(3t>0:SCR1U...\JRt)
= 1.
Let E be the set of all closed subsets of S. Defining for x G E and closed subsets R C R d , φ(x, R) = [{x \R)U dR] Π 5, where dR denotes the topological boundary of the set i?, we obtain by (2.1) a Markov chain defined on E. If we think of the Rt as fallening leaves on the ground (so d = 2) and there are no leaves at time t — 0, then XQ(0) shows the boundaries of fallen leaves when looking down at the region S at time t > 0. This model for falling leaves has been introduced by Matheron (1968,1975); see also Serra (1982), Jeulin (1997), Kendall and
336
M0LLER
Thόnnes (1999), and the illuminating applets at Wilfrid Kendall's homepage (http://www.warwick.ac.uk/statsdept/StaiF/WSK/dead.html). Since XQ{X) does not depend on x e E whenever t > inf{n > 0 : 5 C JRI U...Ui?n}, the chain is easily seen to be uniformly ergodic (this is in fact verified in Section 3.1). In particular XQ(X) has some limiting distribution π as t -» oo, where π does not depend on the choice of x G E. Though this may be a very complicated distribution, we can at least make perfect simulations from π: by Proposition 1, for any x (Ξ i£, X^_τ {x) ~ TΓ if we define Tfι = inf {t € N o : S C Ro U #_i U . . . U R-t} , (2.2) and X^_τ (x) does not depend on x. Example 2: Widom-Rowlinson model Let A C R d be a Borel set with positive and finite Lebesgue measure |.A|. Denote the homogeneous Poisson point process defined on A and of rate β > 0 by Poisson(^4, β). This means the following if R ~ Poisson(A, β) is represented as a finite subset of A: the number n(R) of points in R follows a Poisson distribution with mean β\A\\ and conditionally on n(R), the points in R are IID and each point follows a uniform distribution on A. Now suppose that X^ ~ Poisson(S,/?^, i = 1,2, are independent, S C Rd is closed with 0 < | 5 | < oo, and π = V (χW,χW|dist(JrW,-YW) > ί )
(2.3)
where dist(X^),χ( 2 )) denotes the shortest distance between a point from X^ and a point from χ(2\ and /?i,/?2,£ are positive parameters. This is the Widom and Rowlinson (1970) model, which possesses many interesting properties as discussed in Haggstrόm, van Lieshout and M0ller (1999) and the references therein. The support of π is the set of all (x^ι\x^) where x^ and xW are finite subsets of S with d\st(x^\x^) > δ. However, it becomes convenient in Section 3.2, if we define the state space E as the set of all (χ(ι\χW) where x^ and x^ are closed subsets of S. For the purpose of making simulations from π, it seems natural to use a two-component Gibbs sampler, since V (χM\χQ\tiώ(χ(ι\χW)
> δ) = Poisson(s\ UχW
V (χW\χ(ι\ά\st{X^\χW)
>δ)=
,
PoissonjS \ Uχ(1),
where for x c 5, Ux denotes the union of balls of radius δ and centers ξ E x. Using a systematic updating scheme, the two-component Gibbs sampler amounts to use a SRS with Rt = (R^ \R[ ), where R^ - Poisson(Sf, # ) , t = 1
2)
337
PERFECT SIMULATION
1,2, ί € Z , are independent and the updating function φ is defined as follows:
for z W . ^ W ^ C Sand (
)
(
)
(2
.4)
we have that Clearly this sampler preserves π and it is easily shown to be uniformly ergodic due to the following simple facts: if r^ = 0 then 3/1) = 0 does not depend on χ( 2 ); similarly, if r^ = 0 then j/ 2 ) — 0 does not depend on yW; and both cases can happen in each update of the Gibbs sampler, since P(i^ = 0) = exp(—βι\S\) > 0. Moreover, after updating the first component, the sampler stays within the support of π, no matter the choice of the initial state in E. Due to these properties, an obvious choice for the stopping time used in Proposition 1 would be TWR = inf [t E No : [R{1} = 0] or [R^ = 0 and t > 0]} .
(2.5)
Indeed, for any x £ JE, X-T (^) ~π a n d X-τWR (*) ^ o e sn o t depend on i . However, except for possibly rather uninteresting cases with a very small value of β\ or /%, TWR is expected to be extremely large. A much more efficient coupling time is specified in Section 3.2. WR
3
Vertical CFTP
In Examples 1 and 2 we noticed that the chain was uniformly ergodic. As discussed in Section 3.1, this property is in a sense a sufficient and necessary property for applying the Propp-Wilson algorithm which, for reasons which soon will become clear, henceforth is called vertical CFTP. Furthermore, Section 3.2 shows how monotonicity properties of the SRS make vertical CFTP feasible in practice. Finally, Wilson's read-once algorithm is discussed in Section 3.3. Throughout Sections 3.1-3.3 the R^s are assumed to be ΠD. 3.1
Vertical CFTP and uniform ergodicity
Propp and Wilson (1996) consider what Foss and Tweedie (1998) call the smallest "vertical backward coupling time", that is the first time before 0 for coalesence of all possible chains; denoting this by —Tpw we have TPW = inf [t e No : X°_t(x) = -Y°t(y) for all x, y G E] .
(3.1)
338
M0LLER
For example, for the falling leaves model, Tpw = Tμ in (2.2), while for the Widom-Rowlinson model, Tpw < TWR in (2.5). In fact, typically Tpw < TWR unless the rates βi are very small (see Figure 3 in Haggstrδm, van Lieshout and M0ller, 1999). Ί When does Proposition 1 apply if T = Tpw - Condition (i) in Proposition 1 is clearly satisfied. By the definition (3.1), Tpw is a stopping time with respect to 7£_, and Xlt(x) does not depend on (x,t) G E x No when —t < Tpw Moreover, as shown below, if Tpw < oo almost surely, the chain becomes uniformly ergodic, and so condition (ii) in Proposition 1 is satisfied. Hence, in order to verify the conditions of Proposition 1, we need only to verify that P(TPW < oo) = 1: Proposition 2 // P(TPW < oo) = 1, then for any x G E, X-Tpw(x) and Xlt(x) = X-Tpw(£) whenever t > TPW
~π
Indeed, as noticed in Propp and Wilson (1996), if the support of π is finite, then irreducibility implies that P(Tpw < oo) = 1. But in applications of stochastic geometry, the support of π is rarely finite as illustrated in Examples 1 and 2. However, we can often find another backwards stopping time T so that Tpw < T, where T is easily shown to be almost surely finite. One such example is T = TWR given by (2.5). Then by Proposition 2, we can make a perfect simulation from π, provided there is a feasible way for determining Tpw when an algorithm for vertical CFTP is implemented — see Section 3.2. Now, how does the condition in Proposition 2 relate to uniform ergodicity? Recall that uniform ergodicity of a discrete time homogeneous Markov chain is equivalent to the existence of a time n > 0, a number e > 0, and a probability measure Q such that )€Q(.)
(3.2)
(Meyn and Tweedie, 1993, Theorem 16.0.2). Note that we in (3.2) do not require that the chain is defined in terms of a SRS, though we are still using the notation XQ(X) for the state at time n when starting in x at time 0. Observe also that by time homogeneity, Tfor = inf [t e N o : Xfa)
= X^y) for all x, y <E E] - TPW
(3.3)
Hence Tpw < oo almost surely if and only if there is a time t G N so that P(Ct) > 0 where Ct = {Xfa) = X%(v) for a11 x,V e E) ( t h e " o n l y i f P a r t " follows immediately from (3.3), while the "if part" follows by considering the independent and equiprobable events Cu = {X(l-ι)t(x) = -^"(i-i)t(l/) f° r a ^ x, y G E}, i = 1,2,...; in this section we use only the "if part", while the equivalence is used in Section 3.3).
339
PERFECT SIMULATION
Consequently, on one hand, if Tpw < oo almost surely, the chain is uniformly ergodic: set n = t + 1 , e = P(C*), and Q{F) = P (φ(Z, Rt+ι) G F \ Ct) where Z denotes the common value of XQ(X), X £ E, when the event Ct happens to occur. This shows the limitations of the applicability of vertical CFTP: it never works if we don't have uniform ergodicity. On the other hand, it can be shown that uniform ergodicity implies the existence of a SRS construction so that Tpw < oo almost surely (one such construction is provided by Nummelin's splitting technique if n = 1 in (3.2)); we refer to Foss and Tweedie (1998) for further details. Of course this result is purely theoretical as there exist infinitely many possible SRS's, and the art in practice is to pick one so that the chains coalesce quickly. 3.2
Monotonicity and Anti-monotonicity
One concern with for example the case of the SRS construction used in Example 2 is that apparently there are uncountably many paths to take care of if we wish to determine Tpw- In contrast only one path is needed in Example 1! However, as observed in Propp and Wilson (1996), this problem may be overcome if there is a partial ordering -< on E such that the updating function is monotone in its first argument, φ(x, •) -< φ(y, •) whenever x -< y,
(3.4)
and if there exist a unique minimum 0 G E and a unique maximum ί G S , MxeE:
0 -< x -< ϊ.
Thereby Tpw is determined by only two paths "started in the infinite past". More precisely, TPW = inf {n G No : L°_n = ί/°n} , (3.5) where L_ n = {-XT*(0) : t = n,n - 1,...} and U-n = {Xl£(ί) :t = n,n1,...} are the lower and upper chains started at the minimal respective maximal state at time - n : set L I " = 0, UI% = 1, and itn-\R-t),
t = n-l,n-2,....
(3.6)
If the monotonicity property (3.4) is replaced by the anti-monotonicity property, φ(y, •) -< φ(x, •) whenever x -< y, then using the cross-over trick introduced in Kendall (1998) we redefine t
Z -\R-t),t
= n-l,n-2,...,
(3.7)
340
M0LLER
whereby (3.5) remains true. Notice that in both the monotone and the anti-monotone case we have the following sandwiching property, LZl -« XZιn{x) <ΌZιn
for t = n,n - 1,... and n = 0,1,...,
(3.8)
and the funnelling property, LZn -< LZ\n -< UZΪn -< Ull for integers m>n>t
with n > 0.
(3.9)
So putting things together with Proposition 2, we obtain the following proposition. Proposition 3 Assume that P(Tpw < oo) = 1 and there is a partial ordering with unique minimal and maximal states. Then in both the monotone and anti-monotone case, L^_n = X?_Tpw ~ π whenever LP_n = U^_n. Thus, when the conditions of Proposition 3 hold, it becomes much simpler to determine Tpw and to implement a vertical CFTP algorithm: generate (L_ n , U-n) further and further back in time for any strictly increasing sequence of n's, and then return L°_n ~ π as soon as L°_n = U^_n. As argued in Propp and Wilson (1996), instead of using the sequence n = 1,2,3,4,..., it may be more efficient to use a doubling scheme n = 1,2,4,8,...; see also the discussion in Wilson (2000b). Notice that by (3.6) and (3.7), in the vertical CFTP algorithm we need to store each i?_ t which has been used for determining a pair of lower and upper processes. Example 2 (continued) A natural partial ordering is given by set-inclusion with respect to the two types of points: for closed subsets χW,yW c 5, define (x^\x^) -< (y ( 1 ) ,y ( 2 ) ) if x(i) C y«, i = 1,2. (3.10) Then
6 = (0,0),
ϊ = (5,5),
are unique minima and maxima, and the SRS construction for the twocomponent Gibbs sampler is seen to be anti-monotone, so vertical CFTP easily applies. Note that 0 belongs to the support of π (it is even an atom), while ϊ does not (it is at this point the definition of E becomes convenient). Moreover we can in an obvious way extend the definition (2.3) to a WidomRowlinson model of k > 2 components, and the updating function (2.4) to an anti-monotone fc-component Gibbs sampler with respect to an obvious extension of (3.10), so vertical CFTP applies for this case also. There is another partial ordering which makes the two-component Gibbs sampler monotone: suppose now that y(J), y<2> C x™,
(3.11)
341
PERFECT SIMULATION in which case 6 = (0,5),
ϊ = (5,0),
(3.12)
are unique minima and maxima. Note that now neither 0 nor 1 is in the support of π, and (3.11) does not extend to the case of k > 2 components. For a further discussion of perfect simulation using the monotone version above (but with (3.12) replaced by "quasi-minimal and quasi-maximal" states), including empirical results for the Widom-Rowlinson model, see Haggstrόm, van Lieshout and M0ller (1999). 3.3
Read-once Algorithm
A natural question is if we instead of going backwards in time could generate perfect simulations by running forwards in time. For example, by (3.3), the "vertical forwards coupling time" Tfor is distributed as the vertical backwards coupling time Tpw- However, in general X0for ^ π; a counterexample is provided by a random walk on {1,2,3} defined by the SRS φ{x,Rt) = x + Rt, where the Rt are ΠD and uniformly distributed on {±1}, and where we truncate x + Rt at 1 and 3. Moreover, we have noticed in Section 3.2 the need of storing the R-t which are reused in the lower and upper processes. Below we describe Wilson's (2000a) read-once algorithm, which runs forward in time, starting at time 0, and reading the Rt only once. As we shall see, it works whenever vertical CFTP does, and it can then be naturally used for producing IID samples from π. For m G N and i G Z, set Fi(x) = ^™ 1 ) m (z) Then the "random maps" F{, i G Z, are IID. As verified in Section 3.1, Tpw < oo almost surely if and only if p = P(range(Fo) is a singleton) is strictly positive for m sufficiently large. Set Ko = 0 and define recursively, Ki = inf {A; > Ki-\ : range(Ffc) is a singleton} , i = 1,2 ... Ki = sup {k < Ki+ι : range(Ffc) is a singleton} , i = —1, —2... Finally, set Gx = FKι, G{ = FKi-X o ... o FKi_λ for i € Z \ {1}, and n = K{ — 1 — Ki-ι for i £ Z, where o denotes composition of mappings. Proposition 4 // ?{TPW < oo) = 1, then the (Gun) with i G Z \ {1} are IID, where G{ ~ π and T{ follows a geometric distribution with mean
Proof. By Proposition 2, Go ~ π. Using that the random maps are IID, we obtain easily the following properties. The r^, ί E Z, are IID, and each follows a geometric distribution with mean (1 — p)/p. Furthermore, conditionally on the T^, the random maps are mutually independent, where the conditional distribution of Fjd is the same as the distribution of G\
342
M0LLER
for i ^ 0, while for j g {K{ : i Φ 0}, the conditional distribution of Fj is the same as the conditional distribution of Fo given that range(Fo) is not a singleton. Thus (FKi^u... ,F^_ 1 3 Ti), i E Z\ {1}, are ΠD, and so the (G i5 n) with i € Z \ {1} are IID. As noticed in the proof, the τ; (including τ\) are IID. However, in general, Gi 7^ π. A counterexample is provided by the SRS (2.4) for the WidomRowlinson model: letting m = 1, we obtain that the first component of G\ equals 0, contradicting the fact that π({0, •}) < 1. The read-once algorithm consists in generating G2,. , Gj for a given integer j > 2. Notice that we can successively determine G2,... ,CJJ from one path starting at G\ = Fκτ, since G2 = Fκ2-i o ... o Fκx, while for i = 3,4,..., we have successively that Gi = Fj^-i o ... o Fκi_λ ° Gi-i Here Gi and ϋfi, K 0.5 or equivalently E(τ2) < 1. Finally, we notice that Wilson (2000a) discusses a coupling method so that the read-once algorithm applies on locally stable point processes, as defined in Section 4.1.
4
Horizontal CFTP
As mentioned in Section 1, horizontal CFTP has other names in other papers; it is called so here in order to clarify the difference from vertical CFTP (we comment on this at the end of Section 4.1). The ideas behind horizontal CFTP are due to Kendall(1998); a general setting can be found in Kendall and M0ller (2000). Section 4.1 shows the details in the case of using spatial birth and death processes with a locally stable equilibrium distribution. Section 4.2 concerns the case of a noisy point process model. Further examples are briefly discussed in Section 5. 4.1
Perfect simulation for locally stable point processes using spatial birth and death processes
This section is based on Kendall and M0ller (2000). As the results will be used in Section 4.2, we give a detailed exposition. We consider a general setting for finite point processes, where Λ = Poisson(ιS, K) is a Poisson point process defined on a space S and with intensity measure K SO that 0 < κ(S) < 00. For convenience we assume n to be diffuse (i.e. non-atomic), whereby Λ is concentrated on the state space E = {x C S : n(x) < 00} of finite point configurations (but everything in the following easily extends to the case where n is not diffuse); here n(x) denotes the number of elements in the set x. So if X ~ Λ, then n(X) follows a
343
PERFECT SIMULATION
Poisson distribution with mean «(), and conditionally on n(X), the points in X are ΠD with distribution κ( ) = κ( )/κ(S). We restrict attention to point processes X with a target distribution π which is absolutely continuous with respect to Λ, and assume the following local stability condition: if / = dπ/dΛ denotes the density, there is a number K > 0 such that f(xU{ξ})
foτxeE,
ξeS\x.
(4.1)
In other words, defining the Papangelou conditional intensity,
Λ
^'
ξ J
~\
0
otherwise
we have assumed λ < K and / to be hereditary, i.e. for any x G E and ξ E S\x, f(x) > 0 if f(x U {ξ}) > 0. Since everything in the sequel only depends on / through λ, we need only to know an explicit expression of / up to proportionality. Consider first a spatial birth and death process X = {Xt t E R} with birth and death rates b and d. These are non-negative measurable functions defined on E x S such that, in any small time-interval [t, t + dt] and for any B £ /3, if we condition on Xt = x and the process before time t, we have the following: (i) the probability for a birth Xt+άt — x U {£}, with the newborn point ξ E B, is JB b(x,ξ)κ(dξ) x dt + O(dt); (ii) for any point η G x, the probability for a death Xt+dt = x\ {η} is d(x \ {η},η)dt + O(dt); (iii) the probability for more than one transition is O(dt). The target density / and the spatial birth and death process X are in detailed balance if f(x)b(x, ξ) = f{xU {ξ})d(x, ξ) > 0
whenever f(xU {£}) > 0.
(4.2)
In the sequel we set &(z,0=λ(z,0,
dΞl,
whereby (4.2) holds. This is seemingly the simplest choice, but everything in the following can be extended to the general case of (4.2), cf. Berthelsen and M0ller (2001). Notice that (4.1) ensures that X does not explode, 0 is an ergodic atom for X at which it regenerates, and for any event F C B , P(Xt G F) -> π(F) as t -> oo, cf. Preston (1977) (alternatively, one can verify these properties directly using the coupling construction described below). Consider next another spatial birth and death process D = {Dt : ί G R } with birth rate K and death rate 1. Then D satisfies (4.2) when f(x) oc Kn(χ\ i.e. it has equilibrium distribution Poisson(5, Kn), and 0 is an ergodic atom. Let J\ < J<ι < -.. denote the jump times of {Dt : t > 0} and define the
344
M0LLER
jump chain {Dt}t^ι = {Djt}^ι\ for convenience, set DQ = Do and JQ = 0. The key point is now that there is an explicit coupling of {Xt : t > 0} and {Dt : t > 0} so that the jump times of X are included in {Jt}, and letting {Xt}f> = { X j J i 0 and Xo = Xo, we have that D dominates X as Xt C A , t > 0.
(4.3)
Hence a natural partial ordering on E is C (set inclusion). Notice that 0 = 0 is a unique minimum, but there exists no maximum. Specifically, let €*i,..., e*4, t e Z be mutually independent, where each of 6ίi,eί3,et4 is uniformly distributed on [0,1], while et2 ~ ft- Then a SRS construction for D is given by Dt = A - i U {&} \ {ηt} ,
4 = 1,2,...
(4.4)
where ξt sind ηt are defined as follows. A birth happens if en
(Xt-ι,ξt) (4-5)
Finally, consider the exponentially distributed waiting times J* — Jt-ι ~ Exp(Kκ(S) + n ( A - i ) ) , ί = 1,2,..., whereby {Dt :t>0} and {Xt : t > 0} are obtained. Then clearly (4.3) is satisfied, and it can be straightforwardly verified that {Xt : t > 0} and {Dt : t > 0} define two spatial birth and death processes with death rate 1 and birth rates λ and ϋf, respectively. So setting x = 0, condition (ii) in Proposition 1 is satisfied. In order to see that the other conditions of Proposition 1 are satisfied, imagine we first draw eo4 and Do ~ Poisson(5, KK) SO that the dominating process is in equilibrium. Next, imagine we simulate (Dt,et±) forwards in time t = 1,2,... and (D^t^-u) backwards in time — t = —1,-2,... (by reversibility, this is an easy task). Finally, imagine we start "target chains" in x = 0: for n = 0,1,..., use (4.5) to obtain X_ n (0) = {-XΊnW : * = n, n — 1...}. By Proposition 1, if Γ > 0 is an almost surely finite stopping time with respect to Ίl- such that X^.n(0) = X_T{®) whenever n>T, then X^ τ (0) rsj 7r. One such stopping time is
345
PERFECT SIMULATION
where T$ < oo almost surely, since 0 is an ergodic atom. But like the stopping time (2.5) for the Widom-Rowlinson model, T$ can be extremely large. However, we can bound X_ n (0) by the following lower and upper processes: set LZn = 0, UZn = £>-nj and for t = n — 1, n — 2,..., L-J-1 U {ξ-t} L-J-1 \ {η-t} L_}n 1
if {ί-t} 7^ 0 and e_ ί4 < α L if {*-,} = 0 otherwise
Ult1
if {ξ-t} φ 0 and e_ί4 < α α (iljT1,17Γ*"1, ξ-t)
[ C/_^
U {e_t} x
(L~J-\
Ul*Γ\ζ-t)
otherwise
where 1
1
1
α L ( L i t " , t/Γ*-\ξ_ t ) = min{λ(x,ξ_ t )/K : L I ^
«α ( £ l ί Γ \ ^ - n " 1 ^ - * ) = max {λίar.ξ-tJ/ϋΓ : LZ^1 QxC UI^Γ1} . (4.7) Notice that the same ξ_t,r/_t,6_t4 are used for generating all (LZ^UZn) with — n < —t. Thereby we obtain the following sandwiching property LZn QX-iW)
Q UZnQD-U
for t = n , n - l , . . . and n = 0 , 1 , . . . . (4.8)
Observe also that the funnelling property (3.9) is satisfied. Finally,
provides a stopping time which is often applicable (see below). Indeed Thor < T 0 and typically Thor < T 0 . Proposition 5 For a locally stable point process with distribution π and lower and upper processes as defined above, Thor < oo almost surely, and if Do - Poisson{S,Kκ) then L°_n = X°_Thoβ) - π whenever L°_n = U__n. Proof. This follows immediately from Proposition 1 with x = 0, using the above-mentioned sandwiching and funnelling properties. A CFTP algorithm based on generating lower and upper processes may now be implemented in the same way as the vertical CFTP algorithm in Section 3.2, except that we start by drawing 604 and DQ ~ Poisson(5, KK) and generate, as long as needed, (D-t,e-t±) further and further back in time, i.e. until L__n = U__n. We call this for horizontal CFTP, since we have a horizontal coupling between the paths in (4.8), and in contrast to vertical
346
M0LLER
CFTP, we cannot use an arbitrary initial state x for the target chains; in fact we can only use x = 0. In practice, it may only be feasible to determine (4.6) and (4.7) if λ(α;,£) considered as a function of x is either increasing (the so-called attractive case) or decreasing (the so-called repulsive case) with respect to the partial ordering C — or, at least, if λ(x, ξ) factorizes into terms which are increasing or decreasing in rr, we may modify (4.6) and (4.7) in an obvious way so that the compuations become feasible. Fortunately, in most applications, the model is either attractive or repulsive. 4.2
Perfect Simulation for a Noisy Point Process
So far we have mainly for illustrative purposes considered rather simple examples of perfect simulation algorithms. More complicated algorithms are usually needed in "real applications" as exemplified in this section, where we consider an interesting noisy point process model studied in Lund and Rudemo (2000) and Lund et al. (1999), and give an alternative and shorter description of a slightly improved version of a CFTP algorithm introduced in Lund and Thόnnes (2000) (the difference is explained at the end). Particularly, we demonstrate how the results in Section 4.1 can be applied. The noisy point process model can briefly be described as follows. Let A C R d be a Borel set with 0 < \A\ < oo and let μ = Poisson(A, 1). Suppose X is a point process with density / with respect to μ, which is subject to the following four operations. (I) Independent thinning, where each point ξ E X is retained with probability p; here 0 < p < 1 is a given parameter. (II) If ξ E X is retained, then it is translated by a vector R(ξ) with density φ with respect to Lebesgue measure on R d ; here the vectors R{η), η £ R d are ΠD and independent of the retained points in X. (III) The retained displaced points Z(ξ) = ξ + R{ξ), which are outside A, are censored. (IV) Let U C X denote the points which are either thinned in (I) or censored in (III) after the displacement in (II). Set V = X \ £/, Z = {Z(ξ) : ξ e V}, and Q = {(Z(ξ),R(ξ)) : ξ e V}. Assume that W - Poisson(Λ,α) is independent of (f/, Q), where a > 0 is a parameter, and that only a realization of Y = Z U W is observed. Our target distribution π is the conditional distribution of ([/, Q) given Y = y. From this we can obtain the conditional distribution of X given Y = y, which in the above-mentioned papers is considered as the posterior distribution of primarily interest. In order to apply the results in Section 4.1, we let π be defined on an augmented state space E = {(u,q) : uC(A\y)9
n(u) < oo, q C y x R d , n(q) < oo},
347
PERFECT SIMULATION
i.e. we include marked point configurations q = {(^i,ri),..., (zn,rn)} C y x R d where some of the points z\,..., zn G y can be equal, though Q under π is concentrated on the set M consisting of those q where z\,..., zn are pairwise distinct and z% — T{ G A, i = 1,..., n. Finally, for a given point
MO = P(ί eu\ξex)=p
φ(r - ξ)dr +1 - p
ί
Λ gA is the probability that the point is either thinned in (I) or censored in (III) after the displacement in (II), and we define for finite point configurations
uc A,
Proposition 6 Let A' = Poisson(A \ y, κ\) x Poisson(y x R d , ^2) where κ\ denotes Lebesgue measure restricted to A\y (B
K2
(4.9)
and K2 is defined by
x C) = Σ l[*l e B] x / ί*/>(r)/αdr
for B C. y and Borel sets C C R d . Assume that g(y) > 0, where g is the density ofY with respect to μ, see (5.2) in Appendix A. Then π has a density with respect to Λ', (d7r/dΛ')(u,g) = c(y)H{u)f(uUv)l[q
G M],
(u,q) G E
(4.10)
where we let v = {vι,..., vn} be specified by q — {{v\ + ri, r i ) , . . . , (υ n + )} ί where c(y) depends on y only. Proof. See Appendix A. In the sequel we assume that / is locally stable, and let its Papangelou conditional intensity λ/, say, be bounded by the constant K, cf. Section 4.1. Then /(0) > 0, and so by (5.3) in Appendix A, the condition g(y) > 0 is satisfied. Owing to the factorization in (4.9) and the fact that (£/, Q) is in one-to-one correspondence with C/UQ, we can then put things into the framework of Section 4.1: the density of U U Q\Y = y with respect to Λ = Poisson(5, K), where S = {A \ y) U (y x Rd) and κ(B) = κλ(B Π A) + K2{B Π (y x R d )), is given by (4.10). This density is clearly hereditary and its Papangelou conditional density λ is given by \(uUq,ξ) = h(ξ)\f(uUv,ξ)l[q
G M]
λ{u U q, (η, r)) = λ/(n U υ, η)l[qU {(η, r)} € M]
i{ξe(A\y)\u, if (η, r) G (y X R d ) \ 9,
348
M0LLER
so λ < K as λf < K and h < 1, i.e. local stability is satisfied. Therefore we can make perfect simulations from UUQ\Y — y as described in Section 4.1. In order to avoid any confusion with the notation used in Section 4.1, let us rename ξt and ηt from (4.4) by χt and £*, respectively. Note that conditionally on a birth {χt} Φ 0, we have that χt G A \ y with probability κι(A \ y)/κ(S) = \A\/(\A\ + n(y)p/a), in which case χt follows a uniform distribution on A \ y; otherwise χt = (ηt,rt), where ηt is uniformly selected from y, independently of rt ~ φ. Moreover, conditionally on Dt-ι and a death {ζt} φ 0, we have that ζt is uniformly selected from Dt-ι- Finally, λ considered as a function of its first argument splits into a product of an indicator function, which is decreasing, and λ/, which is typically decreasing or increasing, in which case the computations become feasible, cf. the last paragraph of Section 4.1. But as the state space for Dt is much larger than that for X ί ? this perfect simulation procedure can be rather inefficient as demonstrated in Lund and Thonnes (2000). However, a better dominating jump process JD£, say, can be constructed so that Xt C D*t C A , t > 0. (4.11) Assuming XQ = 0 and DQ COo, then the jump times of D* and D are the same. Furthermore, in the corresponding two jump chains D* and D, we are using the same births χt 5 aiκl the same deaths ζt whenever ζt G (A \ y). However, if a death ζt = {ηt,rt) G y x R d happens, then as Xt does not contain 7/t, we set
D*t = A - i \ {fa, 0 € DU :η = ηt}, whereby (4.11) is seen to hold. _ Note that we can write Dt* = UtU {(ξ, St{ξ)) : ξ G y, St{ξ) φ 0}, where {Ut : t > 0}, {5*(ξ) : t > 0}, ξ G y are mutually independent jump processes. Here {Ut : t > 0} is a spatial birth and death process on A \ y with birth rate K and death rate 1, so this is reversible with invariant distribution equal to Poisson(A \y,K). Further, Nt(ξ) = ^ ( ^ ( 0 ) has generator G = {gm,n) given by ρ m , m +i = Kp/a for m = 0,1,..., ffm,0 = 1 for m = 1,2,..., and gm,n = 0 otherwise for off-diagonal elements; recall that the generator for a jump process with discrete state space has all row sums equal to 0 (see e.g. Norris, 1997). As its invariant density pn satisfies ΣPm9m,n = 05 n = 0,1,..., we find that pn oc (a/(a + Kp))n specifies a geometric equilibrium distribution. Furthermore, conditionally on that a birth happens in S(ξ) at time t, the newborn point rt ~ φ, and it is independent of the other points in St(ξ) and the previous history of the process. So the equilibrium distribution of St(f)|iVJt(ξ) is simply a binomial process of IID points, each following the density φ. Therefore DQ = DQ can easily be started in equilibrium. Moreover, though {5t(£) : t > 0} is not
PERFECT
349
SIMULATION
reversible, it can be easily simulated backwards in time: if Gr = (grmn) is the generator of the reversed process of Nt(ζ)-> then gfm+ιim = 1 + Kp/a for m = 0 , 1 , . . . , gf0)Tn = (o/(α + Kp))m for m = 1,2,...', and <4 j 7 l = 0 otherwise for off-diagonal elements. This follows by solving the equations Pm#m,m+1 = Pm+l?m+l,m a n d Pm+10m+l,O = Pθ#ό,ra+1 for m = 0, 1, . . . . Now, the point is that we can obtain a horizontal CFTP algorithm along the same lines as in Section 4.1, but replacing D with D*: as described in detail above, D* can easily be started in equilibrium at time 0 and simulated backwards in time — without simulating D — and we set UZ™= d n , and let the points which are going to be added or deleted at time — t be given by D__t \ D*_t_x and D__t_λ \ D*_t, respectively. As suggested in Lund and Thόnnes (2000), it may be convenient to replace the "acceptance probabilities" for births as given by (4.6) and (4.7) by lower respective upper bounds, which do not depend on evaluating the function /ι, but depend on the properties of Xf only: a s θ < l — p < h < 1 and the function l[q € M] is decreasing, we can redefine
Π (y x Rd) € M](l -p)/Kx
aL (LZ'nKUlt^X-t) = Wit1
min{λ/((«Uϋ),χ_t) : LZ1'1 C (u,q) C EC*"1} 1
1
aL (Lzt , Ult\x-t)
= WltΓ
if χ_t G A \ y, d
U {χ_t}) Π (y x R ) € M}/Kx
1
1
min{λ/((«Ut;),χ_ t ) : LZ*' C (u,q) C Uz'' }
d
if χ_ t G y x R ,
where we let v be determined by («, g) G ^ similarly, redefine
av (LZ*r\Ultr\χ-t)
1
d
= liLztΓ n (y x R ) € M](l -p)/iΓx
max{λ/((«Ut;),χ_t) : LZ1'1 C («,ς) C C/^"1}
if χ_t G A\y,
α^ (LlίΓ 1 , UZl-\ χ-t) = WZΪΓ1 U {*-,}) Π (y x R d ) G Af]/Xx min{λf{(
: L l ^ 1 C ( U ) 9 ) C ί/Γ*"1}
if χ_t G y x R d .
P r o p o s i t i o n 7 Zeί the situation be as described above. Then T£ o r = inf{n > 0 : L_\n = U_\n] is almost surely finite, and if DQ is started in equilibrium then L_\n ~ π whenever L_\n = U[\n. Proof. This follows in a similar way as the proof of Proposition 5, recalling the possibility of coupling D* with D so that T£ o r < ThorAs promised we now compare this with the perfect simulation algorithm in Lund and Thonnes (2000, Section 6.3); apparently, they include the jump
350
M0LLER
times, but since this is not needed, let us just consider the jump chains and use our notation. Lund and Thonnes do not start by simulating D$ in equilibrium, but for each time —n, before simulating (L_n,C/_n) started at time — n, they start D* further back at time min{—Tn(ξ) : ξ £ y}, where —Tn(ξ) = sup{—t < —n : N-t{ζ) = 0}; here N(ξ) is the jump process of N(ξ). So the running time TLT of their algorithm is related to ours by that TLT ~ T£or + e, where e is distributed as the maximum of n(y) IID random variables, each following the geometric distribution with density pn. For the particular application considered in Lund and Thonnes (2000), possibly e is rather small. They report that typically TLT <S Thor, where Thor is the running time for horizontal CFTP based on Section 4.1, i.e. before the better dominating jump process JD* is introduced.
5
Concluding Remarks and Further Reading
Remark 1 (Fill's algorithm) Fill (1998) introduces a clever form of rejection sampling, assuming a finite state space and a monotone setting with unique minimal and maximal states. Applications and extensions of Fill's algorithm can be found in Fill et al. (2000) and the references therein. The advantage of Fill's algorithm compared to CFTP is that it is interruptible in the sense that the output is independent on the running time (like in any rejection sampler). The disadvantages may be problems with storage and that it seems more limited for applications than CFTP. As regards applications of Fill's algorithm in stochastic geometry, Thonnes (1999) considers the special case of the Widom-Rowlinson model (using the anti-monotone setting, this can be extended to the case of A -components, see Example 2, Section 3.2), while M0ller and Schladitz (1999) consider .more general spatial point processes approximated by lattice processes. Remark 2 (other horizontal CFTP algorithms) As noticed, the algorithm in Section 4.1 is only practical if the Papangelou conditional intensity λ(x,ξ) is increasing or decreasing in its first argument; and this is often the case in applications of stochastic geometry. In contrast the horizontal CFTP algorithm in Fernandes et al. (1999) depends only on λ through its "interaction radius"; for example, if S C R d , then inf{r > 0 : λ(x,ξ) = 1, dist(x, ξ) = r} is the interaction radius at the point ξ € S). The algorithm consists in finding "clans of ancestors" in the dominating spatial birth and death process, when this is simulated backwards in time from time 0 until there are no more ancestors, and then obtain a target process by thinning as in (4.5); so no lower and upper processes are needed. For the running time Tp of the algorithm (when jump times are ignored), we have that Tp > Tkor. This may not be a fair comparison if we are using a doubling scheme for the
PERFECT SIMULATION
351
algorithm in Section 4.1, and the efficiency of the algorithms depend much on how close λ is to its upper bound K used in the dominating process, and also on the range of the interaction radius. There are other horizontal CFTP algorithms: Metropolis-Hastings algorithms for locally stable point processes are studied in Kendall and M0ller (2000), and the use of spatial jump processes for more general classes of spatial point processes is studied in Berthelsen and M0ller (2001). It is noticed in Kendall and M0ller (2000) that the considered Metropolis-Hastings algorithm is geometrically ergodic, but in general it is not uniformly ergodic — recall that uniform ergodicity is a necessary condition for vertical CFTP to work. Remark 3 (infinitely many points) Often one considers point processes with infinitely many points contained in an "infinite volume" such as R r f . In order to avoid edge-effects, a perfect sample within a bounded region may be achieved by extending simulations both backwards in time and in space (Kendall 1997; Fernandes et al. 1999). This is sometimes possible, for example if λ is sufficiently close to K and the interaction radius is sufficiently small. Such coupling constructions may be of great theoretical interest, but in my opinion they remain so far unpractical for applications of real interest. Remark 4 (statistical applications) Section 4.2 provides one example of a Bayesian application of horizontal CFTP using the results in Section 4.1; another is given in Loizeaux and McKeague (2000). Van Zwet (1999) relates horizontal CFTP to likelihood based inference for a conditional Boolean model. Here a lower dominating process is used, while the upper dominating process is the same for all times; see also Kendall and Thonnes (1999). It would be interesting to study more complicated parametric models like germ grain models of interacting geometrical objects, where possibly perfect simulated tempering (M0ller and Nicholls, 1999) could be applied for finding the maximum likelihood estimate. Acknowledgement Kasper K. Berthelsen, Katja Ickstadt, and Rasmus P. Waagepetersen have commented on an earlier version of this paper. This research has been supported by the Centre for Mathematical Physics and Stochastics (MaPhySto), funded by a grant from the Danish National Research Foundation.
352
M0LLER
Appendix A We start by verifying Proposition 6. By (I)-(IΠ) in Section 4.2, for finite point configurations u C x c A and v = x \ u = {^i,..., υ n }, P(U
= u,Q€ G\X = x) = H{u) [•••[l[q€G,zC J J
A] f[ (pφ{n)) άrx • • • drn χ
where q = {qι,..., qn} and z = {z\,..., zn} are given by Zi = Vi + Ti and q% = (zi,ri), and where the n-fold integral is read as 1[0 € G] if n — 0. Using the definition of μ and Fubini's theorem, it is easily seen that / Σ / """ / J
J
n=0
fc
Σ
J
("> 9) dr i
{vu...,vn}Cx
-J
dr n μ(da )
fHu,q)dqi
(5.1)
dqnμ(du)
n=0
for integrable functions k. Hence P(UEF,QeG) ί H(u)Σ^
= [•••
ίl[qeG,υCA)f(u[Jv)φ(r1)dq1---φ(rn)dqnμ(du).
Combining this with (IV) in Section 4.2, we obtain that P(UeF,QeG,YeN) =
exp(\A\ - a\A\) ί H(u) ί α JF
J
n W
^
f ζ /"• o
n \ J J
I l[q € G,v C A,z Uw € N]
f{u U v)φ(rχ)dqι • • • φ{rn)dqn μ(dw) μ(du) = n
exp(|Λ| - a\A\) f H(u) f a ^ f ) ( p / α ) » [••• f JF
JN
n=0
J
J
Σ {Zl,...,zn}Cy
l[q € G, v C A)f(u U v)φ(n)drι • • • φ(rn)drn μ(dy) μ(du) where the second identify is obtained from (5.1), replacing u, x, υ with w, y, z, respectively. Thereby Proposition 6 follows with
g(y) = exp(\A\ - a\A\)an^ j H(u)
) X
n=0
/•"/ and c(y) <xg(y).
Σ
/(«Uϋ)^(ri)dri
^(rn)drnμ(dy)μ(d«)
(5.2)
353
PERFECT SIMULATION
We conclude with various comments. The condition g(y) > 0 is satisfied if /(0) > 0: considering the case where u = υ = 0 and n = 0 in (5.2), we obtain that g(y)>eM\A\-a\A\)an^f(
(5.3)
Proposition 6 can easily be extended to cases where the parameter p in (I) is allowed to depend on the location of a point in X (so-called inhomogeneous thinning), φ in (II) is allowed to depend on the location of a retained point, W has a density with respect to μ, and we put a prior distribution on (p, α, φ). However, as an explicit expression for the normalizing constant of / is usually not known in applications, it will usually not be feasible to introduce a prior on / . Incidentally, a simplified proof of Theorem 1 in Lund and Rudemo (2000), which concerns the expression for the conditional density of Y given X = x, can easily be obtained along similar lines as in the proof above.
References
Berthelsen, K.K. and Miller, J. (2001). Spatial jump processes and perfect simulation for spatial point processes. Research Report R-01-2008. Department of Mathematical Sciences, Aalborg University. Fernandez, R., Ferrari, P.A. and Garcia, N.L. (1999). Perfect simulation for interacting point processes, loss networks and Ising models. Available at http://xxx.lanl.gov/abs/math.PR/9911162 Fill, J. (1998). An interruptible algorithm for exact sampling via Markov chains. Ann. Appl. Prob., 8, 131-162. Fill, J.A., Machida, M., Murdoch, D.J. and Rosenthal, J.S. (2000). Extensions of Fill's perfect rejection sampling algorithm to general chains. Random Structures and Algorithms, 17, 290-316. Foss, S.G. and Tweedie, R.L. (1998). Perfect simulation and backward coupling. Stock. Models, 14, 187-203. Haggstrδm, O., van Lieshout, M.N.M. and M0ller, J. (1999). Characterization results and Markov chain Monte Carlo algorithms including exact simulation for some spatial point processes. Bernoulli, 5, 641-659. Jeulin, D. (1997). Dead leaves model: from space tesellation to random functions. Advances in Theory and Applications of Random Sets, D. Jeulin, ed., World Scientific Press, Singapore, 137-156.
354
M0LLER
Kendall, W.S. (1997). Perfect simulation for spatial point processes. Proc. st ISI51 session, Istanbul (August 1997), 3, 163-166. Kendall, W.S. (1998). Perfect simulation for the area-interaction point process. Probability Towards 2000, L. Accardi and C.C. Heyde, eds., Springer Verlag, New York, 218-234. Kendall, W.S. and M0ller, J. (2000). Perfect simulation using dominating processes on ordered spaces, with application to locally stable point processes. Adv. Appl. Prob., 32. To appear. Kendall, W.S. and Thόnnes, E. (1999). Perfect simulation in stochastic geometry. Pattern Recognition, 32, 1569-1586. Loizeaux, M.A. and McKeague, I.W. (2000). Bayesian inference for spatial point processes via perfect simulation. Available at http://stat.fsu.edu/ ~mckeague/ps/index.html Lund, J., Penttinen, A. and Rudemo, M. (1999). Bayesian analysis of spatial point patterns from noisy observations. Preprint 1999:57, Department of Mathematics, Chalmers University of Technology and Gothenburg University. Availiable at http://www.math.chalmers.se/Stat/ Research/Preprints/ Lund, J. and Rudemo, M. (2000). Models for point processes observed with noise. Biometrika, 87, 235-249. Lund, J. and Thόnnes, E. (2000). Perfect simulation of point patterns from noisy observations. Research Report 366, Department of Statistics, University of Warwick. Matheron, G. (1968). Schema booleen sequentiel de partition aleatoire. Internal Report N-83, CMM, Fontainebleau. Matheron, G. (1975). Random Sets and Integral Geometry, Wiley, New York. Meyn, S.P. and Tweedie, R.L. (1993). Markov Chains and Stochastic Stability, Springer Verlag, New York. M0ller, J. and Nicholls, G. (1999). Perfect simulation for sample-based inference. Research Report R-99-2011, Department of Mathematical Sciences, Aalborg University. M0ller, J. and Schladitz, K. (1999). Extensions of Fill's algorithm for perfect simulation. J. Roy. Statist Soc, B 61, 955-969. Norris, J.R. (1997). Markov Chains, Cambridge University Press, Cambridge. Preston, C.J. (1977). Spatial birth-and-death processes. Bull. Inst. Internat. Statist., 46, 371-391. Propp, J.G. and Wilson, D.B. (1996). Exact sampling with coupled Markov chains and applications to statistical mechanics. Random Structures and Algorithms, 9, 223-252.
PERFECT SIMULATION
355
Serra, J. (1982). Image Analysis and Mathematical Morphology, Academic Press, London. Thδnnes, E. (1999). Perfect simulation of some point processes for the impatient user. Adv. Appl. Prob., 31, 69-87. Wilson, D.B. (2000a). How to couple from the past using a read-once source of randomness. Random Structures and Algorithms, 16, 85-113. Wilson, D.B. (2000b). Layered multishift coupling for use in perfect sampling algorithms (with a primer to CFTP). Monte Carlo Methods, N. Madras, ed., Fields Institute Communications, 26, 141-176. van Zwet, E.W. (1999). Likelihood devices in spatial statistics. PhD thesis, Universiteit Utrecht, Faculteit Wiskunde en Informatica.
IMS Lecture Notes—Monograph Series NSF-CBMS Regional Conference Series in Probability & Statistics
Order online at: http^/wwwJmstatorg/pubHcations/imslβct/index.phtml
Vol
Title
1 2 3 4 5 6 7 8 9 10
IMS LECTURE NOTES-MONOGRAPH SERIES Essays on the Prediction Process Survival Analysis Empirical Processes Zonal Polynomials Inequalities in Statistics & Probability The Likelihood Principle Approximate Computation ofExpectations Adaptive Statistical Procedures & Related Topics Fundamentals of Statistical Exponential Families Differential Geometry in Statistical Inference
11 12 13 14 15 16 17
Group Representations in Probability & Statistics An Introduction to Continuity, Extrema & Related Topics for General Gaussian Processes Small Sample Asymptotics Invariant Measures on Groups & Their Use in Statistics Analytic Statistical Models Topics in Statistical Dependence
27 28
Current Issues in Statistical Inference: Essays in Honor of D. Basu Selected Proceedings of the Sheffield Symposium on Applied Probability Stochastic Orders & Decision Under Risk Spatial Statistics & Imaging Weighted Empiricals & Linear Models Stochastic Inequalities Change-point Problems Multivariate Analysis & Its Applications Adaptive Designs Stochastic Differential Equations in Infinite Dimensional Spaces Analysis of Censored Data Distributions with Fixed Marginals & Related Topics
29
Bayesian Robustness
30
Statistics, Probability & Game Theory: Papers in Honor of D. Blackwell Lj-Statistical Procedures & Related Topics Selected Proceedings of the Symposium on Estimating Functions Statistics in Molecular Biology & Genetics New Developments & Applications in Experimental Design
18 19 20 21 22 23 24 25 26
31 32 33 34
1 2 3 4 5
Game Theory, Optimal Stopping, Probability & Statistics: Papers in Honor of Thomas S. Ferguson State of the Art in Probability & Statistics: Festschrift for WillemR vanZwet Selected Proceedings of the Symposium on Inference for Stochastic Processes Model Selection NSF-CBMS REGIONAL CONFERENCE SERIES Group Invariance Applications in Statistics Empirical Processes: Theory & Applications Stochastic Curve Estimation Higher Order Asymptotics Mixture Models: Theory, Geometry & Applications
6
Statistical Inference from Genetic Data on Pedigrees
35 36 37 38
Authors or Editors
Member price
General price
FB Knight J Crowley & RA Johnson P Gaenssler ATakemura YLTong JO Berger & RL Wolpert C Stein J Van Ryzin LD Brown SI Amari, OE Barndorff-Nielsen, RE Kass, SL Lauritzen & CR Rao PW Diaconis RJAdler
$9 $15 $12 $9 $15 $15 $12 $24 $15 $15
$15 $25 $20 $15 $25 $25 $20 $40 $25 $25
$18 $15
$30 $25
CA Field &ERonchetti RAWijsman IM Skovgaard HW Block, AR Sampson & TH Savits M Ghosh & PK Pathak
$15 $1$ $15 $27
$25 $30 $25 $45
$15
$25
$9
$15
K Mosler & M Scarsini A Possolo HLKoul MShaked& YLTong HG Mueller & D Siegmund TW Anderson, KT Fang & I Olkin N Flournoy & WF Rosenberger G Kallianpur & J Xiong
$18 $21 $18 $24 $26 $26 $24 $24
$30 $35 $30 $40 $45 $45 $40 $40
HL Koul & JV Deshpande L Ruschendorζ B Schweizer & MD Taylor JO Berger, B Betro, E Moreno, LR Pericchi, F Ruggeri, G Salinetti & L Wasserman TS Ferguson, LS Shapley & JB MacQueen Y Dodge IV Basawa, VP Godambe & RL Taylor F. Seillier-Moiseiwitsch N Flournoy, WF Rosenberger & WKWong FT Brass & L Le Cam
$18 $31
$30 $52
$29
$49
$32
$54
$41 $42
$69 $69
$36 $21
$45 $35
$39
$66
M de Gunst, C Klaassen & A van derVaart IV Basawa, CC Heyde & RL Taylor
$79
$120
$15 $12 $15 $15 $15
$25 $20 $25 $25 $25
$18
$30
ΓV Basawa & RL Taylor
I
PLahiri ML Eaton D Pollard M Rosenblatt JK Ghosh BG Lindsay E Thompson